linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Oded Gabbay <oded.gabbay@gmail.com>
To: Jason Gunthorpe <jgg@ziepe.ca>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>, John Hubbard <jhubbard@nvidia.com>,
	 Leon Romanovsky <leonro@nvidia.com>,
	Linux-MM <linux-mm@kvack.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Maya B . Gokhale" <gokhale2@llnl.gov>,
	 Yang Shi <yang.shi@linux.alibaba.com>,
	Marty Mcfadden <mcfadden8@llnl.gov>,
	 Kirill Shutemov <kirill@shutemov.name>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	 Kirill Tkhai <ktkhai@virtuozzo.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	 Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Daniel Vetter <daniel.vetter@ffwll.ch>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	 Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
Date: Wed, 16 Feb 2022 18:59:22 +0200	[thread overview]
Message-ID: <CAFCwf10eNmwq2wD71xjUhqkvv5+_pJMR1nPug2RqNDcFT4H86Q@mail.gmail.com> (raw)
In-Reply-To: <20200921120301.GD8409@ziepe.ca>

On Mon, Sep 21, 2020 at 3:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Sep 21, 2020 at 10:35:05AM +0200, Jan Kara wrote:
> > > My thinking is to hit this issue you have to already be doing
> > > FOLL_LONGTERM, and if some driver hasn't been properly marked and
> > > regresses, the fix is to mark it.
> > >
> > > Remember, this use case requires the pin to extend after a system
> > > call, past another fork() system call, and still have data-coherence.
> > >
> > > IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
> > > means the lifetime of the pin is being controlled by userspace, not by
> > > the kernel. Otherwise userspace could not cause new DMA touches after
> > > fork.
> >
> > I agree that the new aggressive COW behavior is probably causing issues
> > only for FOLL_LONGTERM users. That being said it would be nice if even
> > ordinary threaded FOLL_PIN users would not have to be that careful about
> > fork(2) and possible data loss due to COW - we had certainly reports of
> > O_DIRECT IO loosing data due to fork(2) and COW exactly because it is very
> > subtle how it behaves... But as I wrote above this is not urgent since that
> > problematic behavior exists since the beginning of O_DIRECT IO in Linux.
>
> Yes, I agree - what I was thinking is to do this FOLL_LONGTERM for the
> rc and then a small patch to make it wider for the next cycle so it
> can test in linux-next for a responsible time period.
>
> Interesting to hear you confirm block has also seen subtle user
> problems with this as well.
>
> Jason
>

Hi Jason, Linus,
Sorry for waking up this thread, but I've filed a bug against this change:
https://bugzilla.kernel.org/show_bug.cgi?id=215616

In the past week, I've bisected a problem we have in one of our new
demos running on our Gaudi accelerator, and after a very long
bisection, I've come to this commit.
All the details are in the bug, but the bottom line is that somehow,
this patch causes corruption when the numa balancing feature is
enabled AND we don't use process affinity AND we use GUP to pin pages
so our accelerator can DMA to/from system memory.

Either disabling numa balancing, using process affinity to bind to
specific numa-node or reverting this patch causes the bug to
disappear.
I validated the bug and the revert on kernels 5.9, 5.11 and 5.17-rc4.

You can see our GUP code in the driver in get_user_memory() in
drivers/misc/habanalabs/common/memory.c.
It is fairly standard and I think I got that line from Daniel (cc'ed
on this email).

I would appreciate help from the mm experts here to understand how to
fix this, but it looks as if this simplification caused or exposed
some race between numa migration code and GUP.

Thanks,
Oded


  reply	other threads:[~2022-02-16 16:59 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21 23:49 [PATCH 0/4] mm: Simplfy cow handling Peter Xu
2020-08-21 23:49 ` [PATCH 1/4] mm: Trial do_wp_page() simplification Peter Xu
2020-08-24  8:36   ` Kirill Tkhai
2020-08-24 14:30     ` Jan Kara
2020-08-24 15:37       ` Kirill Tkhai
2020-08-24 18:22         ` Linus Torvalds
2020-09-01  7:01           ` Hugh Dickins
2020-09-14 14:38   ` Jason Gunthorpe
2020-09-14 17:32     ` Linus Torvalds
2020-09-14 18:34       ` Peter Xu
2020-09-14 21:15         ` Peter Xu
2020-09-14 22:55           ` Jason Gunthorpe
2020-09-14 22:59             ` Linus Torvalds
2020-09-14 23:28               ` Jason Gunthorpe
2020-09-15  0:19                 ` Linus Torvalds
2020-09-15 14:50                 ` Peter Xu
2020-09-15 15:17                   ` Peter Xu
2020-09-15 16:05                   ` Jason Gunthorpe
2020-09-15 18:29                     ` Jason Gunthorpe
2020-09-15 19:13                       ` Peter Xu
2020-09-15 19:38                         ` Jason Gunthorpe
2020-09-15 21:33                           ` Peter Xu
2020-09-15 23:22                             ` Jason Gunthorpe
2020-09-16  1:50                               ` John Hubbard
2020-09-16 17:48                                 ` Jason Gunthorpe
2020-09-16 18:46                                   ` Peter Xu
2020-09-17 11:25                                     ` Jason Gunthorpe
2020-09-17 18:11                                       ` Linus Torvalds
2020-09-17 19:38                                         ` Jason Gunthorpe
2020-09-17 19:51                                           ` Linus Torvalds
2020-09-18 16:40                                             ` Peter Xu
2020-09-18 17:16                                               ` Linus Torvalds
2020-09-18 19:57                                                 ` Peter Xu
2020-09-18 17:32                                               ` Jason Gunthorpe
2020-09-18 20:40                                                 ` Peter Xu
2020-09-18 20:59                                                   ` Linus Torvalds
2020-09-19  0:28                                                     ` Jason Gunthorpe
2020-09-18 21:06                                                   ` John Hubbard
2020-09-19  0:01                                                     ` Jason Gunthorpe
2020-09-21  8:35                                                       ` Jan Kara
2020-09-21 12:03                                                         ` Jason Gunthorpe
2022-02-16 16:59                                                           ` Oded Gabbay [this message]
2022-02-16 17:24                                                             ` Oded Gabbay
2022-02-16 19:04                                                             ` Linus Torvalds
2022-02-16 19:20                                                               ` Oded Gabbay
2022-02-16 19:24                                                               ` David Hildenbrand
2020-09-21 13:42                                               ` Michal Hocko
2020-09-21 14:18                                                 ` Peter Xu
2020-09-21 14:28                                                   ` Michal Hocko
2020-09-21 14:38                                                     ` Tejun Heo
2020-09-21 14:43                                                       ` Christian Brauner
2020-09-21 14:55                                                         ` Michal Hocko
2020-09-21 15:04                                                           ` Christian Brauner
2020-09-21 16:06                                                             ` Michal Hocko
2020-09-23  7:53                                                               ` Michal Hocko
2020-09-21 14:41                                                 ` Christian Brauner
2020-09-21 14:57                                                   ` Michal Hocko
2020-09-21 16:31                                                     ` Peter Xu
2020-09-17 18:14                                       ` Peter Xu
2020-09-17 18:26                                         ` Linus Torvalds
2020-09-17 19:03                                           ` Peter Xu
2020-09-17 19:42                                             ` Linus Torvalds
2020-09-17 19:55                                               ` John Hubbard
2020-09-17 20:06                                               ` Jason Gunthorpe
2020-09-17 20:19                                                 ` John Hubbard
2020-09-17 20:25                                                   ` Jason Gunthorpe
2020-09-17 20:35                                                 ` Linus Torvalds
2020-09-17 21:40                                                   ` Peter Xu
2020-09-17 22:09                                                     ` Jason Gunthorpe
2020-09-17 22:25                                                       ` Linus Torvalds
2020-09-17 22:48                                                       ` Ira Weiny
2020-09-18  9:36                                                         ` Jan Kara
2020-09-18  9:44                                                       ` Jan Kara
2020-09-18 16:19                                             ` Jason Gunthorpe
2020-09-15 10:23           ` Leon Romanovsky
2020-09-15 15:56           ` Jason Gunthorpe
2020-09-15 15:03   ` Oleg Nesterov
2020-09-15 16:18     ` Peter Xu
2020-08-21 23:49 ` [PATCH 2/4] mm/ksm: Remove reuse_ksm_page() Peter Xu
2020-08-21 23:49 ` [PATCH 3/4] mm/gup: Remove enfornced COW mechanism Peter Xu
2020-09-14 14:27   ` Oleg Nesterov
2020-09-14 17:59     ` Peter Xu
2020-09-14 19:03       ` Linus Torvalds
2020-08-21 23:49 ` [PATCH 4/4] mm: Add PGREUSE counter Peter Xu
2020-08-22 16:14   ` Linus Torvalds
2020-08-24  0:24     ` Peter Xu
2020-08-22 16:05 ` [PATCH 0/4] mm: Simplfy cow handling Linus Torvalds
2020-08-23 23:58   ` Peter Xu
2020-08-24  8:38 ` Kirill Tkhai
2020-08-27 14:15 ` Peter Xu
2021-02-02 14:40 [PATCH 1/4] mm: Trial do_wp_page() simplification Gal Pressman
2021-02-02 16:31 ` Peter Xu
2021-02-02 16:44   ` Jason Gunthorpe
2021-02-02 17:05     ` Peter Xu
2021-02-02 17:13       ` Jason Gunthorpe
2021-02-03 12:43         ` Gal Pressman
2021-02-03 14:00           ` Jason Gunthorpe
2021-02-03 14:47             ` Gal Pressman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFCwf10eNmwq2wD71xjUhqkvv5+_pJMR1nPug2RqNDcFT4H86Q@mail.gmail.com \
    --to=oded.gabbay@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel.vetter@ffwll.ch \
    --cc=gokhale2@llnl.gov \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcfadden8@llnl.gov \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).