* Possible race with page_maybe_dma_pinned?
@ 2021-09-29 19:57 Peter Xu
2021-09-29 22:47 ` Linus Torvalds
0 siblings, 1 reply; 4+ messages in thread
From: Peter Xu @ 2021-09-29 19:57 UTC (permalink / raw)
To: Linux MM Mailing List
Cc: Jason Gunthorpe, Linus Torvalds, John Hubbard, Jan Kara,
Andrew Morton, Andrea Arcangeli
Hi, all,
It seems to be racy to call page_maybe_dma_pinned() without properly taking the
mm->write_protect_seq lock, which is taken read for fast gup.
Now we have 3 callers of page_maybe_dma_pinned():
1. page_needs_cow_for_dma
2. pte_is_pinned
3. shrink_page_list
The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd
are missing, we may need to add them.
The race could trigger when the fast-gup of FOLL_PIN happened right after a
call to page_maybe_dma_pinned() which returned false. One example for page
reclaim of above case 3:
fast-gup thread page reclaim thread
--------------- -------------------
page_maybe_dma_pinned --> false
put the page into swap cache
fast-gup with FOLL_PIN
unmap page in pgtables
...
So commit feb889fb40fa ("mm: don't put pinned pages into the swap cache",
2021-01-17) could still have a small window that will stop working.
Same thing to the pte_is_pinned for clear_refs, which is case 2nd above.
If anyone agrees, and if anyone would like to fix this, please add:
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
As this is originally spotted and reported by Andrea.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Possible race with page_maybe_dma_pinned?
2021-09-29 19:57 Possible race with page_maybe_dma_pinned? Peter Xu
@ 2021-09-29 22:47 ` Linus Torvalds
2021-09-30 1:57 ` John Hubbard
0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2021-09-29 22:47 UTC (permalink / raw)
To: Peter Xu
Cc: Linux MM Mailing List, Jason Gunthorpe, John Hubbard, Jan Kara,
Andrew Morton, Andrea Arcangeli
On Wed, Sep 29, 2021 at 12:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> Now we have 3 callers of page_maybe_dma_pinned():
>
> 1. page_needs_cow_for_dma
> 2. pte_is_pinned
> 3. shrink_page_list
>
> The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd
> are missing, we may need to add them.
Well, the pte_is_pinned() case at least could do the seqlock in
clear_soft_dirty() - it has the vma and mm available.
The page shrinker has always been problematic since it doesn't have
the vm (and by "always" I mean "modern times" - long ago we used to
scan virtually, in the days before rmap)
One option might be for fast-gup to give up on locked pages. I think
the page lock is the only thing that shrink_page_list() serializes
with.
Linus
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Possible race with page_maybe_dma_pinned?
2021-09-29 22:47 ` Linus Torvalds
@ 2021-09-30 1:57 ` John Hubbard
2021-09-30 11:11 ` Jan Kara
0 siblings, 1 reply; 4+ messages in thread
From: John Hubbard @ 2021-09-30 1:57 UTC (permalink / raw)
To: Linus Torvalds, Peter Xu
Cc: Linux MM Mailing List, Jason Gunthorpe, Jan Kara, Andrew Morton,
Andrea Arcangeli
On 9/29/21 15:47, Linus Torvalds wrote:
> On Wed, Sep 29, 2021 at 12:57 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> Now we have 3 callers of page_maybe_dma_pinned():
>>
>> 1. page_needs_cow_for_dma
>> 2. pte_is_pinned
>> 3. shrink_page_list
>>
>> The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd
>> are missing, we may need to add them.
>
> Well, the pte_is_pinned() case at least could do the seqlock in
> clear_soft_dirty() - it has the vma and mm available.
That part seems straightforward, OK.
>
> The page shrinker has always been problematic since it doesn't have
> the vm (and by "always" I mean "modern times" - long ago we used to
> scan virtually, in the days before rmap)
>
> One option might be for fast-gup to give up on locked pages. I think
> the page lock is the only thing that shrink_page_list() serializes
> with.
>
In order to avoid locked pages in gup fast, it is easiest to do a
check for locked pages *after* fast-pinning them, and unpin them before
returning to the caller. This makes the change much smaller.
However, doing so leaves a window of time during which the pages are
still marked as maybe-dma-pinned, although those pages are never
returned to the caller as such. There is already code that is subject to
this in lockless_pages_from_mm(), for the case of a failed seqlock read.
I'm thinking it's probably OK, because the pages are not going to be
held long-term. They will be unpinned before returning from
lockless_pages_from_mm().
The counter argument is that this is merely making the race window
smaller, which is usually something that I argue against because it just
leads to harder-to-find bugs...
To be specific, here's what I mean:
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d03..8ba871a927668 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2657,6 +2657,7 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
{
unsigned long flags;
int nr_pinned = 0;
+ int i;
unsigned seq;
if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
@@ -2693,7 +2694,23 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
unpin_user_pages(pages, nr_pinned);
return 0;
}
+
+ /*
+ * Avoiding locked pages, in this fast/lockless context, will
+ * avoid interfering with shrink_page_list(), in particular.
+ * Give up upon finding the first locked page, but keep the
+ * earlier pages, so that slow gup does not have to re-pin them.
+ */
+ for (i = 0; i < nr_pinned; i++) {
+ if (PageLocked(pages[i])) {
+ unpin_user_pages(&pages[i], nr_pinned - i);
+ nr_pinned = i + 1;
+ break;
+ }
+ }
}
+
+
return nr_pinned;
}
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: Possible race with page_maybe_dma_pinned?
2021-09-30 1:57 ` John Hubbard
@ 2021-09-30 11:11 ` Jan Kara
0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2021-09-30 11:11 UTC (permalink / raw)
To: John Hubbard
Cc: Linus Torvalds, Peter Xu, Linux MM Mailing List, Jason Gunthorpe,
Jan Kara, Andrew Morton, Andrea Arcangeli
On Wed 29-09-21 18:57:33, John Hubbard wrote:
> On 9/29/21 15:47, Linus Torvalds wrote:
> > On Wed, Sep 29, 2021 at 12:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Now we have 3 callers of page_maybe_dma_pinned():
> > >
> > > 1. page_needs_cow_for_dma
> > > 2. pte_is_pinned
> > > 3. shrink_page_list
> > >
> > > The 1st one is good as it takes the seqlock for write properly. The 2nd & 3rd
> > > are missing, we may need to add them.
> >
> > Well, the pte_is_pinned() case at least could do the seqlock in
> > clear_soft_dirty() - it has the vma and mm available.
>
> That part seems straightforward, OK.
>
> >
> > The page shrinker has always been problematic since it doesn't have
> > the vm (and by "always" I mean "modern times" - long ago we used to
> > scan virtually, in the days before rmap)
> >
> > One option might be for fast-gup to give up on locked pages. I think
> > the page lock is the only thing that shrink_page_list() serializes
> > with.
> >
>
> In order to avoid locked pages in gup fast, it is easiest to do a
> check for locked pages *after* fast-pinning them, and unpin them before
> returning to the caller. This makes the change much smaller.
>
> However, doing so leaves a window of time during which the pages are
> still marked as maybe-dma-pinned, although those pages are never
> returned to the caller as such. There is already code that is subject to
> this in lockless_pages_from_mm(), for the case of a failed seqlock read.
> I'm thinking it's probably OK, because the pages are not going to be
> held long-term. They will be unpinned before returning from
> lockless_pages_from_mm().
>
> The counter argument is that this is merely making the race window
> smaller, which is usually something that I argue against because it just
> leads to harder-to-find bugs...
Yeah, what you propose actually does not guarantee that the reclaim and
page pinning cannot race in a way that the page gets unmapped and
gup_fast() returns a pinned page. Which is what the code in
shrink_page_list() tries to prevent AFAIU.
The only place where I can see us doing a sanely race-free check in
shrink_page_list() path is inside try_to_unmap() - we could implement
unmap_if_not_pinned semantics and inside the rmap walk we can bump up the
seqcounts (it even conceptually makes some sense) to make sure
page_maybe_dma_pinned() check we'd do during the rmap walk is not racing
with pup_fast(). By the time we leave the seqcount critical section, the
page will be unmapped so we can be sure there will be no new pins of the
page.
Honza
>
> To be specific, here's what I mean:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 886d6148d3d03..8ba871a927668 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2657,6 +2657,7 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
> {
> unsigned long flags;
> int nr_pinned = 0;
> + int i;
> unsigned seq;
>
> if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
> @@ -2693,7 +2694,23 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
> unpin_user_pages(pages, nr_pinned);
> return 0;
> }
> +
> + /*
> + * Avoiding locked pages, in this fast/lockless context, will
> + * avoid interfering with shrink_page_list(), in particular.
> + * Give up upon finding the first locked page, but keep the
> + * earlier pages, so that slow gup does not have to re-pin them.
> + */
> + for (i = 0; i < nr_pinned; i++) {
> + if (PageLocked(pages[i])) {
> + unpin_user_pages(&pages[i], nr_pinned - i);
> + nr_pinned = i + 1;
> + break;
> + }
> + }
> }
> +
> +
> return nr_pinned;
> }
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-09-30 11:11 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-29 19:57 Possible race with page_maybe_dma_pinned? Peter Xu
2021-09-29 22:47 ` Linus Torvalds
2021-09-30 1:57 ` John Hubbard
2021-09-30 11:11 ` Jan Kara
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.