All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mauricio Faria de Oliveira <mfo@canonical.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>, Yu Zhao <yuzhao@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-block@vger.kernel.org,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>
Subject: Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read
Date: Thu, 13 Jan 2022 11:54:36 -0300	[thread overview]
Message-ID: <CAO9xwp1QbmLNA6her4HveuBOZSwcNY5jZqtc00XQ0=V=HEV6Aw@mail.gmail.com> (raw)
In-Reply-To: <87ilunu8au.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Thu, Jan 13, 2022 at 9:30 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> "Huang, Ying" <ying.huang@intel.com> writes:
>
> > Minchan Kim <minchan@kernel.org> writes:
> >
> >> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
> >>> Hi Minchan Kim,
> >>>
> >>> Thanks for handling the hard questions! :)
> >>>
> >>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> >>> >
> >>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> >>> > > Yu Zhao <yuzhao@google.com> writes:
> >>> > >
> >>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> >>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> > > >> index 163ac4e6bcee..8671de473c25 100644
> >>> > > >> --- a/mm/rmap.c
> >>> > > >> +++ b/mm/rmap.c
> >>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> > > >>
> >>> > > >>                    /* MADV_FREE page check */
> >>> > > >>                    if (!PageSwapBacked(page)) {
> >>> > > >> -                          if (!PageDirty(page)) {
> >>> > > >> +                          int ref_count = page_ref_count(page);
> >>> > > >> +                          int map_count = page_mapcount(page);
> >>> > > >> +
> >>> > > >> +                          /*
> >>> > > >> +                           * The only page refs must be from the isolation
> >>> > > >> +                           * (checked by the caller shrink_page_list() too)
> >>> > > >> +                           * and one or more rmap's (dropped by discard:).
> >>> > > >> +                           *
> >>> > > >> +                           * Check the reference count before dirty flag
> >>> > > >> +                           * with memory barrier; see __remove_mapping().
> >>> > > >> +                           */
> >>> > > >> +                          smp_rmb();
> >>> > > >> +                          if ((ref_count - 1 == map_count) &&
> >>> > > >> +                              !PageDirty(page)) {
> >>> > > >>                                    /* Invalidate as we cleared the pte */
> >>> > > >>                                    mmu_notifier_invalidate_range(mm,
> >>> > > >>                                            address, address + PAGE_SIZE);
> >>> > > >
> >>> > > > Out of curiosity, how does it work with COW in terms of reordering?
> >>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> >>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> >>> > > > is seen first, and direct io is holding a refcnt, this check can still
> >>> > > > pass?
> >>> > >
> >>> > > I think that you are correct.
> >>> > >
> >>> > > After more thoughts, it appears very tricky to compare page count and
> >>> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> >>> > > page_mapcount(), an interrupt may happen between them.  During the
> >>> > > interrupt, the page count and map count may be changed, for example,
> >>> > > unmapped, or do_swap_page().
> >>> >
> >>> > Yeah, it happens but what specific problem are you concerning from the
> >>> > count change under race? The fork case Yu pointed out was already known
> >>> > for breaking DIO so user should take care not to fork under DIO(Please
> >>> > look at O_DIRECT section in man 2 open). If you could give a specific
> >>> > example, it would be great to think over the issue.
> >>> >
> >>> > I agree it's little tricky but it seems to be way other place has used
> >>> > for a long time(Please look at write_protect_page in ksm.c).
> >>>
> >>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
> >>>
> >>> > So, here what we missing is tlb flush before the checking.
> >>>
> >>> That shouldn't be required for this particular issue/case, IIUIC.
> >>> One of the things we checked early on was disabling deferred TLB flush
> >>> (similarly to what you've done), and it didn't help with the issue; also, the
> >>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
> >>
> >> I guess you didn't try it with page_mapcount + 1 == page_count at tha
> >> time?  Anyway, I agree we don't need TLB flush here like KSM.
> >> I think the reason KSM is doing TLB flush before the check it to
> >> make sure trap trigger on the write from userprocess in other core.
> >> However, this MADV_FREE case, HW already gaurantees the trap.
> >> Please see below.
> >>
> >>>
> >>>
> >>> >
> >>> > Something like this.
> >>> >
> >>> > diff --git a/mm/rmap.c b/mm/rmap.c
> >>> > index b0fd9dc19eba..b4ad9faa17b2 100644
> >>> > --- a/mm/rmap.c
> >>> > +++ b/mm/rmap.c
> >>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> >
> >>> >                         /* MADV_FREE page check */
> >>> >                         if (!PageSwapBacked(page)) {
> >>> > -                               int refcount = page_ref_count(page);
> >>> > -
> >>> > -                               /*
> >>> > -                                * The only page refs must be from the isolation
> >>> > -                                * (checked by the caller shrink_page_list() too)
> >>> > -                                * and the (single) rmap (dropped by discard:).
> >>> > -                                *
> >>> > -                                * Check the reference count before dirty flag
> >>> > -                                * with memory barrier; see __remove_mapping().
> >>> > -                                */
> >>> > -                               smp_rmb();
> >>> > -                               if (refcount == 2 && !PageDirty(page)) {
> >>> > +                               if (!PageDirty(page) &&
> >>> > +                                       page_mapcount(page) + 1 == page_count(page)) {
> >>>
> >>> In the interest of avoiding a different race/bug, it seemed worth following the
> >>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
> >>> after the page's reference count, with a memory barrier in between.
> >>
> >> True so it means your patch as-is is good for me.
> >
> > If my understanding were correct, a shared anonymous page will be mapped
> > read-only.  If so, will a private anonymous page be called
> > SetPageDirty() concurrently after direct IO case has been dealt with
> > via comparing page_count()/page_mapcount()?
>
> Sorry, I found that I am not quite right here.  When direct IO read
> completes, it will call SetPageDirty() and put_page() finally.  And
> MADV_FREE in try_to_unmap_one() needs to deal with that too.
>
> Checking direct IO code, it appears that set_page_dirty_lock() is used
> to set page dirty, which will use lock_page().
>
>   dio_bio_complete
>     bio_check_pages_dirty
>       bio_dirty_fn  /* through workqueue */
>         bio_release_pages
>           set_page_dirty_lock
>     bio_release_pages
>       set_page_dirty_lock
>
> So in theory, for direct IO, the memory barrier may be unnecessary.  But
> I don't think it's a good idea to depend on this specific behavior of
> direct IO.  The original code with memory barrier looks more generic and
> I don't think it will introduce visible overhead.
>

Thanks for all the considerations/thought process with potential corner cases!

Regarding the overhead, agreed; and this is in memory reclaim which isn't a
fast path (and even if it's under direct reclaim, things have slowed
down already),
so that would seem to be fine.

cheers,

> Best Regards,
> Huang, Ying



-- 
Mauricio Faria de Oliveira

  reply	other threads:[~2022-01-13 14:54 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-05 23:34 [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read Mauricio Faria de Oliveira
2022-01-06 23:15 ` Minchan Kim
2022-01-07  0:11   ` Yang Shi
2022-01-07  1:08     ` Yang Shi
2022-01-11  1:34   ` Huang, Ying
2022-01-11  6:48 ` Yu Zhao
2022-01-11 18:54   ` Minchan Kim
2022-01-11 19:29     ` John Hubbard
2022-01-11 20:20       ` Minchan Kim
2022-01-11 20:21         ` Minchan Kim
2022-01-11 21:59           ` Minchan Kim
2022-01-11 23:38             ` John Hubbard
2022-01-12  0:01               ` Minchan Kim
2022-01-12  1:46   ` Huang, Ying
2022-01-12 17:33     ` Minchan Kim
2022-01-12 21:53       ` Mauricio Faria de Oliveira
2022-01-12 22:37         ` Minchan Kim
2022-01-13  8:54           ` Huang, Ying
2022-01-13 12:30             ` Huang, Ying
2022-01-13 14:54               ` Mauricio Faria de Oliveira [this message]
2022-01-13 14:30           ` Mauricio Faria de Oliveira
2022-01-13  7:29         ` Yu Zhao
2022-01-14  0:35           ` Minchan Kim
2022-01-31 23:10             ` Mauricio Faria de Oliveira
2022-01-13  5:47       ` Huang, Ying
2022-01-13  6:37         ` Miaohe Lin
2022-01-13  8:04           ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAO9xwp1QbmLNA6her4HveuBOZSwcNY5jZqtc00XQ0=V=HEV6Aw@mail.gmail.com' \
    --to=mfo@canonical.com \
    --cc=akpm@linux-foundation.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.