linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"xishi.qiuxishi@alibaba-inc.com" <xishi.qiuxishi@alibaba-inc.com>,
	"zy.zhengyi@alibaba-inc.com" <zy.zhengyi@alibaba-inc.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages
Date: Thu, 19 Jul 2018 06:19:45 +0000	[thread overview]
Message-ID: <20180719061945.GB22154@hori1.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <20180718085032.GS7193@dhcp22.suse.cz>

On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > There's a race condition between soft offline and hugetlb_fault which
> > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > >
> > > > The process killing is caused by the following flow:
> > > >
> > > >   CPU 0               CPU 1              CPU 2
> > > >
> > > >   soft offline
> > > >     get_any_page
> > > >     // find the hugetlb is free
> > > >                       mmap a hugetlb file
> > > >                       page fault
> > > >                         ...
> > > >                           hugetlb_fault
> > > >                             hugetlb_no_page
> > > >                               alloc_huge_page
> > > >                               // succeed
> > > >       soft_offline_free_page
> > > >       // set hwpoison flag
> > > >                                          mmap the hugetlb file
> > > >                                          page fault
> > > >                                            ...
> > > >                                              hugetlb_fault
> > > >                                                hugetlb_no_page
> > > >                                                  find_lock_page
> > > >                                                    return VM_FAULT_HWPOISON
> > > >                                            mm_fault_error
> > > >                                              do_sigbus
> > > >                                              // kill the process
> > > >
> > > >
> > > > The hugetlb allocation failure comes from the following flow:
> > > >
> > > >   CPU 0                          CPU 1
> > > >
> > > >                                  mmap a hugetlb file
> > > >                                  // reserve all free page but don't fault-in
> > > >   soft offline
> > > >     get_any_page
> > > >     // find the hugetlb is free
> > > >       soft_offline_free_page
> > > >       // set hwpoison flag
> > > >         dissolve_free_huge_page
> > > >         // fail because all free hugepages are reserved
> > > >                                  page fault
> > > >                                    ...
> > > >                                      hugetlb_fault
> > > >                                        hugetlb_no_page
> > > >                                          alloc_huge_page
> > > >                                            ...
> > > >                                              dequeue_huge_page_node_exact
> > > >                                              // ignore hwpoisoned hugepage
> > > >                                              // and finally fail due to no-mem
> > > >
> > > > The root cause of this is that current soft-offline code is written
> > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > > > or hard offline, but does not for soft offline because soft offline is
> > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > only after containment of the error page completes successfully.
> > >
> > > Could you please expand on the worklow here please? The code is really
> > > hard to grasp. I must be missing something because the thing shouldn't
> > > be really complicated. Either the page is in the free pool and you just
> > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > to guaratee reserves) or it is used and you just migrate the content to
> > > a new page (again with the hugetlb reserves consideration). Why should
> > > PageHWPoison flag ordering make any relevance?
> >
> > (Considering soft offlining free hugepage,)
> > PageHWPoison is set at first before this patch, which is racy with
> > hugetlb fault code because it's not protected by hugetlb_lock.
> >
> > Originally this was written in the similar manner as hard-offline, where
> > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > But actually that's found not necessary/correct because soft offline is
> > supposed to be less aggressive and failure is OK.
>
> OK
>
> > So this patch is suggesting to make soft-offline less aggressive by
> > moving SetPageHWPoison into the lock.
>
> I guess I still do not understand why we should even care about the
> ordering of the HWPoison flag setting. Why cannot we simply have the
> following code flow? Or maybe we are doing that already I just do not
> follow the code
>
> 	soft_offline
> 	  check page_count
> 	    - free - normal page - remove from the allocator
> 	           - hugetlb - allocate a new hugetlb page && remove from the pool
> 	    - used - migrate to a new page && never release the old one
>
> Why do we even need HWPoison flag here? Everything can be completely
> transparent to the application. It shouldn't fail from what I
> understood.

PageHWPoison flag is used to the 'remove from the allocator' part
which is like below:

  static inline
  struct page *rmqueue(
          ...
          do {
                  page = NULL;
                  if (alloc_flags & ALLOC_HARDER) {
                          page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
                          if (page)
                                  trace_mm_page_alloc_zone_locked(page, order, migratetype);
                  }
                  if (!page)
                          page = __rmqueue(zone, order, migratetype);
          } while (page && check_new_pages(page, order));

check_new_pages() returns true if the page taken from free list has
a hwpoison page so that the allocator iterates another round to get
another page.

There's no function that can be called from outside allocator to remove
a page in allocator.  So actual page removal is done at allocation time,
not at error handling time. That's the reason why we need PageHWPoison.

Thanks,
Naoya Horiguchi


> > > Do I get it right that the only difference between the hard and soft
> > > offlining is that hugetlb reserves might break for the former while not
> > > for the latter
> >
> > Correct.
> >
> > > and that the failed migration kills all owners for the
> > > former while not for latter?
> >
> > Hard-offline doesn't cause any page migration because the data is already
> > lost, but yes it can kill the owners.
> > Soft-offline never kills processes even if it fails (due to migration failrue
> > or some other reasons.)
> >
> > I listed below some common points and differences between hard-offline
> > and soft-offline.
> >
> >   common points
> >     - they are both contained by PageHWPoison flag,
> >     - error is injected via simliar interfaces.
> >
> >   differences
> >     - the data on the page is considered lost in hard offline, but is not
> >       in soft offline,
> >     - hard offline likely kills the affected processes, but soft offline
> >       never kills processes,
> >     - soft offline causes page migration, but hard offline does not,
> >     - hard offline prioritizes to prevent consumption of broken data with
> >       accepting some race, and soft offline prioritizes not to impact
> >       userspace with accepting failure.
> >
> > Looks to me that there're more differences rather than commont points.
>
> Thanks for the summary. It certainly helped me
> --
> Michal Hocko
> SUSE Labs
>

  reply	other threads:[~2018-07-19  6:21 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-17  5:32 [PATCH v2 0/2] mm: soft-offline: fix race against page allocation Naoya Horiguchi
2018-07-17  5:32 ` [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages Naoya Horiguchi
2018-07-17 14:27   ` Michal Hocko
2018-07-17 20:10     ` Mike Kravetz
2018-07-18  1:28       ` Naoya Horiguchi
2018-07-18  2:36         ` Mike Kravetz
2018-07-18  0:55     ` Naoya Horiguchi
2018-07-18  1:41       ` Naoya Horiguchi
2018-07-18  8:50       ` Michal Hocko
2018-07-19  6:19         ` Naoya Horiguchi [this message]
2018-07-19  7:15           ` Michal Hocko
2018-07-19  8:08             ` Naoya Horiguchi
2018-07-19  8:27               ` Michal Hocko
2018-07-19  9:22                 ` Naoya Horiguchi
2018-07-19 10:32                   ` Michal Hocko
2018-07-17  5:32 ` [PATCH v2 2/2] mm: soft-offline: close the race against page allocation Naoya Horiguchi
2018-08-15 22:43 ` [PATCH v2 0/2] mm: soft-offline: fix " Andrew Morton
2018-08-22  1:37   ` Naoya Horiguchi
2018-08-22  2:25     ` Mike Kravetz
2018-08-22  8:00     ` Michal Hocko
2018-10-26  8:46       ` Michal Hocko
2018-10-30  6:54         ` Naoya Horiguchi
2018-10-30  8:16           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180719061945.GB22154@hori1.linux.bs1.fc.nec.co.jp \
    --to=n-horiguchi@ah.jp.nec.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=xishi.qiuxishi@alibaba-inc.com \
    --cc=zy.zhengyi@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).