All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: David Hildenbrand <david@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	Christoph Lameter <cl@linux.com>,
	Aaron Tomlin <atomlin@atomlin.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
Date: Tue, 21 Mar 2023 15:20:31 +0000	[thread overview]
Message-ID: <20230321152031.2bzcury6k6aj7p6k@suse.de> (raw)
In-Reply-To: <3329f63e-5671-1500-0730-cd46ba461d04@redhat.com>

On Thu, Mar 02, 2023 at 11:10:03AM +0100, David Hildenbrand wrote:
> [...]
> 
> > 
> > > (2) drain_zone_pages() documents that we're draining the PCP
> > >      (bulk-freeing them) of the current CPU on remote nodes. That bulk-
> > >      freeing will properly adjust free memory counters. What exactly is
> > >      the impact when no longer doing that? Won't the "snapshot" of some
> > >      counters eventually be wrong? Do we care?
> > 
> > Don't see why the snapshot of counters will be wrong.
> > 
> > Instead of freeing pages on pcp list of remote nodes after they are
> > considered idle ("3 seconds idle till flush"), what will happen is that
> > drain_all_pages() will free those pcps, for example after an allocation
> > fails on direct reclaim:
> > 
> >          page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > 
> >          /*
> >           * If an allocation failed after direct reclaim, it could be because
> >           * pages are pinned on the per-cpu lists or in high alloc reserves.
> >           * Shrink them and try again
> >           */
> >          if (!page && !drained) {
> >                  unreserve_highatomic_pageblock(ac, false);
> >                  drain_all_pages(NULL);
> >                  drained = true;
> >                  goto retry;
> >          }
> > 
> > In both cases the pages are freed (and counters maintained) here:
> > 
> > static inline void __free_one_page(struct page *page,
> >                  unsigned long pfn,
> >                  struct zone *zone, unsigned int order,
> >                  int migratetype, fpi_t fpi_flags)
> > {
> >          struct capture_control *capc = task_capc(zone);
> >          unsigned long buddy_pfn = 0;
> >          unsigned long combined_pfn;
> >          struct page *buddy;
> >          bool to_tail;
> > 
> >          VM_BUG_ON(!zone_is_initialized(zone));
> >          VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> > 
> >          VM_BUG_ON(migratetype == -1);
> >          if (likely(!is_migrate_isolate(migratetype)))
> >                  __mod_zone_freepage_state(zone, 1 << order, migratetype);
> > 
> >          VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> >          VM_BUG_ON_PAGE(bad_range(zone, page), page);
> > 
> >          while (order < MAX_ORDER - 1) {
> >                  if (compaction_capture(capc, page, order, migratetype)) {
> >                          __mod_zone_freepage_state(zone, -(1 << order),
> >                                                                  migratetype);
> >                          return;
> >                  }
> > 
> > > Describing the difference between instructed refresh of vmstat and "remotely
> > > drain per-cpu lists" in order to move free memory from the pcp to the buddy
> > > would be great.
> > 
> > The difference is that now remote PCPs will be drained on demand, either via
> > kcompactd or direct reclaim (through drain_all_pages), when memory is
> > low.
> > 
> > For example, with the following test:
> > 
> > dd if=/dev/zero of=file bs=1M count=32000 on a tmpfs filesystem:
> > 
> >        kcompactd0-116     [005] ...1 228232.042873: drain_all_pages <-kcompactd_do_work
> >        kcompactd0-116     [005] ...1 228232.042873: __drain_all_pages <-kcompactd_do_work
> >                dd-479485  [003] ...1 228232.455130: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> >                dd-479485  [011] ...1 228232.721994: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> >       gnome-shell-3750    [015] ...1 228232.723729: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > 
> > The commit message was indeed incorrect. Updated one:
> > 
> > "mm/vmstat: remove remote node draining
> > 
> > Draining of pages from the local pcp for a remote zone should not be
> > necessary, since once the system is low on memory (or compaction on a
> > zone is in effect), drain_all_pages should be called freeing any unused
> > pcps."
> > 
> > Thanks!
> 
> Thanks for the explanation, that makes sense to me. Feel free to add my
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> ... hoping that some others (Mel, Vlastimil?) can have another look.
> 

I was on extended leave and am still in the process of triaging a few
thousand mails so I'm working off memory here instead of the code. This
is a straight-forward enough question to answer quickly in case I forget
later.

Short answer: I'm not a fan of the patch in concept and I do not think it
should be merged.

I agree that drain_all_pages() would free the PCP pages on demand in
direct reclaim context but it happens after reclaim has already
happened. Hence, the reclaim may be necessary and may cause overreclaim
in some situations due to remote CPUs pinning memory in PCP lists.

Similarly, kswapd may trigger early because PCP pages do not contribute
to NR_FREE_PAGES so watermark checks can fail even though pages are
free, just inaccessible.

Finally, remote pages expire because ideally CPUs allocate local memory
assuming memory policies are not forcing use of remote nodes. The expiry
means that remote pages get freed back to the buddy lists after a short
period. By removing the expiry, it's possible that a local allocation will
fail and spill over to a remote node prematurely because free pages were
pinned on the PCP lists.

As this patch has the possibility of reclaiming early in both direct and
kswapd context and increases the risk of remote node fallback, I think it
needs much stronger justification and a warning about the side-effects. For
this version unless I'm very wrong -- NAK :(

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2023-03-21 15:21 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
2023-02-28 15:53   ` David Hildenbrand
2023-02-28 19:36     ` Marcelo Tosatti
2023-03-02 10:10       ` David Hildenbrand
2023-03-21 15:20         ` Mel Gorman [this message]
2023-03-21 17:31           ` Marcelo Tosatti
2023-03-02 17:21   ` Peter Xu
2023-03-02 17:27     ` Peter Xu
2023-03-02 19:17       ` Marcelo Tosatti
2023-03-02 18:56     ` Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-02 10:42   ` David Hildenbrand
2023-03-02 10:51     ` David Hildenbrand
2023-03-02 14:32     ` Marcelo Tosatti
2023-03-02 20:53   ` Peter Xu
2023-03-02 21:04     ` Marcelo Tosatti
2023-03-02 21:25       ` Peter Xu
2023-03-03 15:39         ` Marcelo Tosatti
2023-03-03 15:47     ` Marcelo Tosatti
2023-03-15 23:56   ` Christoph Lameter
2023-03-16 10:54     ` Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 06/11] this_cpu_cmpxchg: asm-generic: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-02 20:54   ` Peter Xu
2023-02-09 15:01 ` [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-02 10:47   ` David Hildenbrand
2023-03-02 14:47     ` Marcelo Tosatti
2023-03-02 16:20       ` Peter Xu
2023-03-02 19:11         ` Marcelo Tosatti
2023-03-02 20:06           ` Peter Xu
2023-02-09 15:01 ` [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
2023-03-01 22:57   ` Peter Xu
2023-03-02 13:55     ` Marcelo Tosatti
2023-03-02 21:19       ` Peter Xu
2023-03-03 15:17         ` Marcelo Tosatti
2023-02-09 15:02 ` [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-02 21:01   ` Peter Xu
2023-03-02 21:16     ` Marcelo Tosatti
2023-03-02 21:30       ` Peter Xu
2023-02-09 15:02 ` [PATCH v2 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
2023-02-23 14:54 ` [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-02-24  2:34   ` Hillf Danton
2023-02-27 19:41     ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230321152031.2bzcury6k6aj7p6k@suse.de \
    --to=mgorman@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=atomlin@atomlin.com \
    --cc=cl@linux.com \
    --cc=david@redhat.com \
    --cc=frederic@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mtosatti@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.