Linux-rt-users Archive on lore.kernel.org
 help / color / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Linux-MM <linux-mm@kvack.org>,
	Linux-RT-Users <linux-rt-users@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	brouer@redhat.com
Subject: Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
Date: Tue, 30 Mar 2021 20:51:54 +0200
Message-ID: <20210330205154.1fe1e479@carbon> (raw)
In-Reply-To: <20210329120648.19040-1-mgorman@techsingularity.net>

On Mon, 29 Mar 2021 13:06:42 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> This series requires patches in Andrew's tree so the series is also
> available at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> 
> tldr: Jesper and Chuck, it would be nice to verify if this series helps
> 	the allocation rate of the bulk page allocator. RT people, this
> 	*partially* addresses some problems PREEMPT_RT has with the page
> 	allocator but it needs review.

I've run a new micro-benchmark[1] which shows:
(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)

BASELINE
 single_page alloc+put: 194 cycles(tsc) 54.106 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

 Per elem: 200 cycles(tsc) 55.667 ns (step:1)
 Per elem: 143 cycles(tsc) 39.755 ns (step:2)
 Per elem: 132 cycles(tsc) 36.758 ns (step:3)
 Per elem: 128 cycles(tsc) 35.795 ns (step:4)
 Per elem: 123 cycles(tsc) 34.339 ns (step:8)
 Per elem: 120 cycles(tsc) 33.396 ns (step:16)
 Per elem: 118 cycles(tsc) 32.806 ns (step:32)
 Per elem: 115 cycles(tsc) 32.169 ns (step:64)
 Per elem: 116 cycles(tsc) 32.258 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

 Per elem: 195 cycles(tsc) 54.225 ns (step:1)
 Per elem: 127 cycles(tsc) 35.492 ns (step:2)
 Per elem: 117 cycles(tsc) 32.643 ns (step:3)
 Per elem: 111 cycles(tsc) 30.992 ns (step:4)
 Per elem: 106 cycles(tsc) 29.606 ns (step:8)
 Per elem: 102 cycles(tsc) 28.532 ns (step:16)
 Per elem: 99 cycles(tsc) 27.728 ns (step:32)
 Per elem: 98 cycles(tsc) 27.252 ns (step:64)
 Per elem: 97 cycles(tsc) 27.090 ns (step:128)

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk-local_lock-v1r15

This should be seen in comparison with the older micro-benchmark[2]
done on branch mm-bulk-rebase-v5r9.

BASELINE
 single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

 Per elem: 206 cycles(tsc) 57.478 ns (step:1)
 Per elem: 154 cycles(tsc) 42.861 ns (step:2)
 Per elem: 145 cycles(tsc) 40.536 ns (step:3)
 Per elem: 142 cycles(tsc) 39.477 ns (step:4)
 Per elem: 142 cycles(tsc) 39.610 ns (step:8)
 Per elem: 137 cycles(tsc) 38.155 ns (step:16)
 Per elem: 135 cycles(tsc) 37.739 ns (step:32)
 Per elem: 134 cycles(tsc) 37.282 ns (step:64)
 Per elem: 133 cycles(tsc) 36.993 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

 Per elem: 202 cycles(tsc) 56.383 ns (step:1)
 Per elem: 144 cycles(tsc) 40.047 ns (step:2)
 Per elem: 134 cycles(tsc) 37.339 ns (step:3)
 Per elem: 128 cycles(tsc) 35.578 ns (step:4)
 Per elem: 120 cycles(tsc) 33.592 ns (step:8)
 Per elem: 116 cycles(tsc) 32.362 ns (step:16)
 Per elem: 113 cycles(tsc) 31.476 ns (step:32)
 Per elem: 110 cycles(tsc) 30.633 ns (step:64)
 Per elem: 110 cycles(tsc) 30.596 ns (step:128)

[2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk

This new patchset does show some improvements in the micro-benchmark.


> The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
> with vmstat which is inconvenient and causes some issues. Possibly because
> of that, the PCP list and vmstat share the same per-cpu space meaning that
> it's possible that vmstat updates dirty cache lines holding per-cpu lists
> across CPUs unless padding is used. The series splits that structure and
> separates the locking.
> 
> Second, PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
> 
>    local_irq_disable();
>    spin_lock(&lock);
> 
> The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
> -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly  
> separates the locking requirements for the PCP list (local_lock) and stat
> updates (irqs disabled). Once that is done, the length of time IRQs are
> disabled can be reduced and in some cases, IRQ disabling can be replaced
> with preempt_disable.
> 
> After that, it was very obvious that zone_statistics in particular has way
> too much overhead and leaves IRQs disabled for longer than necessary. It
> has perfectly accurate counters requiring IRQs be disabled for parallel
> RMW sequences when inaccurate ones like vm_events would do. The series
> makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
> only require preempt be disabled.
> 
> Finally the bulk page allocator can then do all the stat updates in bulk
> with IRQs enabled which should improve the efficiency of the bulk page
> allocator. Technically, this could have been done without the local_lock
> and vmstat conversion work and the order simply reflects the timing of
> when different series were implemented.
> 
> No performance data is included because despite the overhead of the
> stats, it's within the noise for most workloads but Jesper and Chuck may
> observe a significant different with the same tests used for the bulk
> page allocator. The series is more likely to be interesting to the RT
> folk in terms of slowing getting the PREEMPT tree into mainline.

I've try to run some longer packet benchmarks later.  A quick test
showed performance was within same range, and slightly better.  The
perf report and objdump, did reveal that code layout prefers the label
"failed:" as the primary code path, which should only be used for
single page allocs, which is kind of weird. (But as performance is the
same or slightly better, I will not complain).


>  drivers/base/node.c    |  18 +--
>  include/linux/mmzone.h |  29 +++--
>  include/linux/vmstat.h |  65 ++++++-----
>  mm/mempolicy.c         |   2 +-
>  mm/page_alloc.c        | 173 ++++++++++++++++------------
>  mm/vmstat.c            | 254 +++++++++++++++--------------------------
>  6 files changed, 254 insertions(+), 287 deletions(-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  parent reply index

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-29 12:06 Mel Gorman
2021-03-29 12:06 ` [PATCH 1/6] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman
2021-03-29 12:06 ` [PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman
2021-03-31  9:55   ` Thomas Gleixner
2021-03-31 11:01     ` Mel Gorman
2021-03-31 17:42       ` Thomas Gleixner
2021-03-31 17:46         ` Thomas Gleixner
2021-03-31 20:42         ` Mel Gorman
2021-03-29 12:06 ` [PATCH 3/6] mm/vmstat: Convert NUMA statistics to basic NUMA counters Mel Gorman
2021-03-29 12:06 ` [PATCH 4/6] mm/vmstat: Inline NUMA event counter updates Mel Gorman
2021-03-29 12:06 ` [PATCH 5/6] mm/page_alloc: Batch the accounting updates in the bulk allocator Mel Gorman
2021-03-29 12:06 ` [PATCH 6/6] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters Mel Gorman
2021-03-30 18:51 ` Jesper Dangaard Brouer [this message]
2021-03-31  7:38   ` [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
2021-03-31  8:17     ` Jesper Dangaard Brouer
2021-03-31  8:52 ` Mel Gorman
2021-03-31  9:51   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210330205154.1fe1e479@carbon \
    --to=brouer@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-rt-users Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rt-users/0 linux-rt-users/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rt-users linux-rt-users/ https://lore.kernel.org/linux-rt-users \
		linux-rt-users@vger.kernel.org
	public-inbox-index linux-rt-users

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rt-users


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git