From: "ying.huang@intel.com" <ying.huang@intel.com>
To: Aaron Lu <aaron.lu@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
kernel test robot <oliver.sang@intel.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Vlastimil Babka <vbabka@suse.cz>,
Dave Hansen <dave.hansen@linux.intel.com>,
Jesper Dangaard Brouer <brouer@redhat.com>,
Michal Hocko <mhocko@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
lkp@lists.01.org, lkp@intel.com, feng.tang@intel.com,
zhengjun.xing@linux.intel.com, fengwei.yin@intel.com
Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression
Date: Sat, 07 May 2022 15:11:41 +0800 [thread overview]
Message-ID: <ae763d63e50d14650c5762103d113934412bef57.camel@intel.com> (raw)
In-Reply-To: <YnXnLuYjmEWdVyBP@ziqianlu-desk1>
On Sat, 2022-05-07 at 11:27 +0800, Aaron Lu wrote:
> On Sat, May 07, 2022 at 08:54:35AM +0800, ying.huang@intel.com wrote:
> > On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
> > > On Fri, May 06, 2022 at 04:40:45PM +0800, ying.huang@intel.com wrote:
> > > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > > > > Hi Mel,
> > > > >
> > > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > > > > >
> > > > > > (please be noted we reported
> > > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > > > > on
> > > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > > > > while the commit is on branch.
> > > > > > now we still observe similar regression when it's on mainline, and we also
> > > > > > observe a 13.2% improvement on another netperf subtest.
> > > > > > so report again for information)
> > > > > >
> > > > > > Greeting,
> > > > > >
> > > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > > > > >
> > > > > >
> > > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > >
> > > > >
> > > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> > > >
> > > > IMHO, this means the consumer and producer are running on different
> > > > CPUs.
> > > >
> > >
> > > Right.
> > >
> > > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > > > > then do not use PCP but directly free the page directly to buddy.
> > > > >
> > > > > The rationale as explained in the commit's changelog is:
> > > > > "
> > > > > Netperf running on localhost exhibits this pattern and while it does not
> > > > > matter for some machines, it does matter for others with smaller caches
> > > > > where cache misses cause problems due to reduced page reuse. Pages
> > > > > freed directly to the buddy list may be reused quickly while still cache
> > > > > hot where as storing on the PCP lists may be cold by the time
> > > > > free_pcppages_bulk() is called.
> > > > > "
> > > > >
> > > > > This regression occurred on a machine that has large caches so this
> > > > > optimization brings no value to it but only overhead(skipped PCP), I
> > > > > guess this is the reason why there is a regression.
> > > >
> > > > Per my understanding, not only the cache size is larger, but also the L2
> > > > cache (1MB) is per-core on this machine. So if the consumer and
> > > > producer are running on different cores, the cache-hot page may cause
> > > > more core-to-core cache transfer. This may hurt performance too.
> > > >
> > >
> > > Client side allocates skb(page) and server side recvfrom() it.
> > > recvfrom() copies the page data to server's own buffer and then releases
> > > the page associated with the skb. Client does all the allocation and
> > > server does all the free, page reuse happens at client side.
> > > So I think core-2-core cache transfer due to page reuse can occur when
> > > client task migrates.
> >
> > The core-to-core cache transfering can be cross-socket or cross-L2 in
> > one socket. I mean the later one.
> >
> > > I have modified the job to have the client and server bound to a
> > > specific CPU of different cores on the same node, and testing it on the
> > > same Icelake 2 sockets server, the result is
> > >
> > > kernel throughput
> > > 8b10b465d0e1 125168
> > > f26b3fa04611 102039 -18%
> > >
> > > It's also a 18% drop. I think this means c2c is not a factor?
> >
> > Can you test with client and server bound to 2 hardware threads
> > (hyperthread) of one core? The two hardware threads of one core will
> > share the L2 cache.
> >
>
> 8b10b465d0e1: 89702
> f26b3fa04611: 95823 +6.8%
>
> When binding client and server on the 2 threads of the same core, the
> bisected commit is an improvement now on this 2 sockets Icelake server.
Good. I guess cache-hot works now.
> > > > > I have also tested this case on a small machine: a skylake desktop and
> > > > > this commit shows improvement:
> > > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> > > > >
> > > > > So this means those directly freed pages get reused by allocator side
> > > > > and that brings performance improvement for machines with smaller cache.
> > > >
> > > > Per my understanding, the L2 cache on this desktop machine is shared
> > > > among cores.
> > > >
> > >
> > > The said CPU is i7-6700 and according to this wikipedia page,
> > > L2 is per core:
> > > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
> >
> > Sorry, my memory was wrong. The skylake and later server has much
> > larger private L2 cache (1MB vs 256KB of client), this may increase the
> > possibility of core-2-core transfering.
> >
>
> I'm trying to understand where is the core-2-core cache transfer.
>
> When server needs to do the copy in recvfrom(), there is core-2-core
> cache transfer from client cpu to server cpu. But this is the same no
> matter page gets reused or not, i.e. the bisected commit and its parent
> doesn't have any difference in this step.
Yes.
> Then when page gets reused in
> the client side, there is no core-2-core cache transfer as the server
> side didn't do write to the page's data.
The "reused" pages were read by the server side, so their cache lines
are in "shared" state, some inter-core traffic is needed to shoot down
these cache lines before the client side writes them. This will incur
some overhead.
Best Regards,
Huang, Ying
> So page reuse or not, it
> shouldn't cause any difference regarding core-2-core cache transfer.
> Is this correct?
>
> > > > > I wonder if we should still use PCP a little bit under the above said
> > > > > condition, for the purpose of:
> > > > > 1 reduced overhead in the free path for machines with large cache;
> > > > > 2 still keeps the benefit of reused pages for machines with smaller cache.
> > > > >
> > > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > > > > either returning pcp->batch or (pcp->batch << 2):
> > > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > > > > skylake desktop: 72288 90784 92219 91528
> > > > > icelake 2sockets: 120956 99177 98251 116108
> > > > >
> > > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > > > > parent, returns 0 is the behaviour of this commit.
> > > > >
> > > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > > > > for the described condition, then this workload's performance on
> > > > > small machine can remain while the regression on large machines can be
> > > > > greately reduced(from -18% to -4%).
> > > > >
> > > >
> > > > Can we use cache size and topology information directly?
> > >
> > > It can be complicated by the fact that the system can have multiple
> > > producers(cpus that are doing free) running at the same time and getting
> > > the perfect number can be a difficult job.
> >
> > We can discuss this after verifying whether it's core-2-core transfering
> > related.
> >
> > Best Regards,
> > Huang, Ying
> >
> >
next prev parent reply other threads:[~2022-05-07 7:11 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-20 1:35 [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression kernel test robot
2022-04-29 11:29 ` Aaron Lu
2022-04-29 13:39 ` Mel Gorman
2022-05-05 8:27 ` Aaron Lu
2022-05-05 11:09 ` Mel Gorman
2022-05-05 14:29 ` Aaron Lu
2022-05-06 8:40 ` ying.huang
2022-05-06 12:17 ` Aaron Lu
2022-05-07 0:54 ` ying.huang
2022-05-07 3:27 ` Aaron Lu
2022-05-07 7:11 ` ying.huang [this message]
2022-05-07 7:31 ` Aaron Lu
2022-05-07 7:44 ` ying.huang
2022-05-10 3:43 ` Aaron Lu
2022-05-10 6:23 ` ying.huang
2022-05-10 18:05 ` Linus Torvalds
2022-05-10 18:47 ` Waiman Long
2022-05-10 19:03 ` Linus Torvalds
2022-05-10 19:25 ` Linus Torvalds
2022-05-10 19:46 ` Waiman Long
2022-05-10 19:27 ` Peter Zijlstra
2022-05-11 1:58 ` ying.huang
2022-05-11 2:06 ` Waiman Long
2022-05-11 11:04 ` Aaron Lu
2022-05-12 3:17 ` ying.huang
2022-05-12 12:45 ` Aaron Lu
2022-05-12 17:42 ` Linus Torvalds
2022-05-12 18:06 ` Andrew Morton
2022-05-12 18:49 ` Linus Torvalds
2022-06-14 2:09 ` Feng Tang
2022-05-13 6:19 ` ying.huang
2022-05-11 3:40 ` Aaron Lu
2022-05-11 7:32 ` ying.huang
2022-05-11 7:53 ` Aaron Lu
2022-06-01 2:19 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ae763d63e50d14650c5762103d113934412bef57.camel@intel.com \
--to=ying.huang@intel.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=brouer@redhat.com \
--cc=dave.hansen@linux.intel.com \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lkp@intel.com \
--cc=lkp@lists.01.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=oliver.sang@intel.com \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
--cc=zhengjun.xing@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).