Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

From: Mel Gorman <mgorman@techsingularity.net>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	mhocko@kernel.org, ying.huang@intel.com, s.priebe@profihost.ag,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	alex.williamson@redhat.com, lkp@01.org,
	David Rientjes <rientjes@google.com>,
	kirill@shutemov.name, Andrew Morton <akpm@linux-foundation.org>,
	zi.yan@cs.rutgers.edu
Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Date: Wed, 5 Dec 2018 10:06:24 +0000	[thread overview]
Message-ID: <20181205100624.GX23260@techsingularity.net> (raw)
In-Reply-To: <20181204104558.GV23260@techsingularity.net>

On Tue, Dec 04, 2018 at 10:45:58AM +0000, Mel Gorman wrote:
> I have *one* result of the series on a 1-socket machine running
> "thpscale". It creates a file, punches holes in it to create a
> very light form of fragmentation and then tries THP allocations
> using madvise measuring latency and success rates. It's the
> global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
> filesystem.
> 
> thpscale Fault Latencies
>                                     4.20.0-rc4             4.20.0-rc4
>                                 mmots-20181130       gfpthisnode-v1r1
> Amean     fault-base-3      5358.54 (   0.00%)     2408.93 *  55.04%*
> Amean     fault-base-5      9742.30 (   0.00%)     3035.25 *  68.84%*
> Amean     fault-base-7     13069.18 (   0.00%)     4362.22 *  66.62%*
> Amean     fault-base-12    14882.53 (   0.00%)     9424.38 *  36.67%*
> Amean     fault-base-18    15692.75 (   0.00%)    16280.03 (  -3.74%)
> Amean     fault-base-24    28775.11 (   0.00%)    18374.84 *  36.14%*
> Amean     fault-base-30    42056.32 (   0.00%)    21984.55 *  47.73%*
> Amean     fault-base-32    38634.26 (   0.00%)    22199.49 *  42.54%*
> Amean     fault-huge-1         0.00 (   0.00%)        0.00 (   0.00%)
> Amean     fault-huge-3      3628.86 (   0.00%)      963.45 *  73.45%*
> Amean     fault-huge-5      4926.42 (   0.00%)     2959.85 *  39.92%*
> Amean     fault-huge-7      6717.15 (   0.00%)     3828.68 *  43.00%*
> Amean     fault-huge-12    11393.47 (   0.00%)     5772.92 *  49.33%*
> Amean     fault-huge-18    16979.38 (   0.00%)     4435.95 *  73.87%*
> Amean     fault-huge-24    16558.00 (   0.00%)     4416.46 *  73.33%*
> Amean     fault-huge-30    20351.46 (   0.00%)     5099.73 *  74.94%*
> Amean     fault-huge-32    23332.54 (   0.00%)     6524.73 *  72.04%*
> 
> So, looks like massive latency improvements but then the THP allocation
> success rates
> 
> thpscale Percentage Faults Huge
>                                4.20.0-rc4             4.20.0-rc4
>                            mmots-20181130       gfpthisnode-v1r1
> Percentage huge-3        95.14 (   0.00%)        7.94 ( -91.65%)
> Percentage huge-5        91.28 (   0.00%)        5.00 ( -94.52%)
> Percentage huge-7        86.87 (   0.00%)        9.36 ( -89.22%)
> Percentage huge-12       83.36 (   0.00%)       21.03 ( -74.78%)
> Percentage huge-18       83.04 (   0.00%)       30.73 ( -63.00%)
> Percentage huge-24       83.74 (   0.00%)       27.47 ( -67.20%)
> Percentage huge-30       83.66 (   0.00%)       31.85 ( -61.93%)
> Percentage huge-32       83.89 (   0.00%)       29.09 ( -65.32%)
> 

Other results arrived once the grid caught up and it's a mixed bag of
gains and losses roughtly along the lines predicted by the discussion
already -- namely locality is better as long as the workload fits,
compaction is reduced, reclaim is reduced, THP allocation success rates
are reduced but latencies are often better.

Whether this is "good" or "bad" depends on whether you have a workload
that benefits because it's neither universally good or bad. It would
still be nice to hear how Andreas fared but I think we'll reach the same
conclusion -- the patches shuffles the problem around with limited effort
to address the root causes so all we end up changing is the identity of
the person who complains about their workload. One might be tempted to
think that the reduced latencies in some cases are great but not if the
workload is one that benefits from longer startup costs in exchange for
lower runtime costs in the active phase.

For the much longer answer, I'll focus on the two-socket results because
they are more relevant to the current discussion. The workloads are
not realistic in the slightest, they just happen to trigger some of the
interesting corner cases.

global-dhp__workload_usemem-stress-numa-compact
o Plain anonymous faulting workload
o defrag=always (not representative, simply triggers a bad case)

                               4.20.0-rc4             4.20.0-rc4
                           mmots-20181130       gfpthisnode-v1r1
Amean     Elapsd-1       26.79 (   0.00%)       34.92 * -30.37%*
Amean     Elapsd-3        7.32 (   0.00%)        8.10 * -10.61%*
Amean     Elapsd-4        5.53 (   0.00%)        5.64 (  -1.94%)

Units are seconds, time to complete 30.37% worse for the single-threaded
case. No direct reclaim activity but other activity is interesting and
I'll pick it out snippets;

                            4.20.0-rc4  4.20.0-rc4
                          mmots-20181130gfpthisnode-v1r1
Swap Ins                             8           0
Swap Outs                         1546           0
Allocation stalls                    0           0
Fragmentation stalls                 0        2022
Direct pages scanned                 0           0
Kswapd pages scanned             42719        1078
Kswapd pages reclaimed           41082        1049
Page writes by reclaim            1546           0
Page writes file                     0           0
Page writes anon                  1546           0
Page reclaim immediate               2           0

Baseline kernel swaps out (bad), David's patch reclaims less (good).
That's reasonably positive. Less positive is that fragmentation stalls are
triggered with David's patch. This is due to a patch of mine in Andrew's
tree which I've asked that he drop as while it helps control long-term
fragmentation, there was always a risk that the short stalls would be
problematic and it's a distraction.

THP fault alloc                 540043      456714
THP fault fallback                   0       83329
THP collapse alloc                   0           4
THP collapse fail                    0           0
THP split                            1           0
THP split failed                     0           0

David's patch falls back to base page allocation to a much higher degree
(bad).

Compaction pages isolated        85381       11432
Compaction migrate scanned      204787       42635
Compaction free scanned          72376       13061
Compact scan efficiency           282%        326%

David's patch also compacts less.

NUMA alloc hit                 1188182     1093244
NUMA alloc miss                  68199    42764192
NUMA interleave hit                  0           0
NUMA alloc local               1179614     1084665
NUMA base PTE updates         28902547    23270389
NUMA huge PMD updates            56437       45438
NUMA page range updates       57798291    46534645
NUMA hint faults                 61395       47838
NUMA hint local faults           46440       47833
NUMA hint local percent            75%         99%
NUMA pages migrated            2000156           5

Interestingly, the NUMA misses are higher with David's patch indicating
that it's allocating *more* from remote nodes. However, there are also
hints that the accessing process then removes to the remote node instead
of the current mmotm kernel which tries to migrate the memory locally.

So, in line with expectations. The baseline kernel works harder to
allocate the THPs where as David's gives up quickly and moves over. At
one level this is good but the bottom line is total time to complete the
workload goes from the baseline of 280 seconds up to 344 seconds. This
is overall mixed because depending on what you look at, it's both good
and bad.

global-dhp__workload_thpscale-xfs
o Workload creates a large file, punches holes in it
o Mapping is created and faulted to measure allocation success rates and
  latencies
o No special madvise
o Considered a relatively "simple" case

                                    4.20.0-rc4             4.20.0-rc4
                                mmots-20181130       gfpthisnode-v1r1
Amean     fault-base-3      2021.05 (   0.00%)     2633.11 * -30.28%*
Amean     fault-base-5      2475.25 (   0.00%)     2997.15 * -21.08%*
Amean     fault-base-7      5595.79 (   0.00%)     7523.10 ( -34.44%)
Amean     fault-base-12    15604.91 (   0.00%)    16355.02 (  -4.81%)
Amean     fault-base-18    20277.13 (   0.00%)    22062.73 (  -8.81%)
Amean     fault-base-24    24218.46 (   0.00%)    25772.49 (  -6.42%)
Amean     fault-base-30    28516.75 (   0.00%)    28208.14 (   1.08%)
Amean     fault-base-32    36722.30 (   0.00%)    20712.46 *  43.60%*
Amean     fault-huge-1         0.00 (   0.00%)        0.00 (   0.00%)
Amean     fault-huge-3       685.38 (   0.00%)      512.02 *  25.29%*
Amean     fault-huge-5      3639.75 (   0.00%)      807.33 (  77.82%)
Amean     fault-huge-7      1139.54 (   0.00%)      555.45 *  51.26%*
Amean     fault-huge-12     1012.64 (   0.00%)      850.68 (  15.99%)
Amean     fault-huge-18     6694.45 (   0.00%)     1310.39 *  80.43%*
Amean     fault-huge-24    10165.27 (   0.00%)     3822.23 *  62.40%*
Amean     fault-huge-30    13496.19 (   0.00%)    19248.06 * -42.62%*
Amean     fault-huge-32     4477.05 (   0.00%)    63463.78 *-1317.54%*

These latency outliers can be huge so take them with a grain of salt.
Sometimes I'll look at the percentiles but it takes an age to discuss.

In general, David's patch faults huge pages faster, particularly with
higher threads.  The allocation success rates are also great

                               4.20.0-rc4             4.20.0-rc4
                           mmots-20181130       gfpthisnode-v1r1
Percentage huge-3         2.86 (   0.00%)       26.48 ( 825.27%)
Percentage huge-5         1.07 (   0.00%)        1.41 (  31.16%)
Percentage huge-7        20.38 (   0.00%)       54.82 ( 168.94%)
Percentage huge-12       19.07 (   0.00%)       38.10 (  99.76%)
Percentage huge-18       10.72 (   0.00%)       30.18 ( 181.49%)
Percentage huge-24        8.44 (   0.00%)       15.48 (  83.39%)
Percentage huge-30        7.41 (   0.00%)       10.78 (  45.38%)
Percentage huge-32       29.08 (   0.00%)        3.23 ( -88.91%)

Overall system activity looks similar which is counter-intuitive. The
only hints of what is going on is that David's patch reclaims less from
kswapd context. Direct reclaim scanning is high in both cases but does
not reclaim much. David's patch scans for free page as compaction
targets much more aggressively but no indication as to why. Locality
information looks similar.

So, not as sure what to think about this. Headline results look good but
no obvious explanation as to why exactly. It could be that stalls
(higher with David's patch) mean there is less inteference between
threads but that's thin.

global-dhp__workload_thpscale-madvhugepage-xfs
o Same as above except that MADV_HUGEPAGE is used

                                    4.20.0-rc4             4.20.0-rc4
                                mmots-20181130       gfpthisnode-v1r1
Amean     fault-base-1         0.00 (   0.00%)        0.00 (   0.00%)
Amean     fault-base-3     18880.35 (   0.00%)     6341.60 *  66.41%*
Amean     fault-base-5     27608.74 (   0.00%)     6515.10 *  76.40%*
Amean     fault-base-7     28345.03 (   0.00%)     7529.98 *  73.43%*
Amean     fault-base-12    35690.33 (   0.00%)    13518.77 *  62.12%*
Amean     fault-base-18    56538.31 (   0.00%)    23933.91 *  57.67%*
Amean     fault-base-24    71485.33 (   0.00%)    26927.03 *  62.33%*
Amean     fault-base-30    54286.39 (   0.00%)    23453.61 *  56.80%*
Amean     fault-base-32    92143.50 (   0.00%)    19474.99 *  78.86%*
Amean     fault-huge-1         0.00 (   0.00%)        0.00 (   0.00%)
Amean     fault-huge-3      5666.72 (   0.00%)     1351.55 *  76.15%*
Amean     fault-huge-5      8307.35 (   0.00%)     2776.28 *  66.58%*
Amean     fault-huge-7     10651.96 (   0.00%)     2397.70 *  77.49%*
Amean     fault-huge-12    15489.56 (   0.00%)     7034.98 *  54.58%*
Amean     fault-huge-18    20278.54 (   0.00%)     6417.46 *  68.35%*
Amean     fault-huge-24    29378.24 (   0.00%)    16173.41 *  44.95%*
Amean     fault-huge-30    29237.66 (   0.00%)    81198.70 *-177.72%*
Amean     fault-huge-32    27177.37 (   0.00%)    18966.08 *  30.21%*

Superb improvement in latencies coupled with the following

                               4.20.0-rc4             4.20.0-rc4
                           mmots-20181130       gfpthisnode-v1r1
Percentage huge-1         0.00 (   0.00%)        0.00 (   0.00%)
Percentage huge-3        99.74 (   0.00%)       49.62 ( -50.25%)
Percentage huge-5        99.24 (   0.00%)       12.19 ( -87.72%)
Percentage huge-7        97.98 (   0.00%)       19.20 ( -80.40%)
Percentage huge-12       95.76 (   0.00%)       21.33 ( -77.73%)
Percentage huge-18       94.91 (   0.00%)       31.63 ( -66.67%)
Percentage huge-24       94.36 (   0.00%)        9.27 ( -90.18%)
Percentage huge-30       92.15 (   0.00%)        9.60 ( -89.58%)
Percentage huge-32       94.18 (   0.00%)        8.67 ( -90.79%)

THP allocation success rates are through the floor which is why
latencies overall are better.

This goes back to the fundamental question -- does your workload benefit
from THP or not and is it the primary metric? If yes (potentially in the
case with KVM) then this is a disaster. It's actually a mixed bad for
David because THP was desired but so was locality. In this case, the
application specifically requested THP so presumably a real application
specifying the flag means it

The high-level system stats reflect the level of effort, David's patch
does less work in the system which is both good and bad depending on
your requirements

                            4.20.0-rc4  4.20.0-rc4
                          mmots-20181130gfpthisnode-v1r1
Swap Ins                          1564           0
Swap Outs                        12283         163
Allocation stalls                30236          24
Fragmentation stalls              1069       24683

Baseline kernel swaps and has high allocation stalls to reclaim memory.
David's patch stalls on trying to control fragmentation instead.

                            4.20.0-rc4  4.20.0-rc4
                          mmots-20181130gfpthisnode-v1r1
Direct pages scanned          12780511     9955217
Kswapd pages scanned           1944181    16554296
Kswapd pages reclaimed          870023     4029534
Direct pages reclaimed         6738924        5884
Kswapd efficiency                  44%         24%
Kswapd velocity               1308.975   11200.850
Direct efficiency                  52%          0%
Direct velocity               8604.840    6735.828

The baseline kernel does much of the reclaim work in direct context
while David's does it in kswapd context.

THP fault alloc                 316843      238810
THP fault fallback               17224       95256
THP collapse alloc                   2           0
THP collapse fail                    0           5
THP split                       177536      180673
THP split failed                 10024           2

Baseline kernel allocates THP while David's falls back

THP collapse alloc                   2           0
THP collapse fail                    0           5
Compaction stalls               100198       75267
Compaction success               65803        3964
Compaction failures              34395       71303
Compaction efficiency              65%          5%
Page migrate success          40807601    17963914
Page migrate failure             16206       16782
Compaction pages isolated     90818819    41285100
Compaction migrate scanned    98628306    36990342
Compaction free scanned     6547623619  6870889207

Unsurprisingly, David's patch tries to compact less. The collapse activity
shows that enough time didn't pass for khugepaged to intervene. While
outside the context of the current discussion, that compaction scanning
activity is mental but also unsurprising. A lot of it is from kcompactd
activity. It's a seperate series to deal with that.

Given the mix of gains and losses, the patch simply shuffles the problem
around in a circle. Some workloads benefit, some don't and whether it's
merged or not merged, someone ends up annoyed as their workload suffers.

I know I didn't review the patch in much detail because in this context,
it was more interesting to know "what it does" than the specifics of the
approach. I'm going to go back to hopping my face off the compaction
series because I think it has the potential to reduce the problem
overall instead of shuffling the deckchairs around the titanic[1]

[1] Famous last words, the series could end up being the iceberg

-- 
Mel Gorman
SUSE Labs