* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node
@ 2020-09-30 19:27 Sebastiaan Meijer
2020-10-01 12:30 ` Michal Hocko
0 siblings, 1 reply; 23+ messages in thread
From: Sebastiaan Meijer @ 2020-09-30 19:27 UTC (permalink / raw)
To: mhocko
Cc: akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman,
riel, willy
> yes it shows the bottleneck but it is quite artificial. Read data is
> usually processed and/or written back and that changes the picture a
> lot.
Apologies for reviving an ancient thread (and apologies in advance for my lack
of knowledge on how mailing lists work), but I'd like to offer up another
reason why merging this might be a good idea.
From what I understand, zswap runs its compression on the same kswapd thread,
limiting it to a single thread for compression. Given enough processing power,
zswap can get great throughput using heavier compression algorithms like zstd,
but this is currently greatly limited by the lack of threading.
People on other sites have claimed applying this patchset greatly improved
zswap performance on their systems even for lighter compression algorithms.
For me personally I currently have a swap-heavy zswap-enabled server with
a single-threaded kswapd0 consuming 100% CPU constantly, and performance
is suffering because of it.
The server has 32 cores sitting mostly idle that I'd love to put to zswap work.
This setup could be considered a corner case, but it's definitely a
production workload that would greatly benefit from this change.
--
Sebastiaan Meijer
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-09-30 19:27 [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node Sebastiaan Meijer @ 2020-10-01 12:30 ` Michal Hocko 2020-10-01 16:18 ` Sebastiaan Meijer 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2020-10-01 12:30 UTC (permalink / raw) To: Sebastiaan Meijer Cc: akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman, riel, willy On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > yes it shows the bottleneck but it is quite artificial. Read data is > > usually processed and/or written back and that changes the picture a > > lot. > Apologies for reviving an ancient thread (and apologies in advance for my lack > of knowledge on how mailing lists work), but I'd like to offer up another > reason why merging this might be a good idea. > > From what I understand, zswap runs its compression on the same kswapd thread, > limiting it to a single thread for compression. Given enough processing power, > zswap can get great throughput using heavier compression algorithms like zstd, > but this is currently greatly limited by the lack of threading. Isn't this a problem of the zswap implementation rather than general kswapd reclaim? Why zswap doesn't do the same as normal swap out in a context outside of the reclaim? My recollection of the particular patch is dimm but I do remember it tried to add more kswapd threads which would just paper over the problem you are seein rather than solve it. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-10-01 12:30 ` Michal Hocko @ 2020-10-01 16:18 ` Sebastiaan Meijer 2020-10-02 7:03 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Sebastiaan Meijer @ 2020-10-01 16:18 UTC (permalink / raw) To: Michal Hocko Cc: akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman, riel, willy (Apologies for messing up the mailing list thread, Gmail had fooled me into believing that it properly picked up the thread) On Thu, 1 Oct 2020 at 14:30, Michal Hocko <mhocko@suse.com> wrote: > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > yes it shows the bottleneck but it is quite artificial. Read data is > > > usually processed and/or written back and that changes the picture a > > > lot. > > Apologies for reviving an ancient thread (and apologies in advance for my lack > > of knowledge on how mailing lists work), but I'd like to offer up another > > reason why merging this might be a good idea. > > > > From what I understand, zswap runs its compression on the same kswapd thread, > > limiting it to a single thread for compression. Given enough processing power, > > zswap can get great throughput using heavier compression algorithms like zstd, > > but this is currently greatly limited by the lack of threading. > > Isn't this a problem of the zswap implementation rather than general > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a > context outside of the reclaim? I wouldn't be able to tell you, the documentation on zswap is fairly limited from what I've found. > My recollection of the particular patch is dimm but I do remember it > tried to add more kswapd threads which would just paper over the problem > you are seein rather than solve it. Yeah, that's exactly what it does, just adding more kswap threads. I've tried updating the patch to the latest mainline kernel to test its viability for our use case, but the kswap code changed too much over the past 2 years, updating it is beyond my ability right now it seems. For the time being I've switched over to zram, which better suits our use case either way, and is threaded, but lacks zswap's memory deduplication. Even with zram I'm still seeing kswap frequently max out a core though, so there's definitely still a case for further optimization of kswap. In our case it's not a single big application taking up our memory, rather we are running 2000 high-memory applications. They store a lot of data in swap, but rarely ever access said data, so the actual swap i/o isn't even that high. -- Sebastiaan Meijer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-10-01 16:18 ` Sebastiaan Meijer @ 2020-10-02 7:03 ` Michal Hocko 2020-10-02 8:40 ` Mel Gorman 2020-10-02 13:53 ` Rik van Riel 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2020-10-02 7:03 UTC (permalink / raw) To: Sebastiaan Meijer Cc: akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman, riel, willy On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > (Apologies for messing up the mailing list thread, Gmail had fooled me into > believing that it properly picked up the thread) > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > yes it shows the bottleneck but it is quite artificial. Read data is > > > > usually processed and/or written back and that changes the picture a > > > > lot. > > > Apologies for reviving an ancient thread (and apologies in advance for my lack > > > of knowledge on how mailing lists work), but I'd like to offer up another > > > reason why merging this might be a good idea. > > > > > > From what I understand, zswap runs its compression on the same kswapd thread, > > > limiting it to a single thread for compression. Given enough processing power, > > > zswap can get great throughput using heavier compression algorithms like zstd, > > > but this is currently greatly limited by the lack of threading. > > > > Isn't this a problem of the zswap implementation rather than general > > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a > > context outside of the reclaim? > > I wouldn't be able to tell you, the documentation on zswap is fairly limited > from what I've found. I would recommend you to talk to zswap maintainers. Describing your problem and suggesting to offload the heavy lifting into a separate context like the standard swap IO does. You are not the only one to hit this problem http://lkml.kernel.org/r/CALvZod43VXKZ3StaGXK_EZG_fKcW3v3=cEYOWFwp4HNJpOOf8g@mail.gmail.com. Ccing Shakeel on such an email might help you to give more usecases. > > My recollection of the particular patch is dimm but I do remember it > > tried to add more kswapd threads which would just paper over the problem > > you are seein rather than solve it. > > Yeah, that's exactly what it does, just adding more kswap threads. Which is far from trivial because it has its side effects on the over system balanc. See my reply to the original request and the follow up discussion. I am not saying this is impossible to achieve and tune properly but it is certainly non trivial and it would require a really strong justification. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-10-02 7:03 ` Michal Hocko @ 2020-10-02 8:40 ` Mel Gorman 2020-10-02 13:53 ` Rik van Riel 1 sibling, 0 replies; 23+ messages in thread From: Mel Gorman @ 2020-10-02 8:40 UTC (permalink / raw) To: Michal Hocko Cc: Sebastiaan Meijer, akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, riel, willy On Fri, Oct 02, 2020 at 09:03:33AM +0200, Michal Hocko wrote: > > > My recollection of the particular patch is dimm but I do remember it > > > tried to add more kswapd threads which would just paper over the problem > > > you are seein rather than solve it. > > > > Yeah, that's exactly what it does, just adding more kswap threads. > > Which is far from trivial because it has its side effects on the over > system balanc. While I have not read the original patches, multiple kswapd threads will smash into the LRU lock repeatedly. It's already the case that just plain storms of page cache allocations hammer that lock on pagevec releases and gets worse as memory sizes increase. Increasing LRU lock contention when memory is low is going to have diminishing returns. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-10-02 7:03 ` Michal Hocko 2020-10-02 8:40 ` Mel Gorman @ 2020-10-02 13:53 ` Rik van Riel 2020-10-02 14:00 ` Matthew Wilcox 2020-10-02 14:29 ` Michal Hocko 1 sibling, 2 replies; 23+ messages in thread From: Rik van Riel @ 2020-10-02 13:53 UTC (permalink / raw) To: Michal Hocko, Sebastiaan Meijer Cc: akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman, willy [-- Attachment #1: Type: text/plain, Size: 1665 bytes --] On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote: > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > > (Apologies for messing up the mailing list thread, Gmail had fooled > > me into > > believing that it properly picked up the thread) > > > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <mhocko@suse.com> wrote: > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > > yes it shows the bottleneck but it is quite artificial. Read > > > > > data is > > > > > usually processed and/or written back and that changes the > > > > > picture a > > > > > lot. > > > > Apologies for reviving an ancient thread (and apologies in > > > > advance for my lack > > > > of knowledge on how mailing lists work), but I'd like to offer > > > > up another > > > > reason why merging this might be a good idea. > > > > > > > > From what I understand, zswap runs its compression on the same > > > > kswapd thread, > > > > limiting it to a single thread for compression. Given enough > > > > processing power, > > > > zswap can get great throughput using heavier compression > > > > algorithms like zstd, > > > > but this is currently greatly limited by the lack of threading. > > > > > > Isn't this a problem of the zswap implementation rather than > > > general > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out > > > in a > > > context outside of the reclaim? On systems with lots of very fast IO devices, we have also seen kswapd take 100% CPU time without any zswap in use. This seems like a generic issue, though zswap does manage to bring it out on lower end systems. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-10-02 13:53 ` Rik van Riel @ 2020-10-02 14:00 ` Matthew Wilcox 2020-10-02 14:29 ` Michal Hocko 1 sibling, 0 replies; 23+ messages in thread From: Matthew Wilcox @ 2020-10-02 14:00 UTC (permalink / raw) To: Rik van Riel Cc: Michal Hocko, Sebastiaan Meijer, akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman On Fri, Oct 02, 2020 at 09:53:05AM -0400, Rik van Riel wrote: > On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote: > > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > > > (Apologies for messing up the mailing list thread, Gmail had fooled > > > me into > > > believing that it properly picked up the thread) > > > > > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > > > yes it shows the bottleneck but it is quite artificial. Read > > > > > > data is > > > > > > usually processed and/or written back and that changes the > > > > > > picture a > > > > > > lot. > > > > > Apologies for reviving an ancient thread (and apologies in > > > > > advance for my lack > > > > > of knowledge on how mailing lists work), but I'd like to offer > > > > > up another > > > > > reason why merging this might be a good idea. > > > > > > > > > > From what I understand, zswap runs its compression on the same > > > > > kswapd thread, > > > > > limiting it to a single thread for compression. Given enough > > > > > processing power, > > > > > zswap can get great throughput using heavier compression > > > > > algorithms like zstd, > > > > > but this is currently greatly limited by the lack of threading. > > > > > > > > Isn't this a problem of the zswap implementation rather than > > > > general > > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out > > > > in a > > > > context outside of the reclaim? > > On systems with lots of very fast IO devices, we have > also seen kswapd take 100% CPU time without any zswap > in use. > > This seems like a generic issue, though zswap does > manage to bring it out on lower end systems. Then, given Mel's observation about contention on the LRU lock, what's the solution? Partition the LRU list? Batch removals from the LRU list by kswapd and hand off to per-?node?cpu? worker threads? Rik, if you have access to one of those systems, I'd be interested to know whether using file THPs would help with your workload. Tracking only one THP instead of, say, 16 regular size pages is going to reduce the amount of time taken to pull things off the LRU list. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2020-10-02 13:53 ` Rik van Riel 2020-10-02 14:00 ` Matthew Wilcox @ 2020-10-02 14:29 ` Michal Hocko 1 sibling, 0 replies; 23+ messages in thread From: Michal Hocko @ 2020-10-02 14:29 UTC (permalink / raw) To: Rik van Riel Cc: Sebastiaan Meijer, akpm, buddy.lumpkin, hannes, linux-kernel, linux-mm, mgorman, willy On Fri 02-10-20 09:53:05, Rik van Riel wrote: > On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote: > > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote: > > > (Apologies for messing up the mailing list thread, Gmail had fooled > > > me into > > > believing that it properly picked up the thread) > > > > > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote: > > > > > > yes it shows the bottleneck but it is quite artificial. Read > > > > > > data is > > > > > > usually processed and/or written back and that changes the > > > > > > picture a > > > > > > lot. > > > > > Apologies for reviving an ancient thread (and apologies in > > > > > advance for my lack > > > > > of knowledge on how mailing lists work), but I'd like to offer > > > > > up another > > > > > reason why merging this might be a good idea. > > > > > > > > > > From what I understand, zswap runs its compression on the same > > > > > kswapd thread, > > > > > limiting it to a single thread for compression. Given enough > > > > > processing power, > > > > > zswap can get great throughput using heavier compression > > > > > algorithms like zstd, > > > > > but this is currently greatly limited by the lack of threading. > > > > > > > > Isn't this a problem of the zswap implementation rather than > > > > general > > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out > > > > in a > > > > context outside of the reclaim? > > On systems with lots of very fast IO devices, we have > also seen kswapd take 100% CPU time without any zswap > in use. Do you have more details? Does the saturated kswapd lead to pre-mature direct reclaim? What is the saturated number of reclaimed pages per unit of time? Have you tried to play with this to see whether an additional worker would help? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFC PATCH 0/1] mm: Support multiple kswapd threads per node @ 2018-04-02 9:24 Buddy Lumpkin 2018-04-02 9:24 ` [RFC PATCH 1/1] vmscan: " Buddy Lumpkin 0 siblings, 1 reply; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-02 9:24 UTC (permalink / raw) To: linux-mm, linux-kernel; +Cc: buddy.lumpkin, hannes, riel, mgorman, willy, akpm I created this patch to address performance problems we are seeing in Oracle Cloud Infrastructure. We run the Oracle Linux UEK4 kernel internally, which is based on upstream 4.1. I created and tested this patch for the latest upstream kernel and UEK4. I was able to show substantial benefits in both kernels, using workloads that provide a mix of anonymous memory allocations with filesystem writes. As I went through the process of getting this patch approved internally, I learned that it was hard to come up with a concise set of test results that clearly demonstrate that devoting more threads toward proactive page replacement is actually necessary. I was more focused on the impact that direct reclaims had on latency at the time, so I came up with a systemtap script that measures the latency of direct reclaims. On systems that were doing large volumes of filesystem IO, I saw order 0 allocations regularly taking over 10ms, and occasionally over 100ms. Since we were seeing large volumes of direct reclaims triggered as a side effect of filesystem IO, I figured this had to have a substantial impact on throughput. I compared the maximum read throughput that could be obtained using direct IO streams to standard filesystem IO through the page cache on one of the dense storage systems that we vend. Direct IO was 55% higher in throughput than standard filesystem IO. I can't remember the last time I measured this but I know it was over 15 years ago, and I am quite sure the number was no more than 10%. I was pretty sure that direct reclaims were to blame for most of this and it would only take a few more tests to prove it. At 23GB/s, it only takes 32.6 seconds to fill the page cache on one of these systems, but that is enough time to measure throughput without any page replacement occuring. In this case direct IO throughput was only 13.5% higher. It was pretty clear that direct reclaims were causing a substantial reduction in throughput. I decided this would be the ideal way to show the benefits of threading kswapd. On the UEK4 kernel, six kswapd threads provided a 48% increase over one. When I ran the same tests on upstream kernel version 4.16.0-rc7, I only saw a 20% increase with 6 threads and the numbers fluctuated quite a bit when I watched with iostat with a 2 second sample interval. The output stalled periodically as well. When I profiled the system using perf, I saw that 70% of the CPU time was being spent in a single function, it was native_queued_spin_lock_slowpath(). 38% was during shrink_inactive_list() and another 34% was spent during __lru_cache_add() I eventually determined that my tests were presenting a difficult pattern for the logic that uses shadow entries to periodically resize the LRU lists. This was not a problem in the UEK4 kernel which also has shadow entries, so something has changed in that regard. I have not had time to really dig into this particular problem however, I assume those that are more familiar with the code might see the test results below and have an idea about what is going on. I have appended a small patch to the end of this cover letter that effectively disables most of the routines in mm/workingset.c so that filesystem IO can be used to demonstrate the benefits of a threaded kswapd. I am not suggesting that this is the correct solution for this problem. Test results below are the same that were run to demonstrate threaded kswapd performance. For more context, read the patch commit log before continuing and the test results below will make more sense Direct IO results are roughly the same as expected ... Test #1: Direct IO - shadow entries enabled dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43 26253780.80 70 1 1.31 26254154.80 76 1 1.21 26253660.80 82 1 1.12 26254214.80 88 1 1.07 26253770.00 90 1 1.04 26252406.40 Going through the pagecache is a different story entirely. Let's look at throughput with a single kswapd thread with shadow entries enabled, vs disabled: shadow entries ENABLED, 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.96 35.52 34.94 7964174.80 0 460161197 0 16 8 40.75 84.86 81.92 11143540.00 0 907793664 0 22 12 45.01 99.96 99.98 12790778.40 6751 884827215 162344947 28 18 49.10 99.97 99.97 14410621.02 17989 719328362 536886953 34 22 52.87 99.80 99.98 14331978.80 25180 609680315 661201785 40 26 55.66 99.90 99.96 14612901.20 26843 449047388 810399311 46 28 56.37 99.74 99.96 15831410.40 33854 518952367 807944791 52 37 59.78 99.80 99.97 15264190.80 37042 372258422 881626890 58 50 71.90 99.44 99.53 14979692.40 45761 190511392 1114810023 64 53 72.14 99.84 99.95 14747164.80 83665 168461850 1013498958 70 50 68.09 99.80 99.90 15176129.60 113546 203506041 1008655113 76 59 73.77 99.73 99.96 14947922.40 98798 137174015 1057487320 82 66 79.25 99.66 99.98 14624100.40 100242 101830859 1074332196 88 73 81.26 98.85 99.98 14827533.60 101262 90402914 1086186724 90 78 85.48 99.55 99.98 14469963.20 101063 75722196 1083603245 shadow entries DISABLED, 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 When shadow entries are disabled, kernel mode CPU consumption drops and peak throughput increases by 13.7% Here is the same test with 4 kswapd threads: shadow entries ENABLED, 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 6 30.09 17.36 16.82 7692440.40 0 460386412 0 16 11 42.86 34.35 33.86 10836456.80 23 885908695 550482 22 14 46.00 55.30 50.53 13125285.20 0 1075382922 0 28 17 43.74 87.18 44.18 15298355.20 0 1254927179 0 34 26 53.78 99.88 89.93 16203179.20 3443 1247514636 80817567 40 35 62.99 99.88 97.58 16653526.80 15376 960519369 369681969 46 36 51.66 99.85 90.87 18668439.60 10907 1239045416 259575692 52 46 66.96 99.61 99.96 16970211.60 24264 751180033 577278765 58 52 76.53 99.91 99.97 15336601.60 30676 513418729 725394427 64 58 78.20 99.79 99.96 15266654.40 33466 450869495 791218349 70 65 82.98 99.93 99.98 15285421.60 35647 370270673 843608871 76 69 81.52 99.87 99.87 15681812.00 37625 358457523 889023203 82 78 85.68 99.97 99.98 15370775.60 39010 302132025 921379809 88 85 88.52 99.88 99.56 15410439.20 40100 267031806 947441566 90 88 90.11 99.67 99.41 15400593.20 40443 249090848 953893493 shadow entries DISABLED, 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.09 16.65 14.17 7842605.60 0 459105291 0 16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515 22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0 28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0 34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0 40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0 46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0 52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0 58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0 64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821 70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159 76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763 82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704 88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202 90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615 With four kswapd threads, the effects are more pronounced. Kernel mode CPU consumption is substantially higher with shadow entries enabled while throughput is substantially lower. When shadow entries are disabled, additional kswapd tasks increase throughput while kernel mode CPU consumption stays roughly the same --- mm/workingset.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/workingset.c b/mm/workingset.c index b7d616a3bbbe..656451ce2d5e 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -213,6 +213,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page) unsigned long eviction; struct lruvec *lruvec; + return NULL; /* Page is fully exclusive and pins page->mem_cgroup */ VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); -- Buddy Lumpkin (1): vmscan: Support multiple kswapd threads per node Documentation/sysctl/vm.txt | 21 ++++++++ include/linux/mm.h | 2 + include/linux/mmzone.h | 10 +++- kernel/sysctl.c | 10 ++++ mm/page_alloc.c | 15 ++++++ mm/vmscan.c | 116 +++++++++++++++++++++++++++++++++++++------- 6 files changed, 155 insertions(+), 19 deletions(-) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-02 9:24 [RFC PATCH 0/1] mm: " Buddy Lumpkin @ 2018-04-02 9:24 ` Buddy Lumpkin 2018-04-03 13:31 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-02 9:24 UTC (permalink / raw) To: linux-mm, linux-kernel; +Cc: buddy.lumpkin, hannes, riel, mgorman, willy, akpm Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created eleven 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43 26253780.80 70 1 1.31 26254154.80 76 1 1.21 26253660.80 82 1 1.12 26254214.80 88 1 1.07 26253770.00 90 1 1.04 26252406.40 Throughput was close to peak with only 22 dd tasks. Very little system CPU was consumed as expected as the drives DMA directly into the user address space when using direct IO. In this next test, the iflag=direct option is removed and we only run the test until the pgscan_kswapd from /proc/vmstat starts to increment. At that point metrics are parsed and reported and the pagecache contents are dropped prior to the next test. Lather, rinse, repeat. Test #2: standard file system IO, no page replacement dd sy dd_cpu throughput 6 2 28.78 5134316.40 10 3 31.40 8051218.40 16 5 34.73 11438106.80 22 7 33.65 14140596.40 28 8 31.24 16393455.20 34 10 29.88 18219463.60 40 11 28.33 19644159.60 46 11 25.05 20802497.60 52 13 26.92 22092370.00 58 13 23.29 22884881.20 64 14 23.12 23452248.80 70 15 22.40 23916468.00 76 16 22.06 24328737.20 82 17 20.97 24718693.20 88 16 18.57 25149404.40 90 16 18.31 25245565.60 Each read has to pause after the buffer in kernel space is populated while those pages are added to the pagecache and copied into the user address space. For this reason, more parallel streams are required to achieve peak throughput. The copy operation consumes substantially more CPU than direct IO as expected. The next test measures throughput after kswapd starts running. This is the same test only we wait for kswapd to wake up before we start collecting metrics. The script actually keeps track of a few things that were not mentioned earlier. It tracks direct reclaims and page scans by watching the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same way it is tracked for dd. Since the test is 100% reads, you can assume that the page steal rate for kswapd and direct reclaims is almost identical to the scan rate. Test #3: 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 In the previous test where kswapd was not involved, the system-wide kernel mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA node), kswapd can only be responsible for a little over 4% of the increase. The rest is likely caused by 51,618 direct reclaims that scanned 1.2 billion pages over the five minute time period of the test. Same test, more kswapd tasks: Test #4: 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.09 16.65 14.17 7842605.60 0 459105291 0 16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515 22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0 28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0 34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0 40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0 46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0 52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0 58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0 64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821 70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159 76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763 82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704 88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202 90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615 By increasing the number of kswapd threads, throughput increased by ~50% while kernel mode CPU utilization decreased or stayed the same, likely due to a decrease in the number of parallel tasks at any given time doing page replacement. Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com> --- Documentation/sysctl/vm.txt | 23 +++++++++ include/linux/mm.h | 2 + include/linux/mmzone.h | 10 +++- kernel/sysctl.c | 10 ++++ mm/page_alloc.c | 15 ++++++ mm/vmscan.c | 116 +++++++++++++++++++++++++++++++++++++------- 6 files changed, 157 insertions(+), 19 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index ff234d229cbb..aa54cbc14dd9 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -31,6 +31,7 @@ Currently, these files are in /proc/sys/vm: - drop_caches - extfrag_threshold - hugetlb_shm_group +- kswapd_threads - laptop_mode - legacy_va_layout - lowmem_reserve_ratio @@ -267,6 +268,28 @@ shared memory segment using hugetlb page. ============================================================== +kswapd_threads + +kswapd_threads allows you to control the number of kswapd threads per node +running on the system. This provides the ability to devote additional CPU +resources toward proactive page replacement with the goal of reducing +direct reclaims. When direct reclaims are prevented, the CPU consumed +by them is prevented as well. Depending on the workload, the result can +cause aggregate CPU usage on the system to go up, down or stay the same. + +More aggressive page replacement can reduce direct reclaims which cause +latency for tasks and decrease throughput when doing filesystem IO through +the pagecache. Direct reclaims are recorded using the allocstall counter +in /proc/vmstat. + +The default value is 1 and the range of acceptible values are 1-16. +Always start with lower values in the 2-6 range. Higher values should +be justified with testing. If direct reclaims occur in spite of high +values, the cost of direct reclaims (in latency) that occur can be +higher due to increased lock contention. + +============================================================== + laptop_mode laptop_mode is a knob that controls "laptop mode". All the things that are diff --git a/include/linux/mm.h b/include/linux/mm.h index ad06d42adb1a..e25b8da76f7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2078,6 +2078,7 @@ static inline void zero_resv_unavail(void) {} extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, enum memmap_context, struct vmem_altmap *); extern void setup_per_zone_wmarks(void); +extern void update_kswapd_threads(void); extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); extern void __init mmap_init(void); @@ -2098,6 +2099,7 @@ extern __printf(3, 4) extern void zone_pcp_reset(struct zone *zone); /* page_alloc.c */ +extern int kswapd_threads; extern int min_free_kbytes; extern int watermark_scale_factor; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7522a6987595..ad36a5b5c3b8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -36,6 +36,8 @@ */ #define PAGE_ALLOC_COSTLY_ORDER 3 +#define MAX_KSWAPD_THREADS 16 + enum migratetype { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, @@ -653,8 +655,10 @@ struct zonelist { int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; - struct task_struct *kswapd; /* Protected by - mem_hotplug_begin/end() */ + /* + * Protected by mem_hotplug_begin/end() + */ + struct task_struct *kswapd[MAX_KSWAPD_THREADS]; int kswapd_order; enum zone_type kswapd_classzone_idx; @@ -882,6 +886,8 @@ static inline int is_highmem(struct zone *zone) /* These two functions are used to setup the per zone pages min values */ struct ctl_table; +int kswapd_threads_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f98f28c12020..3cef65ce1d46 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -134,6 +134,7 @@ #ifdef CONFIG_PERF_EVENTS static int six_hundred_forty_kb = 640 * 1024; #endif +static int max_kswapd_threads = MAX_KSWAPD_THREADS; /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */ static unsigned long dirty_bytes_min = 2 * PAGE_SIZE; @@ -1437,6 +1438,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .extra1 = &zero, }, { + .procname = "kswapd_threads", + .data = &kswapd_threads, + .maxlen = sizeof(kswapd_threads), + .mode = 0644, + .proc_handler = kswapd_threads_sysctl_handler, + .extra1 = &one, + .extra2 = &max_kswapd_threads, + }, + { .procname = "watermark_scale_factor", .data = &watermark_scale_factor, .maxlen = sizeof(watermark_scale_factor), diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1741dd23e7c1..de30683aeb0f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7143,6 +7143,21 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write, return 0; } +int kswapd_threads_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + if (write) + update_kswapd_threads(); + + return 0; +} + int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index cd5dc3faaa57..663ff14080e7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -118,6 +118,14 @@ struct scan_control { unsigned long nr_reclaimed; }; + +/* + * Number of active kswapd threads + */ +#define DEF_KSWAPD_THREADS_PER_NODE 1 +int kswapd_threads = DEF_KSWAPD_THREADS_PER_NODE; +int kswapd_threads_current = DEF_KSWAPD_THREADS_PER_NODE; + #ifdef ARCH_HAS_PREFETCH #define prefetch_prev_lru_page(_page, _base, _field) \ do { \ @@ -3624,21 +3632,83 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) restore their cpu bindings. */ static int kswapd_cpu_online(unsigned int cpu) { - int nid; + int nid, hid; + int nr_threads = kswapd_threads_current; for_each_node_state(nid, N_MEMORY) { pg_data_t *pgdat = NODE_DATA(nid); const struct cpumask *mask; mask = cpumask_of_node(pgdat->node_id); - - if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids) - /* One of our CPUs online: restore mask */ - set_cpus_allowed_ptr(pgdat->kswapd, mask); + if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids) { + for (hid = 0; hid < nr_threads; hid++) { + /* One of our CPUs online: restore mask */ + set_cpus_allowed_ptr(pgdat->kswapd[hid], mask); + } + } } return 0; } +static void update_kswapd_threads_node(int nid) +{ + pg_data_t *pgdat; + int drop, increase; + int last_idx, start_idx, hid; + int nr_threads = kswapd_threads_current; + + pgdat = NODE_DATA(nid); + last_idx = nr_threads - 1; + if (kswapd_threads < nr_threads) { + drop = nr_threads - kswapd_threads; + for (hid = last_idx; hid > (last_idx - drop); hid--) { + if (pgdat->kswapd[hid]) { + kthread_stop(pgdat->kswapd[hid]); + pgdat->kswapd[hid] = NULL; + } + } + } else { + increase = kswapd_threads - nr_threads; + start_idx = last_idx + 1; + for (hid = start_idx; hid < (start_idx + increase); hid++) { + pgdat->kswapd[hid] = kthread_run(kswapd, pgdat, + "kswapd%d:%d", nid, hid); + if (IS_ERR(pgdat->kswapd[hid])) { + pr_err("Failed to start kswapd%d on node %d\n", + hid, nid); + pgdat->kswapd[hid] = NULL; + /* + * We are out of resources. Do not start any + * more threads. + */ + break; + } + } + } +} + +void update_kswapd_threads(void) +{ + int nid; + + if (kswapd_threads_current == kswapd_threads) + return; + + /* + * Hold the memory hotplug lock to avoid racing with memory + * hotplug initiated updates + */ + mem_hotplug_begin(); + for_each_node_state(nid, N_MEMORY) + update_kswapd_threads_node(nid); + + pr_info("kswapd_thread count changed, old:%d new:%d\n", + kswapd_threads_current, kswapd_threads); + kswapd_threads_current = kswapd_threads; + mem_hotplug_done(); +} + + /* * This kswapd start function will be called by init and node-hot-add. * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. @@ -3647,18 +3717,25 @@ int kswapd_run(int nid) { pg_data_t *pgdat = NODE_DATA(nid); int ret = 0; + int hid, nr_threads; - if (pgdat->kswapd) + if (pgdat->kswapd[0]) return 0; - pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); - if (IS_ERR(pgdat->kswapd)) { - /* failure at boot is fatal */ - BUG_ON(system_state < SYSTEM_RUNNING); - pr_err("Failed to start kswapd on node %d\n", nid); - ret = PTR_ERR(pgdat->kswapd); - pgdat->kswapd = NULL; + nr_threads = kswapd_threads; + for (hid = 0; hid < nr_threads; hid++) { + pgdat->kswapd[hid] = kthread_run(kswapd, pgdat, "kswapd%d:%d", + nid, hid); + if (IS_ERR(pgdat->kswapd[hid])) { + /* failure at boot is fatal */ + BUG_ON(system_state < SYSTEM_RUNNING); + pr_err("Failed to start kswapd%d on node %d\n", + hid, nid); + ret = PTR_ERR(pgdat->kswapd[hid]); + pgdat->kswapd[hid] = NULL; + } } + kswapd_threads_current = nr_threads; return ret; } @@ -3668,11 +3745,16 @@ int kswapd_run(int nid) */ void kswapd_stop(int nid) { - struct task_struct *kswapd = NODE_DATA(nid)->kswapd; + struct task_struct *kswapd; + int hid; + int nr_threads = kswapd_threads_current; - if (kswapd) { - kthread_stop(kswapd); - NODE_DATA(nid)->kswapd = NULL; + for (hid = 0; hid < nr_threads; hid++) { + kswapd = NODE_DATA(nid)->kswapd[hid]; + if (kswapd) { + kthread_stop(kswapd); + NODE_DATA(nid)->kswapd[hid] = NULL; + } } } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-02 9:24 ` [RFC PATCH 1/1] vmscan: " Buddy Lumpkin @ 2018-04-03 13:31 ` Michal Hocko 2018-04-03 19:07 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Michal Hocko @ 2018-04-03 13:31 UTC (permalink / raw) To: Buddy Lumpkin; +Cc: linux-mm, linux-kernel, hannes, riel, mgorman, willy, akpm On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > Page replacement is handled in the Linux Kernel in one of two ways: > > 1) Asynchronously via kswapd > 2) Synchronously, via direct reclaim > > At page allocation time the allocating task is immediately given a page > from the zone free list allowing it to go right back to work doing > whatever it was doing; Probably directly or indirectly executing business > logic. > > Just prior to satisfying the allocation, free pages is checked to see if > it has reached the zone low watermark and if so, kswapd is awakened. > Kswapd will start scanning pages looking for inactive pages to evict to > make room for new page allocations. The work of kswapd allows tasks to > continue allocating memory from their respective zone free list without > incurring any delay. > > When the demand for free pages exceeds the rate that kswapd tasks can > supply them, page allocation works differently. Once the allocating task > finds that the number of free pages is at or below the zone min watermark, > the task will no longer pull pages from the free list. Instead, the task > will run the same CPU-bound routines as kswapd to satisfy its own > allocation by scanning and evicting pages. This is called a direct reclaim. > > The time spent performing a direct reclaim can be substantial, often > taking tens to hundreds of milliseconds for small order0 allocations to > half a second or more for order9 huge-page allocations. In fact, kswapd is > not actually required on a linux system. It exists for the sole purpose of > optimizing performance by preventing direct reclaims. > > When memory shortfall is sufficient to trigger direct reclaims, they can > occur in any task that is running on the system. A single aggressive > memory allocating task can set the stage for collateral damage to occur in > small tasks that rarely allocate additional memory. Consider the impact of > injecting an additional 100ms of latency when nscd allocates memory to > facilitate caching of a DNS query. > > The presence of direct reclaims 10 years ago was a fairly reliable > indicator that too much was being asked of a Linux system. Kswapd was > likely wasting time scanning pages that were ineligible for eviction. > Adding RAM or reducing the working set size would usually make the problem > go away. Since then hardware has evolved to bring a new struggle for > kswapd. Storage speeds have increased by orders of magnitude while CPU > clock speeds stayed the same or even slowed down in exchange for more > cores per package. This presents a throughput problem for a single > threaded kswapd that will get worse with each generation of new hardware. AFAIR we used to scale the number of kswapd workers many years ago. It just turned out to be not all that great. We have a kswapd reclaim window for quite some time and that can allow to tune how much proactive kswapd should be. Also please note that the direct reclaim is a way to throttle overly aggressive memory consumers. The more we do in the background context the easier for them it will be to allocate faster. So I am not really sure that more background threads will solve the underlying problem. It is just a matter of memory hogs tunning to end in the very same situtation AFAICS. Moreover the more they are going to allocate the more less CPU time will _other_ (non-allocating) task get. > Test Details I will have to study this more to comment. [...] > By increasing the number of kswapd threads, throughput increased by ~50% > while kernel mode CPU utilization decreased or stayed the same, likely due > to a decrease in the number of parallel tasks at any given time doing page > replacement. Well, isn't that just an effect of more work being done on behalf of other workload that might run along with your tests (and which doesn't really need to allocate a lot of memory)? In other words how does the patch behaves with a non-artificial mixed workloads? Please note that I am not saying that we absolutely have to stick with the current single-thread-per-node implementation but I would really like to see more background on why we should be allowing heavy memory hogs to allocate faster or how to prevent that. I would be also very interested to see how to scale the number of threads based on how CPUs are utilized by other workloads. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 13:31 ` Michal Hocko @ 2018-04-03 19:07 ` Matthew Wilcox 2018-04-03 20:49 ` Buddy Lumpkin 2018-04-11 3:52 ` Buddy Lumpkin 2018-04-03 20:13 ` Buddy Lumpkin ` (2 subsequent siblings) 3 siblings, 2 replies; 23+ messages in thread From: Matthew Wilcox @ 2018-04-03 19:07 UTC (permalink / raw) To: Michal Hocko Cc: Buddy Lumpkin, linux-mm, linux-kernel, hannes, riel, mgorman, akpm On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > > The presence of direct reclaims 10 years ago was a fairly reliable > > indicator that too much was being asked of a Linux system. Kswapd was > > likely wasting time scanning pages that were ineligible for eviction. > > Adding RAM or reducing the working set size would usually make the problem > > go away. Since then hardware has evolved to bring a new struggle for > > kswapd. Storage speeds have increased by orders of magnitude while CPU > > clock speeds stayed the same or even slowed down in exchange for more > > cores per package. This presents a throughput problem for a single > > threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. It > is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. > > > Test Details > > I will have to study this more to comment. > > [...] > > By increasing the number of kswapd threads, throughput increased by ~50% > > while kernel mode CPU utilization decreased or stayed the same, likely due > > to a decrease in the number of parallel tasks at any given time doing page > > replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might run along with your tests (and which doesn't > really need to allocate a lot of memory)? In other words how > does the patch behaves with a non-artificial mixed workloads? > > Please note that I am not saying that we absolutely have to stick with the > current single-thread-per-node implementation but I would really like to > see more background on why we should be allowing heavy memory hogs to > allocate faster or how to prevent that. I would be also very interested > to see how to scale the number of threads based on how CPUs are utilized > by other workloads. Yes, very much this. If you have a single-threaded workload which is using the entirety of memory and would like to use even more, then it makes sense to use as many CPUs as necessary getting memory out of its way. If you have N CPUs and N-1 threads happily occupying themselves in their own reasonably-sized working sets with one monster process trying to use as much RAM as possible, then I'd be pretty unimpressed to see the N-1 well-behaved threads preempted by kswapd. My biggest problem with the patch-as-presented is that it's yet one more thing for admins to get wrong. We should spawn more threads automatically if system conditions are right to do that. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 19:07 ` Matthew Wilcox @ 2018-04-03 20:49 ` Buddy Lumpkin 2018-04-03 21:12 ` Matthew Wilcox 2018-04-11 3:52 ` Buddy Lumpkin 1 sibling, 1 reply; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-03 20:49 UTC (permalink / raw) To: Matthew Wilcox Cc: Michal Hocko, linux-mm, linux-kernel, hannes, riel, mgorman, akpm > On Apr 3, 2018, at 12:07 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd was >>> likely wasting time scanning pages that were ineligible for eviction. >>> Adding RAM or reducing the working set size would usually make the problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>> clock speeds stayed the same or even slowed down in exchange for more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new hardware. >> >> AFAIR we used to scale the number of kswapd workers many years ago. It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much proactive >> kswapd should be. >> >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the more >> less CPU time will _other_ (non-allocating) task get. >> >>> Test Details >> >> I will have to study this more to comment. >> >> [...] >>> By increasing the number of kswapd threads, throughput increased by ~50% >>> while kernel mode CPU utilization decreased or stayed the same, likely due >>> to a decrease in the number of parallel tasks at any given time doing page >>> replacement. >> >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >> >> Please note that I am not saying that we absolutely have to stick with the >> current single-thread-per-node implementation but I would really like to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very interested >> to see how to scale the number of threads based on how CPUs are utilized >> by other workloads. > > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves in > their own reasonably-sized working sets with one monster process trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. The default value provides one kswapd thread per NUMA node, the same it was without the patch. Also, I would point out that just because you devote more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads are busy, they are almost certainly doing work that would have resulted in direct reclaims, which are often substantially more expensive than a couple extra context switches due to preemption. Also, the code still uses wake_up_interruptible to wake kswapd threads, so after starting the first kswapd thread, free pages minus the size of the allocation would still need to be below the low watermark for a page allocation at that time to cause another kswapd thread to wake up. When I first decided to try this out, I figured a lot of tuning would be needed to see good behavior. But what I found in practice was that it actually works quite well. When you look closely, you see that there is very little difference between a direct reclaim and kswapd. In fact, direct reclaims work a little harder than kswapd, and they should continue to do so because that prevents the number of parallel scanning tasks from increasing unnecessarily. Please try it out, you might be surprised at how well it works. > > My biggest problem with the patch-as-presented is that it's yet one more > thing for admins to get wrong. We should spawn more threads automatically > if system conditions are right to do that. I totally agree with this. In my previous response to Michal Hocko, I described how I think we could scale watermarks in response to direct reclaims, and launch more kswapd threads when kswapd peaks at 100% CPU usage. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 20:49 ` Buddy Lumpkin @ 2018-04-03 21:12 ` Matthew Wilcox 2018-04-04 10:07 ` Buddy Lumpkin ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Matthew Wilcox @ 2018-04-03 21:12 UTC (permalink / raw) To: Buddy Lumpkin Cc: Michal Hocko, linux-mm, linux-kernel, hannes, riel, mgorman, akpm On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: > > Yes, very much this. If you have a single-threaded workload which is > > using the entirety of memory and would like to use even more, then it > > makes sense to use as many CPUs as necessary getting memory out of its > > way. If you have N CPUs and N-1 threads happily occupying themselves in > > their own reasonably-sized working sets with one monster process trying > > to use as much RAM as possible, then I'd be pretty unimpressed to see > > the N-1 well-behaved threads preempted by kswapd. > > The default value provides one kswapd thread per NUMA node, the same > it was without the patch. Also, I would point out that just because you devote > more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads > are busy, they are almost certainly doing work that would have resulted in > direct reclaims, which are often substantially more expensive than a couple > extra context switches due to preemption. [...] > In my previous response to Michal Hocko, I described > how I think we could scale watermarks in response to direct reclaims, and > launch more kswapd threads when kswapd peaks at 100% CPU usage. I think you're missing my point about the workload ... kswapd isn't "nice", so it will compete with the N-1 threads which are chugging along at 100% CPU inside their working sets. In this scenario, we _don't_ want to kick off kswapd at all; we want the monster thread to clean up its own mess. If we have idle CPUs, then yes, absolutely, lets have them clean up for the monster, but otherwise, I want my N-1 threads doing their own thing. Maybe we should renice kswapd anyway ... thoughts? We don't seem to have had a nice'd kswapd since 2.6.12, but maybe we played with that earlier and discovered it was a bad idea? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 21:12 ` Matthew Wilcox @ 2018-04-04 10:07 ` Buddy Lumpkin 2018-04-05 4:08 ` Buddy Lumpkin 2018-04-11 6:37 ` Buddy Lumpkin 2 siblings, 0 replies; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-04 10:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Michal Hocko, linux-mm, linux-kernel, hannes, riel, mgorman, akpm > On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. For the scenario you describe above. I have my own opinions, but I would rather not speculate on what happens. Tomorrow I will try to simulate this situation and i’ll report back on the results. I think this actually makes a case for accepting the patch as-is for now. Please hear me out on this: You mentioned being concerned that an admin will do the wrong thing with this tunable. I worked in the System Administrator/System Engineering job families for many years and even though I transitioned to spending most of my time on performance and kernel work, I still maintain an active role in System Engineering related projects, hiring and mentoring. The kswapd_threads tunable defaults to a value of one, which is the current default behavior. I think there are plenty of sysctls that are more confusing than this one. If you want to make a comparison, I would say that Transparent Hugepages is one of the best examples of a feature that has confused System Administrators. I am sure it works a lot better today, but it has a history of really sharp edges, and it has been shipping enabled by default for a long time in the OS distributions I am familiar with. I am hopeful that it works better in later kernels as I think we need more features like it. Specifically, features that bring high performance to naive third party apps that do not make use of advanced features like hugetlbfs, spoke, direct IO, or clumsy interfaces like posix_fadvise. But until they are absolutely polished, I wish these kinds of features would not be turned on by default. This includes kswapd_threads. More reasons why implementing this tunable makes sense for now: - A feature like this is a lot easier to reason about after it has been used in the field for a while. This includes trying to auto-tune it - We need an answer for this problem today. Today there are single NVMe drives capable of 10GB/s and larger systems than the system I used for testing - In the scenario you describe above, an admin would have no reason to touch this sysctl - I think I mentioned this before. I honestly thought a lot of tuning would be necessary after implementing this but so far that hasn’t been the case. It works pretty well. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 21:12 ` Matthew Wilcox 2018-04-04 10:07 ` Buddy Lumpkin @ 2018-04-05 4:08 ` Buddy Lumpkin 2018-04-11 6:37 ` Buddy Lumpkin 2 siblings, 0 replies; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-05 4:08 UTC (permalink / raw) To: Matthew Wilcox Cc: Michal Hocko, linux-mm, linux-kernel, hannes, riel, mgorman, akpm > On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? > Trying to distinguish between the monster and a high value task that you want to run as quickly as possible would be challenging. I like your idea of using renice. It probably makes sense to continue to run the first thread on each node at a standard nice value, and run each additional task with a positive nice value. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 21:12 ` Matthew Wilcox 2018-04-04 10:07 ` Buddy Lumpkin 2018-04-05 4:08 ` Buddy Lumpkin @ 2018-04-11 6:37 ` Buddy Lumpkin 2 siblings, 0 replies; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-11 6:37 UTC (permalink / raw) To: Matthew Wilcox Cc: Michal Hocko, linux-mm, linux-kernel, hannes, riel, mgorman, akpm > On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. If the memory hog is generating enough demand for multiple kswapd tasks to be busy, then it is generating enough demand to trigger direct reclaims. Since direct reclaims are 100% CPU bound, the preemptions you are concerned about are happening anyway. > In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. This makes direct reclaims sound like a positive thing overall and that is simply not the case. If cleaning is the metaphor to describe direct reclaims, then it’s happening in the kitchen using a garden hose. When conditions for direct reclaims are present they can occur in any task that is allocating on the system. They inject latency in random places and they decrease filesystem throughput. When software engineers try to build their own cache, I usually try to talk them out of it. This rarely works, as they usually have reasons they believe make the project compelling, so I just ask that they compare their results using direct IO and a private cache to simply allowing the page cache to do it’s thing. I can’t make this pitch any more because direct reclaims have too much of an impact on filesystem throughput. The only positive thing that direct reclaims provide is a means to prevent the system from crashing or deadlocking when it falls too low on memory. > If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 19:07 ` Matthew Wilcox 2018-04-03 20:49 ` Buddy Lumpkin @ 2018-04-11 3:52 ` Buddy Lumpkin 1 sibling, 0 replies; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-11 3:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Michal Hocko, linux-mm, linux-kernel, hannes, riel, mgorman, akpm > On Apr 3, 2018, at 12:07 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd was >>> likely wasting time scanning pages that were ineligible for eviction. >>> Adding RAM or reducing the working set size would usually make the problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>> clock speeds stayed the same or even slowed down in exchange for more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new hardware. >> >> AFAIR we used to scale the number of kswapd workers many years ago. It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much proactive >> kswapd should be. >> >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the more >> less CPU time will _other_ (non-allocating) task get. >> >>> Test Details >> >> I will have to study this more to comment. >> >> [...] >>> By increasing the number of kswapd threads, throughput increased by ~50% >>> while kernel mode CPU utilization decreased or stayed the same, likely due >>> to a decrease in the number of parallel tasks at any given time doing page >>> replacement. >> >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >> >> Please note that I am not saying that we absolutely have to stick with the >> current single-thread-per-node implementation but I would really like to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very interested >> to see how to scale the number of threads based on how CPUs are utilized >> by other workloads. > > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves in > their own reasonably-sized working sets with one monster process trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. A single thread cannot create the demand to keep any number of kswapd tasks busy, so this memory hog is going to need to have multiple threads if it is going to do any measurable damage to the amount of work performed by the compute bound tasks, and once we increase the number of tasks used for the memory hog, preemption is already happening. So let’s say we are willing to accept that it is going to take multiple threads to create enough demand to keep multiple kswapd tasks busy, we just do not want any additional preemptions strictly due to additional kswapd tasks. You have to consider, If we managed to create enough demand to keep multiple kswapd tasks busy, then we are creating enough demand to trigger direct reclaims. A _lot_ of direct reclaims, and direct reclaims consume A _lot_ of cpu. So if we are running multiple kswapd threads, they might be preempting your N-1 threads, but if they were not running, the memory hog tasks would be preempting your N-1 threads. > > My biggest problem with the patch-as-presented is that it's yet one more > thing for admins to get wrong. We should spawn more threads automatically > if system conditions are right to do that. One thing about this patch-as-presented that an admin could get wrong is by starting with a setting of 16, deciding that it didn’t help and reducing it back to one. It allows for 16 threads because I actually saw a benefit with large numbers of kswapd threads when a substantial amount of the memory pressure was created using anonymous memory mappings that do not involve the page cache. This really is a special case, and the maximum number of threads allowed should probably be reduced to a more sensible value like 8 or even 6 if there is concern about admins doing the wrong thing. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 13:31 ` Michal Hocko 2018-04-03 19:07 ` Matthew Wilcox @ 2018-04-03 20:13 ` Buddy Lumpkin 2018-04-11 3:10 ` Buddy Lumpkin [not found] ` <EB9E8FC6-8B02-4D7C-AA50-2B5B6BD2AF40@oracle.com> 3 siblings, 0 replies; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-03 20:13 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, linux-kernel, hannes, riel, mgorman, willy, akpm Very sorry, I forgot to send my last response as plain text. > On Apr 3, 2018, at 6:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >> Page replacement is handled in the Linux Kernel in one of two ways: >> >> 1) Asynchronously via kswapd >> 2) Synchronously, via direct reclaim >> >> At page allocation time the allocating task is immediately given a page >> from the zone free list allowing it to go right back to work doing >> whatever it was doing; Probably directly or indirectly executing business >> logic. >> >> Just prior to satisfying the allocation, free pages is checked to see if >> it has reached the zone low watermark and if so, kswapd is awakened. >> Kswapd will start scanning pages looking for inactive pages to evict to >> make room for new page allocations. The work of kswapd allows tasks to >> continue allocating memory from their respective zone free list without >> incurring any delay. >> >> When the demand for free pages exceeds the rate that kswapd tasks can >> supply them, page allocation works differently. Once the allocating task >> finds that the number of free pages is at or below the zone min watermark, >> the task will no longer pull pages from the free list. Instead, the task >> will run the same CPU-bound routines as kswapd to satisfy its own >> allocation by scanning and evicting pages. This is called a direct reclaim. >> >> The time spent performing a direct reclaim can be substantial, often >> taking tens to hundreds of milliseconds for small order0 allocations to >> half a second or more for order9 huge-page allocations. In fact, kswapd is >> not actually required on a linux system. It exists for the sole purpose of >> optimizing performance by preventing direct reclaims. >> >> When memory shortfall is sufficient to trigger direct reclaims, they can >> occur in any task that is running on the system. A single aggressive >> memory allocating task can set the stage for collateral damage to occur in >> small tasks that rarely allocate additional memory. Consider the impact of >> injecting an additional 100ms of latency when nscd allocates memory to >> facilitate caching of a DNS query. >> >> The presence of direct reclaims 10 years ago was a fairly reliable >> indicator that too much was being asked of a Linux system. Kswapd was >> likely wasting time scanning pages that were ineligible for eviction. >> Adding RAM or reducing the working set size would usually make the problem >> go away. Since then hardware has evolved to bring a new struggle for >> kswapd. Storage speeds have increased by orders of magnitude while CPU >> clock speeds stayed the same or even slowed down in exchange for more >> cores per package. This presents a throughput problem for a single >> threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. Are you referring to vm.watermark_scale_factor? This helps quite a bit. Previously I had to increase min_free_kbytes in order to get a larger gap between the low and min watemarks. I was very excited when saw that this had been added upstream. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. I totally agree, in fact I think this should be the primary role of direct reclaims because they have a substantial impact on performance. Direct reclaims are the emergency brakes for page allocation, and the case I am making here is that they used to only occur when kswapd had to skip over a lot of pages. This changed over time as the rate a system can allocate pages increased. Direct reclaims slowly became a normal part of page replacement. > The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. It > is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. The important thing to realize here is that kswapd and direct reclaims run the same code paths. There is very little that they do differently. If you compare my test results with one kswapd vs four, your an see that direct reclaims increase the kernel mode CPU consumption considerably. By dedicating more threads to proactive page replacement, you eliminate direct reclaims which reduces the total number of parallel threads that are spinning on the CPU. > >> Test Details > > I will have to study this more to comment. > > [...] >> By increasing the number of kswapd threads, throughput increased by ~50% >> while kernel mode CPU utilization decreased or stayed the same, likely due >> to a decrease in the number of parallel tasks at any given time doing page >> replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might run along with your tests (and which doesn't > really need to allocate a lot of memory)? In other words how > does the patch behaves with a non-artificial mixed workloads? It works quite well. We are just starting to test our production apps. I will have results to share soon. > > Please note that I am not saying that we absolutely have to stick with the > current single-thread-per-node implementation but I would really like to > see more background on why we should be allowing heavy memory hogs to > allocate faster or how to prevent that. My test results demonstrate the problem very well. It shows that a handful of SSDs can create enough demand for kswapd that it consumes ~100% CPU long before throughput is able to reach it’s peak. Direct reclaims start occurring at that point. Aggregate throughput continues to increase, but eventually the pauses generated by the direct reclaims cause throughput to plateau: Test #3: 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 I think we have reached the point where it makes sense for page replacement to have more than one mode. Enterprise class servers with lots of memory and a large number of CPU cores would benefit heavily if more threads could be devoted toward proactive page replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently as possible. This problem is only going to get worse. I think it makes sense to be able to choose between efficiency and performance (throughput and latency reduction). > I would be also very interested > to see how to scale the number of threads based on how CPUs are utilized > by other workloads. > -- > Michal Hocko > SUSE Labs I agree. I think it would be nice to have a work queue that can sense when CPU utilization for a task peaks at 100% and uses that as criteria to start another task up to some maximum that was determined at boot time. I would also determine a max gap size for the watermarks at boot time as well, specifically the gap between min and low since that provides the buffer that absorbs spikey reclaim behavior as free pages drops. Each time an direct reclaim occurs, increase the gap up to the limit. Make the limit tunable as well. If at any time along the way CPU peaks at 100%, start another thread up to the limit established at boot (which is also tunable). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-03 13:31 ` Michal Hocko 2018-04-03 19:07 ` Matthew Wilcox 2018-04-03 20:13 ` Buddy Lumpkin @ 2018-04-11 3:10 ` Buddy Lumpkin 2018-04-12 13:23 ` Michal Hocko [not found] ` <EB9E8FC6-8B02-4D7C-AA50-2B5B6BD2AF40@oracle.com> 3 siblings, 1 reply; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-11 3:10 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, hannes, riel, mgorman, Matthew Wilcox, akpm > On Apr 3, 2018, at 6:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >> Page replacement is handled in the Linux Kernel in one of two ways: >> >> 1) Asynchronously via kswapd >> 2) Synchronously, via direct reclaim >> >> At page allocation time the allocating task is immediately given a page >> from the zone free list allowing it to go right back to work doing >> whatever it was doing; Probably directly or indirectly executing business >> logic. >> >> Just prior to satisfying the allocation, free pages is checked to see if >> it has reached the zone low watermark and if so, kswapd is awakened. >> Kswapd will start scanning pages looking for inactive pages to evict to >> make room for new page allocations. The work of kswapd allows tasks to >> continue allocating memory from their respective zone free list without >> incurring any delay. >> >> When the demand for free pages exceeds the rate that kswapd tasks can >> supply them, page allocation works differently. Once the allocating task >> finds that the number of free pages is at or below the zone min watermark, >> the task will no longer pull pages from the free list. Instead, the task >> will run the same CPU-bound routines as kswapd to satisfy its own >> allocation by scanning and evicting pages. This is called a direct reclaim. >> >> The time spent performing a direct reclaim can be substantial, often >> taking tens to hundreds of milliseconds for small order0 allocations to >> half a second or more for order9 huge-page allocations. In fact, kswapd is >> not actually required on a linux system. It exists for the sole purpose of >> optimizing performance by preventing direct reclaims. >> >> When memory shortfall is sufficient to trigger direct reclaims, they can >> occur in any task that is running on the system. A single aggressive >> memory allocating task can set the stage for collateral damage to occur in >> small tasks that rarely allocate additional memory. Consider the impact of >> injecting an additional 100ms of latency when nscd allocates memory to >> facilitate caching of a DNS query. >> >> The presence of direct reclaims 10 years ago was a fairly reliable >> indicator that too much was being asked of a Linux system. Kswapd was >> likely wasting time scanning pages that were ineligible for eviction. >> Adding RAM or reducing the working set size would usually make the problem >> go away. Since then hardware has evolved to bring a new struggle for >> kswapd. Storage speeds have increased by orders of magnitude while CPU >> clock speeds stayed the same or even slowed down in exchange for more >> cores per package. This presents a throughput problem for a single >> threaded kswapd that will get worse with each generation of new hardware. > > AFAIR we used to scale the number of kswapd workers many years ago. It > just turned out to be not all that great. We have a kswapd reclaim > window for quite some time and that can allow to tune how much proactive > kswapd should be. I am not aware of a previous version of Linux that offered more than one kswapd thread per NUMA node. > > Also please note that the direct reclaim is a way to throttle overly > aggressive memory consumers. The more we do in the background context > the easier for them it will be to allocate faster. So I am not really > sure that more background threads will solve the underlying problem. A single kswapd thread used to keep up with all of the demand you could create on a Linux system quite easily provided it didn’t have to scan a lot of pages that were ineligible for eviction. 10 years ago, Fibre Channel was the popular high performance interconnect and if you were lucky enough to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host bus adapter. Also, most high end storage solutions were still using spinning rust so it took an insane number of spindles behind each host bus adapter to saturate the channel if the access patterns were random. There really wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t been any attempts to do this in the last 10 years. > It is just a matter of memory hogs tunning to end in the very same > situtation AFAICS. Moreover the more they are going to allocate the more > less CPU time will _other_ (non-allocating) task get. Please describe the scenario a bit more clearly. Once you start constructing the workload that can create this scenario, I think you will find that you end up with a mix that is rarely seen in practice. > >> Test Details > > I will have to study this more to comment. > > [...] >> By increasing the number of kswapd threads, throughput increased by ~50% >> while kernel mode CPU utilization decreased or stayed the same, likely due >> to a decrease in the number of parallel tasks at any given time doing page >> replacement. > > Well, isn't that just an effect of more work being done on behalf of > other workload that might run along with your tests (and which doesn't > really need to allocate a lot of memory)? In other words how > does the patch behaves with a non-artificial mixed workloads? Still working on this. I will share data as soon as I have it. > > Please note that I am not saying that we absolutely have to stick with the > current single-thread-per-node implementation but I would really like to > see more background on why we should be allowing heavy memory hogs to > allocate faster or how to prevent that. I would be also very interested > to see how to scale the number of threads based on how CPUs are utilized > by other workloads. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-11 3:10 ` Buddy Lumpkin @ 2018-04-12 13:23 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2018-04-12 13:23 UTC (permalink / raw) To: Buddy Lumpkin Cc: linux-mm, linux-kernel, hannes, riel, mgorman, Matthew Wilcox, akpm On Tue 10-04-18 20:10:24, Buddy Lumpkin wrote: [...] > > Also please note that the direct reclaim is a way to throttle overly > > aggressive memory consumers. The more we do in the background context > > the easier for them it will be to allocate faster. So I am not really > > sure that more background threads will solve the underlying problem. > > A single kswapd thread used to keep up with all of the demand you could > create on a Linux system quite easily provided it didn’t have to scan a lot > of pages that were ineligible for eviction. Well, what do you mean by ineligible for eviction? Could you be more specific? Are we talking about pages on LRU list or metadata and shrinker based reclaim. > 10 years ago, Fibre Channel was > the popular high performance interconnect and if you were lucky enough > to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host > bus adapter. Also, most high end storage solutions were still using spinning > rust so it took an insane number of spindles behind each host bus adapter > to saturate the channel if the access patterns were random. There really > wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t > been any attempts to do this in the last 10 years. I do not really see your point. Yeah you can get a faster storage today. So what? Pagecache has always been bound by the RAM speed. > > It is just a matter of memory hogs tunning to end in the very same > > situtation AFAICS. Moreover the more they are going to allocate the more > > less CPU time will _other_ (non-allocating) task get. > > Please describe the scenario a bit more clearly. Once you start constructing > the workload that can create this scenario, I think you will find that you end > up with a mix that is rarely seen in practice. What I meant is that the more you reclaim in the background to more you allow memory hogs to allocate because they will not get throttled. All that on behalf of other workload which is not memory bound and cannot use CPU cycles additional kswapd would consume. Think of any computation intensive workload spreading over most CPUs and a memory hungry data processing. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <EB9E8FC6-8B02-4D7C-AA50-2B5B6BD2AF40@oracle.com>]
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node [not found] ` <EB9E8FC6-8B02-4D7C-AA50-2B5B6BD2AF40@oracle.com> @ 2018-04-12 13:16 ` Michal Hocko 2018-04-17 3:02 ` Buddy Lumpkin 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2018-04-12 13:16 UTC (permalink / raw) To: Buddy Lumpkin; +Cc: linux-mm, linux-kernel, hannes, riel, mgorman, willy, akpm On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote: > > > On Apr 3, 2018, at 6:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: > >> Page replacement is handled in the Linux Kernel in one of two ways: > >> > >> 1) Asynchronously via kswapd > >> 2) Synchronously, via direct reclaim > >> > >> At page allocation time the allocating task is immediately given a page > >> from the zone free list allowing it to go right back to work doing > >> whatever it was doing; Probably directly or indirectly executing business > >> logic. > >> > >> Just prior to satisfying the allocation, free pages is checked to see if > >> it has reached the zone low watermark and if so, kswapd is awakened. > >> Kswapd will start scanning pages looking for inactive pages to evict to > >> make room for new page allocations. The work of kswapd allows tasks to > >> continue allocating memory from their respective zone free list without > >> incurring any delay. > >> > >> When the demand for free pages exceeds the rate that kswapd tasks can > >> supply them, page allocation works differently. Once the allocating task > >> finds that the number of free pages is at or below the zone min watermark, > >> the task will no longer pull pages from the free list. Instead, the task > >> will run the same CPU-bound routines as kswapd to satisfy its own > >> allocation by scanning and evicting pages. This is called a direct reclaim. > >> > >> The time spent performing a direct reclaim can be substantial, often > >> taking tens to hundreds of milliseconds for small order0 allocations to > >> half a second or more for order9 huge-page allocations. In fact, kswapd is > >> not actually required on a linux system. It exists for the sole purpose of > >> optimizing performance by preventing direct reclaims. > >> > >> When memory shortfall is sufficient to trigger direct reclaims, they can > >> occur in any task that is running on the system. A single aggressive > >> memory allocating task can set the stage for collateral damage to occur in > >> small tasks that rarely allocate additional memory. Consider the impact of > >> injecting an additional 100ms of latency when nscd allocates memory to > >> facilitate caching of a DNS query. > >> > >> The presence of direct reclaims 10 years ago was a fairly reliable > >> indicator that too much was being asked of a Linux system. Kswapd was > >> likely wasting time scanning pages that were ineligible for eviction. > >> Adding RAM or reducing the working set size would usually make the problem > >> go away. Since then hardware has evolved to bring a new struggle for > >> kswapd. Storage speeds have increased by orders of magnitude while CPU > >> clock speeds stayed the same or even slowed down in exchange for more > >> cores per package. This presents a throughput problem for a single > >> threaded kswapd that will get worse with each generation of new hardware. > > > > AFAIR we used to scale the number of kswapd workers many years ago. It > > just turned out to be not all that great. We have a kswapd reclaim > > window for quite some time and that can allow to tune how much proactive > > kswapd should be. > > Are you referring to vm.watermark_scale_factor? Yes along with min_free_kbytes > This helps quite a bit. Previously > I had to increase min_free_kbytes in order to get a larger gap between the low > and min watemarks. I was very excited when saw that this had been added > upstream. > > > > > Also please note that the direct reclaim is a way to throttle overly > > aggressive memory consumers. > > I totally agree, in fact I think this should be the primary role of direct reclaims > because they have a substantial impact on performance. Direct reclaims are > the emergency brakes for page allocation, and the case I am making here is > that they used to only occur when kswapd had to skip over a lot of pages. Or when it is busy reclaiming which can be the case quite easily if you do not have the inactive file LRU full of clean page cache. And that is another problem. If you have a trivial reclaim situation then a single kswapd thread can reclaim quickly enough. But once you hit a wall with hard-to-reclaim pages then I would expect multiple threads will simply contend more (e.g. on fs locks in shrinkers etc...). Or how do you want to prevent that? Or more specifically. How is the admin supposed to know how many background threads are still improving the situation? > This changed over time as the rate a system can allocate pages increased. > Direct reclaims slowly became a normal part of page replacement. > > > The more we do in the background context > > the easier for them it will be to allocate faster. So I am not really > > sure that more background threads will solve the underlying problem. It > > is just a matter of memory hogs tunning to end in the very same > > situtation AFAICS. Moreover the more they are going to allocate the more > > less CPU time will _other_ (non-allocating) task get. > > The important thing to realize here is that kswapd and direct reclaims run the > same code paths. There is very little that they do differently. Their target is however completely different. Kswapd want to keep nodes balanced while direct reclaim aims to reclaim _some_ memory. That is quite some difference. Especially for the throttle by reclaiming memory part. > If you compare > my test results with one kswapd vs four, your an see that direct reclaims > increase the kernel mode CPU consumption considerably. By dedicating > more threads to proactive page replacement, you eliminate direct reclaims > which reduces the total number of parallel threads that are spinning on the > CPU. I still haven't looked at your test results in detail because they seem quite artificial. Clean pagecache reclaim is not all that interesting IMHO [...] > > I would be also very interested > > to see how to scale the number of threads based on how CPUs are utilized > > by other workloads. > > I think we have reached the point where it makes sense for page replacement to have more > than one mode. Enterprise class servers with lots of memory and a large number of CPU > cores would benefit heavily if more threads could be devoted toward proactive page > replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently > as possible. This problem is only going to get worse. I think it makes sense to be able to > choose between efficiency and performance (throughput and latency reduction). The thing is that as long as this would require admin to guess then this is not all that useful. People will simply not know what to set and we are going to end up with stupid admin guides claiming that you should use 1/N of per node cpus for kswapd and that will not work. Not to mention that the reclaim logic is full of heuristics which change over time and a subtle implementation detail that would work for a particular scaling might break without anybody noticing. Really, if we are not able to come up with some auto tuning then I think that this is not really worth it. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-12 13:16 ` Michal Hocko @ 2018-04-17 3:02 ` Buddy Lumpkin 2018-04-17 9:03 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Buddy Lumpkin @ 2018-04-17 3:02 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, linux-kernel, hannes, riel, mgorman, willy, akpm > On Apr 12, 2018, at 6:16 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote: >> >>> On Apr 3, 2018, at 6:31 AM, Michal Hocko <mhocko@kernel.org> wrote: >>> >>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>>> Page replacement is handled in the Linux Kernel in one of two ways: >>>> >>>> 1) Asynchronously via kswapd >>>> 2) Synchronously, via direct reclaim >>>> >>>> At page allocation time the allocating task is immediately given a page >>>> from the zone free list allowing it to go right back to work doing >>>> whatever it was doing; Probably directly or indirectly executing business >>>> logic. >>>> >>>> Just prior to satisfying the allocation, free pages is checked to see if >>>> it has reached the zone low watermark and if so, kswapd is awakened. >>>> Kswapd will start scanning pages looking for inactive pages to evict to >>>> make room for new page allocations. The work of kswapd allows tasks to >>>> continue allocating memory from their respective zone free list without >>>> incurring any delay. >>>> >>>> When the demand for free pages exceeds the rate that kswapd tasks can >>>> supply them, page allocation works differently. Once the allocating task >>>> finds that the number of free pages is at or below the zone min watermark, >>>> the task will no longer pull pages from the free list. Instead, the task >>>> will run the same CPU-bound routines as kswapd to satisfy its own >>>> allocation by scanning and evicting pages. This is called a direct reclaim. >>>> >>>> The time spent performing a direct reclaim can be substantial, often >>>> taking tens to hundreds of milliseconds for small order0 allocations to >>>> half a second or more for order9 huge-page allocations. In fact, kswapd is >>>> not actually required on a linux system. It exists for the sole purpose of >>>> optimizing performance by preventing direct reclaims. >>>> >>>> When memory shortfall is sufficient to trigger direct reclaims, they can >>>> occur in any task that is running on the system. A single aggressive >>>> memory allocating task can set the stage for collateral damage to occur in >>>> small tasks that rarely allocate additional memory. Consider the impact of >>>> injecting an additional 100ms of latency when nscd allocates memory to >>>> facilitate caching of a DNS query. >>>> >>>> The presence of direct reclaims 10 years ago was a fairly reliable >>>> indicator that too much was being asked of a Linux system. Kswapd was >>>> likely wasting time scanning pages that were ineligible for eviction. >>>> Adding RAM or reducing the working set size would usually make the problem >>>> go away. Since then hardware has evolved to bring a new struggle for >>>> kswapd. Storage speeds have increased by orders of magnitude while CPU >>>> clock speeds stayed the same or even slowed down in exchange for more >>>> cores per package. This presents a throughput problem for a single >>>> threaded kswapd that will get worse with each generation of new hardware. >>> >>> AFAIR we used to scale the number of kswapd workers many years ago. It >>> just turned out to be not all that great. We have a kswapd reclaim >>> window for quite some time and that can allow to tune how much proactive >>> kswapd should be. >> >> Are you referring to vm.watermark_scale_factor? > > Yes along with min_free_kbytes > >> This helps quite a bit. Previously >> I had to increase min_free_kbytes in order to get a larger gap between the low >> and min watemarks. I was very excited when saw that this had been added >> upstream. >> >>> >>> Also please note that the direct reclaim is a way to throttle overly >>> aggressive memory consumers. >> >> I totally agree, in fact I think this should be the primary role of direct reclaims >> because they have a substantial impact on performance. Direct reclaims are >> the emergency brakes for page allocation, and the case I am making here is >> that they used to only occur when kswapd had to skip over a lot of pages. > > Or when it is busy reclaiming which can be the case quite easily if you > do not have the inactive file LRU full of clean page cache. And that is > another problem. If you have a trivial reclaim situation then a single > kswapd thread can reclaim quickly enough. A single kswapd thread does not help quickly enough. That is the entire point of this patch. > But once you hit a wall with > hard-to-reclaim pages then I would expect multiple threads will simply > contend more (e.g. on fs locks in shrinkers etc…). If that is the case, this is already happening since direct reclaims do just about everything that kswapd does. I have tested with a mix of filesystem reads, writes and anonymous memory with and without a swap device. The only locking problems I have run into so far are related to routines in mm/workingset.c. It is a lot harder to burden the page scan logic than it used to be. Somewhere around 2007 a change was made where page types that had to be skipped over were simply removed from the LRU list. Anonymous pages were only scanned if a swap device exists, mlocked pages are not scanned at all. It took a couple years before this was available in the common distros though. Also, 64 bit kernels help as well as you don’t have the problem where objects held in ZONE_NORMAL pin pages in ZONE_HIGHMEM. Getting real world results is a waiting game on my end. Once we have a version available to service owners, they need to coordinate an outage so that systems can be rebooted. Only then can I coordinate with them to test for improvements. > Or how do you want > to prevent that? Kswapd has a throughput problem. Once that problem is solved new bottlenecks will reveal themselves. There is nothing to prevent here. When you remove bottlenecks, new bottlenecks materialize and someone will need to identify them and make them go away. > > Or more specifically. How is the admin supposed to know how many > background threads are still improving the situation? Reduce the setting and check to see if pgscan_direct is still incrementing. > >> This changed over time as the rate a system can allocate pages increased. >> Direct reclaims slowly became a normal part of page replacement. >> >>> The more we do in the background context >>> the easier for them it will be to allocate faster. So I am not really >>> sure that more background threads will solve the underlying problem. It >>> is just a matter of memory hogs tunning to end in the very same >>> situtation AFAICS. Moreover the more they are going to allocate the more >>> less CPU time will _other_ (non-allocating) task get. >> >> The important thing to realize here is that kswapd and direct reclaims run the >> same code paths. There is very little that they do differently. > > Their target is however completely different. Kswapd want to keep nodes > balanced while direct reclaim aims to reclaim _some_ memory. That is > quite some difference. Especially for the throttle by reclaiming memory > part. Routines like balance_pgdat showed up in 2.4.10 when Andrea Arcangeli rewrote a lot of the page replacement logic. He referred to his work as the classzone patch and the whole selling point on what it would provide was making allocation and page replacement more cohesive and balanced to avoid cases where kswapd would behave pathologically, scanning to evict pages in the wrong location, or in the wrong order. That doesn’t mean that kswapd’s primary occupation is balancing, in fact if you read the comments direct reclaims and kswapd sound pretty similar to me: /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation * request. * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) /* * For kswapd, balance_pgdat() will reclaim pages across a node from zones * that are eligible for use by the caller until at least one zone is * balanced. * * Returns the order kswapd finished reclaiming at. * * kswapd scans the zones in the highmem->normal->dma direction. It skips * zones which have free_pages > high_wmark_pages(zone), but once a zone is * found to have free_pages <= high_wmark_pages(zone), any page is that zone * or lower is eligible for reclaim until at least one usable zone is * balanced. */ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) Kswapd makes an effort toward proving balance, but that is clearly not the main goal. Both code paths are triggered by a need for memory, and both code paths scan zones that are eligible to satisfy the allocation that triggered them. > >> If you compare >> my test results with one kswapd vs four, your an see that direct reclaims >> increase the kernel mode CPU consumption considerably. By dedicating >> more threads to proactive page replacement, you eliminate direct reclaims >> which reduces the total number of parallel threads that are spinning on the >> CPU. > > I still haven't looked at your test results in detail because they seem > quite artificial. Clean pagecache reclaim is not all that interesting > IMHO Clean page cache is extremely interesting for demonstrating this bottleneck. kswapd reads from the tail of the inactive list, and practically every page it encounters is eligible for eviction, and yet it still cannot keep up with the demand for fresh pages. In the test data I provided, you can see that peak throughput with direct IO was: 26,254,215 Kbytes/s Peak throughput without direct IO and 1 kswapd thread was: 18,001,910 Kbytes/s Direct IO is 46% higher, and this gap is only going to continue to increase. It used to be around 10%. Any negative effects that can be seen with additional kswapd threads can already be seen with multiple concurrent direct reclaims. The additional throughput that is gained by scanning proactively in kswapd can certainly push harder against any additional lock contention. In that case kswapd is just the canary in the coal mine, finding problems that would eventually need to be solved anyway. > > [...] >>> I would be also very interested >>> to see how to scale the number of threads based on how CPUs are utilized >>> by other workloads. >> >> I think we have reached the point where it makes sense for page replacement to have more >> than one mode. Enterprise class servers with lots of memory and a large number of CPU >> cores would benefit heavily if more threads could be devoted toward proactive page >> replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently >> as possible. This problem is only going to get worse. I think it makes sense to be able to >> choose between efficiency and performance (throughput and latency reduction). > > The thing is that as long as this would require admin to guess then this > is not all that useful. People will simply not know what to set and we > are going to end up with stupid admin guides claiming that you should > use 1/N of per node cpus for kswapd and that will not work. I think this sysctl is very intuitive to use. Only use it if direct reclaims are occurring. This can be seen with sar -B. Justify any increase with testing. That is a whole lot easier to wrap your head around than a lot of the other sysctls that are available today. Find me an admin that actually understands what the swappiness tunable does. > Not to > mention that the reclaim logic is full of heuristics which change over > time and a subtle implementation detail that would work for a particular > scaling might break without anybody noticing. Really, if we are not able > to come up with some auto tuning then I think that this is not really > worth it. This is all speculation about how a patch behaves that you have not even tested. Similar arguments can be made about most of the sysctls that are available. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node 2018-04-17 3:02 ` Buddy Lumpkin @ 2018-04-17 9:03 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2018-04-17 9:03 UTC (permalink / raw) To: Buddy Lumpkin; +Cc: linux-mm, linux-kernel, hannes, riel, mgorman, willy, akpm On Mon 16-04-18 20:02:22, Buddy Lumpkin wrote: > > > On Apr 12, 2018, at 6:16 AM, Michal Hocko <mhocko@kernel.org> wrote: [...] > > But once you hit a wall with > > hard-to-reclaim pages then I would expect multiple threads will simply > > contend more (e.g. on fs locks in shrinkers etc…). > > If that is the case, this is already happening since direct reclaims do just about > everything that kswapd does. I have tested with a mix of filesystem reads, writes > and anonymous memory with and without a swap device. The only locking > problems I have run into so far are related to routines in mm/workingset.c. You haven't tried hard enough. Try to generate a bigger fs metadata pressure. In other words something less of a toy than a pure reader without any real processing. [...] > > Or more specifically. How is the admin supposed to know how many > > background threads are still improving the situation? > > Reduce the setting and check to see if pgscan_direct is still incrementing. This just doesn't work. You are oversimplifying a lot! There are much more aspects to this. How many background threads are still worth it without stealing cycles from others? Is half of CPUs per NUMA node worth devoting to background reclaim or is it better to let those excessive memory consumers to be throttled by the direct reclaim? You are still ignoring/underestimating the fact that kswapd steals cycles even from other workload that is not memory bound while direct reclaim throttles (mostly) memory consumers. [...] > > I still haven't looked at your test results in detail because they seem > > quite artificial. Clean pagecache reclaim is not all that interesting > > IMHO > > Clean page cache is extremely interesting for demonstrating this bottleneck. yes it shows the bottleneck but it is quite artificial. Read data is usually processed and/or written back and that changes the picture a lot. Anyway, I do agree that the reclaim can be made faster. I am just not (yet) convinced that multiplying the number of workers is the way to achieve that. [...] > >>> I would be also very interested > >>> to see how to scale the number of threads based on how CPUs are utilized > >>> by other workloads. > >> > >> I think we have reached the point where it makes sense for page replacement to have more > >> than one mode. Enterprise class servers with lots of memory and a large number of CPU > >> cores would benefit heavily if more threads could be devoted toward proactive page > >> replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently > >> as possible. This problem is only going to get worse. I think it makes sense to be able to > >> choose between efficiency and performance (throughput and latency reduction). > > > > The thing is that as long as this would require admin to guess then this > > is not all that useful. People will simply not know what to set and we > > are going to end up with stupid admin guides claiming that you should > > use 1/N of per node cpus for kswapd and that will not work. > > I think this sysctl is very intuitive to use. Only use it if direct reclaims are > occurring. This can be seen with sar -B. Justify any increase with testing. > That is a whole lot easier to wrap your head around than a lot of the other > sysctls that are available today. Find me an admin that actually understands > what the swappiness tunable does. Well, you have pointed to a nice example actually. Yes swappiness is confusing and you can find _many_ different howtos for tuning. Do they work? No, for a long time on most workloads because we are simply pagecache biased so much these days that we simply ignore the value most of the time. I am pretty sure your "just watch sar -B and tune accordingly" will become obsolete in a short time and people will get confused again. Because they are explicitly tuning for their workload but it doesn't help anymore because the internal implementation of the reclaim has changed again (this happens all the time). No, I simply do not want to repeat past errors and expose too much of implementation details for admins who will most likely have no clue how to use the tuning and rely on random advices on internet or even worse admin guides of questionable quality full of cargo cult advises (remember advises to disable THP for basically any performance problem you see). > > Not to > > mention that the reclaim logic is full of heuristics which change over > > time and a subtle implementation detail that would work for a particular > > scaling might break without anybody noticing. Really, if we are not able > > to come up with some auto tuning then I think that this is not really > > worth it. > > This is all speculation about how a patch behaves that you have not even > tested. Similar arguments can be made about most of the sysctls that are > available. I really do want a solid background for the change like this. You are throwing a corner case numbers at me and ignoring some important points. So let me repeat. If we want to allow more kswapd threads per node then we really have to evaluate the effect on memory hogs throttling and we should have a decent idea on how to scale those threads. If we are not able to handle that in the kernel with the full picture then I fail to see how admin can do that. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2020-10-02 14:29 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-09-30 19:27 [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node Sebastiaan Meijer 2020-10-01 12:30 ` Michal Hocko 2020-10-01 16:18 ` Sebastiaan Meijer 2020-10-02 7:03 ` Michal Hocko 2020-10-02 8:40 ` Mel Gorman 2020-10-02 13:53 ` Rik van Riel 2020-10-02 14:00 ` Matthew Wilcox 2020-10-02 14:29 ` Michal Hocko -- strict thread matches above, loose matches on Subject: below -- 2018-04-02 9:24 [RFC PATCH 0/1] mm: " Buddy Lumpkin 2018-04-02 9:24 ` [RFC PATCH 1/1] vmscan: " Buddy Lumpkin 2018-04-03 13:31 ` Michal Hocko 2018-04-03 19:07 ` Matthew Wilcox 2018-04-03 20:49 ` Buddy Lumpkin 2018-04-03 21:12 ` Matthew Wilcox 2018-04-04 10:07 ` Buddy Lumpkin 2018-04-05 4:08 ` Buddy Lumpkin 2018-04-11 6:37 ` Buddy Lumpkin 2018-04-11 3:52 ` Buddy Lumpkin 2018-04-03 20:13 ` Buddy Lumpkin 2018-04-11 3:10 ` Buddy Lumpkin 2018-04-12 13:23 ` Michal Hocko [not found] ` <EB9E8FC6-8B02-4D7C-AA50-2B5B6BD2AF40@oracle.com> 2018-04-12 13:16 ` Michal Hocko 2018-04-17 3:02 ` Buddy Lumpkin 2018-04-17 9:03 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).