linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [REGRESSION] [BISECTED] kswapd high CPU usage
@ 2020-07-15 10:04 Alexey Vlasov
  2020-08-10 13:47 ` Alexey Vlasov
  0 siblings, 1 reply; 3+ messages in thread
From: Alexey Vlasov @ 2020-07-15 10:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: kirill

Hi,

After upgrading from 3.14 to 4.14.173, I ran into exactly the same problem
that the starter topic described. Namely, sometimes kswapd starts to consume 100% 
of the CPU, and the system freezes for several minutes.

Below is an example of such an event (orange - system cpu, red - total cpu):
https://www.dropbox.com/s/5wr5su3p0fubq0a/kswapd_100.png?dl=0

Here is the top:

top - 23:44:16 up 9 days,  2:06, 14 users,  load average: 14.03, 12.32, 13.07
Tasks: 7108 total,  16 running, 6921 sleeping,   0 stopped,   9 zombie
%Cpu(s): 28.1 us, 18.1 sy,  0.0 ni, 51.7 id,  1.2 wa,  0.0 hi,  0.9 si,  0.0 st
KiB Mem : 19803248+total,   596160 free, 11094233+used, 86493992 buff/cache
KiB Swap: 62914556 total, 62302912 free,   611644 used. 71269504 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  134 root      20   0       0      0      0 R  86.2  0.0 383:21.35 kswapd0
  135 root      20   0       0      0      0 R  84.9  0.0 344:00.17 kswapd1

this is a begin of the collapse, some minutes later the system has thousands of D
processes and does not answer:

top - 23:57:33 up 9 days,  2:19, 14 users,  load average: 1223.43, 1083.85, 662.
Tasks: 8356 total, 344 running, 7821 sleeping,   0 stopped,  44 zombie
%Cpu(s): 28.1 us, 18.2 sy,  0.0 ni, 51.6 id,  1.2 wa,  0.0 hi,  0.9 si,  0.0 st
KiB Mem : 19803248+total,   800516 free, 11587540+used, 81356560 buff/cache
KiB Swap: 62914556 total, 62130072 free,   784484 used. 62231208 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10704 w_defau+  20   0  393476 117160  15160 D 100.0  0.1   0:00.16 httpd
16056 w_sti46+  20   0  599048  21528   9504 S 100.0  0.0   0:00.00 httpd
12649 w_divan+  20   0   41764   8064   3904 D 100.0  0.0   0:06.62 menu1.pl
13739 w_defau+  20   0  248696  24168  14132 S 100.0  0.0   0:00.01 httpd
 5172 mysql     20   0 6993508 2.310g   9660 D  38.9  1.2   3866:26 mysqld_aux3
 4683 mysql     20   0 9974.1m 4.366g   8268 D  38.7  2.3   2553:14 mysqld
 4791 mysql     20   0 10.359g 4.180g   9784 D  28.5  2.2   1659:40 mysqld_aux1
 5078 mysql     20   0  9.871g 3.774g   9888 D  25.4  2.0   2445:08 mysqld_aux2
    9 root      20   0       0      0      0 I   3.4  0.0  13:56.16 rcu_sched
  135 root      20   0       0      0      0 D   2.8  0.0 344:29.12 kswapd1
  134 root      20   0       0      0      0 D   2.6  0.0 383:49.86 kswapd0

Nevertheless there is not any I/O activity before after and during this collapse.

I tried to use your patch about "late_initcall(set_recommended_min_free_kbytes)",
unfortunately it did not help.

In my experience this could be solved by adding RAM but unfortunately this server
no longer has free slots. 188 GB RAM is the maximum for it.

Also I cannot go back to 3.14 kernel, since one of the partitions contains xfs with
the superblock of the new version v5, which is not supported by 3.14 kernel.

If you need more information, for example, vmstat, /proc/meminfo, I can send.

Is there any solution to this problem?

> On Fri, Jan 22, 2016 at 12:28:10AM +1000, Nalorokk wrote:
>> It appears that kernels newer than 4.1 have kswapd-related bug resulting in
>> high CPU usage. CPU 100% usage could last for several minutes or several
>> days, with CPU being busy entirely with serving kswapd. It happens usually
>> after server being mostly idle, sometimes after days, sometimes after weeks
>> of uptime. But the issue appears much sooner if the machine is loaded with
>> something like building a kernel.
>>
>> Here are the graphs of CPU load: first
>> <http://i.piccy.info/i9/9ee6c0620c9481a974908484b2a52a0f/1453384595/44012/994698/cpu_month.png>,
>> second
>> <http://i.piccy.info/i9/7c97c2f39620bb9d7ea93096312dbbb6/1453384649/41222/994698/cpu_year.png>.
>> Perf top output is here <http://pastebin.com/aRzTjb2x>as well.
>>
>> To find the cause of this problem I've started with the fact that the issue
>> appeared after 4.1 kernel update. Then I performed longterm test of 3.18,
>> and discovered that 3.18 is unaffected by this bug. Then I did some tests
>> of 4.0 to confirm that this version behaves well too.
>>
>> Then I performed git bisect from tag v4.0 to v4.1-rc1 and found exact
>> commits that seem to be reason of high CPU usage.
>>
>> The first really "bad" commit is 79553da293d38d63097278de13e28a3b371f43c1.
>> 2 previous commits cause weird behavior as well resulting in kswapd
>> consuming more CPU than unaffected kernels, but not that much as the commit
>> pointed above. I believe those commits are related to the same mm tree
>> merge.
>>
>> I tried to add transparent_hugepage=never to kernel boot parameters, but it
>> did not change anything. Changing allocator to SLAB from SLUB alters
>> behavior and makes CPU load lower, but don't solve a problem at all.
>>
>> Here <https://bugzilla.kernel.org/show_bug.cgi?id=110501>is kernel bugzilla
>> bugreport as well.
>>
>> Ideas? â
>
> Could you try to insert "late_initcall(set_recommended_min_free_kbytes);"
> back and check if makes any difference.
>
>-- 
>Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [REGRESSION] [BISECTED] kswapd high CPU usage
  2020-07-15 10:04 [REGRESSION] [BISECTED] kswapd high CPU usage Alexey Vlasov
@ 2020-08-10 13:47 ` Alexey Vlasov
  0 siblings, 0 replies; 3+ messages in thread
From: Alexey Vlasov @ 2020-08-10 13:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: kirill

I have found a workaround preventing these hangs.
Primarily, disable THP:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

and next, we should increase vm.min_free_kbytes, in my case 16Gb is
enough

vm.min_free_kbytes = 16777216

On Wed, Jul 15, 2020 at 01:04:38PM +0300, Alexey Vlasov wrote:
> Hi,
> 
> After upgrading from 3.14 to 4.14.173, I ran into exactly the same problem
> that the starter topic described. Namely, sometimes kswapd starts to consume 100% 
> of the CPU, and the system freezes for several minutes.
> 
> Below is an example of such an event (orange - system cpu, red - total cpu):
> https://www.dropbox.com/s/5wr5su3p0fubq0a/kswapd_100.png?dl=0
> 
> Here is the top:
> 
> top - 23:44:16 up 9 days,  2:06, 14 users,  load average: 14.03, 12.32, 13.07
> Tasks: 7108 total,  16 running, 6921 sleeping,   0 stopped,   9 zombie
> %Cpu(s): 28.1 us, 18.1 sy,  0.0 ni, 51.7 id,  1.2 wa,  0.0 hi,  0.9 si,  0.0 st
> KiB Mem : 19803248+total,   596160 free, 11094233+used, 86493992 buff/cache
> KiB Swap: 62914556 total, 62302912 free,   611644 used. 71269504 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>   134 root      20   0       0      0      0 R  86.2  0.0 383:21.35 kswapd0
>   135 root      20   0       0      0      0 R  84.9  0.0 344:00.17 kswapd1
> 
> this is a begin of the collapse, some minutes later the system has thousands of D
> processes and does not answer:
> 
> top - 23:57:33 up 9 days,  2:19, 14 users,  load average: 1223.43, 1083.85, 662.
> Tasks: 8356 total, 344 running, 7821 sleeping,   0 stopped,  44 zombie
> %Cpu(s): 28.1 us, 18.2 sy,  0.0 ni, 51.6 id,  1.2 wa,  0.0 hi,  0.9 si,  0.0 st
> KiB Mem : 19803248+total,   800516 free, 11587540+used, 81356560 buff/cache
> KiB Swap: 62914556 total, 62130072 free,   784484 used. 62231208 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 10704 w_defau+  20   0  393476 117160  15160 D 100.0  0.1   0:00.16 httpd
> 16056 w_sti46+  20   0  599048  21528   9504 S 100.0  0.0   0:00.00 httpd
> 12649 w_divan+  20   0   41764   8064   3904 D 100.0  0.0   0:06.62 menu1.pl
> 13739 w_defau+  20   0  248696  24168  14132 S 100.0  0.0   0:00.01 httpd
>  5172 mysql     20   0 6993508 2.310g   9660 D  38.9  1.2   3866:26 mysqld_aux3
>  4683 mysql     20   0 9974.1m 4.366g   8268 D  38.7  2.3   2553:14 mysqld
>  4791 mysql     20   0 10.359g 4.180g   9784 D  28.5  2.2   1659:40 mysqld_aux1
>  5078 mysql     20   0  9.871g 3.774g   9888 D  25.4  2.0   2445:08 mysqld_aux2
>     9 root      20   0       0      0      0 I   3.4  0.0  13:56.16 rcu_sched
>   135 root      20   0       0      0      0 D   2.8  0.0 344:29.12 kswapd1
>   134 root      20   0       0      0      0 D   2.6  0.0 383:49.86 kswapd0
> 
> Nevertheless there is not any I/O activity before after and during this collapse.
> 
> I tried to use your patch about "late_initcall(set_recommended_min_free_kbytes)",
> unfortunately it did not help.
> 
> In my experience this could be solved by adding RAM but unfortunately this server
> no longer has free slots. 188 GB RAM is the maximum for it.
> 
> Also I cannot go back to 3.14 kernel, since one of the partitions contains xfs with
> the superblock of the new version v5, which is not supported by 3.14 kernel.
> 
> If you need more information, for example, vmstat, /proc/meminfo, I can send.
> 
> Is there any solution to this problem?
> 
> > On Fri, Jan 22, 2016 at 12:28:10AM +1000, Nalorokk wrote:
> >> It appears that kernels newer than 4.1 have kswapd-related bug resulting in
> >> high CPU usage. CPU 100% usage could last for several minutes or several
> >> days, with CPU being busy entirely with serving kswapd. It happens usually
> >> after server being mostly idle, sometimes after days, sometimes after weeks
> >> of uptime. But the issue appears much sooner if the machine is loaded with
> >> something like building a kernel.
> >>
> >> Here are the graphs of CPU load: first
> >> <http://i.piccy.info/i9/9ee6c0620c9481a974908484b2a52a0f/1453384595/44012/994698/cpu_month.png>,
> >> second
> >> <http://i.piccy.info/i9/7c97c2f39620bb9d7ea93096312dbbb6/1453384649/41222/994698/cpu_year.png>.
> >> Perf top output is here <http://pastebin.com/aRzTjb2x>as well.
> >>
> >> To find the cause of this problem I've started with the fact that the issue
> >> appeared after 4.1 kernel update. Then I performed longterm test of 3.18,
> >> and discovered that 3.18 is unaffected by this bug. Then I did some tests
> >> of 4.0 to confirm that this version behaves well too.
> >>
> >> Then I performed git bisect from tag v4.0 to v4.1-rc1 and found exact
> >> commits that seem to be reason of high CPU usage.
> >>
> >> The first really "bad" commit is 79553da293d38d63097278de13e28a3b371f43c1.
> >> 2 previous commits cause weird behavior as well resulting in kswapd
> >> consuming more CPU than unaffected kernels, but not that much as the commit
> >> pointed above. I believe those commits are related to the same mm tree
> >> merge.
> >>
> >> I tried to add transparent_hugepage=never to kernel boot parameters, but it
> >> did not change anything. Changing allocator to SLAB from SLUB alters
> >> behavior and makes CPU load lower, but don't solve a problem at all.
> >>
> >> Here <https://bugzilla.kernel.org/show_bug.cgi?id=110501>is kernel bugzilla
> >> bugreport as well.
> >>
> >> Ideas? â
> >
> > Could you try to insert "late_initcall(set_recommended_min_free_kbytes);"
> > back and check if makes any difference.
> >
> >-- 
> >Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [REGRESSION] [BISECTED] kswapd high CPU usage
       [not found] <CAPKbV49wfVWqwdgNu9xBnXju-4704t2QF97C+6t3aff_8bVbdA@mail.gmail.com>
@ 2016-01-21 16:16 ` Kirill A. Shutemov
  0 siblings, 0 replies; 3+ messages in thread
From: Kirill A. Shutemov @ 2016-01-21 16:16 UTC (permalink / raw)
  To: Nalorokk
  Cc: Kirill A. Shutemov, Stefan Strogin, Andrew Morton, Sasha Levin,
	Mel Gorman, linux-mm, linux-kernel, oleksandr

On Fri, Jan 22, 2016 at 12:28:10AM +1000, Nalorokk wrote:
> It appears that kernels newer than 4.1 have kswapd-related bug resulting in
> high CPU usage. CPU 100% usage could last for several minutes or several
> days, with CPU being busy entirely with serving kswapd. It happens usually
> after server being mostly idle, sometimes after days, sometimes after weeks
> of uptime. But the issue appears much sooner if the machine is loaded with
> something like building a kernel.
> 
> Here are the graphs of CPU load: first
> <http://i.piccy.info/i9/9ee6c0620c9481a974908484b2a52a0f/1453384595/44012/994698/cpu_month.png>,
> second
> <http://i.piccy.info/i9/7c97c2f39620bb9d7ea93096312dbbb6/1453384649/41222/994698/cpu_year.png>.
> Perf top output is here <http://pastebin.com/aRzTjb2x>as well.
> 
> To find the cause of this problem I've started with the fact that the issue
> appeared after 4.1 kernel update. Then I performed longterm test of 3.18,
> and discovered that 3.18 is unaffected by this bug. Then I did some tests
> of 4.0 to confirm that this version behaves well too.
> 
> Then I performed git bisect from tag v4.0 to v4.1-rc1 and found exact
> commits that seem to be reason of high CPU usage.
> 
> The first really "bad" commit is 79553da293d38d63097278de13e28a3b371f43c1.
> 2 previous commits cause weird behavior as well resulting in kswapd
> consuming more CPU than unaffected kernels, but not that much as the commit
> pointed above. I believe those commits are related to the same mm tree
> merge.
> 
> I tried to add transparent_hugepage=never to kernel boot parameters, but it
> did not change anything. Changing allocator to SLAB from SLUB alters
> behavior and makes CPU load lower, but don't solve a problem at all.
> 
> Here <https://bugzilla.kernel.org/show_bug.cgi?id=110501>is kernel bugzilla
> bugreport as well.
> 
> Ideas? ​

Could you try to insert "late_initcall(set_recommended_min_free_kbytes);"
back and check if makes any difference.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-08-10 14:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 10:04 [REGRESSION] [BISECTED] kswapd high CPU usage Alexey Vlasov
2020-08-10 13:47 ` Alexey Vlasov
     [not found] <CAPKbV49wfVWqwdgNu9xBnXju-4704t2QF97C+6t3aff_8bVbdA@mail.gmail.com>
2016-01-21 16:16 ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).