From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758362AbcLBJOw (ORCPT ); Fri, 2 Dec 2016 04:14:52 -0500 Received: from mx1.molgen.mpg.de ([141.14.17.9]:41919 "EHLO mx1.molgen.mpg.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754978AbcLBJOt (ORCPT ); Fri, 2 Dec 2016 04:14:49 -0500 Subject: Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node` To: Michal Hocko References: <20161121134130.GB18112@dhcp22.suse.cz> <20161121140122.GU3612@linux.vnet.ibm.com> <20161121141818.GD18112@dhcp22.suse.cz> <20161121142901.GV3612@linux.vnet.ibm.com> <68025f6c-6801-ab46-b0fc-a9407353d8ce@molgen.mpg.de> <20161124101525.GB20668@dhcp22.suse.cz> <583AA50A.9010608@molgen.mpg.de> <20161128110449.GK14788@dhcp22.suse.cz> <109d5128-f3a4-4b6e-db17-7a1fcb953500@molgen.mpg.de> <29196f89-c35e-f79d-8e4d-2bf73fe930df@molgen.mpg.de> <20161130110944.GD18432@dhcp22.suse.cz> <099c4569-a010-5414-0934-6af3734d8460@molgen.mpg.de> Cc: Paul Menzel , dvteam@molgen.mpg.de, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Josh Triplett , "Paul E. McKenney" , Peter Zijlstra From: Donald Buczek Message-ID: <6c8599e5-7f4c-c60b-8954-08168af6343b@molgen.mpg.de> Date: Fri, 2 Dec 2016 10:14:45 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <099c4569-a010-5414-0934-6af3734d8460@molgen.mpg.de> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/30/16 12:43, Donald Buczek wrote: > On 11/30/16 12:09, Michal Hocko wrote: >> [CCing Paul] >> >> On Wed 30-11-16 11:28:34, Donald Buczek wrote: >> [...] >>> shrink_active_list gets and releases the spinlock and calls >>> cond_resched(). >>> This should give other tasks a chance to run. Just as an experiment, >>> I'm >>> trying >>> >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long >>> nr_to_scan, >>> spin_unlock_irq(&pgdat->lru_lock); >>> >>> while (!list_empty(&l_hold)) { >>> - cond_resched(); >>> + cond_resched_rcu_qs(); >>> page = lru_to_page(&l_hold); >>> list_del(&page->lru); >>> >>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see. >> This is really interesting! Is it possible that the RCU stall detector >> is somehow confused? > > Wait... 21 hours is not yet a test result. For the records: We didn't have any stall warnings after 2 days and 20 hours now and so I'm quite confident, that my above patch fixed the problem for v4.8.0. On previous boots the rcu warnings started after 37,0.2,1,2,0.8 hours uptime. Now I've applied this patch to stable latest (v4.8.11) on another backup machine which suffered even more rcu stalls. Donald > [...] -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433