From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422641AbcFCNv7 (ORCPT ); Fri, 3 Jun 2016 09:51:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53967 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932831AbcFCNv5 (ORCPT ); Fri, 3 Jun 2016 09:51:57 -0400 Date: Fri, 3 Jun 2016 15:51:54 +0200 From: Andrea Arcangeli To: Michal Hocko Cc: Sergey Senozhatsky , Sergey Senozhatsky , Andrew Morton , Vlastimil Babka , "Kirill A. Shutemov" , Stephen Rothwell , linux-mm@kvack.org, linux-next@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Message-ID: <20160603135154.GD29930@redhat.com> References: <20160601131122.7dbb0a65@canb.auug.org.au> <20160602014835.GA635@swordfish> <20160602092113.GH1995@dhcp22.suse.cz> <20160602120857.GA704@swordfish> <20160602122109.GM1995@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160602122109.GM1995@dhcp22.suse.cz> User-Agent: Mutt/1.6.1 (2016-04-27) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Fri, 03 Jun 2016 13:51:56 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote: > Testing with the patch makes some sense as well, but I would like to > hear from Andrea whether the approach is good because I am wondering why > he hasn't done that before - it feels so much simpler than the current > code. The down_write in the exit path comes from __ksm_exit. If you don't like it there I'd suggest to also remove it from __ksm_exit. This is a proposed cleanup correct? The first thing that I can notice is that khugepaged_test_exit() then can only be called and provide the expected retval, after atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be used instead. However the code still uses khugepaged_test_exit in __khugepage_enter that won't increase the mm_users, so then the patch relaxes that check too much, albeit only for a debug check not strictly a bug. The cons of this change purely that it'll decrease the responsiveness in releasing the RAM of a killed task a bit. To me the fewer time we hold the mm_users the better and I don't see an obvious runtime improvement coming from this change. It's a bit simpler yes, but the down_write in the exit path is well understood, ksm does the same thing and it's in a slow path (it only happens if the mm that exited is the current one under scan by either ksmd or khugepaged, so normally the down_write is not executed in the exit path and the "mm" is collected right away both as a mm_users and mm_count). In short I think it's a tradeoff: pros) removes down_write in a slow path of the the mm exit which may simplify the code a bit, cons) it could increase the latency in freeing memory as result of a task exiting or being killed during the khugepaged scan, for example while the THP is being allocated. While compaction runs to allocate the THP in collapse_huge_page, if the task is killed currently the memory is released right away, without waiting for the allocation to succeed or fail. I don't see a big enough problem with the down_write in a slow path of khugepaged_exit to justify the increased latency in releasing memory. I was very happy by Oleg's patch reducing the mm_users holding of userfaultfd too. That was controlled by userland so it would only be an issue for non-cooperative usage which isn't upstream yet, and it was also much wider than this one would become with the patch applied, but I liked the direction. If prefer instead to remove the down_write, you probably could move the test_exit before the down_read/write to bail out before taking the lock: you don't need the mmap_sem to do test_exit anymore. The only reason the text_exit would remain in fact is just to reduce the latency of the memory freeing, it then becomes a voluntary preempt cond_resched() to release the memory to make a parallel ;), but unable to let the kernel free the memory while the THP allocation runs. Thanks, Andrea