From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751440AbdINH4i (ORCPT ); Thu, 14 Sep 2017 03:56:38 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36595 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751174AbdINH4g (ORCPT ); Thu, 14 Sep 2017 03:56:36 -0400 Subject: Re: [PATCH v3 04/20] mm: VMA sequence count To: Sergey Senozhatsky Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1504894024-2750-1-git-send-email-ldufour@linux.vnet.ibm.com> <1504894024-2750-5-git-send-email-ldufour@linux.vnet.ibm.com> <20170913115354.GA7756@jagdpanzerIV.localdomain> <44849c10-bc67-b55e-5788-d3c6bb5e7ad1@linux.vnet.ibm.com> <20170914003116.GA599@jagdpanzerIV.localdomain> From: Laurent Dufour Date: Thu, 14 Sep 2017 09:55:13 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170914003116.GA599@jagdpanzerIV.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17091407-0016-0000-0000-000004ECDF01 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17091407-0017-0000-0000-00002826FEAA Message-Id: <441ff1c6-72a7-5d96-02c8-063578affb62@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-09-14_02:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1709140118 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 14/09/2017 02:31, Sergey Senozhatsky wrote: > Hi, > > On (09/13/17 18:56), Laurent Dufour wrote: >> Hi Sergey, >> >> On 13/09/2017 13:53, Sergey Senozhatsky wrote: >>> Hi, >>> >>> On (09/08/17 20:06), Laurent Dufour wrote: > [..] >>> ok, so what I got on my box is: >>> >>> vm_munmap() -> down_write_killable(&mm->mmap_sem) >>> do_munmap() >>> __split_vma() >>> __vma_adjust() -> write_seqcount_begin(&vma->vm_sequence) >>> -> write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING) >>> >>> so this gives 3 dependencies ->mmap_sem -> ->vm_seq >>> ->vm_seq -> ->vm_seq/1 >>> ->mmap_sem -> ->vm_seq/1 >>> >>> >>> SyS_mremap() -> down_write_killable(¤t->mm->mmap_sem) >>> move_vma() -> write_seqcount_begin(&vma->vm_sequence) >>> -> write_seqcount_begin_nested(&new_vma->vm_sequence, SINGLE_DEPTH_NESTING); >>> move_page_tables() >>> __pte_alloc() >>> pte_alloc_one() >>> __alloc_pages_nodemask() >>> fs_reclaim_acquire() >>> >>> >>> I think here we have prepare_alloc_pages() call, that does >>> >>> -> fs_reclaim_acquire(gfp_mask) >>> -> fs_reclaim_release(gfp_mask) >>> >>> so that adds one more dependency ->mmap_sem -> ->vm_seq -> fs_reclaim >>> ->mmap_sem -> ->vm_seq/1 -> fs_reclaim >>> >>> >>> now, under memory pressure we hit the slow path and perform direct >>> reclaim. direct reclaim is done under fs_reclaim lock, so we end up >>> with the following call chain >>> >>> __alloc_pages_nodemask() >>> __alloc_pages_slowpath() >>> __perform_reclaim() -> fs_reclaim_acquire(gfp_mask); >>> try_to_free_pages() >>> shrink_node() >>> shrink_active_list() >>> rmap_walk_file() -> i_mmap_lock_read(mapping); >>> >>> >>> and this break the existing dependency. since we now take the leaf lock >>> (fs_reclaim) first and the the root lock (->mmap_sem). >> >> Thanks for looking at this. >> I'm sorry, I should have miss something. > > no prob :) > > >> My understanding is that there are 2 chains of locks: >> 1. from __vma_adjust() mmap_sem -> i_mmap_rwsem -> vm_seq >> 2. from move_vmap() mmap_sem -> vm_seq -> fs_reclaim >> 2. from __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem > > yes, as far as lockdep warning suggests. > >> So the solution would be to have in __vma_adjust() >> mmap_sem -> vm_seq -> i_mmap_rwsem >> >> But this will raised the following dependency from unmap_mapping_range() >> unmap_mapping_range() -> i_mmap_rwsem >> unmap_mapping_range_tree() >> unmap_mapping_range_vma() >> zap_page_range_single() >> unmap_single_vma() >> unmap_page_range() -> vm_seq >> >> And there is no way to get rid of it easily as in unmap_mapping_range() >> there is no VMA identified yet. >> >> That's being said I can't see any clear way to get lock dependency cleaned >> here. >> Furthermore, this is not clear to me how a deadlock could happen as vm_seq >> is a sequence lock, and there is no way to get blocked here. > > as far as I understand, > seq locks can deadlock, technically. not on the write() side, but on > the read() side: > > read_seqcount_begin() > raw_read_seqcount_begin() > __read_seqcount_begin() > > and __read_seqcount_begin() spins for ever > > __read_seqcount_begin() > { > repeat: > ret = READ_ONCE(s->sequence); > if (unlikely(ret & 1)) { > cpu_relax(); > goto repeat; > } > return ret; > } > > > so if there are two CPUs, one doing write_seqcount() and the other one > doing read_seqcount() then what can happen is something like this > > CPU0 CPU1 > > fs_reclaim_acquire() > write_seqcount_begin() > fs_reclaim_acquire() read_seqcount_begin() > write_seqcount_end() > > CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from > CPU1, CPU1 can't read_seqcount_begin() because CPU0 did write_seqcount_begin() > and now waits for fs_reclaim_acquire(). makes sense? Yes, this makes sense. But in the case of this series, there is no call to __read_seqcount_begin(), and the reader (the speculative page fault handler), is just checking for (vm_seq & 1) and if this is true, simply exit the speculative path without waiting. So there is no deadlock possibility. The bad case would be to have 2 concurrent threads calling write_seqcount_begin() on the same VMA, leading a wrongly freed sequence lock but this can't happen because of the mmap_sem holding for write in such a case. Cheers, Laurent. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 7ECCA6B0253 for ; Thu, 14 Sep 2017 03:55:26 -0400 (EDT) Received: by mail-qk0-f197.google.com with SMTP id r141so3427098qke.7 for ; Thu, 14 Sep 2017 00:55:26 -0700 (PDT) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com. [148.163.158.5]) by mx.google.com with ESMTPS id z64si11454561qtd.418.2017.09.14.00.55.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 14 Sep 2017 00:55:25 -0700 (PDT) Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v8E7riTl087952 for ; Thu, 14 Sep 2017 03:55:24 -0400 Received: from e06smtp14.uk.ibm.com (e06smtp14.uk.ibm.com [195.75.94.110]) by mx0b-001b2d01.pphosted.com with ESMTP id 2cyma0490h-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 14 Sep 2017 03:55:24 -0400 Received: from localhost by e06smtp14.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 14 Sep 2017 08:55:22 +0100 Subject: Re: [PATCH v3 04/20] mm: VMA sequence count References: <1504894024-2750-1-git-send-email-ldufour@linux.vnet.ibm.com> <1504894024-2750-5-git-send-email-ldufour@linux.vnet.ibm.com> <20170913115354.GA7756@jagdpanzerIV.localdomain> <44849c10-bc67-b55e-5788-d3c6bb5e7ad1@linux.vnet.ibm.com> <20170914003116.GA599@jagdpanzerIV.localdomain> From: Laurent Dufour Date: Thu, 14 Sep 2017 09:55:13 +0200 MIME-Version: 1.0 In-Reply-To: <20170914003116.GA599@jagdpanzerIV.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Message-Id: <441ff1c6-72a7-5d96-02c8-063578affb62@linux.vnet.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sergey Senozhatsky Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org Hi, On 14/09/2017 02:31, Sergey Senozhatsky wrote: > Hi, > > On (09/13/17 18:56), Laurent Dufour wrote: >> Hi Sergey, >> >> On 13/09/2017 13:53, Sergey Senozhatsky wrote: >>> Hi, >>> >>> On (09/08/17 20:06), Laurent Dufour wrote: > [..] >>> ok, so what I got on my box is: >>> >>> vm_munmap() -> down_write_killable(&mm->mmap_sem) >>> do_munmap() >>> __split_vma() >>> __vma_adjust() -> write_seqcount_begin(&vma->vm_sequence) >>> -> write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING) >>> >>> so this gives 3 dependencies ->mmap_sem -> ->vm_seq >>> ->vm_seq -> ->vm_seq/1 >>> ->mmap_sem -> ->vm_seq/1 >>> >>> >>> SyS_mremap() -> down_write_killable(¤t->mm->mmap_sem) >>> move_vma() -> write_seqcount_begin(&vma->vm_sequence) >>> -> write_seqcount_begin_nested(&new_vma->vm_sequence, SINGLE_DEPTH_NESTING); >>> move_page_tables() >>> __pte_alloc() >>> pte_alloc_one() >>> __alloc_pages_nodemask() >>> fs_reclaim_acquire() >>> >>> >>> I think here we have prepare_alloc_pages() call, that does >>> >>> -> fs_reclaim_acquire(gfp_mask) >>> -> fs_reclaim_release(gfp_mask) >>> >>> so that adds one more dependency ->mmap_sem -> ->vm_seq -> fs_reclaim >>> ->mmap_sem -> ->vm_seq/1 -> fs_reclaim >>> >>> >>> now, under memory pressure we hit the slow path and perform direct >>> reclaim. direct reclaim is done under fs_reclaim lock, so we end up >>> with the following call chain >>> >>> __alloc_pages_nodemask() >>> __alloc_pages_slowpath() >>> __perform_reclaim() -> fs_reclaim_acquire(gfp_mask); >>> try_to_free_pages() >>> shrink_node() >>> shrink_active_list() >>> rmap_walk_file() -> i_mmap_lock_read(mapping); >>> >>> >>> and this break the existing dependency. since we now take the leaf lock >>> (fs_reclaim) first and the the root lock (->mmap_sem). >> >> Thanks for looking at this. >> I'm sorry, I should have miss something. > > no prob :) > > >> My understanding is that there are 2 chains of locks: >> 1. from __vma_adjust() mmap_sem -> i_mmap_rwsem -> vm_seq >> 2. from move_vmap() mmap_sem -> vm_seq -> fs_reclaim >> 2. from __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem > > yes, as far as lockdep warning suggests. > >> So the solution would be to have in __vma_adjust() >> mmap_sem -> vm_seq -> i_mmap_rwsem >> >> But this will raised the following dependency from unmap_mapping_range() >> unmap_mapping_range() -> i_mmap_rwsem >> unmap_mapping_range_tree() >> unmap_mapping_range_vma() >> zap_page_range_single() >> unmap_single_vma() >> unmap_page_range() -> vm_seq >> >> And there is no way to get rid of it easily as in unmap_mapping_range() >> there is no VMA identified yet. >> >> That's being said I can't see any clear way to get lock dependency cleaned >> here. >> Furthermore, this is not clear to me how a deadlock could happen as vm_seq >> is a sequence lock, and there is no way to get blocked here. > > as far as I understand, > seq locks can deadlock, technically. not on the write() side, but on > the read() side: > > read_seqcount_begin() > raw_read_seqcount_begin() > __read_seqcount_begin() > > and __read_seqcount_begin() spins for ever > > __read_seqcount_begin() > { > repeat: > ret = READ_ONCE(s->sequence); > if (unlikely(ret & 1)) { > cpu_relax(); > goto repeat; > } > return ret; > } > > > so if there are two CPUs, one doing write_seqcount() and the other one > doing read_seqcount() then what can happen is something like this > > CPU0 CPU1 > > fs_reclaim_acquire() > write_seqcount_begin() > fs_reclaim_acquire() read_seqcount_begin() > write_seqcount_end() > > CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from > CPU1, CPU1 can't read_seqcount_begin() because CPU0 did write_seqcount_begin() > and now waits for fs_reclaim_acquire(). makes sense? Yes, this makes sense. But in the case of this series, there is no call to __read_seqcount_begin(), and the reader (the speculative page fault handler), is just checking for (vm_seq & 1) and if this is true, simply exit the speculative path without waiting. So there is no deadlock possibility. The bad case would be to have 2 concurrent threads calling write_seqcount_begin() on the same VMA, leading a wrongly freed sequence lock but this can't happen because of the mmap_sem holding for write in such a case. Cheers, Laurent. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org