From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDD98C433FE for ; Tue, 18 Jan 2022 10:40:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237390AbiARKkA (ORCPT ); Tue, 18 Jan 2022 05:40:00 -0500 Received: from smtp-out2.suse.de ([195.135.220.29]:40920 "EHLO smtp-out2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230274AbiARKj6 (ORCPT ); Tue, 18 Jan 2022 05:39:58 -0500 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id A2ADF1F3A1; Tue, 18 Jan 2022 10:39:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1642502397; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sHPUwbtFeoE0ORf/ZOnFmdOiPu1cHwxOX55EXE7/xBs=; b=Al4aMfDnRCunydLsj2lCJfrlUrJvAtQiR8queTeqgAqjWyhlCFX5BrLjPeOJALEyZXnl4A w20jgEJP+2tSPYyj1be2AahY4YrbwppwADLL0acCHKBvhSLdGZdWnhYlDfCLA9kv/MKRH/ LevSF8DxkG5P+9Pec6Rbd0YHCV5xh1w= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1642502397; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sHPUwbtFeoE0ORf/ZOnFmdOiPu1cHwxOX55EXE7/xBs=; b=pa8NSMY3/oCpDERwBrDj716mVVNZYLoDBdfaMNindBEGnunAwcEsjqBGk7EuIJ8Yg5/73f Oj6iXNVsK3eIl6AQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 564F213DC7; Tue, 18 Jan 2022 10:39:57 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id hivrE/2Y5mHiMQAAMHmgww (envelope-from ); Tue, 18 Jan 2022 10:39:57 +0000 Message-ID: Date: Tue, 18 Jan 2022 11:39:57 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 From: Vlastimil Babka To: Liam Howlett , "maple-tree@lists.infradead.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Andrew Morton Cc: Song Liu , Davidlohr Bueso , "Paul E . McKenney" , Matthew Wilcox , Laurent Dufour , David Rientjes , Axel Rasmussen , Suren Baghdasaryan , Rik van Riel , Peter Zijlstra , Michel Lespinasse , Jerome Glisse , Minchan Kim , Joel Fernandes , Rom Lemarchand References: <20211201142918.921493-1-Liam.Howlett@oracle.com> <20211201142918.921493-27-Liam.Howlett@oracle.com> Content-Language: en-US Subject: Re: [PATCH v4 26/66] mm/mmap: Reorganize munmap to use maple states In-Reply-To: <20211201142918.921493-27-Liam.Howlett@oracle.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/1/21 15:30, Liam Howlett wrote: > From: "Liam R. Howlett" > > Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and > do_mas_align_munmap(). > > do_munmap() is a wrapper to create a maple state for any callers that > have not been converted to the maple tree. > > do_mas_munmap() takes a maple state to mumap a range. This is just a > small function which checks for error conditions and aligns the end of > the range. > > do_mas_align_munmap() uses the aligned range to mumap a range. > do_mas_align_munmap() starts with the first VMA in the range, then finds > the last VMA in the range. Both start and end are split if necessary. > Then the VMAs are unlocked and removed from the linked list at the same > time. Followed by a single tree operation of overwriting the area in > with a NULL. Finally, the detached list is unmapped and freed. > > By reorganizing the munmap calls as outlined, it is now possible to > avoid extra work of aligning pre-aligned callers which are known to be > safe, avoid extra VMA lookups or tree walks for modifications. > > detach_vmas_to_be_unmapped() is no longer used, so drop this code. > > Signed-off-by: Liam R. Howlett > -/* Munmap is split into 2 main parts -- this part which finds > - * what needs doing, and the areas themselves, which do the > - * work. This now handles partial unmappings. > - * Jeremy Fitzhardinge > +/* > + * do_mas_align_munmap() - munmap the aligned region from @start to @end. > + * @mas: The maple_state, ideally set up to alter the correct tree location. > + * @vma: The starting vm_area_struct > + * @mm: The mm_struct > + * @start: The aligned start address to munmap. > + * @end: The aligned end address to munmap. > + * @uf: The userfaultfd list_head > + * @downgrade: Set to true to attempt a downwrite of the mmap_sem s/downwrite/write downgrade/? > + * > + * @mas must be locked before calling this function. If @downgrade is true, > + * check return code for potential release of the lock. How is 'mas' locked? The downgrade still calls mmap_write_downgrade(mm). It should say "mm's mmap_lock should be write locked" no? > */ > -int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > - struct list_head *uf, bool downgrade) > +static int > +do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma, > + struct mm_struct *mm, unsigned long start, > + unsigned long end, struct list_head *uf, bool downgrade) > { > - unsigned long end; > - struct vm_area_struct *vma, *prev, *last; > - > - if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) > - return -EINVAL; > - > - len = PAGE_ALIGN(len); > - end = start + len; > - if (len == 0) > - return -EINVAL; > - > - /* arch_unmap() might do unmaps itself. */ > - arch_unmap(mm, start, end); > - > - /* Find the first overlapping VMA where start < vma->vm_end */ > - vma = find_vma_intersection(mm, start, end); > - if (!vma) > - return 0; > - prev = vma->vm_prev; > + struct vm_area_struct *prev, *last; > /* we have start < vma->vm_end */ > > /* > @@ -2458,16 +2418,26 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > if (error) > return error; > prev = vma; > + vma = __vma_next(mm, prev); > + mas->index = start; > + mas_reset(mas); > + } else { > + prev = vma->vm_prev; > } > > + if (vma->vm_end >= end) > + last = vma; > + else > + last = find_vma_intersection(mm, end - 1, end); > + > /* Does it split the last one? */ > - last = find_vma(mm, end); > - if (last && end > last->vm_start) { > + if (last && end < last->vm_end) { > int error = __split_vma(mm, last, end, 1); > if (error) > return error; > + vma = __vma_next(mm, prev); Should be needed only if last == vma? > + mas_reset(mas); > } > - vma = __vma_next(mm, prev); > > if (unlikely(uf)) { > /* > @@ -2480,22 +2450,47 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > * failure that it's not worth optimizing it for. > */ > int error = userfaultfd_unmap_prep(vma, start, end, uf); > + > if (error) > return error; > } > > /* > - * unlock any mlock()ed ranges before detaching vmas > + * unlock any mlock()ed ranges before detaching vmas, count the number > + * of VMAs to be dropped, and return the tail entry of the affected > + * area. > */ > - if (mm->locked_vm) > - unlock_range(vma, end); > + mm->map_count -= unlock_range(vma, &last, end); > + /* Drop removed area from the tree */ > + mas_store_gfp(mas, NULL, GFP_KERNEL); > > - /* Detach vmas from the MM linked list and remove from the mm tree*/ > - if (!detach_vmas_to_be_unmapped(mm, vma, prev, end)) > - downgrade = false; > + /* Detach vmas from the MM linked list */ > + vma->vm_prev = NULL; > + if (prev) > + prev->vm_next = last->vm_next; > + else > + mm->mmap = last->vm_next; > > - if (downgrade) > - mmap_write_downgrade(mm); > + if (last->vm_next) { > + last->vm_next->vm_prev = prev; > + last->vm_next = NULL; > + } else > + mm->highest_vm_end = prev ? vm_end_gap(prev) : 0; > + > + /* > + * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or > + * VM_GROWSUP VMA. Such VMAs can change their size under > + * down_read(mmap_lock) and collide with the VMA we are about to unmap. > + */ > + if (downgrade) { > + if (last && (last->vm_flags & VM_GROWSDOWN)) > + downgrade = false; > + else if (prev && (prev->vm_flags & VM_GROWSUP)) > + downgrade = false; > + else { > + mmap_write_downgrade(mm); > + } remove { } brackets? > + } > > unmap_region(mm, vma, prev, start, end); > > @@ -2505,10 +2500,61 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > return downgrade ? 1 : 0; > } > > +/* > + * do_mas_munmap() - munmap a given range. > + * @mas: The maple state > + * @mm: The mm_struct > + * @start: The start address to munmap > + * @len: The length of the range to munmap > + * @uf: The userfaultfd list_head > + * @downgrade: set to true if the user wants to attempt to write_downgrade the > + * mmap_sem > + * > + * This function takes a @mas that is in the correct state to remove the > + * mapping(s). The @len will be aligned and any arch_unmap work will be > + * preformed. > + * @mas must be locked. @mas may be unlocked if @degraded is true. > + * > + * Returns: -EINVAL on failure, 1 on success and unlock, 0 otherwise. > + */ > +int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm, > + unsigned long start, size_t len, struct list_head *uf, > + bool downgrade) > +{ > + unsigned long end; > + struct vm_area_struct *vma; > + > + if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) > + return -EINVAL; > + > + end = start + PAGE_ALIGN(len); > + if (end == start) > + return -EINVAL; > + > + /* arch_unmap() might do unmaps itself. */ > + arch_unmap(mm, start, end); > + > + /* Find the first overlapping VMA */ > + vma = mas_find(mas, end - 1); > + if (!vma) > + return 0; > + > + mas->last = end - 1; Why not set this before mas_find() above? Hm but that takes its own second parameter instead of looking at mas->last. To be honest, I'm a bit confused wrt the role of mas->last in the API. Perhaps another suggestion for the "how to improve docs" discussion earlier. Or maybe I just missed/forgot it. > + return do_mas_align_munmap(mas, vma, mm, start, end, uf, downgrade); > +} > + > diff --git a/mm/mremap.c b/mm/mremap.c > index 002eec83e91e..b09e107cd18b 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -978,20 +978,23 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, > /* > * Always allow a shrinking remap: that just unmaps > * the unnecessary pages.. > - * __do_munmap does all the needed commit accounting, and > + * do_mas_munmap does all the needed commit accounting, and > * downgrades mmap_lock to read if so directed. > */ > if (old_len >= new_len) { > int retval; > + MA_STATE(mas, &mm->mm_mt, addr + new_len, addr + new_len); > > - retval = __do_munmap(mm, addr+new_len, old_len - new_len, > - &uf_unmap, true); > - if (retval < 0 && old_len != new_len) { > - ret = retval; > - goto out; > + retval = do_mas_munmap(&mas, mm, addr + new_len, > + old_len - new_len, &uf_unmap, true); > /* Returning 1 indicates mmap_lock is downgraded to read. */ > - } else if (retval == 1) > + if (retval == 1) { > downgraded = true; > + } else if (retval < 0 && old_len != new_len) { > + ret = retval; > + goto out; > + } > + > ret = addr; > goto out; > } > @@ -1006,7 +1009,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, > } > > /* old_len exactly to the end of the area.. > - */ > + */ Spurious edit? > if (old_len == vma->vm_end - addr) { > /* can we just expand the current mapping? */ > if (vma_expandable(vma, new_len - old_len)) { > @@ -1048,9 +1051,9 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, > map_flags |= MAP_SHARED; > > new_addr = get_unmapped_area(vma->vm_file, 0, new_len, > - vma->vm_pgoff + > - ((addr - vma->vm_start) >> PAGE_SHIFT), > - map_flags); > + vma->vm_pgoff + > + ((addr - vma->vm_start) >> PAGE_SHIFT), > + map_flags); And this? > if (IS_ERR_VALUE(new_addr)) { > ret = new_addr; > goto out;