From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C1CEC433DF for ; Mon, 17 Aug 2020 17:22:36 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 62980206FA for ; Mon, 17 Aug 2020 17:22:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="PAuuu7K1" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 62980206FA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 03BF0132E5401; Mon, 17 Aug 2020 10:22:36 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=205.139.110.120; helo=us-smtp-1.mimecast.com; envelope-from=vgoyal@redhat.com; receiver= Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id D37A3132E53FF for ; Mon, 17 Aug 2020 10:22:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597684951; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WKN0994t/hbdJ7u3ukIhrtrsKOYgx7+SHughQ7Y0XYU=; b=PAuuu7K1RTA0FTCSR75M/DFKeS0svunXZYJTHwc9V/PwYhNpor1rRZaQuFLbSS1HypjKL7 A+2x4c4ccF0vxp4lt58DC7jT81uckkvBRurMWtVTSJqAUPm84HoTpWVYSAza8lk1wc3hD0 crQ2bQKdHyxS17DSCGcXzuvjEf52jJM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-348-97ei8BkgNj6e3vRfERYvKg-1; Mon, 17 Aug 2020 13:22:28 -0400 X-MC-Unique: 97ei8BkgNj6e3vRfERYvKg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6D1A5801AAF; Mon, 17 Aug 2020 17:22:27 +0000 (UTC) Received: from horse.redhat.com (ovpn-115-81.rdu2.redhat.com [10.10.115.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 35478756AA; Mon, 17 Aug 2020 17:22:21 +0000 (UTC) Received: by horse.redhat.com (Postfix, from userid 10451) id 9FA4A222E58; Mon, 17 Aug 2020 13:22:20 -0400 (EDT) Date: Mon, 17 Aug 2020 13:22:20 -0400 From: Vivek Goyal To: Jan Kara Subject: Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Message-ID: <20200817172220.GB630630@redhat.com> References: <20200807195526.426056-1-vgoyal@redhat.com> <20200807195526.426056-3-vgoyal@redhat.com> <20200817165339.GA22500@quack2.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200817165339.GA22500@quack2.suse.cz> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Message-ID-Hash: 3GVGBM65XIIT3OOC3YGFMPCBHHLGMPWB X-Message-ID-Hash: 3GVGBM65XIIT3OOC3YGFMPCBHHLGMPWB X-MailFrom: vgoyal@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header CC: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, virtio-fs@redhat.com, miklos@szeredi.hu, stefanha@redhat.com, dgilbert@redhat.com, linux-nvdimm@lists.01.org X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Aug 17, 2020 at 06:53:39PM +0200, Jan Kara wrote: > On Fri 07-08-20 15:55:08, Vivek Goyal wrote: > > virtiofs device has a range of memory which is mapped into file inodes > > using dax. This memory is mapped in qemu on host and maps different > > sections of real file on host. Size of this memory is limited > > (determined by administrator) and depending on filesystem size, we will > > soon reach a situation where all the memory is in use and we need to > > reclaim some. > > > > As part of reclaim process, we will need to make sure that there are > > no active references to pages (taken by get_user_pages()) on the memory > > range we are trying to reclaim. I am planning to use > > dax_layout_busy_page() for this. But in current form this is per inode > > and scans through all the pages of the inode. > > > > We want to reclaim only a portion of memory (say 2MB page). So we want > > to make sure that only that 2MB range of pages do not have any > > references (and don't want to unmap all the pages of inode). > > > > Hence, create a range version of this function named > > dax_layout_busy_page_range() which can be used to pass a range which > > needs to be unmapped. > > > > Cc: Dan Williams > > Cc: linux-nvdimm@lists.01.org > > Signed-off-by: Vivek Goyal > > The API looks OK. Some comments WRT the implementation below. > > > diff --git a/fs/dax.c b/fs/dax.c > > index 11b16729b86f..0d51b0fbb489 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas, > > return xa_mk_internal(VM_FAULT_FALLBACK); > > } > > > > -/** > > - * dax_layout_busy_page - find first pinned page in @mapping > > - * @mapping: address space to scan for a page with ref count > 1 > > - * > > - * DAX requires ZONE_DEVICE mapped pages. These pages are never > > - * 'onlined' to the page allocator so they are considered idle when > > - * page->count == 1. A filesystem uses this interface to determine if > > - * any page in the mapping is busy, i.e. for DMA, or other > > - * get_user_pages() usages. > > - * > > - * It is expected that the filesystem is holding locks to block the > > - * establishment of new mappings in this address_space. I.e. it expects > > - * to be able to run unmap_mapping_range() and subsequently not race > > - * mapping_mapped() becoming true. > > +/* > > + * Partial pages are included. If end is LLONG_MAX, pages in the range from > > + * start to end of the file are inluded. > > */ > > I think the big kerneldoc comment should stay with > dax_layout_busy_page_range() since dax_layout_busy_page() will be just a > trivial wrapper around it.. Hi Jan, Thanks for the review. Will move kerneldoc comment. > > > -struct page *dax_layout_busy_page(struct address_space *mapping) > > +struct page *dax_layout_busy_page_range(struct address_space *mapping, > > + loff_t start, loff_t end) > > { > > - XA_STATE(xas, &mapping->i_pages, 0); > > void *entry; > > unsigned int scanned = 0; > > struct page *page = NULL; > > + pgoff_t start_idx = start >> PAGE_SHIFT; > > + pgoff_t end_idx = end >> PAGE_SHIFT; > > + XA_STATE(xas, &mapping->i_pages, start_idx); > > + loff_t len, lstart = round_down(start, PAGE_SIZE); > > > > /* > > * In the 'limited' case get_user_pages() for dax is disabled. > > @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > if (!dax_mapping(mapping) || !mapping_mapped(mapping)) > > return NULL; > > > > + /* If end == LLONG_MAX, all pages from start to till end of file */ > > + if (end == LLONG_MAX) { > > + end_idx = ULONG_MAX; > > + len = 0; > > + } else { > > + /* length is being calculated from lstart and not start. > > + * This is due to behavior of unmap_mapping_range(). If > > + * start is say 4094 and end is on 4096 then we want to > > + * unamp two pages, idx 0 and 1. But unmap_mapping_range() > > + * will unmap only page at idx 0. If we calculate len > > + * from the rounded down start, this problem should not > > + * happen. > > + */ > > + len = end - lstart + 1; > > + } > > Maybe it would be more understandable to use > unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1); > below and avoid all this rounding and special-casing. Will do. > > > + > > /* > > * If we race get_user_pages_fast() here either we'll see the > > * elevated page count in the iteration and wait, or > > @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > * guaranteed to either see new references or prevent new > > * references from being established. > > */ > > - unmap_mapping_range(mapping, 0, 0, 0); > > + unmap_mapping_range(mapping, start, len, 0); > > > > xas_lock_irq(&xas); > > - xas_for_each(&xas, entry, ULONG_MAX) { > > + xas_for_each(&xas, entry, end_idx) { > > if (WARN_ON_ONCE(!xa_is_value(entry))) > > continue; > > if (unlikely(dax_is_locked(entry))) > > @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > xas_unlock_irq(&xas); > > return page; > > } > > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range); > > + > > +/** > > + * dax_layout_busy_page - find first pinned page in @mapping > > + * @mapping: address space to scan for a page with ref count > 1 > > + * > > + * DAX requires ZONE_DEVICE mapped pages. These pages are never > > + * 'onlined' to the page allocator so they are considered idle when > > + * page->count == 1. A filesystem uses this interface to determine if > > + * any page in the mapping is busy, i.e. for DMA, or other > > + * get_user_pages() usages. > > + * > > + * It is expected that the filesystem is holding locks to block the > > + * establishment of new mappings in this address_space. I.e. it expects > > + * to be able to run unmap_mapping_range() and subsequently not race > > + * mapping_mapped() becoming true. > > + */ > > +struct page *dax_layout_busy_page(struct address_space *mapping) > > +{ > > + return dax_layout_busy_page_range(mapping, 0, 0); > > Should the 'end' rather be LLONG_MAX? My bad. I forgot to change this. Previous version of patches had the semantic that 'end == 0' signifies till the end of file. Yes, 'end' should be LLONG_MAX now. Will fix it. Thanks Vivek > > Otherwise the patch looks good to me. > > Honza > -- > Jan Kara > SUSE Labs, CR > _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D71C3C433DF for ; Mon, 17 Aug 2020 17:25:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A414C20738 for ; Mon, 17 Aug 2020 17:25:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="cn0s5KzM" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390044AbgHQRZe (ORCPT ); Mon, 17 Aug 2020 13:25:34 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:23679 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2389807AbgHQRWc (ORCPT ); Mon, 17 Aug 2020 13:22:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597684950; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=WKN0994t/hbdJ7u3ukIhrtrsKOYgx7+SHughQ7Y0XYU=; b=cn0s5KzMFeUr0tLKa+SWztopuDL8EUORQTzfLpYIMogOKeOLB9xBOEXrrRew5LrKIlRGtk lPhaP7wso3DO3lfT0K8V0pNPFnIsx35IdaExFzRlWN78zYimg1xV8xmOXl3AxOGEhvue5G lLkNrMjXrgf4whDp4kMm/4hoDVhaPpA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-348-97ei8BkgNj6e3vRfERYvKg-1; Mon, 17 Aug 2020 13:22:28 -0400 X-MC-Unique: 97ei8BkgNj6e3vRfERYvKg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6D1A5801AAF; Mon, 17 Aug 2020 17:22:27 +0000 (UTC) Received: from horse.redhat.com (ovpn-115-81.rdu2.redhat.com [10.10.115.81]) by smtp.corp.redhat.com (Postfix) with ESMTP id 35478756AA; Mon, 17 Aug 2020 17:22:21 +0000 (UTC) Received: by horse.redhat.com (Postfix, from userid 10451) id 9FA4A222E58; Mon, 17 Aug 2020 13:22:20 -0400 (EDT) Date: Mon, 17 Aug 2020 13:22:20 -0400 From: Vivek Goyal To: Jan Kara Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, virtio-fs@redhat.com, miklos@szeredi.hu, stefanha@redhat.com, dgilbert@redhat.com, Dan Williams , linux-nvdimm@lists.01.org Subject: Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Message-ID: <20200817172220.GB630630@redhat.com> References: <20200807195526.426056-1-vgoyal@redhat.com> <20200807195526.426056-3-vgoyal@redhat.com> <20200817165339.GA22500@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200817165339.GA22500@quack2.suse.cz> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 17, 2020 at 06:53:39PM +0200, Jan Kara wrote: > On Fri 07-08-20 15:55:08, Vivek Goyal wrote: > > virtiofs device has a range of memory which is mapped into file inodes > > using dax. This memory is mapped in qemu on host and maps different > > sections of real file on host. Size of this memory is limited > > (determined by administrator) and depending on filesystem size, we will > > soon reach a situation where all the memory is in use and we need to > > reclaim some. > > > > As part of reclaim process, we will need to make sure that there are > > no active references to pages (taken by get_user_pages()) on the memory > > range we are trying to reclaim. I am planning to use > > dax_layout_busy_page() for this. But in current form this is per inode > > and scans through all the pages of the inode. > > > > We want to reclaim only a portion of memory (say 2MB page). So we want > > to make sure that only that 2MB range of pages do not have any > > references (and don't want to unmap all the pages of inode). > > > > Hence, create a range version of this function named > > dax_layout_busy_page_range() which can be used to pass a range which > > needs to be unmapped. > > > > Cc: Dan Williams > > Cc: linux-nvdimm@lists.01.org > > Signed-off-by: Vivek Goyal > > The API looks OK. Some comments WRT the implementation below. > > > diff --git a/fs/dax.c b/fs/dax.c > > index 11b16729b86f..0d51b0fbb489 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas, > > return xa_mk_internal(VM_FAULT_FALLBACK); > > } > > > > -/** > > - * dax_layout_busy_page - find first pinned page in @mapping > > - * @mapping: address space to scan for a page with ref count > 1 > > - * > > - * DAX requires ZONE_DEVICE mapped pages. These pages are never > > - * 'onlined' to the page allocator so they are considered idle when > > - * page->count == 1. A filesystem uses this interface to determine if > > - * any page in the mapping is busy, i.e. for DMA, or other > > - * get_user_pages() usages. > > - * > > - * It is expected that the filesystem is holding locks to block the > > - * establishment of new mappings in this address_space. I.e. it expects > > - * to be able to run unmap_mapping_range() and subsequently not race > > - * mapping_mapped() becoming true. > > +/* > > + * Partial pages are included. If end is LLONG_MAX, pages in the range from > > + * start to end of the file are inluded. > > */ > > I think the big kerneldoc comment should stay with > dax_layout_busy_page_range() since dax_layout_busy_page() will be just a > trivial wrapper around it.. Hi Jan, Thanks for the review. Will move kerneldoc comment. > > > -struct page *dax_layout_busy_page(struct address_space *mapping) > > +struct page *dax_layout_busy_page_range(struct address_space *mapping, > > + loff_t start, loff_t end) > > { > > - XA_STATE(xas, &mapping->i_pages, 0); > > void *entry; > > unsigned int scanned = 0; > > struct page *page = NULL; > > + pgoff_t start_idx = start >> PAGE_SHIFT; > > + pgoff_t end_idx = end >> PAGE_SHIFT; > > + XA_STATE(xas, &mapping->i_pages, start_idx); > > + loff_t len, lstart = round_down(start, PAGE_SIZE); > > > > /* > > * In the 'limited' case get_user_pages() for dax is disabled. > > @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > if (!dax_mapping(mapping) || !mapping_mapped(mapping)) > > return NULL; > > > > + /* If end == LLONG_MAX, all pages from start to till end of file */ > > + if (end == LLONG_MAX) { > > + end_idx = ULONG_MAX; > > + len = 0; > > + } else { > > + /* length is being calculated from lstart and not start. > > + * This is due to behavior of unmap_mapping_range(). If > > + * start is say 4094 and end is on 4096 then we want to > > + * unamp two pages, idx 0 and 1. But unmap_mapping_range() > > + * will unmap only page at idx 0. If we calculate len > > + * from the rounded down start, this problem should not > > + * happen. > > + */ > > + len = end - lstart + 1; > > + } > > Maybe it would be more understandable to use > unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1); > below and avoid all this rounding and special-casing. Will do. > > > + > > /* > > * If we race get_user_pages_fast() here either we'll see the > > * elevated page count in the iteration and wait, or > > @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > * guaranteed to either see new references or prevent new > > * references from being established. > > */ > > - unmap_mapping_range(mapping, 0, 0, 0); > > + unmap_mapping_range(mapping, start, len, 0); > > > > xas_lock_irq(&xas); > > - xas_for_each(&xas, entry, ULONG_MAX) { > > + xas_for_each(&xas, entry, end_idx) { > > if (WARN_ON_ONCE(!xa_is_value(entry))) > > continue; > > if (unlikely(dax_is_locked(entry))) > > @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > xas_unlock_irq(&xas); > > return page; > > } > > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range); > > + > > +/** > > + * dax_layout_busy_page - find first pinned page in @mapping > > + * @mapping: address space to scan for a page with ref count > 1 > > + * > > + * DAX requires ZONE_DEVICE mapped pages. These pages are never > > + * 'onlined' to the page allocator so they are considered idle when > > + * page->count == 1. A filesystem uses this interface to determine if > > + * any page in the mapping is busy, i.e. for DMA, or other > > + * get_user_pages() usages. > > + * > > + * It is expected that the filesystem is holding locks to block the > > + * establishment of new mappings in this address_space. I.e. it expects > > + * to be able to run unmap_mapping_range() and subsequently not race > > + * mapping_mapped() becoming true. > > + */ > > +struct page *dax_layout_busy_page(struct address_space *mapping) > > +{ > > + return dax_layout_busy_page_range(mapping, 0, 0); > > Should the 'end' rather be LLONG_MAX? My bad. I forgot to change this. Previous version of patches had the semantic that 'end == 0' signifies till the end of file. Yes, 'end' should be LLONG_MAX now. Will fix it. Thanks Vivek > > Otherwise the patch looks good to me. > > Honza > -- > Jan Kara > SUSE Labs, CR > From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 17 Aug 2020 13:22:20 -0400 From: Vivek Goyal Message-ID: <20200817172220.GB630630@redhat.com> References: <20200807195526.426056-1-vgoyal@redhat.com> <20200807195526.426056-3-vgoyal@redhat.com> <20200817165339.GA22500@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200817165339.GA22500@quack2.suse.cz> Subject: Re: [Virtio-fs] [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() List-Id: Development discussions about virtio-fs List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kara Cc: miklos@szeredi.hu, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, virtio-fs@redhat.com, linux-fsdevel@vger.kernel.org, Dan Williams On Mon, Aug 17, 2020 at 06:53:39PM +0200, Jan Kara wrote: > On Fri 07-08-20 15:55:08, Vivek Goyal wrote: > > virtiofs device has a range of memory which is mapped into file inodes > > using dax. This memory is mapped in qemu on host and maps different > > sections of real file on host. Size of this memory is limited > > (determined by administrator) and depending on filesystem size, we will > > soon reach a situation where all the memory is in use and we need to > > reclaim some. > > > > As part of reclaim process, we will need to make sure that there are > > no active references to pages (taken by get_user_pages()) on the memory > > range we are trying to reclaim. I am planning to use > > dax_layout_busy_page() for this. But in current form this is per inode > > and scans through all the pages of the inode. > > > > We want to reclaim only a portion of memory (say 2MB page). So we want > > to make sure that only that 2MB range of pages do not have any > > references (and don't want to unmap all the pages of inode). > > > > Hence, create a range version of this function named > > dax_layout_busy_page_range() which can be used to pass a range which > > needs to be unmapped. > > > > Cc: Dan Williams > > Cc: linux-nvdimm@lists.01.org > > Signed-off-by: Vivek Goyal > > The API looks OK. Some comments WRT the implementation below. > > > diff --git a/fs/dax.c b/fs/dax.c > > index 11b16729b86f..0d51b0fbb489 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas, > > return xa_mk_internal(VM_FAULT_FALLBACK); > > } > > > > -/** > > - * dax_layout_busy_page - find first pinned page in @mapping > > - * @mapping: address space to scan for a page with ref count > 1 > > - * > > - * DAX requires ZONE_DEVICE mapped pages. These pages are never > > - * 'onlined' to the page allocator so they are considered idle when > > - * page->count == 1. A filesystem uses this interface to determine if > > - * any page in the mapping is busy, i.e. for DMA, or other > > - * get_user_pages() usages. > > - * > > - * It is expected that the filesystem is holding locks to block the > > - * establishment of new mappings in this address_space. I.e. it expects > > - * to be able to run unmap_mapping_range() and subsequently not race > > - * mapping_mapped() becoming true. > > +/* > > + * Partial pages are included. If end is LLONG_MAX, pages in the range from > > + * start to end of the file are inluded. > > */ > > I think the big kerneldoc comment should stay with > dax_layout_busy_page_range() since dax_layout_busy_page() will be just a > trivial wrapper around it.. Hi Jan, Thanks for the review. Will move kerneldoc comment. > > > -struct page *dax_layout_busy_page(struct address_space *mapping) > > +struct page *dax_layout_busy_page_range(struct address_space *mapping, > > + loff_t start, loff_t end) > > { > > - XA_STATE(xas, &mapping->i_pages, 0); > > void *entry; > > unsigned int scanned = 0; > > struct page *page = NULL; > > + pgoff_t start_idx = start >> PAGE_SHIFT; > > + pgoff_t end_idx = end >> PAGE_SHIFT; > > + XA_STATE(xas, &mapping->i_pages, start_idx); > > + loff_t len, lstart = round_down(start, PAGE_SIZE); > > > > /* > > * In the 'limited' case get_user_pages() for dax is disabled. > > @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > if (!dax_mapping(mapping) || !mapping_mapped(mapping)) > > return NULL; > > > > + /* If end == LLONG_MAX, all pages from start to till end of file */ > > + if (end == LLONG_MAX) { > > + end_idx = ULONG_MAX; > > + len = 0; > > + } else { > > + /* length is being calculated from lstart and not start. > > + * This is due to behavior of unmap_mapping_range(). If > > + * start is say 4094 and end is on 4096 then we want to > > + * unamp two pages, idx 0 and 1. But unmap_mapping_range() > > + * will unmap only page at idx 0. If we calculate len > > + * from the rounded down start, this problem should not > > + * happen. > > + */ > > + len = end - lstart + 1; > > + } > > Maybe it would be more understandable to use > unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1); > below and avoid all this rounding and special-casing. Will do. > > > + > > /* > > * If we race get_user_pages_fast() here either we'll see the > > * elevated page count in the iteration and wait, or > > @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > * guaranteed to either see new references or prevent new > > * references from being established. > > */ > > - unmap_mapping_range(mapping, 0, 0, 0); > > + unmap_mapping_range(mapping, start, len, 0); > > > > xas_lock_irq(&xas); > > - xas_for_each(&xas, entry, ULONG_MAX) { > > + xas_for_each(&xas, entry, end_idx) { > > if (WARN_ON_ONCE(!xa_is_value(entry))) > > continue; > > if (unlikely(dax_is_locked(entry))) > > @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > > xas_unlock_irq(&xas); > > return page; > > } > > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range); > > + > > +/** > > + * dax_layout_busy_page - find first pinned page in @mapping > > + * @mapping: address space to scan for a page with ref count > 1 > > + * > > + * DAX requires ZONE_DEVICE mapped pages. These pages are never > > + * 'onlined' to the page allocator so they are considered idle when > > + * page->count == 1. A filesystem uses this interface to determine if > > + * any page in the mapping is busy, i.e. for DMA, or other > > + * get_user_pages() usages. > > + * > > + * It is expected that the filesystem is holding locks to block the > > + * establishment of new mappings in this address_space. I.e. it expects > > + * to be able to run unmap_mapping_range() and subsequently not race > > + * mapping_mapped() becoming true. > > + */ > > +struct page *dax_layout_busy_page(struct address_space *mapping) > > +{ > > + return dax_layout_busy_page_range(mapping, 0, 0); > > Should the 'end' rather be LLONG_MAX? My bad. I forgot to change this. Previous version of patches had the semantic that 'end == 0' signifies till the end of file. Yes, 'end' should be LLONG_MAX now. Will fix it. Thanks Vivek > > Otherwise the patch looks good to me. > > Honza > -- > Jan Kara > SUSE Labs, CR >