From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E76AC77B6E for ; Wed, 12 Apr 2023 11:06:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9C2DD6B0075; Wed, 12 Apr 2023 07:06:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97386900003; Wed, 12 Apr 2023 07:06:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 83B5B900002; Wed, 12 Apr 2023 07:06:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 704506B0075 for ; Wed, 12 Apr 2023 07:06:53 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C9E6146059 for ; Wed, 12 Apr 2023 10:41:56 +0000 (UTC) X-FDA: 80672398632.08.01EAB2D Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf03.hostedemail.com (Postfix) with ESMTP id 58C6020003 for ; Wed, 12 Apr 2023 10:41:53 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=tll98gAJ; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=veHbgvGO; dmarc=none; spf=pass (imf03.hostedemail.com: domain of jack@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681296113; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qORcrIdBgTd6N3Ng5bWVL93iviLNLM8ytHL/iEw+y1E=; b=WnTVjWNCnKFLIGNmM61JJYVNvJTDSs/CvXf98yOmQGOJLtAIBsT+dZxcxCOC9lOtUvYPHu I2PNLCt6UIF+aa0eSpratCi11icPMOzeH5nfdV3D0HAjl593rjc/F1LygFxGagIOQ5szj2 i8tnAqMK9hYLHxp1PBsy+P+8v2fIz8Y= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=tll98gAJ; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=veHbgvGO; dmarc=none; spf=pass (imf03.hostedemail.com: domain of jack@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681296113; a=rsa-sha256; cv=none; b=6WywVpiVAb6Vp8K3A6tjA4m595YONMPChqIwj2CVHb3PgafLjqldS4XFy9eamE0lMQgg8I 0ISLBzaCe1cesgqECmkpQGeiCiExS8TxZELPOq5pXdGwV4+QpbbVj4i0/PXZPtUaOjYFoA AEF720OeLmpn4Q3h7+U6c1HRJevamcU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id B2CC52195D; Wed, 12 Apr 2023 10:41:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1681296111; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=qORcrIdBgTd6N3Ng5bWVL93iviLNLM8ytHL/iEw+y1E=; b=tll98gAJ4PBYuuJblGT3bQTrcize3AXLB86X/BE3z46RbRj6az/IkE9vtA3F8V4XllLfBz FKgQKz4aY9oahSA0ue9QWwRaAp1hbkHNI83EAXRvwR/7cwiEfbdUDBQbhNWsDDzJEkhKr2 izPIq2ctZFtma5fABC7ObRIhNzaZLWw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1681296111; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=qORcrIdBgTd6N3Ng5bWVL93iviLNLM8ytHL/iEw+y1E=; b=veHbgvGOYpdoyrweJpndxK/nR+6BALDmQjqNpPGDHor1aYk+XMVgcPGxI81C6Vn7hpKX5v IOTXkQ82MPDbcrDQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 9BC54132C7; Wed, 12 Apr 2023 10:41:51 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id L+YAJu+KNmTCIAAAMHmgww (envelope-from ); Wed, 12 Apr 2023 10:41:51 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 1F497A0732; Wed, 12 Apr 2023 12:41:51 +0200 (CEST) Date: Wed, 12 Apr 2023 12:41:51 +0200 From: Jan Kara To: "Teterevkov, Ivan" Cc: Alistair Popple , "linux-mm@kvack.org" , "jhubbard@nvidia.com" , "jack@suse.cz" , "rppt@linux.ibm.com" , "jglisse@redhat.com" , "ira.weiny@intel.com" , "linux-kernel@vger.kernel.org" Subject: Re: find_get_page() VS pin_user_pages() Message-ID: <20230412104151.hkl5navnaoc7l7ob@quack3> References: <87mt3ehti4.fsf@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 58C6020003 X-Stat-Signature: gkuzng8bqow9jnd73eu8ygsa89wke7ax X-HE-Tag: 1681296113-265823 X-HE-Meta: U2FsdGVkX1+16ftTe2HmYsjZMwLAhb9vKAWITvW7/S5bppSzQzXQwohNxByr79RszDEnRO45uoBjw6xkkjTGoebHLGnDLs2JqYtUrpiVEVYGx6hyL6+8G3lNAfvcj8fMPGrIFBFAnwlR2JkoXIGj+4/43INJrXPyd0PRbP8mLzL3JoOx02ZBtqojADWGaj94IGf6fuLSIy+LRNlmae9QYLCPLZfmGVqCX2EHokbqVb2P2JaQQnVANR3dFtrGl1V/i/raHGJkucTy/EPrXFdY3IHq+HDWWpHVsFn/mjCv47fN7N5q8AQrnCeOSkaKuLceanlkdWZKMePfe+b29Q2pXZVg0b5W4XApScaT8jAty0DT/VlKleAxBZpMHDzHycRJAWVVj2m3gvIHI7LAYKnJIx27S7oMm8Fx90aArGtxvDThxTLTqiMHhVy/Ts1V7/DQW5pvfNuKYa8N+0BZSupDaTTwjmeeAO5ZlPrrra1yKuzrLq7Ymr+D6l+Hc+UwJMcDG5Dy2p+wUyhcY5l+Xh88xfUgzjZ7FgvQjBNZPMFdj5SiYCnUS0yZKWm7jJYmfd9TdZyeLqs3bvz87zo8raAepef1kxe7n9Wn1N8t+ICpQ2lBU2W+gFQ3RYWhHaeWQUl6MfFDRokyzPD8FKxEmlk7bR+Gx7ONA5k8DLSTmpwaACFyE8zdUt/eJde1nMoxGoCk2KCSMHp4js+yB77Y3BMM/K8a6NexZ8Czs/jcKQMnRIQSIWNC/oqszSjDr5KmOKAi4VVE9a+kWjRiSQyOKWBGc76RU0hws+XvSZL6XzsjSPWZhL7OghG0FRA+ZPskM+bSb8KmV9FqRm+M+zhzdLSSE/hORDkIdnBfwpuLaPv2TcdMXX+le1RxaJMy24L75p++fdCafYWPcTrV+cZVFSK5WpF8QBBS/rmzWEax80MvZk/izxbHoXeBASMdcat7JthfJK/XzBrEDR8uelHgJkh Le1WRFDc I9/5sZXzng3IQ7dCbIsLSMypLDR+iHRFHT9FkMm5HfyHsbgGvvT5bVNYsGZ+uArd4m2+jfF6HO/BYLRbzd1yXqZNkYtPkUwZ8IDYrXPrBnYTI2TQQRDPsfMXFUXi4ZqKTBJ8maJK+Kb5l3Jaj2JDbcc9Cr2YkkT3ZUvJhlJO6HsQ8geDTlc/imp23ZZvxr9FRKInJmffiJQupW+B7D1Ya4V6TJM/q7Et30nS3Fh2Y5kCR1N1eKUG+73fueHNjBelgx8mn+4Sw40JbdymLOT4MhgQ0Ca1hZzI3bxYhqfeCUusl0+Jvw5kh9TyUOcgCRn/k9U3LYBjf/fLKtJ8NimbVcp3TkS7TPE+qD1igcfm/N3dHB8C2O8kVcZSG5ADMq3u5Qx6GYiuxfdYr2Dv7Qei/PQ2n7inAZ9ydiN/malq1WNhumG28P3QABJKRphjFIvvnIe9yk7F/3gBOJeQtfsbDb1iT10MNFLiYT12yVn8wcNozFpE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 12-04-23 09:04:33, Teterevkov, Ivan wrote: > From: Alistair Popple > > > "Teterevkov, Ivan" writes: > > > > > Hello folks, > > > > > > I work with an application which aims to share memory in the userspace and > > > interact with the NIC DMA. The memory allocation workflow begins in the > > > userspace, which creates a new file backed by 2MiB hugepages with > > > memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace > > > makes an IOCTL to the kernel module with the file descriptor and size so that > > > the kernel module can get the struct page with find_get_page(). Then the kernel > > > module calls dma_map_single(page_address(page)) for NIC, which concludes the > > > datapath. The allocated memory may (significantly) outlive the originating > > > userspace application. The hugepages stay mapped with NIC, and the kernel > > > module wants to continue using them and map to other applications that come and > > > go with vm_mmap(). > > > > > > I am studying the pin_user_pages*() family of functions, and I wonder if the > > > outlined workflow requires it. The hugepages do not page out, but they can move > > > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page() > > > must increment the page reference counter without mapping and prevent it from > > > moving. In particular, https://docs.kernel.org/mm/page_migration.html: > > > > I'm not super familiar with the memfd_create()/find_get_page() workflow > > but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM) > > to get the struct page initially? You're description above sounds > > exactly the use case pin_user_pages() was designed for because it marks > > the page as being writen to by DMA, makes sure it's not in a movable > > zone, etc. > > > > The biggest obstacle with the application workflow is that the memory > allocation is mostly kernel-driven. The kernel module may want to tell DMA > about the hugepages before the userspace application maps it into its address > space, so the kernel module does not have the starting user address at hand. I'm a bit confused. Above you write that: "The memory allocation workflow begins in the userspace, which creates a new file backed by 2MiB hugepages with memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace makes an IOCTL to the kernel module with the file descriptor and size so that the kernel module can get the struct page with find_get_page()." So the memory allocation actually does happen from fallocate(2) as far as I can tell. What guys are suggesting is that instead of passing the prepared 'fd' to ioctl(2), your application should mmap the file and pass the address of the mmapped area. That's how things are usually done and it also gives userspace more freedom over how it prepares buffers for DMA. Also then pin_user_pages() comes as a natural API to use in the driver. Now I'm not sure whether changing the ioctl(2) is still an option for you. If not, then you have to resort to some kind of workaround as you mentioned. But still pin_user_pages(FOLL_LONGTERM) is definitely the API you should be using for telling the kernel you are going to DMA into these pages and want to hold onto them for a long time. Honza -- Jan Kara SUSE Labs, CR