From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2731DC433FE for ; Wed, 12 Oct 2022 17:00:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229491AbiJLRAT (ORCPT ); Wed, 12 Oct 2022 13:00:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40650 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230030AbiJLRAE (ORCPT ); Wed, 12 Oct 2022 13:00:04 -0400 Received: from ale.deltatee.com (ale.deltatee.com [204.191.154.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51B50D9E for ; Wed, 12 Oct 2022 09:59:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=deltatee.com; s=20200525; h=Subject:In-Reply-To:From:References:Cc:To: MIME-Version:Date:Message-ID:content-disposition; bh=5Dr2oEmE6UnZFK8La87ojFbkgXphKoO6+tb9L1ERD0A=; b=In8ebRrGUhArT9ZOouKg9cWKyP YMWGVr3BQOxQaik09nbOzFhnEKNaIofaI+4AJB/aASMeZwNQrFhLGqd10Tp1peTO9KaTwob3XPbkB yLf7lixn5zSDFY5YJk2umhU9WrG8oO2sE99Ki5Lmbzq/rj3qeuydQk5p44h4jK6/Gp0xyVjjYVG8u YfZp/ln8n5g+eKqEv5gXtFoZhi/rnhMJpS37YiJyNnlWkzuTd1y3z3E4sgZyhg93eucCWH2u5r71v juUnY5T482+n2fuZVYkAdbORfKmTD49RH9ieErebpsxxKmhE8kbmlMXhONEjV58rQwclMvWS71EgJ 7ysy52hA==; Received: from s0106ac1f6bb1ecac.cg.shawcable.net ([70.73.163.230] helo=[192.168.11.155]) by ale.deltatee.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1oif5A-006czx-D2; Wed, 12 Oct 2022 10:59:50 -0600 Message-ID: <8ee5368c-1808-d2bc-9ad2-2f8332d2704e@deltatee.com> Date: Wed, 12 Oct 2022 10:59:45 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Content-Language: en-US To: Xiao Ni Cc: linux-raid@vger.kernel.org, Jes Sorensen , Guoqing Jiang , Mariusz Tkaczyk , Coly Li , Chaitanya Kulkarni , Jonmichael Hands , Stephen Bates , Martin Oliveira , David Sloan , "Martin K. Petersen" References: <20221007201037.20263-1-logang@deltatee.com> From: Logan Gunthorpe In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SA-Exim-Connect-IP: 70.73.163.230 X-SA-Exim-Rcpt-To: xni@redhat.com, linux-raid@vger.kernel.org, jes@trained-monkey.org, guoqing.jiang@linux.dev, mariusz.tkaczyk@linux.intel.com, colyli@suse.de, chaitanyak@nvidia.com, jm@chia.net, sbates@raithlin.com, Martin.Oliveira@eideticom.com, David.Sloan@eideticom.com, martin.petersen@oracle.com X-SA-Exim-Mail-From: logang@deltatee.com Subject: Re: [PATCH mdadm v4 0/7] Write Zeroes option for Creating Arrays X-SA-Exim-Version: 4.2.1 (built Sat, 13 Feb 2021 17:57:42 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org @ccing Martin hoping he has an opinion on the write zeroes interface On 2022-10-11 19:09, Xiao Ni wrote: > Hi Logan > > I did a test with the patchset. There is a problem like this: > > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 --write-zero > mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme1n1 > mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme2n1 > mdadm: zeroing data from 135266304 to 960061505536 on: /dev/nvme0n1 > > I ran ctrl+c when waiting, then the raid can't be created anymore. Because the > processes that write zero to nvmes are stuck. > > ps auxf | grep mdadm > root 68764 0.0 0.0 9216 1104 pts/0 S+ 21:09 0:00 > \_ grep --color=auto mdadm > root 68633 0.1 0.0 27808 336 pts/0 D 21:04 0:00 > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 > --write-zero > root 68634 0.2 0.0 27808 336 pts/0 D 21:04 0:00 > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 > --write-zero > root 68635 0.0 0.0 27808 336 pts/0 D 21:04 0:00 > mdadm -CR /dev/md0 -l5 -n3 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme0n1 > --write-zero Yes, this is because the fallocate() call that the child processes use to write_zeros will submit a large number of bios in the kernel and then wait with submit_bio_wait() which is non-interruptible. So when the child processes get the SIGINT, they will not stop until after the fallocate() call completes which will pretty much be after the entire disk is zeroed. So if you are zeroing a very large disk, those processes will be stuck around for several minutes after the parent process terminates; though they do go away eventually. There aren't many great solutions for this: 1) We could install as signal handler in the parent so it sticks around until the zeroing is complete. This would mean mdadm will not be able to be terminated while the zeroing is occurring and the user has to wait. 2) We could split up the fallocate call into multiple calls to zero the entire disk. This would allow a quicker ctrl-c to occur, however it's not clear what the best size would be to split it into. Even zeroing 1GB can take a few seconds, but the smaller we go, the less efficient it will be if the block layer and devices ever get write-zeroes optimized in the same way discard has been optimized (with NVMe, discard only requires a single command to handle the entire disk where as write-zeroes requires a minimum of one command per 2MB of data to zero). I was hoping write-zeroes could be made faster in the future, at least for NVMe. Thoughts? Logan