From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9EAFC6FD1D for ; Thu, 23 Mar 2023 11:30:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36BD06B0072; Thu, 23 Mar 2023 07:30:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 31BDC6B0074; Thu, 23 Mar 2023 07:30:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BC1D6B0075; Thu, 23 Mar 2023 07:30:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0ACA66B0072 for ; Thu, 23 Mar 2023 07:30:37 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id BBBFAC0513 for ; Thu, 23 Mar 2023 11:30:36 +0000 (UTC) X-FDA: 80599945272.06.ED53B69 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf01.hostedemail.com (Postfix) with ESMTP id 6551740009 for ; Thu, 23 Mar 2023 11:30:33 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=xxk0H1VU; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=yRT7BIEc; spf=pass (imf01.hostedemail.com: domain of jack@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679571033; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PafqFtBeT3VNjLa3bxD2d3FT8QKiRREDypdho4biGYk=; b=coo45LQGGKBBbu6cY0KTGRxxIbYj/W8THeNtDxa9GB2BYG0bI6VXECQGRnUmGG5NCYFPsk V6mye6zawTSenj7+DTtdAKOo/Bi/nllomRrTzOdEQEDCTrQlSToYASFtybvMJEZRvuto5s u00WyeaZqcSxVlf5xnY+gpzj/3EoGiA= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=xxk0H1VU; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=yRT7BIEc; spf=pass (imf01.hostedemail.com: domain of jack@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679571033; a=rsa-sha256; cv=none; b=rLAD1/7f5GLKCst5fIvMg5aoSJQX/4GTMCC/QCAHtJ05fnBrfbUOqwBHeARAp17wUAzmtY hU0LjeDte9L5fkUVaEXqTeVvaNgNELC3GGKzxEuKSPvqV/rBvDGxH467S/wsjeaaukifWh +wDrfioZGjc5hdgVCxxHAkEBpCQyM5w= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id C10FF33AFE; Thu, 23 Mar 2023 11:30:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1679571031; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PafqFtBeT3VNjLa3bxD2d3FT8QKiRREDypdho4biGYk=; b=xxk0H1VUm5tlRIPe31O4WpV90ssc0y40SuM926E5ZfiMAmtyd4CYgJHG0fcsAiBinUXarO Icqe7XddkeVqg75Hv1eN0p+TbIOVJVgZ1RuKdHaylEZLaAy44GlebE3yToPzp0VfLpXTim ms6jZbSYQq99XYTO8YaI/dEWrQck1Nk= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1679571031; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PafqFtBeT3VNjLa3bxD2d3FT8QKiRREDypdho4biGYk=; b=yRT7BIEczqyllmYv3ubax2JxiGI4JEyo4jgJXH2GJUEHZJ//fRPFVMbnP4CNvx/uEjvtBS B1brXYxd1Z48XqDQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id AC0D213596; Thu, 23 Mar 2023 11:30:31 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id RGD1KVc4HGRUEgAAMHmgww (envelope-from ); Thu, 23 Mar 2023 11:30:31 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 0E952A071C; Thu, 23 Mar 2023 12:30:30 +0100 (CET) Date: Thu, 23 Mar 2023 12:30:30 +0100 From: Jan Kara To: Ritesh Harjani Cc: Jan Kara , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org Subject: Re: [RFCv1][WIP] ext2: Move direct-io to use iomap Message-ID: <20230323113030.ryne2oq27b6cx6xz@quack3> References: <87ttz889ns.fsf@doe.com> <20230320175139.l5oqbwuae4schgcu@quack3> <87zg85pa5i.fsf@doe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87zg85pa5i.fsf@doe.com> X-Stat-Signature: ynkiddye5inebxs4rfm83xd61nq3j9qg X-Rspam-User: X-Rspamd-Queue-Id: 6551740009 X-Rspamd-Server: rspam06 X-HE-Tag: 1679571033-232571 X-HE-Meta: U2FsdGVkX1+cSpMU7brzSBV1VnMZU+UvFqKOfDvb16LPn/u+PYdsWL3bmwbr+eX5KyZELcDi6wmd7Y3KMMqwAGtMdXFtwfYglnJo2826icnGXx0IDOhMu2h9dnoE4KRaSVAGaic7Sed2wXUEY1FnqAi6Yuda9TY/iJTbZ34qTqKE5esOvOfQcDfOQr3Pzoc1kBjG/RWB1ntFVoyizud4c8rAsJ+9VEoJjNtembs1lT/Ui3Hc/RnpMp9bveaMj+6TYx6uUumkjMk7p10vbkj+numNeOvE8v0W6FTLsF5VcZlO7nU6+ZQfpwLS8vypnPdvc2Xjxtlylhyfet6jukMmXrRkWriqOriRZ4tzlYTItdzp3+JxyvjPcEhN6zPnj7jrgp/HWLXRdtDQV6tTi84J4OGMnFrSXhOEtnExb6ScgZ300zCVwixflN6iU2ZmpuVFsDGOMS6/Qp7Ms89ng3yiQgMfhfEM3QNWh0qFbejf3UumVYLgn1l1DVlKycSo/J0/mnJaRaG9DGXUV2unPgVwO1ixpvDYmFJNqDTlOi7KLKf5U+FgDq98DzKNJSPG+jnqFVUZKD6Im2sRUvBj4bSaiIEThsF0vFvyT35z+zv9bhQlSKcmFKyPPi18gSf/vwSwKEUJU+3SStCpWiBdZD4igZMXCpgfJgA2xb9a3NYPoJqFAZW1gmj5gkRqRpXjxNFayHPOt3OBa5A3gG+TvRvhmkD2G0AfkT4ayavuceSLNFh026shLfRQ71lJtfVB7r0jrfgrLPvRwcTZUCml8LCY1p5QS5+PGGQSAeLa6kRUUZhmCL9yVh8F/isoIUSdmNk2tDQeQCSjCB9kf/trkzPodUWHELzKaiYirYVsygZCxQdh8rGO7wCjO3IA+1mCMmjE9toGONm1pFM55D1sz51cJH9AUyCGLE3vdxvhYL9lf/B2CuJNq0dewJiNPZD0Ve9/CV8ZCPEac6iaZmYQvm3 uAZun16k M17XdIRxesJ0g7p+HowAXNS3cT396lm9gz83W6qLuv0OUrIU0YpOEm+clXNLS3TUO2MAcQooCT+JK+97bxlciKrTuHrUnzpRIan8meGSNrV4bt0YD2UqPsXmHzNLCaFLiyOaMJ/8f9np2GFieJaD2+UZublwSDaaHTVYhroUzpOy2EfNJ9UTZJAoF6NeKwES3FJ+wZoslrcBjrwJt1t9Ne6j+XBpiiaCKWScL7w8jzn43zgyP3YE9JJUPpsP0Rk1IpQpiMPGV7ajfdoIO8C7gHo5+OFNMQWgbIz+MJb1kU0PteoUxlAKT5RknbGNy1b1JtTL0RVaimgytBEJ0Ep2A7/Bq4Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 22-03-23 12:04:01, Ritesh Harjani wrote: > Jan Kara writes: > >> + pos += size; > >> + if (pos > i_size_read(inode)) > >> + i_size_write(inode, pos); > >> + > >> + return 0; > >> +} > >> + > >> +static const struct iomap_dio_ops ext2_dio_write_ops = { > >> + .end_io = ext2_dio_write_end_io, > >> +}; > >> + > >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > >> +{ > >> + struct file *file = iocb->ki_filp; > >> + struct inode *inode = file->f_mapping->host; > >> + ssize_t ret; > >> + unsigned int flags; > >> + unsigned long blocksize = inode->i_sb->s_blocksize; > >> + loff_t offset = iocb->ki_pos; > >> + loff_t count = iov_iter_count(from); > >> + > >> + > >> + inode_lock(inode); > >> + ret = generic_write_checks(iocb, from); > >> + if (ret <= 0) > >> + goto out_unlock; > >> + ret = file_remove_privs(file); > >> + if (ret) > >> + goto out_unlock; > >> + ret = file_update_time(file); > >> + if (ret) > >> + goto out_unlock; > >> + > >> + /* > >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw() > >> + * calls for generic_write_sync in iomap_dio_complete(). > >> + * Since ext2_fsync nmust be called w/o inode lock, > >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync() > >> + * ourselves. > >> + */ > >> + flags = IOMAP_DIO_NOSYNC; > > > > Meh, this is kind of ugly and we should come up with something better for > > simple filesystems so that they don't have to play these games. Frankly, > > these days I doubt there's anybody really needing inode_lock in > > __generic_file_fsync(). Neither sync_mapping_buffers() nor > > sync_inode_metadata() need inode_lock for their self-consistency. So it is > > only about flushing more consistent set of metadata to disk when fsync(2) > > races with other write(2)s to the same file so after a crash we have higher > > chances of seeing some real state of the file. But I'm not sure it's really > > worth keeping for filesystems that are still using sync_mapping_buffers(). > > People that care about consistency after a crash have IMHO moved to other > > filesystems long ago. > > > > One way which hch is suggesting is to use __iomap_dio_rw() -> unlock > inode -> call generic_write_sync(). I haven't yet worked on this part. So I see two problems with what Christoph suggests: a) It is unfortunate API design to require trivial (and low maintenance) filesystem to do these relatively complex locking games. But this can be solved by providing appropriate wrapper for them I guess. b) When you unlock the inode, other stuff can happen with the inode. And e.g. i_size update needs to happen after IO is completed so filesystems would have to be taught to avoid say two racing expanding writes. That's IMHO really too much to ask. > Are you suggesting to rip of inode_lock from __generic_file_fsync()? > Won't it have a much larger implications? Yes and yes :). But inode writeback already happens from other paths without inode_lock so there's hardly any surprise there. sync_mapping_buffers() is impossible to "customize" by filesystems and the generic code is fine without inode_lock. So I have hard time imagining how any filesystem would really depend on inode_lock in this path (famous last words ;)). > >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) || > >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize))) > >> + flags |= IOMAP_DIO_FORCE_WAIT; > >> + > >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops, > >> + flags, NULL, 0); > >> + > >> + if (ret == -ENOTBLK) > >> + ret = 0; > > > > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of > > blockdev_direct_IO(). Thus you have to implement that in your > > ext2_iomap_ops, in particular in iomap_begin... > > > > Aah yes. Thanks for pointing that out - > ext2_iomap_begin() should have something like this - > /* > * We cannot fill holes in indirect tree based inodes as that could > * expose stale data in the case of a crash. Use the magic error code > * to fallback to buffered I/O. > */ > > Also I think ext2_iomap_end() should also handle a case like in ext4 - > > /* > * Check to see whether an error occurred while writing out the data to > * the allocated blocks. If so, return the magic error code so that we > * fallback to buffered I/O and attempt to complete the remainder of > * the I/O. Any blocks that may have been allocated in preparation for > * the direct I/O will be reused during buffered I/O. > */ > if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) > return -ENOTBLK; > > > I am wondering if we have testcases in xfstests which really tests these > functionalities also or not? Let me give it a try... > ... So I did and somehow couldn't find any testcase which fails w/o > above changes. I guess we don't. It isn't that simple (but certainly possible) to test for stale data exposure... > Another query - > > We have this function ext2_iomap_end() (pasted below) > which calls ext2_write_failed(). > > Here IMO two cases are possible - > > 1. written is 0. which means an error has occurred. > In that case calling ext2_write_failed() make sense. > > 2. consider a case where written > 0 && written < length. > (This is possible right?). In that case we still go and call > ext2_write_failed(). This function will truncate the pagecache and disk > blocks beyong i_size. Now we haven't yet updated inode->i_size (we do > that in ->end_io which gets called in the end during completion) > So that means it just removes everything. > > Then in ext2_dax_write_iter(), we might go and update inode->i_size > to iocb->ki_pos including for short writes. This looks like it isn't > consistent because earlier we had destroyed all the blocks for the short > writes and we will be returning ret > 0 to the user saying these many > bytes have been written. > Again I haven't yet found a test case at least not in xfstests which > can trigger this short writes. Let me know your thoughts on this. > All of this lies on the fact that there can be a case where > written > 0 && written < length. I will read more to see if this even > happens or not. But I atleast wanted to capture this somewhere. So as far as I remember, direct IO writes as implemented in iomap are all-or-nothing (see iomap_dio_complete()). But it would be good to assert that in ext4 code to avoid surprises if the generic code changes. > Another thing - > In dax while truncating the inode i_size in ext2_setsize(), > I think we don't properly call dax_zero_blocks() when we are trying to > zero the last block beyond EOF. i.e. for e.g. it can be called with len > as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with > len = 0 and can bug_on at maxblocks == 0. How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will not call iomap_begin() at all if iter.len == 0. > I think it should be this. I will spend some more time analyzing this > and also test it once against DAX paths. > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 7ff669d0b6d2..cc264b1e288c 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) > inode_dio_wait(inode); > > if (IS_DAX(inode)) > - error = dax_zero_range(inode, newsize, > - PAGE_ALIGN(newsize) - newsize, NULL, > - &ext2_iomap_ops); > + error = dax_truncate_page(inode, newsize, NULL, > + &ext2_iomap_ops); > else > error = block_truncate_page(inode->i_mapping, > newsize, ext2_get_block); That being said this is indeed a nice cleanup. Honza -- Jan Kara SUSE Labs, CR