From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97FD9C76192 for ; Tue, 16 Jul 2019 11:21:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 523A92145D for ; Tue, 16 Jul 2019 11:21:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="YZ/xN8vJ"; dkim=pass (1024-bit key) header.d=sharedspace.onmicrosoft.com header.i=@sharedspace.onmicrosoft.com header.b="vmGVDZee" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387708AbfGPLVw (ORCPT ); Tue, 16 Jul 2019 07:21:52 -0400 Received: from esa4.hgst.iphmx.com ([216.71.154.42]:59266 "EHLO esa4.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387658AbfGPLVu (ORCPT ); Tue, 16 Jul 2019 07:21:50 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1563276110; x=1594812110; h=from:to:cc:subject:date:message-id:references: content-transfer-encoding:mime-version; bh=TuIT7CVnserEl/AzHfozlulsEBlXyhOFwL9CVqTceYI=; b=YZ/xN8vJ/idMYTOfvFrVOlWRdUJUwxFFCKa0xQrLXvObxLup1ZumCITt OzrwksjfOSFbltZEQHyvJNFEYjNsqF5NYjemZDrArkIPoiGGZL+WquqEo cOOxZLTKdyGrWYXSBEG/mWRnWZBVW+/NH94uQCBnsDKoUC759t1sRMQm5 lBaSJ6Ke6PoUh3h1t8zRGsVklswa+eMLI8YSlEKs/ArR2tzKb1VFOXKDr 72FThB8yWaAhj2C31aFBZ40zFKKNxHl2OJKATONQdXPKhSZqiybll0wvf O3XT1eZc8wgYzHaLAkvAvYUf+A0vXkhtCUfjDbPtUrvA+hNIzyPm+DRSD Q==; IronPort-SDR: yguZnuvfYFTcwM3JIxB9AWywNoJQ09TevH+8ZQRfGqeUvOLKkpEiKNuWYVWzodQNYPuliWEi87 TkgyjqOrsDJ4T497GtRAENJKhz7xuVX2VCW+Rt5OSjKoPJLNoq4IPhojyQz8doBGtk2nDLdR28 dyl1xKWd9QhcAw33MUYiPdpXtTY0pZ6KnZKVVeBqglrrTauH3dh/Ht69ED39t2Il/jbZR72qB7 hp67zXNhyEXe+w1MyBZ95PGqAWLIh9sZs7ClAVUICP8VdsJbIwKinXbL+vgglpYR7jejatL2xJ JZU= X-IronPort-AV: E=Sophos;i="5.63,498,1557158400"; d="scan'208";a="113168402" Received: from mail-dm3nam05lp2055.outbound.protection.outlook.com (HELO NAM05-DM3-obe.outbound.protection.outlook.com) ([104.47.49.55]) by ob1.hgst.iphmx.com with ESMTP; 16 Jul 2019 19:21:49 +0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PRf5MGPNMDtLCoj9zJrZIwnGGNlpzQmlJb217oJKfUgQG9tMEQuUWFMiuw9DWqX7VcJ9VQxEvZt3iwSatl4z4pON9Er1uqf2MT0+Onf+Fs01R5gF0gINqI6edt+6o5ERfG1q9HTd58tL8mVov7c0ESc+0r/4xopm/Lt2uaUgoLyuKYy48RrxJ5dZkxFhrsCU+jW/V7g4zEzBqTFelmZDRK9SC9LcNxTOzPSYYMwNNYWHOe85Ojg1tf3JkQ4vt32fsMI07HXsZEExJPNOrLNh7j4gALPsOHv7fsIwOUEOTfTlA1zq0G7BB6kH2hnqnGNA3uUAtpDytNP/VnbTTuoo8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LbqyAalFTaxl998dOt1mAsjpbOoeVI2MCfO7zyG5baU=; b=KWwMQlyBqx/Sb+7gfusDA0qJXyKy+dN9Q9qAZHEdlNptufdo4cxGz/RYeMWBouT/TXGRPxL8ZImHPcmgIia/oO2/jRO6xe7ZG2Y2Jq6TWBwii1lHRNcfumUMB6Uuhg3WhIVNRw13TAs/phRT/mfmudi4PMX8pkQTf7cnqxhFyNxwK91f5N5I+wYri9hEp1ELYF/6L/8m6FAFDv+58QNutKstRlV7Rp9tRNvjsL6Hg6JoYTp40ZZz3k8cYgFXtsXIQ2oaC0K+6YPp2xUWVyQ4rCg6Vb/JWyIdXXOqVwFxxr5RWbeLk7IgVKHWqoRs6scNV4r3LmbIQlvEVK3OVfSzIQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1;spf=pass smtp.mailfrom=wdc.com;dmarc=pass action=none header.from=wdc.com;dkim=pass header.d=wdc.com;arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sharedspace.onmicrosoft.com; s=selector2-sharedspace-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LbqyAalFTaxl998dOt1mAsjpbOoeVI2MCfO7zyG5baU=; b=vmGVDZeeRzAGV2vw/++O1ntM++cN1uYme/4zhn98LLvUeKrw86sROSyzXmgy7/65XhMPaZU8LPfEi8y5qjeA5N+AsuZj9bYRSUKmCexd06duzwpzbd/xIzD5fX6EPvrYFcEC+AifX7ZrYybHqo3oTy17iD/mvASZLh3HBcm6vwM= Received: from BYAPR04MB5816.namprd04.prod.outlook.com (20.179.58.207) by BYAPR04MB5717.namprd04.prod.outlook.com (20.179.57.161) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2073.11; Tue, 16 Jul 2019 11:21:46 +0000 Received: from BYAPR04MB5816.namprd04.prod.outlook.com ([fe80::65a9:db0a:646d:eb1e]) by BYAPR04MB5816.namprd04.prod.outlook.com ([fe80::65a9:db0a:646d:eb1e%6]) with mapi id 15.20.2073.012; Tue, 16 Jul 2019 11:21:46 +0000 From: Damien Le Moal To: Dave Chinner CC: "linux-fsdevel@vger.kernel.org" , "linux-xfs@vger.kernel.org" , Christoph Hellwig , Johannes Thumshirn , Hannes Reinecke Subject: Re: [PATCH RFC] fs: New zonefs file system Thread-Topic: [PATCH RFC] fs: New zonefs file system Thread-Index: AQHVOF31k+9+55ZdEkmKABdH1mzidg== Date: Tue, 16 Jul 2019 11:21:45 +0000 Message-ID: References: <20190712030017.14321-1-damien.lemoal@wdc.com> <20190715011935.GM7689@dread.disaster.area> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Damien.LeMoal@wdc.com; x-originating-ip: [129.253.182.57] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 084ac8a8-1451-44a9-975c-08d709dfc997 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0;PCL:0;RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600148)(711020)(4605104)(1401327)(4618075)(2017052603328)(7193020);SRVR:BYAPR04MB5717; x-ms-traffictypediagnostic: BYAPR04MB5717: x-microsoft-antispam-prvs: wdcipoutbound: EOP-TRUE x-ms-oob-tlc-oobclassifiers: OLM:1227; x-forefront-prvs: 0100732B76 x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(4636009)(136003)(376002)(396003)(39860400002)(346002)(366004)(189003)(199004)(74316002)(9686003)(6116002)(53946003)(305945005)(476003)(478600001)(25786009)(71190400001)(55016002)(66946007)(71200400001)(446003)(316002)(54906003)(76116006)(3846002)(256004)(14444005)(186003)(66446008)(26005)(53936002)(7736002)(64756008)(6246003)(8936002)(102836004)(66066001)(99286004)(6436002)(68736007)(66476007)(81156014)(81166006)(66556008)(30864003)(5660300002)(8676002)(2906002)(53546011)(33656002)(6916009)(14454004)(229853002)(76176011)(52536014)(7696005)(486006)(4326008)(6506007)(86362001)(579004);DIR:OUT;SFP:1102;SCL:1;SRVR:BYAPR04MB5717;H:BYAPR04MB5816.namprd04.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;MX:1;A:1; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: ow7d+511aFJb+IKCKf5gm1gjjczsGoZDrPMgSOt/TcJ94IZ2JZne3VPlSfVcUXcq1H3IkC94L/muLWOh0DFZx30Pl4XckmjzCIsH8KiPS0vW5jWmqWOkiJSvboSZaEPF+YCKyLmZLcrLEt9SKHicyF+NeTU9x88OAEKg971YPxZilSu9ozbm/TZUez5Jh31vQ4mlJ9vwL2mAtzhZyZI+GQCnTo3tK7iU6KX9wWkEcoDDajkmQQrhuACdiy+HDSAsGRz4FfWHTIRm1ANS/ow0aX8i2qTHVS8CvPUT3J++xTf5MhEghzb9OLw7wEiyM48ge4cuZCn93toZC5GEOjPYleDtzUTpI8IdwUB2qQi/ZF7c/nHH+Xl6YNO6Swu/X/yFAohxcupbNGz66mlsu+boCDJCFxOzQBll63/PE0cOZ1s= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: wdc.com X-MS-Exchange-CrossTenant-Network-Message-Id: 084ac8a8-1451-44a9-975c-08d709dfc997 X-MS-Exchange-CrossTenant-originalarrivaltime: 16 Jul 2019 11:21:45.8907 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: b61c8803-16f3-4c35-9b17-6f65f441df86 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: Damien.LeMoal@wdc.com X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR04MB5717 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Dave,=0A= =0A= On 2019/07/15 10:20, Dave Chinner wrote:=0A= > Just a few quick things as I read through this to see how it uses=0A= > iomap....=0A= > =0A= > On Fri, Jul 12, 2019 at 12:00:17PM +0900, Damien Le Moal wrote:=0A= >> +static int zonefs_iomap_begin(struct inode *inode, loff_t offset, loff_= t length,=0A= >> + unsigned int flags, struct iomap *iomap)=0A= >> +{=0A= =0A= [...]=0A= =0A= >> + /* An IO cannot exceed the zone size */=0A= >> + if (offset >=3D max_isize)=0A= >> + return -EFBIG;=0A= > =0A= > So a write() call that is for a length longer than max_isize is=0A= > going to end up being a short write? i.e. iomap_apply() will loop=0A= > mapping the inode until either we reach the end of the user write=0A= > or we hit max_isize?=0A= > =0A= > How is userspace supposed to tell the difference between a short=0A= > write and a write that overruns max_isize?=0A= =0A= offset >=3D max_isize is already checked when zonefs_file_write_iter() is c= alled=0A= and the write size truncated in that function too so that a write never goe= s=0A= beyond the zone size. This check here is probably redundant and not needed.= I=0A= left it temporarily to make sure I was using iomap correctly and to check b= ugs.=0A= =0A= >> +static int zonefs_map_blocks(struct iomap_writepage_ctx *wpc,=0A= >> + struct inode *inode, loff_t offset)=0A= >> +{=0A= >> + if (offset >=3D wpc->iomap.offset &&=0A= >> + offset < wpc->iomap.offset + wpc->iomap.length)=0A= >> + return 0;=0A= >> +=0A= >> + memset(&wpc->iomap, 0, sizeof(wpc->iomap));=0A= >> + return zonefs_iomap_begin(inode, offset, INT_MAX, 0, &wpc->iomap);=0A= > =0A= > Why is the write length set to INT_MAX here? What happens when we=0A= > get a zone that is larger than 2GB? i.e. the length parameter is a=0A= > loff_t, not an int....=0A= =0A= Yes indeed. Since no access can go beyond the file size,=0A= zonefs_file_max_size(inode) is the right value here I think, even though=0A= LLONG_MAX would work too.=0A= =0A= >> +static int zonefs_truncate_seqfile(struct inode *inode)=0A= >> +{=0A= >> + struct zonefs_inode_info *zi =3D ZONEFS_I(inode);=0A= >> + int ret;=0A= >> +=0A= >> + /* Serialize against page faults */=0A= >> + down_write(&zi->i_mmap_sem);=0A= >> +=0A= >> + ret =3D blkdev_reset_zones(inode->i_sb->s_bdev,=0A= >> + zonefs_file_addr(inode) >> SECTOR_SHIFT,=0A= >> + zonefs_file_max_size(inode) >> SECTOR_SHIFT,=0A= >> + GFP_KERNEL);=0A= > =0A= > Not sure GFP_KERNEL is safe here. This is called holding a=0A= > filesystem lock here, so it's not immediately clear to me if this=0A= > can deadlock through memory reclaim or not...=0A= =0A= Conventional zones can be written with buffered I/Os and mmap, so recursing= into=0A= the FS on the memory allocation while holding i_mmap_sem is a possibility I= =0A= think. I changed this to GFP_NOFS.=0A= =0A= =0A= >> +static int zonefs_inode_setattr(struct dentry *dentry, struct iattr *ia= ttr)=0A= >> +{=0A= >> + struct inode *inode =3D d_inode(dentry);=0A= >> + int ret;=0A= >> +=0A= >> + ret =3D setattr_prepare(dentry, iattr);=0A= >> + if (ret)=0A= >> + return ret;=0A= >> +=0A= >> + if ((iattr->ia_valid & ATTR_UID &&=0A= >> + !uid_eq(iattr->ia_uid, inode->i_uid)) ||=0A= >> + (iattr->ia_valid & ATTR_GID &&=0A= >> + !gid_eq(iattr->ia_gid, inode->i_gid))) {=0A= >> + ret =3D dquot_transfer(inode, iattr);=0A= >> + if (ret)=0A= >> + return ret;=0A= >> + }=0A= >> +=0A= >> + if (iattr->ia_valid & ATTR_SIZE) {=0A= >> + /* The size of conventional zone files cannot be changed */=0A= >> + if (zonefs_file_is_conv(inode))=0A= >> + return -EPERM;=0A= >> +=0A= >> + /*=0A= >> + * For sequential zone files, we can only allow truncating to=0A= >> + * 0 size which is equivalent to a zone reset.=0A= >> + */=0A= >> + if (iattr->ia_size !=3D 0)=0A= >> + return -EPERM;=0A= >> +=0A= >> + ret =3D zonefs_truncate_seqfile(inode);=0A= >> + if (ret)=0A= >> + return ret;=0A= > =0A= > Ok, so we are calling zonefs_truncate_seqfile() holding the i_rwsem=0A= > as well. That does tend to imply GFP_NOFS should probably be used=0A= > for the blkdev_reset_zones() call.=0A= =0A= Yes. Fixed. Thanks.=0A= =0A= >> +/*=0A= >> + * Open a file.=0A= >> + */=0A= >> +static int zonefs_file_open(struct inode *inode, struct file *file)=0A= >> +{=0A= >> + /*=0A= >> + * Note: here we can do an explicit open of the file zone,=0A= >> + * on the first open of the inode. The explicit close can be=0A= >> + * done on the last release (close) call for the inode.=0A= >> + */=0A= >> +=0A= >> + return generic_file_open(inode, file);=0A= >> +}=0A= > =0A= > Why is a wrapper needed for this?=0A= =0A= It is not for now. As the comment says, we may want to have a wrapper like = this=0A= in the future to do a device level zone open/close in sync with file=0A= open/release. This is not useful for SMR HDDs but probably will for NVMe zo= ned=0A= namespace devices.=0A= =0A= >> +static int zonefs_file_fsync(struct file *file, loff_t start, loff_t en= d,=0A= >> + int datasync)=0A= >> +{=0A= >> + struct inode *inode =3D file_inode(file);=0A= >> + int ret;=0A= >> +=0A= >> + /*=0A= >> + * Since only direct writes are allowed in sequential files, we only= =0A= >> + * need a device flush for these files.=0A= >> + */=0A= >> + if (zonefs_file_is_seq(inode))=0A= >> + goto flush;=0A= >> +=0A= >> + ret =3D file_write_and_wait_range(file, start, end);=0A= >> + if (ret =3D=3D 0)=0A= >> + ret =3D file_check_and_advance_wb_err(file);=0A= >> + if (ret)=0A= >> + return ret;=0A= > =0A= >> +=0A= >> +flush:=0A= >> + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);=0A= > =0A= > The goto can be avoided in this case simply:=0A= > =0A= > if (zonefs_file_is_conv(inode)) {=0A= > /* do flush */=0A= > }=0A= > return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);=0A= =0A= Good point. Fixed.=0A= =0A= >> +static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_ite= r *to)=0A= >> +{=0A= >> + struct inode *inode =3D file_inode(iocb->ki_filp);=0A= >> + struct zonefs_sb_info *sbi =3D ZONEFS_SB(inode->i_sb);=0A= >> + loff_t max_pos =3D zonefs_file_max_size(inode);=0A= >> + size_t count;=0A= >> + ssize_t ret =3D 0;=0A= >> +=0A= >> + /*=0A= >> + * Check that the read operation does not go beyond the maximum=0A= >> + * file size.=0A= >> + */=0A= >> + if (iocb->ki_pos >=3D zonefs_file_max_size(inode))=0A= >> + return -EFBIG;=0A= >> +=0A= >> + /*=0A= >> + * For sequential zones, limit reads to written data.=0A= >> + */=0A= >> + if (zonefs_file_is_seq(inode))=0A= >> + max_pos =3D i_size_read(inode);=0A= >> + if (iocb->ki_pos >=3D max_pos)=0A= >> + return 0;=0A= > =0A= > Isn't this true for both types of zone at this point? i.e. at this=0A= > point:=0A= > =0A= > max_pos =3D i_size_read(inode);=0A= > if (iocb->ki_pos >=3D max_pos)=0A= > return 0;=0A= > =0A= > because i_size is either the zonefs_file_max_size() for conventional=0A= > zones (which we've already checked) or it's the write pointer for=0A= > a sequential zone. i.e. it's the max position for either case.=0A= =0A= Yes, correct. I fixed it.=0A= =0A= > =0A= >> + iov_iter_truncate(to, max_pos - iocb->ki_pos);=0A= >> + count =3D iov_iter_count(to);=0A= >> + if (!count)=0A= >> + return 0;=0A= > =0A= > The iov_iter should never be zero length here, because that implies=0A= > the position was >=3D max_pos and that will be caught by the above=0A= > checks...=0A= =0A= Absolutely. Fixed.=0A= =0A= > =0A= >> + /* Direct IO reads must be aligned to device physical sector size */= =0A= >> + if ((iocb->ki_flags & IOCB_DIRECT) &&=0A= >> + ((iocb->ki_pos | count) & sbi->s_blocksize_mask))=0A= >> + return -EINVAL;=0A= >> +=0A= >> + if (iocb->ki_flags & IOCB_NOWAIT) {=0A= >> + if (!inode_trylock_shared(inode))=0A= >> + return -EAGAIN;=0A= >> + } else {=0A= >> + inode_lock_shared(inode);=0A= >> + }=0A= > =0A= > IIUC, write IO completion takes the inode lock to serialise file=0A= > size updates for sequential zones. In that case, shouldn't this lock=0A= > be taken before we do the EOF checks above?=0A= =0A= Yes. Fixed.=0A= =0A= >> +/*=0A= >> + * We got a write error: get the sequenial zone information from the de= vice to=0A= >> + * figure out where the zone write pointer is and verify the inode size= against=0A= >> + * it.=0A= >> + */=0A= >> +static int zonefs_write_failed(struct inode *inode, int error)=0A= >> +{=0A= >> + struct super_block *sb =3D inode->i_sb;=0A= >> + struct zonefs_inode_info *zi =3D ZONEFS_I(inode);=0A= >> + sector_t sector =3D zi->i_addr >> SECTOR_SHIFT;=0A= >> + unsigned int noio_flag;=0A= >> + struct blk_zone zone;=0A= >> + int n =3D 1, ret;=0A= >> +=0A= >> + zonefs_warn(sb, "Updating inode zone %llu info\n", sector);=0A= >> +=0A= >> + noio_flag =3D memalloc_noio_save();=0A= >> + ret =3D blkdev_report_zones(sb->s_bdev, sector, &zone, &n);=0A= >> + memalloc_noio_restore(noio_flag);=0A= > =0A= > What deadlock does the memalloc_noio_save() avoid? There should be a=0A= > comment explaining what problem memalloc_noio_save() avoids=0A= > everywhere it is used like this. If it isn't safe to do GFP_KERNEL=0A= > allocations here under the i_rwsem, then why would it be safe to=0A= > do GFP_KERNEL allocations in the truncate code under the i_rwsem?=0A= =0A= Similarly to the truncate (zone reset) code, GFP_NOFS is safer here. So I= =0A= changed the call to memalloc_nofs_save/restore and added a comment.=0A= =0A= >> +=0A= >> + if (!n)=0A= >> + ret =3D -EIO;=0A= >> + if (ret) {=0A= >> + zonefs_err(sb, "Get zone %llu report failed %d\n",=0A= >> + sector, ret);=0A= >> + return ret;=0A= >> + }=0A= >> +=0A= >> + zi->i_wpoffset =3D (zone.wp - zone.start) << SECTOR_SHIFT;=0A= >> + if (i_size_read(inode) !=3D zi->i_wpoffset) {=0A= >> + i_size_write(inode, zi->i_wpoffset);=0A= >> + truncate_pagecache(inode, zi->i_wpoffset);=0A= >> + }=0A= > =0A= > This looks .... dangerous. If the write pointer was advanced, but=0A= > the data wasn't written properly, this causes stale data exposure on=0A= > write failure. i.e. it's not failsafe.=0A= =0A= It is safe because the device will return zeros or the format pattern for s= ector=0A= reads that are beyond the zone write pointer. No stale data will be exposed= , as=0A= long as the device can be trusted to correctly handle reads beyond the writ= e=0A= pointer.=0A= =0A= > =0A= > I suspect that on a sequential zone write failure and the write=0A= > pointer does not equal the offset of the write, we should consider=0A= > the zone corrupt. Also, this is for direct IO completion for=0A= > sequential writes, yes? So what does the page cache truncation=0A= > acheive given that only direct IO writes are allowed to these files?=0A= =0A= Yes, the page cache truncation can be removed.=0A= =0A= >> +static int zonefs_dio_seqwrite_end_io(struct kiocb *iocb, ssize_t size,= =0A= >> + unsigned int flags)=0A= >> +{=0A= >> + struct inode *inode =3D file_inode(iocb->ki_filp);=0A= >> + int ret;=0A= >> +=0A= >> + inode_lock(inode);=0A= >> + if (size < 0)=0A= >> + ret =3D zonefs_write_failed(inode, size);=0A= >> + else=0A= >> + ret =3D zonefs_update_size(inode, iocb->ki_pos + size);=0A= >> + inode_unlock(inode);=0A= >> + return ret;=0A= > =0A= > Shouldn't this have a check that it's being called on a sequential=0A= > zone inode?=0A= =0A= Yes, a check can be added but the only caller being zonefs_file_dio_aio_wri= te()=0A= after checking that the inode is not a conventional zone file, I thought it= was=0A= not needed.=0A= =0A= >> +static ssize_t zonefs_file_dio_aio_write(struct kiocb *iocb,=0A= >> + struct iov_iter *from)=0A= >> +{=0A= >> + struct inode *inode =3D file_inode(iocb->ki_filp);=0A= >> + struct zonefs_inode_info *zi =3D ZONEFS_I(inode);=0A= >> + size_t count;=0A= >> +=0A= >> + /*=0A= >> + * The size of conventional zone files is fixed to the zone size.=0A= >> + * So only direct writes to sequential zones need adjusting the=0A= >> + * inode size on IO completion.=0A= >> + */=0A= >> + if (zonefs_file_is_conv(inode))=0A= >> + return iomap_dio_rw(iocb, from, &zonefs_iomap_ops, NULL);=0A= >> +=0A= >> + /* Enforce append only sequential writes */=0A= >> + count =3D iov_iter_count(from);=0A= >> + if (iocb->ki_pos !=3D zi->i_wpoffset) {=0A= >> + zonefs_err(inode->i_sb,=0A= >> + "Unaligned write at %llu + %zu (wp %llu)\n",=0A= >> + iocb->ki_pos, count, zi->i_wpoffset);=0A= >> + return -EINVAL;=0A= >> + }=0A= >> +=0A= >> + if (is_sync_kiocb(iocb)) {=0A= >> + /*=0A= >> + * Don't use the end_io callback for synchronous iocbs,=0A= >> + * as we'd deadlock on i_rwsem. Instead perform the same=0A= >> + * actions manually here.=0A= >> + */=0A= >> + count =3D iomap_dio_rw(iocb, from, &zonefs_iomap_ops, NULL);=0A= >> + if (count < 0)=0A= >> + return zonefs_write_failed(inode, count);=0A= >> + zonefs_update_size(inode, iocb->ki_pos);=0A= >> + return count;=0A= > =0A= > Urk. This locking is nasty, and doesn't avoid the problem.....=0A= =0A= Do you mean the stale data exposure problem ? As said above, the device=0A= guarantees that stale data will not be returned for reads after write point= er.=0A= Old data (data written before the last zone reset) is not readable.=0A= =0A= > =0A= >> + }=0A= >> +=0A= >> + return iomap_dio_rw(iocb, from, &zonefs_iomap_ops,=0A= >> + zonefs_dio_seqwrite_end_io);=0A= > =0A= > ... because I think this can deadlock.=0A= > =0A= > AFAIA, the rule is that IO completion callbacks cannot take=0A= > a lock that is held across IO submission. The reason is that=0A= > IO can complete so fast that the submission code runs the=0A= > completion. i.e. iomap_dio_rw() can be the function that calls=0A= > iomap_dio_complete() and runs the IO completion.=0A= > =0A= > In which case, this will deadlock because we are already holding the=0A= > i_rwsem and the end_io completion will try to take it again.=0A= =0A= Arrg ! Yes ! Beginner's mistake. Will rework this.=0A= > =0A= > =0A= > =0A= >> +}=0A= >> +=0A= >> +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_it= er *from)=0A= >> +{=0A= >> + struct inode *inode =3D file_inode(iocb->ki_filp);=0A= >> + struct zonefs_sb_info *sbi =3D ZONEFS_SB(inode->i_sb);=0A= >> + size_t count;=0A= >> + ssize_t ret;=0A= >> +=0A= >> + /*=0A= >> + * Check that the read operation does not go beyond the file=0A= >> + * zone boundary.=0A= >> + */=0A= >> + if (iocb->ki_pos >=3D zonefs_file_max_size(inode))=0A= >> + return -EFBIG;=0A= >> + iov_iter_truncate(from, zonefs_file_max_size(inode) - iocb->ki_pos);= =0A= >> + count =3D iov_iter_count(from);=0A= >> +=0A= >> + if (!count)=0A= >> + return 0;=0A= >> +=0A= >> + /*=0A= >> + * Direct IO writes are mandatory for sequential zones so that write I= O=0A= >> + * order is preserved. The direct writes also must be aligned to=0A= >> + * device physical sector size.=0A= >> + */=0A= >> + if (iocb->ki_flags & IOCB_DIRECT) {=0A= >> + if ((iocb->ki_pos | count) & sbi->s_blocksize_mask)=0A= >> + return -EINVAL;=0A= >> + } else {=0A= >> + if (zonefs_file_is_seq(inode))=0A= >> + return -EOPNOTSUPP;=0A= > =0A= > zonefs_iomap_begin() returns -EIO in this case and issues a warning.=0A= > This seems somewhat inconsistent....=0A= =0A= Yes. The check in zonefs_iomap_begin() is probably redundant. Will remove i= t.=0A= =0A= >> + }=0A= >> +=0A= >> + if (iocb->ki_flags & IOCB_NOWAIT) {=0A= >> + if (!inode_trylock(inode))=0A= >> + return -EAGAIN;=0A= >> + } else {=0A= >> + inode_lock(inode);=0A= >> + }=0A= >> +=0A= >> + ret =3D generic_write_checks(iocb, from);=0A= >> + if (ret <=3D 0)=0A= >> + goto out;=0A= > =0A= > Shouldn't this be done before the iov_iter is truncated?=0A= =0A= Yes. Will move it.=0A= =0A= > =0A= >> +=0A= >> + if (iocb->ki_flags & IOCB_DIRECT)=0A= >> + ret =3D zonefs_file_dio_aio_write(iocb, from);=0A= >> + else=0A= >> + ret =3D iomap_file_buffered_write(iocb, from, &zonefs_iomap_ops);=0A= >> +=0A= >> +out:=0A= >> + inode_unlock(inode);=0A= >> +=0A= >> + if (ret > 0 && (!(iocb->ki_flags & IOCB_DIRECT))) {=0A= >> + iocb->ki_pos +=3D ret;=0A= >> + ret =3D generic_write_sync(iocb, ret);=0A= >> + }=0A= > =0A= > Hmmm. The split of checks and doing stuff between direct IO and=0A= > buffered IO seems a bit arbitrary. e.g. the "sequential zones can=0A= > only do append writes" is in zonefs_file_dio_aio_write(), but we=0A= > do a check that "sequential zones can only do direct IO" here.=0A= > =0A= > And then we have the sync code that can only occur on buffered IO,=0A= > which we don't have a wrapper function for but really should. And I=0A= > suspect that the locking is going to have to change here because of=0A= > the direct IO issues, so maybe it would be best to split this up=0A= > similar to the way XFS has two completely separate functions for the=0A= > two paths....=0A= =0A= Yes, very good idea. I will clean this up to clarify all cases.=0A= =0A= >> +static struct kmem_cache *zonefs_inode_cachep;=0A= >> +=0A= >> +static struct inode *zonefs_alloc_inode(struct super_block *sb)=0A= >> +{=0A= >> + struct zonefs_inode_info *zi;=0A= >> +=0A= >> + zi =3D kmem_cache_alloc(zonefs_inode_cachep, GFP_KERNEL);=0A= >> + if (!zi)=0A= >> + return NULL;=0A= >> +=0A= >> + init_rwsem(&zi->i_mmap_sem);=0A= >> + inode_init_once(&zi->i_vnode);=0A= >> +=0A= >> + return &zi->i_vnode;=0A= >> +}=0A= >> +=0A= >> +static void zonefs_destroy_cb(struct rcu_head *head)=0A= >> +{=0A= >> + struct inode *inode =3D container_of(head, struct inode, i_rcu);=0A= >> +=0A= >> + kmem_cache_free(zonefs_inode_cachep, ZONEFS_I(inode));=0A= >> +}=0A= >> +=0A= >> +static void zonefs_destroy_inode(struct inode *inode)=0A= >> +{=0A= >> + call_rcu(&inode->i_rcu, zonefs_destroy_cb);=0A= >> +}=0A= > =0A= > If this is all the inode destructor is, then implement ->free_inode=0A= > instead. i.e.=0A= > =0A= > zonefs_free_inode(inode)=0A= > {=0A= > kmem_cache_free(zonefs_inode_cachep, ZONEFS_I(inode));=0A= > }=0A= > =0A= > and the VFS takes care of the RCU freeing of the inode.=0A= =0A= Yes, I remember now that the rcu free inode code was moved as generic in VF= S a=0A= while back. I will clean this up.=0A= =0A= > =0A= >> +/*=0A= >> + * File system stat.=0A= >> + */=0A= >> +static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)=0A= >> +{=0A= >> + struct super_block *sb =3D dentry->d_sb;=0A= >> + struct zonefs_sb_info *sbi =3D ZONEFS_SB(sb);=0A= >> + sector_t nr_sectors =3D sb->s_bdev->bd_part->nr_sects;=0A= >> + enum zonefs_ztype t;=0A= >> +=0A= >> + buf->f_type =3D ZONEFS_MAGIC;=0A= >> + buf->f_bsize =3D dentry->d_sb->s_blocksize;=0A= >> + buf->f_namelen =3D ZONEFS_NAME_MAX;=0A= >> +=0A= >> + buf->f_blocks =3D nr_sectors >> (sb->s_blocksize_bits - SECTOR_SHIFT);= =0A= >> + buf->f_bfree =3D 0;=0A= >> + buf->f_bavail =3D 0;=0A= >> +=0A= >> + buf->f_files =3D sbi->s_nr_zones[ZONEFS_ZTYPE_ALL] - 1;=0A= >> + for (t =3D ZONEFS_ZTYPE_ALL; t < ZONEFS_ZTYPE_MAX; t++) {=0A= >> + if (sbi->s_nr_zones[t])=0A= >> + buf->f_files++;=0A= >> + }=0A= >> + buf->f_ffree =3D 0;=0A= >> +=0A= >> + /* buf->f_fsid =3D 0; uuid, see ext2 */=0A= > =0A= > This doesn't tell me anything useful. Does it mean "we should use=0A= > the uuid like ext2" or something else? is it a "TODO:" item?=0A= =0A= Basically yes. Forgot to code it after too much copy-paste :)=0A= Will fix that.=0A= =0A= >> +static char *zgroups_name[ZONEFS_ZTYPE_MAX] =3D {=0A= >> + NULL,=0A= >> + "cnv",=0A= >> + "seq"=0A= >> +};=0A= > =0A= > What's the reason for a NULL in the first entry?=0A= =0A= This is for ZONEFS_ZTYPE_ALL, which does not have a directory. The NULL is = only=0A= there because the array is using ZONEFS_ZTYPE_MAX. This can be simplified.= =0A= =0A= > =0A= >> +=0A= >> +/*=0A= >> + * Create a zone group and populate it with zone files.=0A= >> + */=0A= >> +static int zonefs_create_zgroup(struct super_block *sb, struct blk_zone= *zones,=0A= >> + enum zonefs_ztype type)=0A= >> +{=0A= >> + struct zonefs_sb_info *sbi =3D ZONEFS_SB(sb);=0A= >> + struct blk_zone *zone, *next, *end;=0A= >> + char name[ZONEFS_NAME_MAX];=0A= >> + unsigned int nr_files =3D 0;=0A= >> + struct dentry *dir;=0A= >> +=0A= >> + /* If the group is empty, nothing to do */=0A= >> + if (!sbi->s_nr_zones[type])=0A= >> + return 0;=0A= >> +=0A= >> + dir =3D zonefs_create_inode(sb->s_root, zgroups_name[type], NULL);=0A= >> + if (!dir)=0A= >> + return -ENOMEM;=0A= >> +=0A= >> + /*=0A= >> + * Note: The first zone contains the super block: skip it.=0A= >> + */=0A= >> + end =3D zones + sbi->s_nr_zones[ZONEFS_ZTYPE_ALL];=0A= >> + for (zone =3D &zones[1]; zone < end; zone =3D next) {=0A= >> +=0A= >> + next =3D zone + 1;=0A= >> + if (zonefs_zone_type(zone) !=3D type)=0A= >> + continue;=0A= >> +=0A= >> + /* Ignore offline zones */=0A= >> + if (zonefs_zone_offline(zone))=0A= >> + continue;=0A= >> +=0A= >> + /*=0A= >> + * For conventional zones, contiguous zones can be aggregated=0A= >> + * together to form larger files.=0A= >> + * Note that this overwrites the length of the first zone of=0A= >> + * the set of contiguous zones aggregated together.=0A= >> + * Only zones with the same condition can be agreggated so that=0A= >> + * offline zones are excluded and readonly zones are aggregated=0A= >> + * together into a read only file.=0A= >> + */=0A= >> + if (type =3D=3D ZONEFS_ZTYPE_CNV &&=0A= >> + zonefs_has_feature(sbi, ZONEFS_F_AGRCNV)) {=0A= >> + for (; next < end; next++) {=0A= >> + if (zonefs_zone_type(next) !=3D type ||=0A= >> + next->cond !=3D zone->cond)=0A= >> + break;=0A= >> + zone->len +=3D next->len;=0A= >> + }=0A= >> + }=0A= >> +=0A= >> + if (zonefs_has_feature(sbi, ZONEFS_F_STARTSECT_NAME))=0A= >> + /* Use zone start sector as file names */=0A= >> + snprintf(name, ZONEFS_NAME_MAX - 1, "%llu",=0A= >> + zone->start);=0A= >> + else=0A= >> + /* Use file number as file names */=0A= >> + snprintf(name, ZONEFS_NAME_MAX - 1, "%u", nr_files);=0A= >> + nr_files++;=0A= >> +=0A= >> + if (!zonefs_create_inode(dir, name, zone))=0A= >> + return -ENOMEM;=0A= > =0A= > I guess this means partial setup due to failure needs to be torn=0A= > down by the kill_super() code?=0A= =0A= Yes. And the call to d_genocide(sb->s_root) does that unless I missed somet= hing.=0A= =0A= >> +static struct blk_zone *zonefs_get_zone_info(struct super_block *sb)=0A= >> +{=0A= >> + struct zonefs_sb_info *sbi =3D ZONEFS_SB(sb);=0A= >> + struct block_device *bdev =3D sb->s_bdev;=0A= >> + sector_t nr_sectors =3D bdev->bd_part->nr_sects;=0A= >> + unsigned int i, n, nr_zones =3D 0;=0A= >> + struct blk_zone *zones, *zone;=0A= >> + sector_t sector =3D 0;=0A= >> + int ret;=0A= >> +=0A= >> + sbi->s_blocksize_mask =3D sb->s_blocksize - 1;=0A= >> + sbi->s_nr_zones[ZONEFS_ZTYPE_ALL] =3D blkdev_nr_zones(bdev);=0A= >> + zones =3D kvcalloc(sbi->s_nr_zones[ZONEFS_ZTYPE_ALL],=0A= >> + sizeof(struct blk_zone), GFP_KERNEL);=0A= >> + if (!zones)=0A= >> + return ERR_PTR(-ENOMEM);=0A= > =0A= > Hmmm. That's a big allocation. That might be several megabytes for a=0A= > typical 16TB SMR drive, right? It might be worth adding a comment=0A= > indicating just how large this is, because it's somewhat unusual in=0A= > kernel space, even for temporary storage.=0A= =0A= Yes. A full zone report (64B per zone) will take to 4-5MB on a 15+ TB disk.= I=0A= will comment this. Doing the big allocation to allow having all zones on ha= nd=0A= before creating the inodes makes the code easier. But this can be optimized= with=0A= a partial report in a loop too. I will rewrite this code to avoid the big= =0A= temporary allocation.=0A= =0A= >> + * defined in linux/blkzoned.h, that is, BLK_ZONE_TYPE_SEQWRITE_REQ and= =0A= >> + * BLK_ZONE_TYPE_SEQWRITE_PREF.=0A= >> + */=0A= >> +enum zonefs_ztype {=0A= >> + ZONEFS_ZTYPE_ALL =3D 0,=0A= >> + ZONEFS_ZTYPE_CNV,=0A= >> + ZONEFS_ZTYPE_SEQ,=0A= >> + ZONEFS_ZTYPE_MAX,=0A= >> +};=0A= > =0A= > What is ZONEFS_ZTYPE_ALL supposed to be used for?=0A= =0A= It is defined only to allow thee declaration of sbi->s_nr_zones as an array= of=0A= ZONEFS_ZTYPE_MAX unsigned int, so that we get 3 counters: total number of z= ones=0A= is always larger or equal to conv zones + seq zones. Initially, this made t= he=0A= code somewhat cleaner, but I realize now not really easier to understand :)= =0A= =0A= I will remove it and make things more obvious.=0A= =0A= >> +static inline bool zonefs_zone_offline(struct blk_zone *zone)=0A= >> +{=0A= >> + return zone->cond =3D=3D BLK_ZONE_COND_OFFLINE;=0A= >> +}=0A= >> +=0A= >> +static inline bool zonefs_zone_readonly(struct blk_zone *zone)=0A= >> +{=0A= >> + return zone->cond =3D=3D BLK_ZONE_COND_READONLY;=0A= >> +}=0A= > =0A= > These should be block layer helpers as the operate on blk_zone,=0A= > not zonefs structures.=0A= =0A= Yes. Will fix that.=0A= =0A= >> +/*=0A= >> + * Address (byte offset) on disk of a file zone.=0A= >> + */=0A= >> +static inline loff_t zonefs_file_addr(struct inode *inode)=0A= >> +{=0A= >> + return ZONEFS_I(inode)->i_addr;=0A= >> +}=0A= > =0A= > so it's a disk address, but it's encoded in bytes rather than sectors=0A= > so that makes it an offset. That's kinda confusing coming from a=0A= > filesystem that makes a clear distinction between these two things.=0A= =0A= Yes. This simplifies a little the code, but it is not obvious. I will rewor= k this.=0A= =0A= >> +=0A= >> +/*=0A= >> + * Maximum possible size of a file (i.e. the zone size).=0A= >> + */=0A= >> +static inline loff_t zonefs_file_max_size(struct inode *inode)=0A= >> +{=0A= >> + return ZONEFS_I(inode)->i_max_size;=0A= >> +}=0A= >> +=0A= >> +/*=0A= >> + * On-disk super block (block 0).=0A= >> + */=0A= >> +struct zonefs_super {=0A= >> +=0A= >> + /* Magic number */=0A= >> + __le32 s_magic; /* 4 */=0A= >> +=0A= >> + /* Metadata version number */=0A= >> + __le32 s_version; /* 8 */=0A= >> +=0A= >> + /* Features */=0A= >> + __le64 s_features; /* 16 */=0A= > =0A= > On-disk version numbers are kinda redundant when you have=0A= > fine grained feature fields to indicate the on-disk layout...=0A= =0A= Yes. I guess. I was kind of reserving the s_features field for optional thi= ngs,=0A= hence the version number for on-disk mandatory things. But I may as well us= e it=0A= for everything, both mandatory format features and options.=0A= =0A= >> +/*=0A= >> + * Feature flags.=0A= >> + */=0A= >> +enum zonefs_features {=0A= >> + /*=0A= >> + * Use a zone start sector value as file name.=0A= >> + */=0A= >> + ZONEFS_F_STARTSECT_NAME,=0A= >> + /*=0A= >> + * Aggregate contiguous conventional zones into a single file.=0A= >> + */=0A= >> + ZONEFS_F_AGRCNV,=0A= >> + /*=0A= >> + * Use super block specified UID for files instead of default.=0A= >> + */=0A= >> + ZONEFS_F_UID,=0A= >> + /*=0A= >> + * Use super block specified GID for files instead of default.=0A= >> + */=0A= >> + ZONEFS_F_GID,=0A= >> + /*=0A= >> + * Use super block specified file permissions instead of default 640.= =0A= >> + */=0A= >> + ZONEFS_F_PERM,=0A= >> +};=0A= > =0A= > Are these the on-disk feature bit definitions, or just used in=0A= > memory? Or both?=0A= =0A= These are used for both on-disk super block and in-memory sb info.=0A= =0A= Thank you very much for the detailed review. I will rework this and repost.= =0A= =0A= Best regards.=0A= =0A= -- =0A= Damien Le Moal=0A= Western Digital Research=0A=