From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [PATCH 17/18] xfs: implement pnfs export operations Date: Thu, 8 Jan 2015 08:11:40 +1100 Message-ID: <20150107211140.GC25000@dastard> References: <1420561721-9150-1-git-send-email-hch@lst.de> <1420561721-9150-18-git-send-email-hch@lst.de> <20150107002434.GG31508@dastard> <20150107104010.GD28783@lst.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "J. Bruce Fields" , Jeff Layton , linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com To: Christoph Hellwig Return-path: Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:9569 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751919AbbAGVLp (ORCPT ); Wed, 7 Jan 2015 16:11:45 -0500 Content-Disposition: inline In-Reply-To: <20150107104010.GD28783@lst.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Jan 07, 2015 at 11:40:10AM +0100, Christoph Hellwig wrote: > On Wed, Jan 07, 2015 at 11:24:34AM +1100, Dave Chinner wrote: > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > > > index fdc6422..2b86be8 100644 > > > --- a/fs/xfs/xfs_fsops.c > > > +++ b/fs/xfs/xfs_fsops.c > > > @@ -601,6 +601,8 @@ xfs_growfs_data( > > > if (!mutex_trylock(&mp->m_growlock)) > > > return -EWOULDBLOCK; > > > error =3D xfs_growfs_data_private(mp, in); > > > + if (!error) > > > + mp->m_generation++; > > > mutex_unlock(&mp->m_growlock); > > > return error; > > > } > >=20 > > I couldn't find an explanation of what this generation number is > > for. What are it's semantics w.r.t. server crashes? >=20 > The generation is incremented when we grow the filesystem, so that > a new layout (block mapping) returned to the cl=D1=96ent referers to = the > new NFS device ID, which will make the client aware of the new size. >=20 > The device IDs aren't persistent, so after a server crash / reboot > we'll start at zero again. So what happens if a grow occurs, then the server crashes, and the client on reboot sees the same generation as before the grow occured? Perhaps it would be better to just initialise the generation with a random number? > I'll add comments explaining this to the code. >=20 > > Why does this function get passed an offset it is not actually used= ? >=20 > Historic reasons.. >=20 > > > +static int > > > +xfs_fs_update_flags( > > > + struct xfs_inode *ip) > > > +{ > > > + struct xfs_mount *mp =3D ip->i_mount; > > > + struct xfs_trans *tp; > > > + int error; > > > + > > > + /* > > > + * Update the mode, and prealloc flag bits. > > > + */ > > > + tp =3D xfs_trans_alloc(mp, XFS_TRANS_WRITEID); > > > + error =3D xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0); > > > + if (error) { > > > + xfs_trans_cancel(tp, 0); > > > + return error; > > > + } > > > + > > > + xfs_ilock(ip, XFS_ILOCK_EXCL); > > > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > + ip->i_d.di_mode &=3D ~S_ISUID; > > > + if (ip->i_d.di_mode & S_IXGRP) > > > + ip->i_d.di_mode &=3D ~S_ISGID; > > > + > > > + ip->i_d.di_flags |=3D XFS_DIFLAG_PREALLOC; > > > + > > > + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > + return xfs_trans_commit(tp, 0); > > > +} > >=20 > > That needs timestamp changes as well. i.e.: > >=20 > > xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); >=20 > The time stamps are only updated when we actually commit the data. > Updating them here might be harmless, but I'll have to dig into the > protocol specification and tests a bit more to check if doing the > additional timestamp update would be harmless. >=20 > > > + > > > +/* > > > + * Get a layout for the pNFS client. > > > + * > > > + * Note that in the allocation case we do force out the transact= ion here. > > > + * There is no metadata update that is required to be stable for= NFS > > > + * semantics, and layouts are not valid over a server crash. In= stead > > > + * we'll have to be careful in the commit routine as it might pa= ss us > > > + * blocks for an allocation that never made it to disk in the re= covery > > > + * case. > >=20 > > I think you are saying that because block allocation is an async > > transaction, then we have to deal with the possibility that we cras= h > > before the transaction hits the disk. > >=20 > > How often do we have to allocate > > new blocks like this? Do we need to use async transactions for this > > case, or should we simply do the brute force thing (by making the > > allocation transaction synchronous) initially and then, if > > performance problems arise, optimise from there? >=20 > Every block allocation from a pNFS client goes through this path, so > yes it is performance critical. Sure, but how many allocations per second are we expecting to have to support? We can do tens of thousands of synchronous transactions per second on luns with non-volatile write caches, so I'm really wondering how much of a limitation this is going to be in the real world. Do you have any numbers? > > So whenever the server first starts up the generation number in a > > map is going to be zero - what purpose does this actually serve? >=20 > So that we can communicate if a device was grown to the client, which > in this case needs to re-read the device information. Why does it need to reread the device information? the layouts that are handled to it are still going to be valid from the server POV... > > > + /* > > > + * Make sure reads through the pagecache see the new data. > > > + */ > > > + invalidate_inode_pages2(inode->i_mapping); > >=20 > > Probably should do that first. Also, what happens if there is local > > dirty data on the file at this point? Doesn't this just toss them > > away? >=20 > If there was local data it will be tossed. For regular writes this c= an't > happen because we really outstanding layouts in the write path. For > mmap we for now ignore this problem, as a pNFS server should generall= y > not be used locally. =20 Comments, please. ;) Cheers, Dave. --=20 Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id E6D827F50 for ; Wed, 7 Jan 2015 15:11:52 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id D30B58F8035 for ; Wed, 7 Jan 2015 13:11:49 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id cdwnLz600t0zaaJu for ; Wed, 07 Jan 2015 13:11:43 -0800 (PST) Date: Thu, 8 Jan 2015 08:11:40 +1100 From: Dave Chinner Subject: Re: [PATCH 17/18] xfs: implement pnfs export operations Message-ID: <20150107211140.GC25000@dastard> References: <1420561721-9150-1-git-send-email-hch@lst.de> <1420561721-9150-18-git-send-email-hch@lst.de> <20150107002434.GG31508@dastard> <20150107104010.GD28783@lst.de> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20150107104010.GD28783@lst.de> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Christoph Hellwig Cc: "J. Bruce Fields" , linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org, Jeff Layton , xfs@oss.sgi.com T24gV2VkLCBKYW4gMDcsIDIwMTUgYXQgMTE6NDA6MTBBTSArMDEwMCwgQ2hyaXN0b3BoIEhlbGx3 aWcgd3JvdGU6Cj4gT24gV2VkLCBKYW4gMDcsIDIwMTUgYXQgMTE6MjQ6MzRBTSArMTEwMCwgRGF2 ZSBDaGlubmVyIHdyb3RlOgo+ID4gPiBkaWZmIC0tZ2l0IGEvZnMveGZzL3hmc19mc29wcy5jIGIv ZnMveGZzL3hmc19mc29wcy5jCj4gPiA+IGluZGV4IGZkYzY0MjIuLjJiODZiZTggMTAwNjQ0Cj4g PiA+IC0tLSBhL2ZzL3hmcy94ZnNfZnNvcHMuYwo+ID4gPiArKysgYi9mcy94ZnMveGZzX2Zzb3Bz LmMKPiA+ID4gQEAgLTYwMSw2ICs2MDEsOCBAQCB4ZnNfZ3Jvd2ZzX2RhdGEoCj4gPiA+ICAJaWYg KCFtdXRleF90cnlsb2NrKCZtcC0+bV9ncm93bG9jaykpCj4gPiA+ICAJCXJldHVybiAtRVdPVUxE QkxPQ0s7Cj4gPiA+ICAJZXJyb3IgPSB4ZnNfZ3Jvd2ZzX2RhdGFfcHJpdmF0ZShtcCwgaW4pOwo+ ID4gPiArCWlmICghZXJyb3IpCj4gPiA+ICsJCW1wLT5tX2dlbmVyYXRpb24rKzsKPiA+ID4gIAlt dXRleF91bmxvY2soJm1wLT5tX2dyb3dsb2NrKTsKPiA+ID4gIAlyZXR1cm4gZXJyb3I7Cj4gPiA+ ICB9Cj4gPiAKPiA+IEkgY291bGRuJ3QgZmluZCBhbiBleHBsYW5hdGlvbiBvZiB3aGF0IHRoaXMg Z2VuZXJhdGlvbiBudW1iZXIgaXMKPiA+IGZvci4gV2hhdCBhcmUgaXQncyBzZW1hbnRpY3Mgdy5y LnQuIHNlcnZlciBjcmFzaGVzPwo+IAo+IFRoZSBnZW5lcmF0aW9uIGlzIGluY3JlbWVudGVkIHdo ZW4gd2UgZ3JvdyB0aGUgZmlsZXN5c3RlbSwgc28gdGhhdAo+IGEgbmV3IGxheW91dCAoYmxvY2sg bWFwcGluZykgcmV0dXJuZWQgdG8gdGhlIGNs0ZZlbnQgcmVmZXJlcnMgdG8gdGhlCj4gbmV3IE5G UyBkZXZpY2UgSUQsIHdoaWNoIHdpbGwgbWFrZSB0aGUgY2xpZW50IGF3YXJlIG9mIHRoZSBuZXcg c2l6ZS4KPiAKPiBUaGUgZGV2aWNlIElEcyBhcmVuJ3QgcGVyc2lzdGVudCwgc28gYWZ0ZXIgYSBz ZXJ2ZXIgY3Jhc2ggLyByZWJvb3QKPiB3ZSdsbCBzdGFydCBhdCB6ZXJvIGFnYWluLgoKU28gd2hh dCBoYXBwZW5zIGlmIGEgZ3JvdyBvY2N1cnMsIHRoZW4gdGhlIHNlcnZlciBjcmFzaGVzLCBhbmQg dGhlCmNsaWVudCBvbiByZWJvb3Qgc2VlcyB0aGUgc2FtZSBnZW5lcmF0aW9uIGFzIGJlZm9yZSB0 aGUgZ3JvdwpvY2N1cmVkPwoKUGVyaGFwcyBpdCB3b3VsZCBiZSBiZXR0ZXIgdG8ganVzdCBpbml0 aWFsaXNlIHRoZSBnZW5lcmF0aW9uIHdpdGggYQpyYW5kb20gbnVtYmVyPwoKPiBJJ2xsIGFkZCBj b21tZW50cyBleHBsYWluaW5nIHRoaXMgdG8gdGhlIGNvZGUuCj4gCj4gPiBXaHkgZG9lcyB0aGlz IGZ1bmN0aW9uIGdldCBwYXNzZWQgYW4gb2Zmc2V0IGl0IGlzIG5vdCBhY3R1YWxseSB1c2VkPwo+ IAo+IEhpc3RvcmljIHJlYXNvbnMuLgo+IAo+ID4gPiArc3RhdGljIGludAo+ID4gPiAreGZzX2Zz X3VwZGF0ZV9mbGFncygKPiA+ID4gKwlzdHJ1Y3QgeGZzX2lub2RlCSppcCkKPiA+ID4gK3sKPiA+ ID4gKwlzdHJ1Y3QgeGZzX21vdW50CSptcCA9IGlwLT5pX21vdW50Owo+ID4gPiArCXN0cnVjdCB4 ZnNfdHJhbnMJKnRwOwo+ID4gPiArCWludAkJCWVycm9yOwo+ID4gPiArCj4gPiA+ICsJLyoKPiA+ ID4gKwkgKiBVcGRhdGUgdGhlIG1vZGUsIGFuZCBwcmVhbGxvYyBmbGFnIGJpdHMuCj4gPiA+ICsJ ICovCj4gPiA+ICsJdHAgPSB4ZnNfdHJhbnNfYWxsb2MobXAsIFhGU19UUkFOU19XUklURUlEKTsK PiA+ID4gKwllcnJvciA9IHhmc190cmFuc19yZXNlcnZlKHRwLCAmTV9SRVMobXApLT50cl93cml0 ZWlkLCAwLCAwKTsKPiA+ID4gKwlpZiAoZXJyb3IpIHsKPiA+ID4gKwkJeGZzX3RyYW5zX2NhbmNl bCh0cCwgMCk7Cj4gPiA+ICsJCXJldHVybiBlcnJvcjsKPiA+ID4gKwl9Cj4gPiA+ICsKPiA+ID4g Kwl4ZnNfaWxvY2soaXAsIFhGU19JTE9DS19FWENMKTsKPiA+ID4gKwl4ZnNfdHJhbnNfaWpvaW4o dHAsIGlwLCBYRlNfSUxPQ0tfRVhDTCk7Cj4gPiA+ICsJaXAtPmlfZC5kaV9tb2RlICY9IH5TX0lT VUlEOwo+ID4gPiArCWlmIChpcC0+aV9kLmRpX21vZGUgJiBTX0lYR1JQKQo+ID4gPiArCQlpcC0+ aV9kLmRpX21vZGUgJj0gflNfSVNHSUQ7Cj4gPiA+ICsKPiA+ID4gKwlpcC0+aV9kLmRpX2ZsYWdz IHw9IFhGU19ESUZMQUdfUFJFQUxMT0M7Cj4gPiA+ICsKPiA+ID4gKwl4ZnNfdHJhbnNfbG9nX2lu b2RlKHRwLCBpcCwgWEZTX0lMT0dfQ09SRSk7Cj4gPiA+ICsJcmV0dXJuIHhmc190cmFuc19jb21t aXQodHAsIDApOwo+ID4gPiArfQo+ID4gCj4gPiBUaGF0IG5lZWRzIHRpbWVzdGFtcCBjaGFuZ2Vz IGFzIHdlbGwuIGkuZS46Cj4gPiAKPiA+IAl4ZnNfdHJhbnNfaWNoZ3RpbWUodHAsIGlwLCBYRlNf SUNIR1RJTUVfTU9EIHwgWEZTX0lDSEdUSU1FX0NIRyk7Cj4gCj4gVGhlIHRpbWUgc3RhbXBzIGFy ZSBvbmx5IHVwZGF0ZWQgd2hlbiB3ZSBhY3R1YWxseSBjb21taXQgdGhlIGRhdGEuCj4gVXBkYXRp bmcgdGhlbSBoZXJlIG1pZ2h0IGJlIGhhcm1sZXNzLCBidXQgSSdsbCBoYXZlIHRvIGRpZyBpbnRv IHRoZQo+IHByb3RvY29sIHNwZWNpZmljYXRpb24gYW5kIHRlc3RzIGEgYml0IG1vcmUgdG8gY2hl Y2sgaWYgZG9pbmcgdGhlCj4gYWRkaXRpb25hbCB0aW1lc3RhbXAgdXBkYXRlIHdvdWxkIGJlIGhh cm1sZXNzLgo+IAo+ID4gPiArCj4gPiA+ICsvKgo+ID4gPiArICogR2V0IGEgbGF5b3V0IGZvciB0 aGUgcE5GUyBjbGllbnQuCj4gPiA+ICsgKgo+ID4gPiArICogTm90ZSB0aGF0IGluIHRoZSBhbGxv Y2F0aW9uIGNhc2Ugd2UgZG8gZm9yY2Ugb3V0IHRoZSB0cmFuc2FjdGlvbiBoZXJlLgo+ID4gPiAr ICogVGhlcmUgaXMgbm8gbWV0YWRhdGEgdXBkYXRlIHRoYXQgaXMgcmVxdWlyZWQgdG8gYmUgc3Rh YmxlIGZvciBORlMKPiA+ID4gKyAqIHNlbWFudGljcywgYW5kIGxheW91dHMgYXJlIG5vdCB2YWxp ZCBvdmVyIGEgc2VydmVyIGNyYXNoLiAgSW5zdGVhZAo+ID4gPiArICogd2UnbGwgaGF2ZSB0byBi ZSBjYXJlZnVsIGluIHRoZSBjb21taXQgcm91dGluZSBhcyBpdCBtaWdodCBwYXNzIHVzCj4gPiA+ ICsgKiBibG9ja3MgZm9yIGFuIGFsbG9jYXRpb24gdGhhdCBuZXZlciBtYWRlIGl0IHRvIGRpc2sg aW4gdGhlIHJlY292ZXJ5Cj4gPiA+ICsgKiBjYXNlLgo+ID4gCj4gPiBJIHRoaW5rIHlvdSBhcmUg c2F5aW5nIHRoYXQgYmVjYXVzZSBibG9jayBhbGxvY2F0aW9uIGlzIGFuIGFzeW5jCj4gPiB0cmFu c2FjdGlvbiwgdGhlbiB3ZSBoYXZlIHRvIGRlYWwgd2l0aCB0aGUgcG9zc2liaWxpdHkgdGhhdCB3 ZSBjcmFzaAo+ID4gYmVmb3JlIHRoZSB0cmFuc2FjdGlvbiBoaXRzIHRoZSBkaXNrLgo+ID4gCj4g PiBIb3cgb2Z0ZW4gZG8gd2UgaGF2ZSB0byBhbGxvY2F0ZQo+ID4gbmV3IGJsb2NrcyBsaWtlIHRo aXM/IERvIHdlIG5lZWQgdG8gdXNlIGFzeW5jIHRyYW5zYWN0aW9ucyBmb3IgdGhpcwo+ID4gY2Fz ZSwgb3Igc2hvdWxkIHdlIHNpbXBseSBkbyB0aGUgYnJ1dGUgZm9yY2UgdGhpbmcgKGJ5IG1ha2lu ZyB0aGUKPiA+IGFsbG9jYXRpb24gdHJhbnNhY3Rpb24gc3luY2hyb25vdXMpIGluaXRpYWxseSBh bmQgdGhlbiwgaWYKPiA+IHBlcmZvcm1hbmNlIHByb2JsZW1zIGFyaXNlLCBvcHRpbWlzZSBmcm9t IHRoZXJlPwo+IAo+IEV2ZXJ5IGJsb2NrIGFsbG9jYXRpb24gZnJvbSBhIHBORlMgY2xpZW50IGdv ZXMgdGhyb3VnaCB0aGlzIHBhdGgsIHNvCj4geWVzIGl0IGlzIHBlcmZvcm1hbmNlIGNyaXRpY2Fs LgoKU3VyZSwgYnV0IGhvdyBtYW55IGFsbG9jYXRpb25zIHBlciBzZWNvbmQgYXJlIHdlIGV4cGVj dGluZyB0byBoYXZlCnRvIHN1cHBvcnQ/IFdlIGNhbiBkbyB0ZW5zIG9mIHRob3VzYW5kcyBvZiBz eW5jaHJvbm91cyB0cmFuc2FjdGlvbnMKcGVyIHNlY29uZCBvbiBsdW5zIHdpdGggbm9uLXZvbGF0 aWxlIHdyaXRlIGNhY2hlcywgc28gSSdtIHJlYWxseQp3b25kZXJpbmcgaG93IG11Y2ggb2YgYSBs aW1pdGF0aW9uIHRoaXMgaXMgZ29pbmcgdG8gYmUgaW4gdGhlIHJlYWwKd29ybGQuIERvIHlvdSBo YXZlIGFueSBudW1iZXJzPwoKPiA+IFNvIHdoZW5ldmVyIHRoZSBzZXJ2ZXIgZmlyc3Qgc3RhcnRz IHVwIHRoZSBnZW5lcmF0aW9uIG51bWJlciBpbiBhCj4gPiBtYXAgaXMgZ29pbmcgdG8gYmUgemVy byAtIHdoYXQgcHVycG9zZSBkb2VzIHRoaXMgYWN0dWFsbHkgc2VydmU/Cj4gCj4gU28gdGhhdCB3 ZSBjYW4gY29tbXVuaWNhdGUgaWYgYSBkZXZpY2Ugd2FzIGdyb3duIHRvIHRoZSBjbGllbnQsIHdo aWNoCj4gaW4gdGhpcyBjYXNlIG5lZWRzIHRvIHJlLXJlYWQgdGhlIGRldmljZSBpbmZvcm1hdGlv bi4KCldoeSBkb2VzIGl0IG5lZWQgdG8gcmVyZWFkIHRoZSBkZXZpY2UgaW5mb3JtYXRpb24/IHRo ZSBsYXlvdXRzIHRoYXQKYXJlIGhhbmRsZWQgdG8gaXQgYXJlIHN0aWxsIGdvaW5nIHRvIGJlIHZh bGlkIGZyb20gdGhlIHNlcnZlciBQT1YuLi4KCj4gPiA+ICsJLyoKPiA+ID4gKwkgKiBNYWtlIHN1 cmUgcmVhZHMgdGhyb3VnaCB0aGUgcGFnZWNhY2hlIHNlZSB0aGUgbmV3IGRhdGEuCj4gPiA+ICsJ ICovCj4gPiA+ICsJaW52YWxpZGF0ZV9pbm9kZV9wYWdlczIoaW5vZGUtPmlfbWFwcGluZyk7Cj4g PiAKPiA+IFByb2JhYmx5IHNob3VsZCBkbyB0aGF0IGZpcnN0LiBBbHNvLCB3aGF0IGhhcHBlbnMg aWYgdGhlcmUgaXMgbG9jYWwKPiA+IGRpcnR5IGRhdGEgb24gdGhlIGZpbGUgYXQgdGhpcyBwb2lu dD8gRG9lc24ndCB0aGlzIGp1c3QgdG9zcyB0aGVtCj4gPiBhd2F5Pwo+IAo+IElmIHRoZXJlIHdh cyBsb2NhbCBkYXRhIGl0IHdpbGwgYmUgdG9zc2VkLiAgRm9yIHJlZ3VsYXIgd3JpdGVzIHRoaXMg Y2FuJ3QKPiBoYXBwZW4gYmVjYXVzZSB3ZSByZWFsbHkgb3V0c3RhbmRpbmcgbGF5b3V0cyBpbiB0 aGUgd3JpdGUgcGF0aC4gIEZvcgo+IG1tYXAgd2UgZm9yIG5vdyBpZ25vcmUgdGhpcyBwcm9ibGVt LCBhcyBhIHBORlMgc2VydmVyIHNob3VsZCBnZW5lcmFsbHkKPiBub3QgYmUgdXNlZCBsb2NhbGx5 LiAgCgpDb21tZW50cywgcGxlYXNlLiA7KQoKQ2hlZXJzLAoKRGF2ZS4KLS0gCkRhdmUgQ2hpbm5l cgpkYXZpZEBmcm9tb3JiaXQuY29tCgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fXwp4ZnMgbWFpbGluZyBsaXN0Cnhmc0Bvc3Muc2dpLmNvbQpodHRwOi8vb3Nz LnNnaS5jb20vbWFpbG1hbi9saXN0aW5mby94ZnMK From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: linux-nfs-owner@vger.kernel.org Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:9569 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751919AbbAGVLp (ORCPT ); Wed, 7 Jan 2015 16:11:45 -0500 Date: Thu, 8 Jan 2015 08:11:40 +1100 From: Dave Chinner To: Christoph Hellwig Cc: "J. Bruce Fields" , Jeff Layton , linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com Subject: Re: [PATCH 17/18] xfs: implement pnfs export operations Message-ID: <20150107211140.GC25000@dastard> References: <1420561721-9150-1-git-send-email-hch@lst.de> <1420561721-9150-18-git-send-email-hch@lst.de> <20150107002434.GG31508@dastard> <20150107104010.GD28783@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 In-Reply-To: <20150107104010.GD28783@lst.de> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Wed, Jan 07, 2015 at 11:40:10AM +0100, Christoph Hellwig wrote: > On Wed, Jan 07, 2015 at 11:24:34AM +1100, Dave Chinner wrote: > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > > > index fdc6422..2b86be8 100644 > > > --- a/fs/xfs/xfs_fsops.c > > > +++ b/fs/xfs/xfs_fsops.c > > > @@ -601,6 +601,8 @@ xfs_growfs_data( > > > if (!mutex_trylock(&mp->m_growlock)) > > > return -EWOULDBLOCK; > > > error = xfs_growfs_data_private(mp, in); > > > + if (!error) > > > + mp->m_generation++; > > > mutex_unlock(&mp->m_growlock); > > > return error; > > > } > > > > I couldn't find an explanation of what this generation number is > > for. What are it's semantics w.r.t. server crashes? > > The generation is incremented when we grow the filesystem, so that > a new layout (block mapping) returned to the clіent referers to the > new NFS device ID, which will make the client aware of the new size. > > The device IDs aren't persistent, so after a server crash / reboot > we'll start at zero again. So what happens if a grow occurs, then the server crashes, and the client on reboot sees the same generation as before the grow occured? Perhaps it would be better to just initialise the generation with a random number? > I'll add comments explaining this to the code. > > > Why does this function get passed an offset it is not actually used? > > Historic reasons.. > > > > +static int > > > +xfs_fs_update_flags( > > > + struct xfs_inode *ip) > > > +{ > > > + struct xfs_mount *mp = ip->i_mount; > > > + struct xfs_trans *tp; > > > + int error; > > > + > > > + /* > > > + * Update the mode, and prealloc flag bits. > > > + */ > > > + tp = xfs_trans_alloc(mp, XFS_TRANS_WRITEID); > > > + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_writeid, 0, 0); > > > + if (error) { > > > + xfs_trans_cancel(tp, 0); > > > + return error; > > > + } > > > + > > > + xfs_ilock(ip, XFS_ILOCK_EXCL); > > > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > + ip->i_d.di_mode &= ~S_ISUID; > > > + if (ip->i_d.di_mode & S_IXGRP) > > > + ip->i_d.di_mode &= ~S_ISGID; > > > + > > > + ip->i_d.di_flags |= XFS_DIFLAG_PREALLOC; > > > + > > > + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > + return xfs_trans_commit(tp, 0); > > > +} > > > > That needs timestamp changes as well. i.e.: > > > > xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); > > The time stamps are only updated when we actually commit the data. > Updating them here might be harmless, but I'll have to dig into the > protocol specification and tests a bit more to check if doing the > additional timestamp update would be harmless. > > > > + > > > +/* > > > + * Get a layout for the pNFS client. > > > + * > > > + * Note that in the allocation case we do force out the transaction here. > > > + * There is no metadata update that is required to be stable for NFS > > > + * semantics, and layouts are not valid over a server crash. Instead > > > + * we'll have to be careful in the commit routine as it might pass us > > > + * blocks for an allocation that never made it to disk in the recovery > > > + * case. > > > > I think you are saying that because block allocation is an async > > transaction, then we have to deal with the possibility that we crash > > before the transaction hits the disk. > > > > How often do we have to allocate > > new blocks like this? Do we need to use async transactions for this > > case, or should we simply do the brute force thing (by making the > > allocation transaction synchronous) initially and then, if > > performance problems arise, optimise from there? > > Every block allocation from a pNFS client goes through this path, so > yes it is performance critical. Sure, but how many allocations per second are we expecting to have to support? We can do tens of thousands of synchronous transactions per second on luns with non-volatile write caches, so I'm really wondering how much of a limitation this is going to be in the real world. Do you have any numbers? > > So whenever the server first starts up the generation number in a > > map is going to be zero - what purpose does this actually serve? > > So that we can communicate if a device was grown to the client, which > in this case needs to re-read the device information. Why does it need to reread the device information? the layouts that are handled to it are still going to be valid from the server POV... > > > + /* > > > + * Make sure reads through the pagecache see the new data. > > > + */ > > > + invalidate_inode_pages2(inode->i_mapping); > > > > Probably should do that first. Also, what happens if there is local > > dirty data on the file at this point? Doesn't this just toss them > > away? > > If there was local data it will be tossed. For regular writes this can't > happen because we really outstanding layouts in the write path. For > mmap we for now ignore this problem, as a pNFS server should generally > not be used locally. Comments, please. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com