From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754309AbcGECc1 (ORCPT ); Mon, 4 Jul 2016 22:32:27 -0400 Received: from linuxhacker.ru ([217.76.32.60]:37086 "EHLO fiona.linuxhacker.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753568AbcGECc0 (ORCPT ); Mon, 4 Jul 2016 22:32:26 -0400 Subject: Re: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes. Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: multipart/mixed; boundary="Apple-Mail=_A7D8DD5A-846E-4CE6-8145-77C7218B23DF" From: Oleg Drokin In-Reply-To: <1B63970B-923A-4C4C-98ED-70F0087C513D@linuxhacker.ru> Date: Mon, 4 Jul 2016 22:32:21 -0400 Cc: Mailing List , "" Message-Id: References: <20160617042914.GD14480@ZenIV.linux.org.uk> <20160703062917.GG14480@ZenIV.linux.org.uk> <1B63970B-923A-4C4C-98ED-70F0087C513D@linuxhacker.ru> To: Al Viro X-Mailer: Apple Mail (2.1283) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --Apple-Mail=_A7D8DD5A-846E-4CE6-8145-77C7218B23DF Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii ah, and of course the testcase that I forgot to attach ;) --Apple-Mail=_A7D8DD5A-846E-4CE6-8145-77C7218B23DF Content-Disposition: attachment; filename=lstest.sh Content-Type: application/octet-stream; x-unix-mode=0755; name="lstest.sh" Content-Transfer-Encoding: 7bit #!/bin/bash cd /home/green/git/lustre-release/lustre/tests/racer dd if=/dev/zero of=/tmp/loop bs=1024k count=1024 mkfs.ext4 /tmp/loop service rpcbind start mount none /var/lib/nfs -t tmpfs touch /var/lib/nfs/etab service nfs-mountd start mount /tmp/loop /mnt/lustre -o loop service nfs-server start mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock mount localhost:/ /mnt/nfs2 -t nfs4 mkdir /mnt/lustre/racer while :; do mkdir /mnt/lustre/racer/a ; mv /mnt/lustre/racer/a /mnt/lustre/racer/b ; touch /mnt/lustre/racer/a ; mv /mnt/lustre/racer/a /mnt/lustre/racer/b/ ; rm -rf /mnt/lustre/racer/b ; done & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & while :; do ls -l /mnt/nfs2/racer/a/ ; ls -l /mnt/nfs2/racer/b/ ; ls -lR /mnt/nfs2/racer/ ; done >/dev/null 2>&1 & wait %1 --Apple-Mail=_A7D8DD5A-846E-4CE6-8145-77C7218B23DF Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Jul 4, 2016, at 10:28 PM, Oleg Drokin wrote: >=20 > On Jul 3, 2016, at 2:29 AM, Al Viro wrote: >=20 >> On Sat, Jun 25, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >>=20 >>> Sorry to nag you about this, but did any of those pan out? >>>=20 >>> d_alloc_parallel() sounds like a bit too heavy there, esp. = considering we came in with >>> a dentry already (though a potentially shared one, I understand). >>> Would not it be better to try and establish some dentry locking rule = for calling into >>> d_splice_alias() instead? At least then the callers can make sure = the dentry does >>> not change under them? >>> Though I guess if there's dentry locking like that, we might as well = do all the >>> checking in d_splice_alias(), but that means the unhashed dentries = would no >>> longer be disallowed which is a change of semantic from now.-- >>=20 >> FWIW, the only interesting case here is this: >> * no O_CREAT in flags (otherwise the parent is held exclusive). >> * dentry is found in hash >> * dentry is negative >> * dentry has passed ->d_revalidate() (i.e. in case of >> NFS it had nfs_neg_need_reval() return false). >>=20 >> Only two instances are non-trivial in that respect - NFS and Lustre. >> Everything else will simply fail open() with ENOENT in that case. >>=20 >> And at least for NFS we could bloody well do d_drop + = d_alloc_parallel + >> finish_no_open and bugger off in case it's not in_lookup, otherwise = do >> pretty much what we do in case we'd got in_lookup from the very = beginning. >> Some adjustments are needed for that case (basically, we need to make >> sure we hit d_lookup_done() matching that d_alloc_parallel() and deal >> with refcounting correctly). >>=20 >> Tentative NFS patch follows; I don't understand Lustre well enough, = but it >> looks like a plausible strategy there as well. >=20 > This patch seems to have brought the other crash back in (or something = similar), > the one with a negative dentry being hashed when it's already hashed. > it's not as easy to hit as before, but a lot easier than the race we = are hitting here. >=20 > Also in all cases it is ls that is crashing now, which seems to = highlight it is > taking some of this new paths added. > I have half a dosen crashdumps if you want me to look something up = there. > This is on top of 4.7.0-rc6 a99cde438de0c4c0cecc1d1af1a55a75b10bfdef > with just your patch on top. >=20 > I can reproduce it with both the complex workload, or with a = simplified one (attached), > takes anywhere from 5 to 30 minutes to hit. >=20 >>=20 >> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c >> index d8015a03..5474e39 100644 >> --- a/fs/nfs/dir.c >> +++ b/fs/nfs/dir.c >> @@ -1485,11 +1485,13 @@ int nfs_atomic_open(struct inode *dir, struct = dentry *dentry, >> struct file *file, unsigned open_flags, >> umode_t mode, int *opened) >> { >> + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); >> struct nfs_open_context *ctx; >> struct dentry *res; >> struct iattr attr =3D { .ia_valid =3D ATTR_OPEN }; >> struct inode *inode; >> unsigned int lookup_flags =3D 0; >> + bool switched =3D false; >> int err; >>=20 >> /* Expect a negative dentry */ >> @@ -1528,6 +1530,17 @@ int nfs_atomic_open(struct inode *dir, struct = dentry *dentry, >> attr.ia_size =3D 0; >> } >>=20 >> + if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) { >> + d_drop(dentry); >> + switched =3D true; >> + dentry =3D d_alloc_parallel(dentry->d_parent, >> + &dentry->d_name, &wq); >=20 > Hm, d_alloc_parallel can return some preexisting dentry it was able to = lookup in > rcu attached to the same parent with the same name it seems? > What's to stop a parallel thread to rehash the dentry after we dropped = it here, > and so d_alloc_parallel will happily find it? > I am running a test to verify this theory now. >=20 >> + if (IS_ERR(dentry)) >> + return PTR_ERR(dentry); >> + if (unlikely(!d_in_lookup(dentry))) >> + return finish_no_open(file, dentry); >> + } >> + >> ctx =3D create_nfs_open_context(dentry, open_flags); >> err =3D PTR_ERR(ctx); >> if (IS_ERR(ctx)) >> @@ -1563,14 +1576,23 @@ int nfs_atomic_open(struct inode *dir, struct = dentry *dentry, >> trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); >> put_nfs_open_context(ctx); >> out: >> + if (unlikely(switched)) { >> + d_lookup_done(dentry); >> + dput(dentry); >> + } >> return err; >>=20 >> no_open: >> res =3D nfs_lookup(dir, dentry, lookup_flags); >> - err =3D PTR_ERR(res); >> + if (switched) { >> + d_lookup_done(dentry); >> + if (!res) >> + res =3D dentry; >> + else >> + dput(dentry); >> + } >> if (IS_ERR(res)) >> - goto out; >> - >> + return PTR_ERR(res); >> return finish_no_open(file, res); >> } >> EXPORT_SYMBOL_GPL(nfs_atomic_open); >=20 --Apple-Mail=_A7D8DD5A-846E-4CE6-8145-77C7218B23DF--