From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Subject: Re: Race between __sync_single_inode() and LogFS garbage collector
Date: Mon, 19 Feb 2007 17:05:55 -0600
Message-ID: <1171926356.9771.34.camel@kleikamp.austin.ibm.com>
References: <20070219213150.GD7813@lazybastard.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-fsdevel@vger.kernel.org
To: =?ISO-8859-1?Q?J=F6rn?= Engel <joern@lazybastard.org>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e4.ny.us.ibm.com ([32.97.182.144]:41428 "EHLO e4.ny.us.ibm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S965521AbXBSXGF (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 19 Feb 2007 18:06:05 -0500
Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236])
	by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l1JN63R5011286
	for <linux-fsdevel@vger.kernel.org>; Mon, 19 Feb 2007 18:06:03 -0500
Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215])
	by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.2) with ESMTP id l1JN63bm300790
	for <linux-fsdevel@vger.kernel.org>; Mon, 19 Feb 2007 18:06:03 -0500
Received: from d01av01.pok.ibm.com (loopback [127.0.0.1])
	by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l1JN620B003627
	for <linux-fsdevel@vger.kernel.org>; Mon, 19 Feb 2007 18:06:02 -0500
In-Reply-To: <20070219213150.GD7813@lazybastard.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Mon, 2007-02-19 at 21:31 +0000, J=F6rn Engel wrote:
> Looks like I really write the first log-structured filesystem for Lin=
ux.
> At least I can into a fairly arcane race that seems to be generic to =
all
> of them.
>=20
> Writing when space is tight may involve calling the garbage collector=
=2E
> The garbage collector will iget() random inodes, either to verify if =
a
> block is valid or to copy the block around.  At this point, all write=
s
> to LogFS are serialized.
>=20
> __sync_single_inode() will first lock a random inode, then call
> write_inode(), then unlock the inode.  So we can get this:
>=20
>=20
> __sync_single_inode()			garbage collector
> ---------------------------------------------------------------------
> inode->i_state |=3D I_LOCK;		...
> ...					mutex_lock(&super->s_w_mutex);
> write_inode(inode, wait);		...
>   ...					iget(sb, ino);
>   mutex_lock(&super->s_w_mutex);	...
>   ...					  wait_on_inode(inode);
>   mutex_unlock(&super->s_w_mutex);=09
>   ...				=09
> ...
> inode->i_state &=3D ~I_LOCK;
>=20
>=20
> And once in a blue moon, those two will race for the same inode.  As =
far
> as I can see, the race can only get fixed in two ways:
> 1. Never iget() inside the garbage collector.  That would require hav=
ing
>    a private inode cache for LogFS.
> 2. Synchonize __sync_single_inode() and the garbage collector somehow=
=2E
>=20
> Variant 1 would result in double caching for the same object, somethi=
ng
> I would like to avoid.  So does anyone have suggestions how variant 2
> could be achieved?  Essentially what I need is a way to say "don't sy=
nc
> any inodes right now, I'll be back in 5 milliseconds or so".

It'd be nice if you could drop s_w_mutex when the garbage collector
calls i_get().

Otherwise, you may be able to call ilookup5_nowait() in the garbage
collector, and skip that inode if I_LOCK is set.

>=20
> J=F6rn
>=20
--=20
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html