From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from zeniv.linux.org.uk ([195.92.253.2]:46766 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752603AbcBJQol (ORCPT ); Wed, 10 Feb 2016 11:44:41 -0500 Date: Wed, 10 Feb 2016 16:44:36 +0000 From: Al Viro To: Mike Marshall Cc: Linus Torvalds , linux-fsdevel , Stephen Rothwell Subject: Re: Orangefs ABI documentation Message-ID: <20160210164435.GA4950@ZenIV.linux.org.uk> References: <20160207035331.GZ17997@ZenIV.linux.org.uk> <20160208233535.GC17997@ZenIV.linux.org.uk> <20160209033203.GE17997@ZenIV.linux.org.uk> <20160209174049.GG17997@ZenIV.linux.org.uk> <20160209221623.GI17997@ZenIV.linux.org.uk> <20160209224050.GJ17997@ZenIV.linux.org.uk> <20160209231328.GK17997@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209231328.GK17997@ZenIV.linux.org.uk> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote: > On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote: > > > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is > > vulnerable to the same race. 2.8.1 isn't - it ignores signals on the > > cancel, but that means waiting for cancel to be processed (or timed out) > > on any interrupted read() before we return to userland. We can return > > to that behaviour, of course, but I suspect that offloading it to something > > async (along with freeing the slot used by original operation) would be > > better from QoI point of view. > > That breakage had been introduced between 2.8.5 and 2.8.6 (at some point > during the spring of 2012). AFAICS, all versions starting with 2.8.6 are > vulnerable... BTW, what about kill -9 delivered to readdir in progress? There's no cancel for those (and AFAICS the daemon will reject cancel on anything other than FILE_IO), so what's to stop another thread from picking the same readdir slot and getting (daemon-side) two of them spewing into the same area of shared memory? Is it simply that daemon-side the shared memory on readdir is touched only upon request completion in completely serialized process_vfs_requests()? That doesn't seem to be enough - suppose the second readdir request completes (daemon-side) first, its results get packed into shared memory slot and it is reported to kernel, which proceeds to repack and copy that data to userland. In the meanwhile, daemon completes the _earlier_ readdir and proceeds to pack its results into the same slot of shared memory. Sure, the kernel won't take that (the op with the matching tag has been gone already), but the data is stored into shared memory *before* writev() on the control device that would pass the response to the kernel, so it still gets overwritten. Right under decoding readdir()... Or is there something in the daemon that would guarantee readdir responses to happen in the same order in which it had picked the requests? I'm not familiar enough with that beast (and overall control flow in there is, er, not the most transparent one I've seen), so I might be missing something, but I don't see anything obvious that would guarantee such ordering. Please, clarify.