From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: [PATCH 13/17] scsi: push host_lock down into scsi_{host,target}_queue_ready Date: Thu, 06 Feb 2014 08:56:59 -0800 Message-ID: <1391705819.22335.8.camel@dabdike> References: <20140205123930.150608699@bombadil.infradead.org> <20140205124021.286457268@bombadil.infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: 7bit Return-path: Received: from bedivere.hansenpartnership.com ([66.63.167.143]:48574 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751500AbaBFQ5A (ORCPT ); Thu, 6 Feb 2014 11:57:00 -0500 In-Reply-To: <20140205124021.286457268@bombadil.infradead.org> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Christoph Hellwig Cc: Jens Axboe , Nicholas Bellinger , linux-scsi@vger.kernel.org On Wed, 2014-02-05 at 04:39 -0800, Christoph Hellwig wrote: > Prepare for not taking a host-wide lock in the dispatch path by pushing > the lock down into the places that actually need it. Note that this > patch is just a preparation step, as it will actually increase lock > roundtrips and thus decrease performance on its own. I'm not really convinced by the rest of the series. I think patches 4,8,9,10,11,12 (lock elimination from fast path and get/put reduction) are where the improvements are at and they mostly look ready to apply modulo some author and signoff fixes. I'm dubious about replacing a locked set of checks and increments with atomics for the simple reason that atomics are pretty expensive on non-x86, so you've likely slowed the critical path down for them. Even on x86, atomics can be very expensive because of the global bus lock. I think about three of them in a row is where you might as well stick with the lock. Could you benchmark this lot and show what the actual improvement is just for this series, if any? I also think we should be getting more utility out of threading guarantees. So, if there's only one thread active per device we don't need any device counters to be atomic. Likewise, u32 read/write is an atomic operation, so we might be able to use sloppy counters for the target and host stuff (one per CPU that are incremented/decremented on that CPU ... this will only work using CPU locality ... completion on same CPU but that seems to be an element of a lot of stuff nowadays). I'm not saying this will all pan out, but I am saying I don't think just making all the counters atomic to reduce locking will buy us a huge amount, so I'd appreciate careful benchmarking to confirm or invalidate this hypothesis first. James