From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756275AbZEZPdJ (ORCPT ); Tue, 26 May 2009 11:33:09 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756798AbZEZPcw (ORCPT ); Tue, 26 May 2009 11:32:52 -0400 Received: from sh.osrg.net ([192.16.179.4]:32815 "EHLO sh.osrg.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756379AbZEZPcu (ORCPT ); Tue, 26 May 2009 11:32:50 -0400 Date: Wed, 27 May 2009 00:31:43 +0900 To: James.Bottomley@HansenPartnership.com Cc: fujita.tomonori@lab.ntt.co.jp, jens.axboe@oracle.com, rdreier@cisco.com, bharrosh@panasas.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, chris.mason@oracle.com, david@fromorbit.com, hch@infradead.org, akpm@linux-foundation.org, jack@suse.cz, yanmin_zhang@linux.intel.com, linux-scsi@vger.kernel.org Subject: Re: [PATCH 03/13] scsi: unify allocation of scsi command and sense buffer From: FUJITA Tomonori In-Reply-To: <1243349222.2815.22.camel@localhost.localdomain> References: <20090526073229.GC11363@kernel.dk> <20090526163823U.fujita.tomonori@lab.ntt.co.jp> <1243349222.2815.22.camel@localhost.localdomain> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20090527003152X.fujita.tomonori@lab.ntt.co.jp> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (sh.osrg.net [192.16.179.4]); Wed, 27 May 2009 00:31:46 +0900 (JST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 26 May 2009 09:47:02 -0500 James Bottomley wrote: > On Tue, 2009-05-26 at 16:38 +0900, FUJITA Tomonori wrote: > > On Tue, 26 May 2009 09:32:29 +0200 > > Jens Axboe wrote: > > > > > On Tue, May 26 2009, FUJITA Tomonori wrote: > > > > On Tue, 26 May 2009 08:29:53 +0200 > > > > Jens Axboe wrote: > > > > > > > > > On Tue, May 26 2009, FUJITA Tomonori wrote: > > > > > > On Mon, 25 May 2009 18:45:25 -0700 > > > > > > Roland Dreier wrote: > > > > > > > > > > > > > > Ideally there should be a MACRO that is defined to WORD_SIZE on cache-coherent > > > > > > > > ARCHs and to SMP_CACHE_BYTES on none-cache-coherent systems and use that size > > > > > > > > at the __align() attribute. (So only stupid ARCHES get hurt) > > > > > > > > > > > > > > this seems to come up repeatedly -- I had a proposal a _long_ time ago > > > > > > > that never quite got merged, cf http://lwn.net/Articles/2265/ and > > > > > > > http://lwn.net/Articles/2269/ -- from 2002 (!?). The idea is to go a > > > > > > > > > > > > Yeah, I think that Benjamin did last time: > > > > > > > > > > > > http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12632.html > > > > > > > > > > > > IIRC, James didn't like it so I wrote the current code. I didn't see > > > > > > any big performance difference with scsi_debug: > > > > > > > > > > > > http://marc.info/?l=linux-scsi&m=120038907123706&w=2 > > > > > > > > > > > > Jens, you see the performance difference due to this unification? > > > > > > > > > > Yes, it's definitely a worth while optimization. The problem isn't as > > > > > such this specific allocation, it's the total number of allocations we > > > > > do for a piece of IO. This sense buffer one is just one of many, I'm > > > > > continually working to reduce them. If we get rid of this one and add > > > > > the ->alloc_cmd() stuff, we can kill one more. The bio path already lost > > > > > one. So in the IO stack, we went from 6 allocations to 3 for a piece of > > > > > IO. And then it starts to add up. Even at just 30-50k iops, that's more > > > > > than 1% of time in the testing I did. > > > > > > > > I see, thanks. Hmm, possibly slab becomes slower. ;) > > > > > > > > Then I think that we need something like the ->alloc_cmd() > > > > method. Let's ask James. > > > > > > > > I don't think that it's just about simply adding the hook; there are > > > > some issues that we need to think about. Though Boaz worries too much > > > > a bit, I think. > > > > > > > > I'm not sure about this patch if we add ->alloc_cmd(). I doubt that > > > > there are any llds don't use ->alloc_cmd() worry about the overhead of > > > > the separated sense buffer allocation. If a lld doesn't define the own > > > > alloc_cmd, then I think it's fine to use the generic command > > > > allocator that does the separate sense buffer allocation. > > > > > > I think we should do the two things seperately. If we can safely inline > > > the sense buffer in the command by doing the right alignment, then lets > > > do that. The ->alloc_cmd() approach will be easier to do with an inline > > > sense buffer. > > > > James rejected this in the past. Let's wait for his verdict. > > OK, so the reason for the original problems where the sense buffer was > inlined with the scsi_command was that we need to DMA to the sense > buffer but not to the command. Plus the command is in fairly constant > use so we get cacheline interference unless they're always in separate > caches. This necessitates opening up a hole in the command to achieve > this (you can separate to the next cache line if you can guarantee that > the command always begins on a cacheline. If not, it has to be > 2*cacheline). The L1 cacheline can be up to 128 bytes on some > architectures, so we'd need to know the waste of space is worth it in > terms of speed. The other problem is that the entire command now has to > be allocated in DMAable memory, which restricts the allocation on some > systems. Yeah, I think that there are good reasons why we shouldn't inline the sense buffer. As I already wrote, seems that the DMA requirement wasn't properly understood; it's not about the alignment. > > Yeah, we can inline the sense buffer but as we discussed in the past > > several times, there are some good reasons that we should not do so, I > > think. > > There are several other approaches: > > 1. Keep the sense buffer packed in the command but disallow DMA to > it, which fixes all the alignment problems. Then we supply a > set of rotating DMA buffers to drivers which need to do the DMA > (which isn't the majority). Can we just fix some drivers not to do the DMA with the sense buffer in scsi_cmnd? IIRC, there are only five or six drivers that do such.