From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-block-owner@vger.kernel.org>
Received: from verein.lst.de ([213.95.11.211]:40578 "EHLO newverein.lst.de"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S933172AbcIFQu6 (ORCPT <rfc822;linux-block@vger.kernel.org>);
        Tue, 6 Sep 2016 12:50:58 -0400
Date: Tue, 6 Sep 2016 18:50:56 +0200
From: Christoph Hellwig <hch@lst.de>
To: Keith Busch <keith.busch@intel.com>
Cc: axboe@fb.com, linux-block@vger.kernel.org,
        linux-nvme@lists.infradead.org,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity
        mask
Message-ID: <20160906165056.GB26214@lst.de>
References: <1472468013-29936-1-git-send-email-hch@lst.de> <1472468013-29936-5-git-send-email-hch@lst.de> <20160831163852.GB5598@localhost.localdomain> <20160901084624.GC4115@lst.de> <20160901142410.GA10903@localhost.localdomain> <20160905194759.GA26008@lst.de> <20160906143928.GA25201@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <20160906143928.GA25201@localhost.localdomain>
Sender: linux-block-owner@vger.kernel.org
List-Id: linux-block@vger.kernel.org

[adding Thomas as it's about the affinity_mask he (we) added to the
 IRQ core]

On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > Always the previous one.  Below is a patch to get us back to the
> > previous behavior:
> 
> No, that's not right.
> 
> Here's my topology info:
> 
>   # numactl --hardware
>   available: 2 nodes (0-1)
>   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>   node 0 size: 15745 MB
>   node 0 free: 15319 MB
>   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>   node 1 size: 16150 MB
>   node 1 free: 15758 MB
>   node distances:
>   node   0   1
>     0:  10  21
>     1:  21  10

How do you get that mapping?  Does this CPU use Hyperthreading and
thus expose siblings using topology_sibling_cpumask?  As that's the
only thing the old code used for any sort of special casing.

I'll need to see if I can find a system with such a mapping to reproduce.

> If I have 16 vectors, the affinity_mask generated by what you're doing
> looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> of those are the first unique CPU, getting a unique vector just like you
> wanted. If an unset bit just means share with the previous, then all of
> my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> 
> What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> with this scheme. We need affinity_masks per vector.

We actually have per-vector masks, but they are hidden inside the IRQ
core and awkward to use.  We could to the get_first_sibling magic
in the block-mq queue mapping (and in fact with the current code I guess
we need to).  Or take a step back from trying to emulate the old code
and instead look at NUMA nodes instead of siblings which some folks
suggested a while ago.

From mboxrd@z Thu Jan  1 00:00:00 1970
From: hch@lst.de (Christoph Hellwig)
Date: Tue, 6 Sep 2016 18:50:56 +0200
Subject: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
In-Reply-To: <20160906143928.GA25201@localhost.localdomain>
References: <1472468013-29936-1-git-send-email-hch@lst.de>
 <1472468013-29936-5-git-send-email-hch@lst.de>
 <20160831163852.GB5598@localhost.localdomain> <20160901084624.GC4115@lst.de>
 <20160901142410.GA10903@localhost.localdomain>
 <20160905194759.GA26008@lst.de>
 <20160906143928.GA25201@localhost.localdomain>
Message-ID: <20160906165056.GB26214@lst.de>

[adding Thomas as it's about the affinity_mask he (we) added to the
 IRQ core]

On Tue, Sep 06, 2016@10:39:28AM -0400, Keith Busch wrote:
> > Always the previous one.  Below is a patch to get us back to the
> > previous behavior:
> 
> No, that's not right.
> 
> Here's my topology info:
> 
>   # numactl --hardware
>   available: 2 nodes (0-1)
>   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>   node 0 size: 15745 MB
>   node 0 free: 15319 MB
>   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>   node 1 size: 16150 MB
>   node 1 free: 15758 MB
>   node distances:
>   node   0   1
>     0:  10  21
>     1:  21  10

How do you get that mapping?  Does this CPU use Hyperthreading and
thus expose siblings using topology_sibling_cpumask?  As that's the
only thing the old code used for any sort of special casing.

I'll need to see if I can find a system with such a mapping to reproduce.

> If I have 16 vectors, the affinity_mask generated by what you're doing
> looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> of those are the first unique CPU, getting a unique vector just like you
> wanted. If an unset bit just means share with the previous, then all of
> my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> 
> What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> with this scheme. We need affinity_masks per vector.

We actually have per-vector masks, but they are hidden inside the IRQ
core and awkward to use.  We could to the get_first_sibling magic
in the block-mq queue mapping (and in fact with the current code I guess
we need to).  Or take a step back from trying to emulate the old code
and instead look at NUMA nodes instead of siblings which some folks
suggested a while ago.