From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753087AbcFOCfH (ORCPT ); Tue, 14 Jun 2016 22:35:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55885 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752407AbcFOCfE (ORCPT ); Tue, 14 Jun 2016 22:35:04 -0400 Date: Tue, 14 Jun 2016 22:35:02 -0400 From: Mike Snitzer To: Dan Williams Cc: Jeff Moyer , "Kani, Toshimitsu" , "axboe@kernel.dk" , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , "linux-raid@vger.kernel.org" , "dm-devel@redhat.com" , "viro@zeniv.linux.org.uk" , "ross.zwisler@linux.intel.com" , "agk@redhat.com" Subject: Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices Message-ID: <20160615023502.GC5443@redhat.com> References: <1465856497-19698-1-git-send-email-toshi.kani@hpe.com> <1465861755.3504.185.camel@hpe.com> <20160614154131.GB25876@redhat.com> <20160615014658.GA5443@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Wed, 15 Jun 2016 02:35:04 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 14 2016 at 10:07pm -0400, Dan Williams wrote: > On Tue, Jun 14, 2016 at 6:46 PM, Mike Snitzer wrote: > > On Tue, Jun 14 2016 at 4:19pm -0400, > > Jeff Moyer wrote: > > > >> Mike Snitzer writes: > >> > >> > On Tue, Jun 14 2016 at 9:50am -0400, > >> > Jeff Moyer wrote: > >> > > >> >> "Kani, Toshimitsu" writes: > >> >> > >> >> >> I had dm-linear and md-raid0 support on my list of things to look at, > >> >> >> did you have raid0 in your plans? > >> >> > > >> >> > Yes, I hope to extend further and raid0 is a good candidate. > >> >> > >> >> dm-flakey would allow more xfstests test cases to run. I'd say that's > >> >> more important than linear or raid0. ;-) > >> > > >> > Regardless of which target(s) grow DAX support the most pressing initial > >> > concern is getting the DM device stacking correct. And verifying that > >> > IO that cross pmem device boundaries are being properly split by DM > >> > core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to > >> > max_io_len). > >> > >> That was a tongue-in-cheek comment. You're reading way too much into > >> it. > >> > >> >> Also, the next step in this work is to then decide how to determine on > >> >> what numa node an LBA resides. We had discussed this at a prior > >> >> plumbers conference, and I think the consensus was to use xattrs. > >> >> Toshi, do you also plan to do that work? > >> > > >> > How does the associated NUMA node relate to this? Does the > >> > DM requests_queue need to be setup to only allocate from the NUMA node > >> > the pmem device is attached to? I recently added support for this to > >> > DM. But there will likely be some code need to propagate the NUMA node > >> > id accordingly. > >> > >> I assume you mean allocate memory (the volatile kind). That should work > >> the same between pmem and regular block devices, no? > > > > This is the commit I made to train DM to be numa node aware: > > 115485e83f497fdf9b4 ("dm: add 'dm_numa_node' module parameter") > > Hmm, but this is global for all DM device instances. Right, only because I didn't have a convenient way to allow the user to specify it on a per-device level. But I'll defer skinning that cat for now since in this pmem case we'd inherit from the underlying device(s) > > As is the DM code is focused on memory allocations. But I think blk-mq > > may use the NUMA node for via tag_set->numa_node. But that is moot > > given pmem is bio-based right? > > Right. > > > > > Steps could be taken to make all threads DM creates for a a given device > > get pinned to the specified NUMA node too. > > I think it would be useful if a DM instance inherited the numa node > from the component devices by default (assuming they're all from the > same node). A "dev_to_node(disk_to_dev(disk))" conversion works for > pmem devices. OK, I can look to make that happen. > As far as I understand, Jeff wants to go further and have a linear > span across component devices from different nodes with an interface > to do an LBA-to-numa-node conversion. All that variability makes DM's ability to do anything sane with it close to impossible considering memory pools, threads, etc are all pinned during the first activation of the DM device.