From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161598AbeCAUaS (ORCPT ); Thu, 1 Mar 2018 15:30:18 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:43810 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1161563AbeCAUaI (ORCPT ); Thu, 1 Mar 2018 15:30:08 -0500 Subject: Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory From: Benjamin Herrenschmidt Reply-To: benh@au1.ibm.com To: Logan Gunthorpe , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?ISO-8859-1?Q?J=E9r=F4me?= Glisse , Alex Williamson , Oliver OHalloran Date: Fri, 02 Mar 2018 07:29:55 +1100 In-Reply-To: <8e808448-fc01-5da0-51e7-1a6657d5a23a@deltatee.com> References: <20180228234006.21093-1-logang@deltatee.com> <1519876489.4592.3.camel@kernel.crashing.org> <1519876569.4592.4.camel@au1.ibm.com> <8e808448-fc01-5da0-51e7-1a6657d5a23a@deltatee.com> Organization: IBM Australia Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.26.5 (3.26.5-1.fc27) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18030120-0040-0000-0000-000004399C34 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18030120-0041-0000-0000-000020DCA219 Message-Id: <1519936195.4592.18.camel@au1.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-03-01_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1803010251 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote: > > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > > > The problem is that acccording to him (I didn't double check the latest > > > patches) you effectively hotplug the PCIe memory into the system when > > > creating struct pages. > > > > > > This cannot possibly work for us. First we cannot map PCIe memory as > > > cachable. (Note that doing so is a bad idea if you are behind a PLX > > > switch anyway since you'd ahve to manage cache coherency in SW). > > > > Note: I think the above means it won't work behind a switch on x86 > > either, will it ? > > This works perfectly fine on x86 behind a switch and we've tested it on > multiple machines. We've never had an issue of running out of virtual > space despite our PCI bars typically being located with an offset of > 56TB or more. The arch code on x86 also somehow figures out not to map > the memory as cachable so that's not an issue (though, at this point, > the CPU never accesses the memory so even if it were, it wouldn't affect > anything). Oliver can you look into this ? You sais the memory was effectively hotplug'ed into the system when creating the struct pages. That would mean to me that it's a) mapped (which for us is cachable, maybe x86 has tricks to avoid that) and b) potentially used to populate userspace pages (that will definitely be cachable). Unless there's something in there you didn't see that prevents it. > We also had this working on ARM64 a while back but it required some out > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch > code to ioremap the memory into the page map. > > You didn't mention what architecture you were trying this on. ppc64. > It may make sense at this point to make this feature dependent on x86 > until more work is done to make it properly portable. Something like > arch functions that allow adding IO memory pages to with a specific > cache setting. Though, if an arch has such restrictive limits on the map > size it would probably need to address that too somehow. Not fan of that approach. So there are two issues to consider here: - Our MMIO space is very far away from memory (high bits set in the address) which causes problem with things like vmmemmap, page_address, virt_to_page etc... Do you have similar issues on arm64 ? - We need to ensure that the mechanism (which I'm not familiar with) that you use to create the struct page's for the device don't end up turning those device pages into normal "general use" pages for the system. Oliver thinks it does, you say it doesn't, ... Jerome (Glisse), what's your take on this ? Smells like something that could be covered by HMM... Logan, the only reason you need struct page's to begin with is for the DMA API right ? Or am I missing something here ? Cheers, Ben. > Thanks, > > Logan >