From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757743AbdAJHby (ORCPT ); Tue, 10 Jan 2017 02:31:54 -0500 Received: from mail-lf0-f51.google.com ([209.85.215.51]:32949 "EHLO mail-lf0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750778AbdAJHbw (ORCPT ); Tue, 10 Jan 2017 02:31:52 -0500 Subject: Re: NVMe vs DMA addressing limitations To: Christoph Hellwig References: <1483044304-2085-1-git-send-email-nikita.yoush@cogentembedded.com> <2723285.JORgusvJv4@wuerfel> <9a03c05d-ad4c-0547-d1fe-01edb8b082d6@cogentembedded.com> <6374144.HVL0QxNJiT@wuerfel> <20170109205746.GA6274@lst.de> <20170110070719.GA17208@lst.de> Cc: Arnd Bergmann , linux-arm-kernel@lists.infradead.org, Catalin Marinas , Will Deacon , linux-kernel@vger.kernel.org, linux-renesas-soc@vger.kernel.org, Simon Horman , linux-pci@vger.kernel.org, Bjorn Helgaas , artemi.ivanov@cogentembedded.com, Keith Busch , Jens Axboe , Sagi Grimberg , linux-nvme@lists.infradead.org From: Nikita Yushchenko X-Enigmail-Draft-Status: N1110 Message-ID: Date: Tue, 10 Jan 2017 10:31:47 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.5.1 MIME-Version: 1.0 In-Reply-To: <20170110070719.GA17208@lst.de> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Christoph, thanks for clear input. Arnd, I think that given this discussion, best short-term solution is indeed the patch I've submitted yesterday. That is, your version + coherent mask support. With that, set_dma_mask(DMA_BIT_MASK(64)) will succeed and hardware with work with swiotlb. Possible next step is to teach swiotlb to dynamically allocate bounce buffers within entire arm64's ZONE_DMA. Also there is some hope that R-Car *can* iommu-translate addresses that PCIe module issues to system bus. Although previous attempts to make that working failed. Additional research is needed here. Nikita > On Tue, Jan 10, 2017 at 09:47:21AM +0300, Nikita Yushchenko wrote: >> I'm now working with HW that: >> - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores, >> and is being manufactured and developed, >> - has 75% of it's RAM located beyond first 4G of address space, >> - can't physically handle incoming PCIe transactions addressed to memory >> beyond 4G. > > It might not be low end or obselete, but it's absolutely braindead. > Your I/O performance will suffer badly for the life of the platform > because someone tries to save 2 cents, and there is not much we can do > about it. > >> (1) it constantly runs of swiotlb space, logs are full of warnings >> despite of rate limiting, > >> Per my current understanding, blk-level bounce buffering will at least >> help with (1) - if done properly it will allocate bounce buffers within >> entire memory below 4G, not within dedicated swiotlb space (that is >> small and enlarging it makes memory permanently unavailable for other >> use). This looks simple and safe (in sense of not anyhow breaking >> unrelated use cases). > > Yes. Although there is absolutely no reason why swiotlb could not > do the same. > >> (2) it runs far suboptimal due to bounce-buffering almost all i/o, >> despite of lots of free memory in area where direct DMA is possible. > >> Addressing (2) looks much more difficult because different memory >> allocation policy is required for that. > > It's basically not possible. Every piece of memory in a Linux > kernel is a possible source of I/O, and depending on the workload > type it might even be a the prime source of I/O. > >>> NVMe should never bounce, the fact that it currently possibly does >>> for highmem pages is a bug. >> >> The entire topic is absolutely not related to highmem (i.e. memory not >> directly addressable by 32-bit kernel). > > I did not say this affects you, but thanks to your mail I noticed that > NVMe has a suboptimal setting there. Also note that highmem does not > have to imply a 32-bit kernel, just physical memory that is not in the > kernel mapping. > >> What we are discussing is hw-originated restriction on where DMA is >> possible. > > Yes, where hw means the SOC, and not the actual I/O device, which is an > important distinction. > >>> Or even better remove the call to dma_set_mask_and_coherent with >>> DMA_BIT_MASK(32). NVMe is designed around having proper 64-bit DMA >>> addressing, there is not point in trying to pretent it works without that >> >> Are you claiming that NVMe driver in mainline is intentionally designed >> to not work on HW that can't do DMA to entire 64-bit space? > > It is not intenteded to handle the case where the SOC / chipset > can't handle DMA to all physical memoery, yes. > >> Such setups do exist and there is interest to make them working. > > Sure, but it's not the job of the NVMe driver to work around such a broken > system. It's something your architecture code needs to do, maybe with > a bit of core kernel support. > >> Quite a few pages used for block I/O are allocated by filemap code - and >> at allocation point it is known what inode page is being allocated for. >> If this inode is from filesystem located on a known device with known >> DMA limitations, this knowledge can be used to allocate page that can be >> DMAed directly. > > But in other cases we might never DMA to it. Or we rarely DMA to it, say > for a machine running databses or qemu and using lots of direct I/O. Or > a storage target using it's local alloc_pages buffers. > >> Sure there are lots of cases when at allocation time there is no idea >> what device will run DMA on page being allocated, or perhaps page is >> going to be shared, or whatever. Such cases unavoidably require bounce >> buffers if page ends to be used with device with DMA limitations. But >> still there are cases when better allocation can remove need for bounce >> buffers - without any hurt for other cases. > > It takes your max 1GB DMA addressable memoery away from other uses, > and introduce the crazy highmem VM tuning issues we had with big > 32-bit x86 systems in the past. >