From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1757743AbdAJHby (ORCPT <rfc822;w@1wt.eu>);
        Tue, 10 Jan 2017 02:31:54 -0500
Received: from mail-lf0-f51.google.com ([209.85.215.51]:32949 "EHLO
        mail-lf0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750778AbdAJHbw (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 10 Jan 2017 02:31:52 -0500
Subject: Re: NVMe vs DMA addressing limitations
To: Christoph Hellwig <hch@lst.de>
References: <1483044304-2085-1-git-send-email-nikita.yoush@cogentembedded.com>
 <2723285.JORgusvJv4@wuerfel>
 <9a03c05d-ad4c-0547-d1fe-01edb8b082d6@cogentembedded.com>
 <6374144.HVL0QxNJiT@wuerfel>
 <b50e25af-3936-d22d-2e9b-5bf6be19c800@cogentembedded.com>
 <20170109205746.GA6274@lst.de>
 <e084dbad-29ab-25bd-5e17-da0fcd92f7ac@cogentembedded.com>
 <20170110070719.GA17208@lst.de>
Cc: Arnd Bergmann <arnd@linaro.org>, linux-arm-kernel@lists.infradead.org,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will.deacon@arm.com>, linux-kernel@vger.kernel.org,
        linux-renesas-soc@vger.kernel.org, Simon Horman <horms@verge.net.au>,
        linux-pci@vger.kernel.org, Bjorn Helgaas <bhelgaas@google.com>,
        artemi.ivanov@cogentembedded.com, Keith Busch <keith.busch@intel.com>,
        Jens Axboe <axboe@fb.com>, Sagi Grimberg <sagi@grimberg.me>,
        linux-nvme@lists.infradead.org
From: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
X-Enigmail-Draft-Status: N1110
Message-ID: <c4fd007c-3f07-fc2a-8cbb-7152f578609d@cogentembedded.com>
Date: Tue, 10 Jan 2017 10:31:47 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Icedove/45.5.1
MIME-Version: 1.0
In-Reply-To: <20170110070719.GA17208@lst.de>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Christoph, thanks for clear input.

Arnd, I think that given this discussion, best short-term solution is
indeed the patch I've submitted yesterday. That is, your version +
coherent mask support.  With that, set_dma_mask(DMA_BIT_MASK(64)) will
succeed and hardware with work with swiotlb.

Possible next step is to teach swiotlb to dynamically allocate bounce
buffers within entire arm64's ZONE_DMA.

Also there is some hope that R-Car *can* iommu-translate addresses that
PCIe module issues to system bus.  Although previous attempts to make
that working failed. Additional research is needed here.

Nikita

> On Tue, Jan 10, 2017 at 09:47:21AM +0300, Nikita Yushchenko wrote:
>> I'm now working with HW that:
>> - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores,
>> and is being manufactured and developed,
>> - has 75% of it's RAM located beyond first 4G of address space,
>> - can't physically handle incoming PCIe transactions addressed to memory
>> beyond 4G.
> 
> It might not be low end or obselete, but it's absolutely braindead.
> Your I/O performance will suffer badly for the life of the platform
> because someone tries to save 2 cents, and there is not much we can do
> about it.
> 
>> (1) it constantly runs of swiotlb space, logs are full of warnings
>> despite of rate limiting,
> 
>> Per my current understanding, blk-level bounce buffering will at least
>> help with (1) - if done properly it will allocate bounce buffers within
>> entire memory below 4G, not within dedicated swiotlb space (that is
>> small and enlarging it makes memory permanently unavailable for other
>> use).  This looks simple and safe (in sense of not anyhow breaking
>> unrelated use cases).
> 
> Yes.  Although there is absolutely no reason why swiotlb could not
> do the same.
> 
>> (2) it runs far suboptimal due to bounce-buffering almost all i/o,
>> despite of lots of free memory in area where direct DMA is possible.
> 
>> Addressing (2) looks much more difficult because different memory
>> allocation policy is required for that.
> 
> It's basically not possible.  Every piece of memory in a Linux
> kernel is a possible source of I/O, and depending on the workload
> type it might even be a the prime source of I/O.
> 
>>> NVMe should never bounce, the fact that it currently possibly does
>>> for highmem pages is a bug.
>>
>> The entire topic is absolutely not related to highmem (i.e. memory not
>> directly addressable by 32-bit kernel).
> 
> I did not say this affects you, but thanks to your mail I noticed that
> NVMe has a suboptimal setting there.  Also note that highmem does not
> have to imply a 32-bit kernel, just physical memory that is not in the
> kernel mapping.
> 
>> What we are discussing is hw-originated restriction on where DMA is
>> possible.
> 
> Yes, where hw means the SOC, and not the actual I/O device, which is an
> important distinction.
> 
>>> Or even better remove the call to dma_set_mask_and_coherent with
>>> DMA_BIT_MASK(32).  NVMe is designed around having proper 64-bit DMA
>>> addressing, there is not point in trying to pretent it works without that
>>
>> Are you claiming that NVMe driver in mainline is intentionally designed
>> to not work on HW that can't do DMA to entire 64-bit space?
> 
> It is not intenteded to handle the case where the SOC / chipset
> can't handle DMA to all physical memoery, yes.
> 
>> Such setups do exist and there is interest to make them working.
> 
> Sure, but it's not the job of the NVMe driver to work around such a broken
> system.  It's something your architecture code needs to do, maybe with
> a bit of core kernel support.
> 
>> Quite a few pages used for block I/O are allocated by filemap code - and
>> at allocation point it is known what inode page is being allocated for.
>> If this inode is from filesystem located on a known device with known
>> DMA limitations, this knowledge can be used to allocate page that can be
>> DMAed directly.
> 
> But in other cases we might never DMA to it.  Or we rarely DMA to it, say
> for a machine running databses or qemu and using lots of direct I/O. Or
> a storage target using it's local alloc_pages buffers.
> 
>> Sure there are lots of cases when at allocation time there is no idea
>> what device will run DMA on page being allocated, or perhaps page is
>> going to be shared, or whatever. Such cases unavoidably require bounce
>> buffers if page ends to be used with device with DMA limitations. But
>> still there are cases when better allocation can remove need for bounce
>> buffers - without any hurt for other cases.
> 
> It takes your max 1GB DMA addressable memoery away from other uses,
> and introduce the crazy highmem VM tuning issues we had with big
> 32-bit x86 systems in the past.
>