From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754570AbbFJQ5Q (ORCPT <rfc822;w@1wt.eu>);
	Wed, 10 Jun 2015 12:57:16 -0400
Received: from mail-wi0-f174.google.com ([209.85.212.174]:34559 "EHLO
	mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754358AbbFJQ5H (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 10 Jun 2015 12:57:07 -0400
MIME-Version: 1.0
In-Reply-To: <20150610163121.GP7557@n2100.arm.linux.org.uk>
References: <20150609162659.21910.41681.stgit@dwillia2-desk3.amr.corp.intel.com>
	<20150609162710.21910.57295.stgit@dwillia2-desk3.amr.corp.intel.com>
	<20150610093252.GA20384@8bytes.org>
	<CAPcyv4gat8TG9YnHmvD_7A7RD00Urn=febCj2Uc_cVcPi+fnCg@mail.gmail.com>
	<20150610163121.GP7557@n2100.arm.linux.org.uk>
Date: Wed, 10 Jun 2015 09:57:06 -0700
Message-ID: <CAPcyv4i0MB9GTL4ujtybwBLB9sV+n0PX686Oi-Rt5CGEeey_hQ@mail.gmail.com>
Subject: Re: [PATCH 1/2] scatterlist: use sg_phys()
From: Dan Williams <dan.j.williams@intel.com>
To: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Joerg Roedel <joro@8bytes.org>, Jens Axboe <axboe@kernel.dk>,
        Michal Simek <monstr@monstr.eu>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Julia Lawall <Julia.Lawall@lip6.fr>,
        "dmaengine@vger.kernel.org" <dmaengine@vger.kernel.org>,
        David Woodhouse <dwmw2@infradead.org>, Christoph Hellwig <hch@lst.de>,
        "linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jun 10, 2015 at 9:31 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Jun 10, 2015 at 09:00:31AM -0700, Dan Williams wrote:
>> On Wed, Jun 10, 2015 at 2:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > On Tue, Jun 09, 2015 at 12:27:10PM -0400, Dan Williams wrote:
>> >> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
>> >> index 7e7583ddd607..9f6ff6671f01 100644
>> >> --- a/arch/arm/mm/dma-mapping.c
>> >> +++ b/arch/arm/mm/dma-mapping.c
>> >> @@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,
>> >>               return -ENOMEM;
>> >>
>> >>       for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
>> >> -             phys_addr_t phys = page_to_phys(sg_page(s));
>> >> +             phys_addr_t phys = sg_phys(s) - s->offset;
>> >
>> > So sg_phys() turns out to be 'page_to_phys(sg_page(s)) + s->offset',
>> > which makes the above statement to:
>> >
>> >         page_to_phys(sg_page(s)) + s->offset - s->offset;
>> >
>> > The compiler will probably optimize that away, but it still doesn't look
>> > like an improvement.
>>
>> The goal is to eventually stop leaking struct page deep into the i/o
>> stack.  Anything that relies on being able to retrieve a struct page
>> out of an sg entry needs to be converted.  I think we need a new
>> helper for this case "sg_phys_aligned()?".
>
> Why?  The aim of the code is not to detect whether the address is aligned
> to a page (if it were, it'd be testing for a zero s->offset, or it would
> be testing for an s->offset being a multiple of the page size.
>
> Let's first understand the code that's being modified (which seems to be
> something which isn't done very much today - people seem to just like
> changing code for the hell of it.)
>
>         for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
>                 phys_addr_t phys = page_to_phys(sg_page(s));
>                 unsigned int len = PAGE_ALIGN(s->offset + s->length);
>
>                 if (!is_coherent &&
>                         !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
>                         __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length,
> dir);
>
>                 prot = __dma_direction_to_prot(dir);
>
>                 ret = iommu_map(mapping->domain, iova, phys, len, prot);
>                 if (ret < 0)
>                         goto fail;
>                 count += len >> PAGE_SHIFT;
>                 iova += len;
>         }
>
> What it's doing is trying to map each scatterlist entry via an IOMMU.
> Unsurprisingly, IOMMUs are page based - you can't map a partial IOMMU
> page.
>
> However, what says that the IOMMU page size is the same as the host CPU
> page size - it may not be... so the above code is wrong for a completely
> different reason.
>
> What we _should_ be doing is finding out what the IOMMU minimum page
> size is, and using that in conjunction with the sg_phys() of the
> scatterlist entry to determine the start and length of the mapping
> such that the IOMMU mapping covers the range described by the scatterlist
> entry.  (iommu_map() takes arguments which must be aligned to the IOMMU
> minimum page size.)
>
> However, what we can also see from the above is that we have other code
> here using sg_page() - which is a necessity to be able to perform the
> required DMA cache maintanence to ensure that the data is visible to the
> DMA device.  We need to kmap_atomic() these in order to flush them, and
> there's no other way other than struct page to represent a highmem page.
>
> So, I think your intent to want to remove struct page from the I/O
> subsystem is a false hope, unless you want to end up rewriting lots of
> architecture code, and you can come up with an alternative method to
> represent highmem pages.

I think there will always be cases that need to do pfn_to_page() for
buffer management.  Those configurations will be blocked from seeing
cases where a pfn is not struct page backed.  So, you can have highmem
or dma to pmem, but not both.  Christoph, this is why I have Kconfig
symbols (DEV_PFN etc) to gate whether an arch/config supports pfn-only
i/o.

From mboxrd@z Thu Jan  1 00:00:00 1970
From: dan.j.williams@intel.com (Dan Williams)
Date: Wed, 10 Jun 2015 09:57:06 -0700
Subject: [PATCH 1/2] scatterlist: use sg_phys()
In-Reply-To: <20150610163121.GP7557@n2100.arm.linux.org.uk>
References: <20150609162659.21910.41681.stgit@dwillia2-desk3.amr.corp.intel.com>
 <20150609162710.21910.57295.stgit@dwillia2-desk3.amr.corp.intel.com>
 <20150610093252.GA20384@8bytes.org>
 <CAPcyv4gat8TG9YnHmvD_7A7RD00Urn=febCj2Uc_cVcPi+fnCg@mail.gmail.com>
 <20150610163121.GP7557@n2100.arm.linux.org.uk>
Message-ID: <CAPcyv4i0MB9GTL4ujtybwBLB9sV+n0PX686Oi-Rt5CGEeey_hQ@mail.gmail.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Wed, Jun 10, 2015 at 9:31 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Jun 10, 2015 at 09:00:31AM -0700, Dan Williams wrote:
>> On Wed, Jun 10, 2015 at 2:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > On Tue, Jun 09, 2015 at 12:27:10PM -0400, Dan Williams wrote:
>> >> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
>> >> index 7e7583ddd607..9f6ff6671f01 100644
>> >> --- a/arch/arm/mm/dma-mapping.c
>> >> +++ b/arch/arm/mm/dma-mapping.c
>> >> @@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,
>> >>               return -ENOMEM;
>> >>
>> >>       for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
>> >> -             phys_addr_t phys = page_to_phys(sg_page(s));
>> >> +             phys_addr_t phys = sg_phys(s) - s->offset;
>> >
>> > So sg_phys() turns out to be 'page_to_phys(sg_page(s)) + s->offset',
>> > which makes the above statement to:
>> >
>> >         page_to_phys(sg_page(s)) + s->offset - s->offset;
>> >
>> > The compiler will probably optimize that away, but it still doesn't look
>> > like an improvement.
>>
>> The goal is to eventually stop leaking struct page deep into the i/o
>> stack.  Anything that relies on being able to retrieve a struct page
>> out of an sg entry needs to be converted.  I think we need a new
>> helper for this case "sg_phys_aligned()?".
>
> Why?  The aim of the code is not to detect whether the address is aligned
> to a page (if it were, it'd be testing for a zero s->offset, or it would
> be testing for an s->offset being a multiple of the page size.
>
> Let's first understand the code that's being modified (which seems to be
> something which isn't done very much today - people seem to just like
> changing code for the hell of it.)
>
>         for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
>                 phys_addr_t phys = page_to_phys(sg_page(s));
>                 unsigned int len = PAGE_ALIGN(s->offset + s->length);
>
>                 if (!is_coherent &&
>                         !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
>                         __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length,
> dir);
>
>                 prot = __dma_direction_to_prot(dir);
>
>                 ret = iommu_map(mapping->domain, iova, phys, len, prot);
>                 if (ret < 0)
>                         goto fail;
>                 count += len >> PAGE_SHIFT;
>                 iova += len;
>         }
>
> What it's doing is trying to map each scatterlist entry via an IOMMU.
> Unsurprisingly, IOMMUs are page based - you can't map a partial IOMMU
> page.
>
> However, what says that the IOMMU page size is the same as the host CPU
> page size - it may not be... so the above code is wrong for a completely
> different reason.
>
> What we _should_ be doing is finding out what the IOMMU minimum page
> size is, and using that in conjunction with the sg_phys() of the
> scatterlist entry to determine the start and length of the mapping
> such that the IOMMU mapping covers the range described by the scatterlist
> entry.  (iommu_map() takes arguments which must be aligned to the IOMMU
> minimum page size.)
>
> However, what we can also see from the above is that we have other code
> here using sg_page() - which is a necessity to be able to perform the
> required DMA cache maintanence to ensure that the data is visible to the
> DMA device.  We need to kmap_atomic() these in order to flush them, and
> there's no other way other than struct page to represent a highmem page.
>
> So, I think your intent to want to remove struct page from the I/O
> subsystem is a false hope, unless you want to end up rewriting lots of
> architecture code, and you can come up with an alternative method to
> represent highmem pages.

I think there will always be cases that need to do pfn_to_page() for
buffer management.  Those configurations will be blocked from seeing
cases where a pfn is not struct page backed.  So, you can have highmem
or dma to pmem, but not both.  Christoph, this is why I have Kconfig
symbols (DEV_PFN etc) to gate whether an arch/config supports pfn-only
i/o.