From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1758588Ab1CCRWo (ORCPT <rfc822;w@1wt.eu>);
	Thu, 3 Mar 2011 12:22:44 -0500
Received: from usmamail.tilera.com ([206.83.70.70]:15996 "EHLO
	USMAMAIL.TILERA.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1758503Ab1CCRWl (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 3 Mar 2011 12:22:41 -0500
Message-ID: <4D6FCE5D.4030904@tilera.com>
Date: Thu, 3 Mar 2011 12:22:37 -0500
From: Chris Metcalf <cmetcalf@tilera.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.14) Gecko/20110221 Lightning/1.0b2 Thunderbird/3.1.8
MIME-Version: 1.0
To: David Miller <davem@davemloft.net>
CC: <a.p.zijlstra@chello.nl>, <torvalds@linux-foundation.org>,
        <aarcange@redhat.com>, <tglx@linutronix.de>, <riel@redhat.com>,
        <mingo@elte.hu>, <akpm@linux-foundation.org>,
        <linux-kernel@vger.kernel.org>, <linux-arch@vger.kernel.org>,
        <linux-mm@kvack.org>, <benh@kernel.crashing.org>,
        <hugh.dickins@tiscali.co.uk>, <mel@csn.ul.ie>, <npiggin@kernel.dk>,
        <rmk@arm.linux.org.uk>, <schwidefsky@de.ibm.com>
Subject: Re: [RFC][PATCH 2/6] mm: Change flush_tlb_range() to take an mm_struct
References: <20110302180258.956518392@chello.nl>	<AANLkTimhWKhHojZ-9XZGSh3OzfPhvo__Dib9VfeMWoBQ@mail.gmail.com>	<1299102027.1310.39.camel@laptop> <20110302.134735.260066220.davem@davemloft.net>
In-Reply-To: <20110302.134735.260066220.davem@davemloft.net>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 3/2/2011 4:47 PM, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Wed, 02 Mar 2011 22:40:27 +0100
>
>> On Wed, 2011-03-02 at 11:19 -0800, Linus Torvalds wrote:
>>> On Wed, Mar 2, 2011 at 9:59 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>>> In order to be able to properly support architecture that want/need to
>>>> support TLB range invalidation, we need to change the
>>>> flush_tlb_range() argument from a vm_area_struct to an mm_struct
>>>> because the range might very well extend past one VMA, or not have a
>>>> VMA at all.
>>> I really don't think this is right. The whole "drop the icache
>>> information" thing is a total anti-optimization, since for some
>>> architectures, the icache flush is the _big_ deal. 
>> Right, so Tile has the I-cache flush from flush_tlb_range(), I'm not
>> sure if that's the right thing to do, Documentation/cachetlb.txt seems
>> to suggest doing it from update_mmu_cache() like things.
> Sparc32 chips that require a valid TLB entry for I-cache flushes do
> the flush from flush_cache_range() and similar.
>
> Sparc64 does not have the "present TLB entry" requirement (since I-cache
> is physical), and we handle it in update_mmu_cache() but only as an
> optimization.  This scheme works in concert with flush_dcache_page().
>
> Either scheme is valid, the former is best when flushing is based upon
> virtual addresses.
>
> But I'll be the first to admit that the interfaces we have for doing
> this stuff is basically nothing more than a set of hooks, with
> assurances that the hooks will be called in specific situations.  Like
> anything else, it's evolved over time based upon architectural needs.

I'm finding it hard to understand how the Sparc code handles icache
coherence.  It seems that the Spitfire MMU is the interesting one, but the
hard case seems to be when a process migrates around to various cores
during execution (thus leaving incoherent icache lines everywhere), and the
page is then freed and re-used for different executable code.  I'd think
that there would have to be xcall IPIs to flush all the cpus' icaches, or
to flush every core in the cpu_vm_mask plus do something at context switch,
but I don't see any of that.  No doubt I'm missing something :-)

Currently on Tile I assume that we flush icaches in cpu_vm_mask at TLB
flush time, and flush the icache on context-switch, since I'm confident I
can reason correctly about that and prove that with this model you can
never have stale icache data.  But the "every context-switch" is a
nuisance, only somewhat mitigated by the fact that with 64 cores we don't
do a lot of context-switching.

To give some more specificity to my thinking, here's one optimization we
could do on Tile, that would both address Peter Zijlstra's generic
architecture in an obvious way, and also improve context switch time:

- Add a "free time" field to struct page.  The free time field could be a
64-bit cycle counter value, or maybe some kind of 32-bit counter that just
increments every time we free, etc., though then we'd need to worry about
handling wraparound.  We'd record the free time when we freed the page back
to the buddy allocator.  Since we only care about executable page frees,
we'd want to use a page bit to track if a given page was ever associated
with an executable PTE, and if it wasn't, we could just record the "free
time" as zero, for book-keeping purposes.

- Keep a per-cpu "icache flush time", with the same timekeeping system as
the page free time.  Every time we flush the whole icache on a cpu, we
update its per-cpu timestamp.

- When writing an executable PTE into the page table, we'd check the
cpu_vm_mask, and any cpu that hadn't done a full icache flush since the
page in question was previously freed would be IPI'ed and would do the
icache flush, making it safe to start running code on the page with its new
code.  We'd also update a per-mm "latest free" timestamp to hold the most
recent "free time" of all the pages faulted in for that mm.

- When context-switching, we'd check the per-mm "latest free" timestamp,
and if the mm held a page that was freed more recently than that cpu's
timestamp, we'd do a full icache flush and update the per-cpu timestamp.

This has several good properties:

- We are unlikely to do much icache flushing, since we only do it when an
executable page is freed back to the buddy allocator and then reused as
executable again.

- If two processes share a cpu, they don't end up having to icache flush at
every context switch.

- We never need to IPI a cpu that isn't actively involved with the process
that is faulting in a new executable page.  (This is particularly important
since we want to avoid disturbing "dataplane" cpus that are running
latency-sensitive tasks.)

- We don't need to worry about vma's at flush_tlb_range() time, thus making
Peter happy :-)

I'm not worrying about kernel module executable pages, since I'm happy to
do much more heavy-weight operations for them, i.e. flush all the icaches
on all the cores.

So: does this general approach seem appropriate, or am I missing a key
subtlety of the Sparc approach that makes this all unnecessary?

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


From mboxrd@z Thu Jan  1 00:00:00 1970
From: Chris Metcalf <cmetcalf@tilera.com>
Subject: Re: [RFC][PATCH 2/6] mm: Change flush_tlb_range() to take an mm_struct
Date: Thu, 3 Mar 2011 12:22:37 -0500
Message-ID: <4D6FCE5D.4030904@tilera.com>
References: <20110302180258.956518392@chello.nl>	<AANLkTimhWKhHojZ-9XZGSh3OzfPhvo__Dib9VfeMWoBQ@mail.gmail.com>	<1299102027.1310.39.camel@laptop> <20110302.134735.260066220.davem@davemloft.net>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Return-path: <owner-linux-mm@kvack.org>
In-Reply-To: <20110302.134735.260066220.davem@davemloft.net>
Sender: owner-linux-mm@kvack.org
To: David Miller <davem@davemloft.net>
Cc: a.p.zijlstra@chello.nl, torvalds@linux-foundation.org, aarcange@redhat.com, tglx@linutronix.de, riel@redhat.com, mingo@elte.hu, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, benh@kernel.crashing.org, hugh.dickins@tiscali.co.uk, mel@csn.ul.ie, npiggin@kernel.dk, rmk@arm.linux.org.uk, schwidefsky@de.ibm.com
List-Id: linux-arch.vger.kernel.org

On 3/2/2011 4:47 PM, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Wed, 02 Mar 2011 22:40:27 +0100
>
>> On Wed, 2011-03-02 at 11:19 -0800, Linus Torvalds wrote:
>>> On Wed, Mar 2, 2011 at 9:59 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>>> In order to be able to properly support architecture that want/need to
>>>> support TLB range invalidation, we need to change the
>>>> flush_tlb_range() argument from a vm_area_struct to an mm_struct
>>>> because the range might very well extend past one VMA, or not have a
>>>> VMA at all.
>>> I really don't think this is right. The whole "drop the icache
>>> information" thing is a total anti-optimization, since for some
>>> architectures, the icache flush is the _big_ deal. 
>> Right, so Tile has the I-cache flush from flush_tlb_range(), I'm not
>> sure if that's the right thing to do, Documentation/cachetlb.txt seems
>> to suggest doing it from update_mmu_cache() like things.
> Sparc32 chips that require a valid TLB entry for I-cache flushes do
> the flush from flush_cache_range() and similar.
>
> Sparc64 does not have the "present TLB entry" requirement (since I-cache
> is physical), and we handle it in update_mmu_cache() but only as an
> optimization.  This scheme works in concert with flush_dcache_page().
>
> Either scheme is valid, the former is best when flushing is based upon
> virtual addresses.
>
> But I'll be the first to admit that the interfaces we have for doing
> this stuff is basically nothing more than a set of hooks, with
> assurances that the hooks will be called in specific situations.  Like
> anything else, it's evolved over time based upon architectural needs.

I'm finding it hard to understand how the Sparc code handles icache
coherence.  It seems that the Spitfire MMU is the interesting one, but the
hard case seems to be when a process migrates around to various cores
during execution (thus leaving incoherent icache lines everywhere), and the
page is then freed and re-used for different executable code.  I'd think
that there would have to be xcall IPIs to flush all the cpus' icaches, or
to flush every core in the cpu_vm_mask plus do something at context switch,
but I don't see any of that.  No doubt I'm missing something :-)

Currently on Tile I assume that we flush icaches in cpu_vm_mask at TLB
flush time, and flush the icache on context-switch, since I'm confident I
can reason correctly about that and prove that with this model you can
never have stale icache data.  But the "every context-switch" is a
nuisance, only somewhat mitigated by the fact that with 64 cores we don't
do a lot of context-switching.

To give some more specificity to my thinking, here's one optimization we
could do on Tile, that would both address Peter Zijlstra's generic
architecture in an obvious way, and also improve context switch time:

- Add a "free time" field to struct page.  The free time field could be a
64-bit cycle counter value, or maybe some kind of 32-bit counter that just
increments every time we free, etc., though then we'd need to worry about
handling wraparound.  We'd record the free time when we freed the page back
to the buddy allocator.  Since we only care about executable page frees,
we'd want to use a page bit to track if a given page was ever associated
with an executable PTE, and if it wasn't, we could just record the "free
time" as zero, for book-keeping purposes.

- Keep a per-cpu "icache flush time", with the same timekeeping system as
the page free time.  Every time we flush the whole icache on a cpu, we
update its per-cpu timestamp.

- When writing an executable PTE into the page table, we'd check the
cpu_vm_mask, and any cpu that hadn't done a full icache flush since the
page in question was previously freed would be IPI'ed and would do the
icache flush, making it safe to start running code on the page with its new
code.  We'd also update a per-mm "latest free" timestamp to hold the most
recent "free time" of all the pages faulted in for that mm.

- When context-switching, we'd check the per-mm "latest free" timestamp,
and if the mm held a page that was freed more recently than that cpu's
timestamp, we'd do a full icache flush and update the per-cpu timestamp.

This has several good properties:

- We are unlikely to do much icache flushing, since we only do it when an
executable page is freed back to the buddy allocator and then reused as
executable again.

- If two processes share a cpu, they don't end up having to icache flush at
every context switch.

- We never need to IPI a cpu that isn't actively involved with the process
that is faulting in a new executable page.  (This is particularly important
since we want to avoid disturbing "dataplane" cpus that are running
latency-sensitive tasks.)

- We don't need to worry about vma's at flush_tlb_range() time, thus making
Peter happy :-)

I'm not worrying about kernel module executable pages, since I'm happy to
do much more heavy-weight operations for them, i.e. flush all the icaches
on all the cores.

So: does this general approach seem appropriate, or am I missing a key
subtlety of the Sparc approach that makes this all unnecessary?

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arch-owner@vger.kernel.org>
Received: from usmamail.tilera.com ([206.83.70.70]:15996 "EHLO
	USMAMAIL.TILERA.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1758503Ab1CCRWl (ORCPT
	<rfc822;linux-arch@vger.kernel.org>); Thu, 3 Mar 2011 12:22:41 -0500
Message-ID: <4D6FCE5D.4030904@tilera.com>
Date: Thu, 3 Mar 2011 12:22:37 -0500
From: Chris Metcalf <cmetcalf@tilera.com>
MIME-Version: 1.0
Subject: Re: [RFC][PATCH 2/6] mm: Change flush_tlb_range() to take an mm_struct
References: <20110302180258.956518392@chello.nl>	<AANLkTimhWKhHojZ-9XZGSh3OzfPhvo__Dib9VfeMWoBQ@mail.gmail.com>	<1299102027.1310.39.camel@laptop> <20110302.134735.260066220.davem@davemloft.net>
In-Reply-To: <20110302.134735.260066220.davem@davemloft.net>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Sender: linux-arch-owner@vger.kernel.org
List-ID: <linux-arch.vger.kernel.org>
To: David Miller <davem@davemloft.net>
Cc: a.p.zijlstra@chello.nl, torvalds@linux-foundation.org, aarcange@redhat.com, tglx@linutronix.de, riel@redhat.com, mingo@elte.hu, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, benh@kernel.crashing.org, hugh.dickins@tiscali.co.uk, mel@csn.ul.ie, npiggin@kernel.dk, rmk@arm.linux.org.uk, schwidefsky@de.ibm.com
Message-ID: <20110303172237.324EtsxPBqVpJr7wDXf7e6VEKifoycv8spqtioHntxs@z>

On 3/2/2011 4:47 PM, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Wed, 02 Mar 2011 22:40:27 +0100
>
>> On Wed, 2011-03-02 at 11:19 -0800, Linus Torvalds wrote:
>>> On Wed, Mar 2, 2011 at 9:59 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>>> In order to be able to properly support architecture that want/need to
>>>> support TLB range invalidation, we need to change the
>>>> flush_tlb_range() argument from a vm_area_struct to an mm_struct
>>>> because the range might very well extend past one VMA, or not have a
>>>> VMA at all.
>>> I really don't think this is right. The whole "drop the icache
>>> information" thing is a total anti-optimization, since for some
>>> architectures, the icache flush is the _big_ deal. 
>> Right, so Tile has the I-cache flush from flush_tlb_range(), I'm not
>> sure if that's the right thing to do, Documentation/cachetlb.txt seems
>> to suggest doing it from update_mmu_cache() like things.
> Sparc32 chips that require a valid TLB entry for I-cache flushes do
> the flush from flush_cache_range() and similar.
>
> Sparc64 does not have the "present TLB entry" requirement (since I-cache
> is physical), and we handle it in update_mmu_cache() but only as an
> optimization.  This scheme works in concert with flush_dcache_page().
>
> Either scheme is valid, the former is best when flushing is based upon
> virtual addresses.
>
> But I'll be the first to admit that the interfaces we have for doing
> this stuff is basically nothing more than a set of hooks, with
> assurances that the hooks will be called in specific situations.  Like
> anything else, it's evolved over time based upon architectural needs.

I'm finding it hard to understand how the Sparc code handles icache
coherence.  It seems that the Spitfire MMU is the interesting one, but the
hard case seems to be when a process migrates around to various cores
during execution (thus leaving incoherent icache lines everywhere), and the
page is then freed and re-used for different executable code.  I'd think
that there would have to be xcall IPIs to flush all the cpus' icaches, or
to flush every core in the cpu_vm_mask plus do something at context switch,
but I don't see any of that.  No doubt I'm missing something :-)

Currently on Tile I assume that we flush icaches in cpu_vm_mask at TLB
flush time, and flush the icache on context-switch, since I'm confident I
can reason correctly about that and prove that with this model you can
never have stale icache data.  But the "every context-switch" is a
nuisance, only somewhat mitigated by the fact that with 64 cores we don't
do a lot of context-switching.

To give some more specificity to my thinking, here's one optimization we
could do on Tile, that would both address Peter Zijlstra's generic
architecture in an obvious way, and also improve context switch time:

- Add a "free time" field to struct page.  The free time field could be a
64-bit cycle counter value, or maybe some kind of 32-bit counter that just
increments every time we free, etc., though then we'd need to worry about
handling wraparound.  We'd record the free time when we freed the page back
to the buddy allocator.  Since we only care about executable page frees,
we'd want to use a page bit to track if a given page was ever associated
with an executable PTE, and if it wasn't, we could just record the "free
time" as zero, for book-keeping purposes.

- Keep a per-cpu "icache flush time", with the same timekeeping system as
the page free time.  Every time we flush the whole icache on a cpu, we
update its per-cpu timestamp.

- When writing an executable PTE into the page table, we'd check the
cpu_vm_mask, and any cpu that hadn't done a full icache flush since the
page in question was previously freed would be IPI'ed and would do the
icache flush, making it safe to start running code on the page with its new
code.  We'd also update a per-mm "latest free" timestamp to hold the most
recent "free time" of all the pages faulted in for that mm.

- When context-switching, we'd check the per-mm "latest free" timestamp,
and if the mm held a page that was freed more recently than that cpu's
timestamp, we'd do a full icache flush and update the per-cpu timestamp.

This has several good properties:

- We are unlikely to do much icache flushing, since we only do it when an
executable page is freed back to the buddy allocator and then reused as
executable again.

- If two processes share a cpu, they don't end up having to icache flush at
every context switch.

- We never need to IPI a cpu that isn't actively involved with the process
that is faulting in a new executable page.  (This is particularly important
since we want to avoid disturbing "dataplane" cpus that are running
latency-sensitive tasks.)

- We don't need to worry about vma's at flush_tlb_range() time, thus making
Peter happy :-)

I'm not worrying about kernel module executable pages, since I'm happy to
do much more heavy-weight operations for them, i.e. flush all the icaches
on all the cores.

So: does this general approach seem appropriate, or am I missing a key
subtlety of the Sparc approach that makes this all unnecessary?

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com