All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] page-table walkers vs memory order
@ 2012-07-23 17:34 ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2012-07-23 17:34 UTC (permalink / raw)
  To: Linus Torvalds, paulmck, Hugh Dickins
  Cc: Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	Andrea Arcangeli, linux-arch, linux-mm


While staring at mm/huge_memory.c I found a very under-commented
smp_wmb() in __split_huge_page_map(). It turns out that its copied from
__{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
thereto).

Now, afaict we're not good, as per that comment. Paul has since
convinced some of us that compiler writers are pure evil and out to get
us.

Therefore we should do what rcu_dereference() does and use
ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
we dereference a page-table pointer.


In particular it looks like things like
mm/memcontrol.c:mem_cgroup_count_precharge(), which use
walk_page_range() under down_read(&mm->mmap_sem) and can thus be
concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
itself) are quite broken on Alpha and subtly broken for those of us with
'creative' compilers.

Should I go do a more extensive audit of page-table walkers or are we
happy with the status quo?

---
 arch/x86/mm/gup.c |    6 +++---
 mm/pagewalk.c     |   24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4958fb1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 
 	pmdp = pmd_offset(&pud, addr);
 	do {
-		pmd_t pmd = *pmdp;
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
 		/*
@@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 
 	pudp = pud_offset(&pgd, addr);
 	do {
-		pud_t pud = *pudp;
+		pud_t pud = ACCESS_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
@@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	local_irq_save(flags);
 	pgdp = pgd_offset(mm, addr);
 	do {
-		pgd_t pgd = *pgdp;
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
 
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..2ba2a74 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pte = pte_offset_map(pmd, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	for (;;) {
 		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
@@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pmd = pmd_offset(pud, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	do {
 again:
 		next = pmd_addr_end(addr, end);
@@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pud = pud_offset(pgd, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud)) {


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC] page-table walkers vs memory order
@ 2012-07-23 17:34 ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2012-07-23 17:34 UTC (permalink / raw)
  To: Linus Torvalds, paulmck, Hugh Dickins
  Cc: Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	Andrea Arcangeli, linux-arch, linux-mm


While staring at mm/huge_memory.c I found a very under-commented
smp_wmb() in __split_huge_page_map(). It turns out that its copied from
__{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
thereto).

Now, afaict we're not good, as per that comment. Paul has since
convinced some of us that compiler writers are pure evil and out to get
us.

Therefore we should do what rcu_dereference() does and use
ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
we dereference a page-table pointer.


In particular it looks like things like
mm/memcontrol.c:mem_cgroup_count_precharge(), which use
walk_page_range() under down_read(&mm->mmap_sem) and can thus be
concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
itself) are quite broken on Alpha and subtly broken for those of us with
'creative' compilers.

Should I go do a more extensive audit of page-table walkers or are we
happy with the status quo?

---
 arch/x86/mm/gup.c |    6 +++---
 mm/pagewalk.c     |   24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4958fb1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 
 	pmdp = pmd_offset(&pud, addr);
 	do {
-		pmd_t pmd = *pmdp;
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
 		/*
@@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 
 	pudp = pud_offset(&pgd, addr);
 	do {
-		pud_t pud = *pudp;
+		pud_t pud = ACCESS_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
@@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	local_irq_save(flags);
 	pgdp = pgd_offset(mm, addr);
 	do {
-		pgd_t pgd = *pgdp;
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
 
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..2ba2a74 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pte = pte_offset_map(pmd, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	for (;;) {
 		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
@@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pmd = pmd_offset(pud, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	do {
 again:
 		next = pmd_addr_end(addr, end);
@@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	int err = 0;
 
 	pud = pud_offset(pgd, addr);
+	/*
+	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+	 * actual dereference of p[gum]d, but that's hidden deep within the
+	 * bowels of {pte,pmd,pud}_offset.
+	 */
+	barrier();
+	smp_read_barrier_depends();
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-23 17:34 ` Peter Zijlstra
@ 2012-07-23 19:27   ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-23 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hugh Dickins, Rik van Riel, Andrew Morton,
	Nick Piggin, linux-kernel, Andrea Arcangeli, linux-arch,
	linux-mm

On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.

I have seen the glint in their eyes when they discuss optimization
techniques that you would not want your children to know about!

> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha and subtly broken for those of us with
> 'creative' compilers.

Looks good to me, though given my ignorance of mm, not sure that counts
for much.

							Thanx, Paul

> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?
> 
> ---
>  arch/x86/mm/gup.c |    6 +++---
>  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  
>  	pmdp = pmd_offset(&pud, addr);
>  	do {
> -		pmd_t pmd = *pmdp;
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);

Here, ACCESS_ONCE() suffices because this is x86-specific code.
Core code would need to worry about Alpha, and would thus also need
smp_read_barrier_depends(), as you in fact do have further down.

>  		next = pmd_addr_end(addr, end);
>  		/*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  
>  	pudp = pud_offset(&pgd, addr);
>  	do {
> -		pud_t pud = *pudp;
> +		pud_t pud = ACCESS_ONCE(*pudp);
>  
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	local_irq_save(flags);
>  	pgdp = pgd_offset(mm, addr);
>  	do {
> -		pgd_t pgd = *pgdp;
> +		pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pte = pte_offset_map(pmd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	for (;;) {
>  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pmd = pmd_offset(pud, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  again:
>  		next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pud = pud_offset(pgd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud)) {
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-23 19:27   ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-23 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hugh Dickins, Rik van Riel, Andrew Morton,
	Nick Piggin, linux-kernel, Andrea Arcangeli, linux-arch,
	linux-mm

On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.

I have seen the glint in their eyes when they discuss optimization
techniques that you would not want your children to know about!

> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha and subtly broken for those of us with
> 'creative' compilers.

Looks good to me, though given my ignorance of mm, not sure that counts
for much.

							Thanx, Paul

> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?
> 
> ---
>  arch/x86/mm/gup.c |    6 +++---
>  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  
>  	pmdp = pmd_offset(&pud, addr);
>  	do {
> -		pmd_t pmd = *pmdp;
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);

Here, ACCESS_ONCE() suffices because this is x86-specific code.
Core code would need to worry about Alpha, and would thus also need
smp_read_barrier_depends(), as you in fact do have further down.

>  		next = pmd_addr_end(addr, end);
>  		/*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  
>  	pudp = pud_offset(&pgd, addr);
>  	do {
> -		pud_t pud = *pudp;
> +		pud_t pud = ACCESS_ONCE(*pudp);
>  
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	local_irq_save(flags);
>  	pgdp = pgd_offset(mm, addr);
>  	do {
> -		pgd_t pgd = *pgdp;
> +		pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pte = pte_offset_map(pmd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	for (;;) {
>  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pmd = pmd_offset(pud, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  again:
>  		next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pud = pud_offset(pgd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud)) {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-23 17:34 ` Peter Zijlstra
@ 2012-07-24 21:51   ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-24 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul E. McKenney, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.
> 
> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha

The Alpha pmd_offset() and pte_offset_map() already contain an
smp_read_barrier_depends() (362a61ad from Nick); with comment that
it's not needed on the pgd, and I presume the pud level is folded.
Does Alpha really need more of them, as you have put below?

> and subtly broken for those of us with 'creative' compilers.

I don't want to fight against ACCESS_ONCE() (or barrier(): that's
interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
substitute before); but I do want to question it a little.

I'm totally unclear whether the kernel ever gets built with these
'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
of where some future compiler would be permitted to mess with our
assumptions?  Or is it actually saving us already today?  Would we
know?  Could there be a boottime test that would tell us?  Is it
likely that a future compiler would have an "--access_once"
option that the kernel build would want to turn on?

Those may all be questions for Paul!

> 
> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?

I do love the status quo, but an audit would be welcome.  When
it comes to patches, personally I tend to prefer ACCESS_ONCE() and
smp_read_barrier_depends() and accompanying comments to be hidden away
in the underlying macros or inlines where reasonable, rather than
repeated all over; but I may have my priorities wrong on that.

The last time we rewrote the main pgd-pud-pmd-pte walkers,
we believed that no ACCESS_ONCE() was necessary, because although a
pgd-pud-pmd entry might be racily instantiated at any instant, it
could never change beneath us - the freeing of page tables happens
only when we cannot reach them by other routes.

(Never quite true: those _clear_bad() things can zero entries at any
instant, but we're already in a bad place when those come into play,
so we never worried about racing against them.)

Since then, I think THP has made the rules more complicated; but I
believe Andrea paid a great deal of attention to that kind of issue.

I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
gup_fast() breaks as many rules as it can, and in particular may
be racing with the freeing of page tables; but I'm not so sure
about the pagewalk mods - we could say "cannot do any harm",
but I don't like adding lines on that basis.

Hugh

> 
> ---
>  arch/x86/mm/gup.c |    6 +++---
>  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  
>  	pmdp = pmd_offset(&pud, addr);
>  	do {
> -		pmd_t pmd = *pmdp;
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
>  		/*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  
>  	pudp = pud_offset(&pgd, addr);
>  	do {
> -		pud_t pud = *pudp;
> +		pud_t pud = ACCESS_ONCE(*pudp);
>  
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	local_irq_save(flags);
>  	pgdp = pgd_offset(mm, addr);
>  	do {
> -		pgd_t pgd = *pgdp;
> +		pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pte = pte_offset_map(pmd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	for (;;) {
>  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pmd = pmd_offset(pud, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  again:
>  		next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pud = pud_offset(pgd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud)) {
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-24 21:51   ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-24 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul E. McKenney, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.
> 
> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha

The Alpha pmd_offset() and pte_offset_map() already contain an
smp_read_barrier_depends() (362a61ad from Nick); with comment that
it's not needed on the pgd, and I presume the pud level is folded.
Does Alpha really need more of them, as you have put below?

> and subtly broken for those of us with 'creative' compilers.

I don't want to fight against ACCESS_ONCE() (or barrier(): that's
interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
substitute before); but I do want to question it a little.

I'm totally unclear whether the kernel ever gets built with these
'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
of where some future compiler would be permitted to mess with our
assumptions?  Or is it actually saving us already today?  Would we
know?  Could there be a boottime test that would tell us?  Is it
likely that a future compiler would have an "--access_once"
option that the kernel build would want to turn on?

Those may all be questions for Paul!

> 
> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?

I do love the status quo, but an audit would be welcome.  When
it comes to patches, personally I tend to prefer ACCESS_ONCE() and
smp_read_barrier_depends() and accompanying comments to be hidden away
in the underlying macros or inlines where reasonable, rather than
repeated all over; but I may have my priorities wrong on that.

The last time we rewrote the main pgd-pud-pmd-pte walkers,
we believed that no ACCESS_ONCE() was necessary, because although a
pgd-pud-pmd entry might be racily instantiated at any instant, it
could never change beneath us - the freeing of page tables happens
only when we cannot reach them by other routes.

(Never quite true: those _clear_bad() things can zero entries at any
instant, but we're already in a bad place when those come into play,
so we never worried about racing against them.)

Since then, I think THP has made the rules more complicated; but I
believe Andrea paid a great deal of attention to that kind of issue.

I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
gup_fast() breaks as many rules as it can, and in particular may
be racing with the freeing of page tables; but I'm not so sure
about the pagewalk mods - we could say "cannot do any harm",
but I don't like adding lines on that basis.

Hugh

> 
> ---
>  arch/x86/mm/gup.c |    6 +++---
>  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  
>  	pmdp = pmd_offset(&pud, addr);
>  	do {
> -		pmd_t pmd = *pmdp;
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
>  		/*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  
>  	pudp = pud_offset(&pgd, addr);
>  	do {
> -		pud_t pud = *pudp;
> +		pud_t pud = ACCESS_ONCE(*pudp);
>  
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	local_irq_save(flags);
>  	pgdp = pgd_offset(mm, addr);
>  	do {
> -		pgd_t pgd = *pgdp;
> +		pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pte = pte_offset_map(pmd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	for (;;) {
>  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pmd = pmd_offset(pud, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  again:
>  		next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	int err = 0;
>  
>  	pud = pud_offset(pgd, addr);
> +	/*
> +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +	 * actual dereference of p[gum]d, but that's hidden deep within the
> +	 * bowels of {pte,pmd,pud}_offset.
> +	 */
> +	barrier();
> +	smp_read_barrier_depends();
>  	do {
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud)) {
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-24 21:51   ` Hugh Dickins
@ 2012-07-25 17:56     ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-25 17:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> > 
> > While staring at mm/huge_memory.c I found a very under-commented
> > smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> > __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> > thereto).
> > 
> > Now, afaict we're not good, as per that comment. Paul has since
> > convinced some of us that compiler writers are pure evil and out to get
> > us.
> > 
> > Therefore we should do what rcu_dereference() does and use
> > ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> > we dereference a page-table pointer.
> > 
> > 
> > In particular it looks like things like
> > mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> > walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> > concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> > itself) are quite broken on Alpha
> 
> The Alpha pmd_offset() and pte_offset_map() already contain an
> smp_read_barrier_depends() (362a61ad from Nick); with comment that
> it's not needed on the pgd, and I presume the pud level is folded.
> Does Alpha really need more of them, as you have put below?
> 
> > and subtly broken for those of us with 'creative' compilers.
> 
> I don't want to fight against ACCESS_ONCE() (or barrier(): that's
> interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
> substitute before); but I do want to question it a little.
> 
> I'm totally unclear whether the kernel ever gets built with these
> 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> of where some future compiler would be permitted to mess with our
> assumptions?  Or is it actually saving us already today?  Would we
> know?  Could there be a boottime test that would tell us?  Is it
> likely that a future compiler would have an "--access_once"
> option that the kernel build would want to turn on?
> 
> Those may all be questions for Paul!

The problem is that, unless you tell it otherwise, the compiler is
permitted to assume that the code that it is generating is the only thing
active in that address space at that time.  So the compiler might know
that it already has a perfectly good copy of that value somewhere in
its registers, or it might decide to fetch the value twice rather than
once due to register pressure, either of which can be fatal in SMP code.
And then there are more aggressive optimizations as well.

ACCESS_ONCE() is a way of telling the compiler to access the value
once, regardless of what cute single-threaded optimizations that it
otherwise might want to apply.

							Thanx, Paul

> > Should I go do a more extensive audit of page-table walkers or are we
> > happy with the status quo?
> 
> I do love the status quo, but an audit would be welcome.  When
> it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> smp_read_barrier_depends() and accompanying comments to be hidden away
> in the underlying macros or inlines where reasonable, rather than
> repeated all over; but I may have my priorities wrong on that.
> 
> The last time we rewrote the main pgd-pud-pmd-pte walkers,
> we believed that no ACCESS_ONCE() was necessary, because although a
> pgd-pud-pmd entry might be racily instantiated at any instant, it
> could never change beneath us - the freeing of page tables happens
> only when we cannot reach them by other routes.
> 
> (Never quite true: those _clear_bad() things can zero entries at any
> instant, but we're already in a bad place when those come into play,
> so we never worried about racing against them.)
> 
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.
> 
> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.
> 
> Hugh
> 
> > 
> > ---
> >  arch/x86/mm/gup.c |    6 +++---
> >  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
> >  2 files changed, 27 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > index dd74e46..4958fb1 100644
> > --- a/arch/x86/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> > @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  
> >  	pmdp = pmd_offset(&pud, addr);
> >  	do {
> > -		pmd_t pmd = *pmdp;
> > +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> >  
> >  		next = pmd_addr_end(addr, end);
> >  		/*
> > @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
> >  
> >  	pudp = pud_offset(&pgd, addr);
> >  	do {
> > -		pud_t pud = *pudp;
> > +		pud_t pud = ACCESS_ONCE(*pudp);
> >  
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(pud))
> > @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  	local_irq_save(flags);
> >  	pgdp = pgd_offset(mm, addr);
> >  	do {
> > -		pgd_t pgd = *pgdp;
> > +		pgd_t pgd = ACCESS_ONCE(*pgdp);
> >  
> >  		next = pgd_addr_end(addr, end);
> >  		if (pgd_none(pgd))
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 6c118d0..2ba2a74 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >  	int err = 0;
> >  
> >  	pte = pte_offset_map(pmd, addr);
> > +	/*
> > +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> > +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> > +	 * actual dereference of p[gum]d, but that's hidden deep within the
> > +	 * bowels of {pte,pmd,pud}_offset.
> > +	 */
> > +	barrier();
> > +	smp_read_barrier_depends();
> >  	for (;;) {
> >  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> >  		if (err)
> > @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  	int err = 0;
> >  
> >  	pmd = pmd_offset(pud, addr);
> > +	/*
> > +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> > +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> > +	 * actual dereference of p[gum]d, but that's hidden deep within the
> > +	 * bowels of {pte,pmd,pud}_offset.
> > +	 */
> > +	barrier();
> > +	smp_read_barrier_depends();
> >  	do {
> >  again:
> >  		next = pmd_addr_end(addr, end);
> > @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> >  	int err = 0;
> >  
> >  	pud = pud_offset(pgd, addr);
> > +	/*
> > +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> > +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> > +	 * actual dereference of p[gum]d, but that's hidden deep within the
> > +	 * bowels of {pte,pmd,pud}_offset.
> > +	 */
> > +	barrier();
> > +	smp_read_barrier_depends();
> >  	do {
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none_or_clear_bad(pud)) {
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-25 17:56     ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-25 17:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> > 
> > While staring at mm/huge_memory.c I found a very under-commented
> > smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> > __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> > thereto).
> > 
> > Now, afaict we're not good, as per that comment. Paul has since
> > convinced some of us that compiler writers are pure evil and out to get
> > us.
> > 
> > Therefore we should do what rcu_dereference() does and use
> > ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> > we dereference a page-table pointer.
> > 
> > 
> > In particular it looks like things like
> > mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> > walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> > concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> > itself) are quite broken on Alpha
> 
> The Alpha pmd_offset() and pte_offset_map() already contain an
> smp_read_barrier_depends() (362a61ad from Nick); with comment that
> it's not needed on the pgd, and I presume the pud level is folded.
> Does Alpha really need more of them, as you have put below?
> 
> > and subtly broken for those of us with 'creative' compilers.
> 
> I don't want to fight against ACCESS_ONCE() (or barrier(): that's
> interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
> substitute before); but I do want to question it a little.
> 
> I'm totally unclear whether the kernel ever gets built with these
> 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> of where some future compiler would be permitted to mess with our
> assumptions?  Or is it actually saving us already today?  Would we
> know?  Could there be a boottime test that would tell us?  Is it
> likely that a future compiler would have an "--access_once"
> option that the kernel build would want to turn on?
> 
> Those may all be questions for Paul!

The problem is that, unless you tell it otherwise, the compiler is
permitted to assume that the code that it is generating is the only thing
active in that address space at that time.  So the compiler might know
that it already has a perfectly good copy of that value somewhere in
its registers, or it might decide to fetch the value twice rather than
once due to register pressure, either of which can be fatal in SMP code.
And then there are more aggressive optimizations as well.

ACCESS_ONCE() is a way of telling the compiler to access the value
once, regardless of what cute single-threaded optimizations that it
otherwise might want to apply.

							Thanx, Paul

> > Should I go do a more extensive audit of page-table walkers or are we
> > happy with the status quo?
> 
> I do love the status quo, but an audit would be welcome.  When
> it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> smp_read_barrier_depends() and accompanying comments to be hidden away
> in the underlying macros or inlines where reasonable, rather than
> repeated all over; but I may have my priorities wrong on that.
> 
> The last time we rewrote the main pgd-pud-pmd-pte walkers,
> we believed that no ACCESS_ONCE() was necessary, because although a
> pgd-pud-pmd entry might be racily instantiated at any instant, it
> could never change beneath us - the freeing of page tables happens
> only when we cannot reach them by other routes.
> 
> (Never quite true: those _clear_bad() things can zero entries at any
> instant, but we're already in a bad place when those come into play,
> so we never worried about racing against them.)
> 
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.
> 
> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.
> 
> Hugh
> 
> > 
> > ---
> >  arch/x86/mm/gup.c |    6 +++---
> >  mm/pagewalk.c     |   24 ++++++++++++++++++++++++
> >  2 files changed, 27 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > index dd74e46..4958fb1 100644
> > --- a/arch/x86/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> > @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> >  
> >  	pmdp = pmd_offset(&pud, addr);
> >  	do {
> > -		pmd_t pmd = *pmdp;
> > +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> >  
> >  		next = pmd_addr_end(addr, end);
> >  		/*
> > @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
> >  
> >  	pudp = pud_offset(&pgd, addr);
> >  	do {
> > -		pud_t pud = *pudp;
> > +		pud_t pud = ACCESS_ONCE(*pudp);
> >  
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(pud))
> > @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  	local_irq_save(flags);
> >  	pgdp = pgd_offset(mm, addr);
> >  	do {
> > -		pgd_t pgd = *pgdp;
> > +		pgd_t pgd = ACCESS_ONCE(*pgdp);
> >  
> >  		next = pgd_addr_end(addr, end);
> >  		if (pgd_none(pgd))
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 6c118d0..2ba2a74 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >  	int err = 0;
> >  
> >  	pte = pte_offset_map(pmd, addr);
> > +	/*
> > +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> > +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> > +	 * actual dereference of p[gum]d, but that's hidden deep within the
> > +	 * bowels of {pte,pmd,pud}_offset.
> > +	 */
> > +	barrier();
> > +	smp_read_barrier_depends();
> >  	for (;;) {
> >  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> >  		if (err)
> > @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  	int err = 0;
> >  
> >  	pmd = pmd_offset(pud, addr);
> > +	/*
> > +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> > +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> > +	 * actual dereference of p[gum]d, but that's hidden deep within the
> > +	 * bowels of {pte,pmd,pud}_offset.
> > +	 */
> > +	barrier();
> > +	smp_read_barrier_depends();
> >  	do {
> >  again:
> >  		next = pmd_addr_end(addr, end);
> > @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> >  	int err = 0;
> >  
> >  	pud = pud_offset(pgd, addr);
> > +	/*
> > +	 * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> > +	 * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> > +	 * actual dereference of p[gum]d, but that's hidden deep within the
> > +	 * bowels of {pte,pmd,pud}_offset.
> > +	 */
> > +	barrier();
> > +	smp_read_barrier_depends();
> >  	do {
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none_or_clear_bad(pud)) {
> > 
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-25 17:56     ` Paul E. McKenney
@ 2012-07-25 20:26       ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-25 20:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > 
> > I'm totally unclear whether the kernel ever gets built with these
> > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > of where some future compiler would be permitted to mess with our
> > assumptions?  Or is it actually saving us already today?  Would we
> > know?  Could there be a boottime test that would tell us?  Is it
> > likely that a future compiler would have an "--access_once"
> > option that the kernel build would want to turn on?
> 
> The problem is that, unless you tell it otherwise, the compiler is
> permitted to assume that the code that it is generating is the only thing
> active in that address space at that time.  So the compiler might know
> that it already has a perfectly good copy of that value somewhere in
> its registers, or it might decide to fetch the value twice rather than
> once due to register pressure, either of which can be fatal in SMP code.
> And then there are more aggressive optimizations as well.
> 
> ACCESS_ONCE() is a way of telling the compiler to access the value
> once, regardless of what cute single-threaded optimizations that it
> otherwise might want to apply.

Right, but you say "might": I have never heard it asserted, that we do
build the kernel with a compiler which actually makes such optimizations.

There's a lot of other surprising things which a compiler is permitted
to do, but we would simply not use such a compiler to build the kernel.

Does some version of gcc, under the options which we insist upon,
make such optimizations on any of the architectures which we support?

Or is there some other compiler in use on the kernel, which makes
such optimizations?  It seems a long time since I heard of building
the kernel with icc.  clang?

I don't mind the answer "Yes, you idiot" - preferably with an example
or two of which compiler and which piece of code it has bitten us on.
I don't mind the answer "We just don't know" if that's the case.

But I'd like a better idea of how much to worry: is ACCESS_ONCE
demonstrably needed today, or rather future-proofing and documentation?

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-25 20:26       ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-25 20:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > 
> > I'm totally unclear whether the kernel ever gets built with these
> > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > of where some future compiler would be permitted to mess with our
> > assumptions?  Or is it actually saving us already today?  Would we
> > know?  Could there be a boottime test that would tell us?  Is it
> > likely that a future compiler would have an "--access_once"
> > option that the kernel build would want to turn on?
> 
> The problem is that, unless you tell it otherwise, the compiler is
> permitted to assume that the code that it is generating is the only thing
> active in that address space at that time.  So the compiler might know
> that it already has a perfectly good copy of that value somewhere in
> its registers, or it might decide to fetch the value twice rather than
> once due to register pressure, either of which can be fatal in SMP code.
> And then there are more aggressive optimizations as well.
> 
> ACCESS_ONCE() is a way of telling the compiler to access the value
> once, regardless of what cute single-threaded optimizations that it
> otherwise might want to apply.

Right, but you say "might": I have never heard it asserted, that we do
build the kernel with a compiler which actually makes such optimizations.

There's a lot of other surprising things which a compiler is permitted
to do, but we would simply not use such a compiler to build the kernel.

Does some version of gcc, under the options which we insist upon,
make such optimizations on any of the architectures which we support?

Or is there some other compiler in use on the kernel, which makes
such optimizations?  It seems a long time since I heard of building
the kernel with icc.  clang?

I don't mind the answer "Yes, you idiot" - preferably with an example
or two of which compiler and which piece of code it has bitten us on.
I don't mind the answer "We just don't know" if that's the case.

But I'd like a better idea of how much to worry: is ACCESS_ONCE
demonstrably needed today, or rather future-proofing and documentation?

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-25 20:26       ` Hugh Dickins
@ 2012-07-25 21:12         ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-25 21:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > 
> > > I'm totally unclear whether the kernel ever gets built with these
> > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > of where some future compiler would be permitted to mess with our
> > > assumptions?  Or is it actually saving us already today?  Would we
> > > know?  Could there be a boottime test that would tell us?  Is it
> > > likely that a future compiler would have an "--access_once"
> > > option that the kernel build would want to turn on?
> > 
> > The problem is that, unless you tell it otherwise, the compiler is
> > permitted to assume that the code that it is generating is the only thing
> > active in that address space at that time.  So the compiler might know
> > that it already has a perfectly good copy of that value somewhere in
> > its registers, or it might decide to fetch the value twice rather than
> > once due to register pressure, either of which can be fatal in SMP code.
> > And then there are more aggressive optimizations as well.
> > 
> > ACCESS_ONCE() is a way of telling the compiler to access the value
> > once, regardless of what cute single-threaded optimizations that it
> > otherwise might want to apply.
> 
> Right, but you say "might": I have never heard it asserted, that we do
> build the kernel with a compiler which actually makes such optimizations.

The compiler we use today can and has hurt us with double-fetching
and old-value-reuse optimizations.  There have been several that have
"optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
in the Linux kernel, which have been dealt with by recoding.

You might argue that the compiler cannot reasonably apply such an
optimization in some given case, but the compiler does much more detailed
analysis of the code than most people are willing to do (certainly more
than I am usually willing to do!), so I believe that a little paranoia is
quite worthwhile.

> There's a lot of other surprising things which a compiler is permitted
> to do, but we would simply not use such a compiler to build the kernel.

Unless we get the gcc folks to build and boot the Linux kernel as part
of their test suite (maybe they already do, but not that I know of),
how would either they or we know that they had deployed a destructive
optimization?

> Does some version of gcc, under the options which we insist upon,
> make such optimizations on any of the architectures which we support?

Pretty much any production-quality compiler will do double-fetch
and old-value-reuse optimizations, the former especially on 32-bit
x86.  I don't know of any production-quality compilers that do value
speculation, which would make the compiler act like DEC Alpha hardware,
and I would hope that if this does appear, (1) we would have warning
and (2) it could be turned off.  But there has been a lot of work on
this topic, so we would be foolish to rule it out.

But the currently deployed optimizations can already cause enough trouble.

> Or is there some other compiler in use on the kernel, which makes
> such optimizations?  It seems a long time since I heard of building
> the kernel with icc.  clang?
> 
> I don't mind the answer "Yes, you idiot" - preferably with an example
> or two of which compiler and which piece of code it has bitten us on.
> I don't mind the answer "We just don't know" if that's the case.
> 
> But I'd like a better idea of how much to worry: is ACCESS_ONCE
> demonstrably needed today, or rather future-proofing and documentation?

Both.  If you are coding "while (foo)" where "foo" can be changed by an
interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
or something similar, because most compilers are happy to optimize your
loop into an infinite loop in that case.  There are places in the Linux
kernel that would have problems if the compiler decided to refetch a
value -- if a pointer was changed in the meantime, part of your code
might be working on the old structure, and part on the new structure.
This really can happen today, and this is why rcu_dereference() contains
an ACCESS_ONCE().

If you are making lockless non-atomic access to a variable, I strongly
suggest ACCESS_ONCE() or something similar even if you cannot see how
the compiler can mess you up, especially in cases involving a lot of
inline functions.  In this case, the compiler can be looking at quite
a bit of code and optimizing across the entire mess.

/me wonders what he stepped into with this email thread.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-25 21:12         ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-25 21:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > 
> > > I'm totally unclear whether the kernel ever gets built with these
> > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > of where some future compiler would be permitted to mess with our
> > > assumptions?  Or is it actually saving us already today?  Would we
> > > know?  Could there be a boottime test that would tell us?  Is it
> > > likely that a future compiler would have an "--access_once"
> > > option that the kernel build would want to turn on?
> > 
> > The problem is that, unless you tell it otherwise, the compiler is
> > permitted to assume that the code that it is generating is the only thing
> > active in that address space at that time.  So the compiler might know
> > that it already has a perfectly good copy of that value somewhere in
> > its registers, or it might decide to fetch the value twice rather than
> > once due to register pressure, either of which can be fatal in SMP code.
> > And then there are more aggressive optimizations as well.
> > 
> > ACCESS_ONCE() is a way of telling the compiler to access the value
> > once, regardless of what cute single-threaded optimizations that it
> > otherwise might want to apply.
> 
> Right, but you say "might": I have never heard it asserted, that we do
> build the kernel with a compiler which actually makes such optimizations.

The compiler we use today can and has hurt us with double-fetching
and old-value-reuse optimizations.  There have been several that have
"optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
in the Linux kernel, which have been dealt with by recoding.

You might argue that the compiler cannot reasonably apply such an
optimization in some given case, but the compiler does much more detailed
analysis of the code than most people are willing to do (certainly more
than I am usually willing to do!), so I believe that a little paranoia is
quite worthwhile.

> There's a lot of other surprising things which a compiler is permitted
> to do, but we would simply not use such a compiler to build the kernel.

Unless we get the gcc folks to build and boot the Linux kernel as part
of their test suite (maybe they already do, but not that I know of),
how would either they or we know that they had deployed a destructive
optimization?

> Does some version of gcc, under the options which we insist upon,
> make such optimizations on any of the architectures which we support?

Pretty much any production-quality compiler will do double-fetch
and old-value-reuse optimizations, the former especially on 32-bit
x86.  I don't know of any production-quality compilers that do value
speculation, which would make the compiler act like DEC Alpha hardware,
and I would hope that if this does appear, (1) we would have warning
and (2) it could be turned off.  But there has been a lot of work on
this topic, so we would be foolish to rule it out.

But the currently deployed optimizations can already cause enough trouble.

> Or is there some other compiler in use on the kernel, which makes
> such optimizations?  It seems a long time since I heard of building
> the kernel with icc.  clang?
> 
> I don't mind the answer "Yes, you idiot" - preferably with an example
> or two of which compiler and which piece of code it has bitten us on.
> I don't mind the answer "We just don't know" if that's the case.
> 
> But I'd like a better idea of how much to worry: is ACCESS_ONCE
> demonstrably needed today, or rather future-proofing and documentation?

Both.  If you are coding "while (foo)" where "foo" can be changed by an
interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
or something similar, because most compilers are happy to optimize your
loop into an infinite loop in that case.  There are places in the Linux
kernel that would have problems if the compiler decided to refetch a
value -- if a pointer was changed in the meantime, part of your code
might be working on the old structure, and part on the new structure.
This really can happen today, and this is why rcu_dereference() contains
an ACCESS_ONCE().

If you are making lockless non-atomic access to a variable, I strongly
suggest ACCESS_ONCE() or something similar even if you cannot see how
the compiler can mess you up, especially in cases involving a lot of
inline functions.  In this case, the compiler can be looking at quite
a bit of code and optimizing across the entire mess.

/me wonders what he stepped into with this email thread.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-25 21:12         ` Paul E. McKenney
@ 2012-07-25 22:09           ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-25 22:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> > On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > > 
> > > > I'm totally unclear whether the kernel ever gets built with these
> > > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > > of where some future compiler would be permitted to mess with our
> > > > assumptions?  Or is it actually saving us already today?  Would we
> > > > know?  Could there be a boottime test that would tell us?  Is it
> > > > likely that a future compiler would have an "--access_once"
> > > > option that the kernel build would want to turn on?
> > > 
> > > The problem is that, unless you tell it otherwise, the compiler is
> > > permitted to assume that the code that it is generating is the only thing
> > > active in that address space at that time.  So the compiler might know
> > > that it already has a perfectly good copy of that value somewhere in
> > > its registers, or it might decide to fetch the value twice rather than
> > > once due to register pressure, either of which can be fatal in SMP code.
> > > And then there are more aggressive optimizations as well.
> > > 
> > > ACCESS_ONCE() is a way of telling the compiler to access the value
> > > once, regardless of what cute single-threaded optimizations that it
> > > otherwise might want to apply.
> > 
> > Right, but you say "might": I have never heard it asserted, that we do
> > build the kernel with a compiler which actually makes such optimizations.
> 
> The compiler we use today can and has hurt us with double-fetching
> and old-value-reuse optimizations.  There have been several that have
> "optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
> in the Linux kernel, which have been dealt with by recoding.

Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

I consider the double-fetching ones more insidious,
less obviously in need of the volatile cast.

> 
> You might argue that the compiler cannot reasonably apply such an
> optimization in some given case, but the compiler does much more detailed
> analysis of the code than most people are willing to do (certainly more
> than I am usually willing to do!), so I believe that a little paranoia is
> quite worthwhile.
> 
> > There's a lot of other surprising things which a compiler is permitted
> > to do, but we would simply not use such a compiler to build the kernel.
> 
> Unless we get the gcc folks to build and boot the Linux kernel as part
> of their test suite (maybe they already do, but not that I know of),
> how would either they or we know that they had deployed a destructive
> optimization?

We find out after it hits us, and someone studies the disassembly -
if we're lucky enough to crash near the origin of the problem.

> 
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.

That makes good sense, yes: so, under register pressure, they may
refetch from global memory, instead of using a temporary on local stack.

> I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

I think you're justified in expecting both (1) and (2) there.

> 
> But the currently deployed optimizations can already cause enough trouble.
> 
> > Or is there some other compiler in use on the kernel, which makes
> > such optimizations?  It seems a long time since I heard of building
> > the kernel with icc.  clang?
> > 
> > I don't mind the answer "Yes, you idiot" - preferably with an example
> > or two of which compiler and which piece of code it has bitten us on.
> > I don't mind the answer "We just don't know" if that's the case.
> > 
> > But I'd like a better idea of how much to worry: is ACCESS_ONCE
> > demonstrably needed today, or rather future-proofing and documentation?
> 
> Both.  If you are coding "while (foo)" where "foo" can be changed by an
> interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
> or something similar, because most compilers are happy to optimize your
> loop into an infinite loop in that case.  There are places in the Linux
> kernel that would have problems if the compiler decided to refetch a
> value -- if a pointer was changed in the meantime, part of your code
> might be working on the old structure, and part on the new structure.
> This really can happen today, and this is why rcu_dereference() contains
> an ACCESS_ONCE().
> 
> If you are making lockless non-atomic access to a variable, I strongly
> suggest ACCESS_ONCE() or something similar even if you cannot see how
> the compiler can mess you up, especially in cases involving a lot of
> inline functions.  In this case, the compiler can be looking at quite
> a bit of code and optimizing across the entire mess.

Thank you for your fuller reply, Paul: I should be able to hold that
i386 register pressure example in mind in future (not, of course,
that it would be limited to i386 at all).

> 
> /me wonders what he stepped into with this email thread.  ;-)
> 
> 							Thanx, Paul

Come on, it wasn't that painful, was it?
Just a quick extraction of info ;-)

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-25 22:09           ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-25 22:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> > On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > > 
> > > > I'm totally unclear whether the kernel ever gets built with these
> > > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > > of where some future compiler would be permitted to mess with our
> > > > assumptions?  Or is it actually saving us already today?  Would we
> > > > know?  Could there be a boottime test that would tell us?  Is it
> > > > likely that a future compiler would have an "--access_once"
> > > > option that the kernel build would want to turn on?
> > > 
> > > The problem is that, unless you tell it otherwise, the compiler is
> > > permitted to assume that the code that it is generating is the only thing
> > > active in that address space at that time.  So the compiler might know
> > > that it already has a perfectly good copy of that value somewhere in
> > > its registers, or it might decide to fetch the value twice rather than
> > > once due to register pressure, either of which can be fatal in SMP code.
> > > And then there are more aggressive optimizations as well.
> > > 
> > > ACCESS_ONCE() is a way of telling the compiler to access the value
> > > once, regardless of what cute single-threaded optimizations that it
> > > otherwise might want to apply.
> > 
> > Right, but you say "might": I have never heard it asserted, that we do
> > build the kernel with a compiler which actually makes such optimizations.
> 
> The compiler we use today can and has hurt us with double-fetching
> and old-value-reuse optimizations.  There have been several that have
> "optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
> in the Linux kernel, which have been dealt with by recoding.

Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

I consider the double-fetching ones more insidious,
less obviously in need of the volatile cast.

> 
> You might argue that the compiler cannot reasonably apply such an
> optimization in some given case, but the compiler does much more detailed
> analysis of the code than most people are willing to do (certainly more
> than I am usually willing to do!), so I believe that a little paranoia is
> quite worthwhile.
> 
> > There's a lot of other surprising things which a compiler is permitted
> > to do, but we would simply not use such a compiler to build the kernel.
> 
> Unless we get the gcc folks to build and boot the Linux kernel as part
> of their test suite (maybe they already do, but not that I know of),
> how would either they or we know that they had deployed a destructive
> optimization?

We find out after it hits us, and someone studies the disassembly -
if we're lucky enough to crash near the origin of the problem.

> 
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.

That makes good sense, yes: so, under register pressure, they may
refetch from global memory, instead of using a temporary on local stack.

> I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

I think you're justified in expecting both (1) and (2) there.

> 
> But the currently deployed optimizations can already cause enough trouble.
> 
> > Or is there some other compiler in use on the kernel, which makes
> > such optimizations?  It seems a long time since I heard of building
> > the kernel with icc.  clang?
> > 
> > I don't mind the answer "Yes, you idiot" - preferably with an example
> > or two of which compiler and which piece of code it has bitten us on.
> > I don't mind the answer "We just don't know" if that's the case.
> > 
> > But I'd like a better idea of how much to worry: is ACCESS_ONCE
> > demonstrably needed today, or rather future-proofing and documentation?
> 
> Both.  If you are coding "while (foo)" where "foo" can be changed by an
> interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
> or something similar, because most compilers are happy to optimize your
> loop into an infinite loop in that case.  There are places in the Linux
> kernel that would have problems if the compiler decided to refetch a
> value -- if a pointer was changed in the meantime, part of your code
> might be working on the old structure, and part on the new structure.
> This really can happen today, and this is why rcu_dereference() contains
> an ACCESS_ONCE().
> 
> If you are making lockless non-atomic access to a variable, I strongly
> suggest ACCESS_ONCE() or something similar even if you cannot see how
> the compiler can mess you up, especially in cases involving a lot of
> inline functions.  In this case, the compiler can be looking at quite
> a bit of code and optimizing across the entire mess.

Thank you for your fuller reply, Paul: I should be able to hold that
i386 register pressure example in mind in future (not, of course,
that it would be limited to i386 at all).

> 
> /me wonders what he stepped into with this email thread.  ;-)
> 
> 							Thanx, Paul

Come on, it wasn't that painful, was it?
Just a quick extraction of info ;-)

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-25 22:09           ` Hugh Dickins
@ 2012-07-25 22:37             ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-25 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, Jul 25, 2012 at 03:09:48PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> > > On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > I'm totally unclear whether the kernel ever gets built with these
> > > > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > > > of where some future compiler would be permitted to mess with our
> > > > > assumptions?  Or is it actually saving us already today?  Would we
> > > > > know?  Could there be a boottime test that would tell us?  Is it
> > > > > likely that a future compiler would have an "--access_once"
> > > > > option that the kernel build would want to turn on?
> > > > 
> > > > The problem is that, unless you tell it otherwise, the compiler is
> > > > permitted to assume that the code that it is generating is the only thing
> > > > active in that address space at that time.  So the compiler might know
> > > > that it already has a perfectly good copy of that value somewhere in
> > > > its registers, or it might decide to fetch the value twice rather than
> > > > once due to register pressure, either of which can be fatal in SMP code.
> > > > And then there are more aggressive optimizations as well.
> > > > 
> > > > ACCESS_ONCE() is a way of telling the compiler to access the value
> > > > once, regardless of what cute single-threaded optimizations that it
> > > > otherwise might want to apply.
> > > 
> > > Right, but you say "might": I have never heard it asserted, that we do
> > > build the kernel with a compiler which actually makes such optimizations.
> > 
> > The compiler we use today can and has hurt us with double-fetching
> > and old-value-reuse optimizations.  There have been several that have
> > "optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
> > in the Linux kernel, which have been dealt with by recoding.
> 
> Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

;-) ;-) ;-)

> I consider the double-fetching ones more insidious,
> less obviously in need of the volatile cast.

Agreed!

> > You might argue that the compiler cannot reasonably apply such an
> > optimization in some given case, but the compiler does much more detailed
> > analysis of the code than most people are willing to do (certainly more
> > than I am usually willing to do!), so I believe that a little paranoia is
> > quite worthwhile.
> > 
> > > There's a lot of other surprising things which a compiler is permitted
> > > to do, but we would simply not use such a compiler to build the kernel.
> > 
> > Unless we get the gcc folks to build and boot the Linux kernel as part
> > of their test suite (maybe they already do, but not that I know of),
> > how would either they or we know that they had deployed a destructive
> > optimization?
> 
> We find out after it hits us, and someone studies the disassembly -
> if we're lucky enough to crash near the origin of the problem.

Color me unreassured.  ;-)

> > > Does some version of gcc, under the options which we insist upon,
> > > make such optimizations on any of the architectures which we support?
> > 
> > Pretty much any production-quality compiler will do double-fetch
> > and old-value-reuse optimizations, the former especially on 32-bit
> > x86.
> 
> That makes good sense, yes: so, under register pressure, they may
> refetch from global memory, instead of using a temporary on local stack.
> 
> > I don't know of any production-quality compilers that do value
> > speculation, which would make the compiler act like DEC Alpha hardware,
> > and I would hope that if this does appear, (1) we would have warning
> > and (2) it could be turned off.  But there has been a lot of work on
> > this topic, so we would be foolish to rule it out.
> 
> I think you're justified in expecting both (1) and (2) there.

Here is hoping!

> > But the currently deployed optimizations can already cause enough trouble.
> > 
> > > Or is there some other compiler in use on the kernel, which makes
> > > such optimizations?  It seems a long time since I heard of building
> > > the kernel with icc.  clang?
> > > 
> > > I don't mind the answer "Yes, you idiot" - preferably with an example
> > > or two of which compiler and which piece of code it has bitten us on.
> > > I don't mind the answer "We just don't know" if that's the case.
> > > 
> > > But I'd like a better idea of how much to worry: is ACCESS_ONCE
> > > demonstrably needed today, or rather future-proofing and documentation?
> > 
> > Both.  If you are coding "while (foo)" where "foo" can be changed by an
> > interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
> > or something similar, because most compilers are happy to optimize your
> > loop into an infinite loop in that case.  There are places in the Linux
> > kernel that would have problems if the compiler decided to refetch a
> > value -- if a pointer was changed in the meantime, part of your code
> > might be working on the old structure, and part on the new structure.
> > This really can happen today, and this is why rcu_dereference() contains
> > an ACCESS_ONCE().
> > 
> > If you are making lockless non-atomic access to a variable, I strongly
> > suggest ACCESS_ONCE() or something similar even if you cannot see how
> > the compiler can mess you up, especially in cases involving a lot of
> > inline functions.  In this case, the compiler can be looking at quite
> > a bit of code and optimizing across the entire mess.
> 
> Thank you for your fuller reply, Paul: I should be able to hold that
> i386 register pressure example in mind in future (not, of course,
> that it would be limited to i386 at all).

Good point -- given a large enough pile of inline functions, the
compiler might want to use a surprisingly large number of registers.

> > /me wonders what he stepped into with this email thread.  ;-)
> > 
> > 							Thanx, Paul
> 
> Come on, it wasn't that painful, was it?
> Just a quick extraction of info ;-)

It didn't hurt a bit, and it was over before I knew it.  ;-)

							Thanx, Paul

> Thanks,
> Hugh
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-25 22:37             ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-25 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, Jul 25, 2012 at 03:09:48PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> > > On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > I'm totally unclear whether the kernel ever gets built with these
> > > > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > > > of where some future compiler would be permitted to mess with our
> > > > > assumptions?  Or is it actually saving us already today?  Would we
> > > > > know?  Could there be a boottime test that would tell us?  Is it
> > > > > likely that a future compiler would have an "--access_once"
> > > > > option that the kernel build would want to turn on?
> > > > 
> > > > The problem is that, unless you tell it otherwise, the compiler is
> > > > permitted to assume that the code that it is generating is the only thing
> > > > active in that address space at that time.  So the compiler might know
> > > > that it already has a perfectly good copy of that value somewhere in
> > > > its registers, or it might decide to fetch the value twice rather than
> > > > once due to register pressure, either of which can be fatal in SMP code.
> > > > And then there are more aggressive optimizations as well.
> > > > 
> > > > ACCESS_ONCE() is a way of telling the compiler to access the value
> > > > once, regardless of what cute single-threaded optimizations that it
> > > > otherwise might want to apply.
> > > 
> > > Right, but you say "might": I have never heard it asserted, that we do
> > > build the kernel with a compiler which actually makes such optimizations.
> > 
> > The compiler we use today can and has hurt us with double-fetching
> > and old-value-reuse optimizations.  There have been several that have
> > "optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
> > in the Linux kernel, which have been dealt with by recoding.
> 
> Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

;-) ;-) ;-)

> I consider the double-fetching ones more insidious,
> less obviously in need of the volatile cast.

Agreed!

> > You might argue that the compiler cannot reasonably apply such an
> > optimization in some given case, but the compiler does much more detailed
> > analysis of the code than most people are willing to do (certainly more
> > than I am usually willing to do!), so I believe that a little paranoia is
> > quite worthwhile.
> > 
> > > There's a lot of other surprising things which a compiler is permitted
> > > to do, but we would simply not use such a compiler to build the kernel.
> > 
> > Unless we get the gcc folks to build and boot the Linux kernel as part
> > of their test suite (maybe they already do, but not that I know of),
> > how would either they or we know that they had deployed a destructive
> > optimization?
> 
> We find out after it hits us, and someone studies the disassembly -
> if we're lucky enough to crash near the origin of the problem.

Color me unreassured.  ;-)

> > > Does some version of gcc, under the options which we insist upon,
> > > make such optimizations on any of the architectures which we support?
> > 
> > Pretty much any production-quality compiler will do double-fetch
> > and old-value-reuse optimizations, the former especially on 32-bit
> > x86.
> 
> That makes good sense, yes: so, under register pressure, they may
> refetch from global memory, instead of using a temporary on local stack.
> 
> > I don't know of any production-quality compilers that do value
> > speculation, which would make the compiler act like DEC Alpha hardware,
> > and I would hope that if this does appear, (1) we would have warning
> > and (2) it could be turned off.  But there has been a lot of work on
> > this topic, so we would be foolish to rule it out.
> 
> I think you're justified in expecting both (1) and (2) there.

Here is hoping!

> > But the currently deployed optimizations can already cause enough trouble.
> > 
> > > Or is there some other compiler in use on the kernel, which makes
> > > such optimizations?  It seems a long time since I heard of building
> > > the kernel with icc.  clang?
> > > 
> > > I don't mind the answer "Yes, you idiot" - preferably with an example
> > > or two of which compiler and which piece of code it has bitten us on.
> > > I don't mind the answer "We just don't know" if that's the case.
> > > 
> > > But I'd like a better idea of how much to worry: is ACCESS_ONCE
> > > demonstrably needed today, or rather future-proofing and documentation?
> > 
> > Both.  If you are coding "while (foo)" where "foo" can be changed by an
> > interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
> > or something similar, because most compilers are happy to optimize your
> > loop into an infinite loop in that case.  There are places in the Linux
> > kernel that would have problems if the compiler decided to refetch a
> > value -- if a pointer was changed in the meantime, part of your code
> > might be working on the old structure, and part on the new structure.
> > This really can happen today, and this is why rcu_dereference() contains
> > an ACCESS_ONCE().
> > 
> > If you are making lockless non-atomic access to a variable, I strongly
> > suggest ACCESS_ONCE() or something similar even if you cannot see how
> > the compiler can mess you up, especially in cases involving a lot of
> > inline functions.  In this case, the compiler can be looking at quite
> > a bit of code and optimizing across the entire mess.
> 
> Thank you for your fuller reply, Paul: I should be able to hold that
> i386 register pressure example in mind in future (not, of course,
> that it would be limited to i386 at all).

Good point -- given a large enough pile of inline functions, the
compiler might want to use a surprisingly large number of registers.

> > /me wonders what he stepped into with this email thread.  ;-)
> > 
> > 							Thanx, Paul
> 
> Come on, it wasn't that painful, was it?
> Just a quick extraction of info ;-)

It didn't hurt a bit, and it was over before I knew it.  ;-)

							Thanx, Paul

> Thanks,
> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-25 22:09           ` Hugh Dickins
@ 2012-07-26  8:11             ` Peter Zijlstra
  -1 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2012-07-26  8:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Paul E. McKenney, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, 2012-07-25 at 15:09 -0700, Hugh Dickins wrote:
> We find out after it hits us, and someone studies the disassembly -
> if we're lucky enough to crash near the origin of the problem. 

This is a rather painful way.. see

  https://lkml.org/lkml/2009/1/5/555

we were lucky there in that the lack of ACCESS_ONCE() caused an infinite
loop so we knew exactly where we got stuck.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-26  8:11             ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2012-07-26  8:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Paul E. McKenney, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Wed, 2012-07-25 at 15:09 -0700, Hugh Dickins wrote:
> We find out after it hits us, and someone studies the disassembly -
> if we're lucky enough to crash near the origin of the problem. 

This is a rather painful way.. see

  https://lkml.org/lkml/2009/1/5/555

we were lucky there in that the lack of ACCESS_ONCE() caused an infinite
loop so we knew exactly where we got stuck.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-24 21:51   ` Hugh Dickins
@ 2012-07-26 20:39     ` Peter Zijlstra
  -1 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2012-07-26 20:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Paul E. McKenney, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> I do love the status quo, but an audit would be welcome.  When
> it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> smp_read_barrier_depends() and accompanying comments to be hidden away
> in the underlying macros or inlines where reasonable, rather than
> repeated all over; but I may have my priorities wrong on that.
> 
> 
Yeah, I was being lazy, and I totally forgot to actually look at the
alpha code.

How about we do a generic (cribbed from rcu_dereference):

#define page_table_deref(p)					\
({								\
	typeof(*p) *______p = (typeof(*p) __force *)ACCESS_ONCE(p);\
	smp_read_barrier_depends();				\
	((typeof(*p) __force __kernel *)(______p));		\
})

and use that all over to dereference page-tables. That way all this
lives in one place. Granted, I'll have to go edit all arch code, but I
seem to be doing that on a frequent basis anyway :/



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-26 20:39     ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2012-07-26 20:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Paul E. McKenney, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> I do love the status quo, but an audit would be welcome.  When
> it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> smp_read_barrier_depends() and accompanying comments to be hidden away
> in the underlying macros or inlines where reasonable, rather than
> repeated all over; but I may have my priorities wrong on that.
> 
> 
Yeah, I was being lazy, and I totally forgot to actually look at the
alpha code.

How about we do a generic (cribbed from rcu_dereference):

#define page_table_deref(p)					\
({								\
	typeof(*p) *______p = (typeof(*p) __force *)ACCESS_ONCE(p);\
	smp_read_barrier_depends();				\
	((typeof(*p) __force __kernel *)(______p));		\
})

and use that all over to dereference page-tables. That way all this
lives in one place. Granted, I'll have to go edit all arch code, but I
seem to be doing that on a frequent basis anyway :/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-26 20:39     ` Peter Zijlstra
@ 2012-07-27 19:22       ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-27 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul E. McKenney, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Thu, 26 Jul 2012, Peter Zijlstra wrote:
> On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> > I do love the status quo, but an audit would be welcome.  When
> > it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> > smp_read_barrier_depends() and accompanying comments to be hidden away
> > in the underlying macros or inlines where reasonable, rather than
> > repeated all over; but I may have my priorities wrong on that.

I notice from that old radix_tree thread you pointed to in the previous
mail (for which many thanks: lots of meat to digest in there) that this
is also Linus's preference.

> > 
> > 
> Yeah, I was being lazy, and I totally forgot to actually look at the
> alpha code.
> 
> How about we do a generic (cribbed from rcu_dereference):
> 
> #define page_table_deref(p)					\
> ({								\
> 	typeof(*p) *______p = (typeof(*p) __force *)ACCESS_ONCE(p);\
> 	smp_read_barrier_depends();				\
> 	((typeof(*p) __force __kernel *)(______p));		\
> })
> 
> and use that all over to dereference page-tables. That way all this
> lives in one place. Granted, I'll have to go edit all arch code, but I
> seem to be doing that on a frequent basis anyway :/

If you're convinced that we now have (or are in danger of growing)
a number of places which need this safety, yes, I suppose so.

Personally, I'd have gone for just adding the relatively-understandable
ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
visit to make the above change), and leave the smp_read_barrier_depends()
entirely in Alpha - one level of indirection less for the reader.
But that's just me, you're the one proposing to do the work, and
you may have very good reason for the above.

I'm unfamiliar with what value the __force __kernel annotations add.
But I am interested to notice that you are only 6/9ths as insane as
Paul: any chance of helping global underscore availability by not
hoarding quite so many in there? 

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-27 19:22       ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2012-07-27 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul E. McKenney, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Thu, 26 Jul 2012, Peter Zijlstra wrote:
> On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> > I do love the status quo, but an audit would be welcome.  When
> > it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> > smp_read_barrier_depends() and accompanying comments to be hidden away
> > in the underlying macros or inlines where reasonable, rather than
> > repeated all over; but I may have my priorities wrong on that.

I notice from that old radix_tree thread you pointed to in the previous
mail (for which many thanks: lots of meat to digest in there) that this
is also Linus's preference.

> > 
> > 
> Yeah, I was being lazy, and I totally forgot to actually look at the
> alpha code.
> 
> How about we do a generic (cribbed from rcu_dereference):
> 
> #define page_table_deref(p)					\
> ({								\
> 	typeof(*p) *______p = (typeof(*p) __force *)ACCESS_ONCE(p);\
> 	smp_read_barrier_depends();				\
> 	((typeof(*p) __force __kernel *)(______p));		\
> })
> 
> and use that all over to dereference page-tables. That way all this
> lives in one place. Granted, I'll have to go edit all arch code, but I
> seem to be doing that on a frequent basis anyway :/

If you're convinced that we now have (or are in danger of growing)
a number of places which need this safety, yes, I suppose so.

Personally, I'd have gone for just adding the relatively-understandable
ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
visit to make the above change), and leave the smp_read_barrier_depends()
entirely in Alpha - one level of indirection less for the reader.
But that's just me, you're the one proposing to do the work, and
you may have very good reason for the above.

I'm unfamiliar with what value the __force __kernel annotations add.
But I am interested to notice that you are only 6/9ths as insane as
Paul: any chance of helping global underscore availability by not
hoarding quite so many in there? 

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-27 19:22       ` Hugh Dickins
@ 2012-07-27 19:39         ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-27 19:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Fri, Jul 27, 2012 at 12:22:46PM -0700, Hugh Dickins wrote:
> On Thu, 26 Jul 2012, Peter Zijlstra wrote:
> > On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> > > I do love the status quo, but an audit would be welcome.  When
> > > it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> > > smp_read_barrier_depends() and accompanying comments to be hidden away
> > > in the underlying macros or inlines where reasonable, rather than
> > > repeated all over; but I may have my priorities wrong on that.
> 
> I notice from that old radix_tree thread you pointed to in the previous
> mail (for which many thanks: lots of meat to digest in there) that this
> is also Linus's preference.
> 
> > > 
> > > 
> > Yeah, I was being lazy, and I totally forgot to actually look at the
> > alpha code.
> > 
> > How about we do a generic (cribbed from rcu_dereference):
> > 
> > #define page_table_deref(p)					\
> > ({								\
> > 	typeof(*p) *______p = (typeof(*p) __force *)ACCESS_ONCE(p);\
> > 	smp_read_barrier_depends();				\
> > 	((typeof(*p) __force __kernel *)(______p));		\
> > })
> > 
> > and use that all over to dereference page-tables. That way all this
> > lives in one place. Granted, I'll have to go edit all arch code, but I
> > seem to be doing that on a frequent basis anyway :/
> 
> If you're convinced that we now have (or are in danger of growing)
> a number of places which need this safety, yes, I suppose so.
> 
> Personally, I'd have gone for just adding the relatively-understandable
> ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
> visit to make the above change), and leave the smp_read_barrier_depends()
> entirely in Alpha - one level of indirection less for the reader.
> But that's just me, you're the one proposing to do the work, and
> you may have very good reason for the above.
> 
> I'm unfamiliar with what value the __force __kernel annotations add.
> But I am interested to notice that you are only 6/9ths as insane as
> Paul: any chance of helping global underscore availability by not
> hoarding quite so many in there? 

Heh!!!  The number of underscores for the original rcu_dereference()'s
local variable was the outcome of an argument about how obfuscated that
variable's name should be in order to avoid possible collisions with names
in the enclosing scope.  Nine leading underscores might seem excessive,
or even as you say, insane, but on the other hand no name collisions
have ever come to my attention.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-27 19:39         ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-27 19:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Rik van Riel, Andrew Morton,
	Nick Piggin, Andrea Arcangeli, linux-kernel, linux-arch,
	linux-mm

On Fri, Jul 27, 2012 at 12:22:46PM -0700, Hugh Dickins wrote:
> On Thu, 26 Jul 2012, Peter Zijlstra wrote:
> > On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> > > I do love the status quo, but an audit would be welcome.  When
> > > it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> > > smp_read_barrier_depends() and accompanying comments to be hidden away
> > > in the underlying macros or inlines where reasonable, rather than
> > > repeated all over; but I may have my priorities wrong on that.
> 
> I notice from that old radix_tree thread you pointed to in the previous
> mail (for which many thanks: lots of meat to digest in there) that this
> is also Linus's preference.
> 
> > > 
> > > 
> > Yeah, I was being lazy, and I totally forgot to actually look at the
> > alpha code.
> > 
> > How about we do a generic (cribbed from rcu_dereference):
> > 
> > #define page_table_deref(p)					\
> > ({								\
> > 	typeof(*p) *______p = (typeof(*p) __force *)ACCESS_ONCE(p);\
> > 	smp_read_barrier_depends();				\
> > 	((typeof(*p) __force __kernel *)(______p));		\
> > })
> > 
> > and use that all over to dereference page-tables. That way all this
> > lives in one place. Granted, I'll have to go edit all arch code, but I
> > seem to be doing that on a frequent basis anyway :/
> 
> If you're convinced that we now have (or are in danger of growing)
> a number of places which need this safety, yes, I suppose so.
> 
> Personally, I'd have gone for just adding the relatively-understandable
> ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
> visit to make the above change), and leave the smp_read_barrier_depends()
> entirely in Alpha - one level of indirection less for the reader.
> But that's just me, you're the one proposing to do the work, and
> you may have very good reason for the above.
> 
> I'm unfamiliar with what value the __force __kernel annotations add.
> But I am interested to notice that you are only 6/9ths as insane as
> Paul: any chance of helping global underscore availability by not
> hoarding quite so many in there? 

Heh!!!  The number of underscores for the original rcu_dereference()'s
local variable was the outcome of an argument about how obfuscated that
variable's name should be in order to avoid possible collisions with names
in the enclosing scope.  Nine leading underscores might seem excessive,
or even as you say, insane, but on the other hand no name collisions
have ever come to my attention.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-25 21:12         ` Paul E. McKenney
@ 2012-07-30 19:21           ` Jamie Lokier
  -1 siblings, 0 replies; 42+ messages in thread
From: Jamie Lokier @ 2012-07-30 19:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, Andrea Arcangeli, linux-kernel,
	linux-arch, linux-mm

Paul E. McKenney wrote:
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.  I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
     (En/Dis)able data speculative scheduling after reload. This results
     in generation of ld.a instructions and the corresponding check
     instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-30 19:21           ` Jamie Lokier
  0 siblings, 0 replies; 42+ messages in thread
From: Jamie Lokier @ 2012-07-30 19:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, Andrea Arcangeli, linux-kernel,
	linux-arch, linux-mm

Paul E. McKenney wrote:
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.  I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
     (En/Dis)able data speculative scheduling after reload. This results
     in generation of ld.a instructions and the corresponding check
     instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-30 19:21           ` Jamie Lokier
@ 2012-07-30 20:28             ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-30 20:28 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, Andrea Arcangeli, linux-kernel,
	linux-arch, linux-mm

On Mon, Jul 30, 2012 at 08:21:40PM +0100, Jamie Lokier wrote:
> Paul E. McKenney wrote:
> > > Does some version of gcc, under the options which we insist upon,
> > > make such optimizations on any of the architectures which we support?
> > 
> > Pretty much any production-quality compiler will do double-fetch
> > and old-value-reuse optimizations, the former especially on 32-bit
> > x86.  I don't know of any production-quality compilers that do value
> > speculation, which would make the compiler act like DEC Alpha hardware,
> > and I would hope that if this does appear, (1) we would have warning
> > and (2) it could be turned off.  But there has been a lot of work on
> > this topic, so we would be foolish to rule it out.
> 
> GCC documentation for IA-64:
> 
>    -msched-ar-data-spec
>    -mno-sched-ar-data-spec
>      (En/Dis)able data speculative scheduling after reload. This results
>      in generation of ld.a instructions and the corresponding check
>      instructions (ld.c / chk.a). The default is 'enable'.
> 
> I don't know if that results in value speculation of the relevant kind.

If I remember correctly, the chk.a instruction will detect failed
speculation via cache state and deal with the situation correctly,
but I really need to defer to someone with more recent IA-64 experience.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-07-30 20:28             ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-07-30 20:28 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, Andrea Arcangeli, linux-kernel,
	linux-arch, linux-mm

On Mon, Jul 30, 2012 at 08:21:40PM +0100, Jamie Lokier wrote:
> Paul E. McKenney wrote:
> > > Does some version of gcc, under the options which we insist upon,
> > > make such optimizations on any of the architectures which we support?
> > 
> > Pretty much any production-quality compiler will do double-fetch
> > and old-value-reuse optimizations, the former especially on 32-bit
> > x86.  I don't know of any production-quality compilers that do value
> > speculation, which would make the compiler act like DEC Alpha hardware,
> > and I would hope that if this does appear, (1) we would have warning
> > and (2) it could be turned off.  But there has been a lot of work on
> > this topic, so we would be foolish to rule it out.
> 
> GCC documentation for IA-64:
> 
>    -msched-ar-data-spec
>    -mno-sched-ar-data-spec
>      (En/Dis)able data speculative scheduling after reload. This results
>      in generation of ld.a instructions and the corresponding check
>      instructions (ld.c / chk.a). The default is 'enable'.
> 
> I don't know if that results in value speculation of the relevant kind.

If I remember correctly, the chk.a instruction will detect failed
speculation via cache state and deal with the situation correctly,
but I really need to defer to someone with more recent IA-64 experience.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-07-24 21:51   ` Hugh Dickins
@ 2012-08-04 14:37     ` Andrea Arcangeli
  -1 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-08-04 14:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Paul E. McKenney, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.

There were many issues, one unexpected was
1a5a9906d4e8d1976b701f889d8f35d54b928f25.

Keep in mind when holding only mmap_sem read mode (walk page range
speculative mode) or gup_fast, the result is always undefined and
racey if on the other CPU you have a munmap or mremap or any other pmd
manging concurrently messing with the mapping you're walking, all we
have to do is not to crash, it doesn't matter what happens.

The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a
while (rcu_dereference()), is no good reason to worry about all
possible purely theoretical gcc issues.

One important thing that wasn't mentioned so far in this thread is
also that we entirely relay on gcc for all pagetable and device driver
writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel.

We must separate all different cases to avoid huge confusion:

1) tmp=*ptr, while(tmp) -> possible, needs barrier or better
   ACCESS_ONCE if possible

2) orig_pmd = *pmdp before do_wp_huge_page in memory.c -> possible, needs
   barrier() and it should be possible to convert to ACCESS_ONCE

3) native_set_pmd and friends -> possible but not worth fixing, tried
   to fix a decade ago for a peace of mind and I was suggested to
   desist and it didn't bite us yet

4) writel -> possible but same as 3

5) compiler behaving like alpha -> impossible (I may be wrong but I
   believe so after thinking more on it)

6) I was told a decade ago by Honza to never touch any ram that can
   change under the compiler unless it's declared volatile (could
   crash over switch/case statement implemented with a table if the
   switch/case value is re-read by the compiler).  -> depends, we
   don't always obey to this rule, clearly gup_fast currently disobeys
   and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
   can zero the pmd). If there's no "switch/case" I'm not aware of
   other troubles.

7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
   issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f

Note: here I'm ignoring CPU reordering, this is only about the compiler.

5 is impossible because:

a) the compiler can't read a guessed address or it can crash the
   kernel

b) the compiler has no memory to store a "guessed" valid address when
   the function return and the stack is unwind

For the compiler to behave like alpha, the compiler should read the
pteval before the pmdp, that it can't do, because it has no address to
guess from and it would Oops if it really tries to guess it!

So far it was said "compiler can guess the address" but there was no
valid explanation of how it could do it, and I don't see it, so please
explain if I'm wrong about the a, b above.

Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
pud/pgd can't prevent the compiler to read a guessed pmdp address as a
volatile variable, before reading the pmdp pointer and compare it with
the guessed address! So if it's 5 you worry about, when adding
ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
have added a barrier() instead.

> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.

I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
pmd.

The other part of the patch in pagewalk.c is superflous and should be
dropped: pud/pgd can't change in walk_page_table, it's required to
hold the mmap_sem at least in read mode (it's not disabling irqs).

The pmd side instead can change but only with THP enabled, and only
because MADV_DONTNEED (never because of free_pgtables) but it's
already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
->pmd_entry callers are required to call pmd_trans_unstable() before
proceeding as the pmd may have been zeroed out by the time we get
there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
there are smp_read_barrier_depends already in alpha pte methods we're
fine.

Sorry for the long email but without a list that separates all
possible cases above, I don't think we can understand what we're
fixing in that patch and why the gup.c part is good.

Peter, I suggest to resend the fix with a more detailed explanataion
of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-04 14:37     ` Andrea Arcangeli
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-08-04 14:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Linus Torvalds, Paul E. McKenney, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.

There were many issues, one unexpected was
1a5a9906d4e8d1976b701f889d8f35d54b928f25.

Keep in mind when holding only mmap_sem read mode (walk page range
speculative mode) or gup_fast, the result is always undefined and
racey if on the other CPU you have a munmap or mremap or any other pmd
manging concurrently messing with the mapping you're walking, all we
have to do is not to crash, it doesn't matter what happens.

The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a
while (rcu_dereference()), is no good reason to worry about all
possible purely theoretical gcc issues.

One important thing that wasn't mentioned so far in this thread is
also that we entirely relay on gcc for all pagetable and device driver
writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel.

We must separate all different cases to avoid huge confusion:

1) tmp=*ptr, while(tmp) -> possible, needs barrier or better
   ACCESS_ONCE if possible

2) orig_pmd = *pmdp before do_wp_huge_page in memory.c -> possible, needs
   barrier() and it should be possible to convert to ACCESS_ONCE

3) native_set_pmd and friends -> possible but not worth fixing, tried
   to fix a decade ago for a peace of mind and I was suggested to
   desist and it didn't bite us yet

4) writel -> possible but same as 3

5) compiler behaving like alpha -> impossible (I may be wrong but I
   believe so after thinking more on it)

6) I was told a decade ago by Honza to never touch any ram that can
   change under the compiler unless it's declared volatile (could
   crash over switch/case statement implemented with a table if the
   switch/case value is re-read by the compiler).  -> depends, we
   don't always obey to this rule, clearly gup_fast currently disobeys
   and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
   can zero the pmd). If there's no "switch/case" I'm not aware of
   other troubles.

7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
   issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f

Note: here I'm ignoring CPU reordering, this is only about the compiler.

5 is impossible because:

a) the compiler can't read a guessed address or it can crash the
   kernel

b) the compiler has no memory to store a "guessed" valid address when
   the function return and the stack is unwind

For the compiler to behave like alpha, the compiler should read the
pteval before the pmdp, that it can't do, because it has no address to
guess from and it would Oops if it really tries to guess it!

So far it was said "compiler can guess the address" but there was no
valid explanation of how it could do it, and I don't see it, so please
explain if I'm wrong about the a, b above.

Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
pud/pgd can't prevent the compiler to read a guessed pmdp address as a
volatile variable, before reading the pmdp pointer and compare it with
the guessed address! So if it's 5 you worry about, when adding
ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
have added a barrier() instead.

> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.

I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
pmd.

The other part of the patch in pagewalk.c is superflous and should be
dropped: pud/pgd can't change in walk_page_table, it's required to
hold the mmap_sem at least in read mode (it's not disabling irqs).

The pmd side instead can change but only with THP enabled, and only
because MADV_DONTNEED (never because of free_pgtables) but it's
already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
->pmd_entry callers are required to call pmd_trans_unstable() before
proceeding as the pmd may have been zeroed out by the time we get
there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
there are smp_read_barrier_depends already in alpha pte methods we're
fine.

Sorry for the long email but without a list that separates all
possible cases above, I don't think we can understand what we're
fixing in that patch and why the gup.c part is good.

Peter, I suggest to resend the fix with a more detailed explanataion
of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-08-04 14:37     ` Andrea Arcangeli
@ 2012-08-04 22:02       ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-08-04 22:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > Since then, I think THP has made the rules more complicated; but I
> > believe Andrea paid a great deal of attention to that kind of issue.
> 
> There were many issues, one unexpected was
> 1a5a9906d4e8d1976b701f889d8f35d54b928f25.

[ . . . ]

> 5) compiler behaving like alpha -> impossible (I may be wrong but I
>    believe so after thinking more on it)
> 
> 6) I was told a decade ago by Honza to never touch any ram that can
>    change under the compiler unless it's declared volatile (could
>    crash over switch/case statement implemented with a table if the
>    switch/case value is re-read by the compiler).  -> depends, we
>    don't always obey to this rule, clearly gup_fast currently disobeys
>    and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
>    can zero the pmd). If there's no "switch/case" I'm not aware of
>    other troubles.
> 
> 7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
>    issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f
> 
> Note: here I'm ignoring CPU reordering, this is only about the compiler.
> 
> 5 is impossible because:
> 
> a) the compiler can't read a guessed address or it can crash the
>    kernel
> 
> b) the compiler has no memory to store a "guessed" valid address when
>    the function return and the stack is unwind
> 
> For the compiler to behave like alpha, the compiler should read the
> pteval before the pmdp, that it can't do, because it has no address to
> guess from and it would Oops if it really tries to guess it!
> 
> So far it was said "compiler can guess the address" but there was no
> valid explanation of how it could do it, and I don't see it, so please
> explain if I'm wrong about the a, b above.

OK, I'll bite.  ;-)

The most sane way for this to happen is with feedback-driven techniques
involving profiling, similar to what is done for basic-block reordering
or branch prediction.  The idea is that you compile the kernel in an
as-yet (and thankfully) mythical pointer-profiling mode, which records
the values of pointer loads and also measures the pointer-load latency.
If a situation is found where a given pointer almost always has the
same value but has high load latency (for example, is almost always a
high-latency cache miss), this fact is recorded and fed back into a
subsequent kernel build.  This subsequent kernel build might choose to
speculate the value of the pointer concurrently with the pointer load.

And of course, when interpreting the phrase "most sane way" at the
beginning of the prior paragraph, it would probably be wise to keep
in mind who wrote it.  And that "most sane way" might have little or
no resemblance to anything that typical kernel hackers would consider
anywhere near sanity.  ;-)

> Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> volatile variable, before reading the pmdp pointer and compare it with
> the guessed address! So if it's 5 you worry about, when adding
> ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> have added a barrier() instead.

Most compiler writers I have discussed this with agreed that a volatile
cast would suppress value speculation.  The "volatile" keyword is not
all that well specified in the C and C++ standards, but as "nix" said
at http://lwn.net/Articles/509731/:

	volatile's meaning as 'minimize optimizations applied to things
	manipulating anything of volatile type, do not duplicate, elide,
	move, fold, spindle or mutilate' is of long standing.

That said, value speculation as a compiler optimization makes me a bit
nervous, so my current feeling is that is should be suppressed entirely.

Hey, you asked, even if only implicitly!  ;-)

							Thanx, Paul

> > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> > gup_fast() breaks as many rules as it can, and in particular may
> > be racing with the freeing of page tables; but I'm not so sure
> > about the pagewalk mods - we could say "cannot do any harm",
> > but I don't like adding lines on that basis.
> 
> I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
> could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
> pmd.
> 
> The other part of the patch in pagewalk.c is superflous and should be
> dropped: pud/pgd can't change in walk_page_table, it's required to
> hold the mmap_sem at least in read mode (it's not disabling irqs).
> 
> The pmd side instead can change but only with THP enabled, and only
> because MADV_DONTNEED (never because of free_pgtables) but it's
> already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
> ->pmd_entry callers are required to call pmd_trans_unstable() before
> proceeding as the pmd may have been zeroed out by the time we get
> there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
> there are smp_read_barrier_depends already in alpha pte methods we're
> fine.
> 
> Sorry for the long email but without a list that separates all
> possible cases above, I don't think we can understand what we're
> fixing in that patch and why the gup.c part is good.
> 
> Peter, I suggest to resend the fix with a more detailed explanataion
> of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.
> 
> Thanks,
> Andrea
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-04 22:02       ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-08-04 22:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > Since then, I think THP has made the rules more complicated; but I
> > believe Andrea paid a great deal of attention to that kind of issue.
> 
> There were many issues, one unexpected was
> 1a5a9906d4e8d1976b701f889d8f35d54b928f25.

[ . . . ]

> 5) compiler behaving like alpha -> impossible (I may be wrong but I
>    believe so after thinking more on it)
> 
> 6) I was told a decade ago by Honza to never touch any ram that can
>    change under the compiler unless it's declared volatile (could
>    crash over switch/case statement implemented with a table if the
>    switch/case value is re-read by the compiler).  -> depends, we
>    don't always obey to this rule, clearly gup_fast currently disobeys
>    and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
>    can zero the pmd). If there's no "switch/case" I'm not aware of
>    other troubles.
> 
> 7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
>    issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f
> 
> Note: here I'm ignoring CPU reordering, this is only about the compiler.
> 
> 5 is impossible because:
> 
> a) the compiler can't read a guessed address or it can crash the
>    kernel
> 
> b) the compiler has no memory to store a "guessed" valid address when
>    the function return and the stack is unwind
> 
> For the compiler to behave like alpha, the compiler should read the
> pteval before the pmdp, that it can't do, because it has no address to
> guess from and it would Oops if it really tries to guess it!
> 
> So far it was said "compiler can guess the address" but there was no
> valid explanation of how it could do it, and I don't see it, so please
> explain if I'm wrong about the a, b above.

OK, I'll bite.  ;-)

The most sane way for this to happen is with feedback-driven techniques
involving profiling, similar to what is done for basic-block reordering
or branch prediction.  The idea is that you compile the kernel in an
as-yet (and thankfully) mythical pointer-profiling mode, which records
the values of pointer loads and also measures the pointer-load latency.
If a situation is found where a given pointer almost always has the
same value but has high load latency (for example, is almost always a
high-latency cache miss), this fact is recorded and fed back into a
subsequent kernel build.  This subsequent kernel build might choose to
speculate the value of the pointer concurrently with the pointer load.

And of course, when interpreting the phrase "most sane way" at the
beginning of the prior paragraph, it would probably be wise to keep
in mind who wrote it.  And that "most sane way" might have little or
no resemblance to anything that typical kernel hackers would consider
anywhere near sanity.  ;-)

> Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> volatile variable, before reading the pmdp pointer and compare it with
> the guessed address! So if it's 5 you worry about, when adding
> ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> have added a barrier() instead.

Most compiler writers I have discussed this with agreed that a volatile
cast would suppress value speculation.  The "volatile" keyword is not
all that well specified in the C and C++ standards, but as "nix" said
at http://lwn.net/Articles/509731/:

	volatile's meaning as 'minimize optimizations applied to things
	manipulating anything of volatile type, do not duplicate, elide,
	move, fold, spindle or mutilate' is of long standing.

That said, value speculation as a compiler optimization makes me a bit
nervous, so my current feeling is that is should be suppressed entirely.

Hey, you asked, even if only implicitly!  ;-)

							Thanx, Paul

> > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> > gup_fast() breaks as many rules as it can, and in particular may
> > be racing with the freeing of page tables; but I'm not so sure
> > about the pagewalk mods - we could say "cannot do any harm",
> > but I don't like adding lines on that basis.
> 
> I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
> could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
> pmd.
> 
> The other part of the patch in pagewalk.c is superflous and should be
> dropped: pud/pgd can't change in walk_page_table, it's required to
> hold the mmap_sem at least in read mode (it's not disabling irqs).
> 
> The pmd side instead can change but only with THP enabled, and only
> because MADV_DONTNEED (never because of free_pgtables) but it's
> already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
> ->pmd_entry callers are required to call pmd_trans_unstable() before
> proceeding as the pmd may have been zeroed out by the time we get
> there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
> there are smp_read_barrier_depends already in alpha pte methods we're
> fine.
> 
> Sorry for the long email but without a list that separates all
> possible cases above, I don't think we can understand what we're
> fixing in that patch and why the gup.c part is good.
> 
> Peter, I suggest to resend the fix with a more detailed explanataion
> of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.
> 
> Thanks,
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-08-04 22:02       ` Paul E. McKenney
@ 2012-08-04 22:47         ` Andrea Arcangeli
  -1 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-08-04 22:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> OK, I'll bite.  ;-)

:))

> The most sane way for this to happen is with feedback-driven techniques
> involving profiling, similar to what is done for basic-block reordering
> or branch prediction.  The idea is that you compile the kernel in an
> as-yet (and thankfully) mythical pointer-profiling mode, which records
> the values of pointer loads and also measures the pointer-load latency.
> If a situation is found where a given pointer almost always has the
> same value but has high load latency (for example, is almost always a
> high-latency cache miss), this fact is recorded and fed back into a
> subsequent kernel build.  This subsequent kernel build might choose to
> speculate the value of the pointer concurrently with the pointer load.
> 
> And of course, when interpreting the phrase "most sane way" at the
> beginning of the prior paragraph, it would probably be wise to keep
> in mind who wrote it.  And that "most sane way" might have little or
> no resemblance to anything that typical kernel hackers would consider
> anywhere near sanity.  ;-)

I see. The above scenario is sure fair enough assumption. We're
clearly stretching the constraints to see what is theoretically
possible and this is a very clear explanation of how gcc could have an
hardcoded "guessed" address in the .text.

Next step to clearify now, is how gcc can safely dereference such a
"guessed" address without the kernel knowing about it.

If gcc would really dereference a guessed address coming from a
profiling run without kernel being aware of it, it would eventually
crash the kernel with an oops. gcc cannot know what another CPU will
do with the kernel pagetables. It'd be perfectly legitimate to
temporarily move the data at the "guessed address" to another page and
to update the pointer through stop_cpu during some weird "cpu
offlining scenario" or anything you can imagine. I mean gcc must
behave in all cases so it's not allowed to deference the guessed
address at any given time.

The only way gcc could do the alpha thing and dereference the guessed
address before the real pointer, is with cooperation with the kernel.
The kernel should provide gcc "safe ranges" that won't crash the
kernel, and/or gcc could provide a .fixup section similar to the
current .fixup and the kernel should look it up during the page fault
handler in case the kernel is ok with temporarily getting faults in
that range. And in turn it can't happen unless we explicitly decide to
allow gcc to do it.

> > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > volatile variable, before reading the pmdp pointer and compare it with
> > the guessed address! So if it's 5 you worry about, when adding
> > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > have added a barrier() instead.
> 
> Most compiler writers I have discussed this with agreed that a volatile
> cast would suppress value speculation.  The "volatile" keyword is not
> all that well specified in the C and C++ standards, but as "nix" said
> at http://lwn.net/Articles/509731/:
> 
> 	volatile's meaning as 'minimize optimizations applied to things
> 	manipulating anything of volatile type, do not duplicate, elide,
> 	move, fold, spindle or mutilate' is of long standing.

Ok, so if the above optimization would be possible, volatile would
stop it too, thanks for the quote and the explanation.

On a side note I believe there's a few barrier()s that may be worth
converting to ACCESS_ONCE, that would take care of case 6) too in
addition to avoid clobbering more CPU registers than strictly
necessary. Not very important but a possible microoptimization.

> That said, value speculation as a compiler optimization makes me a bit
> nervous, so my current feeling is that is should be suppressed entirely.
> 
> Hey, you asked, even if only implicitly!  ;-)

You're reading my mind! :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-04 22:47         ` Andrea Arcangeli
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-08-04 22:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> OK, I'll bite.  ;-)

:))

> The most sane way for this to happen is with feedback-driven techniques
> involving profiling, similar to what is done for basic-block reordering
> or branch prediction.  The idea is that you compile the kernel in an
> as-yet (and thankfully) mythical pointer-profiling mode, which records
> the values of pointer loads and also measures the pointer-load latency.
> If a situation is found where a given pointer almost always has the
> same value but has high load latency (for example, is almost always a
> high-latency cache miss), this fact is recorded and fed back into a
> subsequent kernel build.  This subsequent kernel build might choose to
> speculate the value of the pointer concurrently with the pointer load.
> 
> And of course, when interpreting the phrase "most sane way" at the
> beginning of the prior paragraph, it would probably be wise to keep
> in mind who wrote it.  And that "most sane way" might have little or
> no resemblance to anything that typical kernel hackers would consider
> anywhere near sanity.  ;-)

I see. The above scenario is sure fair enough assumption. We're
clearly stretching the constraints to see what is theoretically
possible and this is a very clear explanation of how gcc could have an
hardcoded "guessed" address in the .text.

Next step to clearify now, is how gcc can safely dereference such a
"guessed" address without the kernel knowing about it.

If gcc would really dereference a guessed address coming from a
profiling run without kernel being aware of it, it would eventually
crash the kernel with an oops. gcc cannot know what another CPU will
do with the kernel pagetables. It'd be perfectly legitimate to
temporarily move the data at the "guessed address" to another page and
to update the pointer through stop_cpu during some weird "cpu
offlining scenario" or anything you can imagine. I mean gcc must
behave in all cases so it's not allowed to deference the guessed
address at any given time.

The only way gcc could do the alpha thing and dereference the guessed
address before the real pointer, is with cooperation with the kernel.
The kernel should provide gcc "safe ranges" that won't crash the
kernel, and/or gcc could provide a .fixup section similar to the
current .fixup and the kernel should look it up during the page fault
handler in case the kernel is ok with temporarily getting faults in
that range. And in turn it can't happen unless we explicitly decide to
allow gcc to do it.

> > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > volatile variable, before reading the pmdp pointer and compare it with
> > the guessed address! So if it's 5 you worry about, when adding
> > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > have added a barrier() instead.
> 
> Most compiler writers I have discussed this with agreed that a volatile
> cast would suppress value speculation.  The "volatile" keyword is not
> all that well specified in the C and C++ standards, but as "nix" said
> at http://lwn.net/Articles/509731/:
> 
> 	volatile's meaning as 'minimize optimizations applied to things
> 	manipulating anything of volatile type, do not duplicate, elide,
> 	move, fold, spindle or mutilate' is of long standing.

Ok, so if the above optimization would be possible, volatile would
stop it too, thanks for the quote and the explanation.

On a side note I believe there's a few barrier()s that may be worth
converting to ACCESS_ONCE, that would take care of case 6) too in
addition to avoid clobbering more CPU registers than strictly
necessary. Not very important but a possible microoptimization.

> That said, value speculation as a compiler optimization makes me a bit
> nervous, so my current feeling is that is should be suppressed entirely.
> 
> Hey, you asked, even if only implicitly!  ;-)

You're reading my mind! :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-08-04 22:47         ` Andrea Arcangeli
@ 2012-08-04 22:59           ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2012-08-04 22:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paul E. McKenney, Hugh Dickins, Peter Zijlstra, Linus Torvalds,
	Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	linux-arch, linux-mm

* Andrea Arcangeli (aarcange@redhat.com) wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite.  ;-)
> 
> :))
> 
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction.  The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build.  This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> > 
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it.  And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity.  ;-)
> 
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
> 
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
> 
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.

A compiler could decide to dereference it using a non-faulting load,
do the calculations or whatever on the returned value of the non-faulting
load, and then check whether the load actually faulted, and whether the
address matched the prediction before it did a store based on it's
guess.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-04 22:59           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2012-08-04 22:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paul E. McKenney, Hugh Dickins, Peter Zijlstra, Linus Torvalds,
	Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	linux-arch, linux-mm

* Andrea Arcangeli (aarcange@redhat.com) wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite.  ;-)
> 
> :))
> 
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction.  The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build.  This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> > 
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it.  And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity.  ;-)
> 
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
> 
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
> 
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.

A compiler could decide to dereference it using a non-faulting load,
do the calculations or whatever on the returned value of the non-faulting
load, and then check whether the load actually faulted, and whether the
address matched the prediction before it did a store based on it's
guess.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-08-04 22:47         ` Andrea Arcangeli
@ 2012-08-04 23:06           ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-08-04 23:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite.  ;-)
> 
> :))
> 
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction.  The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build.  This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> > 
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it.  And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity.  ;-)
> 
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
> 
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
> 
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.
> 
> The only way gcc could do the alpha thing and dereference the guessed
> address before the real pointer, is with cooperation with the kernel.
> The kernel should provide gcc "safe ranges" that won't crash the
> kernel, and/or gcc could provide a .fixup section similar to the
> current .fixup and the kernel should look it up during the page fault
> handler in case the kernel is ok with temporarily getting faults in
> that range. And in turn it can't happen unless we explicitly decide to
> allow gcc to do it.

And these are indeed some good reasons why I am not a fan of pointer-value
speculation.  ;-)

> > > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > > volatile variable, before reading the pmdp pointer and compare it with
> > > the guessed address! So if it's 5 you worry about, when adding
> > > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > > have added a barrier() instead.
> > 
> > Most compiler writers I have discussed this with agreed that a volatile
> > cast would suppress value speculation.  The "volatile" keyword is not
> > all that well specified in the C and C++ standards, but as "nix" said
> > at http://lwn.net/Articles/509731/:
> > 
> > 	volatile's meaning as 'minimize optimizations applied to things
> > 	manipulating anything of volatile type, do not duplicate, elide,
> > 	move, fold, spindle or mutilate' is of long standing.
> 
> Ok, so if the above optimization would be possible, volatile would
> stop it too, thanks for the quote and the explanation.
> 
> On a side note I believe there's a few barrier()s that may be worth
> converting to ACCESS_ONCE, that would take care of case 6) too in
> addition to avoid clobbering more CPU registers than strictly
> necessary. Not very important but a possible microoptimization.

Agreed on both points.

> > That said, value speculation as a compiler optimization makes me a bit
> > nervous, so my current feeling is that is should be suppressed entirely.
> > 
> > Hey, you asked, even if only implicitly!  ;-)
> 
> You're reading my mind! :)

Or succesfully carrying out value speculation on it.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-04 23:06           ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-08-04 23:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Peter Zijlstra, Linus Torvalds, Rik van Riel,
	Andrew Morton, Nick Piggin, linux-kernel, linux-arch, linux-mm

On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite.  ;-)
> 
> :))
> 
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction.  The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build.  This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> > 
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it.  And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity.  ;-)
> 
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
> 
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
> 
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.
> 
> The only way gcc could do the alpha thing and dereference the guessed
> address before the real pointer, is with cooperation with the kernel.
> The kernel should provide gcc "safe ranges" that won't crash the
> kernel, and/or gcc could provide a .fixup section similar to the
> current .fixup and the kernel should look it up during the page fault
> handler in case the kernel is ok with temporarily getting faults in
> that range. And in turn it can't happen unless we explicitly decide to
> allow gcc to do it.

And these are indeed some good reasons why I am not a fan of pointer-value
speculation.  ;-)

> > > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > > volatile variable, before reading the pmdp pointer and compare it with
> > > the guessed address! So if it's 5 you worry about, when adding
> > > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > > have added a barrier() instead.
> > 
> > Most compiler writers I have discussed this with agreed that a volatile
> > cast would suppress value speculation.  The "volatile" keyword is not
> > all that well specified in the C and C++ standards, but as "nix" said
> > at http://lwn.net/Articles/509731/:
> > 
> > 	volatile's meaning as 'minimize optimizations applied to things
> > 	manipulating anything of volatile type, do not duplicate, elide,
> > 	move, fold, spindle or mutilate' is of long standing.
> 
> Ok, so if the above optimization would be possible, volatile would
> stop it too, thanks for the quote and the explanation.
> 
> On a side note I believe there's a few barrier()s that may be worth
> converting to ACCESS_ONCE, that would take care of case 6) too in
> addition to avoid clobbering more CPU registers than strictly
> necessary. Not very important but a possible microoptimization.

Agreed on both points.

> > That said, value speculation as a compiler optimization makes me a bit
> > nervous, so my current feeling is that is should be suppressed entirely.
> > 
> > Hey, you asked, even if only implicitly!  ;-)
> 
> You're reading my mind! :)

Or succesfully carrying out value speculation on it.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-08-04 22:59           ` Dr. David Alan Gilbert
@ 2012-08-04 23:11             ` Paul E. McKenney
  -1 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-08-04 23:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Andrea Arcangeli, Hugh Dickins, Peter Zijlstra, Linus Torvalds,
	Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	linux-arch, linux-mm

On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:
> * Andrea Arcangeli (aarcange@redhat.com) wrote:
> > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > > OK, I'll bite.  ;-)
> > 
> > :))
> > 
> > > The most sane way for this to happen is with feedback-driven techniques
> > > involving profiling, similar to what is done for basic-block reordering
> > > or branch prediction.  The idea is that you compile the kernel in an
> > > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > > the values of pointer loads and also measures the pointer-load latency.
> > > If a situation is found where a given pointer almost always has the
> > > same value but has high load latency (for example, is almost always a
> > > high-latency cache miss), this fact is recorded and fed back into a
> > > subsequent kernel build.  This subsequent kernel build might choose to
> > > speculate the value of the pointer concurrently with the pointer load.
> > > 
> > > And of course, when interpreting the phrase "most sane way" at the
> > > beginning of the prior paragraph, it would probably be wise to keep
> > > in mind who wrote it.  And that "most sane way" might have little or
> > > no resemblance to anything that typical kernel hackers would consider
> > > anywhere near sanity.  ;-)
> > 
> > I see. The above scenario is sure fair enough assumption. We're
> > clearly stretching the constraints to see what is theoretically
> > possible and this is a very clear explanation of how gcc could have an
> > hardcoded "guessed" address in the .text.
> > 
> > Next step to clearify now, is how gcc can safely dereference such a
> > "guessed" address without the kernel knowing about it.
> > 
> > If gcc would really dereference a guessed address coming from a
> > profiling run without kernel being aware of it, it would eventually
> > crash the kernel with an oops. gcc cannot know what another CPU will
> > do with the kernel pagetables. It'd be perfectly legitimate to
> > temporarily move the data at the "guessed address" to another page and
> > to update the pointer through stop_cpu during some weird "cpu
> > offlining scenario" or anything you can imagine. I mean gcc must
> > behave in all cases so it's not allowed to deference the guessed
> > address at any given time.
> 
> A compiler could decide to dereference it using a non-faulting load,
> do the calculations or whatever on the returned value of the non-faulting
> load, and then check whether the load actually faulted, and whether the
> address matched the prediction before it did a store based on it's
> guess.

Or the compiler could record a recovery address in a per-thread variable
before doing the speculative reference.  The page-fault handler could
consult the per-thread variable and take appropriate action.

But both this approach and your approach are vulnerable to things like
having the speculation area mapped to (say) MMIO space.  Not good!

So I am with Andrea on this one -- there would need to be some handshake
between kernel and compiler to avoid messing with possibly-unsafe
mappings.  And I am still not much in favor of value speculation.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-04 23:11             ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2012-08-04 23:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Andrea Arcangeli, Hugh Dickins, Peter Zijlstra, Linus Torvalds,
	Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	linux-arch, linux-mm

On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:
> * Andrea Arcangeli (aarcange@redhat.com) wrote:
> > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > > OK, I'll bite.  ;-)
> > 
> > :))
> > 
> > > The most sane way for this to happen is with feedback-driven techniques
> > > involving profiling, similar to what is done for basic-block reordering
> > > or branch prediction.  The idea is that you compile the kernel in an
> > > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > > the values of pointer loads and also measures the pointer-load latency.
> > > If a situation is found where a given pointer almost always has the
> > > same value but has high load latency (for example, is almost always a
> > > high-latency cache miss), this fact is recorded and fed back into a
> > > subsequent kernel build.  This subsequent kernel build might choose to
> > > speculate the value of the pointer concurrently with the pointer load.
> > > 
> > > And of course, when interpreting the phrase "most sane way" at the
> > > beginning of the prior paragraph, it would probably be wise to keep
> > > in mind who wrote it.  And that "most sane way" might have little or
> > > no resemblance to anything that typical kernel hackers would consider
> > > anywhere near sanity.  ;-)
> > 
> > I see. The above scenario is sure fair enough assumption. We're
> > clearly stretching the constraints to see what is theoretically
> > possible and this is a very clear explanation of how gcc could have an
> > hardcoded "guessed" address in the .text.
> > 
> > Next step to clearify now, is how gcc can safely dereference such a
> > "guessed" address without the kernel knowing about it.
> > 
> > If gcc would really dereference a guessed address coming from a
> > profiling run without kernel being aware of it, it would eventually
> > crash the kernel with an oops. gcc cannot know what another CPU will
> > do with the kernel pagetables. It'd be perfectly legitimate to
> > temporarily move the data at the "guessed address" to another page and
> > to update the pointer through stop_cpu during some weird "cpu
> > offlining scenario" or anything you can imagine. I mean gcc must
> > behave in all cases so it's not allowed to deference the guessed
> > address at any given time.
> 
> A compiler could decide to dereference it using a non-faulting load,
> do the calculations or whatever on the returned value of the non-faulting
> load, and then check whether the load actually faulted, and whether the
> address matched the prediction before it did a store based on it's
> guess.

Or the compiler could record a recovery address in a per-thread variable
before doing the speculative reference.  The page-fault handler could
consult the per-thread variable and take appropriate action.

But both this approach and your approach are vulnerable to things like
having the speculation area mapped to (say) MMIO space.  Not good!

So I am with Andrea on this one -- there would need to be some handshake
between kernel and compiler to avoid messing with possibly-unsafe
mappings.  And I am still not much in favor of value speculation.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
  2012-08-04 23:11             ` Paul E. McKenney
@ 2012-08-05  0:10               ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2012-08-05  0:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Arcangeli, Hugh Dickins, Peter Zijlstra, Linus Torvalds,
	Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	linux-arch, linux-mm

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:

<snip>

> > A compiler could decide to dereference it using a non-faulting load,
> > do the calculations or whatever on the returned value of the non-faulting
> > load, and then check whether the load actually faulted, and whether the
> > address matched the prediction before it did a store based on it's
> > guess.
> 
> Or the compiler could record a recovery address in a per-thread variable
> before doing the speculative reference.  The page-fault handler could
> consult the per-thread variable and take appropriate action.

The difference is that I'd expect a compiler writer to think that
they've got a free hand in terms of instruction usage that the OS/library
doesn't see - if it's in the instruction manual and it's marked as user
space and non-faulting I'd say it's fair game; once they know that they're
going to take a fault or mark pages specially then they already know they're
going to have to cooperate with the OS, or worry about what other
normal library calls are going to do.
(A bit of googling seems to suggest IA64 and SPARC have played around
with non-faulting load optimisations, but I can't tell how much.)

> But both this approach and your approach are vulnerable to things like
> having the speculation area mapped to (say) MMIO space.  Not good!

Not good for someone doing MMIO, but from an evil-compiler point
of view, they might well assume that a pointer is to memory
unless someone has made an effort to tell them otherwise (not that
there is a good standard to do that).

> So I am with Andrea on this one -- there would need to be some handshake
> between kernel and compiler to avoid messing with possibly-unsafe
> mappings.  And I am still not much in favor of value speculation.  ;-)

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] page-table walkers vs memory order
@ 2012-08-05  0:10               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2012-08-05  0:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Arcangeli, Hugh Dickins, Peter Zijlstra, Linus Torvalds,
	Rik van Riel, Andrew Morton, Nick Piggin, linux-kernel,
	linux-arch, linux-mm

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:

<snip>

> > A compiler could decide to dereference it using a non-faulting load,
> > do the calculations or whatever on the returned value of the non-faulting
> > load, and then check whether the load actually faulted, and whether the
> > address matched the prediction before it did a store based on it's
> > guess.
> 
> Or the compiler could record a recovery address in a per-thread variable
> before doing the speculative reference.  The page-fault handler could
> consult the per-thread variable and take appropriate action.

The difference is that I'd expect a compiler writer to think that
they've got a free hand in terms of instruction usage that the OS/library
doesn't see - if it's in the instruction manual and it's marked as user
space and non-faulting I'd say it's fair game; once they know that they're
going to take a fault or mark pages specially then they already know they're
going to have to cooperate with the OS, or worry about what other
normal library calls are going to do.
(A bit of googling seems to suggest IA64 and SPARC have played around
with non-faulting load optimisations, but I can't tell how much.)

> But both this approach and your approach are vulnerable to things like
> having the speculation area mapped to (say) MMIO space.  Not good!

Not good for someone doing MMIO, but from an evil-compiler point
of view, they might well assume that a pointer is to memory
unless someone has made an effort to tell them otherwise (not that
there is a good standard to do that).

> So I am with Andrea on this one -- there would need to be some handshake
> between kernel and compiler to avoid messing with possibly-unsafe
> mappings.  And I am still not much in favor of value speculation.  ;-)

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-08-05  0:11 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-23 17:34 [RFC] page-table walkers vs memory order Peter Zijlstra
2012-07-23 17:34 ` Peter Zijlstra
2012-07-23 19:27 ` Paul E. McKenney
2012-07-23 19:27   ` Paul E. McKenney
2012-07-24 21:51 ` Hugh Dickins
2012-07-24 21:51   ` Hugh Dickins
2012-07-25 17:56   ` Paul E. McKenney
2012-07-25 17:56     ` Paul E. McKenney
2012-07-25 20:26     ` Hugh Dickins
2012-07-25 20:26       ` Hugh Dickins
2012-07-25 21:12       ` Paul E. McKenney
2012-07-25 21:12         ` Paul E. McKenney
2012-07-25 22:09         ` Hugh Dickins
2012-07-25 22:09           ` Hugh Dickins
2012-07-25 22:37           ` Paul E. McKenney
2012-07-25 22:37             ` Paul E. McKenney
2012-07-26  8:11           ` Peter Zijlstra
2012-07-26  8:11             ` Peter Zijlstra
2012-07-30 19:21         ` Jamie Lokier
2012-07-30 19:21           ` Jamie Lokier
2012-07-30 20:28           ` Paul E. McKenney
2012-07-30 20:28             ` Paul E. McKenney
2012-07-26 20:39   ` Peter Zijlstra
2012-07-26 20:39     ` Peter Zijlstra
2012-07-27 19:22     ` Hugh Dickins
2012-07-27 19:22       ` Hugh Dickins
2012-07-27 19:39       ` Paul E. McKenney
2012-07-27 19:39         ` Paul E. McKenney
2012-08-04 14:37   ` Andrea Arcangeli
2012-08-04 14:37     ` Andrea Arcangeli
2012-08-04 22:02     ` Paul E. McKenney
2012-08-04 22:02       ` Paul E. McKenney
2012-08-04 22:47       ` Andrea Arcangeli
2012-08-04 22:47         ` Andrea Arcangeli
2012-08-04 22:59         ` Dr. David Alan Gilbert
2012-08-04 22:59           ` Dr. David Alan Gilbert
2012-08-04 23:11           ` Paul E. McKenney
2012-08-04 23:11             ` Paul E. McKenney
2012-08-05  0:10             ` Dr. David Alan Gilbert
2012-08-05  0:10               ` Dr. David Alan Gilbert
2012-08-04 23:06         ` Paul E. McKenney
2012-08-04 23:06           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.