devicetree.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] of: Rework and simplify phandle cache to use a fixed size
@ 2019-12-11 23:23 Rob Herring
  2019-12-11 23:48 ` Rob Herring
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Rob Herring @ 2019-12-11 23:23 UTC (permalink / raw)
  To: devicetree
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool, Frank Rowand

The phandle cache was added to speed up of_find_node_by_phandle() by
avoiding walking the whole DT to find a matching phandle. The
implementation has several shortcomings:

  - The cache is designed to work on a linear set of phandle values.
    This is true for dtc generated DTs, but not for other cases such as
    Power.
  - The cache isn't enabled until of_core_init() and a typical system
    may see hundreds of calls to of_find_node_by_phandle() before that
    point.
  - The cache is freed and re-allocated when the number of phandles
    changes.
  - It takes a raw spinlock around a memory allocation which breaks on
    RT.

Change the implementation to a fixed size and use hash_32() as the
cache index. This greatly simplifies the implementation. It avoids
the need for any re-alloc of the cache and taking a reference on nodes
in the cache. We only have a single source of removing cache entries
which is of_detach_node().

Using hash_32() removes any assumption on phandle values improving
the hit rate for non-linear phandle values. The effect on linear values
using hash_32() is about a 10% collision. The chances of thrashing on
colliding values seems to be low.

To compare performance, I used a RK3399 board which is a pretty typical
system. I found that just measuring boot time as done previously is
noisy and may be impacted by other things. Also bringing up secondary
cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
With no caching, calls to of_find_node_by_phandle() take about 20124 us
for 1248 calls. There's an additional 288 calls before time keeping is
up. Using the average time per hit/miss with the cache, we can calculate
these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
and 13319 us with no cache or an uninitialized cache.

Comparing the 3 implementations the time spent in
of_find_node_by_phandle() is:

no cache:        20124 us (+ 13319 us)
128 entry cache:  5134 us (+ 690 us)
current cache:     819 us (+ 13319 us)

We could move the allocation of the cache earlier to improve the
current cache, but that just further complicates the situation as it
needs to be after slab is up, so we can't do it when unflattening (which
uses memblock).

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Signed-off-by: Rob Herring <robh@kernel.org>
---
 drivers/of/base.c       | 133 ++++++++--------------------------------
 drivers/of/dynamic.c    |   2 +-
 drivers/of/of_private.h |   4 +-
 drivers/of/overlay.c    |  10 ---
 4 files changed, 28 insertions(+), 121 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index db7fbc0c0893..f7da162f126d 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -135,115 +135,38 @@ int __weak of_node_to_nid(struct device_node *np)
 }
 #endif
 
-/*
- * Assumptions behind phandle_cache implementation:
- *   - phandle property values are in a contiguous range of 1..n
- *
- * If the assumptions do not hold, then
- *   - the phandle lookup overhead reduction provided by the cache
- *     will likely be less
- */
+#define OF_PHANDLE_CACHE_BITS	7
+#define OF_PHANDLE_CACHE_SZ	BIT(OF_PHANDLE_CACHE_BITS)
 
-static struct device_node **phandle_cache;
-static u32 phandle_cache_mask;
+static struct device_node *phandle_cache[OF_PHANDLE_CACHE_SZ];
 
-/*
- * Caller must hold devtree_lock.
- */
-static void __of_free_phandle_cache(void)
+static u32 of_phandle_cache_hash(phandle handle)
 {
-	u32 cache_entries = phandle_cache_mask + 1;
-	u32 k;
-
-	if (!phandle_cache)
-		return;
-
-	for (k = 0; k < cache_entries; k++)
-		of_node_put(phandle_cache[k]);
-
-	kfree(phandle_cache);
-	phandle_cache = NULL;
+	return hash_32(handle, OF_PHANDLE_CACHE_BITS);
 }
 
-int of_free_phandle_cache(void)
-{
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&devtree_lock, flags);
-
-	__of_free_phandle_cache();
-
-	raw_spin_unlock_irqrestore(&devtree_lock, flags);
-
-	return 0;
-}
-#if !defined(CONFIG_MODULES)
-late_initcall_sync(of_free_phandle_cache);
-#endif
-
 /*
  * Caller must hold devtree_lock.
  */
-void __of_free_phandle_cache_entry(phandle handle)
+void __of_phandle_cache_inv_entry(phandle handle)
 {
-	phandle masked_handle;
+	u32 handle_hash;
 	struct device_node *np;
 
 	if (!handle)
 		return;
 
-	masked_handle = handle & phandle_cache_mask;
+	handle_hash = of_phandle_cache_hash(handle);
 
-	if (phandle_cache) {
-		np = phandle_cache[masked_handle];
-		if (np && handle == np->phandle) {
-			of_node_put(np);
-			phandle_cache[masked_handle] = NULL;
-		}
-	}
-}
-
-void of_populate_phandle_cache(void)
-{
-	unsigned long flags;
-	u32 cache_entries;
-	struct device_node *np;
-	u32 phandles = 0;
-
-	raw_spin_lock_irqsave(&devtree_lock, flags);
-
-	__of_free_phandle_cache();
-
-	for_each_of_allnodes(np)
-		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
-			phandles++;
-
-	if (!phandles)
-		goto out;
-
-	cache_entries = roundup_pow_of_two(phandles);
-	phandle_cache_mask = cache_entries - 1;
-
-	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
-				GFP_ATOMIC);
-	if (!phandle_cache)
-		goto out;
-
-	for_each_of_allnodes(np)
-		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
-			of_node_get(np);
-			phandle_cache[np->phandle & phandle_cache_mask] = np;
-		}
-
-out:
-	raw_spin_unlock_irqrestore(&devtree_lock, flags);
+	np = phandle_cache[handle_hash];
+	if (np && handle == np->phandle)
+		phandle_cache[handle_hash] = NULL;
 }
 
 void __init of_core_init(void)
 {
 	struct device_node *np;
 
-	of_populate_phandle_cache();
 
 	/* Create the kset, and register existing nodes */
 	mutex_lock(&of_mutex);
@@ -253,8 +176,11 @@ void __init of_core_init(void)
 		pr_err("failed to register existing nodes\n");
 		return;
 	}
-	for_each_of_allnodes(np)
+	for_each_of_allnodes(np) {
 		__of_attach_node_sysfs(np);
+		if (np->phandle && !phandle_cache[of_phandle_cache_hash(np->phandle)])
+			phandle_cache[of_phandle_cache_hash(np->phandle)] = np;
+	}
 	mutex_unlock(&of_mutex);
 
 	/* Symlink in /proc as required by userspace ABI */
@@ -1235,36 +1161,29 @@ struct device_node *of_find_node_by_phandle(phandle handle)
 {
 	struct device_node *np = NULL;
 	unsigned long flags;
-	phandle masked_handle;
+	u32 handle_hash;
 
 	if (!handle)
 		return NULL;
 
+	handle_hash = of_phandle_cache_hash(handle);
+
 	raw_spin_lock_irqsave(&devtree_lock, flags);
 
-	masked_handle = handle & phandle_cache_mask;
-
-	if (phandle_cache) {
-		if (phandle_cache[masked_handle] &&
-		    handle == phandle_cache[masked_handle]->phandle)
-			np = phandle_cache[masked_handle];
-		if (np && of_node_check_flag(np, OF_DETACHED)) {
-			WARN_ON(1); /* did not uncache np on node removal */
-			of_node_put(np);
-			phandle_cache[masked_handle] = NULL;
-			np = NULL;
-		}
+	if (phandle_cache[handle_hash] &&
+	    handle == phandle_cache[handle_hash]->phandle)
+		np = phandle_cache[handle_hash];
+	if (np && of_node_check_flag(np, OF_DETACHED)) {
+		WARN_ON(1); /* did not uncache np on node removal */
+		phandle_cache[handle_hash] = NULL;
+		np = NULL;
 	}
 
 	if (!np) {
 		for_each_of_allnodes(np)
 			if (np->phandle == handle &&
 			    !of_node_check_flag(np, OF_DETACHED)) {
-				if (phandle_cache) {
-					/* will put when removed from cache */
-					of_node_get(np);
-					phandle_cache[masked_handle] = np;
-				}
+				phandle_cache[handle_hash] = np;
 				break;
 			}
 	}
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 49b16f76d78e..08fd823edac9 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -276,7 +276,7 @@ void __of_detach_node(struct device_node *np)
 	of_node_set_flag(np, OF_DETACHED);
 
 	/* race with of_find_node_by_phandle() prevented by devtree_lock */
-	__of_free_phandle_cache_entry(np->phandle);
+	__of_phandle_cache_inv_entry(np->phandle);
 }
 
 /**
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index 66294d29942a..582844c158ae 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -85,14 +85,12 @@ int of_resolve_phandles(struct device_node *tree);
 #endif
 
 #if defined(CONFIG_OF_DYNAMIC)
-void __of_free_phandle_cache_entry(phandle handle);
+void __of_phandle_cache_inv_entry(phandle handle);
 #endif
 
 #if defined(CONFIG_OF_OVERLAY)
 void of_overlay_mutex_lock(void);
 void of_overlay_mutex_unlock(void);
-int of_free_phandle_cache(void);
-void of_populate_phandle_cache(void);
 #else
 static inline void of_overlay_mutex_lock(void) {};
 static inline void of_overlay_mutex_unlock(void) {};
diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 9617b7df7c4d..97fe92c1f1d2 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -974,8 +974,6 @@ static int of_overlay_apply(const void *fdt, struct device_node *tree,
 		goto err_free_overlay_changeset;
 	}
 
-	of_populate_phandle_cache();
-
 	ret = __of_changeset_apply_notify(&ovcs->cset);
 	if (ret)
 		pr_err("overlay apply changeset entry notify error %d\n", ret);
@@ -1218,17 +1216,9 @@ int of_overlay_remove(int *ovcs_id)
 
 	list_del(&ovcs->ovcs_list);
 
-	/*
-	 * Disable phandle cache.  Avoids race condition that would arise
-	 * from removing cache entry when the associated node is deleted.
-	 */
-	of_free_phandle_cache();
-
 	ret_apply = 0;
 	ret = __of_changeset_revert_entries(&ovcs->cset, &ret_apply);
 
-	of_populate_phandle_cache();
-
 	if (ret) {
 		if (ret_apply)
 			devicetree_state_flags |= DTSF_REVERT_FAIL;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:23 [PATCH] of: Rework and simplify phandle cache to use a fixed size Rob Herring
@ 2019-12-11 23:48 ` Rob Herring
  2019-12-12 13:05   ` Sebastian Andrzej Siewior
  2019-12-19 15:31   ` Frank Rowand
  2019-12-12 11:50 ` Frank Rowand
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Rob Herring @ 2019-12-11 23:48 UTC (permalink / raw)
  To: devicetree, Frank Rowand
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool

On Wed, Dec 11, 2019 at 5:23 PM Rob Herring <robh@kernel.org> wrote:
>
> The phandle cache was added to speed up of_find_node_by_phandle() by
> avoiding walking the whole DT to find a matching phandle. The
> implementation has several shortcomings:
>
>   - The cache is designed to work on a linear set of phandle values.
>     This is true for dtc generated DTs, but not for other cases such as
>     Power.
>   - The cache isn't enabled until of_core_init() and a typical system
>     may see hundreds of calls to of_find_node_by_phandle() before that
>     point.
>   - The cache is freed and re-allocated when the number of phandles
>     changes.
>   - It takes a raw spinlock around a memory allocation which breaks on
>     RT.
>
> Change the implementation to a fixed size and use hash_32() as the
> cache index. This greatly simplifies the implementation. It avoids
> the need for any re-alloc of the cache and taking a reference on nodes
> in the cache. We only have a single source of removing cache entries
> which is of_detach_node().
>
> Using hash_32() removes any assumption on phandle values improving
> the hit rate for non-linear phandle values. The effect on linear values
> using hash_32() is about a 10% collision. The chances of thrashing on
> colliding values seems to be low.
>
> To compare performance, I used a RK3399 board which is a pretty typical
> system. I found that just measuring boot time as done previously is
> noisy and may be impacted by other things. Also bringing up secondary
> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
> With no caching, calls to of_find_node_by_phandle() take about 20124 us
> for 1248 calls. There's an additional 288 calls before time keeping is
> up. Using the average time per hit/miss with the cache, we can calculate
> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
> and 13319 us with no cache or an uninitialized cache.
>
> Comparing the 3 implementations the time spent in
> of_find_node_by_phandle() is:
>
> no cache:        20124 us (+ 13319 us)
> 128 entry cache:  5134 us (+ 690 us)
> current cache:     819 us (+ 13319 us)
>
> We could move the allocation of the cache earlier to improve the
> current cache, but that just further complicates the situation as it
> needs to be after slab is up, so we can't do it when unflattening (which
> uses memblock).
>
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Segher Boessenkool <segher@kernel.crashing.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
>  drivers/of/base.c       | 133 ++++++++--------------------------------
>  drivers/of/dynamic.c    |   2 +-
>  drivers/of/of_private.h |   4 +-
>  drivers/of/overlay.c    |  10 ---
>  4 files changed, 28 insertions(+), 121 deletions(-)

[...]

> -       if (phandle_cache) {
> -               if (phandle_cache[masked_handle] &&
> -                   handle == phandle_cache[masked_handle]->phandle)
> -                       np = phandle_cache[masked_handle];
> -               if (np && of_node_check_flag(np, OF_DETACHED)) {
> -                       WARN_ON(1); /* did not uncache np on node removal */
> -                       of_node_put(np);
> -                       phandle_cache[masked_handle] = NULL;
> -                       np = NULL;
> -               }
> +       if (phandle_cache[handle_hash] &&
> +           handle == phandle_cache[handle_hash]->phandle)
> +               np = phandle_cache[handle_hash];
> +       if (np && of_node_check_flag(np, OF_DETACHED)) {
> +               WARN_ON(1); /* did not uncache np on node removal */

BTW, I don't think this check is even valid. If we failed to detach
and remove the node from the cache, then we could be accessing np
after freeing it.

Rob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:23 [PATCH] of: Rework and simplify phandle cache to use a fixed size Rob Herring
  2019-12-11 23:48 ` Rob Herring
@ 2019-12-12 11:50 ` Frank Rowand
  2019-12-19  3:38   ` Frank Rowand
  2019-12-12 13:00 ` Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Frank Rowand @ 2019-12-12 11:50 UTC (permalink / raw)
  To: Rob Herring, devicetree, Chintan Pandya
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool


+Chintan

Hi Chintan,

On 12/11/19 5:23 PM, Rob Herring wrote:
> The phandle cache was added to speed up of_find_node_by_phandle() by
> avoiding walking the whole DT to find a matching phandle. The
> implementation has several shortcomings:
> 
>   - The cache is designed to work on a linear set of phandle values.
>     This is true for dtc generated DTs, but not for other cases such as
>     Power.
>   - The cache isn't enabled until of_core_init() and a typical system
>     may see hundreds of calls to of_find_node_by_phandle() before that
>     point.
>   - The cache is freed and re-allocated when the number of phandles
>     changes.
>   - It takes a raw spinlock around a memory allocation which breaks on
>     RT.
> 
> Change the implementation to a fixed size and use hash_32() as the
> cache index. This greatly simplifies the implementation. It avoids
> the need for any re-alloc of the cache and taking a reference on nodes
> in the cache. We only have a single source of removing cache entries
> which is of_detach_node().

Can you check this patch on the system that you reported phandle lookup
overhead issues for?

-Frank


> 
> Using hash_32() removes any assumption on phandle values improving
> the hit rate for non-linear phandle values. The effect on linear values
> using hash_32() is about a 10% collision. The chances of thrashing on
> colliding values seems to be low.
> 
> To compare performance, I used a RK3399 board which is a pretty typical
> system. I found that just measuring boot time as done previously is
> noisy and may be impacted by other things. Also bringing up secondary
> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
> With no caching, calls to of_find_node_by_phandle() take about 20124 us
> for 1248 calls. There's an additional 288 calls before time keeping is
> up. Using the average time per hit/miss with the cache, we can calculate
> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
> and 13319 us with no cache or an uninitialized cache.
> 
> Comparing the 3 implementations the time spent in
> of_find_node_by_phandle() is:
> 
> no cache:        20124 us (+ 13319 us)
> 128 entry cache:  5134 us (+ 690 us)
> current cache:     819 us (+ 13319 us)
> 
> We could move the allocation of the cache earlier to improve the
> current cache, but that just further complicates the situation as it
> needs to be after slab is up, so we can't do it when unflattening (which
> uses memblock).
> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Segher Boessenkool <segher@kernel.crashing.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
>  drivers/of/base.c       | 133 ++++++++--------------------------------
>  drivers/of/dynamic.c    |   2 +-
>  drivers/of/of_private.h |   4 +-
>  drivers/of/overlay.c    |  10 ---
>  4 files changed, 28 insertions(+), 121 deletions(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index db7fbc0c0893..f7da162f126d 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -135,115 +135,38 @@ int __weak of_node_to_nid(struct device_node *np)
>  }
>  #endif
>  
> -/*
> - * Assumptions behind phandle_cache implementation:
> - *   - phandle property values are in a contiguous range of 1..n
> - *
> - * If the assumptions do not hold, then
> - *   - the phandle lookup overhead reduction provided by the cache
> - *     will likely be less
> - */
> +#define OF_PHANDLE_CACHE_BITS	7
> +#define OF_PHANDLE_CACHE_SZ	BIT(OF_PHANDLE_CACHE_BITS)
>  
> -static struct device_node **phandle_cache;
> -static u32 phandle_cache_mask;
> +static struct device_node *phandle_cache[OF_PHANDLE_CACHE_SZ];
>  
> -/*
> - * Caller must hold devtree_lock.
> - */
> -static void __of_free_phandle_cache(void)
> +static u32 of_phandle_cache_hash(phandle handle)
>  {
> -	u32 cache_entries = phandle_cache_mask + 1;
> -	u32 k;
> -
> -	if (!phandle_cache)
> -		return;
> -
> -	for (k = 0; k < cache_entries; k++)
> -		of_node_put(phandle_cache[k]);
> -
> -	kfree(phandle_cache);
> -	phandle_cache = NULL;
> +	return hash_32(handle, OF_PHANDLE_CACHE_BITS);
>  }
>  
> -int of_free_phandle_cache(void)
> -{
> -	unsigned long flags;
> -
> -	raw_spin_lock_irqsave(&devtree_lock, flags);
> -
> -	__of_free_phandle_cache();
> -
> -	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> -
> -	return 0;
> -}
> -#if !defined(CONFIG_MODULES)
> -late_initcall_sync(of_free_phandle_cache);
> -#endif
> -
>  /*
>   * Caller must hold devtree_lock.
>   */
> -void __of_free_phandle_cache_entry(phandle handle)
> +void __of_phandle_cache_inv_entry(phandle handle)
>  {
> -	phandle masked_handle;
> +	u32 handle_hash;
>  	struct device_node *np;
>  
>  	if (!handle)
>  		return;
>  
> -	masked_handle = handle & phandle_cache_mask;
> +	handle_hash = of_phandle_cache_hash(handle);
>  
> -	if (phandle_cache) {
> -		np = phandle_cache[masked_handle];
> -		if (np && handle == np->phandle) {
> -			of_node_put(np);
> -			phandle_cache[masked_handle] = NULL;
> -		}
> -	}
> -}
> -
> -void of_populate_phandle_cache(void)
> -{
> -	unsigned long flags;
> -	u32 cache_entries;
> -	struct device_node *np;
> -	u32 phandles = 0;
> -
> -	raw_spin_lock_irqsave(&devtree_lock, flags);
> -
> -	__of_free_phandle_cache();
> -
> -	for_each_of_allnodes(np)
> -		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> -			phandles++;
> -
> -	if (!phandles)
> -		goto out;
> -
> -	cache_entries = roundup_pow_of_two(phandles);
> -	phandle_cache_mask = cache_entries - 1;
> -
> -	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
> -				GFP_ATOMIC);
> -	if (!phandle_cache)
> -		goto out;
> -
> -	for_each_of_allnodes(np)
> -		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
> -			of_node_get(np);
> -			phandle_cache[np->phandle & phandle_cache_mask] = np;
> -		}
> -
> -out:
> -	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> +	np = phandle_cache[handle_hash];
> +	if (np && handle == np->phandle)
> +		phandle_cache[handle_hash] = NULL;
>  }
>  
>  void __init of_core_init(void)
>  {
>  	struct device_node *np;
>  
> -	of_populate_phandle_cache();
>  
>  	/* Create the kset, and register existing nodes */
>  	mutex_lock(&of_mutex);
> @@ -253,8 +176,11 @@ void __init of_core_init(void)
>  		pr_err("failed to register existing nodes\n");
>  		return;
>  	}
> -	for_each_of_allnodes(np)
> +	for_each_of_allnodes(np) {
>  		__of_attach_node_sysfs(np);
> +		if (np->phandle && !phandle_cache[of_phandle_cache_hash(np->phandle)])
> +			phandle_cache[of_phandle_cache_hash(np->phandle)] = np;
> +	}
>  	mutex_unlock(&of_mutex);
>  
>  	/* Symlink in /proc as required by userspace ABI */
> @@ -1235,36 +1161,29 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>  {
>  	struct device_node *np = NULL;
>  	unsigned long flags;
> -	phandle masked_handle;
> +	u32 handle_hash;
>  
>  	if (!handle)
>  		return NULL;
>  
> +	handle_hash = of_phandle_cache_hash(handle);
> +
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	masked_handle = handle & phandle_cache_mask;
> -
> -	if (phandle_cache) {
> -		if (phandle_cache[masked_handle] &&
> -		    handle == phandle_cache[masked_handle]->phandle)
> -			np = phandle_cache[masked_handle];
> -		if (np && of_node_check_flag(np, OF_DETACHED)) {
> -			WARN_ON(1); /* did not uncache np on node removal */
> -			of_node_put(np);
> -			phandle_cache[masked_handle] = NULL;
> -			np = NULL;
> -		}
> +	if (phandle_cache[handle_hash] &&
> +	    handle == phandle_cache[handle_hash]->phandle)
> +		np = phandle_cache[handle_hash];
> +	if (np && of_node_check_flag(np, OF_DETACHED)) {
> +		WARN_ON(1); /* did not uncache np on node removal */
> +		phandle_cache[handle_hash] = NULL;
> +		np = NULL;
>  	}
>  
>  	if (!np) {
>  		for_each_of_allnodes(np)
>  			if (np->phandle == handle &&
>  			    !of_node_check_flag(np, OF_DETACHED)) {
> -				if (phandle_cache) {
> -					/* will put when removed from cache */
> -					of_node_get(np);
> -					phandle_cache[masked_handle] = np;
> -				}
> +				phandle_cache[handle_hash] = np;
>  				break;
>  			}
>  	}
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 49b16f76d78e..08fd823edac9 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -276,7 +276,7 @@ void __of_detach_node(struct device_node *np)
>  	of_node_set_flag(np, OF_DETACHED);
>  
>  	/* race with of_find_node_by_phandle() prevented by devtree_lock */
> -	__of_free_phandle_cache_entry(np->phandle);
> +	__of_phandle_cache_inv_entry(np->phandle);
>  }
>  
>  /**
> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
> index 66294d29942a..582844c158ae 100644
> --- a/drivers/of/of_private.h
> +++ b/drivers/of/of_private.h
> @@ -85,14 +85,12 @@ int of_resolve_phandles(struct device_node *tree);
>  #endif
>  
>  #if defined(CONFIG_OF_DYNAMIC)
> -void __of_free_phandle_cache_entry(phandle handle);
> +void __of_phandle_cache_inv_entry(phandle handle);
>  #endif
>  
>  #if defined(CONFIG_OF_OVERLAY)
>  void of_overlay_mutex_lock(void);
>  void of_overlay_mutex_unlock(void);
> -int of_free_phandle_cache(void);
> -void of_populate_phandle_cache(void);
>  #else
>  static inline void of_overlay_mutex_lock(void) {};
>  static inline void of_overlay_mutex_unlock(void) {};
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 9617b7df7c4d..97fe92c1f1d2 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -974,8 +974,6 @@ static int of_overlay_apply(const void *fdt, struct device_node *tree,
>  		goto err_free_overlay_changeset;
>  	}
>  
> -	of_populate_phandle_cache();
> -
>  	ret = __of_changeset_apply_notify(&ovcs->cset);
>  	if (ret)
>  		pr_err("overlay apply changeset entry notify error %d\n", ret);
> @@ -1218,17 +1216,9 @@ int of_overlay_remove(int *ovcs_id)
>  
>  	list_del(&ovcs->ovcs_list);
>  
> -	/*
> -	 * Disable phandle cache.  Avoids race condition that would arise
> -	 * from removing cache entry when the associated node is deleted.
> -	 */
> -	of_free_phandle_cache();
> -
>  	ret_apply = 0;
>  	ret = __of_changeset_revert_entries(&ovcs->cset, &ret_apply);
>  
> -	of_populate_phandle_cache();
> -
>  	if (ret) {
>  		if (ret_apply)
>  			devicetree_state_flags |= DTSF_REVERT_FAIL;
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:23 [PATCH] of: Rework and simplify phandle cache to use a fixed size Rob Herring
  2019-12-11 23:48 ` Rob Herring
  2019-12-12 11:50 ` Frank Rowand
@ 2019-12-12 13:00 ` Sebastian Andrzej Siewior
  2019-12-19 15:51 ` Frank Rowand
  2020-01-07 10:22 ` Jon Hunter
  4 siblings, 0 replies; 16+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-12 13:00 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-kernel, Michael Ellerman, Segher Boessenkool,
	Frank Rowand

On 2019-12-11 17:23:45 [-0600], Rob Herring wrote:
> Change the implementation to a fixed size and use hash_32() as the
> cache index. This greatly simplifies the implementation. It avoids

The negative diffstat is nice. It solves the original issue I had and it
did not introduce anything new, just tested it. So feel free to:

Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Sebastian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:48 ` Rob Herring
@ 2019-12-12 13:05   ` Sebastian Andrzej Siewior
  2019-12-12 19:28     ` Rob Herring
  2019-12-19 15:33     ` Frank Rowand
  2019-12-19 15:31   ` Frank Rowand
  1 sibling, 2 replies; 16+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-12 13:05 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, Frank Rowand, linux-kernel, Michael Ellerman,
	Segher Boessenkool

On 2019-12-11 17:48:54 [-0600], Rob Herring wrote:
> > -       if (phandle_cache) {
> > -               if (phandle_cache[masked_handle] &&
> > -                   handle == phandle_cache[masked_handle]->phandle)
> > -                       np = phandle_cache[masked_handle];
> > -               if (np && of_node_check_flag(np, OF_DETACHED)) {
> > -                       WARN_ON(1); /* did not uncache np on node removal */
> > -                       of_node_put(np);
> > -                       phandle_cache[masked_handle] = NULL;
> > -                       np = NULL;
> > -               }
> > +       if (phandle_cache[handle_hash] &&
> > +           handle == phandle_cache[handle_hash]->phandle)
> > +               np = phandle_cache[handle_hash];
> > +       if (np && of_node_check_flag(np, OF_DETACHED)) {
> > +               WARN_ON(1); /* did not uncache np on node removal */
> 
> BTW, I don't think this check is even valid. If we failed to detach
> and remove the node from the cache, then we could be accessing np
> after freeing it.

this is kmalloc()ed memory which is always valid. If the memory is
already re-used then
	handle == phandle_cache[handle_hash]->phandle

will fail (the check, not the memory access itself). If the check
remains valid then you can hope for the OF_DETACHED flag to trigger the
warning.

> Rob

Sebastian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-12 13:05   ` Sebastian Andrzej Siewior
@ 2019-12-12 19:28     ` Rob Herring
  2019-12-18  9:47       ` Sebastian Andrzej Siewior
  2019-12-19 15:33     ` Frank Rowand
  1 sibling, 1 reply; 16+ messages in thread
From: Rob Herring @ 2019-12-12 19:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: devicetree, Frank Rowand, linux-kernel, Michael Ellerman,
	Segher Boessenkool

On Thu, Dec 12, 2019 at 7:05 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2019-12-11 17:48:54 [-0600], Rob Herring wrote:
> > > -       if (phandle_cache) {
> > > -               if (phandle_cache[masked_handle] &&
> > > -                   handle == phandle_cache[masked_handle]->phandle)
> > > -                       np = phandle_cache[masked_handle];
> > > -               if (np && of_node_check_flag(np, OF_DETACHED)) {
> > > -                       WARN_ON(1); /* did not uncache np on node removal */
> > > -                       of_node_put(np);
> > > -                       phandle_cache[masked_handle] = NULL;
> > > -                       np = NULL;
> > > -               }
> > > +       if (phandle_cache[handle_hash] &&
> > > +           handle == phandle_cache[handle_hash]->phandle)
> > > +               np = phandle_cache[handle_hash];
> > > +       if (np && of_node_check_flag(np, OF_DETACHED)) {
> > > +               WARN_ON(1); /* did not uncache np on node removal */
> >
> > BTW, I don't think this check is even valid. If we failed to detach
> > and remove the node from the cache, then we could be accessing np
> > after freeing it.
>
> this is kmalloc()ed memory which is always valid. If the memory is
> already re-used then
>         handle == phandle_cache[handle_hash]->phandle
>
> will fail (the check, not the memory access itself).

There's a 1 in 2^32 chance it won't.

> If the check
> remains valid then you can hope for the OF_DETACHED flag to trigger the
> warning.

Keyword is hope.

To look at it another way. Do we need this check? It is in the "fast
path". There's a single location where we set OF_DETACHED and the
cache entry is removed at the same time. Also, if we do free the
node's memory, it also checks for OF_DETACHED. Previously, a free
wouldn't happen because we incremented the ref count on nodes in the
cache.

Rob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-12 19:28     ` Rob Herring
@ 2019-12-18  9:47       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 16+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-12-18  9:47 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, Frank Rowand, linux-kernel, Michael Ellerman,
	Segher Boessenkool

On 2019-12-12 13:28:26 [-0600], Rob Herring wrote:
> On Thu, Dec 12, 2019 at 7:05 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > On 2019-12-11 17:48:54 [-0600], Rob Herring wrote:
> > > > -       if (phandle_cache) {
> > > > -               if (phandle_cache[masked_handle] &&
> > > > -                   handle == phandle_cache[masked_handle]->phandle)
> > > > -                       np = phandle_cache[masked_handle];
> > > > -               if (np && of_node_check_flag(np, OF_DETACHED)) {
> > > > -                       WARN_ON(1); /* did not uncache np on node removal */
> > > > -                       of_node_put(np);
> > > > -                       phandle_cache[masked_handle] = NULL;
> > > > -                       np = NULL;
> > > > -               }
> > > > +       if (phandle_cache[handle_hash] &&
> > > > +           handle == phandle_cache[handle_hash]->phandle)
> > > > +               np = phandle_cache[handle_hash];
> > > > +       if (np && of_node_check_flag(np, OF_DETACHED)) {
> > > > +               WARN_ON(1); /* did not uncache np on node removal */
> > >
> > > BTW, I don't think this check is even valid. If we failed to detach
> > > and remove the node from the cache, then we could be accessing np
> > > after freeing it.
> >
> > this is kmalloc()ed memory which is always valid. If the memory is
> > already re-used then
> >         handle == phandle_cache[handle_hash]->phandle
> >
> > will fail (the check, not the memory access itself).
> 
> There's a 1 in 2^32 chance it won't.

:)

> > If the check
> > remains valid then you can hope for the OF_DETACHED flag to trigger the
> > warning.
> 
> Keyword is hope.
> 
> To look at it another way. Do we need this check? It is in the "fast
> path". There's a single location where we set OF_DETACHED and the
> cache entry is removed at the same time. Also, if we do free the
> node's memory, it also checks for OF_DETACHED. Previously, a free
> wouldn't happen because we incremented the ref count on nodes in the
> cache.

So get rid of it then. It is just __of_detach_node() that removes it.

> Rob

Sebastian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-12 11:50 ` Frank Rowand
@ 2019-12-19  3:38   ` Frank Rowand
  0 siblings, 0 replies; 16+ messages in thread
From: Frank Rowand @ 2019-12-19  3:38 UTC (permalink / raw)
  To: Rob Herring, devicetree, Chintan Pandya
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool


-Chintan

On 12/12/19 5:50 AM, Frank Rowand wrote:
> 
> +Chintan
> 
> Hi Chintan,
> 
> On 12/11/19 5:23 PM, Rob Herring wrote:
>> The phandle cache was added to speed up of_find_node_by_phandle() by
>> avoiding walking the whole DT to find a matching phandle. The
>> implementation has several shortcomings:
>>
>>   - The cache is designed to work on a linear set of phandle values.
>>     This is true for dtc generated DTs, but not for other cases such as
>>     Power.
>>   - The cache isn't enabled until of_core_init() and a typical system
>>     may see hundreds of calls to of_find_node_by_phandle() before that
>>     point.
>>   - The cache is freed and re-allocated when the number of phandles
>>     changes.
>>   - It takes a raw spinlock around a memory allocation which breaks on
>>     RT.
>>
>> Change the implementation to a fixed size and use hash_32() as the
>> cache index. This greatly simplifies the implementation. It avoids
>> the need for any re-alloc of the cache and taking a reference on nodes
>> in the cache. We only have a single source of removing cache entries
>> which is of_detach_node().
> 
> Can you check this patch on the system that you reported phandle lookup
> overhead issues for?

Adding Chintan to the distribution list was not successful -- his email
address bounces.

If anyone else from Qualcomm wants to look at this, the system Chintan
was measuring was:

https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/arch/arm64/boot/dts/qcom/sda670-mtp.dts?h=msm-4.9

The various email threads were in January and February 2018.

-Frank

> 
> -Frank
> 
> 
>>
>> Using hash_32() removes any assumption on phandle values improving
>> the hit rate for non-linear phandle values. The effect on linear values
>> using hash_32() is about a 10% collision. The chances of thrashing on
>> colliding values seems to be low.
>>
>> To compare performance, I used a RK3399 board which is a pretty typical
>> system. I found that just measuring boot time as done previously is
>> noisy and may be impacted by other things. Also bringing up secondary
>> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
>> With no caching, calls to of_find_node_by_phandle() take about 20124 us
>> for 1248 calls. There's an additional 288 calls before time keeping is
>> up. Using the average time per hit/miss with the cache, we can calculate
>> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
>> and 13319 us with no cache or an uninitialized cache.
>>
>> Comparing the 3 implementations the time spent in
>> of_find_node_by_phandle() is:
>>
>> no cache:        20124 us (+ 13319 us)
>> 128 entry cache:  5134 us (+ 690 us)
>> current cache:     819 us (+ 13319 us)
>>
>> We could move the allocation of the cache earlier to improve the
>> current cache, but that just further complicates the situation as it
>> needs to be after slab is up, so we can't do it when unflattening (which
>> uses memblock).
>>
>> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Segher Boessenkool <segher@kernel.crashing.org>
>> Cc: Frank Rowand <frowand.list@gmail.com>
>> Signed-off-by: Rob Herring <robh@kernel.org>
>> ---
>>  drivers/of/base.c       | 133 ++++++++--------------------------------
>>  drivers/of/dynamic.c    |   2 +-
>>  drivers/of/of_private.h |   4 +-
>>  drivers/of/overlay.c    |  10 ---
>>  4 files changed, 28 insertions(+), 121 deletions(-)
>>
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index db7fbc0c0893..f7da162f126d 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -135,115 +135,38 @@ int __weak of_node_to_nid(struct device_node *np)
>>  }
>>  #endif
>>  
>> -/*
>> - * Assumptions behind phandle_cache implementation:
>> - *   - phandle property values are in a contiguous range of 1..n
>> - *
>> - * If the assumptions do not hold, then
>> - *   - the phandle lookup overhead reduction provided by the cache
>> - *     will likely be less
>> - */
>> +#define OF_PHANDLE_CACHE_BITS	7
>> +#define OF_PHANDLE_CACHE_SZ	BIT(OF_PHANDLE_CACHE_BITS)
>>  
>> -static struct device_node **phandle_cache;
>> -static u32 phandle_cache_mask;
>> +static struct device_node *phandle_cache[OF_PHANDLE_CACHE_SZ];
>>  
>> -/*
>> - * Caller must hold devtree_lock.
>> - */
>> -static void __of_free_phandle_cache(void)
>> +static u32 of_phandle_cache_hash(phandle handle)
>>  {
>> -	u32 cache_entries = phandle_cache_mask + 1;
>> -	u32 k;
>> -
>> -	if (!phandle_cache)
>> -		return;
>> -
>> -	for (k = 0; k < cache_entries; k++)
>> -		of_node_put(phandle_cache[k]);
>> -
>> -	kfree(phandle_cache);
>> -	phandle_cache = NULL;
>> +	return hash_32(handle, OF_PHANDLE_CACHE_BITS);
>>  }
>>  
>> -int of_free_phandle_cache(void)
>> -{
>> -	unsigned long flags;
>> -
>> -	raw_spin_lock_irqsave(&devtree_lock, flags);
>> -
>> -	__of_free_phandle_cache();
>> -
>> -	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>> -
>> -	return 0;
>> -}
>> -#if !defined(CONFIG_MODULES)
>> -late_initcall_sync(of_free_phandle_cache);
>> -#endif
>> -
>>  /*
>>   * Caller must hold devtree_lock.
>>   */
>> -void __of_free_phandle_cache_entry(phandle handle)
>> +void __of_phandle_cache_inv_entry(phandle handle)
>>  {
>> -	phandle masked_handle;
>> +	u32 handle_hash;
>>  	struct device_node *np;
>>  
>>  	if (!handle)
>>  		return;
>>  
>> -	masked_handle = handle & phandle_cache_mask;
>> +	handle_hash = of_phandle_cache_hash(handle);
>>  
>> -	if (phandle_cache) {
>> -		np = phandle_cache[masked_handle];
>> -		if (np && handle == np->phandle) {
>> -			of_node_put(np);
>> -			phandle_cache[masked_handle] = NULL;
>> -		}
>> -	}
>> -}
>> -
>> -void of_populate_phandle_cache(void)
>> -{
>> -	unsigned long flags;
>> -	u32 cache_entries;
>> -	struct device_node *np;
>> -	u32 phandles = 0;
>> -
>> -	raw_spin_lock_irqsave(&devtree_lock, flags);
>> -
>> -	__of_free_phandle_cache();
>> -
>> -	for_each_of_allnodes(np)
>> -		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
>> -			phandles++;
>> -
>> -	if (!phandles)
>> -		goto out;
>> -
>> -	cache_entries = roundup_pow_of_two(phandles);
>> -	phandle_cache_mask = cache_entries - 1;
>> -
>> -	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>> -				GFP_ATOMIC);
>> -	if (!phandle_cache)
>> -		goto out;
>> -
>> -	for_each_of_allnodes(np)
>> -		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
>> -			of_node_get(np);
>> -			phandle_cache[np->phandle & phandle_cache_mask] = np;
>> -		}
>> -
>> -out:
>> -	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>> +	np = phandle_cache[handle_hash];
>> +	if (np && handle == np->phandle)
>> +		phandle_cache[handle_hash] = NULL;
>>  }
>>  
>>  void __init of_core_init(void)
>>  {
>>  	struct device_node *np;
>>  
>> -	of_populate_phandle_cache();
>>  
>>  	/* Create the kset, and register existing nodes */
>>  	mutex_lock(&of_mutex);
>> @@ -253,8 +176,11 @@ void __init of_core_init(void)
>>  		pr_err("failed to register existing nodes\n");
>>  		return;
>>  	}
>> -	for_each_of_allnodes(np)
>> +	for_each_of_allnodes(np) {
>>  		__of_attach_node_sysfs(np);
>> +		if (np->phandle && !phandle_cache[of_phandle_cache_hash(np->phandle)])
>> +			phandle_cache[of_phandle_cache_hash(np->phandle)] = np;
>> +	}
>>  	mutex_unlock(&of_mutex);
>>  
>>  	/* Symlink in /proc as required by userspace ABI */
>> @@ -1235,36 +1161,29 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>>  {
>>  	struct device_node *np = NULL;
>>  	unsigned long flags;
>> -	phandle masked_handle;
>> +	u32 handle_hash;
>>  
>>  	if (!handle)
>>  		return NULL;
>>  
>> +	handle_hash = of_phandle_cache_hash(handle);
>> +
>>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>>  
>> -	masked_handle = handle & phandle_cache_mask;
>> -
>> -	if (phandle_cache) {
>> -		if (phandle_cache[masked_handle] &&
>> -		    handle == phandle_cache[masked_handle]->phandle)
>> -			np = phandle_cache[masked_handle];
>> -		if (np && of_node_check_flag(np, OF_DETACHED)) {
>> -			WARN_ON(1); /* did not uncache np on node removal */
>> -			of_node_put(np);
>> -			phandle_cache[masked_handle] = NULL;
>> -			np = NULL;
>> -		}
>> +	if (phandle_cache[handle_hash] &&
>> +	    handle == phandle_cache[handle_hash]->phandle)
>> +		np = phandle_cache[handle_hash];
>> +	if (np && of_node_check_flag(np, OF_DETACHED)) {
>> +		WARN_ON(1); /* did not uncache np on node removal */
>> +		phandle_cache[handle_hash] = NULL;
>> +		np = NULL;
>>  	}
>>  
>>  	if (!np) {
>>  		for_each_of_allnodes(np)
>>  			if (np->phandle == handle &&
>>  			    !of_node_check_flag(np, OF_DETACHED)) {
>> -				if (phandle_cache) {
>> -					/* will put when removed from cache */
>> -					of_node_get(np);
>> -					phandle_cache[masked_handle] = np;
>> -				}
>> +				phandle_cache[handle_hash] = np;
>>  				break;
>>  			}
>>  	}
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index 49b16f76d78e..08fd823edac9 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -276,7 +276,7 @@ void __of_detach_node(struct device_node *np)
>>  	of_node_set_flag(np, OF_DETACHED);
>>  
>>  	/* race with of_find_node_by_phandle() prevented by devtree_lock */
>> -	__of_free_phandle_cache_entry(np->phandle);
>> +	__of_phandle_cache_inv_entry(np->phandle);
>>  }
>>  
>>  /**
>> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
>> index 66294d29942a..582844c158ae 100644
>> --- a/drivers/of/of_private.h
>> +++ b/drivers/of/of_private.h
>> @@ -85,14 +85,12 @@ int of_resolve_phandles(struct device_node *tree);
>>  #endif
>>  
>>  #if defined(CONFIG_OF_DYNAMIC)
>> -void __of_free_phandle_cache_entry(phandle handle);
>> +void __of_phandle_cache_inv_entry(phandle handle);
>>  #endif
>>  
>>  #if defined(CONFIG_OF_OVERLAY)
>>  void of_overlay_mutex_lock(void);
>>  void of_overlay_mutex_unlock(void);
>> -int of_free_phandle_cache(void);
>> -void of_populate_phandle_cache(void);
>>  #else
>>  static inline void of_overlay_mutex_lock(void) {};
>>  static inline void of_overlay_mutex_unlock(void) {};
>> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
>> index 9617b7df7c4d..97fe92c1f1d2 100644
>> --- a/drivers/of/overlay.c
>> +++ b/drivers/of/overlay.c
>> @@ -974,8 +974,6 @@ static int of_overlay_apply(const void *fdt, struct device_node *tree,
>>  		goto err_free_overlay_changeset;
>>  	}
>>  
>> -	of_populate_phandle_cache();
>> -
>>  	ret = __of_changeset_apply_notify(&ovcs->cset);
>>  	if (ret)
>>  		pr_err("overlay apply changeset entry notify error %d\n", ret);
>> @@ -1218,17 +1216,9 @@ int of_overlay_remove(int *ovcs_id)
>>  
>>  	list_del(&ovcs->ovcs_list);
>>  
>> -	/*
>> -	 * Disable phandle cache.  Avoids race condition that would arise
>> -	 * from removing cache entry when the associated node is deleted.
>> -	 */
>> -	of_free_phandle_cache();
>> -
>>  	ret_apply = 0;
>>  	ret = __of_changeset_revert_entries(&ovcs->cset, &ret_apply);
>>  
>> -	of_populate_phandle_cache();
>> -
>>  	if (ret) {
>>  		if (ret_apply)
>>  			devicetree_state_flags |= DTSF_REVERT_FAIL;
>>
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:48 ` Rob Herring
  2019-12-12 13:05   ` Sebastian Andrzej Siewior
@ 2019-12-19 15:31   ` Frank Rowand
  1 sibling, 0 replies; 16+ messages in thread
From: Frank Rowand @ 2019-12-19 15:31 UTC (permalink / raw)
  To: Rob Herring, devicetree
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool, Frank Rowand

On 12/11/19 5:48 PM, Rob Herring wrote:
> On Wed, Dec 11, 2019 at 5:23 PM Rob Herring <robh@kernel.org> wrote:
>>
>> The phandle cache was added to speed up of_find_node_by_phandle() by
>> avoiding walking the whole DT to find a matching phandle. The
>> implementation has several shortcomings:
>>
>>   - The cache is designed to work on a linear set of phandle values.
>>     This is true for dtc generated DTs, but not for other cases such as
>>     Power.
>>   - The cache isn't enabled until of_core_init() and a typical system
>>     may see hundreds of calls to of_find_node_by_phandle() before that
>>     point.
>>   - The cache is freed and re-allocated when the number of phandles
>>     changes.
>>   - It takes a raw spinlock around a memory allocation which breaks on
>>     RT.
>>
>> Change the implementation to a fixed size and use hash_32() as the
>> cache index. This greatly simplifies the implementation. It avoids
>> the need for any re-alloc of the cache and taking a reference on nodes
>> in the cache. We only have a single source of removing cache entries
>> which is of_detach_node().
>>
>> Using hash_32() removes any assumption on phandle values improving
>> the hit rate for non-linear phandle values. The effect on linear values
>> using hash_32() is about a 10% collision. The chances of thrashing on
>> colliding values seems to be low.
>>
>> To compare performance, I used a RK3399 board which is a pretty typical
>> system. I found that just measuring boot time as done previously is
>> noisy and may be impacted by other things. Also bringing up secondary
>> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
>> With no caching, calls to of_find_node_by_phandle() take about 20124 us
>> for 1248 calls. There's an additional 288 calls before time keeping is
>> up. Using the average time per hit/miss with the cache, we can calculate
>> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
>> and 13319 us with no cache or an uninitialized cache.
>>
>> Comparing the 3 implementations the time spent in
>> of_find_node_by_phandle() is:
>>
>> no cache:        20124 us (+ 13319 us)
>> 128 entry cache:  5134 us (+ 690 us)
>> current cache:     819 us (+ 13319 us)
>>
>> We could move the allocation of the cache earlier to improve the
>> current cache, but that just further complicates the situation as it
>> needs to be after slab is up, so we can't do it when unflattening (which
>> uses memblock).
>>
>> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Segher Boessenkool <segher@kernel.crashing.org>
>> Cc: Frank Rowand <frowand.list@gmail.com>
>> Signed-off-by: Rob Herring <robh@kernel.org>
>> ---
>>  drivers/of/base.c       | 133 ++++++++--------------------------------
>>  drivers/of/dynamic.c    |   2 +-
>>  drivers/of/of_private.h |   4 +-
>>  drivers/of/overlay.c    |  10 ---
>>  4 files changed, 28 insertions(+), 121 deletions(-)
> 
> [...]
> 
>> -       if (phandle_cache) {
>> -               if (phandle_cache[masked_handle] &&
>> -                   handle == phandle_cache[masked_handle]->phandle)
>> -                       np = phandle_cache[masked_handle];
>> -               if (np && of_node_check_flag(np, OF_DETACHED)) {
>> -                       WARN_ON(1); /* did not uncache np on node removal */
>> -                       of_node_put(np);
>> -                       phandle_cache[masked_handle] = NULL;
>> -                       np = NULL;
>> -               }
>> +       if (phandle_cache[handle_hash] &&
>> +           handle == phandle_cache[handle_hash]->phandle)
>> +               np = phandle_cache[handle_hash];
>> +       if (np && of_node_check_flag(np, OF_DETACHED)) {
>> +               WARN_ON(1); /* did not uncache np on node removal */
> 
> BTW, I don't think this check is even valid. If we failed to detach
> and remove the node from the cache, then we could be accessing np
> after freeing it.
> 
> Rob
> 

I added the OF_DETACHED checks out of extreme paranoia.  They can be
removed.

-Frank

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-12 13:05   ` Sebastian Andrzej Siewior
  2019-12-12 19:28     ` Rob Herring
@ 2019-12-19 15:33     ` Frank Rowand
  1 sibling, 0 replies; 16+ messages in thread
From: Frank Rowand @ 2019-12-19 15:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Rob Herring
  Cc: devicetree, linux-kernel, Michael Ellerman, Segher Boessenkool,
	Frank Rowand

On 12/12/19 7:05 AM, Sebastian Andrzej Siewior wrote:
> On 2019-12-11 17:48:54 [-0600], Rob Herring wrote:
>>> -       if (phandle_cache) {
>>> -               if (phandle_cache[masked_handle] &&
>>> -                   handle == phandle_cache[masked_handle]->phandle)
>>> -                       np = phandle_cache[masked_handle];
>>> -               if (np && of_node_check_flag(np, OF_DETACHED)) {
>>> -                       WARN_ON(1); /* did not uncache np on node removal */
>>> -                       of_node_put(np);
>>> -                       phandle_cache[masked_handle] = NULL;
>>> -                       np = NULL;
>>> -               }
>>> +       if (phandle_cache[handle_hash] &&
>>> +           handle == phandle_cache[handle_hash]->phandle)
>>> +               np = phandle_cache[handle_hash];
>>> +       if (np && of_node_check_flag(np, OF_DETACHED)) {
>>> +               WARN_ON(1); /* did not uncache np on node removal */
>>
>> BTW, I don't think this check is even valid. If we failed to detach
>> and remove the node from the cache, then we could be accessing np
>> after freeing it.
> 
> this is kmalloc()ed memory which is always valid. If the memory is

It was kmalloc()ed memory _before_ applying Rob's patch.  It no longer
is kmalloc()ed, so the rest of this discussion no longer applies.

-Frank

> already re-used then
> 	handle == phandle_cache[handle_hash]->phandle
> 
> will fail (the check, not the memory access itself). If the check
> remains valid then you can hope for the OF_DETACHED flag to trigger the
> warning.
> 
>> Rob
> 
> Sebastian
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:23 [PATCH] of: Rework and simplify phandle cache to use a fixed size Rob Herring
                   ` (2 preceding siblings ...)
  2019-12-12 13:00 ` Sebastian Andrzej Siewior
@ 2019-12-19 15:51 ` Frank Rowand
  2020-01-07 10:22 ` Jon Hunter
  4 siblings, 0 replies; 16+ messages in thread
From: Frank Rowand @ 2019-12-19 15:51 UTC (permalink / raw)
  To: Rob Herring, devicetree
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool

On 12/11/19 5:23 PM, Rob Herring wrote:
> The phandle cache was added to speed up of_find_node_by_phandle() by
> avoiding walking the whole DT to find a matching phandle. The
> implementation has several shortcomings:
> 
>   - The cache is designed to work on a linear set of phandle values.
>     This is true for dtc generated DTs, but not for other cases such as
>     Power.
>   - The cache isn't enabled until of_core_init() and a typical system
>     may see hundreds of calls to of_find_node_by_phandle() before that
>     point.
>   - The cache is freed and re-allocated when the number of phandles
>     changes.
>   - It takes a raw spinlock around a memory allocation which breaks on
>     RT.
> 
> Change the implementation to a fixed size and use hash_32() as the
> cache index. This greatly simplifies the implementation. It avoids
> the need for any re-alloc of the cache and taking a reference on nodes
> in the cache. We only have a single source of removing cache entries
> which is of_detach_node().
> 
> Using hash_32() removes any assumption on phandle values improving
> the hit rate for non-linear phandle values. The effect on linear values
> using hash_32() is about a 10% collision. The chances of thrashing on
> colliding values seems to be low.
> 
> To compare performance, I used a RK3399 board which is a pretty typical
> system. I found that just measuring boot time as done previously is
> noisy and may be impacted by other things. Also bringing up secondary
> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
> With no caching, calls to of_find_node_by_phandle() take about 20124 us
> for 1248 calls. There's an additional 288 calls before time keeping is
> up. Using the average time per hit/miss with the cache, we can calculate
> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
> and 13319 us with no cache or an uninitialized cache.
> 
> Comparing the 3 implementations the time spent in
> of_find_node_by_phandle() is:
> 
> no cache:        20124 us (+ 13319 us)
> 128 entry cache:  5134 us (+ 690 us)
> current cache:     819 us (+ 13319 us)
> 
> We could move the allocation of the cache earlier to improve the
> current cache, but that just further complicates the situation as it
> needs to be after slab is up, so we can't do it when unflattening (which
> uses memblock).

The patch changes phandle_cache to be a statically declared object, not
a dynamically allocated object, so I don't see why the cache populating
could not be moved earlier.  Not a big deal though.

On another note...
I am not happy that no one has bothered to check the performance impact
on the system that was the justification for finally implementing the
phandle cache.  I added the person who reported the performance problem
(Chintan Pandya) to the distribution list for the patch, but that address
bounced and no one else from Qualcomm has commented in this thread or
the related thread that led to this patch.  I am not unhappy enough
to delay this patch.

> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Segher Boessenkool <segher@kernel.crashing.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
>  drivers/of/base.c       | 133 ++++++++--------------------------------
>  drivers/of/dynamic.c    |   2 +-
>  drivers/of/of_private.h |   4 +-
>  drivers/of/overlay.c    |  10 ---
>  4 files changed, 28 insertions(+), 121 deletions(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index db7fbc0c0893..f7da162f126d 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -135,115 +135,38 @@ int __weak of_node_to_nid(struct device_node *np)
>  }
>  #endif
>  
> -/*
> - * Assumptions behind phandle_cache implementation:
> - *   - phandle property values are in a contiguous range of 1..n
> - *
> - * If the assumptions do not hold, then
> - *   - the phandle lookup overhead reduction provided by the cache
> - *     will likely be less
> - */
> +#define OF_PHANDLE_CACHE_BITS	7
> +#define OF_PHANDLE_CACHE_SZ	BIT(OF_PHANDLE_CACHE_BITS)
>  
> -static struct device_node **phandle_cache;
> -static u32 phandle_cache_mask;
> +static struct device_node *phandle_cache[OF_PHANDLE_CACHE_SZ];
>  
> -/*
> - * Caller must hold devtree_lock.
> - */
> -static void __of_free_phandle_cache(void)
> +static u32 of_phandle_cache_hash(phandle handle)
>  {
> -	u32 cache_entries = phandle_cache_mask + 1;
> -	u32 k;
> -
> -	if (!phandle_cache)
> -		return;
> -
> -	for (k = 0; k < cache_entries; k++)
> -		of_node_put(phandle_cache[k]);
> -
> -	kfree(phandle_cache);
> -	phandle_cache = NULL;
> +	return hash_32(handle, OF_PHANDLE_CACHE_BITS);
>  }
>  
> -int of_free_phandle_cache(void)
> -{
> -	unsigned long flags;
> -
> -	raw_spin_lock_irqsave(&devtree_lock, flags);
> -
> -	__of_free_phandle_cache();
> -
> -	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> -
> -	return 0;
> -}
> -#if !defined(CONFIG_MODULES)
> -late_initcall_sync(of_free_phandle_cache);
> -#endif
> -
>  /*
>   * Caller must hold devtree_lock.
>   */
> -void __of_free_phandle_cache_entry(phandle handle)
> +void __of_phandle_cache_inv_entry(phandle handle)
>  {
> -	phandle masked_handle;
> +	u32 handle_hash;
>  	struct device_node *np;
>  
>  	if (!handle)
>  		return;
>  
> -	masked_handle = handle & phandle_cache_mask;
> +	handle_hash = of_phandle_cache_hash(handle);
>  
> -	if (phandle_cache) {
> -		np = phandle_cache[masked_handle];
> -		if (np && handle == np->phandle) {
> -			of_node_put(np);
> -			phandle_cache[masked_handle] = NULL;
> -		}
> -	}
> -}
> -
> -void of_populate_phandle_cache(void)
> -{
> -	unsigned long flags;
> -	u32 cache_entries;
> -	struct device_node *np;
> -	u32 phandles = 0;
> -
> -	raw_spin_lock_irqsave(&devtree_lock, flags);
> -
> -	__of_free_phandle_cache();
> -
> -	for_each_of_allnodes(np)
> -		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> -			phandles++;
> -
> -	if (!phandles)
> -		goto out;
> -
> -	cache_entries = roundup_pow_of_two(phandles);
> -	phandle_cache_mask = cache_entries - 1;
> -
> -	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
> -				GFP_ATOMIC);
> -	if (!phandle_cache)
> -		goto out;
> -
> -	for_each_of_allnodes(np)
> -		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
> -			of_node_get(np);
> -			phandle_cache[np->phandle & phandle_cache_mask] = np;
> -		}
> -
> -out:
> -	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> +	np = phandle_cache[handle_hash];
> +	if (np && handle == np->phandle)
> +		phandle_cache[handle_hash] = NULL;
>  }
>  
>  void __init of_core_init(void)
>  {
>  	struct device_node *np;
>  
> -	of_populate_phandle_cache();
>  
>  	/* Create the kset, and register existing nodes */
>  	mutex_lock(&of_mutex);
> @@ -253,8 +176,11 @@ void __init of_core_init(void)
>  		pr_err("failed to register existing nodes\n");
>  		return;
>  	}
> -	for_each_of_allnodes(np)
> +	for_each_of_allnodes(np) {
>  		__of_attach_node_sysfs(np);
> +		if (np->phandle && !phandle_cache[of_phandle_cache_hash(np->phandle)])
> +			phandle_cache[of_phandle_cache_hash(np->phandle)] = np;
> +	}
>  	mutex_unlock(&of_mutex);
>  
>  	/* Symlink in /proc as required by userspace ABI */
> @@ -1235,36 +1161,29 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>  {
>  	struct device_node *np = NULL;
>  	unsigned long flags;
> -	phandle masked_handle;
> +	u32 handle_hash;
>  
>  	if (!handle)
>  		return NULL;
>  
> +	handle_hash = of_phandle_cache_hash(handle);
> +
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	masked_handle = handle & phandle_cache_mask;
> -
> -	if (phandle_cache) {
> -		if (phandle_cache[masked_handle] &&
> -		    handle == phandle_cache[masked_handle]->phandle)
> -			np = phandle_cache[masked_handle];
> -		if (np && of_node_check_flag(np, OF_DETACHED)) {
> -			WARN_ON(1); /* did not uncache np on node removal */
> -			of_node_put(np);
> -			phandle_cache[masked_handle] = NULL;
> -			np = NULL;
> -		}
> +	if (phandle_cache[handle_hash] &&
> +	    handle == phandle_cache[handle_hash]->phandle)
> +		np = phandle_cache[handle_hash];
> +	if (np && of_node_check_flag(np, OF_DETACHED)) {
> +		WARN_ON(1); /* did not uncache np on node removal */
> +		phandle_cache[handle_hash] = NULL;
> +		np = NULL;
>  	}
>  
>  	if (!np) {
>  		for_each_of_allnodes(np)
>  			if (np->phandle == handle &&
>  			    !of_node_check_flag(np, OF_DETACHED)) {
> -				if (phandle_cache) {
> -					/* will put when removed from cache */
> -					of_node_get(np);
> -					phandle_cache[masked_handle] = np;
> -				}
> +				phandle_cache[handle_hash] = np;
>  				break;
>  			}
>  	}
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 49b16f76d78e..08fd823edac9 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -276,7 +276,7 @@ void __of_detach_node(struct device_node *np)
>  	of_node_set_flag(np, OF_DETACHED);
>  
>  	/* race with of_find_node_by_phandle() prevented by devtree_lock */
> -	__of_free_phandle_cache_entry(np->phandle);
> +	__of_phandle_cache_inv_entry(np->phandle);
>  }
>  
>  /**
> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
> index 66294d29942a..582844c158ae 100644
> --- a/drivers/of/of_private.h
> +++ b/drivers/of/of_private.h
> @@ -85,14 +85,12 @@ int of_resolve_phandles(struct device_node *tree);
>  #endif
>  
>  #if defined(CONFIG_OF_DYNAMIC)
> -void __of_free_phandle_cache_entry(phandle handle);
> +void __of_phandle_cache_inv_entry(phandle handle);
>  #endif

This results in gcc and sparse warnings for drivers/of/base.c when
CONFIG_OF_DYNAMIC is not defined because base.c does not wrap the
function with the same "#if defined(".

Should the "# if defined(" be removed from of_private.h?  Or
added to base.c?


>  
>  #if defined(CONFIG_OF_OVERLAY)
>  void of_overlay_mutex_lock(void);
>  void of_overlay_mutex_unlock(void);
> -int of_free_phandle_cache(void);
> -void of_populate_phandle_cache(void);
>  #else
>  static inline void of_overlay_mutex_lock(void) {};
>  static inline void of_overlay_mutex_unlock(void) {};
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 9617b7df7c4d..97fe92c1f1d2 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -974,8 +974,6 @@ static int of_overlay_apply(const void *fdt, struct device_node *tree,
>  		goto err_free_overlay_changeset;
>  	}
>  
> -	of_populate_phandle_cache();
> -
>  	ret = __of_changeset_apply_notify(&ovcs->cset);
>  	if (ret)
>  		pr_err("overlay apply changeset entry notify error %d\n", ret);
> @@ -1218,17 +1216,9 @@ int of_overlay_remove(int *ovcs_id)
>  
>  	list_del(&ovcs->ovcs_list);
>  
> -	/*
> -	 * Disable phandle cache.  Avoids race condition that would arise
> -	 * from removing cache entry when the associated node is deleted.
> -	 */
> -	of_free_phandle_cache();
> -
>  	ret_apply = 0;
>  	ret = __of_changeset_revert_entries(&ovcs->cset, &ret_apply);
>  
> -	of_populate_phandle_cache();
> -
>  	if (ret) {
>  		if (ret_apply)
>  			devicetree_state_flags |= DTSF_REVERT_FAIL;
> 

Reviewed-by: Frank Rowand <frowand.list@gmail.com>
Tested-by: Frank Rowand <frowand.list@gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2019-12-11 23:23 [PATCH] of: Rework and simplify phandle cache to use a fixed size Rob Herring
                   ` (3 preceding siblings ...)
  2019-12-19 15:51 ` Frank Rowand
@ 2020-01-07 10:22 ` Jon Hunter
  2020-01-10 23:50   ` Rob Herring
  4 siblings, 1 reply; 16+ messages in thread
From: Jon Hunter @ 2020-01-07 10:22 UTC (permalink / raw)
  To: Rob Herring, devicetree
  Cc: linux-kernel, Sebastian Andrzej Siewior, Michael Ellerman,
	Segher Boessenkool, Frank Rowand, linux-tegra

Hi Rob,

On 11/12/2019 23:23, Rob Herring wrote:
> The phandle cache was added to speed up of_find_node_by_phandle() by
> avoiding walking the whole DT to find a matching phandle. The
> implementation has several shortcomings:
> 
>   - The cache is designed to work on a linear set of phandle values.
>     This is true for dtc generated DTs, but not for other cases such as
>     Power.
>   - The cache isn't enabled until of_core_init() and a typical system
>     may see hundreds of calls to of_find_node_by_phandle() before that
>     point.
>   - The cache is freed and re-allocated when the number of phandles
>     changes.
>   - It takes a raw spinlock around a memory allocation which breaks on
>     RT.
> 
> Change the implementation to a fixed size and use hash_32() as the
> cache index. This greatly simplifies the implementation. It avoids
> the need for any re-alloc of the cache and taking a reference on nodes
> in the cache. We only have a single source of removing cache entries
> which is of_detach_node().
> 
> Using hash_32() removes any assumption on phandle values improving
> the hit rate for non-linear phandle values. The effect on linear values
> using hash_32() is about a 10% collision. The chances of thrashing on
> colliding values seems to be low.
> 
> To compare performance, I used a RK3399 board which is a pretty typical
> system. I found that just measuring boot time as done previously is
> noisy and may be impacted by other things. Also bringing up secondary
> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
> With no caching, calls to of_find_node_by_phandle() take about 20124 us
> for 1248 calls. There's an additional 288 calls before time keeping is
> up. Using the average time per hit/miss with the cache, we can calculate
> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
> and 13319 us with no cache or an uninitialized cache.
> 
> Comparing the 3 implementations the time spent in
> of_find_node_by_phandle() is:
> 
> no cache:        20124 us (+ 13319 us)
> 128 entry cache:  5134 us (+ 690 us)
> current cache:     819 us (+ 13319 us)
> 
> We could move the allocation of the cache earlier to improve the
> current cache, but that just further complicates the situation as it
> needs to be after slab is up, so we can't do it when unflattening (which
> uses memblock).
> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Segher Boessenkool <segher@kernel.crashing.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Signed-off-by: Rob Herring <robh@kernel.org>

With next-20200106 I have noticed a regression on Tegra210 where it
appears that only one of the eMMC devices is being registered. Bisect is
pointing to this patch and reverting on top of next fixes the problem.
That is as far as I have got so far, so if you have any ideas, please
let me know. Unfortunately, there do not appear to be any obvious errors
from the bootlog.

Thanks
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2020-01-07 10:22 ` Jon Hunter
@ 2020-01-10 23:50   ` Rob Herring
  2020-01-13 11:12     ` Jon Hunter
  0 siblings, 1 reply; 16+ messages in thread
From: Rob Herring @ 2020-01-10 23:50 UTC (permalink / raw)
  To: Jon Hunter
  Cc: devicetree, linux-kernel, Sebastian Andrzej Siewior,
	Michael Ellerman, Segher Boessenkool, Frank Rowand, linux-tegra

On Tue, Jan 7, 2020 at 4:22 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Rob,
>
> On 11/12/2019 23:23, Rob Herring wrote:
> > The phandle cache was added to speed up of_find_node_by_phandle() by
> > avoiding walking the whole DT to find a matching phandle. The
> > implementation has several shortcomings:
> >
> >   - The cache is designed to work on a linear set of phandle values.
> >     This is true for dtc generated DTs, but not for other cases such as
> >     Power.
> >   - The cache isn't enabled until of_core_init() and a typical system
> >     may see hundreds of calls to of_find_node_by_phandle() before that
> >     point.
> >   - The cache is freed and re-allocated when the number of phandles
> >     changes.
> >   - It takes a raw spinlock around a memory allocation which breaks on
> >     RT.
> >
> > Change the implementation to a fixed size and use hash_32() as the
> > cache index. This greatly simplifies the implementation. It avoids
> > the need for any re-alloc of the cache and taking a reference on nodes
> > in the cache. We only have a single source of removing cache entries
> > which is of_detach_node().
> >
> > Using hash_32() removes any assumption on phandle values improving
> > the hit rate for non-linear phandle values. The effect on linear values
> > using hash_32() is about a 10% collision. The chances of thrashing on
> > colliding values seems to be low.
> >
> > To compare performance, I used a RK3399 board which is a pretty typical
> > system. I found that just measuring boot time as done previously is
> > noisy and may be impacted by other things. Also bringing up secondary
> > cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
> > With no caching, calls to of_find_node_by_phandle() take about 20124 us
> > for 1248 calls. There's an additional 288 calls before time keeping is
> > up. Using the average time per hit/miss with the cache, we can calculate
> > these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
> > and 13319 us with no cache or an uninitialized cache.
> >
> > Comparing the 3 implementations the time spent in
> > of_find_node_by_phandle() is:
> >
> > no cache:        20124 us (+ 13319 us)
> > 128 entry cache:  5134 us (+ 690 us)
> > current cache:     819 us (+ 13319 us)
> >
> > We could move the allocation of the cache earlier to improve the
> > current cache, but that just further complicates the situation as it
> > needs to be after slab is up, so we can't do it when unflattening (which
> > uses memblock).
> >
> > Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Segher Boessenkool <segher@kernel.crashing.org>
> > Cc: Frank Rowand <frowand.list@gmail.com>
> > Signed-off-by: Rob Herring <robh@kernel.org>
>
> With next-20200106 I have noticed a regression on Tegra210 where it
> appears that only one of the eMMC devices is being registered. Bisect is
> pointing to this patch and reverting on top of next fixes the problem.
> That is as far as I have got so far, so if you have any ideas, please
> let me know. Unfortunately, there do not appear to be any obvious errors
> from the bootlog.

I guess that's tegra210-p2371-2180.dts because none of the others have
2 SD hosts enabled. I don't see anything obvious though. Are you doing
any runtime mods to the DT?

Can you try just removing the cache lookup at the beginning? Pretty
sure that will work as that's pretty much what we had before any
cache. Next try dumping out the phandle values and node ptr values.

Rob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2020-01-10 23:50   ` Rob Herring
@ 2020-01-13 11:12     ` Jon Hunter
  2020-04-14 15:00       ` Rob Herring
  0 siblings, 1 reply; 16+ messages in thread
From: Jon Hunter @ 2020-01-13 11:12 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-kernel, Sebastian Andrzej Siewior,
	Michael Ellerman, Segher Boessenkool, Frank Rowand, linux-tegra


On 10/01/2020 23:50, Rob Herring wrote:
> On Tue, Jan 7, 2020 at 4:22 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>> Hi Rob,
>>
>> On 11/12/2019 23:23, Rob Herring wrote:
>>> The phandle cache was added to speed up of_find_node_by_phandle() by
>>> avoiding walking the whole DT to find a matching phandle. The
>>> implementation has several shortcomings:
>>>
>>>   - The cache is designed to work on a linear set of phandle values.
>>>     This is true for dtc generated DTs, but not for other cases such as
>>>     Power.
>>>   - The cache isn't enabled until of_core_init() and a typical system
>>>     may see hundreds of calls to of_find_node_by_phandle() before that
>>>     point.
>>>   - The cache is freed and re-allocated when the number of phandles
>>>     changes.
>>>   - It takes a raw spinlock around a memory allocation which breaks on
>>>     RT.
>>>
>>> Change the implementation to a fixed size and use hash_32() as the
>>> cache index. This greatly simplifies the implementation. It avoids
>>> the need for any re-alloc of the cache and taking a reference on nodes
>>> in the cache. We only have a single source of removing cache entries
>>> which is of_detach_node().
>>>
>>> Using hash_32() removes any assumption on phandle values improving
>>> the hit rate for non-linear phandle values. The effect on linear values
>>> using hash_32() is about a 10% collision. The chances of thrashing on
>>> colliding values seems to be low.
>>>
>>> To compare performance, I used a RK3399 board which is a pretty typical
>>> system. I found that just measuring boot time as done previously is
>>> noisy and may be impacted by other things. Also bringing up secondary
>>> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
>>> With no caching, calls to of_find_node_by_phandle() take about 20124 us
>>> for 1248 calls. There's an additional 288 calls before time keeping is
>>> up. Using the average time per hit/miss with the cache, we can calculate
>>> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
>>> and 13319 us with no cache or an uninitialized cache.
>>>
>>> Comparing the 3 implementations the time spent in
>>> of_find_node_by_phandle() is:
>>>
>>> no cache:        20124 us (+ 13319 us)
>>> 128 entry cache:  5134 us (+ 690 us)
>>> current cache:     819 us (+ 13319 us)
>>>
>>> We could move the allocation of the cache earlier to improve the
>>> current cache, but that just further complicates the situation as it
>>> needs to be after slab is up, so we can't do it when unflattening (which
>>> uses memblock).
>>>
>>> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> Cc: Segher Boessenkool <segher@kernel.crashing.org>
>>> Cc: Frank Rowand <frowand.list@gmail.com>
>>> Signed-off-by: Rob Herring <robh@kernel.org>
>>
>> With next-20200106 I have noticed a regression on Tegra210 where it
>> appears that only one of the eMMC devices is being registered. Bisect is
>> pointing to this patch and reverting on top of next fixes the problem.
>> That is as far as I have got so far, so if you have any ideas, please
>> let me know. Unfortunately, there do not appear to be any obvious errors
>> from the bootlog.
> 
> I guess that's tegra210-p2371-2180.dts because none of the others have
> 2 SD hosts enabled. I don't see anything obvious though. Are you doing
> any runtime mods to the DT?

I have noticed that the bootloader is doing some runtime mods and so
checking if this is the cause. I will let you know, but most likely,
seeing as I cannot find anything wrong with this change itself.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2020-01-13 11:12     ` Jon Hunter
@ 2020-04-14 15:00       ` Rob Herring
  2020-04-14 19:43         ` Jon Hunter
  0 siblings, 1 reply; 16+ messages in thread
From: Rob Herring @ 2020-04-14 15:00 UTC (permalink / raw)
  To: Jon Hunter, Karol Herbst
  Cc: devicetree, linux-kernel, Sebastian Andrzej Siewior,
	Michael Ellerman, Segher Boessenkool, Frank Rowand, linux-tegra

+Karol

On Mon, Jan 13, 2020 at 5:12 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>
>
> On 10/01/2020 23:50, Rob Herring wrote:
> > On Tue, Jan 7, 2020 at 4:22 AM Jon Hunter <jonathanh@nvidia.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 11/12/2019 23:23, Rob Herring wrote:
> >>> The phandle cache was added to speed up of_find_node_by_phandle() by
> >>> avoiding walking the whole DT to find a matching phandle. The
> >>> implementation has several shortcomings:
> >>>
> >>>   - The cache is designed to work on a linear set of phandle values.
> >>>     This is true for dtc generated DTs, but not for other cases such as
> >>>     Power.
> >>>   - The cache isn't enabled until of_core_init() and a typical system
> >>>     may see hundreds of calls to of_find_node_by_phandle() before that
> >>>     point.
> >>>   - The cache is freed and re-allocated when the number of phandles
> >>>     changes.
> >>>   - It takes a raw spinlock around a memory allocation which breaks on
> >>>     RT.
> >>>
> >>> Change the implementation to a fixed size and use hash_32() as the
> >>> cache index. This greatly simplifies the implementation. It avoids
> >>> the need for any re-alloc of the cache and taking a reference on nodes
> >>> in the cache. We only have a single source of removing cache entries
> >>> which is of_detach_node().
> >>>
> >>> Using hash_32() removes any assumption on phandle values improving
> >>> the hit rate for non-linear phandle values. The effect on linear values
> >>> using hash_32() is about a 10% collision. The chances of thrashing on
> >>> colliding values seems to be low.
> >>>
> >>> To compare performance, I used a RK3399 board which is a pretty typical
> >>> system. I found that just measuring boot time as done previously is
> >>> noisy and may be impacted by other things. Also bringing up secondary
> >>> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
> >>> With no caching, calls to of_find_node_by_phandle() take about 20124 us
> >>> for 1248 calls. There's an additional 288 calls before time keeping is
> >>> up. Using the average time per hit/miss with the cache, we can calculate
> >>> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
> >>> and 13319 us with no cache or an uninitialized cache.
> >>>
> >>> Comparing the 3 implementations the time spent in
> >>> of_find_node_by_phandle() is:
> >>>
> >>> no cache:        20124 us (+ 13319 us)
> >>> 128 entry cache:  5134 us (+ 690 us)
> >>> current cache:     819 us (+ 13319 us)
> >>>
> >>> We could move the allocation of the cache earlier to improve the
> >>> current cache, but that just further complicates the situation as it
> >>> needs to be after slab is up, so we can't do it when unflattening (which
> >>> uses memblock).
> >>>
> >>> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >>> Cc: Michael Ellerman <mpe@ellerman.id.au>
> >>> Cc: Segher Boessenkool <segher@kernel.crashing.org>
> >>> Cc: Frank Rowand <frowand.list@gmail.com>
> >>> Signed-off-by: Rob Herring <robh@kernel.org>
> >>
> >> With next-20200106 I have noticed a regression on Tegra210 where it
> >> appears that only one of the eMMC devices is being registered. Bisect is
> >> pointing to this patch and reverting on top of next fixes the problem.
> >> That is as far as I have got so far, so if you have any ideas, please
> >> let me know. Unfortunately, there do not appear to be any obvious errors
> >> from the bootlog.
> >
> > I guess that's tegra210-p2371-2180.dts because none of the others have
> > 2 SD hosts enabled. I don't see anything obvious though. Are you doing
> > any runtime mods to the DT?
>
> I have noticed that the bootloader is doing some runtime mods and so
> checking if this is the cause. I will let you know, but most likely,
> seeing as I cannot find anything wrong with this change itself.

Did you figure out the problem here? Karol sees a similar problem on
Tegra210 with the gpu node regulator.

It looks like /external-memory-controller@7001b000 has a duplicate
phandle. Comparing the dtb in the filesystem with what the kernel
gets, that node is added by the bootloader. So the bootloader is
definitely creating a broken dtb.

Rob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] of: Rework and simplify phandle cache to use a fixed size
  2020-04-14 15:00       ` Rob Herring
@ 2020-04-14 19:43         ` Jon Hunter
  0 siblings, 0 replies; 16+ messages in thread
From: Jon Hunter @ 2020-04-14 19:43 UTC (permalink / raw)
  To: Rob Herring, Karol Herbst
  Cc: devicetree, linux-kernel, Sebastian Andrzej Siewior,
	Michael Ellerman, Segher Boessenkool, Frank Rowand, linux-tegra

On 14/04/2020 16:00, Rob Herring wrote:
> +Karol
> 
> On Mon, Jan 13, 2020 at 5:12 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>>
>> On 10/01/2020 23:50, Rob Herring wrote:
>>> On Tue, Jan 7, 2020 at 4:22 AM Jon Hunter <jonathanh@nvidia.com> wrote:
>>>>
>>>> Hi Rob,
>>>>
>>>> On 11/12/2019 23:23, Rob Herring wrote:
>>>>> The phandle cache was added to speed up of_find_node_by_phandle() by
>>>>> avoiding walking the whole DT to find a matching phandle. The
>>>>> implementation has several shortcomings:
>>>>>
>>>>>   - The cache is designed to work on a linear set of phandle values.
>>>>>     This is true for dtc generated DTs, but not for other cases such as
>>>>>     Power.
>>>>>   - The cache isn't enabled until of_core_init() and a typical system
>>>>>     may see hundreds of calls to of_find_node_by_phandle() before that
>>>>>     point.
>>>>>   - The cache is freed and re-allocated when the number of phandles
>>>>>     changes.
>>>>>   - It takes a raw spinlock around a memory allocation which breaks on
>>>>>     RT.
>>>>>
>>>>> Change the implementation to a fixed size and use hash_32() as the
>>>>> cache index. This greatly simplifies the implementation. It avoids
>>>>> the need for any re-alloc of the cache and taking a reference on nodes
>>>>> in the cache. We only have a single source of removing cache entries
>>>>> which is of_detach_node().
>>>>>
>>>>> Using hash_32() removes any assumption on phandle values improving
>>>>> the hit rate for non-linear phandle values. The effect on linear values
>>>>> using hash_32() is about a 10% collision. The chances of thrashing on
>>>>> colliding values seems to be low.
>>>>>
>>>>> To compare performance, I used a RK3399 board which is a pretty typical
>>>>> system. I found that just measuring boot time as done previously is
>>>>> noisy and may be impacted by other things. Also bringing up secondary
>>>>> cores causes some issues with measuring, so I booted with 'nr_cpus=1'.
>>>>> With no caching, calls to of_find_node_by_phandle() take about 20124 us
>>>>> for 1248 calls. There's an additional 288 calls before time keeping is
>>>>> up. Using the average time per hit/miss with the cache, we can calculate
>>>>> these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache
>>>>> and 13319 us with no cache or an uninitialized cache.
>>>>>
>>>>> Comparing the 3 implementations the time spent in
>>>>> of_find_node_by_phandle() is:
>>>>>
>>>>> no cache:        20124 us (+ 13319 us)
>>>>> 128 entry cache:  5134 us (+ 690 us)
>>>>> current cache:     819 us (+ 13319 us)
>>>>>
>>>>> We could move the allocation of the cache earlier to improve the
>>>>> current cache, but that just further complicates the situation as it
>>>>> needs to be after slab is up, so we can't do it when unflattening (which
>>>>> uses memblock).
>>>>>
>>>>> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>>>> Cc: Segher Boessenkool <segher@kernel.crashing.org>
>>>>> Cc: Frank Rowand <frowand.list@gmail.com>
>>>>> Signed-off-by: Rob Herring <robh@kernel.org>
>>>>
>>>> With next-20200106 I have noticed a regression on Tegra210 where it
>>>> appears that only one of the eMMC devices is being registered. Bisect is
>>>> pointing to this patch and reverting on top of next fixes the problem.
>>>> That is as far as I have got so far, so if you have any ideas, please
>>>> let me know. Unfortunately, there do not appear to be any obvious errors
>>>> from the bootlog.
>>>
>>> I guess that's tegra210-p2371-2180.dts because none of the others have
>>> 2 SD hosts enabled. I don't see anything obvious though. Are you doing
>>> any runtime mods to the DT?
>>
>> I have noticed that the bootloader is doing some runtime mods and so
>> checking if this is the cause. I will let you know, but most likely,
>> seeing as I cannot find anything wrong with this change itself.
> 
> Did you figure out the problem here? Karol sees a similar problem on
> Tegra210 with the gpu node regulator.
> 
> It looks like /external-memory-controller@7001b000 has a duplicate
> phandle. Comparing the dtb in the filesystem with what the kernel
> gets, that node is added by the bootloader. So the bootloader is
> definitely creating a broken dtb.

Yes it was caused by the bootloader, u-boot, incorrectly copying some
nodes. After preventing u-boot from doing that, it was fine. There are
some u-boot environment variables [0] that you can try clearing to
prevent this. Alternatively, if you use a upstream u-boot that should
also work.

Cheers
Jon

[0] https://elinux.org/Jetson/TX1_Upstream_Kernel#Upstream_Linux_kernel

-- 
nvpublic

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-04-14 19:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-11 23:23 [PATCH] of: Rework and simplify phandle cache to use a fixed size Rob Herring
2019-12-11 23:48 ` Rob Herring
2019-12-12 13:05   ` Sebastian Andrzej Siewior
2019-12-12 19:28     ` Rob Herring
2019-12-18  9:47       ` Sebastian Andrzej Siewior
2019-12-19 15:33     ` Frank Rowand
2019-12-19 15:31   ` Frank Rowand
2019-12-12 11:50 ` Frank Rowand
2019-12-19  3:38   ` Frank Rowand
2019-12-12 13:00 ` Sebastian Andrzej Siewior
2019-12-19 15:51 ` Frank Rowand
2020-01-07 10:22 ` Jon Hunter
2020-01-10 23:50   ` Rob Herring
2020-01-13 11:12     ` Jon Hunter
2020-04-14 15:00       ` Rob Herring
2020-04-14 19:43         ` Jon Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).