All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] avoid a live lock in wear_leveling_worker()
@ 2015-11-23 18:09 Sebastian Andrzej Siewior
  2015-11-23 18:09 ` [RFC PATCH 1/2] mtd: nand: schedule() after releasing the device Sebastian Andrzej Siewior
  2015-11-23 18:09 ` [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty Sebastian Andrzej Siewior
  0 siblings, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-23 18:09 UTC (permalink / raw)
  To: linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx, Peter Zijlstra

[ NOTE:
  This has been debugged + tested on v3.0 and then forward ported to
  v4.0-rc2. From what I see the problem should still occur whith proper
  timming.
]

On boot I managed run into a live lock situation within UBI. For $reason
UBI triggered ensure_wear_leveling() and decided to move a block.

It reads the VID header and then invokes ubi_eba_copy_leb(). It tries to
leb_write_trylock() but the PEB is already locked by `sshd' (which is in
UBI-read path, waiting for the NAND device) so it returns MOVE_RETRY.
Based on MOVE_RETRY UBI decides to invoked schedule_erase() on the empty
PEB.
The erase_worker() shows up on the CPU almost right away. It erases the
PEB via sync_erase() and then invokes ensure_wear_leveling() and decides
to move a block. The situation repeats as described in the previous
chapter. The only thing changed since the previous iteration is the EC
counter.

UBI expects that MOVE_RETRY is a temporary situation while here it is
persistent.
Each invocation of schedule_erase() invokes nand_get_device() which
succeeds so the UBI background threads owns the NAND device and can erase
a block. During the erase process the UBI threads schedules away while the
NAND drivers waits for an interrupt. While idle, the scheduler puts run able
tasks on the CPU. Two of them are waiting on the wait queue in
nand_get_device() but did not yet have the chance to run since the last
nand_release_device() invocation. One of them owns the PEB lock that causes
ubi_eba_copy_leb() return MOVE_RETRY.
After the erase process completes the NAND device is released via
nand_release_device() and the two waiting tasks receive a wake up. They
are not put on the CPU right away and the UBI tasks continues (no need
resched flag set).
The UBI task tries to copy the block but because it can't get the PEB
lock it schedules the PEB for erasing and gets the nand_get_device()
immediately.
The live lock happens because the condition which was true at the wake up
time is no longer true at the time the waiting task managed to get on the
CPU.
>From a little tracing:

|18596594us : ensure_wear_leveling: schedule scrubbing
|18596609us : wear_leveling_worker: scrub PEB 1132 to PEB 640
|18596627us : __schedule: ubi_bgt0d -> ubifs_bgt0_2
|18596637us : __schedule: ubifs_bgt0_2 -> sshd
|18596645us : __schedule: sshd -> cat
|18596653us : __schedule: cat -> swapper
|18596839us : __schedule: swapper -> ubi_bgt0d
|18596855us : __schedule: ubi_bgt0d -> swapper
|18597045us : __schedule: swapper -> ubi_bgt0d
|18597073us : __schedule: ubi_bgt0d -> swapper
|18597263us : __schedule: swapper -> ubi_bgt0d
|18597280us : default_wake_function: cat
|18597286us : default_wake_function: sshd
|18597293us : default_wake_function: ubifs_bgt0_2 
  ^read block within UBI

|18597304us : ubi_eba_copy_leb: copy LEB 2:486, PEB 1132 to PEB 640
|18597311us : ubi_eba_copy_leb: Lock woned by process: sshd
|18597315us : ubi_eba_copy_leb: contention on LEB 2:486, cancel
|18597318us : wear_leveling_worker: MOVE_RETRY
|18597323us : wear_leveling_worker: cancel moving PEB 1132 (LEB 2:486) to PEB 640
|18597329us : schedule_erase: schedule erasure of PEB 640, EC 3136, torture 0
|18597336us : erase_worker: erase PEB 640 EC 3136
|18597340us : erase_worker: erase PEB 640, old EC 3136
|18597349us : nand_erase_nand: start = 0x000005500000, len = 131072
^ erase in progress

|18597366us : __schedule: ubi_bgt0d -> sshd
|18597376us : __schedule: sshd -> ubifs_bgt0_2
|18597384us : __schedule: ubifs_bgt0_2 -> cat
|18597391us : __schedule: cat -> swapper
|18597556us : __schedule: swapper -> ubi_bgt0d
|18597574us : __schedule: ubi_bgt0d -> swapper
|18598376us : __schedule: swapper -> ubi_bgt0d
|18598391us : __schedule: ubi_bgt0d -> swapper
|18598580us : __schedule: swapper -> ubi_bgt0d
|18598594us : default_wake_function: cat
|18598601us : default_wake_function: ubifs_bgt0_2
|18598607us : default_wake_function: sshd

The next thing that happens once the erase completes is
=> ensure_wear_leveling: schedule scrubbing

Patch #1 avoids the live lock by invoking schedule() in case there is
someone in the waitqueue so the waiter has a chance to get on the CPU
while the device is free.
Patch #2 is an optimization so the free (and empty) EB is not erased if
nothing was written into it.

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/2] mtd: nand: schedule() after releasing the device
  2015-11-23 18:09 [RFC] avoid a live lock in wear_leveling_worker() Sebastian Andrzej Siewior
@ 2015-11-23 18:09 ` Sebastian Andrzej Siewior
  2015-11-23 18:18   ` Peter Zijlstra
  2015-11-23 18:09 ` [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty Sebastian Andrzej Siewior
  1 sibling, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-23 18:09 UTC (permalink / raw)
  To: linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx, Peter Zijlstra,
	Sebastian Andrzej Siewior

I have here a live lock in UBI doing
  ensure_wear_leveling() -> wear_leveling_worker() -> ubi_eba_copy_leb()
  MOVE_RETRY -> schedule_erase() -> ensure_wear_leveling()

on the same PEB over and over again. The reason for MOVE_RETRY is that
the LEB-Lock owner is stucked in nand_get_device() and does not get the
device lock. The PEB-lock owner is only scheduled on the CPU while the UBI
thread is idle during erase or read while (again) owning the device-lock
so the LEB-lock owner makes no progress.

To fix this live lock the patch adds a schedule() invocation if the wait
queue for the nand-device lock is not empty so the waiter can grab the
lock and make progress.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/mtd/nand/nand_base.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index ece544efccc3..3dc2dff01802 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -133,13 +133,23 @@ static int check_offs_len(struct mtd_info *mtd,
 static void nand_release_device(struct mtd_info *mtd)
 {
 	struct nand_chip *chip = mtd->priv;
+	bool do_sched = false;
 
 	/* Release the controller and the chip */
 	spin_lock(&chip->controller->lock);
 	chip->controller->active = NULL;
 	chip->state = FL_READY;
+	/*
+	 * Check if we have a waiter. If so we will schedule() right away so the
+	 * waiter can grab the device while it is released and not after _this_
+	 * caller gained the device (again) without leaving the CPU in between.
+	 */
+	if (waitqueue_active(&chip->controller->wq))
+		do_sched = true;
 	wake_up(&chip->controller->wq);
 	spin_unlock(&chip->controller->lock);
+	if (do_sched)
+		schedule();
 }
 
 /**
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-23 18:09 [RFC] avoid a live lock in wear_leveling_worker() Sebastian Andrzej Siewior
  2015-11-23 18:09 ` [RFC PATCH 1/2] mtd: nand: schedule() after releasing the device Sebastian Andrzej Siewior
@ 2015-11-23 18:09 ` Sebastian Andrzej Siewior
  2015-11-23 21:30   ` Richard Weinberger
  2015-11-24 12:58   ` Artem Bityutskiy
  1 sibling, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-23 18:09 UTC (permalink / raw)
  To: linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx, Peter Zijlstra,
	Sebastian Andrzej Siewior

wear_leveling_worker() currently unconditionally puts a PEB on erase in
the error case even it just been taken from the free_list and never
used.
In case the PEB was never used it can be put back on the free list
saving an erase cycle.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/mtd/ubi/wl.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
index eb4489f9082f..709ca27e103c 100644
--- a/drivers/mtd/ubi/wl.c
+++ b/drivers/mtd/ubi/wl.c
@@ -631,6 +631,7 @@ static int do_sync_erase(struct ubi_device *ubi, struct ubi_wl_entry *e,
 	return erase_worker(ubi, wl_wrk, 0);
 }
 
+static int ensure_wear_leveling(struct ubi_device *ubi, int nested);
 /**
  * wear_leveling_worker - wear-leveling worker function.
  * @ubi: UBI device description object
@@ -652,6 +653,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
 #endif
 	struct ubi_wl_entry *e1, *e2;
 	struct ubi_vid_hdr *vid_hdr;
+	int to_leb_clean = 0;
 
 	kfree(wrk);
 	if (shutdown)
@@ -756,6 +758,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
 
 	err = ubi_io_read_vid_hdr(ubi, e1->pnum, vid_hdr, 0);
 	if (err && err != UBI_IO_BITFLIPS) {
+		to_leb_clean = 1;
 		if (err == UBI_IO_FF) {
 			/*
 			 * We are trying to move PEB without a VID header. UBI
@@ -801,10 +804,12 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
 			 * protection queue.
 			 */
 			protect = 1;
+			to_leb_clean = 1;
 			goto out_not_moved;
 		}
 		if (err == MOVE_RETRY) {
 			scrubbing = 1;
+			to_leb_clean = 1;
 			goto out_not_moved;
 		}
 		if (err == MOVE_TARGET_BITFLIPS || err == MOVE_TARGET_WR_ERR ||
@@ -830,6 +835,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
 					ubi->erroneous_peb_count);
 				goto out_error;
 			}
+			to_leb_clean = 1;
 			erroneous = 1;
 			goto out_not_moved;
 		}
@@ -900,15 +906,26 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
 		wl_tree_add(e1, &ubi->scrub);
 	else
 		wl_tree_add(e1, &ubi->used);
+	if (to_leb_clean) {
+		wl_tree_add(e2, &ubi->free);
+		ubi->free_count++;
+	}
+
 	ubi_assert(!ubi->move_to_put);
 	ubi->move_from = ubi->move_to = NULL;
 	ubi->wl_scheduled = 0;
 	spin_unlock(&ubi->wl_lock);
 
 	ubi_free_vid_hdr(ubi, vid_hdr);
-	err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
-	if (err)
-		goto out_ro;
+	if (to_leb_clean) {
+		ensure_wear_leveling(ubi, 0);
+	} else {
+		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
+		if (err) {
+			wl_entry_destroy(ubi, e2);
+			goto out_ro;
+		}
+	}
 
 	mutex_unlock(&ubi->move_mutex);
 	return 0;
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] mtd: nand: schedule() after releasing the device
  2015-11-23 18:09 ` [RFC PATCH 1/2] mtd: nand: schedule() after releasing the device Sebastian Andrzej Siewior
@ 2015-11-23 18:18   ` Peter Zijlstra
  2015-11-25 17:35     ` [PATCH] mtd: nand: do FIFO processing in nand_get_device() Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-11-23 18:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-mtd, David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx

On Mon, Nov 23, 2015 at 07:09:06PM +0100, Sebastian Andrzej Siewior wrote:
>  	/* Release the controller and the chip */
>  	spin_lock(&chip->controller->lock);
>  	chip->controller->active = NULL;
>  	chip->state = FL_READY;
> +	/*
> +	 * Check if we have a waiter. If so we will schedule() right away so the
> +	 * waiter can grab the device while it is released and not after _this_
> +	 * caller gained the device (again) without leaving the CPU in between.
> +	 */
> +	if (waitqueue_active(&chip->controller->wq))
> +		do_sched = true;
>  	wake_up(&chip->controller->wq);
>  	spin_unlock(&chip->controller->lock);
> +	if (do_sched)
> +		schedule();

I've not looked at the code, but this _cannot_ be a correct fix.

schedule() can be a no-op, that is, current can be the most eligible
task to run and be selected again.

(with FIFO and this the highest prio task in the system that is a
guarantee)

I'll try and have an actual look at the problem description later.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-23 18:09 ` [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty Sebastian Andrzej Siewior
@ 2015-11-23 21:30   ` Richard Weinberger
  2015-11-23 21:50     ` Richard Weinberger
  2015-11-24  8:26     ` Sebastian Andrzej Siewior
  2015-11-24 12:58   ` Artem Bityutskiy
  1 sibling, 2 replies; 23+ messages in thread
From: Richard Weinberger @ 2015-11-23 21:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

Sebastian,

Am 23.11.2015 um 19:09 schrieb Sebastian Andrzej Siewior:
> wear_leveling_worker() currently unconditionally puts a PEB on erase in
> the error case even it just been taken from the free_list and never
> used.
> In case the PEB was never used it can be put back on the free list
> saving an erase cycle.

Makes sense, thanks for pointing out this optimization!

> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  drivers/mtd/ubi/wl.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
> index eb4489f9082f..709ca27e103c 100644
> --- a/drivers/mtd/ubi/wl.c
> +++ b/drivers/mtd/ubi/wl.c
> @@ -631,6 +631,7 @@ static int do_sync_erase(struct ubi_device *ubi, struct ubi_wl_entry *e,
>  	return erase_worker(ubi, wl_wrk, 0);
>  }
>  
> +static int ensure_wear_leveling(struct ubi_device *ubi, int nested);
>  /**
>   * wear_leveling_worker - wear-leveling worker function.
>   * @ubi: UBI device description object
> @@ -652,6 +653,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>  #endif
>  	struct ubi_wl_entry *e1, *e2;
>  	struct ubi_vid_hdr *vid_hdr;
> +	int to_leb_clean = 0;

Can we please rename this variable?
I know it means "the `to` LEB is clean". But for everybody else this reads as "I'm going to clean a LEB". ;)

>  	kfree(wrk);
>  	if (shutdown)
> @@ -756,6 +758,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>  
>  	err = ubi_io_read_vid_hdr(ubi, e1->pnum, vid_hdr, 0);
>  	if (err && err != UBI_IO_BITFLIPS) {
> +		to_leb_clean = 1;
>  		if (err == UBI_IO_FF) {
>  			/*
>  			 * We are trying to move PEB without a VID header. UBI
> @@ -801,10 +804,12 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>  			 * protection queue.
>  			 */
>  			protect = 1;
> +			to_leb_clean = 1;
>  			goto out_not_moved;
>  		}
>  		if (err == MOVE_RETRY) {
>  			scrubbing = 1;
> +			to_leb_clean = 1;
>  			goto out_not_moved;
>  		}
>  		if (err == MOVE_TARGET_BITFLIPS || err == MOVE_TARGET_WR_ERR ||
> @@ -830,6 +835,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>  					ubi->erroneous_peb_count);
>  				goto out_error;
>  			}
> +			to_leb_clean = 1;
>  			erroneous = 1;
>  			goto out_not_moved;
>  		}
> @@ -900,15 +906,26 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>  		wl_tree_add(e1, &ubi->scrub);
>  	else
>  		wl_tree_add(e1, &ubi->used);
> +	if (to_leb_clean) {
> +		wl_tree_add(e2, &ubi->free);
> +		ubi->free_count++;
> +	}
> +
>  	ubi_assert(!ubi->move_to_put);
>  	ubi->move_from = ubi->move_to = NULL;
>  	ubi->wl_scheduled = 0;
>  	spin_unlock(&ubi->wl_lock);
>  
>  	ubi_free_vid_hdr(ubi, vid_hdr);
> -	err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
> -	if (err)
> -		goto out_ro;
> +	if (to_leb_clean) {
> +		ensure_wear_leveling(ubi, 0);

Why are you rescheduling wear leveling?
And the nested variable has to be set to no-zero as you're calling it from a UBI worker.

> +	} else {
> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
> +		if (err) {
> +			wl_entry_destroy(ubi, e2);

Why that? The erase_worker will free e2 if it encounters
a fatal error and gives up this PEB. You're introducing a double free.

I think you have copy&pasted from:
        err = do_sync_erase(ubi, e1, vol_id, lnum, 0);
        if (err) {
                if (e2)
                        wl_entry_destroy(ubi, e2);
                goto out_ro;
        }

...which looks wrong too. I'll double check tomorrow with a fresh brain.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-23 21:30   ` Richard Weinberger
@ 2015-11-23 21:50     ` Richard Weinberger
  2015-11-24  8:26     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 23+ messages in thread
From: Richard Weinberger @ 2015-11-23 21:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

Am 23.11.2015 um 22:30 schrieb Richard Weinberger:
> Sebastian,
> 
> Am 23.11.2015 um 19:09 schrieb Sebastian Andrzej Siewior:
>> wear_leveling_worker() currently unconditionally puts a PEB on erase in
>> the error case even it just been taken from the free_list and never
>> used.
>> In case the PEB was never used it can be put back on the free list
>> saving an erase cycle.
> 
> Makes sense, thanks for pointing out this optimization!
> 
>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> ---
>>  drivers/mtd/ubi/wl.c | 23 ++++++++++++++++++++---
>>  1 file changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
>> index eb4489f9082f..709ca27e103c 100644
>> --- a/drivers/mtd/ubi/wl.c
>> +++ b/drivers/mtd/ubi/wl.c
>> @@ -631,6 +631,7 @@ static int do_sync_erase(struct ubi_device *ubi, struct ubi_wl_entry *e,
>>  	return erase_worker(ubi, wl_wrk, 0);
>>  }
>>  
>> +static int ensure_wear_leveling(struct ubi_device *ubi, int nested);
>>  /**
>>   * wear_leveling_worker - wear-leveling worker function.
>>   * @ubi: UBI device description object
>> @@ -652,6 +653,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  #endif
>>  	struct ubi_wl_entry *e1, *e2;
>>  	struct ubi_vid_hdr *vid_hdr;
>> +	int to_leb_clean = 0;
> 
> Can we please rename this variable?
> I know it means "the `to` LEB is clean". But for everybody else this reads as "I'm going to clean a LEB". ;)
> 
>>  	kfree(wrk);
>>  	if (shutdown)
>> @@ -756,6 +758,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  
>>  	err = ubi_io_read_vid_hdr(ubi, e1->pnum, vid_hdr, 0);
>>  	if (err && err != UBI_IO_BITFLIPS) {
>> +		to_leb_clean = 1;
>>  		if (err == UBI_IO_FF) {
>>  			/*
>>  			 * We are trying to move PEB without a VID header. UBI
>> @@ -801,10 +804,12 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  			 * protection queue.
>>  			 */
>>  			protect = 1;
>> +			to_leb_clean = 1;
>>  			goto out_not_moved;
>>  		}
>>  		if (err == MOVE_RETRY) {
>>  			scrubbing = 1;
>> +			to_leb_clean = 1;
>>  			goto out_not_moved;
>>  		}
>>  		if (err == MOVE_TARGET_BITFLIPS || err == MOVE_TARGET_WR_ERR ||
>> @@ -830,6 +835,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  					ubi->erroneous_peb_count);
>>  				goto out_error;
>>  			}
>> +			to_leb_clean = 1;
>>  			erroneous = 1;
>>  			goto out_not_moved;
>>  		}
>> @@ -900,15 +906,26 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  		wl_tree_add(e1, &ubi->scrub);
>>  	else
>>  		wl_tree_add(e1, &ubi->used);
>> +	if (to_leb_clean) {
>> +		wl_tree_add(e2, &ubi->free);
>> +		ubi->free_count++;
>> +	}
>> +
>>  	ubi_assert(!ubi->move_to_put);
>>  	ubi->move_from = ubi->move_to = NULL;
>>  	ubi->wl_scheduled = 0;
>>  	spin_unlock(&ubi->wl_lock);
>>  
>>  	ubi_free_vid_hdr(ubi, vid_hdr);
>> -	err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>> -	if (err)
>> -		goto out_ro;
>> +	if (to_leb_clean) {
>> +		ensure_wear_leveling(ubi, 0);
> 
> Why are you rescheduling wear leveling?
> And the nested variable has to be set to no-zero as you're calling it from a UBI worker.
> 
>> +	} else {
>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>> +		if (err) {
>> +			wl_entry_destroy(ubi, e2);
> 
> Why that? The erase_worker will free e2 if it encounters
> a fatal error and gives up this PEB. You're introducing a double free.
> 
> I think you have copy&pasted from:
>         err = do_sync_erase(ubi, e1, vol_id, lnum, 0);
>         if (err) {
>                 if (e2)
>                         wl_entry_destroy(ubi, e2);
>                 goto out_ro;
>         }
> 
> ...which looks wrong too. I'll double check tomorrow with a fresh brain.

Okay. That one is fine, as we switch to ro mode anyway...

Thanks,
//richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-23 21:30   ` Richard Weinberger
  2015-11-23 21:50     ` Richard Weinberger
@ 2015-11-24  8:26     ` Sebastian Andrzej Siewior
  2015-11-24  8:39       ` Richard Weinberger
  1 sibling, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-24  8:26 UTC (permalink / raw)
  To: Richard Weinberger, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

On 11/23/2015 10:30 PM, Richard Weinberger wrote:
>> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
>> index eb4489f9082f..709ca27e103c 100644
>> --- a/drivers/mtd/ubi/wl.c
>> +++ b/drivers/mtd/ubi/wl.c
>> @@ -652,6 +653,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  #endif
>>  	struct ubi_wl_entry *e1, *e2;
>>  	struct ubi_vid_hdr *vid_hdr;
>> +	int to_leb_clean = 0;
> 
> Can we please rename this variable?
> I know it means "the `to` LEB is clean". But for everybody else this reads as "I'm going to clean a LEB". ;)

such a shame that you can't make files RO. Any suggestions? dst_clean ?

>>  	kfree(wrk);
>>  	if (shutdown)
>> @@ -900,15 +906,26 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>  		wl_tree_add(e1, &ubi->scrub);
>>  	else
>>  		wl_tree_add(e1, &ubi->used);
>> +	if (to_leb_clean) {
>> +		wl_tree_add(e2, &ubi->free);
>> +		ubi->free_count++;
>> +	}
>> +
>>  	ubi_assert(!ubi->move_to_put);
>>  	ubi->move_from = ubi->move_to = NULL;
>>  	ubi->wl_scheduled = 0;
>>  	spin_unlock(&ubi->wl_lock);
>>  
>>  	ubi_free_vid_hdr(ubi, vid_hdr);
>> -	err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>> -	if (err)
>> -		goto out_ro;
>> +	if (to_leb_clean) {
>> +		ensure_wear_leveling(ubi, 0);
> 
> Why are you rescheduling wear leveling?

Because we had it pending and we didn't do anything. If I don't
schedule it would be deferred until another LEB is going to be deleted.
Before the change it happens after do_sync_erase() work job completes
but now we add it back to the free list.

> And the nested variable has to be set to no-zero as you're calling it from a UBI worker.
Ah, okay. I've been looking at it and got wrong then. As I said in my
cover mail, this was forwarded ported.

> 
>> +	} else {
>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>> +		if (err) {
>> +			wl_entry_destroy(ubi, e2);
> 
> Why that? The erase_worker will free e2 if it encounters
> a fatal error and gives up this PEB. You're introducing a double free.

Hmmm. That is real bad error handling you have there. So you invoke
do_sync_erase(), the kmalloc() fails and how exactly you free e2 here?
You don't so I did what I did. But then if this succeeds but
erase_worker() fails then we have a double free here. Indeed.

> I think you have copy&pasted from:
>         err = do_sync_erase(ubi, e1, vol_id, lnum, 0);
>         if (err) {
>                 if (e2)
>                         wl_entry_destroy(ubi, e2);
>                 goto out_ro;
>         }
> 
> ...which looks wrong too. I'll double check tomorrow with a fresh brain.
> 
> Thanks,
> //richard

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24  8:26     ` Sebastian Andrzej Siewior
@ 2015-11-24  8:39       ` Richard Weinberger
  2015-11-24  8:42         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Weinberger @ 2015-11-24  8:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

Am 24.11.2015 um 09:26 schrieb Sebastian Andrzej Siewior:
> On 11/23/2015 10:30 PM, Richard Weinberger wrote:
>>> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
>>> index eb4489f9082f..709ca27e103c 100644
>>> --- a/drivers/mtd/ubi/wl.c
>>> +++ b/drivers/mtd/ubi/wl.c
>>> @@ -652,6 +653,7 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>>  #endif
>>>  	struct ubi_wl_entry *e1, *e2;
>>>  	struct ubi_vid_hdr *vid_hdr;
>>> +	int to_leb_clean = 0;
>>
>> Can we please rename this variable?
>> I know it means "the `to` LEB is clean". But for everybody else this reads as "I'm going to clean a LEB". ;)
> 
> such a shame that you can't make files RO. Any suggestions? dst_clean ?

Files RO? I don't get it.

dst_clean is good. :-)

> 
>>>  	kfree(wrk);
>>>  	if (shutdown)
>>> @@ -900,15 +906,26 @@ static int wear_leveling_worker(struct ubi_device *ubi, struct ubi_work *wrk,
>>>  		wl_tree_add(e1, &ubi->scrub);
>>>  	else
>>>  		wl_tree_add(e1, &ubi->used);
>>> +	if (to_leb_clean) {
>>> +		wl_tree_add(e2, &ubi->free);
>>> +		ubi->free_count++;
>>> +	}
>>> +
>>>  	ubi_assert(!ubi->move_to_put);
>>>  	ubi->move_from = ubi->move_to = NULL;
>>>  	ubi->wl_scheduled = 0;
>>>  	spin_unlock(&ubi->wl_lock);
>>>  
>>>  	ubi_free_vid_hdr(ubi, vid_hdr);
>>> -	err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>>> -	if (err)
>>> -		goto out_ro;
>>> +	if (to_leb_clean) {
>>> +		ensure_wear_leveling(ubi, 0);
>>
>> Why are you rescheduling wear leveling?
> 
> Because we had it pending and we didn't do anything. If I don't
> schedule it would be deferred until another LEB is going to be deleted.
> Before the change it happens after do_sync_erase() work job completes
> but now we add it back to the free list.

Makes sense.

>> And the nested variable has to be set to no-zero as you're calling it from a UBI worker.
> Ah, okay. I've been looking at it and got wrong then. As I said in my
> cover mail, this was forwarded ported.
> 
>>
>>> +	} else {
>>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>>> +		if (err) {
>>> +			wl_entry_destroy(ubi, e2);
>>
>> Why that? The erase_worker will free e2 if it encounters
>> a fatal error and gives up this PEB. You're introducing a double free.
> 
> Hmmm. That is real bad error handling you have there. So you invoke
> do_sync_erase(), the kmalloc() fails and how exactly you free e2 here?

Why do you want to free e2? We free an erase entry only if we give
it up. wear leveling entries are allocated at init time and destroyed
when you detach UBI.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24  8:39       ` Richard Weinberger
@ 2015-11-24  8:42         ` Sebastian Andrzej Siewior
  2015-11-24  9:02           ` Richard Weinberger
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-24  8:42 UTC (permalink / raw)
  To: Richard Weinberger, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

On 11/24/2015 09:39 AM, Richard Weinberger wrote:
>>>> +	} else {
>>>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>>>> +		if (err) {
>>>> +			wl_entry_destroy(ubi, e2);
>>>
>>> Why that? The erase_worker will free e2 if it encounters
>>> a fatal error and gives up this PEB. You're introducing a double free.
>>
>> Hmmm. That is real bad error handling you have there. So you invoke
>> do_sync_erase(), the kmalloc() fails and how exactly you free e2 here?
> 
> Why do you want to free e2? We free an erase entry only if we give
> it up. wear leveling entries are allocated at init time and destroyed
> when you detach UBI.

The reference to it in the RB-tree (free) was removed. Is there another
reference to it?

> Thanks,
> //richard
> 

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24  8:42         ` Sebastian Andrzej Siewior
@ 2015-11-24  9:02           ` Richard Weinberger
  2015-11-24  9:07             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Weinberger @ 2015-11-24  9:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

Am 24.11.2015 um 09:42 schrieb Sebastian Andrzej Siewior:
> On 11/24/2015 09:39 AM, Richard Weinberger wrote:
>>>>> +	} else {
>>>>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>>>>> +		if (err) {
>>>>> +			wl_entry_destroy(ubi, e2);
>>>>
>>>> Why that? The erase_worker will free e2 if it encounters
>>>> a fatal error and gives up this PEB. You're introducing a double free.
>>>
>>> Hmmm. That is real bad error handling you have there. So you invoke
>>> do_sync_erase(), the kmalloc() fails and how exactly you free e2 here?
>>
>> Why do you want to free e2? We free an erase entry only if we give
>> it up. wear leveling entries are allocated at init time and destroyed
>> when you detach UBI.
> 
> The reference to it in the RB-tree (free) was removed. Is there another
> reference to it?

UBI supports only single references.
Everything else would be a bug.

That said, I agree that the error handling in the wear_leveling_worker()
can be improved.
Especially as currently an error while do_sync_erase() puts UBI into read-only
mode and cleanup is skipped as we're in read-only anyway then.
Would you volunteer? :-)

Thanks,
//richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24  9:02           ` Richard Weinberger
@ 2015-11-24  9:07             ` Sebastian Andrzej Siewior
  2015-11-24  9:16               ` Richard Weinberger
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-24  9:07 UTC (permalink / raw)
  To: Richard Weinberger, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

On 11/24/2015 10:02 AM, Richard Weinberger wrote:
> Am 24.11.2015 um 09:42 schrieb Sebastian Andrzej Siewior:
>> On 11/24/2015 09:39 AM, Richard Weinberger wrote:
>>>>>> +	} else {
>>>>>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>>>>>> +		if (err) {
>>>>>> +			wl_entry_destroy(ubi, e2);
>>>>>
>>>>> Why that? The erase_worker will free e2 if it encounters
>>>>> a fatal error and gives up this PEB. You're introducing a double free.
>>>>
>>>> Hmmm. That is real bad error handling you have there. So you invoke
>>>> do_sync_erase(), the kmalloc() fails and how exactly you free e2 here?
>>>
>>> Why do you want to free e2? We free an erase entry only if we give
>>> it up. wear leveling entries are allocated at init time and destroyed
>>> when you detach UBI.
>>
>> The reference to it in the RB-tree (free) was removed. Is there another
>> reference to it?
> 
> UBI supports only single references.
> Everything else would be a bug.

So if there is no reference to e2 which was just removed from the
RB-tree free and do_sync_erase() can't kmalloc() then we leak e2,
correct?

> Thanks,
> //richard
> 
Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24  9:07             ` Sebastian Andrzej Siewior
@ 2015-11-24  9:16               ` Richard Weinberger
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Weinberger @ 2015-11-24  9:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Artem Bityutskiy, tglx, Peter Zijlstra

Am 24.11.2015 um 10:07 schrieb Sebastian Andrzej Siewior:
> On 11/24/2015 10:02 AM, Richard Weinberger wrote:
>> Am 24.11.2015 um 09:42 schrieb Sebastian Andrzej Siewior:
>>> On 11/24/2015 09:39 AM, Richard Weinberger wrote:
>>>>>>> +	} else {
>>>>>>> +		err = do_sync_erase(ubi, e2, vol_id, lnum, torture);
>>>>>>> +		if (err) {
>>>>>>> +			wl_entry_destroy(ubi, e2);
>>>>>>
>>>>>> Why that? The erase_worker will free e2 if it encounters
>>>>>> a fatal error and gives up this PEB. You're introducing a double free.
>>>>>
>>>>> Hmmm. That is real bad error handling you have there. So you invoke
>>>>> do_sync_erase(), the kmalloc() fails and how exactly you free e2 here?
>>>>
>>>> Why do you want to free e2? We free an erase entry only if we give
>>>> it up. wear leveling entries are allocated at init time and destroyed
>>>> when you detach UBI.
>>>
>>> The reference to it in the RB-tree (free) was removed. Is there another
>>> reference to it?
>>
>> UBI supports only single references.
>> Everything else would be a bug.
> 
> So if there is no reference to e2 which was just removed from the
> RB-tree free and do_sync_erase() can't kmalloc() then we leak e2,
> correct?

Yes, you are right. That definitely needs improvement.

A possible solution would be iterating over ubi->lookuptbl upon
detach time and call wl_entry_destroy() on every non-NULL entry.

...or rework do_sync_erase(), currently it calls the erase worker directly.
The erase worker destroys a failed wl entry upon failure. But from the return code
of do_sync_erase() we cannot know whether the erase worker destroyed it already.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-23 18:09 ` [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty Sebastian Andrzej Siewior
  2015-11-23 21:30   ` Richard Weinberger
@ 2015-11-24 12:58   ` Artem Bityutskiy
  2015-11-24 13:33     ` Sebastian Andrzej Siewior
  2015-11-26 14:56     ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 23+ messages in thread
From: Artem Bityutskiy @ 2015-11-24 12:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Richard Weinberger, tglx, Peter Zijlstra

On Mon, 2015-11-23 at 19:09 +0100, Sebastian Andrzej Siewior wrote:
> wear_leveling_worker() currently unconditionally puts a PEB on erase
> in
> the error case even it just been taken from the free_list and never
> used.
> In case the PEB was never used it can be put back on the free list
> saving an erase cycle.

Did you think about putting LEBs like that to the protection queue
instead of playing tricks with scheduler?

The protection queue is there in order to protect eraseblocks from the
wear-leveling subsystem (not the best choice of words, but terminology
could be improved separately)

And this is what you need - you want the wear-leveling subsystem to
leave the eraseblock alone for some time, right?

The protection queue uses the erase cycles counts instead of the actual
time, but this seems OK for the situation you described.

Just to remind why this protection queue exists - when the WL subsystem
gives an eraseblock to the user, this may be an eraseblock with a high
erase counter, and it may be a candidate for being moved, the WL
subsystem just did not have a chance to do this yet. But if we give the
eraseblock to the user, the user will probably write something there
soon, and we do not want the WL subsystem to initiate data relocation
while the user is writing the data there. Instead, we want to wait a
little, and then move the data in background without interfering with
the user.

Back to my idea - what if you add the MOVE_RETRY eraseblocks to the
protection queue. May be not to the tail, may be to the head.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24 12:58   ` Artem Bityutskiy
@ 2015-11-24 13:33     ` Sebastian Andrzej Siewior
  2015-11-24 13:40       ` Artem Bityutskiy
  2015-11-24 13:57       ` Artem Bityutskiy
  2015-11-26 14:56     ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-24 13:33 UTC (permalink / raw)
  To: dedekind1, linux-mtd
  Cc: David Woodhouse, Brian Norris, Richard Weinberger, tglx, Peter Zijlstra

On 11/24/2015 01:58 PM, Artem Bityutskiy wrote:

Hello Artem,

> On Mon, 2015-11-23 at 19:09 +0100, Sebastian Andrzej Siewior wrote:
>> wear_leveling_worker() currently unconditionally puts a PEB on erase
>> in
>> the error case even it just been taken from the free_list and never
>> used.
>> In case the PEB was never used it can be put back on the free list
>> saving an erase cycle.
> 
> Did you think about putting LEBs like that to the protection queue
> instead of playing tricks with scheduler?

Why am I playing tricks with the scheduler?

> The protection queue is there in order to protect eraseblocks from the
> wear-leveling subsystem (not the best choice of words, but terminology
> could be improved separately)
> 
> And this is what you need - you want the wear-leveling subsystem to
> leave the eraseblock alone for some time, right?

The empty eraseblock is not the problem. The src-LEB is the problem
because it can not be moved due to lock contention. Ideally I would do
nothing here (except putting it back to the free list) and on unlock of
the SRC-LEB it would trigger a wear-leveling cycle.

> The protection queue uses the erase cycles counts instead of the actual
> time, but this seems OK for the situation you described.
> 
> Just to remind why this protection queue exists - when the WL subsystem
> gives an eraseblock to the user, this may be an eraseblock with a high
> erase counter, and it may be a candidate for being moved, the WL
> subsystem just did not have a chance to do this yet. But if we give the
> eraseblock to the user, the user will probably write something there
> soon, and we do not want the WL subsystem to initiate data relocation
> while the user is writing the data there. Instead, we want to wait a
> little, and then move the data in background without interfering with
> the user.
> 
> Back to my idea - what if you add the MOVE_RETRY eraseblocks to the
> protection queue. May be not to the tail, may be to the head.

Hmm. About which erase blocks are you talking about? The e1 which is
the src EB and will be relocated _or_ the e2 which is the destination
EB where the data will be written to?
>From what you explain it does not make sense to put e2 on the protect
list. I just try to save here an erase cycle here.
e1 on the other hand might be a candidate.

So e1 has a low EC value and WL-subsystem decides to move it to a an
empty block with a high EC value. This fails due to MOVE_RETRY.
I add e2 on to free-RB-tree, e1 on the protect-list. No
ensure_wear_leveling() invocation. What will happen next? When will e1
be removed from the protection list?

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24 13:33     ` Sebastian Andrzej Siewior
@ 2015-11-24 13:40       ` Artem Bityutskiy
  2015-11-24 13:57       ` Artem Bityutskiy
  1 sibling, 0 replies; 23+ messages in thread
From: Artem Bityutskiy @ 2015-11-24 13:40 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Richard Weinberger, tglx, Peter Zijlstra

On Tue, 2015-11-24 at 14:33 +0100, Sebastian Andrzej Siewior wrote:
> What will happen next? When will e1
> be removed from the protection list?

If my memory still serves me, the answer is: roughly speaking, after a
number of erase operation, which is currently defined as

/*
 * Length of the protection queue. The length is effectively equivalent t o the
 * number of (global) erase cycles PEBs are protected from the wear-leveling
 * worker.
 */
#define UBI_PROT_QUEUE_LEN 10

But if you put it to the head of the protection queue, it'll be removed
from there as soon as any LEB is "put", which effectively means
"scheduled for erasure".

Artem.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24 13:33     ` Sebastian Andrzej Siewior
  2015-11-24 13:40       ` Artem Bityutskiy
@ 2015-11-24 13:57       ` Artem Bityutskiy
  1 sibling, 0 replies; 23+ messages in thread
From: Artem Bityutskiy @ 2015-11-24 13:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-mtd
  Cc: David Woodhouse, Brian Norris, Richard Weinberger, tglx, Peter Zijlstra

A follow-up...

On Tue, 2015-11-24 at 14:33 +0100, Sebastian Andrzej Siewior wrote:
> > Did you think about putting LEBs like that to the protection queue
> > instead of playing tricks with scheduler?
> 
> Why am I playing tricks with the scheduler?

I should have replied the "1/2" e-mail, not this one, sorry. Because
this is what you seem to try ding in 1/2.

> Hmm. About which erase blocks are you talking about? The e1 which is
> the src EB and will be relocated _or_ the e2 which is the destination

I think the one which is currently unavailable. Or may be even both. I
suggest you to consider pros an cons :-)

> From what you explain it does not make sense to put e2 on the protect
> list. I just try to save here an erase cycle here.

I think if you put to the head of the PQ, that will be it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] mtd: nand: do FIFO processing in nand_get_device()
  2015-11-23 18:18   ` Peter Zijlstra
@ 2015-11-25 17:35     ` Sebastian Andrzej Siewior
  2015-11-30 16:15       ` Peter Zijlstra
  2015-12-02 18:52       ` [PATCH] " Brian Norris
  0 siblings, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-25 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mtd, David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx

I have here a live lock in UBI doing
  ensure_wear_leveling() -> wear_leveling_worker() -> ubi_eba_copy_leb()
  MOVE_RETRY -> schedule_erase() -> ensure_wear_leveling()

on the same PEB over and over again. The reason for MOVE_RETRY is that
the LEB-Lock owner is stucked in nand_get_device() and does not get the
device lock. The PEB-lock owner is only scheduled on the CPU while the UBI
thread is idle during erase or read while (again) owning the device-lock
so the LEB-lock owner makes no progress.

To fix this live lock I ensure that there FIFO processing in
nand_get_device(). On release the first waiter is marked as the new
owner. If someone asks for the device and is not the waiter to which
nand device has been handed over then it will put itself on the
waitqueue.
The FIFO processing was suggested by Peter Zijlstra.

As a small optimization I use add_wait_queue_exclusive() instead
add_wait_queue() to make sure that only _one_ waiter is woken up and not
all of them.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Would a stable be considered as reasonable?

 drivers/mtd/nand/nand_base.c  | 39 ++++++++++++++++++++++++++++++++-------
 include/linux/mtd/flashchip.h |  1 +
 include/linux/mtd/nand.h      |  1 +
 3 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index ece544efccc3..e2896af488c0 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -137,8 +137,25 @@ static void nand_release_device(struct mtd_info *mtd)
 	/* Release the controller and the chip */
 	spin_lock(&chip->controller->lock);
 	chip->controller->active = NULL;
-	chip->state = FL_READY;
-	wake_up(&chip->controller->wq);
+
+	if (waitqueue_active(&chip->controller->wq)) {
+		wait_queue_head_t *q;
+		wait_queue_t *waiter;
+		unsigned long flags;
+
+		q = &chip->controller->wq;
+		chip->state = FL_HANDOVER;
+
+		spin_lock_irqsave(&q->lock, flags);
+		waiter = list_first_entry(&q->task_list, wait_queue_t,
+					  task_list);
+		spin_unlock_irqrestore(&q->lock, flags);
+
+		chip->controller->handover_waiter = waiter;
+		wake_up(q);
+	} else {
+		chip->state = FL_READY;
+	}
 	spin_unlock(&chip->controller->lock);
 }
 
@@ -843,10 +860,18 @@ nand_get_device(struct mtd_info *mtd, int new_state)
 	if (!chip->controller->active)
 		chip->controller->active = chip;
 
-	if (chip->controller->active == chip && chip->state == FL_READY) {
-		chip->state = new_state;
-		spin_unlock(lock);
-		return 0;
+	if (chip->controller->active == chip) {
+		if (chip->state == FL_READY) {
+			chip->state = new_state;
+			spin_unlock(lock);
+			return 0;
+		}
+		if (chip->state == FL_HANDOVER &&
+		    chip->controller->handover_waiter == &wait) {
+			chip->state = new_state;
+			spin_unlock(lock);
+			return 0;
+		}
 	}
 	if (new_state == FL_PM_SUSPENDED) {
 		if (chip->controller->active->state == FL_PM_SUSPENDED) {
@@ -856,7 +881,7 @@ nand_get_device(struct mtd_info *mtd, int new_state)
 		}
 	}
 	set_current_state(TASK_UNINTERRUPTIBLE);
-	add_wait_queue(wq, &wait);
+	add_wait_queue_exclusive(wq, &wait);
 	spin_unlock(lock);
 	schedule();
 	remove_wait_queue(wq, &wait);
diff --git a/include/linux/mtd/flashchip.h b/include/linux/mtd/flashchip.h
index b63fa457febd..c855f4fd51b8 100644
--- a/include/linux/mtd/flashchip.h
+++ b/include/linux/mtd/flashchip.h
@@ -58,6 +58,7 @@ typedef enum {
 	FL_OTPING,
 	FL_PREPARING_ERASE,
 	FL_VERIFYING_ERASE,
+	FL_HANDOVER,
 
 	FL_UNKNOWN
 } flstate_t;
diff --git a/include/linux/mtd/nand.h b/include/linux/mtd/nand.h
index 5a9d1d4c2487..2686dc5dd51b 100644
--- a/include/linux/mtd/nand.h
+++ b/include/linux/mtd/nand.h
@@ -439,6 +439,7 @@ struct nand_hw_control {
 	spinlock_t lock;
 	struct nand_chip *active;
 	wait_queue_head_t wq;
+	wait_queue_t *handover_waiter;
 };
 
 /**
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty
  2015-11-24 12:58   ` Artem Bityutskiy
  2015-11-24 13:33     ` Sebastian Andrzej Siewior
@ 2015-11-26 14:56     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-11-26 14:56 UTC (permalink / raw)
  To: dedekind1, linux-mtd
  Cc: David Woodhouse, Brian Norris, Richard Weinberger, tglx, Peter Zijlstra

On 11/24/2015 01:58 PM, Artem Bityutskiy wrote:
> On Mon, 2015-11-23 at 19:09 +0100, Sebastian Andrzej Siewior wrote:
> Back to my idea - what if you add the MOVE_RETRY eraseblocks to the
> protection queue. May be not to the tail, may be to the head.

I've been thinking about this. The protected list protects EBs from the
WL worker. There are 10 slots and after one successful erase process
all EB from this slot will be moved back to ->used list. So e2 can't be
placed there because it belongs on the ->free list.

e1 could be added there so it is protected from the WL worker until at
least one erase cycle happened. This is no guarantee that MOVE_RETRY
does not happen again but an optimistic try :)

However once the protection time ends, it won't be added to the ->scrub
list but the ->used list instead. The difference is that if it has an
average number of erase cycles it won't be considered by the WL worker
any time soon. So if it was added to the ->scrub list because a bitflip
was detected on read then you lose this information.
For that reason I think it is not worth putting it on the protected
list.

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mtd: nand: do FIFO processing in nand_get_device()
  2015-11-25 17:35     ` [PATCH] mtd: nand: do FIFO processing in nand_get_device() Sebastian Andrzej Siewior
@ 2015-11-30 16:15       ` Peter Zijlstra
  2015-12-06 14:17         ` Sebastian Andrzej Siewior
  2015-12-02 18:52       ` [PATCH] " Brian Norris
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-11-30 16:15 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-mtd, David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx

On Wed, Nov 25, 2015 at 06:35:43PM +0100, Sebastian Andrzej Siewior wrote:
> +++ b/drivers/mtd/nand/nand_base.c
> @@ -137,8 +137,25 @@ static void nand_release_device(struct mtd_info *mtd)
>  	/* Release the controller and the chip */
>  	spin_lock(&chip->controller->lock);
>  	chip->controller->active = NULL;
> -	chip->state = FL_READY;
> -	wake_up(&chip->controller->wq);
> +
> +	if (waitqueue_active(&chip->controller->wq)) {
> +		wait_queue_head_t *q;
> +		wait_queue_t *waiter;
> +		unsigned long flags;
> +
> +		q = &chip->controller->wq;
> +		chip->state = FL_HANDOVER;
> +
> +		spin_lock_irqsave(&q->lock, flags);

This lock is actually not required, as your add/remove_wait_queue calls
are also under chip->controller->lock.

> +		waiter = list_first_entry(&q->task_list, wait_queue_t,
> +					  task_list);
> +		spin_unlock_irqrestore(&q->lock, flags);

And the one read instruction under a spinlock is a tad pointless anyway.

> +
> +		chip->controller->handover_waiter = waiter;

You could consider using ->active for this; as it stands you never use
both at the same time. Its a tad icky, but it avoids adding that new
field.

> +		wake_up(q);
> +	} else {
> +		chip->state = FL_READY;
> +	}
>  	spin_unlock(&chip->controller->lock);
>  }
>  
> @@ -843,10 +860,18 @@ nand_get_device(struct mtd_info *mtd, int new_state)
>  	if (!chip->controller->active)
>  		chip->controller->active = chip;
>  
> -	if (chip->controller->active == chip && chip->state == FL_READY) {
> -		chip->state = new_state;
> -		spin_unlock(lock);
> -		return 0;
> +	if (chip->controller->active == chip) {
> +		if (chip->state == FL_READY) {
> +			chip->state = new_state;
> +			spin_unlock(lock);
> +			return 0;
> +		}
> +		if (chip->state == FL_HANDOVER &&
> +		    chip->controller->handover_waiter == &wait) {
> +			chip->state = new_state;
> +			spin_unlock(lock);
> +			return 0;
> +		}
>  	}

That means this would become:

	if (chip->state == FL_HANDOVER && chip->controller->active == &wait) {
		chip->controller->active = chip;
		chip->state = new_state;
	}

	... existing code ...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mtd: nand: do FIFO processing in nand_get_device()
  2015-11-25 17:35     ` [PATCH] mtd: nand: do FIFO processing in nand_get_device() Sebastian Andrzej Siewior
  2015-11-30 16:15       ` Peter Zijlstra
@ 2015-12-02 18:52       ` Brian Norris
  2015-12-02 20:41         ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 23+ messages in thread
From: Brian Norris @ 2015-12-02 18:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, linux-mtd, David Woodhouse, Artem Bityutskiy,
	Richard Weinberger, tglx

On Wed, Nov 25, 2015 at 06:35:43PM +0100, Sebastian Andrzej Siewior wrote:
> I have here a live lock in UBI doing
>   ensure_wear_leveling() -> wear_leveling_worker() -> ubi_eba_copy_leb()
>   MOVE_RETRY -> schedule_erase() -> ensure_wear_leveling()
> 
> on the same PEB over and over again. The reason for MOVE_RETRY is that
> the LEB-Lock owner is stucked in nand_get_device() and does not get the
> device lock. The PEB-lock owner is only scheduled on the CPU while the UBI
> thread is idle during erase or read while (again) owning the device-lock
> so the LEB-lock owner makes no progress.
> 
> To fix this live lock I ensure that there FIFO processing in
> nand_get_device(). On release the first waiter is marked as the new
> owner. If someone asks for the device and is not the waiter to which
> nand device has been handed over then it will put itself on the
> waitqueue.
> The FIFO processing was suggested by Peter Zijlstra.
> 
> As a small optimization I use add_wait_queue_exclusive() instead
> add_wait_queue() to make sure that only _one_ waiter is woken up and not
> all of them.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> Would a stable be considered as reasonable?

Your threading is a bit confusing. Is this patch still needed, or did
you fix everything by fixing UBI with these?

http://lists.infradead.org/pipermail/linux-mtd/2015-November/063745.html
http://lists.infradead.org/pipermail/linux-mtd/2015-November/063744.html
http://lists.infradead.org/pipermail/linux-mtd/2015-November/063746.html

Brian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mtd: nand: do FIFO processing in nand_get_device()
  2015-12-02 18:52       ` [PATCH] " Brian Norris
@ 2015-12-02 20:41         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-12-02 20:41 UTC (permalink / raw)
  To: Brian Norris
  Cc: Peter Zijlstra, linux-mtd, David Woodhouse, Artem Bityutskiy,
	Richard Weinberger, tglx

On 12/02/2015 07:52 PM, Brian Norris wrote:
> Your threading is a bit confusing. Is this patch still needed, or did
> you fix everything by fixing UBI with these?

The three UBI patches are just an optimization and I trigger the life
lock with them. This patch is still required. I hope that I can respond
to Peter's comments by the end of the week. If you any suggestions
please let me know.

> Brian

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mtd: nand: do FIFO processing in nand_get_device()
  2015-11-30 16:15       ` Peter Zijlstra
@ 2015-12-06 14:17         ` Sebastian Andrzej Siewior
  2015-12-06 14:23           ` [PATCH v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-12-06 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mtd, David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx

* Peter Zijlstra | 2015-11-30 17:15:49 [+0100]:

>On Wed, Nov 25, 2015 at 06:35:43PM +0100, Sebastian Andrzej Siewior wrote:
>> +++ b/drivers/mtd/nand/nand_base.c
>> @@ -137,8 +137,25 @@ static void nand_release_device(struct mtd_info *mtd)
>>  	/* Release the controller and the chip */
>>  	spin_lock(&chip->controller->lock);
>>  	chip->controller->active = NULL;
>> -	chip->state = FL_READY;
>> -	wake_up(&chip->controller->wq);
>> +
>> +	if (waitqueue_active(&chip->controller->wq)) {
>> +		wait_queue_head_t *q;
>> +		wait_queue_t *waiter;
>> +		unsigned long flags;
>> +
>> +		q = &chip->controller->wq;
>> +		chip->state = FL_HANDOVER;
>> +
>> +		spin_lock_irqsave(&q->lock, flags);
>
>This lock is actually not required, as your add/remove_wait_queue calls
>are also under chip->controller->lock.

yeah. And there can be only one wakeup and only after this function here
made it happen. So yes, not required. dropped it.

>> +
>> +		chip->controller->handover_waiter = waiter;
>
>You could consider using ->active for this; as it stands you never use
>both at the same time. Its a tad icky, but it avoids adding that new
>field.

There is this FL_PM_SUSPENDED which derefences `active` even if its state
is not FL_READY. So I think that we could go boom here.
Also in order to share that member we need either an union or an
explicit cast because it is a different type. Puh.
My estimate here is 3 * pointer + 2 * sizeof spinlock. So on 32bit we
should always end up in kmalloc-32 (except with spinlock debugging which
inflates this struct anyway so this extra pointer should not matter
much).

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] mtd: nand: do FIFO processing in nand_get_device()
  2015-12-06 14:17         ` Sebastian Andrzej Siewior
@ 2015-12-06 14:23           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-12-06 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mtd, David Woodhouse, Brian Norris, Artem Bityutskiy,
	Richard Weinberger, tglx

I have here a live lock in UBI doing
  ensure_wear_leveling() -> wear_leveling_worker() -> ubi_eba_copy_leb()
  MOVE_RETRY -> schedule_erase() -> ensure_wear_leveling()

on the same PEB over and over again. The reason for MOVE_RETRY is that
the LEB-Lock owner is stucked in nand_get_device() and does not get the
device lock. The PEB-lock owner is only scheduled on the CPU while the UBI
thread is idle during erase or read while (again) owning the device-lock
so the LEB-lock owner makes no progress.

To fix this live lock I ensure that there FIFO processing in
nand_get_device(). On release the first waiter is marked as the new
owner. If someone asks for the device and is not the waiter to which
nand device has been handed over then it will put itself on the
waitqueue.
The FIFO processing was suggested by Peter Zijlstra.

As a small optimization I use add_wait_queue_exclusive() instead
add_wait_queue() to make sure that only _one_ waiter is woken up and not
all of them.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: drop superfluously locking around list_first_entry()

 drivers/mtd/nand/nand_base.c  | 36 +++++++++++++++++++++++++++++-------
 include/linux/mtd/flashchip.h |  1 +
 include/linux/mtd/nand.h      |  1 +
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
index ece544efccc3..bd46b75b3540 100644
--- a/drivers/mtd/nand/nand_base.c
+++ b/drivers/mtd/nand/nand_base.c
@@ -137,8 +137,22 @@ static void nand_release_device(struct mtd_info *mtd)
 	/* Release the controller and the chip */
 	spin_lock(&chip->controller->lock);
 	chip->controller->active = NULL;
-	chip->state = FL_READY;
-	wake_up(&chip->controller->wq);
+
+	if (waitqueue_active(&chip->controller->wq)) {
+		wait_queue_head_t *q;
+		wait_queue_t *waiter;
+
+		q = &chip->controller->wq;
+		chip->state = FL_HANDOVER;
+
+		waiter = list_first_entry(&q->task_list, wait_queue_t,
+					  task_list);
+
+		chip->controller->handover_waiter = waiter;
+		wake_up(q);
+	} else {
+		chip->state = FL_READY;
+	}
 	spin_unlock(&chip->controller->lock);
 }
 
@@ -843,10 +857,18 @@ nand_get_device(struct mtd_info *mtd, int new_state)
 	if (!chip->controller->active)
 		chip->controller->active = chip;
 
-	if (chip->controller->active == chip && chip->state == FL_READY) {
-		chip->state = new_state;
-		spin_unlock(lock);
-		return 0;
+	if (chip->controller->active == chip) {
+		if (chip->state == FL_READY) {
+			chip->state = new_state;
+			spin_unlock(lock);
+			return 0;
+		}
+		if (chip->state == FL_HANDOVER &&
+		    chip->controller->handover_waiter == &wait) {
+			chip->state = new_state;
+			spin_unlock(lock);
+			return 0;
+		}
 	}
 	if (new_state == FL_PM_SUSPENDED) {
 		if (chip->controller->active->state == FL_PM_SUSPENDED) {
@@ -856,7 +878,7 @@ nand_get_device(struct mtd_info *mtd, int new_state)
 		}
 	}
 	set_current_state(TASK_UNINTERRUPTIBLE);
-	add_wait_queue(wq, &wait);
+	add_wait_queue_exclusive(wq, &wait);
 	spin_unlock(lock);
 	schedule();
 	remove_wait_queue(wq, &wait);
diff --git a/include/linux/mtd/flashchip.h b/include/linux/mtd/flashchip.h
index b63fa457febd..c855f4fd51b8 100644
--- a/include/linux/mtd/flashchip.h
+++ b/include/linux/mtd/flashchip.h
@@ -58,6 +58,7 @@ typedef enum {
 	FL_OTPING,
 	FL_PREPARING_ERASE,
 	FL_VERIFYING_ERASE,
+	FL_HANDOVER,
 
 	FL_UNKNOWN
 } flstate_t;
diff --git a/include/linux/mtd/nand.h b/include/linux/mtd/nand.h
index 5a9d1d4c2487..2686dc5dd51b 100644
--- a/include/linux/mtd/nand.h
+++ b/include/linux/mtd/nand.h
@@ -439,6 +439,7 @@ struct nand_hw_control {
 	spinlock_t lock;
 	struct nand_chip *active;
 	wait_queue_head_t wq;
+	wait_queue_t *handover_waiter;
 };
 
 /**
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-12-06 14:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-23 18:09 [RFC] avoid a live lock in wear_leveling_worker() Sebastian Andrzej Siewior
2015-11-23 18:09 ` [RFC PATCH 1/2] mtd: nand: schedule() after releasing the device Sebastian Andrzej Siewior
2015-11-23 18:18   ` Peter Zijlstra
2015-11-25 17:35     ` [PATCH] mtd: nand: do FIFO processing in nand_get_device() Sebastian Andrzej Siewior
2015-11-30 16:15       ` Peter Zijlstra
2015-12-06 14:17         ` Sebastian Andrzej Siewior
2015-12-06 14:23           ` [PATCH v2] " Sebastian Andrzej Siewior
2015-12-02 18:52       ` [PATCH] " Brian Norris
2015-12-02 20:41         ` Sebastian Andrzej Siewior
2015-11-23 18:09 ` [RFC PATCH 2/2] mtd: ubi: wl: avoid erasing a PEB which is empty Sebastian Andrzej Siewior
2015-11-23 21:30   ` Richard Weinberger
2015-11-23 21:50     ` Richard Weinberger
2015-11-24  8:26     ` Sebastian Andrzej Siewior
2015-11-24  8:39       ` Richard Weinberger
2015-11-24  8:42         ` Sebastian Andrzej Siewior
2015-11-24  9:02           ` Richard Weinberger
2015-11-24  9:07             ` Sebastian Andrzej Siewior
2015-11-24  9:16               ` Richard Weinberger
2015-11-24 12:58   ` Artem Bityutskiy
2015-11-24 13:33     ` Sebastian Andrzej Siewior
2015-11-24 13:40       ` Artem Bityutskiy
2015-11-24 13:57       ` Artem Bityutskiy
2015-11-26 14:56     ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.