All of lore.kernel.org
 help / color / mirror / Atom feed
* Summary so far - ubd breakage in 4.20-rc1
@ 2018-11-07 18:40 Anton Ivanov
  2018-11-07 18:53 ` Richard Weinberger
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2018-11-07 18:40 UTC (permalink / raw)
  To: linux-um; +Cc: Richard Weinberger

Hi list, hi Richard.

I spent some time digging into the 4.20-rc1 issue today, and unless I am 
missing something it looks like UBD breakage and it looks like memory 
corruption. I cannot pin down where it is coming from.

These are my finding so far:

1. It happens only for write requests - I have not picked up a case 
where a read req breaks in any way so far. UML boots fine until it tries 
to remount the root fs read only and then fails with an IO error.

2. In my config it looks like it is introduced by the "um: Convert ubd 
driver to blk-mq" commit. It appears in 4.19 if I cherry-pick it and 
disappears in 4.20-rc1 if I revert it.

3. The write req is correctly passed as far as the actual io handler and 
correctly processed by the io thread. Upon finishing the request in the 
io thread the value of req->error is 0 and all values look OK.

4. The moment the req is read back by the irq handler req->error is 
something which looks like data from elsewhere instead of the request. 
F.e error may contain 55AA55AA

5. Other bits of the req are also zapped in a similar manner.

6. The pointer to the req passed along the IPC pipe is correct. If a req 
at 00000000deafe300 is given for execution to the IO thread, that is 
what is in the io_req variable in the handler. It is just contents of 
that req by that time are scrambled.

7.  I see it only for write reqs.

I just do not see where it can be zapped. At all. I did a prototype to 
add the BLK_STS_AGAIN return code and continue from half-x-mitted req 
logic similar to the one in nbd driver. It is not that. There is 
something else which causes this and I just do not see it :(

A.


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 18:40 Summary so far - ubd breakage in 4.20-rc1 Anton Ivanov
@ 2018-11-07 18:53 ` Richard Weinberger
  2018-11-07 19:19   ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Weinberger @ 2018-11-07 18:53 UTC (permalink / raw)
  To: Anton Ivanov; +Cc: Jens Axboe, linux-um, hch

[CC'ing hch and jens again.]

Am Mittwoch, 7. November 2018, 19:40:13 CET schrieb Anton Ivanov:
> Hi list, hi Richard.
> 
> I spent some time digging into the 4.20-rc1 issue today, and unless I am 
> missing something it looks like UBD breakage and it looks like memory 
> corruption. I cannot pin down where it is coming from.
> 
> These are my finding so far:
> 
> 1. It happens only for write requests - I have not picked up a case 
> where a read req breaks in any way so far. UML boots fine until it tries 
> to remount the root fs read only and then fails with an IO error.
> 
> 2. In my config it looks like it is introduced by the "um: Convert ubd 
> driver to blk-mq" commit. It appears in 4.19 if I cherry-pick it and 
> disappears in 4.20-rc1 if I revert it.
> 
> 3. The write req is correctly passed as far as the actual io handler and 
> correctly processed by the io thread. Upon finishing the request in the 
> io thread the value of req->error is 0 and all values look OK.
> 
> 4. The moment the req is read back by the irq handler req->error is 
> something which looks like data from elsewhere instead of the request. 
> F.e error may contain 55AA55AA
> 
> 5. Other bits of the req are also zapped in a similar manner.
> 
> 6. The pointer to the req passed along the IPC pipe is correct. If a req 
> at 00000000deafe300 is given for execution to the IO thread, that is 
> what is in the io_req variable in the handler. It is just contents of 
> that req by that time are scrambled.
> 
> 7.  I see it only for write reqs.
> 
> I just do not see where it can be zapped. At all. I did a prototype to 
> add the BLK_STS_AGAIN return code and continue from half-x-mitted req 
> logic similar to the one in nbd driver. It is not that. There is 
> something else which causes this and I just do not see it :(
> 
> A.
> 
> 





_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 18:53 ` Richard Weinberger
@ 2018-11-07 19:19   ` Jens Axboe
  2018-11-07 21:15     ` Richard Weinberger
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2018-11-07 19:19 UTC (permalink / raw)
  To: Richard Weinberger, Anton Ivanov; +Cc: linux-um, hch

On 11/7/18 11:53 AM, Richard Weinberger wrote:
> [CC'ing hch and jens again.]
> 
> Am Mittwoch, 7. November 2018, 19:40:13 CET schrieb Anton Ivanov:
>> Hi list, hi Richard.
>>
>> I spent some time digging into the 4.20-rc1 issue today, and unless I am 
>> missing something it looks like UBD breakage and it looks like memory 
>> corruption. I cannot pin down where it is coming from.
>>
>> These are my finding so far:
>>
>> 1. It happens only for write requests - I have not picked up a case 
>> where a read req breaks in any way so far. UML boots fine until it tries 
>> to remount the root fs read only and then fails with an IO error.
>>
>> 2. In my config it looks like it is introduced by the "um: Convert ubd 
>> driver to blk-mq" commit. It appears in 4.19 if I cherry-pick it and 
>> disappears in 4.20-rc1 if I revert it.
>>
>> 3. The write req is correctly passed as far as the actual io handler and 
>> correctly processed by the io thread. Upon finishing the request in the 
>> io thread the value of req->error is 0 and all values look OK.
>>
>> 4. The moment the req is read back by the irq handler req->error is 
>> something which looks like data from elsewhere instead of the request. 
>> F.e error may contain 55AA55AA
>>
>> 5. Other bits of the req are also zapped in a similar manner.
>>
>> 6. The pointer to the req passed along the IPC pipe is correct. If a req 
>> at 00000000deafe300 is given for execution to the IO thread, that is 
>> what is in the io_req variable in the handler. It is just contents of 
>> that req by that time are scrambled.
>>
>> 7.  I see it only for write reqs.
>>
>> I just do not see where it can be zapped. At all. I did a prototype to 
>> add the BLK_STS_AGAIN return code and continue from half-x-mitted req 
>> logic similar to the one in nbd driver. It is not that. There is 
>> something else which causes this and I just do not see it :(

My guess would be that the issue of requests is no longer blocking
interrupts, that looks like an oversight. So you could have your
IRQ handler race with writing new requests, which would be a problem...

Does the below help things?


diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index 74c002ddc0ce..019bc4828e30 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -1341,11 +1341,14 @@ static int ubd_queue_one_vec(struct blk_mq_hw_ctx *hctx, struct request *req,
 static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
 				 const struct blk_mq_queue_data *bd)
 {
+	struct ubd *ubd_dev = hctx->queue->queuedata;
 	struct request *req = bd->rq;
 	int ret = 0;
 
 	blk_mq_start_request(req);
 
+	spin_lock_irq(&ubd_dev->lock);
+
 	if (req_op(req) == REQ_OP_FLUSH) {
 		ret = ubd_queue_one_vec(hctx, req, 0, NULL);
 	} else {
@@ -1361,6 +1364,8 @@ static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
 		}
 	}
 out:
+	spin_unlock_irq(&ubd_dev->lock);
+
 	if (ret < 0) {
 		blk_mq_requeue_request(req, true);
 	}

-- 
Jens Axboe


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 19:19   ` Jens Axboe
@ 2018-11-07 21:15     ` Richard Weinberger
  2018-11-07 21:16       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Weinberger @ 2018-11-07 21:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-um, Anton Ivanov, hch

Jens,

Am Mittwoch, 7. November 2018, 20:19:42 CET schrieb Jens Axboe:
> My guess would be that the issue of requests is no longer blocking
> interrupts, that looks like an oversight. So you could have your
> IRQ handler race with writing new requests, which would be a problem...

sounds reasonable.

> Does the below help things?
> 

It does!
Anton, please give it also a try.

> diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
> index 74c002ddc0ce..019bc4828e30 100644
> --- a/arch/um/drivers/ubd_kern.c
> +++ b/arch/um/drivers/ubd_kern.c
> @@ -1341,11 +1341,14 @@ static int ubd_queue_one_vec(struct blk_mq_hw_ctx *hctx, struct request *req,
>  static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
>  				 const struct blk_mq_queue_data *bd)
>  {
> +	struct ubd *ubd_dev = hctx->queue->queuedata;
>  	struct request *req = bd->rq;
>  	int ret = 0;
>  
>  	blk_mq_start_request(req);
>  
> +	spin_lock_irq(&ubd_dev->lock);
> +

BTW: Why not irq_save/restore?

Thanks,
//richard




_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 21:15     ` Richard Weinberger
@ 2018-11-07 21:16       ` Jens Axboe
  2018-11-07 21:46         ` Richard Weinberger
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2018-11-07 21:16 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-um, Anton Ivanov, hch

On 11/7/18 2:15 PM, Richard Weinberger wrote:
> Jens,
> 
> Am Mittwoch, 7. November 2018, 20:19:42 CET schrieb Jens Axboe:
>> My guess would be that the issue of requests is no longer blocking
>> interrupts, that looks like an oversight. So you could have your
>> IRQ handler race with writing new requests, which would be a problem...
> 
> sounds reasonable.
> 
>> Does the below help things?
>>
> 
> It does!
> Anton, please give it also a try.

OK good, it's definitely a bug, and given the symptoms, sounds like
THE bug in this case.

>> diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
>> index 74c002ddc0ce..019bc4828e30 100644
>> --- a/arch/um/drivers/ubd_kern.c
>> +++ b/arch/um/drivers/ubd_kern.c
>> @@ -1341,11 +1341,14 @@ static int ubd_queue_one_vec(struct blk_mq_hw_ctx *hctx, struct request *req,
>>  static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  				 const struct blk_mq_queue_data *bd)
>>  {
>> +	struct ubd *ubd_dev = hctx->queue->queuedata;
>>  	struct request *req = bd->rq;
>>  	int ret = 0;
>>  
>>  	blk_mq_start_request(req);
>>  
>> +	spin_lock_irq(&ubd_dev->lock);
>> +
> 
> BTW: Why not irq_save/restore?

You only need the save/restore versions if you're called with
interrupts already disabled (or may be), that's not the case
here. Hence we can use the faster versions.

-- 
Jens Axboe


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 21:16       ` Jens Axboe
@ 2018-11-07 21:46         ` Richard Weinberger
  2018-11-07 21:47           ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Weinberger @ 2018-11-07 21:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-um, Anton Ivanov, hch

Am Mittwoch, 7. November 2018, 22:16:20 CET schrieb Jens Axboe:
> >> Does the below help things?
> >>
> > 
> > It does!
> > Anton, please give it also a try.
> 
> OK good, it's definitely a bug, and given the symptoms, sounds like
> THE bug in this case.

Shall I send your fix as patch or do you want to?
 
> >> diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
> >> index 74c002ddc0ce..019bc4828e30 100644
> >> --- a/arch/um/drivers/ubd_kern.c
> >> +++ b/arch/um/drivers/ubd_kern.c
> >> @@ -1341,11 +1341,14 @@ static int ubd_queue_one_vec(struct blk_mq_hw_ctx *hctx, struct request *req,
> >>  static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>  				 const struct blk_mq_queue_data *bd)
> >>  {
> >> +	struct ubd *ubd_dev = hctx->queue->queuedata;
> >>  	struct request *req = bd->rq;
> >>  	int ret = 0;
> >>  
> >>  	blk_mq_start_request(req);
> >>  
> >> +	spin_lock_irq(&ubd_dev->lock);
> >> +
> > 
> > BTW: Why not irq_save/restore?
> 
> You only need the save/restore versions if you're called with
> interrupts already disabled (or may be), that's not the case
> here. Hence we can use the faster versions.

Got it!

Thanks,
//richard




_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 21:46         ` Richard Weinberger
@ 2018-11-07 21:47           ` Jens Axboe
  2018-11-08  8:30             ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2018-11-07 21:47 UTC (permalink / raw)
  To: Richard Weinberger; +Cc: linux-um, Anton Ivanov, hch

On 11/7/18 2:46 PM, Richard Weinberger wrote:
> Am Mittwoch, 7. November 2018, 22:16:20 CET schrieb Jens Axboe:
>>>> Does the below help things?
>>>>
>>>
>>> It does!
>>> Anton, please give it also a try.
>>
>> OK good, it's definitely a bug, and given the symptoms, sounds like
>> THE bug in this case.
> 
> Shall I send your fix as patch or do you want to?

I'm going to queue it up, just waiting for Anton to also test so I
can add the appropriate Tested-by.

-- 
Jens Axboe


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-07 21:47           ` Jens Axboe
@ 2018-11-08  8:30             ` Anton Ivanov
  2018-11-08 12:06               ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2018-11-08  8:30 UTC (permalink / raw)
  To: Jens Axboe, Richard Weinberger; +Cc: linux-um, hch


On 11/7/18 9:47 PM, Jens Axboe wrote:
> On 11/7/18 2:46 PM, Richard Weinberger wrote:
>> Am Mittwoch, 7. November 2018, 22:16:20 CET schrieb Jens Axboe:
>>>>> Does the below help things?
>>>>>
>>>> It does!
>>>> Anton, please give it also a try.
>>> OK good, it's definitely a bug, and given the symptoms, sounds like
>>> THE bug in this case.
>> Shall I send your fix as patch or do you want to?
> I'm going to queue it up, just waiting for Anton to also test so I
> can add the appropriate Tested-by.
>
I am sorry to disappoint, but it fails testing in the same way. I tried 
the patch "as is" on top of 4.20-rc1 and I applied the same logic to my 
branch which had the changes to return BLK_STS_AGAIN instead of 
BLK_STS_OK for partial as well as other fixes.

Just to clarify - it is "our" request - the UBD structure io_thread_req 
formed from the mq which is being corrupted, not the original mq request.

I do not understand how it can end up being corrupted in the first 
place. The only way this can happen if the same instance of it is 
processed twice somewhere and I do not see anything in the MQ patches 
which can produce that.

Ditto for why it affects only write requests. Ditto why it fails on the 
very first write request - I have debug added on my branch and it is the 
very first write request at the point where the boot attempts to remount 
root which blows up.

IMHO we need to concentrate on what is different between Richard's and 
my setup and why it passes in his case and fails in mine. This will 
hopefully yield some targeting information to isolate the culprit.

Here is my command line to start off with:

/var/autofs/local/src/linux-work/linux/vmlinux mem=$MEMORY umid=OPX \
     ubd0=/exports/UML-debian/OPX-3.0-Work.img \
vec0:transport=raw,ifname=$MGMT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:68 \
vec1:transport=raw,ifname=$LEFT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:69 \
vec2:transport=raw,ifname=$RIGHT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:6a \
     root=/dev/ubda ro con=null con0=null,fd:2 con1=fd:0,fd:1 $@

All interfaces are the pair ends of veths. $MEMORY is 2G. I tried 
booting with various networking on/off combinations, single user mode, 
etc - the result is the same.

IMHO this is something purely on the UML side, not related to the MQ 
part of the equation. I just cannot grok what it is.

A.



_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-08  8:30             ` Anton Ivanov
@ 2018-11-08 12:06               ` Anton Ivanov
  2018-11-08 12:31                 ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2018-11-08 12:06 UTC (permalink / raw)
  To: Jens Axboe, Richard Weinberger; +Cc: linux-um, hch


On 11/8/18 8:30 AM, Anton Ivanov wrote:
>
> On 11/7/18 9:47 PM, Jens Axboe wrote:
>> On 11/7/18 2:46 PM, Richard Weinberger wrote:
>>> Am Mittwoch, 7. November 2018, 22:16:20 CET schrieb Jens Axboe:
>>>>>> Does the below help things?
>>>>>>
>>>>> It does!
>>>>> Anton, please give it also a try.
>>>> OK good, it's definitely a bug, and given the symptoms, sounds like
>>>> THE bug in this case.
>>> Shall I send your fix as patch or do you want to?
>> I'm going to queue it up, just waiting for Anton to also test so I
>> can add the appropriate Tested-by.
>>
> I am sorry to disappoint, but it fails testing in the same way. I 
> tried the patch "as is" on top of 4.20-rc1 and I applied the same 
> logic to my branch which had the changes to return BLK_STS_AGAIN 
> instead of BLK_STS_OK for partial as well as other fixes.
>
> Just to clarify - it is "our" request - the UBD structure 
> io_thread_req formed from the mq which is being corrupted, not the 
> original mq request.
>
> I do not understand how it can end up being corrupted in the first 
> place. The only way this can happen if the same instance of it is 
> processed twice somewhere and I do not see anything in the MQ patches 
> which can produce that.
>
> Ditto for why it affects only write requests. Ditto why it fails on 
> the very first write request - I have debug added on my branch and it 
> is the very first write request at the point where the boot attempts 
> to remount root which blows up.
>
> IMHO we need to concentrate on what is different between Richard's and 
> my setup and why it passes in his case and fails in mine. This will 
> hopefully yield some targeting information to isolate the culprit.
>
> Here is my command line to start off with:
>
> /var/autofs/local/src/linux-work/linux/vmlinux mem=$MEMORY umid=OPX \
>     ubd0=/exports/UML-debian/OPX-3.0-Work.img \
> vec0:transport=raw,ifname=$MGMT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:68 
> \
> vec1:transport=raw,ifname=$LEFT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:69 
> \
> vec2:transport=raw,ifname=$RIGHT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:6a 
> \
>     root=/dev/ubda ro con=null con0=null,fd:2 con1=fd:0,fd:1 $@
>
> All interfaces are the pair ends of veths. $MEMORY is 2G. I tried 
> booting with various networking on/off combinations, single user mode, 
> etc - the result is the same.
>
> IMHO this is something purely on the UML side, not related to the MQ 
> part of the equation. I just cannot grok what it is.

Some movement.

Here is a series of debug printks with explanations:

[    9.580000] in req is 00000000ddb65500 op is 1, udb op is 1 = 1
[    9.580000] emit req is 00000000ddb65500 op is 1, udb op is 1 = 1
[    9.580000] thread req is 00000000ddb65500 op is 1, udb op is 1 = 1
[    9.580000] emit req is 00000000ddb65500 op is 2, udb op is 2 = 1
[    9.620000] irq req is 00000000ddb65500 op is 2, udb op is 2 = 1
[    9.620000] print_req_error: I/O error, dev ubda, sector 0
[    9.620000] Buffer I/O error on dev ubda, logical block 0, lost sync 
page write
[    9.620000] EXT4-fs (ubda): I/O error while writing superblock

in is beginning of ubd_queue_one_vec

emit is the bottom of ubd_queue_one_vec - after it has been written

thread is in the io thread

irq is the result

The remount rw sequence touches the superblock and then SYNCs. That SYNC 
(udp op 2) is emitted using the SAME thread_req structure at the same 
address before it is freed in the irq handler.

Double use. Now why is it happening I will try to figure out after I 
grab something for lunch (unless someone smarter than me points to it 
first).

A.


>
> A.
>
>
>
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-08 12:06               ` Anton Ivanov
@ 2018-11-08 12:31                 ` Anton Ivanov
  2018-11-08 12:43                   ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2018-11-08 12:31 UTC (permalink / raw)
  To: Jens Axboe, Richard Weinberger; +Cc: linux-um, hch


On 11/8/18 12:06 PM, Anton Ivanov wrote:
>
> On 11/8/18 8:30 AM, Anton Ivanov wrote:
>>
>> On 11/7/18 9:47 PM, Jens Axboe wrote:
>>> On 11/7/18 2:46 PM, Richard Weinberger wrote:
>>>> Am Mittwoch, 7. November 2018, 22:16:20 CET schrieb Jens Axboe:
>>>>>>> Does the below help things?
>>>>>>>
>>>>>> It does!
>>>>>> Anton, please give it also a try.
>>>>> OK good, it's definitely a bug, and given the symptoms, sounds like
>>>>> THE bug in this case.
>>>> Shall I send your fix as patch or do you want to?
>>> I'm going to queue it up, just waiting for Anton to also test so I
>>> can add the appropriate Tested-by.
>>>
>> I am sorry to disappoint, but it fails testing in the same way. I 
>> tried the patch "as is" on top of 4.20-rc1 and I applied the same 
>> logic to my branch which had the changes to return BLK_STS_AGAIN 
>> instead of BLK_STS_OK for partial as well as other fixes.
>>
>> Just to clarify - it is "our" request - the UBD structure 
>> io_thread_req formed from the mq which is being corrupted, not the 
>> original mq request.
>>
>> I do not understand how it can end up being corrupted in the first 
>> place. The only way this can happen if the same instance of it is 
>> processed twice somewhere and I do not see anything in the MQ patches 
>> which can produce that.
>>
>> Ditto for why it affects only write requests. Ditto why it fails on 
>> the very first write request - I have debug added on my branch and it 
>> is the very first write request at the point where the boot attempts 
>> to remount root which blows up.
>>
>> IMHO we need to concentrate on what is different between Richard's 
>> and my setup and why it passes in his case and fails in mine. This 
>> will hopefully yield some targeting information to isolate the culprit.
>>
>> Here is my command line to start off with:
>>
>> /var/autofs/local/src/linux-work/linux/vmlinux mem=$MEMORY umid=OPX \
>>     ubd0=/exports/UML-debian/OPX-3.0-Work.img \
>> vec0:transport=raw,ifname=$MGMT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:68 
>> \
>> vec1:transport=raw,ifname=$LEFT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:69 
>> \
>> vec2:transport=raw,ifname=$RIGHT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:6a 
>> \
>>     root=/dev/ubda ro con=null con0=null,fd:2 con1=fd:0,fd:1 $@
>>
>> All interfaces are the pair ends of veths. $MEMORY is 2G. I tried 
>> booting with various networking on/off combinations, single user 
>> mode, etc - the result is the same.
>>
>> IMHO this is something purely on the UML side, not related to the MQ 
>> part of the equation. I just cannot grok what it is.
>
> Some movement.
>
> Here is a series of debug printks with explanations:
>
> [    9.580000] in req is 00000000ddb65500 op is 1, udb op is 1 = 1
> [    9.580000] emit req is 00000000ddb65500 op is 1, udb op is 1 = 1
> [    9.580000] thread req is 00000000ddb65500 op is 1, udb op is 1 = 1
> [    9.580000] emit req is 00000000ddb65500 op is 2, udb op is 2 = 1
> [    9.620000] irq req is 00000000ddb65500 op is 2, udb op is 2 = 1
> [    9.620000] print_req_error: I/O error, dev ubda, sector 0
> [    9.620000] Buffer I/O error on dev ubda, logical block 0, lost 
> sync page write
> [    9.620000] EXT4-fs (ubda): I/O error while writing superblock


printks are reordered (not surprising, two threads trying to write).

It is not going in twice. Simply, error is uninitialized for SYNC and 
set to a non-zero value only upon an error.

The error is from SYNC, not from the write.

A.


>
> in is beginning of ubd_queue_one_vec
>
> emit is the bottom of ubd_queue_one_vec - after it has been written
>
> thread is in the io thread
>
> irq is the result
>
> The remount rw sequence touches the superblock and then SYNCs. That 
> SYNC (udp op 2) is emitted using the SAME thread_req structure at the 
> same address before it is freed in the irq handler.
>
> Double use. Now why is it happening I will try to figure out after I 
> grab something for lunch (unless someone smarter than me points to it 
> first).
>
> A.
>
>
>>
>> A.
>>
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-08 12:31                 ` Anton Ivanov
@ 2018-11-08 12:43                   ` Anton Ivanov
  2018-11-08 13:08                     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2018-11-08 12:43 UTC (permalink / raw)
  To: Jens Axboe, Richard Weinberger; +Cc: linux-um, hch


On 11/8/18 12:31 PM, Anton Ivanov wrote:
>
> On 11/8/18 12:06 PM, Anton Ivanov wrote:
>>
>> On 11/8/18 8:30 AM, Anton Ivanov wrote:
>>>
>>> On 11/7/18 9:47 PM, Jens Axboe wrote:
>>>> On 11/7/18 2:46 PM, Richard Weinberger wrote:
>>>>> Am Mittwoch, 7. November 2018, 22:16:20 CET schrieb Jens Axboe:
>>>>>>>> Does the below help things?
>>>>>>>>
>>>>>>> It does!
>>>>>>> Anton, please give it also a try.
>>>>>> OK good, it's definitely a bug, and given the symptoms, sounds like
>>>>>> THE bug in this case.
>>>>> Shall I send your fix as patch or do you want to?
>>>> I'm going to queue it up, just waiting for Anton to also test so I
>>>> can add the appropriate Tested-by.
>>>>
>>> I am sorry to disappoint, but it fails testing in the same way. I 
>>> tried the patch "as is" on top of 4.20-rc1 and I applied the same 
>>> logic to my branch which had the changes to return BLK_STS_AGAIN 
>>> instead of BLK_STS_OK for partial as well as other fixes.
>>>
>>> Just to clarify - it is "our" request - the UBD structure 
>>> io_thread_req formed from the mq which is being corrupted, not the 
>>> original mq request.
>>>
>>> I do not understand how it can end up being corrupted in the first 
>>> place. The only way this can happen if the same instance of it is 
>>> processed twice somewhere and I do not see anything in the MQ 
>>> patches which can produce that.
>>>
>>> Ditto for why it affects only write requests. Ditto why it fails on 
>>> the very first write request - I have debug added on my branch and 
>>> it is the very first write request at the point where the boot 
>>> attempts to remount root which blows up.
>>>
>>> IMHO we need to concentrate on what is different between Richard's 
>>> and my setup and why it passes in his case and fails in mine. This 
>>> will hopefully yield some targeting information to isolate the culprit.
>>>
>>> Here is my command line to start off with:
>>>
>>> /var/autofs/local/src/linux-work/linux/vmlinux mem=$MEMORY umid=OPX \
>>>     ubd0=/exports/UML-debian/OPX-3.0-Work.img \
>>> vec0:transport=raw,ifname=$MGMT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:68 
>>> \
>>> vec1:transport=raw,ifname=$LEFT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:69 
>>> \
>>> vec2:transport=raw,ifname=$RIGHT_IF,depth=128,gro=1,mac=92:9b:36:5e:38:6a 
>>> \
>>>     root=/dev/ubda ro con=null con0=null,fd:2 con1=fd:0,fd:1 $@
>>>
>>> All interfaces are the pair ends of veths. $MEMORY is 2G. I tried 
>>> booting with various networking on/off combinations, single user 
>>> mode, etc - the result is the same.
>>>
>>> IMHO this is something purely on the UML side, not related to the MQ 
>>> part of the equation. I just cannot grok what it is.
>>
>> Some movement.
>>
>> Here is a series of debug printks with explanations:
>>
>> [    9.580000] in req is 00000000ddb65500 op is 1, udb op is 1 = 1
>> [    9.580000] emit req is 00000000ddb65500 op is 1, udb op is 1 = 1
>> [    9.580000] thread req is 00000000ddb65500 op is 1, udb op is 1 = 1
>> [    9.580000] emit req is 00000000ddb65500 op is 2, udb op is 2 = 1
>> [    9.620000] irq req is 00000000ddb65500 op is 2, udb op is 2 = 1
>> [    9.620000] print_req_error: I/O error, dev ubda, sector 0
>> [    9.620000] Buffer I/O error on dev ubda, logical block 0, lost 
>> sync page write
>> [    9.620000] EXT4-fs (ubda): I/O error while writing superblock
>
>
> printks are reordered (not surprising, two threads trying to write).
>
> It is not going in twice. Simply, error is uninitialized for SYNC and 
> set to a non-zero value only upon an error.
>
> The error is from SYNC, not from the write.

And the spinlock needs to be irqsave/irqrestore after all.

I will send a patch to finish this one off shortly.

A.

>
> A.
>
>
>>
>> in is beginning of ubd_queue_one_vec
>>
>> emit is the bottom of ubd_queue_one_vec - after it has been written
>>
>> thread is in the io thread
>>
>> irq is the result
>>
>> The remount rw sequence touches the superblock and then SYNCs. That 
>> SYNC (udp op 2) is emitted using the SAME thread_req structure at the 
>> same address before it is freed in the irq handler.
>>
>> Double use. Now why is it happening I will try to figure out after I 
>> grab something for lunch (unless someone smarter than me points to it 
>> first).
>>
>> A.
>>
>>
>>>
>>> A.
>>>
>>>
>>>
>>> _______________________________________________
>>> linux-um mailing list
>>> linux-um@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Summary so far - ubd breakage in 4.20-rc1
  2018-11-08 12:43                   ` Anton Ivanov
@ 2018-11-08 13:08                     ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2018-11-08 13:08 UTC (permalink / raw)
  To: Anton Ivanov, Richard Weinberger; +Cc: linux-um, hch

>> printks are reordered (not surprising, two threads trying to write).
>>
>> It is not going in twice. Simply, error is uninitialized for SYNC and 
>> set to a non-zero value only upon an error.

That makes more sense, since the io_req is allocted every time. That
would have placed the error in the allocator, which would be extremely
unlikely.

>> The error is from SYNC, not from the write.
> 
> And the spinlock needs to be irqsave/irqrestore after all.
> 
> I will send a patch to finish this one off shortly.

The lock does NOT need to be save/restore, I already explained why.

-- 
Jens Axboe


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-11-08 13:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-07 18:40 Summary so far - ubd breakage in 4.20-rc1 Anton Ivanov
2018-11-07 18:53 ` Richard Weinberger
2018-11-07 19:19   ` Jens Axboe
2018-11-07 21:15     ` Richard Weinberger
2018-11-07 21:16       ` Jens Axboe
2018-11-07 21:46         ` Richard Weinberger
2018-11-07 21:47           ` Jens Axboe
2018-11-08  8:30             ` Anton Ivanov
2018-11-08 12:06               ` Anton Ivanov
2018-11-08 12:31                 ` Anton Ivanov
2018-11-08 12:43                   ` Anton Ivanov
2018-11-08 13:08                     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.