All of lore.kernel.org
 help / color / mirror / Atom feed
* uml segfault during I/O
@ 2019-10-17 11:37 Ritesh Raj Sarraf
       [not found] ` <1ccf27d8-6b6a-7d08-acef-93077f07511b@cambridgegreys.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Ritesh Raj Sarraf @ 2019-10-17 11:37 UTC (permalink / raw)
  To: linux-um; +Cc: Anton Ivanov


[-- Attachment #1.1: Type: text/plain, Size: 2580 bytes --]

This happens on the 5.2 Linux kernel with Debian patches on top.
The configs details are:

Linux Host: 5.2 (Debian)
Linux UML: 5.2 (Debian)

This tends to happen very easily when doing good amount of I/O, in this
case, doing an `apt upgrade`


[  OK  ] Started Update UTMP about System Runlevel Changes.
NET: Registered protocol family 10
Segment Routing with IPv6
random: crng init done

Modules linked in: ipv6 crc_ccitt
Pid: 870, comm: kworker/0:1H Not tainted 5.2.17
RIP: 0033:[<00000000607e9c03>]
RSP: 000000009d40fde0  EFLAGS: 00010297
RAX: 0000000000000000 RBX: 0000000000004801 RCX: 000000009d40fe30
RDX: 0000000000000001 RSI: 000000009dbc35c0 RDI: 0000000000000000
RBP: 000000009d40fe30 R08: 0000000c00000204 R09: 8080808080808080
R10: fefefefefefefeff R11: 0000000000000246 R12: 0000000000000000
R13: 000000009dbc35c0 R14: 000000009dbc3608 R15: 000000009dbb9800
Kernel panic - not syncing: Segfault with no mm
CPU: 0 PID: 870 Comm: kworker/0:1H Not tainted 5.2.17 #2
Workqueue: kblockd blk_mq_requeue_work
Stack:
 00000000 100000000000001 607d4425 00000001
 9dbc35c0 00000000 00000000 00000000
 9dbb9800 607d44e9 9d40fe30 9d40fe30
Call Trace:
 [<607d4425>] ? blk_mq_sched_insert_request+0x0/0xff
 [<607d44e9>] ? blk_mq_sched_insert_request+0xc4/0xff
 [<607d4425>] ? blk_mq_sched_insert_request+0x0/0xff
 [<607d07c6>] ? blk_mq_requeue_work+0xd0/0x133
 [<60434285>] ? process_one_work+0x1e4/0x34c
 [<60431fd0>] ? move_linked_works+0x0/0x4f
 [<609791f0>] ? __schedule+0x0/0x41d
 [<6043569f>] ? wq_worker_running+0xd/0x2f
 [<60431fd0>] ? move_linked_works+0x0/0x4f
 [<60435362>] ? worker_thread+0x324/0x45e
 [<6043215f>] ? set_pf_worker+0x0/0x5e
 [<604167c7>] ? get_signals+0x0/0xa
 [<60439485>] ? __kthread_parkme+0x4a/0x94
 [<60421a53>] ? do_exit+0x0/0x948
 [<6043503e>] ? worker_thread+0x0/0x45e
 [<604399aa>] ? kthread+0x198/0x1a0
 [<603fea08>] ? new_thread_handler+0x82/0xb8

/home/rrs/bin/uml-debian: line 8: 16743 Aborted                 (core dumped) linux ubd0=~/rrs-home/Libvirt-Images/uml.img eth0=daemon mem=1024M rw
 16:58 ♒ ॐ   ☹ 😟=> 134  


And the linux process remains stray on the host machine.

rrs      18159  0.0  0.0 1057692 11496 ?       Ss   16:57   0:00 linux ubd0=/home/rrs/rrs-home/Libvirt-Images/uml.img eth0=daemon mem=1024M rw
rrs      18187  0.0  0.0 1057704 11496 ?       Ss   16:57   0:00 linux ubd0=/home/rrs/rrs-home/Libvirt-Images/uml.img eth0=daemon mem=1024M rw

-- 
Ritesh Raj Sarraf
RESEARCHUT - http://www.researchut.com
"Necessity is the mother of invention."

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
       [not found] ` <1ccf27d8-6b6a-7d08-acef-93077f07511b@cambridgegreys.com>
@ 2019-10-17 13:30   ` Ritesh Raj Sarraf
  2019-10-17 13:33     ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Ritesh Raj Sarraf @ 2019-10-17 13:30 UTC (permalink / raw)
  To: Anton Ivanov, linux-um


[-- Attachment #1.1: Type: text/plain, Size: 1145 bytes --]

On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
> Looking into that. I have not run into anything like that, but I have
> not used any of the legacy networking for 5 odd years now.
> 

Do you think this is related to networking ? I ask because there was no
network activity going on.

apt is just an example. The packages were all already downloaded and
there was no network activity. Rather, there was block I/O.

One thing I noticed, which may or may not be useful to this report. I
was booting the uml guest and immediately logging into it and running
the block I/O. And the segfault would occur.

If I, instead, booted the uml vm and let it remain idle for a minute or
so and then did the I/O, it worked fine. So I am not sure if there is
any lazy initialization happening in the background which gets
corrupted during immediate hot boot I/O.

On the other hand, if you think there can be any number of commands to
run locally that could generate more information, please assist me so.

Thanks,
Ritesh

-- 
Ritesh Raj Sarraf
RESEARCHUT - http://www.researchut.com
"Necessity is the mother of invention."

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-17 13:30   ` Ritesh Raj Sarraf
@ 2019-10-17 13:33     ` Anton Ivanov
  2019-10-17 15:03       ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-17 13:33 UTC (permalink / raw)
  To: rrs, linux-um



On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>> Looking into that. I have not run into anything like that, but I have
>> not used any of the legacy networking for 5 odd years now.
>>
> 
> Do you think this is related to networking ? I ask because there was no
> network activity going on.

no, it's disk somewhere. I managed to reproduce it with 5.2 stock on 
Debian 5.2 host.

> 
> apt is just an example. The packages were all already downloaded and
> there was no network activity. Rather, there was block I/O.

I am looking into that.

> 
> One thing I noticed, which may or may not be useful to this report. I
> was booting the uml guest and immediately logging into it and running
> the block I/O. And the segfault would occur.
> 
> If I, instead, booted the uml vm and let it remain idle for a minute or
> so and then did the I/O, it worked fine. So I am not sure if there is
> any lazy initialization happening in the background which gets
> corrupted during immediate hot boot I/O.

Interesting...

> 
> On the other hand, if you think there can be any number of commands to
> run locally that could generate more information, please assist me so.

As I said - I managed to reproduce it, I am looking at it. In first 
instance I am trying with a couple of version up/down to see if this is 
5.2 specific.

> 
> Thanks,
> Ritesh
> 
> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
> 

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-17 13:33     ` Anton Ivanov
@ 2019-10-17 15:03       ` Anton Ivanov
  2019-10-17 16:29         ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-17 15:03 UTC (permalink / raw)
  To: rrs, linux-um



On 17/10/2019 14:33, Anton Ivanov wrote:
> 
> 
> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>> Looking into that. I have not run into anything like that, but I have
>>> not used any of the legacy networking for 5 odd years now.
>>>
>>
>> Do you think this is related to networking ? I ask because there was no
>> network activity going on.
> 
> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on 
> Debian 5.2 host.
> 
>>
>> apt is just an example. The packages were all already downloaded and
>> there was no network activity. Rather, there was block I/O.
> 
> I am looking into that.
> 
>>
>> One thing I noticed, which may or may not be useful to this report. I
>> was booting the uml guest and immediately logging into it and running
>> the block I/O. And the segfault would occur.
>>
>> If I, instead, booted the uml vm and let it remain idle for a minute or
>> so and then did the I/O, it worked fine. So I am not sure if there is
>> any lazy initialization happening in the background which gets
>> corrupted during immediate hot boot I/O.
> 
> Interesting...
> 
>>
>> On the other hand, if you think there can be any number of commands to
>> run locally that could generate more information, please assist me so.
> 
> As I said - I managed to reproduce it, I am looking at it. In first 
> instance I am trying with a couple of version up/down to see if this is 
> 5.2 specific.

I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.

> 
>>
>> Thanks,
>> Ritesh
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
> 

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-17 15:03       ` Anton Ivanov
@ 2019-10-17 16:29         ` Anton Ivanov
  2019-10-18  7:35           ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-17 16:29 UTC (permalink / raw)
  To: rrs, linux-um



On 17/10/2019 16:03, Anton Ivanov wrote:
> 
> 
> On 17/10/2019 14:33, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>> Looking into that. I have not run into anything like that, but I have
>>>> not used any of the legacy networking for 5 odd years now.
>>>>
>>>
>>> Do you think this is related to networking ? I ask because there was no
>>> network activity going on.
>>
>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on 
>> Debian 5.2 host.
>>
>>>
>>> apt is just an example. The packages were all already downloaded and
>>> there was no network activity. Rather, there was block I/O.
>>
>> I am looking into that.
>>
>>>
>>> One thing I noticed, which may or may not be useful to this report. I
>>> was booting the uml guest and immediately logging into it and running
>>> the block I/O. And the segfault would occur.
>>>
>>> If I, instead, booted the uml vm and let it remain idle for a minute or
>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>> any lazy initialization happening in the background which gets
>>> corrupted during immediate hot boot I/O.
>>
>> Interesting...
>>
>>>
>>> On the other hand, if you think there can be any number of commands to
>>> run locally that could generate more information, please assist me so.
>>
>> As I said - I managed to reproduce it, I am looking at it. In first 
>> instance I am trying with a couple of version up/down to see if this 
>> is 5.2 specific.
> 
> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.

This is something outside the UBD driver as such. There were no notable 
changes to it since we ported it to block-mq and added DISCARD which was 
quite a while back.

I am going to check the other usual suspects such as IRQs, but that is 
something I test quite extensively when working on the networking side.

So I suspect that this is something outside UML which is showing only in 
UML for some reason.

A.

> 
>>
>>>
>>> Thanks,
>>> Ritesh
>>>
>>>
>>> _______________________________________________
>>> linux-um mailing list
>>> linux-um@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>
>>
> 

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-17 16:29         ` Anton Ivanov
@ 2019-10-18  7:35           ` Anton Ivanov
  2019-10-18  8:57             ` Johannes Berg
  2019-10-18  9:55             ` Anton Ivanov
  0 siblings, 2 replies; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18  7:35 UTC (permalink / raw)
  To: rrs, linux-um



On 17/10/2019 17:29, Anton Ivanov wrote:
> 
> 
> On 17/10/2019 16:03, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>> Looking into that. I have not run into anything like that, but I have
>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>
>>>>
>>>> Do you think this is related to networking ? I ask because there was no
>>>> network activity going on.
>>>
>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on 
>>> Debian 5.2 host.
>>>
>>>>
>>>> apt is just an example. The packages were all already downloaded and
>>>> there was no network activity. Rather, there was block I/O.
>>>
>>> I am looking into that.
>>>
>>>>
>>>> One thing I noticed, which may or may not be useful to this report. I
>>>> was booting the uml guest and immediately logging into it and running
>>>> the block I/O. And the segfault would occur.
>>>>
>>>> If I, instead, booted the uml vm and let it remain idle for a minute or
>>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>>> any lazy initialization happening in the background which gets
>>>> corrupted during immediate hot boot I/O.
>>>
>>> Interesting...
>>>
>>>>
>>>> On the other hand, if you think there can be any number of commands to
>>>> run locally that could generate more information, please assist me so.
>>>
>>> As I said - I managed to reproduce it, I am looking at it. In first 
>>> instance I am trying with a couple of version up/down to see if this 
>>> is 5.2 specific.
>>
>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
> 
> This is something outside the UBD driver as such. There were no notable 
> changes to it since we ported it to block-mq and added DISCARD which was 
> quite a while back.
> 
> I am going to check the other usual suspects such as IRQs, but that is 
> something I test quite extensively when working on the networking side.
> 
> So I suspect that this is something outside UML which is showing only in 
> UML for some reason.

Still happening with 5.2.21, albeit a bit more difficult to reproduce.

Looking at the backtraces it is ALWAYS a result of a re-queue.

[  263.990000]  [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
[  263.990000]  [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
[  263.990000]  [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
[  263.990000]  [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
[  263.990000]  [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
[  263.990000]  [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
[  263.990000]  [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
[  263.990000]  [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
[  263.990000]  [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
[  263.990000]  [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
[  263.990000]  [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
[  263.990000]  [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
ALWAYS >>[  263.990000]  [<604414b0>] blk_mq_requeue_work+0x160/0x170
[  263.990000]  [<6008c91b>] process_one_work+0x1eb/0x490
[  263.990000]  [<607f5aa0>] ? __schedule+0x0/0x620
[  263.990000]  [<6008e000>] ? wq_worker_running+0x10/0x40
[  263.990000]  [<6008c730>] ? process_one_work+0x0/0x490
[  263.990000]  [<6008cc06>] worker_thread+0x46/0x670
[  263.990000]  [<600931c1>] ? __kthread_parkme+0xa1/0xd0
[  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
[  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
[  263.990000]  [<60093bc4>] kthread+0x194/0x1c0
[  263.990000]  [<6002a091>] new_thread_handler+0x81/0xc0


A.

> 
> A.
> 
>>
>>>
>>>>
>>>> Thanks,
>>>> Ritesh
>>>>
>>>>
>>>> _______________________________________________
>>>> linux-um mailing list
>>>> linux-um@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>
>>>
>>
> 

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-18  7:35           ` Anton Ivanov
@ 2019-10-18  8:57             ` Johannes Berg
  2019-10-18  9:13               ` Anton Ivanov
  2019-10-18  9:55             ` Anton Ivanov
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Berg @ 2019-10-18  8:57 UTC (permalink / raw)
  To: Anton Ivanov, rrs, linux-um

On Fri, 2019-10-18 at 08:35 +0100, Anton Ivanov wrote:
> 
> Still happening with 5.2.21, albeit a bit more difficult to reproduce.

Just randomly reviewing the code, isn't there a bug in io_thread()?

--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -1602,7 +1602,8 @@ int io_thread(void *arg)
                written = 0;
 
                do {
-                       res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n);
+                       res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written,
+                                           n - written);
                        if (res >= 0) {
                                written += res;
                        }

johannes


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-18  8:57             ` Johannes Berg
@ 2019-10-18  9:13               ` Anton Ivanov
  2019-10-18  9:49                 ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18  9:13 UTC (permalink / raw)
  To: Johannes Berg, rrs, linux-um



On 18/10/2019 09:57, Johannes Berg wrote:
> On Fri, 2019-10-18 at 08:35 +0100, Anton Ivanov wrote:
>>
>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
> 
> Just randomly reviewing the code, isn't there a bug in io_thread()?

Indeed. Well spotted. It is good that a short write to a block device is 
so rare :)

> 
> --- a/arch/um/drivers/ubd_kern.c
> +++ b/arch/um/drivers/ubd_kern.c
> @@ -1602,7 +1602,8 @@ int io_thread(void *arg)
>                  written = 0;
>   
>                  do {
> -                       res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n);
> +                       res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written,
> +                                           n - written);
>                          if (res >= 0) {
>                                  written += res;
>                          }
> 
> johannes
> 
> 
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
> 

I have verified this.

It is always in a requeue and always after the io thread has returned 
EAGAIN on the attempt to submit a request.

That part of the code has been tested very heavily before and it works 
fine in 4.19

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-18  9:13               ` Anton Ivanov
@ 2019-10-18  9:49                 ` Anton Ivanov
  0 siblings, 0 replies; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18  9:49 UTC (permalink / raw)
  To: Johannes Berg, rrs, linux-um



On 18/10/2019 10:13, Anton Ivanov wrote:
> 
> 
> On 18/10/2019 09:57, Johannes Berg wrote:
>> On Fri, 2019-10-18 at 08:35 +0100, Anton Ivanov wrote:
>>>
>>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>>
>> Just randomly reviewing the code, isn't there a bug in io_thread()?
> 
> Indeed. Well spotted. It is good that a short write to a block device is 
> so rare :)
> 
>>
>> --- a/arch/um/drivers/ubd_kern.c
>> +++ b/arch/um/drivers/ubd_kern.c
>> @@ -1602,7 +1602,8 @@ int io_thread(void *arg)
>>                  written = 0;
>>                  do {
>> -                       res = os_write_file(kernel_fd, ((char *) 
>> io_req_buffer) + written, n);
>> +                       res = os_write_file(kernel_fd, ((char *) 
>> io_req_buffer) + written,
>> +                                           n - written);
>>                          if (res >= 0) {
>>                                  written += res;
>>                          }
>>
>> johannes
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
> 
> I have verified this.
> 
> It is always in a requeue and always after the io thread has returned 
> EAGAIN on the attempt to submit a request.
> 
> That part of the code has been tested very heavily before and it works 
> fine in 4.19
> 

Applied and it is not it.

This is something in the blk-mq code and we should forward the whole 
thread to the blk_mq maintainers.

A




-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-18  7:35           ` Anton Ivanov
  2019-10-18  8:57             ` Johannes Berg
@ 2019-10-18  9:55             ` Anton Ivanov
  2019-10-18 10:51               ` Anton Ivanov
  1 sibling, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18  9:55 UTC (permalink / raw)
  To: rrs, linux-um; +Cc: axboe, hch

Adding Jens and Christoph

On 18/10/2019 08:35, Anton Ivanov wrote:
> 
> 
> On 17/10/2019 17:29, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 16:03, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>>
>>>>
>>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>>> Looking into that. I have not run into anything like that, but I have
>>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>>
>>>>>
>>>>> Do you think this is related to networking ? I ask because there 
>>>>> was no
>>>>> network activity going on.
>>>>
>>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on 
>>>> Debian 5.2 host.
>>>>
>>>>>
>>>>> apt is just an example. The packages were all already downloaded and
>>>>> there was no network activity. Rather, there was block I/O.
>>>>
>>>> I am looking into that.
>>>>
>>>>>
>>>>> One thing I noticed, which may or may not be useful to this report. I
>>>>> was booting the uml guest and immediately logging into it and running
>>>>> the block I/O. And the segfault would occur.
>>>>>
>>>>> If I, instead, booted the uml vm and let it remain idle for a 
>>>>> minute or
>>>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>>>> any lazy initialization happening in the background which gets
>>>>> corrupted during immediate hot boot I/O.
>>>>
>>>> Interesting...
>>>>
>>>>>
>>>>> On the other hand, if you think there can be any number of commands to
>>>>> run locally that could generate more information, please assist me so.
>>>>
>>>> As I said - I managed to reproduce it, I am looking at it. In first 
>>>> instance I am trying with a couple of version up/down to see if this 
>>>> is 5.2 specific.
>>>
>>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
>>
>> This is something outside the UBD driver as such. There were no 
>> notable changes to it since we ported it to block-mq and added DISCARD 
>> which was quite a while back.
>>
>> I am going to check the other usual suspects such as IRQs, but that is 
>> something I test quite extensively when working on the networking side.
>>
>> So I suspect that this is something outside UML which is showing only 
>> in UML for some reason.
> 
> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
> 
> Looking at the backtraces it is ALWAYS a result of a re-queue.
> 
> [  263.990000]  [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
> [  263.990000]  [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
> [  263.990000]  [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
> [  263.990000]  [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
> [  263.990000]  [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
> [  263.990000]  [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
> [  263.990000]  [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
> [  263.990000]  [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
> [  263.990000]  [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
> [  263.990000]  [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
> [  263.990000]  [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
> [  263.990000]  [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
> ALWAYS >>[  263.990000]  [<604414b0>] blk_mq_requeue_work+0x160/0x170
> [  263.990000]  [<6008c91b>] process_one_work+0x1eb/0x490
> [  263.990000]  [<607f5aa0>] ? __schedule+0x0/0x620
> [  263.990000]  [<6008e000>] ? wq_worker_running+0x10/0x40
> [  263.990000]  [<6008c730>] ? process_one_work+0x0/0x490
> [  263.990000]  [<6008cc06>] worker_thread+0x46/0x670
> [  263.990000]  [<600931c1>] ? __kthread_parkme+0xa1/0xd0
> [  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
> [  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
> [  263.990000]  [<60093bc4>] kthread+0x194/0x1c0
> [  263.990000]  [<6002a091>] new_thread_handler+0x81/0xc0
> 
> 
> A.
> 
>>
>> A.
>>
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ritesh
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> linux-um mailing list
>>>>> linux-um@lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>>
>>>>
>>>
>>
> 

To put a long story short we have a reproducible segfault when UML 
re-queues a block request in 5.x

This used to work in 4.x

We found a couple of minor things in UML when looking at it which we 
will fix shortly, but the root problem seems to be either in block_mq or 
in the way we are using it to re-queue requests.

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-18  9:55             ` Anton Ivanov
@ 2019-10-18 10:51               ` Anton Ivanov
  2019-10-28 16:51                 ` Anton Ivanov
  0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18 10:51 UTC (permalink / raw)
  To: rrs, linux-um; +Cc: axboe, hch



On 18/10/2019 10:55, Anton Ivanov wrote:
> Adding Jens and Christoph
> 
> On 18/10/2019 08:35, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 17:29, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 16:03, Anton Ivanov wrote:
>>>>
>>>>
>>>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>>>
>>>>>
>>>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>>>> Looking into that. I have not run into anything like that, but I 
>>>>>>> have
>>>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>>>
>>>>>>
>>>>>> Do you think this is related to networking ? I ask because there 
>>>>>> was no
>>>>>> network activity going on.
>>>>>
>>>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock 
>>>>> on Debian 5.2 host.
>>>>>
>>>>>>
>>>>>> apt is just an example. The packages were all already downloaded and
>>>>>> there was no network activity. Rather, there was block I/O.
>>>>>
>>>>> I am looking into that.
>>>>>
>>>>>>
>>>>>> One thing I noticed, which may or may not be useful to this report. I
>>>>>> was booting the uml guest and immediately logging into it and running
>>>>>> the block I/O. And the segfault would occur.
>>>>>>
>>>>>> If I, instead, booted the uml vm and let it remain idle for a 
>>>>>> minute or
>>>>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>>>>> any lazy initialization happening in the background which gets
>>>>>> corrupted during immediate hot boot I/O.
>>>>>
>>>>> Interesting...
>>>>>
>>>>>>
>>>>>> On the other hand, if you think there can be any number of 
>>>>>> commands to
>>>>>> run locally that could generate more information, please assist me 
>>>>>> so.
>>>>>
>>>>> As I said - I managed to reproduce it, I am looking at it. In first 
>>>>> instance I am trying with a couple of version up/down to see if 
>>>>> this is 5.2 specific.
>>>>
>>>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
>>>
>>> This is something outside the UBD driver as such. There were no 
>>> notable changes to it since we ported it to block-mq and added 
>>> DISCARD which was quite a while back.
>>>
>>> I am going to check the other usual suspects such as IRQs, but that 
>>> is something I test quite extensively when working on the networking 
>>> side.
>>>
>>> So I suspect that this is something outside UML which is showing only 
>>> in UML for some reason.
>>
>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>>
>> Looking at the backtraces it is ALWAYS a result of a re-queue.
>>
>> [  263.990000]  [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
>> [  263.990000]  [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
>> [  263.990000]  [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
>> [  263.990000]  [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
>> [  263.990000]  [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
>> [  263.990000]  [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
>> [  263.990000]  [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
>> [  263.990000]  [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
>> [  263.990000]  [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
>> [  263.990000]  [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
>> [  263.990000]  [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
>> [  263.990000]  [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
>> ALWAYS >>[  263.990000]  [<604414b0>] blk_mq_requeue_work+0x160/0x170
>> [  263.990000]  [<6008c91b>] process_one_work+0x1eb/0x490
>> [  263.990000]  [<607f5aa0>] ? __schedule+0x0/0x620
>> [  263.990000]  [<6008e000>] ? wq_worker_running+0x10/0x40
>> [  263.990000]  [<6008c730>] ? process_one_work+0x0/0x490
>> [  263.990000]  [<6008cc06>] worker_thread+0x46/0x670
>> [  263.990000]  [<600931c1>] ? __kthread_parkme+0xa1/0xd0
>> [  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
>> [  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
>> [  263.990000]  [<60093bc4>] kthread+0x194/0x1c0
>> [  263.990000]  [<6002a091>] new_thread_handler+0x81/0xc0
>>
>>
>> A.
>>
>>>
>>> A.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ritesh
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> linux-um mailing list
>>>>>> linux-um@lists.infradead.org
>>>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>>>
>>>>>
>>>>
>>>
>>
> 
> To put a long story short we have a reproducible segfault when UML 
> re-queues a block request in 5.x
> 
> This used to work in 4.x
> 
> We found a couple of minor things in UML when looking at it which we 
> will fix shortly, but the root problem seems to be either in block_mq or 
> in the way we are using it to re-queue requests.
> 

My (probably wrong) guess is that it is something related to the changes 
from 4.x to 5.x where blk-mq is no longer getting the hw context out of 
the cpu map.

For some reason it gets a null deref at some point.

I may be wrong of course.

I will not be able to pick this up again before Tuesday so if someone 
can have a go before then it will be greatly appreciated.

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: uml segfault during I/O
  2019-10-18 10:51               ` Anton Ivanov
@ 2019-10-28 16:51                 ` Anton Ivanov
  0 siblings, 0 replies; 12+ messages in thread
From: Anton Ivanov @ 2019-10-28 16:51 UTC (permalink / raw)
  To: rrs, linux-um; +Cc: axboe, hch


On 18/10/2019 11:51, Anton Ivanov wrote:
>
>
> On 18/10/2019 10:55, Anton Ivanov wrote:
>> Adding Jens and Christoph
>>
>> On 18/10/2019 08:35, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 17:29, Anton Ivanov wrote:
>>>>
>>>>
>>>> On 17/10/2019 16:03, Anton Ivanov wrote:
>>>>>
>>>>>
>>>>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>>>>
>>>>>>
>>>>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>>>>> Looking into that. I have not run into anything like that, but 
>>>>>>>> I have
>>>>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>>>>
>>>>>>>
>>>>>>> Do you think this is related to networking ? I ask because there 
>>>>>>> was no
>>>>>>> network activity going on.
>>>>>>
>>>>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock 
>>>>>> on Debian 5.2 host.
>>>>>>
>>>>>>>
>>>>>>> apt is just an example. The packages were all already downloaded 
>>>>>>> and
>>>>>>> there was no network activity. Rather, there was block I/O.
>>>>>>
>>>>>> I am looking into that.
>>>>>>
>>>>>>>
>>>>>>> One thing I noticed, which may or may not be useful to this 
>>>>>>> report. I
>>>>>>> was booting the uml guest and immediately logging into it and 
>>>>>>> running
>>>>>>> the block I/O. And the segfault would occur.
>>>>>>>
>>>>>>> If I, instead, booted the uml vm and let it remain idle for a 
>>>>>>> minute or
>>>>>>> so and then did the I/O, it worked fine. So I am not sure if 
>>>>>>> there is
>>>>>>> any lazy initialization happening in the background which gets
>>>>>>> corrupted during immediate hot boot I/O.
>>>>>>
>>>>>> Interesting...
>>>>>>
>>>>>>>
>>>>>>> On the other hand, if you think there can be any number of 
>>>>>>> commands to
>>>>>>> run locally that could generate more information, please assist 
>>>>>>> me so.
>>>>>>
>>>>>> As I said - I managed to reproduce it, I am looking at it. In 
>>>>>> first instance I am trying with a couple of version up/down to 
>>>>>> see if this is 5.2 specific.
>>>>>
>>>>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same 
>>>>> symptoms.
>>>>
>>>> This is something outside the UBD driver as such. There were no 
>>>> notable changes to it since we ported it to block-mq and added 
>>>> DISCARD which was quite a while back.
>>>>
>>>> I am going to check the other usual suspects such as IRQs, but that 
>>>> is something I test quite extensively when working on the 
>>>> networking side.
>>>>
>>>> So I suspect that this is something outside UML which is showing 
>>>> only in UML for some reason.
>>>
>>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>>>
>>> Looking at the backtraces it is ALWAYS a result of a re-queue.
>>>
>>> [  263.990000]  [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
>>> [  263.990000]  [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
>>> [  263.990000]  [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
>>> [  263.990000]  [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
>>> [  263.990000]  [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
>>> [  263.990000]  [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
>>> [  263.990000]  [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
>>> [  263.990000]  [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
>>> [  263.990000]  [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
>>> [  263.990000]  [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
>>> [  263.990000]  [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
>>> [  263.990000]  [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
>>> ALWAYS >>[  263.990000]  [<604414b0>] blk_mq_requeue_work+0x160/0x170
>>> [  263.990000]  [<6008c91b>] process_one_work+0x1eb/0x490
>>> [  263.990000]  [<607f5aa0>] ? __schedule+0x0/0x620
>>> [  263.990000]  [<6008e000>] ? wq_worker_running+0x10/0x40
>>> [  263.990000]  [<6008c730>] ? process_one_work+0x0/0x490
>>> [  263.990000]  [<6008cc06>] worker_thread+0x46/0x670
>>> [  263.990000]  [<600931c1>] ? __kthread_parkme+0xa1/0xd0
>>> [  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
>>> [  263.990000]  [<6008cbc0>] ? worker_thread+0x0/0x670
>>> [  263.990000]  [<60093bc4>] kthread+0x194/0x1c0
>>> [  263.990000]  [<6002a091>] new_thread_handler+0x81/0xc0
>>>
>>>
>>> A.
>>>
>>>>
>>>> A.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ritesh
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> linux-um mailing list
>>>>>>> linux-um@lists.infradead.org
>>>>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> To put a long story short we have a reproducible segfault when UML 
>> re-queues a block request in 5.x
>>
>> This used to work in 4.x
>>
>> We found a couple of minor things in UML when looking at it which we 
>> will fix shortly, but the root problem seems to be either in block_mq 
>> or in the way we are using it to re-queue requests.
>>
>
> My (probably wrong) guess is that it is something related to the 
> changes from 4.x to 5.x where blk-mq is no longer getting the hw 
> context out of the cpu map.
>
> For some reason it gets a null deref at some point.
>
> I may be wrong of course.
>
> I will not be able to pick this up again before Tuesday so if someone 
> can have a go before then it will be greatly appreciated.
>
Unless I am mistaken It should not re-queue.

Xen does not: 
https://elixir.bootlin.com/linux/latest/source/drivers/block/xen-blkfront.c#L910

Unless I am mistaken, if you return a form of BUSY the the upper layers 
will re-queue for you.

That explains the crash as we have the same bio enqueued multiple times 
when the device stalls.

I am testing a patch and if it all tests out I will submit it shortly.

-- 

Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-10-28 17:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17 11:37 uml segfault during I/O Ritesh Raj Sarraf
     [not found] ` <1ccf27d8-6b6a-7d08-acef-93077f07511b@cambridgegreys.com>
2019-10-17 13:30   ` Ritesh Raj Sarraf
2019-10-17 13:33     ` Anton Ivanov
2019-10-17 15:03       ` Anton Ivanov
2019-10-17 16:29         ` Anton Ivanov
2019-10-18  7:35           ` Anton Ivanov
2019-10-18  8:57             ` Johannes Berg
2019-10-18  9:13               ` Anton Ivanov
2019-10-18  9:49                 ` Anton Ivanov
2019-10-18  9:55             ` Anton Ivanov
2019-10-18 10:51               ` Anton Ivanov
2019-10-28 16:51                 ` Anton Ivanov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.