* uml segfault during I/O
@ 2019-10-17 11:37 Ritesh Raj Sarraf
[not found] ` <1ccf27d8-6b6a-7d08-acef-93077f07511b@cambridgegreys.com>
0 siblings, 1 reply; 12+ messages in thread
From: Ritesh Raj Sarraf @ 2019-10-17 11:37 UTC (permalink / raw)
To: linux-um; +Cc: Anton Ivanov
[-- Attachment #1.1: Type: text/plain, Size: 2580 bytes --]
This happens on the 5.2 Linux kernel with Debian patches on top.
The configs details are:
Linux Host: 5.2 (Debian)
Linux UML: 5.2 (Debian)
This tends to happen very easily when doing good amount of I/O, in this
case, doing an `apt upgrade`
[ OK ] Started Update UTMP about System Runlevel Changes.
NET: Registered protocol family 10
Segment Routing with IPv6
random: crng init done
Modules linked in: ipv6 crc_ccitt
Pid: 870, comm: kworker/0:1H Not tainted 5.2.17
RIP: 0033:[<00000000607e9c03>]
RSP: 000000009d40fde0 EFLAGS: 00010297
RAX: 0000000000000000 RBX: 0000000000004801 RCX: 000000009d40fe30
RDX: 0000000000000001 RSI: 000000009dbc35c0 RDI: 0000000000000000
RBP: 000000009d40fe30 R08: 0000000c00000204 R09: 8080808080808080
R10: fefefefefefefeff R11: 0000000000000246 R12: 0000000000000000
R13: 000000009dbc35c0 R14: 000000009dbc3608 R15: 000000009dbb9800
Kernel panic - not syncing: Segfault with no mm
CPU: 0 PID: 870 Comm: kworker/0:1H Not tainted 5.2.17 #2
Workqueue: kblockd blk_mq_requeue_work
Stack:
00000000 100000000000001 607d4425 00000001
9dbc35c0 00000000 00000000 00000000
9dbb9800 607d44e9 9d40fe30 9d40fe30
Call Trace:
[<607d4425>] ? blk_mq_sched_insert_request+0x0/0xff
[<607d44e9>] ? blk_mq_sched_insert_request+0xc4/0xff
[<607d4425>] ? blk_mq_sched_insert_request+0x0/0xff
[<607d07c6>] ? blk_mq_requeue_work+0xd0/0x133
[<60434285>] ? process_one_work+0x1e4/0x34c
[<60431fd0>] ? move_linked_works+0x0/0x4f
[<609791f0>] ? __schedule+0x0/0x41d
[<6043569f>] ? wq_worker_running+0xd/0x2f
[<60431fd0>] ? move_linked_works+0x0/0x4f
[<60435362>] ? worker_thread+0x324/0x45e
[<6043215f>] ? set_pf_worker+0x0/0x5e
[<604167c7>] ? get_signals+0x0/0xa
[<60439485>] ? __kthread_parkme+0x4a/0x94
[<60421a53>] ? do_exit+0x0/0x948
[<6043503e>] ? worker_thread+0x0/0x45e
[<604399aa>] ? kthread+0x198/0x1a0
[<603fea08>] ? new_thread_handler+0x82/0xb8
/home/rrs/bin/uml-debian: line 8: 16743 Aborted (core dumped) linux ubd0=~/rrs-home/Libvirt-Images/uml.img eth0=daemon mem=1024M rw
16:58 ♒ ॐ ☹ 😟=> 134
And the linux process remains stray on the host machine.
rrs 18159 0.0 0.0 1057692 11496 ? Ss 16:57 0:00 linux ubd0=/home/rrs/rrs-home/Libvirt-Images/uml.img eth0=daemon mem=1024M rw
rrs 18187 0.0 0.0 1057704 11496 ? Ss 16:57 0:00 linux ubd0=/home/rrs/rrs-home/Libvirt-Images/uml.img eth0=daemon mem=1024M rw
--
Ritesh Raj Sarraf
RESEARCHUT - http://www.researchut.com
"Necessity is the mother of invention."
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 152 bytes --]
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
[not found] ` <1ccf27d8-6b6a-7d08-acef-93077f07511b@cambridgegreys.com>
@ 2019-10-17 13:30 ` Ritesh Raj Sarraf
2019-10-17 13:33 ` Anton Ivanov
0 siblings, 1 reply; 12+ messages in thread
From: Ritesh Raj Sarraf @ 2019-10-17 13:30 UTC (permalink / raw)
To: Anton Ivanov, linux-um
[-- Attachment #1.1: Type: text/plain, Size: 1145 bytes --]
On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
> Looking into that. I have not run into anything like that, but I have
> not used any of the legacy networking for 5 odd years now.
>
Do you think this is related to networking ? I ask because there was no
network activity going on.
apt is just an example. The packages were all already downloaded and
there was no network activity. Rather, there was block I/O.
One thing I noticed, which may or may not be useful to this report. I
was booting the uml guest and immediately logging into it and running
the block I/O. And the segfault would occur.
If I, instead, booted the uml vm and let it remain idle for a minute or
so and then did the I/O, it worked fine. So I am not sure if there is
any lazy initialization happening in the background which gets
corrupted during immediate hot boot I/O.
On the other hand, if you think there can be any number of commands to
run locally that could generate more information, please assist me so.
Thanks,
Ritesh
--
Ritesh Raj Sarraf
RESEARCHUT - http://www.researchut.com
"Necessity is the mother of invention."
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 152 bytes --]
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-17 13:30 ` Ritesh Raj Sarraf
@ 2019-10-17 13:33 ` Anton Ivanov
2019-10-17 15:03 ` Anton Ivanov
0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-17 13:33 UTC (permalink / raw)
To: rrs, linux-um
On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>> Looking into that. I have not run into anything like that, but I have
>> not used any of the legacy networking for 5 odd years now.
>>
>
> Do you think this is related to networking ? I ask because there was no
> network activity going on.
no, it's disk somewhere. I managed to reproduce it with 5.2 stock on
Debian 5.2 host.
>
> apt is just an example. The packages were all already downloaded and
> there was no network activity. Rather, there was block I/O.
I am looking into that.
>
> One thing I noticed, which may or may not be useful to this report. I
> was booting the uml guest and immediately logging into it and running
> the block I/O. And the segfault would occur.
>
> If I, instead, booted the uml vm and let it remain idle for a minute or
> so and then did the I/O, it worked fine. So I am not sure if there is
> any lazy initialization happening in the background which gets
> corrupted during immediate hot boot I/O.
Interesting...
>
> On the other hand, if you think there can be any number of commands to
> run locally that could generate more information, please assist me so.
As I said - I managed to reproduce it, I am looking at it. In first
instance I am trying with a couple of version up/down to see if this is
5.2 specific.
>
> Thanks,
> Ritesh
>
>
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-17 13:33 ` Anton Ivanov
@ 2019-10-17 15:03 ` Anton Ivanov
2019-10-17 16:29 ` Anton Ivanov
0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-17 15:03 UTC (permalink / raw)
To: rrs, linux-um
On 17/10/2019 14:33, Anton Ivanov wrote:
>
>
> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>> Looking into that. I have not run into anything like that, but I have
>>> not used any of the legacy networking for 5 odd years now.
>>>
>>
>> Do you think this is related to networking ? I ask because there was no
>> network activity going on.
>
> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on
> Debian 5.2 host.
>
>>
>> apt is just an example. The packages were all already downloaded and
>> there was no network activity. Rather, there was block I/O.
>
> I am looking into that.
>
>>
>> One thing I noticed, which may or may not be useful to this report. I
>> was booting the uml guest and immediately logging into it and running
>> the block I/O. And the segfault would occur.
>>
>> If I, instead, booted the uml vm and let it remain idle for a minute or
>> so and then did the I/O, it worked fine. So I am not sure if there is
>> any lazy initialization happening in the background which gets
>> corrupted during immediate hot boot I/O.
>
> Interesting...
>
>>
>> On the other hand, if you think there can be any number of commands to
>> run locally that could generate more information, please assist me so.
>
> As I said - I managed to reproduce it, I am looking at it. In first
> instance I am trying with a couple of version up/down to see if this is
> 5.2 specific.
I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
>
>>
>> Thanks,
>> Ritesh
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
>
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-17 15:03 ` Anton Ivanov
@ 2019-10-17 16:29 ` Anton Ivanov
2019-10-18 7:35 ` Anton Ivanov
0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-17 16:29 UTC (permalink / raw)
To: rrs, linux-um
On 17/10/2019 16:03, Anton Ivanov wrote:
>
>
> On 17/10/2019 14:33, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>> Looking into that. I have not run into anything like that, but I have
>>>> not used any of the legacy networking for 5 odd years now.
>>>>
>>>
>>> Do you think this is related to networking ? I ask because there was no
>>> network activity going on.
>>
>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on
>> Debian 5.2 host.
>>
>>>
>>> apt is just an example. The packages were all already downloaded and
>>> there was no network activity. Rather, there was block I/O.
>>
>> I am looking into that.
>>
>>>
>>> One thing I noticed, which may or may not be useful to this report. I
>>> was booting the uml guest and immediately logging into it and running
>>> the block I/O. And the segfault would occur.
>>>
>>> If I, instead, booted the uml vm and let it remain idle for a minute or
>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>> any lazy initialization happening in the background which gets
>>> corrupted during immediate hot boot I/O.
>>
>> Interesting...
>>
>>>
>>> On the other hand, if you think there can be any number of commands to
>>> run locally that could generate more information, please assist me so.
>>
>> As I said - I managed to reproduce it, I am looking at it. In first
>> instance I am trying with a couple of version up/down to see if this
>> is 5.2 specific.
>
> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
This is something outside the UBD driver as such. There were no notable
changes to it since we ported it to block-mq and added DISCARD which was
quite a while back.
I am going to check the other usual suspects such as IRQs, but that is
something I test quite extensively when working on the networking side.
So I suspect that this is something outside UML which is showing only in
UML for some reason.
A.
>
>>
>>>
>>> Thanks,
>>> Ritesh
>>>
>>>
>>> _______________________________________________
>>> linux-um mailing list
>>> linux-um@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>
>>
>
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-17 16:29 ` Anton Ivanov
@ 2019-10-18 7:35 ` Anton Ivanov
2019-10-18 8:57 ` Johannes Berg
2019-10-18 9:55 ` Anton Ivanov
0 siblings, 2 replies; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18 7:35 UTC (permalink / raw)
To: rrs, linux-um
On 17/10/2019 17:29, Anton Ivanov wrote:
>
>
> On 17/10/2019 16:03, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>> Looking into that. I have not run into anything like that, but I have
>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>
>>>>
>>>> Do you think this is related to networking ? I ask because there was no
>>>> network activity going on.
>>>
>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on
>>> Debian 5.2 host.
>>>
>>>>
>>>> apt is just an example. The packages were all already downloaded and
>>>> there was no network activity. Rather, there was block I/O.
>>>
>>> I am looking into that.
>>>
>>>>
>>>> One thing I noticed, which may or may not be useful to this report. I
>>>> was booting the uml guest and immediately logging into it and running
>>>> the block I/O. And the segfault would occur.
>>>>
>>>> If I, instead, booted the uml vm and let it remain idle for a minute or
>>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>>> any lazy initialization happening in the background which gets
>>>> corrupted during immediate hot boot I/O.
>>>
>>> Interesting...
>>>
>>>>
>>>> On the other hand, if you think there can be any number of commands to
>>>> run locally that could generate more information, please assist me so.
>>>
>>> As I said - I managed to reproduce it, I am looking at it. In first
>>> instance I am trying with a couple of version up/down to see if this
>>> is 5.2 specific.
>>
>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
>
> This is something outside the UBD driver as such. There were no notable
> changes to it since we ported it to block-mq and added DISCARD which was
> quite a while back.
>
> I am going to check the other usual suspects such as IRQs, but that is
> something I test quite extensively when working on the networking side.
>
> So I suspect that this is something outside UML which is showing only in
> UML for some reason.
Still happening with 5.2.21, albeit a bit more difficult to reproduce.
Looking at the backtraces it is ALWAYS a result of a re-queue.
[ 263.990000] [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
[ 263.990000] [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
[ 263.990000] [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
[ 263.990000] [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
[ 263.990000] [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
[ 263.990000] [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
[ 263.990000] [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
[ 263.990000] [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
[ 263.990000] [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
[ 263.990000] [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
[ 263.990000] [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
[ 263.990000] [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
ALWAYS >>[ 263.990000] [<604414b0>] blk_mq_requeue_work+0x160/0x170
[ 263.990000] [<6008c91b>] process_one_work+0x1eb/0x490
[ 263.990000] [<607f5aa0>] ? __schedule+0x0/0x620
[ 263.990000] [<6008e000>] ? wq_worker_running+0x10/0x40
[ 263.990000] [<6008c730>] ? process_one_work+0x0/0x490
[ 263.990000] [<6008cc06>] worker_thread+0x46/0x670
[ 263.990000] [<600931c1>] ? __kthread_parkme+0xa1/0xd0
[ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
[ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
[ 263.990000] [<60093bc4>] kthread+0x194/0x1c0
[ 263.990000] [<6002a091>] new_thread_handler+0x81/0xc0
A.
>
> A.
>
>>
>>>
>>>>
>>>> Thanks,
>>>> Ritesh
>>>>
>>>>
>>>> _______________________________________________
>>>> linux-um mailing list
>>>> linux-um@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>
>>>
>>
>
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-18 7:35 ` Anton Ivanov
@ 2019-10-18 8:57 ` Johannes Berg
2019-10-18 9:13 ` Anton Ivanov
2019-10-18 9:55 ` Anton Ivanov
1 sibling, 1 reply; 12+ messages in thread
From: Johannes Berg @ 2019-10-18 8:57 UTC (permalink / raw)
To: Anton Ivanov, rrs, linux-um
On Fri, 2019-10-18 at 08:35 +0100, Anton Ivanov wrote:
>
> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
Just randomly reviewing the code, isn't there a bug in io_thread()?
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -1602,7 +1602,8 @@ int io_thread(void *arg)
written = 0;
do {
- res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n);
+ res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written,
+ n - written);
if (res >= 0) {
written += res;
}
johannes
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-18 8:57 ` Johannes Berg
@ 2019-10-18 9:13 ` Anton Ivanov
2019-10-18 9:49 ` Anton Ivanov
0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18 9:13 UTC (permalink / raw)
To: Johannes Berg, rrs, linux-um
On 18/10/2019 09:57, Johannes Berg wrote:
> On Fri, 2019-10-18 at 08:35 +0100, Anton Ivanov wrote:
>>
>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>
> Just randomly reviewing the code, isn't there a bug in io_thread()?
Indeed. Well spotted. It is good that a short write to a block device is
so rare :)
>
> --- a/arch/um/drivers/ubd_kern.c
> +++ b/arch/um/drivers/ubd_kern.c
> @@ -1602,7 +1602,8 @@ int io_thread(void *arg)
> written = 0;
>
> do {
> - res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n);
> + res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written,
> + n - written);
> if (res >= 0) {
> written += res;
> }
>
> johannes
>
>
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>
I have verified this.
It is always in a requeue and always after the io thread has returned
EAGAIN on the attempt to submit a request.
That part of the code has been tested very heavily before and it works
fine in 4.19
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-18 9:13 ` Anton Ivanov
@ 2019-10-18 9:49 ` Anton Ivanov
0 siblings, 0 replies; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18 9:49 UTC (permalink / raw)
To: Johannes Berg, rrs, linux-um
On 18/10/2019 10:13, Anton Ivanov wrote:
>
>
> On 18/10/2019 09:57, Johannes Berg wrote:
>> On Fri, 2019-10-18 at 08:35 +0100, Anton Ivanov wrote:
>>>
>>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>>
>> Just randomly reviewing the code, isn't there a bug in io_thread()?
>
> Indeed. Well spotted. It is good that a short write to a block device is
> so rare :)
>
>>
>> --- a/arch/um/drivers/ubd_kern.c
>> +++ b/arch/um/drivers/ubd_kern.c
>> @@ -1602,7 +1602,8 @@ int io_thread(void *arg)
>> written = 0;
>> do {
>> - res = os_write_file(kernel_fd, ((char *)
>> io_req_buffer) + written, n);
>> + res = os_write_file(kernel_fd, ((char *)
>> io_req_buffer) + written,
>> + n - written);
>> if (res >= 0) {
>> written += res;
>> }
>>
>> johannes
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
>
> I have verified this.
>
> It is always in a requeue and always after the io thread has returned
> EAGAIN on the attempt to submit a request.
>
> That part of the code has been tested very heavily before and it works
> fine in 4.19
>
Applied and it is not it.
This is something in the blk-mq code and we should forward the whole
thread to the blk_mq maintainers.
A
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-18 7:35 ` Anton Ivanov
2019-10-18 8:57 ` Johannes Berg
@ 2019-10-18 9:55 ` Anton Ivanov
2019-10-18 10:51 ` Anton Ivanov
1 sibling, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18 9:55 UTC (permalink / raw)
To: rrs, linux-um; +Cc: axboe, hch
Adding Jens and Christoph
On 18/10/2019 08:35, Anton Ivanov wrote:
>
>
> On 17/10/2019 17:29, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 16:03, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>>
>>>>
>>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>>> Looking into that. I have not run into anything like that, but I have
>>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>>
>>>>>
>>>>> Do you think this is related to networking ? I ask because there
>>>>> was no
>>>>> network activity going on.
>>>>
>>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock on
>>>> Debian 5.2 host.
>>>>
>>>>>
>>>>> apt is just an example. The packages were all already downloaded and
>>>>> there was no network activity. Rather, there was block I/O.
>>>>
>>>> I am looking into that.
>>>>
>>>>>
>>>>> One thing I noticed, which may or may not be useful to this report. I
>>>>> was booting the uml guest and immediately logging into it and running
>>>>> the block I/O. And the segfault would occur.
>>>>>
>>>>> If I, instead, booted the uml vm and let it remain idle for a
>>>>> minute or
>>>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>>>> any lazy initialization happening in the background which gets
>>>>> corrupted during immediate hot boot I/O.
>>>>
>>>> Interesting...
>>>>
>>>>>
>>>>> On the other hand, if you think there can be any number of commands to
>>>>> run locally that could generate more information, please assist me so.
>>>>
>>>> As I said - I managed to reproduce it, I am looking at it. In first
>>>> instance I am trying with a couple of version up/down to see if this
>>>> is 5.2 specific.
>>>
>>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
>>
>> This is something outside the UBD driver as such. There were no
>> notable changes to it since we ported it to block-mq and added DISCARD
>> which was quite a while back.
>>
>> I am going to check the other usual suspects such as IRQs, but that is
>> something I test quite extensively when working on the networking side.
>>
>> So I suspect that this is something outside UML which is showing only
>> in UML for some reason.
>
> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>
> Looking at the backtraces it is ALWAYS a result of a re-queue.
>
> [ 263.990000] [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
> [ 263.990000] [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
> [ 263.990000] [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
> [ 263.990000] [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
> [ 263.990000] [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
> [ 263.990000] [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
> [ 263.990000] [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
> [ 263.990000] [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
> [ 263.990000] [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
> [ 263.990000] [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
> [ 263.990000] [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
> [ 263.990000] [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
> ALWAYS >>[ 263.990000] [<604414b0>] blk_mq_requeue_work+0x160/0x170
> [ 263.990000] [<6008c91b>] process_one_work+0x1eb/0x490
> [ 263.990000] [<607f5aa0>] ? __schedule+0x0/0x620
> [ 263.990000] [<6008e000>] ? wq_worker_running+0x10/0x40
> [ 263.990000] [<6008c730>] ? process_one_work+0x0/0x490
> [ 263.990000] [<6008cc06>] worker_thread+0x46/0x670
> [ 263.990000] [<600931c1>] ? __kthread_parkme+0xa1/0xd0
> [ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
> [ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
> [ 263.990000] [<60093bc4>] kthread+0x194/0x1c0
> [ 263.990000] [<6002a091>] new_thread_handler+0x81/0xc0
>
>
> A.
>
>>
>> A.
>>
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ritesh
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> linux-um mailing list
>>>>> linux-um@lists.infradead.org
>>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>>
>>>>
>>>
>>
>
To put a long story short we have a reproducible segfault when UML
re-queues a block request in 5.x
This used to work in 4.x
We found a couple of minor things in UML when looking at it which we
will fix shortly, but the root problem seems to be either in block_mq or
in the way we are using it to re-queue requests.
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-18 9:55 ` Anton Ivanov
@ 2019-10-18 10:51 ` Anton Ivanov
2019-10-28 16:51 ` Anton Ivanov
0 siblings, 1 reply; 12+ messages in thread
From: Anton Ivanov @ 2019-10-18 10:51 UTC (permalink / raw)
To: rrs, linux-um; +Cc: axboe, hch
On 18/10/2019 10:55, Anton Ivanov wrote:
> Adding Jens and Christoph
>
> On 18/10/2019 08:35, Anton Ivanov wrote:
>>
>>
>> On 17/10/2019 17:29, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 16:03, Anton Ivanov wrote:
>>>>
>>>>
>>>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>>>
>>>>>
>>>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>>>> Looking into that. I have not run into anything like that, but I
>>>>>>> have
>>>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>>>
>>>>>>
>>>>>> Do you think this is related to networking ? I ask because there
>>>>>> was no
>>>>>> network activity going on.
>>>>>
>>>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock
>>>>> on Debian 5.2 host.
>>>>>
>>>>>>
>>>>>> apt is just an example. The packages were all already downloaded and
>>>>>> there was no network activity. Rather, there was block I/O.
>>>>>
>>>>> I am looking into that.
>>>>>
>>>>>>
>>>>>> One thing I noticed, which may or may not be useful to this report. I
>>>>>> was booting the uml guest and immediately logging into it and running
>>>>>> the block I/O. And the segfault would occur.
>>>>>>
>>>>>> If I, instead, booted the uml vm and let it remain idle for a
>>>>>> minute or
>>>>>> so and then did the I/O, it worked fine. So I am not sure if there is
>>>>>> any lazy initialization happening in the background which gets
>>>>>> corrupted during immediate hot boot I/O.
>>>>>
>>>>> Interesting...
>>>>>
>>>>>>
>>>>>> On the other hand, if you think there can be any number of
>>>>>> commands to
>>>>>> run locally that could generate more information, please assist me
>>>>>> so.
>>>>>
>>>>> As I said - I managed to reproduce it, I am looking at it. In first
>>>>> instance I am trying with a couple of version up/down to see if
>>>>> this is 5.2 specific.
>>>>
>>>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same symptoms.
>>>
>>> This is something outside the UBD driver as such. There were no
>>> notable changes to it since we ported it to block-mq and added
>>> DISCARD which was quite a while back.
>>>
>>> I am going to check the other usual suspects such as IRQs, but that
>>> is something I test quite extensively when working on the networking
>>> side.
>>>
>>> So I suspect that this is something outside UML which is showing only
>>> in UML for some reason.
>>
>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>>
>> Looking at the backtraces it is ALWAYS a result of a re-queue.
>>
>> [ 263.990000] [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
>> [ 263.990000] [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
>> [ 263.990000] [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
>> [ 263.990000] [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
>> [ 263.990000] [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
>> [ 263.990000] [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
>> [ 263.990000] [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
>> [ 263.990000] [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
>> [ 263.990000] [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
>> [ 263.990000] [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
>> [ 263.990000] [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
>> [ 263.990000] [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
>> ALWAYS >>[ 263.990000] [<604414b0>] blk_mq_requeue_work+0x160/0x170
>> [ 263.990000] [<6008c91b>] process_one_work+0x1eb/0x490
>> [ 263.990000] [<607f5aa0>] ? __schedule+0x0/0x620
>> [ 263.990000] [<6008e000>] ? wq_worker_running+0x10/0x40
>> [ 263.990000] [<6008c730>] ? process_one_work+0x0/0x490
>> [ 263.990000] [<6008cc06>] worker_thread+0x46/0x670
>> [ 263.990000] [<600931c1>] ? __kthread_parkme+0xa1/0xd0
>> [ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
>> [ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
>> [ 263.990000] [<60093bc4>] kthread+0x194/0x1c0
>> [ 263.990000] [<6002a091>] new_thread_handler+0x81/0xc0
>>
>>
>> A.
>>
>>>
>>> A.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ritesh
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> linux-um mailing list
>>>>>> linux-um@lists.infradead.org
>>>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>>>
>>>>>
>>>>
>>>
>>
>
> To put a long story short we have a reproducible segfault when UML
> re-queues a block request in 5.x
>
> This used to work in 4.x
>
> We found a couple of minor things in UML when looking at it which we
> will fix shortly, but the root problem seems to be either in block_mq or
> in the way we are using it to re-queue requests.
>
My (probably wrong) guess is that it is something related to the changes
from 4.x to 5.x where blk-mq is no longer getting the hw context out of
the cpu map.
For some reason it gets a null deref at some point.
I may be wrong of course.
I will not be able to pick this up again before Tuesday so if someone
can have a go before then it will be greatly appreciated.
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: uml segfault during I/O
2019-10-18 10:51 ` Anton Ivanov
@ 2019-10-28 16:51 ` Anton Ivanov
0 siblings, 0 replies; 12+ messages in thread
From: Anton Ivanov @ 2019-10-28 16:51 UTC (permalink / raw)
To: rrs, linux-um; +Cc: axboe, hch
On 18/10/2019 11:51, Anton Ivanov wrote:
>
>
> On 18/10/2019 10:55, Anton Ivanov wrote:
>> Adding Jens and Christoph
>>
>> On 18/10/2019 08:35, Anton Ivanov wrote:
>>>
>>>
>>> On 17/10/2019 17:29, Anton Ivanov wrote:
>>>>
>>>>
>>>> On 17/10/2019 16:03, Anton Ivanov wrote:
>>>>>
>>>>>
>>>>> On 17/10/2019 14:33, Anton Ivanov wrote:
>>>>>>
>>>>>>
>>>>>> On 17/10/2019 14:30, Ritesh Raj Sarraf wrote:
>>>>>>> On Thu, 2019-10-17 at 14:02 +0100, Anton Ivanov wrote:
>>>>>>>> Looking into that. I have not run into anything like that, but
>>>>>>>> I have
>>>>>>>> not used any of the legacy networking for 5 odd years now.
>>>>>>>>
>>>>>>>
>>>>>>> Do you think this is related to networking ? I ask because there
>>>>>>> was no
>>>>>>> network activity going on.
>>>>>>
>>>>>> no, it's disk somewhere. I managed to reproduce it with 5.2 stock
>>>>>> on Debian 5.2 host.
>>>>>>
>>>>>>>
>>>>>>> apt is just an example. The packages were all already downloaded
>>>>>>> and
>>>>>>> there was no network activity. Rather, there was block I/O.
>>>>>>
>>>>>> I am looking into that.
>>>>>>
>>>>>>>
>>>>>>> One thing I noticed, which may or may not be useful to this
>>>>>>> report. I
>>>>>>> was booting the uml guest and immediately logging into it and
>>>>>>> running
>>>>>>> the block I/O. And the segfault would occur.
>>>>>>>
>>>>>>> If I, instead, booted the uml vm and let it remain idle for a
>>>>>>> minute or
>>>>>>> so and then did the I/O, it worked fine. So I am not sure if
>>>>>>> there is
>>>>>>> any lazy initialization happening in the background which gets
>>>>>>> corrupted during immediate hot boot I/O.
>>>>>>
>>>>>> Interesting...
>>>>>>
>>>>>>>
>>>>>>> On the other hand, if you think there can be any number of
>>>>>>> commands to
>>>>>>> run locally that could generate more information, please assist
>>>>>>> me so.
>>>>>>
>>>>>> As I said - I managed to reproduce it, I am looking at it. In
>>>>>> first instance I am trying with a couple of version up/down to
>>>>>> see if this is 5.2 specific.
>>>>>
>>>>> I cannot even get it to start on 5.4-rc1, 5.3 shows the same
>>>>> symptoms.
>>>>
>>>> This is something outside the UBD driver as such. There were no
>>>> notable changes to it since we ported it to block-mq and added
>>>> DISCARD which was quite a while back.
>>>>
>>>> I am going to check the other usual suspects such as IRQs, but that
>>>> is something I test quite extensively when working on the
>>>> networking side.
>>>>
>>>> So I suspect that this is something outside UML which is showing
>>>> only in UML for some reason.
>>>
>>> Still happening with 5.2.21, albeit a bit more difficult to reproduce.
>>>
>>> Looking at the backtraces it is ALWAYS a result of a re-queue.
>>>
>>> [ 263.990000] [<60440633>] blk_mq_dispatch_rq_list+0xf3/0x5f0
>>> [ 263.990000] [<60440501>] ? blk_mq_get_driver_tag+0xc1/0x100
>>> [ 263.990000] [<60440540>] ? blk_mq_dispatch_rq_list+0x0/0x5f0
>>> [ 263.990000] [<60445966>] blk_mq_do_dispatch_sched+0x66/0xe0
>>> [ 263.990000] [<60446107>] blk_mq_sched_dispatch_requests+0x107/0x190
>>> [ 263.990000] [<6043f130>] ? blk_mq_run_hw_queue+0x0/0x120
>>> [ 263.990000] [<6043efb4>] __blk_mq_run_hw_queue+0x74/0xd0
>>> [ 263.990000] [<6043f0d4>] __blk_mq_delay_run_hw_queue+0xc4/0xd0
>>> [ 263.990000] [<6043f1d9>] blk_mq_run_hw_queue+0xa9/0x120
>>> [ 263.990000] [<6043f292>] blk_mq_run_hw_queues+0x42/0x60
>>> [ 263.990000] [<60440bc0>] ? blk_mq_request_bypass_insert+0x0/0x90
>>> [ 263.990000] [<604462a0>] ? blk_mq_sched_insert_request+0x0/0x1c0
>>> ALWAYS >>[ 263.990000] [<604414b0>] blk_mq_requeue_work+0x160/0x170
>>> [ 263.990000] [<6008c91b>] process_one_work+0x1eb/0x490
>>> [ 263.990000] [<607f5aa0>] ? __schedule+0x0/0x620
>>> [ 263.990000] [<6008e000>] ? wq_worker_running+0x10/0x40
>>> [ 263.990000] [<6008c730>] ? process_one_work+0x0/0x490
>>> [ 263.990000] [<6008cc06>] worker_thread+0x46/0x670
>>> [ 263.990000] [<600931c1>] ? __kthread_parkme+0xa1/0xd0
>>> [ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
>>> [ 263.990000] [<6008cbc0>] ? worker_thread+0x0/0x670
>>> [ 263.990000] [<60093bc4>] kthread+0x194/0x1c0
>>> [ 263.990000] [<6002a091>] new_thread_handler+0x81/0xc0
>>>
>>>
>>> A.
>>>
>>>>
>>>> A.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ritesh
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> linux-um mailing list
>>>>>>> linux-um@lists.infradead.org
>>>>>>> http://lists.infradead.org/mailman/listinfo/linux-um
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> To put a long story short we have a reproducible segfault when UML
>> re-queues a block request in 5.x
>>
>> This used to work in 4.x
>>
>> We found a couple of minor things in UML when looking at it which we
>> will fix shortly, but the root problem seems to be either in block_mq
>> or in the way we are using it to re-queue requests.
>>
>
> My (probably wrong) guess is that it is something related to the
> changes from 4.x to 5.x where blk-mq is no longer getting the hw
> context out of the cpu map.
>
> For some reason it gets a null deref at some point.
>
> I may be wrong of course.
>
> I will not be able to pick this up again before Tuesday so if someone
> can have a go before then it will be greatly appreciated.
>
Unless I am mistaken It should not re-queue.
Xen does not:
https://elixir.bootlin.com/linux/latest/source/drivers/block/xen-blkfront.c#L910
Unless I am mistaken, if you return a form of BUSY the the upper layers
will re-queue for you.
That explains the crash as we have the same bio enqueued multiple times
when the device stalls.
I am testing a patch and if it all tests out I will submit it shortly.
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-10-28 17:21 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17 11:37 uml segfault during I/O Ritesh Raj Sarraf
[not found] ` <1ccf27d8-6b6a-7d08-acef-93077f07511b@cambridgegreys.com>
2019-10-17 13:30 ` Ritesh Raj Sarraf
2019-10-17 13:33 ` Anton Ivanov
2019-10-17 15:03 ` Anton Ivanov
2019-10-17 16:29 ` Anton Ivanov
2019-10-18 7:35 ` Anton Ivanov
2019-10-18 8:57 ` Johannes Berg
2019-10-18 9:13 ` Anton Ivanov
2019-10-18 9:49 ` Anton Ivanov
2019-10-18 9:55 ` Anton Ivanov
2019-10-18 10:51 ` Anton Ivanov
2019-10-28 16:51 ` Anton Ivanov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.