io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fixed buffer have out-dated content
@ 2021-01-08 23:39 Martin Raiber
  2021-01-09 16:23 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Raiber @ 2021-01-08 23:39 UTC (permalink / raw)
  To: io-uring

Hi,

I have a gnarly issue with io_uring and fixed buffers (fixed 
read/write). It seems the contents of those buffers contain old data in 
some rare cases under memory pressure after a read/during a write.

Specifically I use io_uring with fuse and to confirm this is not some 
user space issue let fuse print the unique id it adds to each request. 
Fuse adds this request data to a pipe, and when the pipe buffer is later 
copied to the io_uring fixed buffer it has the id of a fuse request 
returned earlier using the same buffer while returning the size of the 
new request. Or I set the unique id in the buffer, write it to fuse (via 
writing to a pipe, then splicing) and then fuse returns with e.g. 
ENOENT, because the unique id is not correct because in kernel it reads 
the id of the previous, already completed, request using this buffer.

To make reproducing this faster running memtester (which mlocks a 
configurable amount of memory) with a large amount of user memory every 
30s helps. So it has something to do with swapping? It seems to not 
occur if no swap space is active. Problem occurs without warning when 
the kernel is build with KASAN and slab debugging.

If I don't use the _FIXED opcodes (which is easy to do), the problem 
does not occur.

Problem occurs with 5.9.16 and 5.10.5.

Regards,
Martin Raiber


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffer have out-dated content
  2021-01-08 23:39 Fixed buffer have out-dated content Martin Raiber
@ 2021-01-09 16:23 ` Jens Axboe
  2021-01-09 16:58   ` Martin Raiber
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2021-01-09 16:23 UTC (permalink / raw)
  To: Martin Raiber, io-uring

On 1/8/21 4:39 PM, Martin Raiber wrote:
> Hi,
> 
> I have a gnarly issue with io_uring and fixed buffers (fixed 
> read/write). It seems the contents of those buffers contain old data in 
> some rare cases under memory pressure after a read/during a write.
> 
> Specifically I use io_uring with fuse and to confirm this is not some 
> user space issue let fuse print the unique id it adds to each request. 
> Fuse adds this request data to a pipe, and when the pipe buffer is later 
> copied to the io_uring fixed buffer it has the id of a fuse request 
> returned earlier using the same buffer while returning the size of the 
> new request. Or I set the unique id in the buffer, write it to fuse (via 
> writing to a pipe, then splicing) and then fuse returns with e.g. 
> ENOENT, because the unique id is not correct because in kernel it reads 
> the id of the previous, already completed, request using this buffer.
> 
> To make reproducing this faster running memtester (which mlocks a 
> configurable amount of memory) with a large amount of user memory every 
> 30s helps. So it has something to do with swapping? It seems to not 
> occur if no swap space is active. Problem occurs without warning when 
> the kernel is build with KASAN and slab debugging.
> 
> If I don't use the _FIXED opcodes (which is easy to do), the problem 
> does not occur.
> 
> Problem occurs with 5.9.16 and 5.10.5.

Can you mention more about what kind of IO you are doing, I'm assuming
it's O_DIRECT? I'll see if I can reproduce this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffer have out-dated content
  2021-01-09 16:23 ` Jens Axboe
@ 2021-01-09 16:58   ` Martin Raiber
  2021-01-09 20:32     ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Raiber @ 2021-01-09 16:58 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 09.01.2021 17:23 Jens Axboe wrote:
> On 1/8/21 4:39 PM, Martin Raiber wrote:
>> Hi,
>>
>> I have a gnarly issue with io_uring and fixed buffers (fixed
>> read/write). It seems the contents of those buffers contain old data in
>> some rare cases under memory pressure after a read/during a write.
>>
>> Specifically I use io_uring with fuse and to confirm this is not some
>> user space issue let fuse print the unique id it adds to each request.
>> Fuse adds this request data to a pipe, and when the pipe buffer is later
>> copied to the io_uring fixed buffer it has the id of a fuse request
>> returned earlier using the same buffer while returning the size of the
>> new request. Or I set the unique id in the buffer, write it to fuse (via
>> writing to a pipe, then splicing) and then fuse returns with e.g.
>> ENOENT, because the unique id is not correct because in kernel it reads
>> the id of the previous, already completed, request using this buffer.
>>
>> To make reproducing this faster running memtester (which mlocks a
>> configurable amount of memory) with a large amount of user memory every
>> 30s helps. So it has something to do with swapping? It seems to not
>> occur if no swap space is active. Problem occurs without warning when
>> the kernel is build with KASAN and slab debugging.
>>
>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>> does not occur.
>>
>> Problem occurs with 5.9.16 and 5.10.5.
> Can you mention more about what kind of IO you are doing, I'm assuming
> it's O_DIRECT? I'll see if I can reproduce this.

It's writing to/reading from pipes (nonblocking, no O_DIRECT).

I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 2GB 
VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M 
memtester while it runs (so it swaps, I guess). I can try further 
reducing the reproducer, but I wanted to avoid that work in case it is 
something obvious. The next step would be to remove fuse from the 
equation -- it does try to move the pages from the pipe when splicing to 
it, for example.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffer have out-dated content
  2021-01-09 16:58   ` Martin Raiber
@ 2021-01-09 20:32     ` Pavel Begunkov
  2021-01-10 16:50       ` Martin Raiber
  0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2021-01-09 20:32 UTC (permalink / raw)
  To: Martin Raiber, Jens Axboe, io-uring

On 09/01/2021 16:58, Martin Raiber wrote:
> On 09.01.2021 17:23 Jens Axboe wrote:
>> On 1/8/21 4:39 PM, Martin Raiber wrote:
>>> Hi,
>>>
>>> I have a gnarly issue with io_uring and fixed buffers (fixed
>>> read/write). It seems the contents of those buffers contain old data in
>>> some rare cases under memory pressure after a read/during a write.
>>>
>>> Specifically I use io_uring with fuse and to confirm this is not some
>>> user space issue let fuse print the unique id it adds to each request.
>>> Fuse adds this request data to a pipe, and when the pipe buffer is later
>>> copied to the io_uring fixed buffer it has the id of a fuse request
>>> returned earlier using the same buffer while returning the size of the
>>> new request. Or I set the unique id in the buffer, write it to fuse (via
>>> writing to a pipe, then splicing) and then fuse returns with e.g.
>>> ENOENT, because the unique id is not correct because in kernel it reads
>>> the id of the previous, already completed, request using this buffer.
>>>
>>> To make reproducing this faster running memtester (which mlocks a
>>> configurable amount of memory) with a large amount of user memory every
>>> 30s helps. So it has something to do with swapping? It seems to not
>>> occur if no swap space is active. Problem occurs without warning when
>>> the kernel is build with KASAN and slab debugging.
>>>
>>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>>> does not occur.
>>>
>>> Problem occurs with 5.9.16 and 5.10.5.
>> Can you mention more about what kind of IO you are doing, I'm assuming
>> it's O_DIRECT? I'll see if I can reproduce this.
> 
> It's writing to/reading from pipes (nonblocking, no O_DIRECT).

A blind guess, does it handle short reads and writes? If not, can you
check whether they happen or not?

> 
> I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 2GB VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M memtester while it runs (so it swaps, I guess). I can try further reducing the reproducer, but I wanted to avoid that work in case it is something obvious. The next step would be to remove fuse from the equation -- it does try to move the pages from the pipe when splicing to it, for example.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffer have out-dated content
  2021-01-09 20:32     ` Pavel Begunkov
@ 2021-01-10 16:50       ` Martin Raiber
  2021-01-14 21:50         ` Fixed buffers " Martin Raiber
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Raiber @ 2021-01-10 16:50 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring

On 09.01.2021 21:32 Pavel Begunkov wrote:
> On 09/01/2021 16:58, Martin Raiber wrote:
>> On 09.01.2021 17:23 Jens Axboe wrote:
>>> On 1/8/21 4:39 PM, Martin Raiber wrote:
>>>> Hi,
>>>>
>>>> I have a gnarly issue with io_uring and fixed buffers (fixed
>>>> read/write). It seems the contents of those buffers contain old data in
>>>> some rare cases under memory pressure after a read/during a write.
>>>>
>>>> Specifically I use io_uring with fuse and to confirm this is not some
>>>> user space issue let fuse print the unique id it adds to each request.
>>>> Fuse adds this request data to a pipe, and when the pipe buffer is later
>>>> copied to the io_uring fixed buffer it has the id of a fuse request
>>>> returned earlier using the same buffer while returning the size of the
>>>> new request. Or I set the unique id in the buffer, write it to fuse (via
>>>> writing to a pipe, then splicing) and then fuse returns with e.g.
>>>> ENOENT, because the unique id is not correct because in kernel it reads
>>>> the id of the previous, already completed, request using this buffer.
>>>>
>>>> To make reproducing this faster running memtester (which mlocks a
>>>> configurable amount of memory) with a large amount of user memory every
>>>> 30s helps. So it has something to do with swapping? It seems to not
>>>> occur if no swap space is active. Problem occurs without warning when
>>>> the kernel is build with KASAN and slab debugging.
>>>>
>>>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>>>> does not occur.
>>>>
>>>> Problem occurs with 5.9.16 and 5.10.5.
>>> Can you mention more about what kind of IO you are doing, I'm assuming
>>> it's O_DIRECT? I'll see if I can reproduce this.
>> It's writing to/reading from pipes (nonblocking, no O_DIRECT).
> A blind guess, does it handle short reads and writes? If not, can you
> check whether they happen or not?

Something like this was what I suspected at first as well. It does check 
for short read/writes and I added (unnecessary -- because the fuse 
request structure is 40 bytes and it does io in page sizes) code for 
retrying short reads at some point. I also checked for the pipes to be 
empty before they are used at some point and let the kernel log 
allocation failures (idea was that it was short pipe read/writes because 
of allocation failure or that something doesn't get rewound properly in 
this case). Beyond that three things that make a user space problem 
unlikely:

  - occurs only when using fixed buffers and does not occur when running 
same code without fixed buffer opcodes
  - doesn't occur when there is no memory pressure
  - I added print(k/f) logging that pointed me in this direction as well

>> I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 2GB VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M memtester while it runs (so it swaps, I guess). I can try further reducing the reproducer, but I wanted to avoid that work in case it is something obvious. The next step would be to remove fuse from the equation -- it does try to move the pages from the pipe when splicing to it, for example.
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-10 16:50       ` Martin Raiber
@ 2021-01-14 21:50         ` Martin Raiber
  2021-01-16 19:30           ` Pavel Begunkov
  2021-01-16 22:12           ` Jens Axboe
  0 siblings, 2 replies; 13+ messages in thread
From: Martin Raiber @ 2021-01-14 21:50 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring

On 10.01.2021 17:50 Martin Raiber wrote:
> On 09.01.2021 21:32 Pavel Begunkov wrote:
>> On 09/01/2021 16:58, Martin Raiber wrote:
>>> On 09.01.2021 17:23 Jens Axboe wrote:
>>>> On 1/8/21 4:39 PM, Martin Raiber wrote:
>>>>> Hi,
>>>>>
>>>>> I have a gnarly issue with io_uring and fixed buffers (fixed
>>>>> read/write). It seems the contents of those buffers contain old 
>>>>> data in
>>>>> some rare cases under memory pressure after a read/during a write.
>>>>>
>>>>> Specifically I use io_uring with fuse and to confirm this is not some
>>>>> user space issue let fuse print the unique id it adds to each 
>>>>> request.
>>>>> Fuse adds this request data to a pipe, and when the pipe buffer is 
>>>>> later
>>>>> copied to the io_uring fixed buffer it has the id of a fuse request
>>>>> returned earlier using the same buffer while returning the size of 
>>>>> the
>>>>> new request. Or I set the unique id in the buffer, write it to 
>>>>> fuse (via
>>>>> writing to a pipe, then splicing) and then fuse returns with e.g.
>>>>> ENOENT, because the unique id is not correct because in kernel it 
>>>>> reads
>>>>> the id of the previous, already completed, request using this buffer.
>>>>>
>>>>> To make reproducing this faster running memtester (which mlocks a
>>>>> configurable amount of memory) with a large amount of user memory 
>>>>> every
>>>>> 30s helps. So it has something to do with swapping? It seems to not
>>>>> occur if no swap space is active. Problem occurs without warning when
>>>>> the kernel is build with KASAN and slab debugging.
>>>>>
>>>>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>>>>> does not occur.
>>>>>
>>>>> Problem occurs with 5.9.16 and 5.10.5.
>>>> Can you mention more about what kind of IO you are doing, I'm assuming
>>>> it's O_DIRECT? I'll see if I can reproduce this.
>>> It's writing to/reading from pipes (nonblocking, no O_DIRECT).
>> A blind guess, does it handle short reads and writes? If not, can you
>> check whether they happen or not?
>
> Something like this was what I suspected at first as well. It does 
> check for short read/writes and I added (unnecessary -- because the 
> fuse request structure is 40 bytes and it does io in page sizes) code 
> for retrying short reads at some point. I also checked for the pipes 
> to be empty before they are used at some point and let the kernel log 
> allocation failures (idea was that it was short pipe read/writes 
> because of allocation failure or that something doesn't get rewound 
> properly in this case). Beyond that three things that make a user 
> space problem unlikely:
>
>  - occurs only when using fixed buffers and does not occur when 
> running same code without fixed buffer opcodes
>  - doesn't occur when there is no memory pressure
>  - I added print(k/f) logging that pointed me in this direction as well
>
>>> I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 
>>> 2GB VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M 
>>> memtester while it runs (so it swaps, I guess). I can try further 
>>> reducing the reproducer, but I wanted to avoid that work in case it 
>>> is something obvious. The next step would be to remove fuse from the 
>>> equation -- it does try to move the pages from the pipe when 
>>> splicing to it, for example.

When I use 5.10.7 with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 ("mm: 
do_wp_page() simplification"), 1a0cf26323c80e2f1c58fc04f15686de61bfab0c 
("mm/ksm: Remove reuse_ksm_page()") and 
be068f29034fb00530a053d18b8cf140c32b12b3 ("mm: fix misplaced unlock_page 
in do_wp_page()") reverted the issue doesn't seem to occur.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-14 21:50         ` Fixed buffers " Martin Raiber
@ 2021-01-16 19:30           ` Pavel Begunkov
  2021-01-16 19:39             ` Jens Axboe
  2021-01-16 22:12           ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2021-01-16 19:30 UTC (permalink / raw)
  To: Martin Raiber, Jens Axboe, io-uring

On 14/01/2021 21:50, Martin Raiber wrote:
> On 10.01.2021 17:50 Martin Raiber wrote:
>> On 09.01.2021 21:32 Pavel Begunkov wrote:
>>> On 09/01/2021 16:58, Martin Raiber wrote:
>>>> On 09.01.2021 17:23 Jens Axboe wrote:
>>>>> On 1/8/21 4:39 PM, Martin Raiber wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have a gnarly issue with io_uring and fixed buffers (fixed
>>>>>> read/write). It seems the contents of those buffers contain old data in
>>>>>> some rare cases under memory pressure after a read/during a write.
>>>>>>
>>>>>> Specifically I use io_uring with fuse and to confirm this is not some
>>>>>> user space issue let fuse print the unique id it adds to each request.
>>>>>> Fuse adds this request data to a pipe, and when the pipe buffer is later
>>>>>> copied to the io_uring fixed buffer it has the id of a fuse request
>>>>>> returned earlier using the same buffer while returning the size of the
>>>>>> new request. Or I set the unique id in the buffer, write it to fuse (via
>>>>>> writing to a pipe, then splicing) and then fuse returns with e.g.
>>>>>> ENOENT, because the unique id is not correct because in kernel it reads
>>>>>> the id of the previous, already completed, request using this buffer.
>>>>>>
>>>>>> To make reproducing this faster running memtester (which mlocks a
>>>>>> configurable amount of memory) with a large amount of user memory every
>>>>>> 30s helps. So it has something to do with swapping? It seems to not
>>>>>> occur if no swap space is active. Problem occurs without warning when
>>>>>> the kernel is build with KASAN and slab debugging.
>>>>>>
>>>>>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>>>>>> does not occur.
>>>>>>
>>>>>> Problem occurs with 5.9.16 and 5.10.5.
>>>>> Can you mention more about what kind of IO you are doing, I'm assuming
>>>>> it's O_DIRECT? I'll see if I can reproduce this.
>>>> It's writing to/reading from pipes (nonblocking, no O_DIRECT).
>>> A blind guess, does it handle short reads and writes? If not, can you
>>> check whether they happen or not?
>>
>> Something like this was what I suspected at first as well. It does check for short read/writes and I added (unnecessary -- because the fuse request structure is 40 bytes and it does io in page sizes) code for retrying short reads at some point. I also checked for the pipes to be empty before they are used at some point and let the kernel log allocation failures (idea was that it was short pipe read/writes because of allocation failure or that something doesn't get rewound properly in this case). Beyond that three things that make a user space problem unlikely:
>>
>>  - occurs only when using fixed buffers and does not occur when running same code without fixed buffer opcodes
>>  - doesn't occur when there is no memory pressure
>>  - I added print(k/f) logging that pointed me in this direction as well
>>
>>>> I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 2GB VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M memtester while it runs (so it swaps, I guess). I can try further reducing the reproducer, but I wanted to avoid that work in case it is something obvious. The next step would be to remove fuse from the equation -- it does try to move the pages from the pipe when splicing to it, for example.
> 
> When I use 5.10.7 with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 ("mm: do_wp_page() simplification"), 1a0cf26323c80e2f1c58fc04f15686de61bfab0c ("mm/ksm: Remove reuse_ksm_page()") and be068f29034fb00530a053d18b8cf140c32b12b3 ("mm: fix misplaced unlock_page in do_wp_page()") reverted the issue doesn't seem to occur.

Thanks for tracking it down. Was it reported to Linus and Peter?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-16 19:30           ` Pavel Begunkov
@ 2021-01-16 19:39             ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2021-01-16 19:39 UTC (permalink / raw)
  To: Pavel Begunkov, Martin Raiber, io-uring

On 1/16/21 12:30 PM, Pavel Begunkov wrote:
> On 14/01/2021 21:50, Martin Raiber wrote:
>> On 10.01.2021 17:50 Martin Raiber wrote:
>>> On 09.01.2021 21:32 Pavel Begunkov wrote:
>>>> On 09/01/2021 16:58, Martin Raiber wrote:
>>>>> On 09.01.2021 17:23 Jens Axboe wrote:
>>>>>> On 1/8/21 4:39 PM, Martin Raiber wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a gnarly issue with io_uring and fixed buffers (fixed
>>>>>>> read/write). It seems the contents of those buffers contain old data in
>>>>>>> some rare cases under memory pressure after a read/during a write.
>>>>>>>
>>>>>>> Specifically I use io_uring with fuse and to confirm this is not some
>>>>>>> user space issue let fuse print the unique id it adds to each request.
>>>>>>> Fuse adds this request data to a pipe, and when the pipe buffer is later
>>>>>>> copied to the io_uring fixed buffer it has the id of a fuse request
>>>>>>> returned earlier using the same buffer while returning the size of the
>>>>>>> new request. Or I set the unique id in the buffer, write it to fuse (via
>>>>>>> writing to a pipe, then splicing) and then fuse returns with e.g.
>>>>>>> ENOENT, because the unique id is not correct because in kernel it reads
>>>>>>> the id of the previous, already completed, request using this buffer.
>>>>>>>
>>>>>>> To make reproducing this faster running memtester (which mlocks a
>>>>>>> configurable amount of memory) with a large amount of user memory every
>>>>>>> 30s helps. So it has something to do with swapping? It seems to not
>>>>>>> occur if no swap space is active. Problem occurs without warning when
>>>>>>> the kernel is build with KASAN and slab debugging.
>>>>>>>
>>>>>>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>>>>>>> does not occur.
>>>>>>>
>>>>>>> Problem occurs with 5.9.16 and 5.10.5.
>>>>>> Can you mention more about what kind of IO you are doing, I'm assuming
>>>>>> it's O_DIRECT? I'll see if I can reproduce this.
>>>>> It's writing to/reading from pipes (nonblocking, no O_DIRECT).
>>>> A blind guess, does it handle short reads and writes? If not, can you
>>>> check whether they happen or not?
>>>
>>> Something like this was what I suspected at first as well. It does check for short read/writes and I added (unnecessary -- because the fuse request structure is 40 bytes and it does io in page sizes) code for retrying short reads at some point. I also checked for the pipes to be empty before they are used at some point and let the kernel log allocation failures (idea was that it was short pipe read/writes because of allocation failure or that something doesn't get rewound properly in this case). Beyond that three things that make a user space problem unlikely:
>>>
>>>  - occurs only when using fixed buffers and does not occur when running same code without fixed buffer opcodes
>>>  - doesn't occur when there is no memory pressure
>>>  - I added print(k/f) logging that pointed me in this direction as well
>>>
>>>>> I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 2GB VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M memtester while it runs (so it swaps, I guess). I can try further reducing the reproducer, but I wanted to avoid that work in case it is something obvious. The next step would be to remove fuse from the equation -- it does try to move the pages from the pipe when splicing to it, for example.
>>
>> When I use 5.10.7 with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 ("mm: do_wp_page() simplification"), 1a0cf26323c80e2f1c58fc04f15686de61bfab0c ("mm/ksm: Remove reuse_ksm_page()") and be068f29034fb00530a053d18b8cf140c32b12b3 ("mm: fix misplaced unlock_page in do_wp_page()") reverted the issue doesn't seem to occur.
> 
> Thanks for tracking it down. Was it reported to Linus and Peter?

That seems very strange and should then affect a bunch of other stuff,
too... Do you have a test case? I'd love to dive into this and figure
out what is going on, and would save me a lot of time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-14 21:50         ` Fixed buffers " Martin Raiber
  2021-01-16 19:30           ` Pavel Begunkov
@ 2021-01-16 22:12           ` Jens Axboe
  2021-01-16 23:05             ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2021-01-16 22:12 UTC (permalink / raw)
  To: Martin Raiber, Pavel Begunkov, io-uring, Linus Torvalds

On 1/14/21 2:50 PM, Martin Raiber wrote:
> On 10.01.2021 17:50 Martin Raiber wrote:
>> On 09.01.2021 21:32 Pavel Begunkov wrote:
>>> On 09/01/2021 16:58, Martin Raiber wrote:
>>>> On 09.01.2021 17:23 Jens Axboe wrote:
>>>>> On 1/8/21 4:39 PM, Martin Raiber wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have a gnarly issue with io_uring and fixed buffers (fixed
>>>>>> read/write). It seems the contents of those buffers contain old 
>>>>>> data in
>>>>>> some rare cases under memory pressure after a read/during a write.
>>>>>>
>>>>>> Specifically I use io_uring with fuse and to confirm this is not some
>>>>>> user space issue let fuse print the unique id it adds to each 
>>>>>> request.
>>>>>> Fuse adds this request data to a pipe, and when the pipe buffer is 
>>>>>> later
>>>>>> copied to the io_uring fixed buffer it has the id of a fuse request
>>>>>> returned earlier using the same buffer while returning the size of 
>>>>>> the
>>>>>> new request. Or I set the unique id in the buffer, write it to 
>>>>>> fuse (via
>>>>>> writing to a pipe, then splicing) and then fuse returns with e.g.
>>>>>> ENOENT, because the unique id is not correct because in kernel it 
>>>>>> reads
>>>>>> the id of the previous, already completed, request using this buffer.
>>>>>>
>>>>>> To make reproducing this faster running memtester (which mlocks a
>>>>>> configurable amount of memory) with a large amount of user memory 
>>>>>> every
>>>>>> 30s helps. So it has something to do with swapping? It seems to not
>>>>>> occur if no swap space is active. Problem occurs without warning when
>>>>>> the kernel is build with KASAN and slab debugging.
>>>>>>
>>>>>> If I don't use the _FIXED opcodes (which is easy to do), the problem
>>>>>> does not occur.
>>>>>>
>>>>>> Problem occurs with 5.9.16 and 5.10.5.
>>>>> Can you mention more about what kind of IO you are doing, I'm assuming
>>>>> it's O_DIRECT? I'll see if I can reproduce this.
>>>> It's writing to/reading from pipes (nonblocking, no O_DIRECT).
>>> A blind guess, does it handle short reads and writes? If not, can you
>>> check whether they happen or not?
>>
>> Something like this was what I suspected at first as well. It does 
>> check for short read/writes and I added (unnecessary -- because the 
>> fuse request structure is 40 bytes and it does io in page sizes) code 
>> for retrying short reads at some point. I also checked for the pipes 
>> to be empty before they are used at some point and let the kernel log 
>> allocation failures (idea was that it was short pipe read/writes 
>> because of allocation failure or that something doesn't get rewound 
>> properly in this case). Beyond that three things that make a user 
>> space problem unlikely:
>>
>>  - occurs only when using fixed buffers and does not occur when 
>> running same code without fixed buffer opcodes
>>  - doesn't occur when there is no memory pressure
>>  - I added print(k/f) logging that pointed me in this direction as well
>>
>>>> I can reproduce it with https://github.com/uroni/fuseuring on e.g. a 
>>>> 2GB VPS. Modify bench.sh so that fio loops. Add swap, then run 1400M 
>>>> memtester while it runs (so it swaps, I guess). I can try further 
>>>> reducing the reproducer, but I wanted to avoid that work in case it 
>>>> is something obvious. The next step would be to remove fuse from the 
>>>> equation -- it does try to move the pages from the pipe when 
>>>> splicing to it, for example.
> 
> When I use 5.10.7 with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 ("mm: 
> do_wp_page() simplification"), 1a0cf26323c80e2f1c58fc04f15686de61bfab0c 
> ("mm/ksm: Remove reuse_ksm_page()") and 
> be068f29034fb00530a053d18b8cf140c32b12b3 ("mm: fix misplaced unlock_page 
> in do_wp_page()") reverted the issue doesn't seem to occur.

Linus, I'm looking into the above report, all of it should still be
intact in the quoted section. Figured it would not hurt to loop you in,
in case this is a generic problem and since Martin identified that
reverting the above changes on the mm side makes it go away. Maybe
there's already some discussion elsewhere about it that I'm not privy
to.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-16 22:12           ` Jens Axboe
@ 2021-01-16 23:05             ` Linus Torvalds
  2021-01-16 23:34               ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2021-01-16 23:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin Raiber, Pavel Begunkov, io-uring

On Sat, Jan 16, 2021 at 2:12 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Linus, I'm looking into the above report, all of it should still be
> intact in the quoted section. Figured it would not hurt to loop you in,
> in case this is a generic problem and since Martin identified that
> reverting the above changes on the mm side makes it go away. Maybe
> there's already some discussion elsewhere about it that I'm not privy
> to.

Ok, that commit clearly ended up more painful that it should have been.

The problem is that with it, if we revert it, then we'd have to also
revert  a308c71bf1e6 ("mm/gup: Remove enfornced COW mechanism"). which
depends on it.

And that fixed its own set of problems.

Darn.

I'll go think about this.

Martin, since you can apparently trigger the problem easily, hopefully
you're willing to try a couple of patches?

           Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-16 23:05             ` Linus Torvalds
@ 2021-01-16 23:34               ` Linus Torvalds
  2021-01-17 20:07                 ` Martin Raiber
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2021-01-16 23:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Martin Raiber, Pavel Begunkov, io-uring

[-- Attachment #1: Type: text/plain, Size: 819 bytes --]

On Sat, Jan 16, 2021 at 3:05 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll go think about this.
>
> Martin, since you can apparently trigger the problem easily, hopefully
> you're willing to try a couple of patches?

Hmm. It might end up being as simple as the attached patch.

I'm not super-happy with this situation (that whole nasty security
issue had some horrible cascading problems), and this only really
fixes _pinned_ pages.

In particular, somebody doing a plain get_user_pages() for writing can
still hit issues (admittedly that's true in general, but the vm
changes made it much more obviously true).

But for the case of io_uring buffers, this looks like the obvious simple fix.

I don't have a load to test this with, so I'll come back to ask Martin
to do so...

                 Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 526 bytes --]

 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 257cba79a96d..b1b574ad199d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1238,6 +1238,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			if (!PageSwapCache(page)) {
 				if (!(sc->gfp_mask & __GFP_IO))
 					goto keep_locked;
+				if (page_maybe_dma_pinned(page))
+					goto keep_locked;
 				if (PageTransHuge(page)) {
 					/* cannot split THP, skip it */
 					if (!can_split_huge_page(page, NULL))

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-16 23:34               ` Linus Torvalds
@ 2021-01-17 20:07                 ` Martin Raiber
  2021-01-17 20:14                   ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Raiber @ 2021-01-17 20:07 UTC (permalink / raw)
  To: Linus Torvalds, Jens Axboe; +Cc: Martin Raiber, Pavel Begunkov, io-uring

On 17.01.2021 00:34 Linus Torvalds wrote:
> On Sat, Jan 16, 2021 at 3:05 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> I'll go think about this.
>>
>> Martin, since you can apparently trigger the problem easily, hopefully
>> you're willing to try a couple of patches?
> Hmm. It might end up being as simple as the attached patch.
>
> I'm not super-happy with this situation (that whole nasty security
> issue had some horrible cascading problems), and this only really
> fixes _pinned_ pages.
>
> In particular, somebody doing a plain get_user_pages() for writing can
> still hit issues (admittedly that's true in general, but the vm
> changes made it much more obviously true).
>
> But for the case of io_uring buffers, this looks like the obvious simple fix.
>
> I don't have a load to test this with, so I'll come back to ask Martin
> to do so...
>
>                   Linus

Thanks! With the patch (skip swapping pinned pages) the problem doesn't 
occur anymore, so it seems to fix it.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fixed buffers have out-dated content
  2021-01-17 20:07                 ` Martin Raiber
@ 2021-01-17 20:14                   ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2021-01-17 20:14 UTC (permalink / raw)
  To: Martin Raiber, Peter Xu; +Cc: Jens Axboe, Pavel Begunkov, io-uring

On Sun, Jan 17, 2021 at 12:07 PM Martin Raiber <martin@urbackup.org> wrote:
>
> Thanks! With the patch (skip swapping pinned pages) the problem doesn't
> occur anymore, so it seems to fix it.

Heh, this email came in just as I had committed it to my tree and was
actively writing an email about how you likely wouldn't test it before
I did rc4 because it's a weekend ;)

But since I hadn't pushed it out yet (or done some of the pulls I have
pending), I amended the commit message with your tested-by as well.
Thanks.

It's commit feb889fb40fa ("mm: don't put pinned pages into the swap cache").

I was pretty sure that was the cause from the symptoms you saw, and
the commit explains the whole chain (and explains why the "simple and
stupid" two-liner is actually the right thing to do).

I was very tempted to make the condition for "don't put it into the
swap cache" be much more aggressive, to handle the "GUP with write"
case too, something like

        /* Single mapper, more references than us and the map? */
        if (page_mapcount(page) == 1 && page_count(page) > 2)
                goto keep_locked;

but just using page_maybe_dma_pinned() is the more targeted one for now.

(Added Peter to the cc so that he sees this).

                     Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-01-17 20:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 23:39 Fixed buffer have out-dated content Martin Raiber
2021-01-09 16:23 ` Jens Axboe
2021-01-09 16:58   ` Martin Raiber
2021-01-09 20:32     ` Pavel Begunkov
2021-01-10 16:50       ` Martin Raiber
2021-01-14 21:50         ` Fixed buffers " Martin Raiber
2021-01-16 19:30           ` Pavel Begunkov
2021-01-16 19:39             ` Jens Axboe
2021-01-16 22:12           ` Jens Axboe
2021-01-16 23:05             ` Linus Torvalds
2021-01-16 23:34               ` Linus Torvalds
2021-01-17 20:07                 ` Martin Raiber
2021-01-17 20:14                   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).