linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* uring regression - lost write request
@ 2021-10-22  3:12 Daniel Black
  2021-10-22  9:10 ` Pavel Begunkov
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-10-22  3:12 UTC (permalink / raw)
  To: linux-block

Sometime after 5.11 and is fixed in 5.15-rcX (rc6 extensively tested
over last few days) is a kernel regression we are tracing in
https://jira.mariadb.org/browse/MDEV-26674 and
https://jira.mariadb.org/browse/MDEV-26555
5.10 and early across many distros and hardware appear not to have a problem.

I'd appreciate some help identifying a 5.14 linux stable patch
suitable as I observe the fault in mainline 5.14.14 (built
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/). This is of
interest to both Debian (sid)
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996951 , Ubuntu
(Impish) and Fedora fc33-35 (TODO bug report)..

Marko in https://jira.mariadb.org/browse/MDEV-26555?focusedCommentId=198601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-198601
traced this down to a io_uring_wait_cqe never returning after a
request was pushed.

The observed behavior uses a mariadb-test package for 10.6

dan@impish:~$ uname -a
Linux impish 5.14.14-051414-generic #202110201037 SMP Wed Oct 20
11:04:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

dan@impish:~$ cd /usr/share/mysql/mysql-test/

dan@impish:/usr/share/mysql/mysql-test$ ./mtr --vardir=/tmp/var
--parallel=4 stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb    stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb  stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb
Logging: ./mtr  --vardir=/tmp/var --parallel=4 stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
stress.ddl_innodb stress.ddl_innodb stress.ddl_innodb
vardir: /tmp/var
Removing old var directory...
Creating var directory '/tmp/var'...
Checking supported features...
MariaDB Version 10.6.5-MariaDB-1:10.6.5+maria~impish
 - SSL connections supported
 - binaries built with wsrep patch
Collecting tests...
Installing system database...

==============================================================================

TEST                                  WORKER RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
worker[4] Using MTR_BUILD_THREAD 301, with reserved ports 16020..16039
worker[3] Using MTR_BUILD_THREAD 302, with reserved ports 16040..16059
worker[2] Using MTR_BUILD_THREAD 303, with reserved ports 16060..16079
stress.ddl_innodb 'innodb'               w3 [ pass ]  185605
stress.ddl_innodb 'innodb'               w4 [ pass ]  186292
stress.ddl_innodb 'innodb'               w2 [ pass ]  193053
stress.ddl_innodb 'innodb'               w1 [ pass ]  202529
stress.ddl_innodb 'innodb'               w4 [ pass ]  213972
stress.ddl_innodb 'innodb'               w3 [ pass ]  214661
stress.ddl_innodb 'innodb'               w1 [ pass ]  213266
stress.ddl_innodb 'innodb'               w4 [ pass ]  181716
stress.ddl_innodb 'innodb'               w3 [ pass ]  194047
stress.ddl_innodb 'innodb'               w1 [ pass ]  208319
stress.ddl_innodb 'innodb'               w2 [ fail ]
        Test ended at 2021-10-22 01:24:22

----------SERVER LOG START-----------
2021-10-22  1:24:20 0 [ERROR] [FATAL] InnoDB:
innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch.
Please refer to
https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

This threshold is 10 minutes so its not like the hardware is that slow.

To my frustration, the hirsuite based container (below) created as a
test framework for you has never produced a fault even though running
on the same 5.14.14-200.fc34.x86_64 kernel that would fail after 2-3
stress.ddl_innodb tests.

$ podman run   --rm --privileged=true
quay.io/danielgblack/mariadb-test:uring    --vardir=/var/tmp
stress.ddl_innodb{,,,,,,,,,,,,,}
...
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 2908.065 of 822 seconds executing testcases

Completed: All 18 tests were successful.

Looking at server test logs in /var/tmp/[0-9]/*/*err* the mariadbd
process are using uring.

I hope provides a hint.

In the mean time, the complete reproduce is to pull a 10.6 disto
package from https://mariadb.org/download/?tab=repo-config
It has to be a distro that provides liburing like:
Debian sid
Ubuntu - groovy+
Rhel8
Fedora

(centos/rhel are doing the incorrect baseurl currently, replace the
last fragement of the path, with [rhel|centos][7|8]-$arch )
Install repo.
Install Package mariadb-test (pull in MariaDB server as dependency).
ldd /usr/{s}bin/mariadbd to check liburing is there.

cd /usr/share/mysql/mysql-test
./mtr --vardir=/tmp/var   --parallel=4 encryption.innochecksum{,,,,,}
./mtr --vardir=/tmp/var   --parallel=4 stress.ddl_innodb{,,,,,}

Should generate a backtrace like above.

As an mtr argument with gdb and xterm installed the following will
breakpoint the application at this state.
 --gdb='b ib::fatal::~fatal;r'

I'm happy to build from a tree like
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-5.15
if you'd like to to test something locally.

I can also run bpftrace scripts to pull out info if required.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-10-22  3:12 uring regression - lost write request Daniel Black
@ 2021-10-22  9:10 ` Pavel Begunkov
  2021-10-25  9:57   ` Pavel Begunkov
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Begunkov @ 2021-10-22  9:10 UTC (permalink / raw)
  To: Daniel Black, linux-block; +Cc: io-uring

On 10/22/21 04:12, Daniel Black wrote:
> Sometime after 5.11 and is fixed in 5.15-rcX (rc6 extensively tested
> over last few days) is a kernel regression we are tracing in
> https://jira.mariadb.org/browse/MDEV-26674 and
> https://jira.mariadb.org/browse/MDEV-26555
> 5.10 and early across many distros and hardware appear not to have a problem.
> 
> I'd appreciate some help identifying a 5.14 linux stable patch
> suitable as I observe the fault in mainline 5.14.14 (built

Cc: io-uring@vger.kernel.org

Let me try to remember anything relevant from 5.15,
Thanks for letting know

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-10-22  9:10 ` Pavel Begunkov
@ 2021-10-25  9:57   ` Pavel Begunkov
  2021-10-25 11:09     ` Daniel Black
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Begunkov @ 2021-10-25  9:57 UTC (permalink / raw)
  To: Daniel Black, linux-block; +Cc: io-uring

On 10/22/21 10:10, Pavel Begunkov wrote:
> On 10/22/21 04:12, Daniel Black wrote:
>> Sometime after 5.11 and is fixed in 5.15-rcX (rc6 extensively tested
>> over last few days) is a kernel regression we are tracing in
>> https://jira.mariadb.org/browse/MDEV-26674 and
>> https://jira.mariadb.org/browse/MDEV-26555
>> 5.10 and early across many distros and hardware appear not to have a problem.
>>
>> I'd appreciate some help identifying a 5.14 linux stable patch
>> suitable as I observe the fault in mainline 5.14.14 (built
> 
> Cc: io-uring@vger.kernel.org
> 
> Let me try to remember anything relevant from 5.15,
> Thanks for letting know

Daniel, following the links I found this:

"From: Daniel Black <daniel@mariadb.org>
...
The good news is I've validated that the linux mainline 5.14.14 build
from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/ has
actually fixed this problem."

To be clear, is the mainline 5.14 kernel affected with the issue?
Or does the problem exists only in debian/etc. kernel trees?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-10-25  9:57   ` Pavel Begunkov
@ 2021-10-25 11:09     ` Daniel Black
  2021-10-25 11:25       ` Pavel Begunkov
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-10-25 11:09 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: linux-block, io-uring

On Mon, Oct 25, 2021 at 8:59 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 10/22/21 10:10, Pavel Begunkov wrote:
> > On 10/22/21 04:12, Daniel Black wrote:
> >> Sometime after 5.11 and is fixed in 5.15-rcX (rc6 extensively tested
> >> over last few days) is a kernel regression we are tracing in
> >> https://jira.mariadb.org/browse/MDEV-26674 and
> >> https://jira.mariadb.org/browse/MDEV-26555
> >> 5.10 and early across many distros and hardware appear not to have a problem.
> >>
> >> I'd appreciate some help identifying a 5.14 linux stable patch
> >> suitable as I observe the fault in mainline 5.14.14 (built
> >
> > Cc: io-uring@vger.kernel.org
> >
> > Let me try to remember anything relevant from 5.15,
> > Thanks for letting know
>
> Daniel, following the links I found this:
>
> "From: Daniel Black <daniel@mariadb.org>
> ...
> The good news is I've validated that the linux mainline 5.14.14 build
> from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/ has
> actually fixed this problem."
>
> To be clear, is the mainline 5.14 kernel affected with the issue?
> Or does the problem exists only in debian/etc. kernel trees?
>
> --
> Pavel Begunkov


Thanks Pavel for looking.

I'm retesting https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/
in earnest. I did get some assertions, but they may have been
unrelated. The testing continues...

The problem with debian trees on 5.14.12 (as
linux-image-5.14.0-3-amd64_5.14.12-1_amd64.deb) was quite real
https://jira.mariadb.org/browse/MDEV-26674?focusedCommentId=203155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-203155


What is concrete is the fc34 package of 5.14.14 (which obviously does
have a Red Hat delta
https://src.fedoraproject.org/rpms/kernel/blob/f34/f/patch-5.14-redhat.patch),
but unsure of significance. Output below:

https://koji.fedoraproject.org/koji/buildinfo?buildID=1847210

$ uname -a
Linux localhost.localdomain 5.14.14-200.fc34.x86_64 #1 SMP Wed Oct 20
16:15:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

~/repos/mariadb-server-10.6 10.6
$ bd

~/repos/build-mariadb-server-10.6
$ mysql-test/mtr  --parallel=4 encryption.innochecksum{,,,,,}
Logging: /home/dan/repos/mariadb-server-10.6/mysql-test/mariadb-test-run.pl
 --parallel=4 encryption.innochecksum encryption.innochecksum
encryption.innochecksum encryption.innochecksum
encryption.innochecksum encryption.innochecksum
vardir: /home/dan/repos/build-mariadb-server-10.6/mysql-test/var
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
The destination for symlink
/home/dan/repos/build-mariadb-server-10.6/mysql-test/var does not
exist; Removing it and creating a new var directory
Creating var directory
'/home/dan/repos/build-mariadb-server-10.6/mysql-test/var'...
Checking supported features...
MariaDB Version 10.6.5-MariaDB
 - SSL connections supported
 - binaries built with wsrep patch
Collecting tests...
Installing system database...

==============================================================================

TEST                                  WORKER RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
worker[3] Using MTR_BUILD_THREAD 302, with reserved ports 16040..16059
worker[2] Using MTR_BUILD_THREAD 301, with reserved ports 16020..16039
worker[4] Using MTR_BUILD_THREAD 303, with reserved ports 16060..16079
encryption.innochecksum '16k,cbc,innodb,strict_crc32' w3 [ pass ]   5460
encryption.innochecksum '16k,cbc,innodb,strict_crc32' w2 [ pass ]   5418
encryption.innochecksum '16k,cbc,innodb,strict_crc32' w1 [ pass ]   9391
encryption.innochecksum '16k,cbc,innodb,strict_crc32' w3 [ pass ]   8682
encryption.innochecksum '16k,cbc,innodb,strict_crc32' w3 [ pass ]   3873
encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   9133
encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]  11074
encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   5253
encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   4019
encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   6318
encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   6176
encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   7305
encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   4430
encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]  10005
encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   6878
encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   3613
encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   3875
encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   6612
encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   4901
encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   3853
encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   5080
encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   7072
encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   6774
encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   7037
encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   4961
encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   5692
encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   8449
encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   5515
encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   5650
encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   3722
encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   6691
encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   4611
encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   4587
encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   5465
encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   6900
encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   8333
encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   4691
encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   5077
encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   6319
encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   4590
encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   9683
encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   5404
encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6775
encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   6190
encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   9354
encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   7734
encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   4993
encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6280
encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   4487
encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6971
encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   5172
encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6317
encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   3371
encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   3472
encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   6707
encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   9337
encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   9176
encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]  11817
encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   3419
encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   5256
encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   9291
encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   6508
encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w2 [ pass ]   6294
encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   6327
encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w2 [ pass ]   4579
encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w1 [ pass ]   4764
encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w2 [ pass ]   4469
encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w1 [ pass ]   4677
encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w2 [ pass ]   4696
encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w1 [ pass ]   3898
encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]  127358
encryption.innochecksum '16k,cbc,innodb,strict_crc32' w4 [ fail ]
        Test ended at 2021-10-25 21:39:13

CURRENT_TEST: encryption.innochecksum
mysqltest: At line 41: query 'INSERT INTO t3 SELECT * FROM t1' failed:
<Unknown> (2013): Lost connection to server during query

The result from queries just before the failure was:
SET GLOBAL innodb_file_per_table = ON;
set global innodb_compression_algorithm = 1;
# Create and populate a tables
CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
ENGINE=InnoDB ENCRYPTED=YES ENCRYPTION_KEY_ID=4;
CREATE TABLE t2 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
ENGINE=InnoDB ROW_FORMAT=COMPRESSED ENCRYPTED=YES ENCRYPTION_KEY_ID=4;
CREATE TABLE t3 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
ENGINE=InnoDB ROW_FORMAT=COMPRESSED ENCRYPTED=NO;
CREATE TABLE t4 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
ENGINE=InnoDB PAGE_COMPRESSED=1;
CREATE TABLE t5 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
ENGINE=InnoDB PAGE_COMPRESSED=1 ENCRYPTED=YES ENCRYPTION_KEY_ID=4;
CREATE TABLE t6 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT) ENGINE=InnoDB;


Server [mysqld.1 - pid: 15380, winpid: 15380, exit: 256] failed during test run
Server log from this test:
----------SERVER LOG START-----------
$ /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd
--defaults-group-suffix=.1
--defaults-file=/home/dan/repos/build-mariadb-server-10.6/mysql-test/var/4/my.cnf
--log-output=file --innodb-page-size=16K
--skip-innodb-read-only-compressed
--innodb-checksum-algorithm=strict_crc32 --innodb-flush-sync=OFF
--innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx
--innodb-locks --innodb-lock-waits --innodb-metrics
--innodb-buffer-pool-stats --innodb-buffer-page
--innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields
--innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes
--innodb-sys-tables --innodb-sys-virtual
--plugin-load-add=file_key_management.so --loose-file-key-management
--loose-file-key-management-filename=/home/dan/repos/mariadb-server-10.6/mysql-test/std_data/keys.txt
--file-key-management-encryption-algorithm=aes_cbc
--skip-innodb-read-only-compressed --core-file
--loose-debug-sync-timeout=300
2021-10-25 21:28:56 0 [Note]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd (server
10.6.5-MariaDB-log) starting as process 15381 ...
2021-10-25 21:28:56 0 [Warning] Could not increase number of
max_open_files to more than 1024 (request: 32190)
2021-10-25 21:28:56 0 [Warning] Changed limits: max_open_files: 1024
max_connections: 151 (was 151)  table_cache: 421 (was 2000)
2021-10-25 21:28:56 0 [Note] Plugin 'partition' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'SEQUENCE' is disabled.
2021-10-25 21:28:56 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-10-25 21:28:56 0 [Note] InnoDB: Number of pools: 1
2021-10-25 21:28:56 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2021-10-25 21:28:56 0 [Note] InnoDB: Using liburing
2021-10-25 21:28:56 0 [Note] InnoDB: Initializing buffer pool, total
size = 8388608, chunk size = 8388608
2021-10-25 21:28:56 0 [Note] InnoDB: Completed initialization of buffer pool
2021-10-25 21:28:56 0 [Note] InnoDB: 128 rollback segments are active.
2021-10-25 21:28:56 0 [Note] InnoDB: Creating shared tablespace for
temporary tables
2021-10-25 21:28:56 0 [Note] InnoDB: Setting file './ibtmp1' size to
12 MB. Physically writing the file full; Please wait ...
2021-10-25 21:28:56 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-10-25 21:28:56 0 [Note] InnoDB: 10.6.5 started; log sequence
number 43637; transaction id 17
2021-10-25 21:28:56 0 [Note] InnoDB: Loading buffer pool(s) from
/home/dan/repos/build-mariadb-server-10.6/mysql-test/var/4/mysqld.1/data/ib_buffer_pool
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_CONFIG' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_SYS_TABLESTATS' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_DELETED' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMP' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_WAITS' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMP_RESET' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_QUEUES' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_INDEX_TABLE' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_GROUPS' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMP_PER_INDEX_RESET' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_INDEX_CACHE' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_BEING_DELETED' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMPMEM_RESET' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_DEFAULT_STOPWORD' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_SYS_TABLESPACES' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'user_variables' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_TABLESPACES_ENCRYPTION' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_STATS' is disabled.
2021-10-25 21:28:56 0 [Note] Plugin 'unix_socket' is disabled.
2021-10-25 21:28:56 0 [Warning]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
variable 'loose-feedback-debug-startup-interval=20'
2021-10-25 21:28:56 0 [Warning]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
variable 'loose-feedback-debug-first-interval=60'
2021-10-25 21:28:56 0 [Warning]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
variable 'loose-feedback-debug-interval=60'
2021-10-25 21:28:56 0 [Warning]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown option
'--loose-pam-debug'
2021-10-25 21:28:56 0 [Warning]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown option
'--loose-aria'
2021-10-25 21:28:56 0 [Warning]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
variable 'loose-debug-sync-timeout=300'
2021-10-25 21:28:56 0 [Note] Server socket created on IP: '127.0.0.1'.
2021-10-25 21:28:56 0 [Note]
/home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: ready for
connections.
Version: '10.6.5-MariaDB-log'  socket:
'/home/dan/repos/build-mariadb-server-10.6/mysql-test/var/tmp/4/mysqld.1.sock'
 port: 16060  Source distribution
2021-10-25 21:28:56 0 [Note] InnoDB: Buffer pool(s) load completed at
211025 21:28:56
2021-10-25 21:39:11 0 [ERROR] [FATAL] InnoDB:
innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-10-25 11:09     ` Daniel Black
@ 2021-10-25 11:25       ` Pavel Begunkov
  2021-10-30  7:30         ` Salvatore Bonaccorso
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Begunkov @ 2021-10-25 11:25 UTC (permalink / raw)
  To: Daniel Black; +Cc: linux-block, io-uring

On 10/25/21 12:09, Daniel Black wrote:
> On Mon, Oct 25, 2021 at 8:59 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 10/22/21 10:10, Pavel Begunkov wrote:
>>> On 10/22/21 04:12, Daniel Black wrote:
>>>> Sometime after 5.11 and is fixed in 5.15-rcX (rc6 extensively tested
>>>> over last few days) is a kernel regression we are tracing in
>>>> https://jira.mariadb.org/browse/MDEV-26674 and
>>>> https://jira.mariadb.org/browse/MDEV-26555
>>>> 5.10 and early across many distros and hardware appear not to have a problem.
>>>>
>>>> I'd appreciate some help identifying a 5.14 linux stable patch
>>>> suitable as I observe the fault in mainline 5.14.14 (built
>>>
>>> Cc: io-uring@vger.kernel.org
>>>
>>> Let me try to remember anything relevant from 5.15,
>>> Thanks for letting know
>>
>> Daniel, following the links I found this:
>>
>> "From: Daniel Black <daniel@mariadb.org>
>> ...
>> The good news is I've validated that the linux mainline 5.14.14 build
>> from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/ has
>> actually fixed this problem."
>>
>> To be clear, is the mainline 5.14 kernel affected with the issue?
>> Or does the problem exists only in debian/etc. kernel trees?
> 
> Thanks Pavel for looking.
> 
> I'm retesting https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/
> in earnest. I did get some assertions, but they may have been
> unrelated. The testing continues...

Thanks for the work on pinpointing it. I'll wait for your conclusion
then, it'll give us an idea what we should look for.


> The problem with debian trees on 5.14.12 (as
> linux-image-5.14.0-3-amd64_5.14.12-1_amd64.deb) was quite real
> https://jira.mariadb.org/browse/MDEV-26674?focusedCommentId=203155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-203155
> 
> 
> What is concrete is the fc34 package of 5.14.14 (which obviously does
> have a Red Hat delta
> https://src.fedoraproject.org/rpms/kernel/blob/f34/f/patch-5.14-redhat.patch),
> but unsure of significance. Output below:
> 
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1847210
> 
> $ uname -a
> Linux localhost.localdomain 5.14.14-200.fc34.x86_64 #1 SMP Wed Oct 20
> 16:15:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> 
> ~/repos/mariadb-server-10.6 10.6
> $ bd
> 
> ~/repos/build-mariadb-server-10.6
> $ mysql-test/mtr  --parallel=4 encryption.innochecksum{,,,,,}
> Logging: /home/dan/repos/mariadb-server-10.6/mysql-test/mariadb-test-run.pl
>   --parallel=4 encryption.innochecksum encryption.innochecksum
> encryption.innochecksum encryption.innochecksum
> encryption.innochecksum encryption.innochecksum
> vardir: /home/dan/repos/build-mariadb-server-10.6/mysql-test/var
> Removing old var directory...
>   - WARNING: Using the 'mysql-test/var' symlink
> The destination for symlink
> /home/dan/repos/build-mariadb-server-10.6/mysql-test/var does not
> exist; Removing it and creating a new var directory
> Creating var directory
> '/home/dan/repos/build-mariadb-server-10.6/mysql-test/var'...
> Checking supported features...
> MariaDB Version 10.6.5-MariaDB
>   - SSL connections supported
>   - binaries built with wsrep patch
> Collecting tests...
> Installing system database...
> 
> ==============================================================================
> 
> TEST                                  WORKER RESULT   TIME (ms) or COMMENT
> --------------------------------------------------------------------------
> 
> worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
> worker[3] Using MTR_BUILD_THREAD 302, with reserved ports 16040..16059
> worker[2] Using MTR_BUILD_THREAD 301, with reserved ports 16020..16039
> worker[4] Using MTR_BUILD_THREAD 303, with reserved ports 16060..16079
> encryption.innochecksum '16k,cbc,innodb,strict_crc32' w3 [ pass ]   5460
> encryption.innochecksum '16k,cbc,innodb,strict_crc32' w2 [ pass ]   5418
> encryption.innochecksum '16k,cbc,innodb,strict_crc32' w1 [ pass ]   9391
> encryption.innochecksum '16k,cbc,innodb,strict_crc32' w3 [ pass ]   8682
> encryption.innochecksum '16k,cbc,innodb,strict_crc32' w3 [ pass ]   3873
> encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   9133
> encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]  11074
> encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   5253
> encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   4019
> encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   6318
> encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   6176
> encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   7305
> encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   4430
> encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]  10005
> encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   6878
> encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   3613
> encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   3875
> encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   6612
> encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   4901
> encryption.innochecksum '16k,cbc,innodb,strict_full_crc32' w3 [ pass ]   3853
> encryption.innochecksum '8k,cbc,innodb,strict_crc32' w1 [ pass ]   5080
> encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   7072
> encryption.innochecksum '4k,cbc,innodb,strict_crc32' w2 [ pass ]   6774
> encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   7037
> encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   4961
> encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   5692
> encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   8449
> encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   5515
> encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   5650
> encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   3722
> encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   6691
> encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   4611
> encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   4587
> encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   5465
> encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   6900
> encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   8333
> encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   4691
> encryption.innochecksum '8k,cbc,innodb,strict_full_crc32' w1 [ pass ]   5077
> encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]   6319
> encryption.innochecksum '16k,ctr,innodb,strict_crc32' w2 [ pass ]   4590
> encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   9683
> encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   5404
> encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6775
> encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   6190
> encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   9354
> encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   7734
> encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   4993
> encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6280
> encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   4487
> encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6971
> encryption.innochecksum '8k,ctr,innodb,strict_crc32' w2 [ pass ]   5172
> encryption.innochecksum '4k,ctr,innodb,strict_crc32' w1 [ pass ]   6317
> encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   3371
> encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   3472
> encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   6707
> encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   9337
> encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   9176
> encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]  11817
> encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   3419
> encryption.innochecksum '16k,ctr,innodb,strict_full_crc32' w2 [ pass ]   5256
> encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   9291
> encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   6508
> encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w2 [ pass ]   6294
> encryption.innochecksum '4k,ctr,innodb,strict_full_crc32' w1 [ pass ]   6327
> encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w2 [ pass ]   4579
> encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w1 [ pass ]   4764
> encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w2 [ pass ]   4469
> encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w1 [ pass ]   4677
> encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w2 [ pass ]   4696
> encryption.innochecksum '8k,ctr,innodb,strict_full_crc32' w1 [ pass ]   3898
> encryption.innochecksum '4k,cbc,innodb,strict_full_crc32' w3 [ pass ]  127358
> encryption.innochecksum '16k,cbc,innodb,strict_crc32' w4 [ fail ]
>          Test ended at 2021-10-25 21:39:13
> 
> CURRENT_TEST: encryption.innochecksum
> mysqltest: At line 41: query 'INSERT INTO t3 SELECT * FROM t1' failed:
> <Unknown> (2013): Lost connection to server during query
> 
> The result from queries just before the failure was:
> SET GLOBAL innodb_file_per_table = ON;
> set global innodb_compression_algorithm = 1;
> # Create and populate a tables
> CREATE TABLE t1 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
> ENGINE=InnoDB ENCRYPTED=YES ENCRYPTION_KEY_ID=4;
> CREATE TABLE t2 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
> ENGINE=InnoDB ROW_FORMAT=COMPRESSED ENCRYPTED=YES ENCRYPTION_KEY_ID=4;
> CREATE TABLE t3 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
> ENGINE=InnoDB ROW_FORMAT=COMPRESSED ENCRYPTED=NO;
> CREATE TABLE t4 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
> ENGINE=InnoDB PAGE_COMPRESSED=1;
> CREATE TABLE t5 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT)
> ENGINE=InnoDB PAGE_COMPRESSED=1 ENCRYPTED=YES ENCRYPTION_KEY_ID=4;
> CREATE TABLE t6 (a INT AUTO_INCREMENT PRIMARY KEY, b TEXT) ENGINE=InnoDB;
> 
> 
> Server [mysqld.1 - pid: 15380, winpid: 15380, exit: 256] failed during test run
> Server log from this test:
> ----------SERVER LOG START-----------
> $ /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd
> --defaults-group-suffix=.1
> --defaults-file=/home/dan/repos/build-mariadb-server-10.6/mysql-test/var/4/my.cnf
> --log-output=file --innodb-page-size=16K
> --skip-innodb-read-only-compressed
> --innodb-checksum-algorithm=strict_crc32 --innodb-flush-sync=OFF
> --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx
> --innodb-locks --innodb-lock-waits --innodb-metrics
> --innodb-buffer-pool-stats --innodb-buffer-page
> --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields
> --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes
> --innodb-sys-tables --innodb-sys-virtual
> --plugin-load-add=file_key_management.so --loose-file-key-management
> --loose-file-key-management-filename=/home/dan/repos/mariadb-server-10.6/mysql-test/std_data/keys.txt
> --file-key-management-encryption-algorithm=aes_cbc
> --skip-innodb-read-only-compressed --core-file
> --loose-debug-sync-timeout=300
> 2021-10-25 21:28:56 0 [Note]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd (server
> 10.6.5-MariaDB-log) starting as process 15381 ...
> 2021-10-25 21:28:56 0 [Warning] Could not increase number of
> max_open_files to more than 1024 (request: 32190)
> 2021-10-25 21:28:56 0 [Warning] Changed limits: max_open_files: 1024
> max_connections: 151 (was 151)  table_cache: 421 (was 2000)
> 2021-10-25 21:28:56 0 [Note] Plugin 'partition' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'SEQUENCE' is disabled.
> 2021-10-25 21:28:56 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
> 2021-10-25 21:28:56 0 [Note] InnoDB: Number of pools: 1
> 2021-10-25 21:28:56 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
> 2021-10-25 21:28:56 0 [Note] InnoDB: Using liburing
> 2021-10-25 21:28:56 0 [Note] InnoDB: Initializing buffer pool, total
> size = 8388608, chunk size = 8388608
> 2021-10-25 21:28:56 0 [Note] InnoDB: Completed initialization of buffer pool
> 2021-10-25 21:28:56 0 [Note] InnoDB: 128 rollback segments are active.
> 2021-10-25 21:28:56 0 [Note] InnoDB: Creating shared tablespace for
> temporary tables
> 2021-10-25 21:28:56 0 [Note] InnoDB: Setting file './ibtmp1' size to
> 12 MB. Physically writing the file full; Please wait ...
> 2021-10-25 21:28:56 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
> 2021-10-25 21:28:56 0 [Note] InnoDB: 10.6.5 started; log sequence
> number 43637; transaction id 17
> 2021-10-25 21:28:56 0 [Note] InnoDB: Loading buffer pool(s) from
> /home/dan/repos/build-mariadb-server-10.6/mysql-test/var/4/mysqld.1/data/ib_buffer_pool
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_CONFIG' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_SYS_TABLESTATS' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_DELETED' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMP' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_WAITS' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMP_RESET' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_QUEUES' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'FEEDBACK' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_INDEX_TABLE' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_GROUPS' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMP_PER_INDEX_RESET' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_INDEX_CACHE' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_BEING_DELETED' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_CMPMEM_RESET' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_FT_DEFAULT_STOPWORD' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_SYS_TABLESPACES' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'user_variables' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'INNODB_TABLESPACES_ENCRYPTION' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'THREAD_POOL_STATS' is disabled.
> 2021-10-25 21:28:56 0 [Note] Plugin 'unix_socket' is disabled.
> 2021-10-25 21:28:56 0 [Warning]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
> variable 'loose-feedback-debug-startup-interval=20'
> 2021-10-25 21:28:56 0 [Warning]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
> variable 'loose-feedback-debug-first-interval=60'
> 2021-10-25 21:28:56 0 [Warning]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
> variable 'loose-feedback-debug-interval=60'
> 2021-10-25 21:28:56 0 [Warning]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown option
> '--loose-pam-debug'
> 2021-10-25 21:28:56 0 [Warning]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown option
> '--loose-aria'
> 2021-10-25 21:28:56 0 [Warning]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: unknown
> variable 'loose-debug-sync-timeout=300'
> 2021-10-25 21:28:56 0 [Note] Server socket created on IP: '127.0.0.1'.
> 2021-10-25 21:28:56 0 [Note]
> /home/dan/repos/build-mariadb-server-10.6/sql/mariadbd: ready for
> connections.
> Version: '10.6.5-MariaDB-log'  socket:
> '/home/dan/repos/build-mariadb-server-10.6/mysql-test/var/tmp/4/mysqld.1.sock'
>   port: 16060  Source distribution
> 2021-10-25 21:28:56 0 [Note] InnoDB: Buffer pool(s) load completed at
> 211025 21:28:56
> 2021-10-25 21:39:11 0 [ERROR] [FATAL] InnoDB:
> innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-10-25 11:25       ` Pavel Begunkov
@ 2021-10-30  7:30         ` Salvatore Bonaccorso
  2021-11-01  7:28           ` Daniel Black
  0 siblings, 1 reply; 36+ messages in thread
From: Salvatore Bonaccorso @ 2021-10-30  7:30 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Daniel Black, linux-block, io-uring

Hi Daniel,

On Mon, Oct 25, 2021 at 12:25:01PM +0100, Pavel Begunkov wrote:
> On 10/25/21 12:09, Daniel Black wrote:
> > On Mon, Oct 25, 2021 at 8:59 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> > > 
> > > On 10/22/21 10:10, Pavel Begunkov wrote:
> > > > On 10/22/21 04:12, Daniel Black wrote:
> > > > > Sometime after 5.11 and is fixed in 5.15-rcX (rc6 extensively tested
> > > > > over last few days) is a kernel regression we are tracing in
> > > > > https://jira.mariadb.org/browse/MDEV-26674 and
> > > > > https://jira.mariadb.org/browse/MDEV-26555
> > > > > 5.10 and early across many distros and hardware appear not to have a problem.
> > > > > 
> > > > > I'd appreciate some help identifying a 5.14 linux stable patch
> > > > > suitable as I observe the fault in mainline 5.14.14 (built
> > > > 
> > > > Cc: io-uring@vger.kernel.org
> > > > 
> > > > Let me try to remember anything relevant from 5.15,
> > > > Thanks for letting know
> > > 
> > > Daniel, following the links I found this:
> > > 
> > > "From: Daniel Black <daniel@mariadb.org>
> > > ...
> > > The good news is I've validated that the linux mainline 5.14.14 build
> > > from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/ has
> > > actually fixed this problem."
> > > 
> > > To be clear, is the mainline 5.14 kernel affected with the issue?
> > > Or does the problem exists only in debian/etc. kernel trees?
> > 
> > Thanks Pavel for looking.
> > 
> > I'm retesting https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/
> > in earnest. I did get some assertions, but they may have been
> > unrelated. The testing continues...
> 
> Thanks for the work on pinpointing it. I'll wait for your conclusion
> then, it'll give us an idea what we should look for.

Were you able to pinpoint the issue?

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-10-30  7:30         ` Salvatore Bonaccorso
@ 2021-11-01  7:28           ` Daniel Black
  2021-11-09 22:58             ` Daniel Black
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-11-01  7:28 UTC (permalink / raw)
  To: Salvatore Bonaccorso; +Cc: Pavel Begunkov, linux-block, io-uring

[-- Attachment #1: Type: text/plain, Size: 1393 bytes --]

On Sat, Oct 30, 2021 at 6:30 PM Salvatore Bonaccorso <carnil@debian.org> wrote:

> > > I'm retesting https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.14.14/
> > > in earnest. I did get some assertions, but they may have been
> > > unrelated. The testing continues...
> >
> > Thanks for the work on pinpointing it. I'll wait for your conclusion
> > then, it'll give us an idea what we should look for.
>
> Were you able to pinpoint the issue?

Retesting on the ubuntu mainline 5.14.14 and 5.14.15 was unable to
reproduce the issue in a VM.

Using Fedora (34) 5.14.14 and 5.14.15 kernel I am reasonably able to
reproduce this, and it is now reported as
https://bugzilla.redhat.com/show_bug.cgi?id=2018882.

I've so far been unable to reproduce this issue on 5.15.0-rc7 inside a
(Ubuntu-21.10) VM.

Marko did using a other heavy flushing sysbench script (modified
version attached - slightly lower specs, and can be used on distro
install) was able to see the fault (qps goes to 0) using Debian sid
userspace and 5.15-rc6/5.15-rc7 Ubuntu mainline kernels.
https://jira.mariadb.org/browse/MDEV-26674?focusedCommentId=203645&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-203645

Note if using a mariadb-10.6.5 (not quite released), there's a change
of defaults to avoid this bug, mtr options
--mysqld=--innodb_use_native_aio=1 --nowarnings  will test this
however.

[-- Attachment #2: Mariarebench-MDEV-23855.sh --]
[-- Type: application/x-shellscript, Size: 3163 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-01  7:28           ` Daniel Black
@ 2021-11-09 22:58             ` Daniel Black
  2021-11-09 23:24               ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-11-09 22:58 UTC (permalink / raw)
  To: Salvatore Bonaccorso; +Cc: Pavel Begunkov, linux-block, io-uring

> On Sat, Oct 30, 2021 at 6:30 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
> > Were you able to pinpoint the issue?

While I have been unable to reproduce this on a single cpu, Marko can
repeat a stall on a dual Broadwell chipset on kernels:

* 5.15.1 - https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.1
* 5.14.16 - https://packages.debian.org/sid/linux-image-5.14.0-4-amd64

Detailed observations:
https://jira.mariadb.org/browse/MDEV-26674

The previous script has been adapted to use MariaDB-10.6 package and
sysbench to demonstrate a workload, I've changed Marko's script to
work with the distro packages and use innodb_use_native_aio=1.

MariaDB packages:

https://mariadb.org/download/?t=repo-config
(needs a distro that has liburing userspace libraries as standard support)

Script:

https://jira.mariadb.org/secure/attachment/60358/Mariabench-MDEV-26674-io_uring-1

The state is achieved either when the sysbench prepare stalls, or the
tps printed at 5 second intervals falls to 0.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-09 22:58             ` Daniel Black
@ 2021-11-09 23:24               ` Jens Axboe
  2021-11-10 18:01                 ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-09 23:24 UTC (permalink / raw)
  To: Daniel Black, Salvatore Bonaccorso; +Cc: Pavel Begunkov, linux-block, io-uring

On 11/9/21 3:58 PM, Daniel Black wrote:
>> On Sat, Oct 30, 2021 at 6:30 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
>>> Were you able to pinpoint the issue?
> 
> While I have been unable to reproduce this on a single cpu, Marko can
> repeat a stall on a dual Broadwell chipset on kernels:
> 
> * 5.15.1 - https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.1
> * 5.14.16 - https://packages.debian.org/sid/linux-image-5.14.0-4-amd64
> 
> Detailed observations:
> https://jira.mariadb.org/browse/MDEV-26674
> 
> The previous script has been adapted to use MariaDB-10.6 package and
> sysbench to demonstrate a workload, I've changed Marko's script to
> work with the distro packages and use innodb_use_native_aio=1.
> 
> MariaDB packages:
> 
> https://mariadb.org/download/?t=repo-config
> (needs a distro that has liburing userspace libraries as standard support)
> 
> Script:
> 
> https://jira.mariadb.org/secure/attachment/60358/Mariabench-MDEV-26674-io_uring-1
> 
> The state is achieved either when the sysbench prepare stalls, or the
> tps printed at 5 second intervals falls to 0.

Thanks, this is most useful! I'll take a look at this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-09 23:24               ` Jens Axboe
@ 2021-11-10 18:01                 ` Jens Axboe
  2021-11-11  6:52                   ` Daniel Black
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-10 18:01 UTC (permalink / raw)
  To: Daniel Black, Salvatore Bonaccorso; +Cc: Pavel Begunkov, linux-block, io-uring

On 11/9/21 4:24 PM, Jens Axboe wrote:
> On 11/9/21 3:58 PM, Daniel Black wrote:
>>> On Sat, Oct 30, 2021 at 6:30 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
>>>> Were you able to pinpoint the issue?
>>
>> While I have been unable to reproduce this on a single cpu, Marko can
>> repeat a stall on a dual Broadwell chipset on kernels:
>>
>> * 5.15.1 - https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.1
>> * 5.14.16 - https://packages.debian.org/sid/linux-image-5.14.0-4-amd64
>>
>> Detailed observations:
>> https://jira.mariadb.org/browse/MDEV-26674
>>
>> The previous script has been adapted to use MariaDB-10.6 package and
>> sysbench to demonstrate a workload, I've changed Marko's script to
>> work with the distro packages and use innodb_use_native_aio=1.
>>
>> MariaDB packages:
>>
>> https://mariadb.org/download/?t=repo-config
>> (needs a distro that has liburing userspace libraries as standard support)
>>
>> Script:
>>
>> https://jira.mariadb.org/secure/attachment/60358/Mariabench-MDEV-26674-io_uring-1
>>
>> The state is achieved either when the sysbench prepare stalls, or the
>> tps printed at 5 second intervals falls to 0.
> 
> Thanks, this is most useful! I'll take a look at this.

Would it be possible to turn this into a full reproducer script?
Something that someone that knows nothing about mysqld/mariadb can just
run and have it reproduce. If I install the 10.6 packages from above,
then it doesn't seem to use io_uring or be linked against liburing.

The script also seems to assume that various things are setup
appropriately, like SRCTREE, MDIR, etc.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-10 18:01                 ` Jens Axboe
@ 2021-11-11  6:52                   ` Daniel Black
  2021-11-11 14:30                     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-11-11  6:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

> Would it be possible to turn this into a full reproducer script?
> Something that someone that knows nothing about mysqld/mariadb can just
> run and have it reproduce. If I install the 10.6 packages from above,
> then it doesn't seem to use io_uring or be linked against liburing.

Sorry Jens.

Hope containers are ok.

mkdir ~/mdbtest/

$ podman run -d -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e
MARIADB_USER=sbtest -e MARIADB_PASSWORD=sbtest -e
MARIADB_DATABASE=sbtest  --name mdb10.6-uring_test -v
$HOME/mdbtest:/var/lib/mysql:Z  --security-opt seccomp=unconfined
quay.io/danielgblack/mariadb-test:10.6-impish-sysbench
--innodb_log_file_size=1G  --innodb_buffer_pool_size=50G
--innodb_io_capacity=5000  --innodb_io_capacity_max=9000
--innodb_flush_log_at_trx_commit=0   --innodb_adaptive_flushing_lwm=0
 --innodb-adaptive-flushing=1   --innodb_flush_neighbors=1
--innodb-use-native-aio=1   --innodb_file-per-table=1
--innodb-fast-shutdown=0   --innodb-flush-method=O_DIRECT
--innodb_lru_scan_depth=1024   --innodb_lru_flush_size=256


# drop 50G pool size down if you don't have it. Not critical to
reproduction. IO capacity here should be about what the hardware is.
Otherwise gaps of 0 tps will appear without it being the cause of the
bug.

$ podman logs mdb10.6-uring_test
...
2021-11-11  6:06:49 0 [Warning] innodb_use_native_aio may cause hangs
with this kernel 5.15.0-0.rc7.20211028git1fc596a56b33.56.fc36.x86_64;
see https://jira.mariadb.org/browse/MDEV-26674
2021-11-11  6:06:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-11-11  6:06:49 0 [Note] InnoDB: Number of pools: 1
2021-11-11  6:06:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2021-11-11  6:06:49 0 [Note] mysqld: O_TMPFILE is not supported on
/tmp (disabling future attempts)
2021-11-11  6:06:49 0 [Note] InnoDB: Using liburing

Should contain first and last line here:

$ podman exec  mdb10.6-uring_test sysbench
/usr/share/sysbench/oltp_update_index.lua --mysql-password=sbtest
--percentile=99  --tables=8 --table_size=2000000 prepare

Creating table 'sbtest1'...
Inserting 2000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...
Creating table 'sbtest2'...
Inserting 2000000 records into 'sbtest2'
Creating a secondary index on 'sbtest2'...
Creating table 'sbtest3'...
Inserting 2000000 records into 'sbtest3'
Creating a secondary index on 'sbtest3'...
Creating table 'sbtest4'...
Inserting 2000000 records into 'sbtest4'
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest5'...
Inserting 2000000 records into 'sbtest5'
Creating a secondary index on 'sbtest5'...
Creating table 'sbtest6'...
Inserting 2000000 records into 'sbtest6'
Creating a secondary index on 'sbtest6'...
Creating table 'sbtest7'...
Inserting 2000000 records into 'sbtest7'
Creating a secondary index on 'sbtest7'...
Creating table 'sbtest8'...
Inserting 2000000 records into 'sbtest8'
Creating a secondary index on 'sbtest8'...


# Adjust threads there to the amount of hardware threads available.
time is the length of the test.

$ podman exec  mdb10.6-uring_test sysbench
/usr/share/sysbench/oltp_update_index.lua --mysql-password=sbtest
--percentile=99  --tables=8 --table_size=2000000 --rand-seed=42
--rand-type=uniform --max-requests=0 --time=600 --report-interval=5
--threads=64 run



Eventually after
https://mariadb.com/kb/en/innodb-system-variables/#innodb_fatal_semaphore_wait_threshold
of 600 seconds the podman logs mdb10.6-uring_test will contains an
error like:

2021-10-07 17:06:43 0 [ERROR] [FATAL] InnoDB:
innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch.
Please refer to
https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
211007 17:06:43 [ERROR] mysqld got signal 6 ;


Restarting the container on the same populated ~/mdbtest volume could
be slow due to recovery time. Remove contents and repeat prepare step.

cleanup:

podman kill mdb10.6-uring_test
podman rm mdb10.6-uring_test
sudo rm -rf ~/mdbtest

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11  6:52                   ` Daniel Black
@ 2021-11-11 14:30                     ` Jens Axboe
  2021-11-11 14:58                       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 14:30 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/10/21 11:52 PM, Daniel Black wrote:
>> Would it be possible to turn this into a full reproducer script?
>> Something that someone that knows nothing about mysqld/mariadb can just
>> run and have it reproduce. If I install the 10.6 packages from above,
>> then it doesn't seem to use io_uring or be linked against liburing.
> 
> Sorry Jens.
> 
> Hope containers are ok.

Don't think I have a way to run that, don't even know what podman is
and nor does my distro. I'll google a bit and see if I can get this
running.

I'm fine building from source and running from there, as long as I
know what to do. Would that make it any easier? It definitely would
for me :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 14:30                     ` Jens Axboe
@ 2021-11-11 14:58                       ` Jens Axboe
  2021-11-11 15:29                         ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 14:58 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/11/21 7:30 AM, Jens Axboe wrote:
> On 11/10/21 11:52 PM, Daniel Black wrote:
>>> Would it be possible to turn this into a full reproducer script?
>>> Something that someone that knows nothing about mysqld/mariadb can just
>>> run and have it reproduce. If I install the 10.6 packages from above,
>>> then it doesn't seem to use io_uring or be linked against liburing.
>>
>> Sorry Jens.
>>
>> Hope containers are ok.
> 
> Don't think I have a way to run that, don't even know what podman is
> and nor does my distro. I'll google a bit and see if I can get this
> running.
> 
> I'm fine building from source and running from there, as long as I
> know what to do. Would that make it any easier? It definitely would
> for me :-)

The podman approach seemed to work, and I was able to run all three
steps. Didn't see any hangs. I'm going to try again dropping down
the innodb pool size (box only has 32G of RAM).

The storage can do a lot more than 5k IOPS, I'm going to try ramping
that up.

Does your reproducer box have multiple NUMA nodes, or is it a single
socket/nod box?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 14:58                       ` Jens Axboe
@ 2021-11-11 15:29                         ` Jens Axboe
  2021-11-11 16:19                           ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 15:29 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/11/21 7:58 AM, Jens Axboe wrote:
> On 11/11/21 7:30 AM, Jens Axboe wrote:
>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>> Would it be possible to turn this into a full reproducer script?
>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>
>>> Sorry Jens.
>>>
>>> Hope containers are ok.
>>
>> Don't think I have a way to run that, don't even know what podman is
>> and nor does my distro. I'll google a bit and see if I can get this
>> running.
>>
>> I'm fine building from source and running from there, as long as I
>> know what to do. Would that make it any easier? It definitely would
>> for me :-)
> 
> The podman approach seemed to work, and I was able to run all three
> steps. Didn't see any hangs. I'm going to try again dropping down
> the innodb pool size (box only has 32G of RAM).
> 
> The storage can do a lot more than 5k IOPS, I'm going to try ramping
> that up.
> 
> Does your reproducer box have multiple NUMA nodes, or is it a single
> socket/nod box?

Doesn't seem to reproduce for me on current -git. What file system are
you using?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 15:29                         ` Jens Axboe
@ 2021-11-11 16:19                           ` Jens Axboe
  2021-11-11 16:55                             ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 16:19 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/11/21 8:29 AM, Jens Axboe wrote:
> On 11/11/21 7:58 AM, Jens Axboe wrote:
>> On 11/11/21 7:30 AM, Jens Axboe wrote:
>>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>>> Would it be possible to turn this into a full reproducer script?
>>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>>
>>>> Sorry Jens.
>>>>
>>>> Hope containers are ok.
>>>
>>> Don't think I have a way to run that, don't even know what podman is
>>> and nor does my distro. I'll google a bit and see if I can get this
>>> running.
>>>
>>> I'm fine building from source and running from there, as long as I
>>> know what to do. Would that make it any easier? It definitely would
>>> for me :-)
>>
>> The podman approach seemed to work, and I was able to run all three
>> steps. Didn't see any hangs. I'm going to try again dropping down
>> the innodb pool size (box only has 32G of RAM).
>>
>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
>> that up.
>>
>> Does your reproducer box have multiple NUMA nodes, or is it a single
>> socket/nod box?
> 
> Doesn't seem to reproduce for me on current -git. What file system are
> you using?

I seem to be able to hit it with ext4, guessing it has more cases that
punt to buffered IO. As I initially suspected, I think this is a race
with buffered file write hashing. I have a debug patch that just turns
a regular non-numa box into multi nodes, may or may not be needed be
needed to hit this, but I definitely can now. Looks like this:

Node7 DUMP                                                                      
index=0, nr_w=1, max=128, r=0, f=1, h=0                                         
  w=ffff8f5e8b8470c0, hashed=1/0, flags=2                                       
  w=ffff8f5e95a9b8c0, hashed=1/0, flags=2                                       
index=1, nr_w=0, max=127877, r=0, f=0, h=0                                      
free_list                                                                       
  worker=ffff8f5eaf2e0540                                                       
all_list                                                                        
  worker=ffff8f5eaf2e0540

where we seed node7 in this case having two work items pending, but the
worker state is stalled on hash.

The hash logic was rewritten as part of the io-wq worker threads being
changed for 5.11 iirc, which is why that was my initial suspicion here.

I'll take a look at this and make a test patch. Looks like you are able
to test self-built kernels, is that correct?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 16:19                           ` Jens Axboe
@ 2021-11-11 16:55                             ` Jens Axboe
  2021-11-11 17:28                               ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 16:55 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/11/21 9:19 AM, Jens Axboe wrote:
> On 11/11/21 8:29 AM, Jens Axboe wrote:
>> On 11/11/21 7:58 AM, Jens Axboe wrote:
>>> On 11/11/21 7:30 AM, Jens Axboe wrote:
>>>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>>>> Would it be possible to turn this into a full reproducer script?
>>>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>>>
>>>>> Sorry Jens.
>>>>>
>>>>> Hope containers are ok.
>>>>
>>>> Don't think I have a way to run that, don't even know what podman is
>>>> and nor does my distro. I'll google a bit and see if I can get this
>>>> running.
>>>>
>>>> I'm fine building from source and running from there, as long as I
>>>> know what to do. Would that make it any easier? It definitely would
>>>> for me :-)
>>>
>>> The podman approach seemed to work, and I was able to run all three
>>> steps. Didn't see any hangs. I'm going to try again dropping down
>>> the innodb pool size (box only has 32G of RAM).
>>>
>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
>>> that up.
>>>
>>> Does your reproducer box have multiple NUMA nodes, or is it a single
>>> socket/nod box?
>>
>> Doesn't seem to reproduce for me on current -git. What file system are
>> you using?
> 
> I seem to be able to hit it with ext4, guessing it has more cases that
> punt to buffered IO. As I initially suspected, I think this is a race
> with buffered file write hashing. I have a debug patch that just turns
> a regular non-numa box into multi nodes, may or may not be needed be
> needed to hit this, but I definitely can now. Looks like this:
> 
> Node7 DUMP                                                                      
> index=0, nr_w=1, max=128, r=0, f=1, h=0                                         
>   w=ffff8f5e8b8470c0, hashed=1/0, flags=2                                       
>   w=ffff8f5e95a9b8c0, hashed=1/0, flags=2                                       
> index=1, nr_w=0, max=127877, r=0, f=0, h=0                                      
> free_list                                                                       
>   worker=ffff8f5eaf2e0540                                                       
> all_list                                                                        
>   worker=ffff8f5eaf2e0540
> 
> where we seed node7 in this case having two work items pending, but the
> worker state is stalled on hash.
> 
> The hash logic was rewritten as part of the io-wq worker threads being
> changed for 5.11 iirc, which is why that was my initial suspicion here.
> 
> I'll take a look at this and make a test patch. Looks like you are able
> to test self-built kernels, is that correct?

Can you try with this patch? It's against -git, but it will apply to
5.15 as well.


diff --git a/fs/io-wq.c b/fs/io-wq.c
index afd955d53db9..7917b8866dcc 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work)
 	return work->flags >> IO_WQ_HASH_SHIFT;
 }
 
-static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
+static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 {
 	struct io_wq *wq = wqe->wq;
+	bool ret = false;
 
 	spin_lock_irq(&wq->hash->wait.lock);
 	if (list_empty(&wqe->wait.entry)) {
@@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 		if (!test_bit(hash, &wq->hash->map)) {
 			__set_current_state(TASK_RUNNING);
 			list_del_init(&wqe->wait.entry);
+			ret = true;
 		}
 	}
 	spin_unlock_irq(&wq->hash->wait.lock);
+	return ret;
 }
 
 static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
@@ -447,6 +450,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 	unsigned int stall_hash = -1U;
 	struct io_wqe *wqe = worker->wqe;
 
+retry:
 	wq_list_for_each(node, prev, &acct->work_list) {
 		unsigned int hash;
 
@@ -475,14 +479,18 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 	}
 
 	if (stall_hash != -1U) {
+		bool do_retry;
+
 		/*
 		 * Set this before dropping the lock to avoid racing with new
 		 * work being added and clearing the stalled bit.
 		 */
 		set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 		raw_spin_unlock(&wqe->lock);
-		io_wait_on_hash(wqe, stall_hash);
+		do_retry = io_wait_on_hash(wqe, stall_hash);
 		raw_spin_lock(&wqe->lock);
+		if (do_retry)
+			goto retry;
 	}
 
 	return NULL;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 16:55                             ` Jens Axboe
@ 2021-11-11 17:28                               ` Jens Axboe
  2021-11-11 23:44                                 ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 17:28 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/11/21 9:55 AM, Jens Axboe wrote:
> On 11/11/21 9:19 AM, Jens Axboe wrote:
>> On 11/11/21 8:29 AM, Jens Axboe wrote:
>>> On 11/11/21 7:58 AM, Jens Axboe wrote:
>>>> On 11/11/21 7:30 AM, Jens Axboe wrote:
>>>>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>>>>> Would it be possible to turn this into a full reproducer script?
>>>>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>>>>
>>>>>> Sorry Jens.
>>>>>>
>>>>>> Hope containers are ok.
>>>>>
>>>>> Don't think I have a way to run that, don't even know what podman is
>>>>> and nor does my distro. I'll google a bit and see if I can get this
>>>>> running.
>>>>>
>>>>> I'm fine building from source and running from there, as long as I
>>>>> know what to do. Would that make it any easier? It definitely would
>>>>> for me :-)
>>>>
>>>> The podman approach seemed to work, and I was able to run all three
>>>> steps. Didn't see any hangs. I'm going to try again dropping down
>>>> the innodb pool size (box only has 32G of RAM).
>>>>
>>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
>>>> that up.
>>>>
>>>> Does your reproducer box have multiple NUMA nodes, or is it a single
>>>> socket/nod box?
>>>
>>> Doesn't seem to reproduce for me on current -git. What file system are
>>> you using?
>>
>> I seem to be able to hit it with ext4, guessing it has more cases that
>> punt to buffered IO. As I initially suspected, I think this is a race
>> with buffered file write hashing. I have a debug patch that just turns
>> a regular non-numa box into multi nodes, may or may not be needed be
>> needed to hit this, but I definitely can now. Looks like this:
>>
>> Node7 DUMP                                                                      
>> index=0, nr_w=1, max=128, r=0, f=1, h=0                                         
>>   w=ffff8f5e8b8470c0, hashed=1/0, flags=2                                       
>>   w=ffff8f5e95a9b8c0, hashed=1/0, flags=2                                       
>> index=1, nr_w=0, max=127877, r=0, f=0, h=0                                      
>> free_list                                                                       
>>   worker=ffff8f5eaf2e0540                                                       
>> all_list                                                                        
>>   worker=ffff8f5eaf2e0540
>>
>> where we seed node7 in this case having two work items pending, but the
>> worker state is stalled on hash.
>>
>> The hash logic was rewritten as part of the io-wq worker threads being
>> changed for 5.11 iirc, which is why that was my initial suspicion here.
>>
>> I'll take a look at this and make a test patch. Looks like you are able
>> to test self-built kernels, is that correct?
> 
> Can you try with this patch? It's against -git, but it will apply to
> 5.15 as well.

I think that one covered one potential gap, but I just managed to
reproduce a stall even with it. So hang on testing that one, I'll send
you something more complete when I have confidence in it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 17:28                               ` Jens Axboe
@ 2021-11-11 23:44                                 ` Jens Axboe
  2021-11-12  6:25                                   ` Daniel Black
  2021-11-14 20:33                                   ` Daniel Black
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2021-11-11 23:44 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/11/21 10:28 AM, Jens Axboe wrote:
> On 11/11/21 9:55 AM, Jens Axboe wrote:
>> On 11/11/21 9:19 AM, Jens Axboe wrote:
>>> On 11/11/21 8:29 AM, Jens Axboe wrote:
>>>> On 11/11/21 7:58 AM, Jens Axboe wrote:
>>>>> On 11/11/21 7:30 AM, Jens Axboe wrote:
>>>>>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>>>>>> Would it be possible to turn this into a full reproducer script?
>>>>>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>>>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>>>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>>>>>
>>>>>>> Sorry Jens.
>>>>>>>
>>>>>>> Hope containers are ok.
>>>>>>
>>>>>> Don't think I have a way to run that, don't even know what podman is
>>>>>> and nor does my distro. I'll google a bit and see if I can get this
>>>>>> running.
>>>>>>
>>>>>> I'm fine building from source and running from there, as long as I
>>>>>> know what to do. Would that make it any easier? It definitely would
>>>>>> for me :-)
>>>>>
>>>>> The podman approach seemed to work, and I was able to run all three
>>>>> steps. Didn't see any hangs. I'm going to try again dropping down
>>>>> the innodb pool size (box only has 32G of RAM).
>>>>>
>>>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
>>>>> that up.
>>>>>
>>>>> Does your reproducer box have multiple NUMA nodes, or is it a single
>>>>> socket/nod box?
>>>>
>>>> Doesn't seem to reproduce for me on current -git. What file system are
>>>> you using?
>>>
>>> I seem to be able to hit it with ext4, guessing it has more cases that
>>> punt to buffered IO. As I initially suspected, I think this is a race
>>> with buffered file write hashing. I have a debug patch that just turns
>>> a regular non-numa box into multi nodes, may or may not be needed be
>>> needed to hit this, but I definitely can now. Looks like this:
>>>
>>> Node7 DUMP                                                                      
>>> index=0, nr_w=1, max=128, r=0, f=1, h=0                                         
>>>   w=ffff8f5e8b8470c0, hashed=1/0, flags=2                                       
>>>   w=ffff8f5e95a9b8c0, hashed=1/0, flags=2                                       
>>> index=1, nr_w=0, max=127877, r=0, f=0, h=0                                      
>>> free_list                                                                       
>>>   worker=ffff8f5eaf2e0540                                                       
>>> all_list                                                                        
>>>   worker=ffff8f5eaf2e0540
>>>
>>> where we seed node7 in this case having two work items pending, but the
>>> worker state is stalled on hash.
>>>
>>> The hash logic was rewritten as part of the io-wq worker threads being
>>> changed for 5.11 iirc, which is why that was my initial suspicion here.
>>>
>>> I'll take a look at this and make a test patch. Looks like you are able
>>> to test self-built kernels, is that correct?
>>
>> Can you try with this patch? It's against -git, but it will apply to
>> 5.15 as well.
> 
> I think that one covered one potential gap, but I just managed to
> reproduce a stall even with it. So hang on testing that one, I'll send
> you something more complete when I have confidence in it.

Alright, give this one a go if you can. Against -git, but will apply to
5.15 as well.


diff --git a/fs/io-wq.c b/fs/io-wq.c
index afd955d53db9..88202de519f6 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work)
 	return work->flags >> IO_WQ_HASH_SHIFT;
 }
 
-static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
+static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 {
 	struct io_wq *wq = wqe->wq;
+	bool ret = false;
 
 	spin_lock_irq(&wq->hash->wait.lock);
 	if (list_empty(&wqe->wait.entry)) {
@@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 		if (!test_bit(hash, &wq->hash->map)) {
 			__set_current_state(TASK_RUNNING);
 			list_del_init(&wqe->wait.entry);
+			ret = true;
 		}
 	}
 	spin_unlock_irq(&wq->hash->wait.lock);
+	return ret;
 }
 
 static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
@@ -475,14 +478,21 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 	}
 
 	if (stall_hash != -1U) {
+		bool unstalled;
+
 		/*
 		 * Set this before dropping the lock to avoid racing with new
 		 * work being added and clearing the stalled bit.
 		 */
 		set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 		raw_spin_unlock(&wqe->lock);
-		io_wait_on_hash(wqe, stall_hash);
+		unstalled = io_wait_on_hash(wqe, stall_hash);
 		raw_spin_lock(&wqe->lock);
+		if (unstalled) {
+			clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+			if (wq_has_sleeper(&wqe->wq->hash->wait))
+				wake_up(&wqe->wq->hash->wait);
+		}
 	}
 
 	return NULL;
@@ -564,8 +574,11 @@ static void io_worker_handle_work(struct io_worker *worker)
 				io_wqe_enqueue(wqe, linked);
 
 			if (hash != -1U && !next_hashed) {
+				/* serialize hash clear with wake_up() */
+				spin_lock_irq(&wq->hash->wait.lock);
 				clear_bit(hash, &wq->hash->map);
 				clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+				spin_unlock_irq(&wq->hash->wait.lock);
 				if (wq_has_sleeper(&wq->hash->wait))
 					wake_up(&wq->hash->wait);
 				raw_spin_lock(&wqe->lock);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 23:44                                 ` Jens Axboe
@ 2021-11-12  6:25                                   ` Daniel Black
  2021-11-12 19:19                                     ` Salvatore Bonaccorso
  2021-11-14 20:33                                   ` Daniel Black
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-11-12  6:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/11/21 10:28 AM, Jens Axboe wrote:
> > On 11/11/21 9:55 AM, Jens Axboe wrote:
> >> On 11/11/21 9:19 AM, Jens Axboe wrote:
> >>> On 11/11/21 8:29 AM, Jens Axboe wrote:
> >>>> On 11/11/21 7:58 AM, Jens Axboe wrote:
> >>>>> On 11/11/21 7:30 AM, Jens Axboe wrote:
> >>>>>> On 11/10/21 11:52 PM, Daniel Black wrote:
> >>>>>>>> Would it be possible to turn this into a full reproducer script?
> >>>>>>>> Something that someone that knows nothing about mysqld/mariadb can just
> >>>>>>>> run and have it reproduce. If I install the 10.6 packages from above,
> >>>>>>>> then it doesn't seem to use io_uring or be linked against liburing.
> >>>>>>>
> >>>>>>> Sorry Jens.
> >>>>>>>
> >>>>>>> Hope containers are ok.
> >>>>>>
> >>>>>> Don't think I have a way to run that, don't even know what podman is
> >>>>>> and nor does my distro. I'll google a bit and see if I can get this
> >>>>>> running.
> >>>>>>
> >>>>>> I'm fine building from source and running from there, as long as I
> >>>>>> know what to do. Would that make it any easier? It definitely would
> >>>>>> for me :-)
> >>>>>
> >>>>> The podman approach seemed to work,

Thanks for bearing with it.

> >>>>> and I was able to run all three
> >>>>> steps. Didn't see any hangs. I'm going to try again dropping down
> >>>>> the innodb pool size (box only has 32G of RAM).
> >>>>>
> >>>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
> >>>>> that up.

Good.

> >>>>>
> >>>>> Does your reproducer box have multiple NUMA nodes, or is it a single
> >>>>> socket/nod box?

It was NUMA. Pre 5.14.14 I could produce it on a simpler test on a single node.

> >>>>
> >>>> Doesn't seem to reproduce for me on current -git. What file system are
> >>>> you using?

Yes ext4.

> >>>
> >>> I seem to be able to hit it with ext4, guessing it has more cases that
> >>> punt to buffered IO. As I initially suspected, I think this is a race
> >>> with buffered file write hashing. I have a debug patch that just turns
> >>> a regular non-numa box into multi nodes, may or may not be needed be
> >>> needed to hit this, but I definitely can now. Looks like this:
> >>>
> >>> Node7 DUMP
> >>> index=0, nr_w=1, max=128, r=0, f=1, h=0
> >>>   w=ffff8f5e8b8470c0, hashed=1/0, flags=2
> >>>   w=ffff8f5e95a9b8c0, hashed=1/0, flags=2
> >>> index=1, nr_w=0, max=127877, r=0, f=0, h=0
> >>> free_list
> >>>   worker=ffff8f5eaf2e0540
> >>> all_list
> >>>   worker=ffff8f5eaf2e0540
> >>>
> >>> where we seed node7 in this case having two work items pending, but the
> >>> worker state is stalled on hash.
> >>>
> >>> The hash logic was rewritten as part of the io-wq worker threads being
> >>> changed for 5.11 iirc, which is why that was my initial suspicion here.
> >>>
> >>> I'll take a look at this and make a test patch. Looks like you are able
> >>> to test self-built kernels, is that correct?

I've been libreating prebuilt kernels, however on the path to self-built again.

Just searching for the holy penguin pee (from yaboot da(ze|ys)) to
peesign(sic) EFI kernels.
jk, working through docs:
https://docs.fedoraproject.org/en-US/quick-docs/kernel/build-custom-kernel/

> >> Can you try with this patch? It's against -git, but it will apply to
> >> 5.15 as well.
> >
> > I think that one covered one potential gap, but I just managed to
> > reproduce a stall even with it. So hang on testing that one, I'll send
> > you something more complete when I have confidence in it.
>
> Alright, give this one a go if you can. Against -git, but will apply to
> 5.15 as well.

Applied, built, attempting to boot....

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-12  6:25                                   ` Daniel Black
@ 2021-11-12 19:19                                     ` Salvatore Bonaccorso
  0 siblings, 0 replies; 36+ messages in thread
From: Salvatore Bonaccorso @ 2021-11-12 19:19 UTC (permalink / raw)
  To: Daniel Black; +Cc: Jens Axboe, Pavel Begunkov, linux-block, io-uring

Daniel,

On Fri, Nov 12, 2021 at 05:25:31PM +1100, Daniel Black wrote:
> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 11/11/21 10:28 AM, Jens Axboe wrote:
> > > On 11/11/21 9:55 AM, Jens Axboe wrote:
> > >> On 11/11/21 9:19 AM, Jens Axboe wrote:
> > >>> On 11/11/21 8:29 AM, Jens Axboe wrote:
> > >>>> On 11/11/21 7:58 AM, Jens Axboe wrote:
> > >>>>> On 11/11/21 7:30 AM, Jens Axboe wrote:
> > >>>>>> On 11/10/21 11:52 PM, Daniel Black wrote:
> > >>>>>>>> Would it be possible to turn this into a full reproducer script?
> > >>>>>>>> Something that someone that knows nothing about mysqld/mariadb can just
> > >>>>>>>> run and have it reproduce. If I install the 10.6 packages from above,
> > >>>>>>>> then it doesn't seem to use io_uring or be linked against liburing.
> > >>>>>>>
> > >>>>>>> Sorry Jens.
> > >>>>>>>
> > >>>>>>> Hope containers are ok.
> > >>>>>>
> > >>>>>> Don't think I have a way to run that, don't even know what podman is
> > >>>>>> and nor does my distro. I'll google a bit and see if I can get this
> > >>>>>> running.
> > >>>>>>
> > >>>>>> I'm fine building from source and running from there, as long as I
> > >>>>>> know what to do. Would that make it any easier? It definitely would
> > >>>>>> for me :-)
> > >>>>>
> > >>>>> The podman approach seemed to work,
> 
> Thanks for bearing with it.
> 
> > >>>>> and I was able to run all three
> > >>>>> steps. Didn't see any hangs. I'm going to try again dropping down
> > >>>>> the innodb pool size (box only has 32G of RAM).
> > >>>>>
> > >>>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
> > >>>>> that up.
> 
> Good.
> 
> > >>>>>
> > >>>>> Does your reproducer box have multiple NUMA nodes, or is it a single
> > >>>>> socket/nod box?
> 
> It was NUMA. Pre 5.14.14 I could produce it on a simpler test on a single node.
> 
> > >>>>
> > >>>> Doesn't seem to reproduce for me on current -git. What file system are
> > >>>> you using?
> 
> Yes ext4.
> 
> > >>>
> > >>> I seem to be able to hit it with ext4, guessing it has more cases that
> > >>> punt to buffered IO. As I initially suspected, I think this is a race
> > >>> with buffered file write hashing. I have a debug patch that just turns
> > >>> a regular non-numa box into multi nodes, may or may not be needed be
> > >>> needed to hit this, but I definitely can now. Looks like this:
> > >>>
> > >>> Node7 DUMP
> > >>> index=0, nr_w=1, max=128, r=0, f=1, h=0
> > >>>   w=ffff8f5e8b8470c0, hashed=1/0, flags=2
> > >>>   w=ffff8f5e95a9b8c0, hashed=1/0, flags=2
> > >>> index=1, nr_w=0, max=127877, r=0, f=0, h=0
> > >>> free_list
> > >>>   worker=ffff8f5eaf2e0540
> > >>> all_list
> > >>>   worker=ffff8f5eaf2e0540
> > >>>
> > >>> where we seed node7 in this case having two work items pending, but the
> > >>> worker state is stalled on hash.
> > >>>
> > >>> The hash logic was rewritten as part of the io-wq worker threads being
> > >>> changed for 5.11 iirc, which is why that was my initial suspicion here.
> > >>>
> > >>> I'll take a look at this and make a test patch. Looks like you are able
> > >>> to test self-built kernels, is that correct?
> 
> I've been libreating prebuilt kernels, however on the path to self-built again.
> 
> Just searching for the holy penguin pee (from yaboot da(ze|ys)) to
> peesign(sic) EFI kernels.
> jk, working through docs:
> https://docs.fedoraproject.org/en-US/quick-docs/kernel/build-custom-kernel/
> 
> > >> Can you try with this patch? It's against -git, but it will apply to
> > >> 5.15 as well.
> > >
> > > I think that one covered one potential gap, but I just managed to
> > > reproduce a stall even with it. So hang on testing that one, I'll send
> > > you something more complete when I have confidence in it.
> >
> > Alright, give this one a go if you can. Against -git, but will apply to
> > 5.15 as well.
> 
> Applied, built, attempting to boot....

If you want to do the same for Debian based system, the following
might help to get the package built:

https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s4.2.2

I might be able to provide you otherwise a prebuild package with the
patch (unsigned though, but best if you built and test it directly)

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-11 23:44                                 ` Jens Axboe
  2021-11-12  6:25                                   ` Daniel Black
@ 2021-11-14 20:33                                   ` Daniel Black
  2021-11-14 20:55                                     ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-11-14 20:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Alright, give this one a go if you can. Against -git, but will apply to
> 5.15 as well.


Works. Thank you very much.

https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599

Tested-by: Marko Mäkelä <marko.makela@mariadb.com>


>
>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index afd955d53db9..88202de519f6 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work)
>         return work->flags >> IO_WQ_HASH_SHIFT;
>  }
>
> -static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
> +static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
>  {
>         struct io_wq *wq = wqe->wq;
> +       bool ret = false;
>
>         spin_lock_irq(&wq->hash->wait.lock);
>         if (list_empty(&wqe->wait.entry)) {
> @@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
>                 if (!test_bit(hash, &wq->hash->map)) {
>                         __set_current_state(TASK_RUNNING);
>                         list_del_init(&wqe->wait.entry);
> +                       ret = true;
>                 }
>         }
>         spin_unlock_irq(&wq->hash->wait.lock);
> +       return ret;
>  }
>
>  static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
> @@ -475,14 +478,21 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
>         }
>
>         if (stall_hash != -1U) {
> +               bool unstalled;
> +
>                 /*
>                  * Set this before dropping the lock to avoid racing with new
>                  * work being added and clearing the stalled bit.
>                  */
>                 set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
>                 raw_spin_unlock(&wqe->lock);
> -               io_wait_on_hash(wqe, stall_hash);
> +               unstalled = io_wait_on_hash(wqe, stall_hash);
>                 raw_spin_lock(&wqe->lock);
> +               if (unstalled) {
> +                       clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
> +                       if (wq_has_sleeper(&wqe->wq->hash->wait))
> +                               wake_up(&wqe->wq->hash->wait);
> +               }
>         }
>
>         return NULL;
> @@ -564,8 +574,11 @@ static void io_worker_handle_work(struct io_worker *worker)
>                                 io_wqe_enqueue(wqe, linked);
>
>                         if (hash != -1U && !next_hashed) {
> +                               /* serialize hash clear with wake_up() */
> +                               spin_lock_irq(&wq->hash->wait.lock);
>                                 clear_bit(hash, &wq->hash->map);
>                                 clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
> +                               spin_unlock_irq(&wq->hash->wait.lock);
>                                 if (wq_has_sleeper(&wq->hash->wait))
>                                         wake_up(&wq->hash->wait);
>                                 raw_spin_lock(&wqe->lock);
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-14 20:33                                   ` Daniel Black
@ 2021-11-14 20:55                                     ` Jens Axboe
  2021-11-14 21:02                                       ` Salvatore Bonaccorso
  2021-11-24  3:27                                       ` Daniel Black
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2021-11-14 20:55 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On 11/14/21 1:33 PM, Daniel Black wrote:
> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Alright, give this one a go if you can. Against -git, but will apply to
>> 5.15 as well.
> 
> 
> Works. Thank you very much.
> 
> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
> 
> Tested-by: Marko Mäkelä <marko.makela@mariadb.com>

Awesome, thanks so much for reporting and testing. All bugs are shallow
when given a reproducer, that certainly helped a ton in figuring out
what this was and nailing a fix.

The patch is already upstream (and in the 5.15 stable queue), and I
provided 5.14 patches too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-14 20:55                                     ` Jens Axboe
@ 2021-11-14 21:02                                       ` Salvatore Bonaccorso
  2021-11-14 21:03                                         ` Jens Axboe
  2021-11-24  3:27                                       ` Daniel Black
  1 sibling, 1 reply; 36+ messages in thread
From: Salvatore Bonaccorso @ 2021-11-14 21:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Daniel Black, Pavel Begunkov, linux-block, io-uring

Hi,

On Sun, Nov 14, 2021 at 01:55:20PM -0700, Jens Axboe wrote:
> On 11/14/21 1:33 PM, Daniel Black wrote:
> > On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> Alright, give this one a go if you can. Against -git, but will apply to
> >> 5.15 as well.
> > 
> > 
> > Works. Thank you very much.
> > 
> > https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
> > 
> > Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
> 
> Awesome, thanks so much for reporting and testing. All bugs are shallow
> when given a reproducer, that certainly helped a ton in figuring out
> what this was and nailing a fix.
> 
> The patch is already upstream (and in the 5.15 stable queue), and I
> provided 5.14 patches too.

FTR, I cherry-picked as well the respective commit for Debian's upload
of 5.15.2-1~exp1 to experimental as
https://salsa.debian.org/kernel-team/linux/-/commit/657413869fa29b97ec886cf62a420ab43b935fff
.

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-14 21:02                                       ` Salvatore Bonaccorso
@ 2021-11-14 21:03                                         ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2021-11-14 21:03 UTC (permalink / raw)
  To: Salvatore Bonaccorso; +Cc: Daniel Black, Pavel Begunkov, linux-block, io-uring

On 11/14/21 2:02 PM, Salvatore Bonaccorso wrote:
> Hi,
> 
> On Sun, Nov 14, 2021 at 01:55:20PM -0700, Jens Axboe wrote:
>> On 11/14/21 1:33 PM, Daniel Black wrote:
>>> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> Alright, give this one a go if you can. Against -git, but will apply to
>>>> 5.15 as well.
>>>
>>>
>>> Works. Thank you very much.
>>>
>>> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
>>>
>>> Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
>>
>> Awesome, thanks so much for reporting and testing. All bugs are shallow
>> when given a reproducer, that certainly helped a ton in figuring out
>> what this was and nailing a fix.
>>
>> The patch is already upstream (and in the 5.15 stable queue), and I
>> provided 5.14 patches too.
> 
> FTR, I cherry-picked as well the respective commit for Debian's upload
> of 5.15.2-1~exp1 to experimental as
> https://salsa.debian.org/kernel-team/linux/-/commit/657413869fa29b97ec886cf62a420ab43b935fff

Great thanks, you're beating stable :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-14 20:55                                     ` Jens Axboe
  2021-11-14 21:02                                       ` Salvatore Bonaccorso
@ 2021-11-24  3:27                                       ` Daniel Black
  2021-11-24 15:28                                         ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Black @ 2021-11-24  3:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/14/21 1:33 PM, Daniel Black wrote:
> > On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> Alright, give this one a go if you can. Against -git, but will apply to
> >> 5.15 as well.
> >
> >
> > Works. Thank you very much.
> >
> > https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
> >
> > Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
>
> The patch is already upstream (and in the 5.15 stable queue), and I
> provided 5.14 patches too.

Jens,

I'm getting the same reproducer on 5.14.20
(https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
backport change logs indicate 5.14.19 has the patch.

Anything missing?

ext4 again (my mount is /dev/mapper/fedora_localhost--live-home on
/home type ext4 (rw,relatime,seclabel)).

previous container should work, thought a source option is there:

build deps: liburing-dev, bison, libevent-dev, ncurses-dev, c++
libraries/compiler

git clone --branch 10.6 --single-branch
https://github.com/MariaDB/server mariadb-server
(cd mariadb-server; git submodule update --init --recursive)
mkdir build-mariadb-server
cd build-mariadb-server
cmake -DPLUGIN_{MROONGA,ROCKSDB,CONNECT,SPIDER,SPHINX,S3,COLUMNSTORE}=NO
../mariadb-server
(ensure liburing userspace is picked up)
cmake --build . --parallel
mysql-test/mtr  --mysqld=--innodb_use_native_aio=1 --nowarnings
--parallel=4 --force encryption.innochecksum{,,,,,}

Adding to mtr: --mysqld=--innodb_io_capacity=50000
--mysqld=--innodb_io_capacity_max=90000 will probably trip this
quicker.


5.15.3 is good (https://jira.mariadb.org/browse/MDEV-26674?focusedCommentId=206787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-206787).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24  3:27                                       ` Daniel Black
@ 2021-11-24 15:28                                         ` Jens Axboe
  2021-11-24 16:10                                           ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-24 15:28 UTC (permalink / raw)
  To: Daniel Black; +Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]

On 11/23/21 8:27 PM, Daniel Black wrote:
> On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 11/14/21 1:33 PM, Daniel Black wrote:
>>> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> Alright, give this one a go if you can. Against -git, but will apply to
>>>> 5.15 as well.
>>>
>>>
>>> Works. Thank you very much.
>>>
>>> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
>>>
>>> Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
>>
>> The patch is already upstream (and in the 5.15 stable queue), and I
>> provided 5.14 patches too.
> 
> Jens,
> 
> I'm getting the same reproducer on 5.14.20
> (https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
> backport change logs indicate 5.14.19 has the patch.
> 
> Anything missing?

We might also need another patch that isn't in stable, I'm attaching
it here. Any chance you can run 5.14.20/21 with this applied? If not,
I'll do some sanity checking here and push it to -stable.

-- 
Jens Axboe


[-- Attachment #2: 0001-io-wq-split-bounded-and-unbounded-work-into-separate.patch --]
[-- Type: text/x-patch, Size: 13384 bytes --]

From 99e6a29dbda79e5e050be1ffd38dd36622f61af5 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Wed, 24 Nov 2021 08:26:11 -0700
Subject: [PATCH] io-wq: split bounded and unbounded work into separate lists

commit f95dc207b93da9c88ddbb7741ec3730c6657b88e upstream.

We've got a few issues that all boil down to the fact that we have one
list of pending work items, yet two different types of workers to
serve them. This causes some oddities around workers switching type and
even hashed work vs regular work on the same bounded list.

Just separate them out cleanly, similarly to how we already do
accounting of what is running. That provides a clean separation and
removes some corner cases that can cause stalls when handling IO
that is punted to io-wq.

Fixes: ecc53c48c13d ("io-wq: check max_worker limits if a worker transitions bound state")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 156 +++++++++++++++++++++++------------------------------
 1 file changed, 68 insertions(+), 88 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 0890d85ba285..7d63299b4776 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -32,7 +32,7 @@ enum {
 };
 
 enum {
-	IO_WQE_FLAG_STALLED	= 1,	/* stalled on hash */
+	IO_ACCT_STALLED_BIT	= 0,	/* stalled on hash */
 };
 
 /*
@@ -71,25 +71,24 @@ struct io_wqe_acct {
 	unsigned max_workers;
 	int index;
 	atomic_t nr_running;
+	struct io_wq_work_list work_list;
+	unsigned long flags;
 };
 
 enum {
 	IO_WQ_ACCT_BOUND,
 	IO_WQ_ACCT_UNBOUND,
+	IO_WQ_ACCT_NR,
 };
 
 /*
  * Per-node worker thread pool
  */
 struct io_wqe {
-	struct {
-		raw_spinlock_t lock;
-		struct io_wq_work_list work_list;
-		unsigned flags;
-	} ____cacheline_aligned_in_smp;
+	raw_spinlock_t lock;
+	struct io_wqe_acct acct[2];
 
 	int node;
-	struct io_wqe_acct acct[2];
 
 	struct hlist_nulls_head free_list;
 	struct list_head all_list;
@@ -195,11 +194,10 @@ static void io_worker_exit(struct io_worker *worker)
 	do_exit(0);
 }
 
-static inline bool io_wqe_run_queue(struct io_wqe *wqe)
-	__must_hold(wqe->lock)
+static inline bool io_acct_run_queue(struct io_wqe_acct *acct)
 {
-	if (!wq_list_empty(&wqe->work_list) &&
-	    !(wqe->flags & IO_WQE_FLAG_STALLED))
+	if (!wq_list_empty(&acct->work_list) &&
+	    !test_bit(IO_ACCT_STALLED_BIT, &acct->flags))
 		return true;
 	return false;
 }
@@ -208,7 +206,8 @@ static inline bool io_wqe_run_queue(struct io_wqe *wqe)
  * Check head of free list for an available worker. If one isn't available,
  * caller must create one.
  */
-static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
+static bool io_wqe_activate_free_worker(struct io_wqe *wqe,
+					struct io_wqe_acct *acct)
 	__must_hold(RCU)
 {
 	struct hlist_nulls_node *n;
@@ -222,6 +221,10 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
 	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
 		if (!io_worker_get(worker))
 			continue;
+		if (io_wqe_get_acct(worker) != acct) {
+			io_worker_release(worker);
+			continue;
+		}
 		if (wake_up_process(worker->task)) {
 			io_worker_release(worker);
 			return true;
@@ -340,7 +343,7 @@ static void io_wqe_dec_running(struct io_worker *worker)
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 
-	if (atomic_dec_and_test(&acct->nr_running) && io_wqe_run_queue(wqe)) {
+	if (atomic_dec_and_test(&acct->nr_running) && io_acct_run_queue(acct)) {
 		atomic_inc(&acct->nr_running);
 		atomic_inc(&wqe->wq->worker_refs);
 		io_queue_worker_create(wqe, worker, acct);
@@ -355,29 +358,10 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
 			     struct io_wq_work *work)
 	__must_hold(wqe->lock)
 {
-	bool worker_bound, work_bound;
-
-	BUILD_BUG_ON((IO_WQ_ACCT_UNBOUND ^ IO_WQ_ACCT_BOUND) != 1);
-
 	if (worker->flags & IO_WORKER_F_FREE) {
 		worker->flags &= ~IO_WORKER_F_FREE;
 		hlist_nulls_del_init_rcu(&worker->nulls_node);
 	}
-
-	/*
-	 * If worker is moving from bound to unbound (or vice versa), then
-	 * ensure we update the running accounting.
-	 */
-	worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0;
-	work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0;
-	if (worker_bound != work_bound) {
-		int index = work_bound ? IO_WQ_ACCT_UNBOUND : IO_WQ_ACCT_BOUND;
-		io_wqe_dec_running(worker);
-		worker->flags ^= IO_WORKER_F_BOUND;
-		wqe->acct[index].nr_workers--;
-		wqe->acct[index ^ 1].nr_workers++;
-		io_wqe_inc_running(worker);
-	 }
 }
 
 /*
@@ -419,44 +403,23 @@ static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 	return ret;
 }
 
-/*
- * We can always run the work if the worker is currently the same type as
- * the work (eg both are bound, or both are unbound). If they are not the
- * same, only allow it if incrementing the worker count would be allowed.
- */
-static bool io_worker_can_run_work(struct io_worker *worker,
-				   struct io_wq_work *work)
-{
-	struct io_wqe_acct *acct;
-
-	if (!(worker->flags & IO_WORKER_F_BOUND) !=
-	    !(work->flags & IO_WQ_WORK_UNBOUND))
-		return true;
-
-	/* not the same type, check if we'd go over the limit */
-	acct = io_work_get_acct(worker->wqe, work);
-	return acct->nr_workers < acct->max_workers;
-}
-
-static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
+static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 					   struct io_worker *worker)
 	__must_hold(wqe->lock)
 {
 	struct io_wq_work_node *node, *prev;
 	struct io_wq_work *work, *tail;
 	unsigned int stall_hash = -1U;
+	struct io_wqe *wqe = worker->wqe;
 
-	wq_list_for_each(node, prev, &wqe->work_list) {
+	wq_list_for_each(node, prev, &acct->work_list) {
 		unsigned int hash;
 
 		work = container_of(node, struct io_wq_work, list);
 
-		if (!io_worker_can_run_work(worker, work))
-			break;
-
 		/* not hashed, can run anytime */
 		if (!io_wq_is_hashed(work)) {
-			wq_list_del(&wqe->work_list, node, prev);
+			wq_list_del(&acct->work_list, node, prev);
 			return work;
 		}
 
@@ -467,7 +430,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
 		/* hashed, can run if not already running */
 		if (!test_and_set_bit(hash, &wqe->wq->hash->map)) {
 			wqe->hash_tail[hash] = NULL;
-			wq_list_cut(&wqe->work_list, &tail->list, prev);
+			wq_list_cut(&acct->work_list, &tail->list, prev);
 			return work;
 		}
 		if (stall_hash == -1U)
@@ -483,12 +446,12 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
 		 * Set this before dropping the lock to avoid racing with new
 		 * work being added and clearing the stalled bit.
 		 */
-		wqe->flags |= IO_WQE_FLAG_STALLED;
+		set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 		raw_spin_unlock(&wqe->lock);
 		unstalled = io_wait_on_hash(wqe, stall_hash);
 		raw_spin_lock(&wqe->lock);
 		if (unstalled) {
-			wqe->flags &= ~IO_WQE_FLAG_STALLED;
+			clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 			if (wq_has_sleeper(&wqe->wq->hash->wait))
 				wake_up(&wqe->wq->hash->wait);
 		}
@@ -525,6 +488,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work);
 static void io_worker_handle_work(struct io_worker *worker)
 	__releases(wqe->lock)
 {
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
@@ -539,7 +503,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 		 * can't make progress, any work completion or insertion will
 		 * clear the stalled flag.
 		 */
-		work = io_get_next_work(wqe, worker);
+		work = io_get_next_work(acct, worker);
 		if (work)
 			__io_worker_busy(wqe, worker, work);
 
@@ -575,7 +539,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 				/* serialize hash clear with wake_up() */
 				spin_lock_irq(&wq->hash->wait.lock);
 				clear_bit(hash, &wq->hash->map);
-				wqe->flags &= ~IO_WQE_FLAG_STALLED;
+				clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 				spin_unlock_irq(&wq->hash->wait.lock);
 				if (wq_has_sleeper(&wq->hash->wait))
 					wake_up(&wq->hash->wait);
@@ -594,6 +558,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 static int io_wqe_worker(void *data)
 {
 	struct io_worker *worker = data;
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	char buf[TASK_COMM_LEN];
@@ -609,7 +574,7 @@ static int io_wqe_worker(void *data)
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
 		raw_spin_lock_irq(&wqe->lock);
-		if (io_wqe_run_queue(wqe)) {
+		if (io_acct_run_queue(acct)) {
 			io_worker_handle_work(worker);
 			goto loop;
 		}
@@ -777,12 +742,13 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe)
 
 static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 {
+	struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
 	unsigned int hash;
 	struct io_wq_work *tail;
 
 	if (!io_wq_is_hashed(work)) {
 append:
-		wq_list_add_tail(&work->list, &wqe->work_list);
+		wq_list_add_tail(&work->list, &acct->work_list);
 		return;
 	}
 
@@ -792,7 +758,7 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 	if (!tail)
 		goto append;
 
-	wq_list_add_after(&work->list, &tail->list, &wqe->work_list);
+	wq_list_add_after(&work->list, &tail->list, &acct->work_list);
 }
 
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
@@ -814,10 +780,10 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 
 	raw_spin_lock_irqsave(&wqe->lock, flags);
 	io_wqe_insert_work(wqe, work);
-	wqe->flags &= ~IO_WQE_FLAG_STALLED;
+	clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 
 	rcu_read_lock();
-	do_create = !io_wqe_activate_free_worker(wqe);
+	do_create = !io_wqe_activate_free_worker(wqe, acct);
 	rcu_read_unlock();
 
 	raw_spin_unlock_irqrestore(&wqe->lock, flags);
@@ -870,6 +836,7 @@ static inline void io_wqe_remove_pending(struct io_wqe *wqe,
 					 struct io_wq_work *work,
 					 struct io_wq_work_node *prev)
 {
+	struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
 	unsigned int hash = io_get_work_hash(work);
 	struct io_wq_work *prev_work = NULL;
 
@@ -881,7 +848,7 @@ static inline void io_wqe_remove_pending(struct io_wqe *wqe,
 		else
 			wqe->hash_tail[hash] = NULL;
 	}
-	wq_list_del(&wqe->work_list, &work->list, prev);
+	wq_list_del(&acct->work_list, &work->list, prev);
 }
 
 static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
@@ -890,22 +857,27 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
 	struct io_wq_work_node *node, *prev;
 	struct io_wq_work *work;
 	unsigned long flags;
+	int i;
 
 retry:
 	raw_spin_lock_irqsave(&wqe->lock, flags);
-	wq_list_for_each(node, prev, &wqe->work_list) {
-		work = container_of(node, struct io_wq_work, list);
-		if (!match->fn(work, match->data))
-			continue;
-		io_wqe_remove_pending(wqe, work, prev);
-		raw_spin_unlock_irqrestore(&wqe->lock, flags);
-		io_run_cancel(work, wqe);
-		match->nr_pending++;
-		if (!match->cancel_all)
-			return;
+	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+		struct io_wqe_acct *acct = io_get_acct(wqe, i == 0);
 
-		/* not safe to continue after unlock */
-		goto retry;
+		wq_list_for_each(node, prev, &acct->work_list) {
+			work = container_of(node, struct io_wq_work, list);
+			if (!match->fn(work, match->data))
+				continue;
+			io_wqe_remove_pending(wqe, work, prev);
+			raw_spin_unlock_irqrestore(&wqe->lock, flags);
+			io_run_cancel(work, wqe);
+			match->nr_pending++;
+			if (!match->cancel_all)
+				return;
+
+			/* not safe to continue after unlock */
+			goto retry;
+		}
 	}
 	raw_spin_unlock_irqrestore(&wqe->lock, flags);
 }
@@ -966,18 +938,24 @@ static int io_wqe_hash_wake(struct wait_queue_entry *wait, unsigned mode,
 			    int sync, void *key)
 {
 	struct io_wqe *wqe = container_of(wait, struct io_wqe, wait);
+	int i;
 
 	list_del_init(&wait->entry);
 
 	rcu_read_lock();
-	io_wqe_activate_free_worker(wqe);
+	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+		struct io_wqe_acct *acct = &wqe->acct[i];
+
+		if (test_and_clear_bit(IO_ACCT_STALLED_BIT, &acct->flags))
+			io_wqe_activate_free_worker(wqe, acct);
+	}
 	rcu_read_unlock();
 	return 1;
 }
 
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 {
-	int ret, node;
+	int ret, node, i;
 	struct io_wq *wq;
 
 	if (WARN_ON_ONCE(!data->free_work || !data->do_work))
@@ -1012,18 +990,20 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 		cpumask_copy(wqe->cpu_mask, cpumask_of_node(node));
 		wq->wqes[node] = wqe;
 		wqe->node = alloc_node;
-		wqe->acct[IO_WQ_ACCT_BOUND].index = IO_WQ_ACCT_BOUND;
-		wqe->acct[IO_WQ_ACCT_UNBOUND].index = IO_WQ_ACCT_UNBOUND;
 		wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
-		atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
 		wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
 					task_rlimit(current, RLIMIT_NPROC);
-		atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
-		wqe->wait.func = io_wqe_hash_wake;
 		INIT_LIST_HEAD(&wqe->wait.entry);
+		wqe->wait.func = io_wqe_hash_wake;
+		for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+			struct io_wqe_acct *acct = &wqe->acct[i];
+
+			acct->index = i;
+			atomic_set(&acct->nr_running, 0);
+			INIT_WQ_LIST(&acct->work_list);
+		}
 		wqe->wq = wq;
 		raw_spin_lock_init(&wqe->lock);
-		INIT_WQ_LIST(&wqe->work_list);
 		INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0);
 		INIT_LIST_HEAD(&wqe->all_list);
 	}
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24 15:28                                         ` Jens Axboe
@ 2021-11-24 16:10                                           ` Jens Axboe
  2021-11-24 16:18                                             ` Greg Kroah-Hartman
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-24 16:10 UTC (permalink / raw)
  To: Daniel Black
  Cc: Salvatore Bonaccorso, Pavel Begunkov, linux-block, io-uring,
	stable, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]

On 11/24/21 8:28 AM, Jens Axboe wrote:
> On 11/23/21 8:27 PM, Daniel Black wrote:
>> On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 11/14/21 1:33 PM, Daniel Black wrote:
>>>> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> Alright, give this one a go if you can. Against -git, but will apply to
>>>>> 5.15 as well.
>>>>
>>>>
>>>> Works. Thank you very much.
>>>>
>>>> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
>>>>
>>>> Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
>>>
>>> The patch is already upstream (and in the 5.15 stable queue), and I
>>> provided 5.14 patches too.
>>
>> Jens,
>>
>> I'm getting the same reproducer on 5.14.20
>> (https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
>> backport change logs indicate 5.14.19 has the patch.
>>
>> Anything missing?
> 
> We might also need another patch that isn't in stable, I'm attaching
> it here. Any chance you can run 5.14.20/21 with this applied? If not,
> I'll do some sanity checking here and push it to -stable.

Looks good to me - Greg, would you mind queueing this up for
5.14-stable?

-- 
Jens Axboe


[-- Attachment #2: 0001-io-wq-split-bounded-and-unbounded-work-into-separate.patch --]
[-- Type: text/x-patch, Size: 13384 bytes --]

From 99e6a29dbda79e5e050be1ffd38dd36622f61af5 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Wed, 24 Nov 2021 08:26:11 -0700
Subject: [PATCH] io-wq: split bounded and unbounded work into separate lists

commit f95dc207b93da9c88ddbb7741ec3730c6657b88e upstream.

We've got a few issues that all boil down to the fact that we have one
list of pending work items, yet two different types of workers to
serve them. This causes some oddities around workers switching type and
even hashed work vs regular work on the same bounded list.

Just separate them out cleanly, similarly to how we already do
accounting of what is running. That provides a clean separation and
removes some corner cases that can cause stalls when handling IO
that is punted to io-wq.

Fixes: ecc53c48c13d ("io-wq: check max_worker limits if a worker transitions bound state")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c | 156 +++++++++++++++++++++++------------------------------
 1 file changed, 68 insertions(+), 88 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 0890d85ba285..7d63299b4776 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -32,7 +32,7 @@ enum {
 };
 
 enum {
-	IO_WQE_FLAG_STALLED	= 1,	/* stalled on hash */
+	IO_ACCT_STALLED_BIT	= 0,	/* stalled on hash */
 };
 
 /*
@@ -71,25 +71,24 @@ struct io_wqe_acct {
 	unsigned max_workers;
 	int index;
 	atomic_t nr_running;
+	struct io_wq_work_list work_list;
+	unsigned long flags;
 };
 
 enum {
 	IO_WQ_ACCT_BOUND,
 	IO_WQ_ACCT_UNBOUND,
+	IO_WQ_ACCT_NR,
 };
 
 /*
  * Per-node worker thread pool
  */
 struct io_wqe {
-	struct {
-		raw_spinlock_t lock;
-		struct io_wq_work_list work_list;
-		unsigned flags;
-	} ____cacheline_aligned_in_smp;
+	raw_spinlock_t lock;
+	struct io_wqe_acct acct[2];
 
 	int node;
-	struct io_wqe_acct acct[2];
 
 	struct hlist_nulls_head free_list;
 	struct list_head all_list;
@@ -195,11 +194,10 @@ static void io_worker_exit(struct io_worker *worker)
 	do_exit(0);
 }
 
-static inline bool io_wqe_run_queue(struct io_wqe *wqe)
-	__must_hold(wqe->lock)
+static inline bool io_acct_run_queue(struct io_wqe_acct *acct)
 {
-	if (!wq_list_empty(&wqe->work_list) &&
-	    !(wqe->flags & IO_WQE_FLAG_STALLED))
+	if (!wq_list_empty(&acct->work_list) &&
+	    !test_bit(IO_ACCT_STALLED_BIT, &acct->flags))
 		return true;
 	return false;
 }
@@ -208,7 +206,8 @@ static inline bool io_wqe_run_queue(struct io_wqe *wqe)
  * Check head of free list for an available worker. If one isn't available,
  * caller must create one.
  */
-static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
+static bool io_wqe_activate_free_worker(struct io_wqe *wqe,
+					struct io_wqe_acct *acct)
 	__must_hold(RCU)
 {
 	struct hlist_nulls_node *n;
@@ -222,6 +221,10 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
 	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
 		if (!io_worker_get(worker))
 			continue;
+		if (io_wqe_get_acct(worker) != acct) {
+			io_worker_release(worker);
+			continue;
+		}
 		if (wake_up_process(worker->task)) {
 			io_worker_release(worker);
 			return true;
@@ -340,7 +343,7 @@ static void io_wqe_dec_running(struct io_worker *worker)
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 
-	if (atomic_dec_and_test(&acct->nr_running) && io_wqe_run_queue(wqe)) {
+	if (atomic_dec_and_test(&acct->nr_running) && io_acct_run_queue(acct)) {
 		atomic_inc(&acct->nr_running);
 		atomic_inc(&wqe->wq->worker_refs);
 		io_queue_worker_create(wqe, worker, acct);
@@ -355,29 +358,10 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker,
 			     struct io_wq_work *work)
 	__must_hold(wqe->lock)
 {
-	bool worker_bound, work_bound;
-
-	BUILD_BUG_ON((IO_WQ_ACCT_UNBOUND ^ IO_WQ_ACCT_BOUND) != 1);
-
 	if (worker->flags & IO_WORKER_F_FREE) {
 		worker->flags &= ~IO_WORKER_F_FREE;
 		hlist_nulls_del_init_rcu(&worker->nulls_node);
 	}
-
-	/*
-	 * If worker is moving from bound to unbound (or vice versa), then
-	 * ensure we update the running accounting.
-	 */
-	worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0;
-	work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0;
-	if (worker_bound != work_bound) {
-		int index = work_bound ? IO_WQ_ACCT_UNBOUND : IO_WQ_ACCT_BOUND;
-		io_wqe_dec_running(worker);
-		worker->flags ^= IO_WORKER_F_BOUND;
-		wqe->acct[index].nr_workers--;
-		wqe->acct[index ^ 1].nr_workers++;
-		io_wqe_inc_running(worker);
-	 }
 }
 
 /*
@@ -419,44 +403,23 @@ static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 	return ret;
 }
 
-/*
- * We can always run the work if the worker is currently the same type as
- * the work (eg both are bound, or both are unbound). If they are not the
- * same, only allow it if incrementing the worker count would be allowed.
- */
-static bool io_worker_can_run_work(struct io_worker *worker,
-				   struct io_wq_work *work)
-{
-	struct io_wqe_acct *acct;
-
-	if (!(worker->flags & IO_WORKER_F_BOUND) !=
-	    !(work->flags & IO_WQ_WORK_UNBOUND))
-		return true;
-
-	/* not the same type, check if we'd go over the limit */
-	acct = io_work_get_acct(worker->wqe, work);
-	return acct->nr_workers < acct->max_workers;
-}
-
-static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
+static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 					   struct io_worker *worker)
 	__must_hold(wqe->lock)
 {
 	struct io_wq_work_node *node, *prev;
 	struct io_wq_work *work, *tail;
 	unsigned int stall_hash = -1U;
+	struct io_wqe *wqe = worker->wqe;
 
-	wq_list_for_each(node, prev, &wqe->work_list) {
+	wq_list_for_each(node, prev, &acct->work_list) {
 		unsigned int hash;
 
 		work = container_of(node, struct io_wq_work, list);
 
-		if (!io_worker_can_run_work(worker, work))
-			break;
-
 		/* not hashed, can run anytime */
 		if (!io_wq_is_hashed(work)) {
-			wq_list_del(&wqe->work_list, node, prev);
+			wq_list_del(&acct->work_list, node, prev);
 			return work;
 		}
 
@@ -467,7 +430,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
 		/* hashed, can run if not already running */
 		if (!test_and_set_bit(hash, &wqe->wq->hash->map)) {
 			wqe->hash_tail[hash] = NULL;
-			wq_list_cut(&wqe->work_list, &tail->list, prev);
+			wq_list_cut(&acct->work_list, &tail->list, prev);
 			return work;
 		}
 		if (stall_hash == -1U)
@@ -483,12 +446,12 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe,
 		 * Set this before dropping the lock to avoid racing with new
 		 * work being added and clearing the stalled bit.
 		 */
-		wqe->flags |= IO_WQE_FLAG_STALLED;
+		set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 		raw_spin_unlock(&wqe->lock);
 		unstalled = io_wait_on_hash(wqe, stall_hash);
 		raw_spin_lock(&wqe->lock);
 		if (unstalled) {
-			wqe->flags &= ~IO_WQE_FLAG_STALLED;
+			clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 			if (wq_has_sleeper(&wqe->wq->hash->wait))
 				wake_up(&wqe->wq->hash->wait);
 		}
@@ -525,6 +488,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work);
 static void io_worker_handle_work(struct io_worker *worker)
 	__releases(wqe->lock)
 {
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	bool do_kill = test_bit(IO_WQ_BIT_EXIT, &wq->state);
@@ -539,7 +503,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 		 * can't make progress, any work completion or insertion will
 		 * clear the stalled flag.
 		 */
-		work = io_get_next_work(wqe, worker);
+		work = io_get_next_work(acct, worker);
 		if (work)
 			__io_worker_busy(wqe, worker, work);
 
@@ -575,7 +539,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 				/* serialize hash clear with wake_up() */
 				spin_lock_irq(&wq->hash->wait.lock);
 				clear_bit(hash, &wq->hash->map);
-				wqe->flags &= ~IO_WQE_FLAG_STALLED;
+				clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 				spin_unlock_irq(&wq->hash->wait.lock);
 				if (wq_has_sleeper(&wq->hash->wait))
 					wake_up(&wq->hash->wait);
@@ -594,6 +558,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 static int io_wqe_worker(void *data)
 {
 	struct io_worker *worker = data;
+	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	char buf[TASK_COMM_LEN];
@@ -609,7 +574,7 @@ static int io_wqe_worker(void *data)
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
 		raw_spin_lock_irq(&wqe->lock);
-		if (io_wqe_run_queue(wqe)) {
+		if (io_acct_run_queue(acct)) {
 			io_worker_handle_work(worker);
 			goto loop;
 		}
@@ -777,12 +742,13 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe)
 
 static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 {
+	struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
 	unsigned int hash;
 	struct io_wq_work *tail;
 
 	if (!io_wq_is_hashed(work)) {
 append:
-		wq_list_add_tail(&work->list, &wqe->work_list);
+		wq_list_add_tail(&work->list, &acct->work_list);
 		return;
 	}
 
@@ -792,7 +758,7 @@ static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work)
 	if (!tail)
 		goto append;
 
-	wq_list_add_after(&work->list, &tail->list, &wqe->work_list);
+	wq_list_add_after(&work->list, &tail->list, &acct->work_list);
 }
 
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
@@ -814,10 +780,10 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 
 	raw_spin_lock_irqsave(&wqe->lock, flags);
 	io_wqe_insert_work(wqe, work);
-	wqe->flags &= ~IO_WQE_FLAG_STALLED;
+	clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 
 	rcu_read_lock();
-	do_create = !io_wqe_activate_free_worker(wqe);
+	do_create = !io_wqe_activate_free_worker(wqe, acct);
 	rcu_read_unlock();
 
 	raw_spin_unlock_irqrestore(&wqe->lock, flags);
@@ -870,6 +836,7 @@ static inline void io_wqe_remove_pending(struct io_wqe *wqe,
 					 struct io_wq_work *work,
 					 struct io_wq_work_node *prev)
 {
+	struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
 	unsigned int hash = io_get_work_hash(work);
 	struct io_wq_work *prev_work = NULL;
 
@@ -881,7 +848,7 @@ static inline void io_wqe_remove_pending(struct io_wqe *wqe,
 		else
 			wqe->hash_tail[hash] = NULL;
 	}
-	wq_list_del(&wqe->work_list, &work->list, prev);
+	wq_list_del(&acct->work_list, &work->list, prev);
 }
 
 static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
@@ -890,22 +857,27 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
 	struct io_wq_work_node *node, *prev;
 	struct io_wq_work *work;
 	unsigned long flags;
+	int i;
 
 retry:
 	raw_spin_lock_irqsave(&wqe->lock, flags);
-	wq_list_for_each(node, prev, &wqe->work_list) {
-		work = container_of(node, struct io_wq_work, list);
-		if (!match->fn(work, match->data))
-			continue;
-		io_wqe_remove_pending(wqe, work, prev);
-		raw_spin_unlock_irqrestore(&wqe->lock, flags);
-		io_run_cancel(work, wqe);
-		match->nr_pending++;
-		if (!match->cancel_all)
-			return;
+	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+		struct io_wqe_acct *acct = io_get_acct(wqe, i == 0);
 
-		/* not safe to continue after unlock */
-		goto retry;
+		wq_list_for_each(node, prev, &acct->work_list) {
+			work = container_of(node, struct io_wq_work, list);
+			if (!match->fn(work, match->data))
+				continue;
+			io_wqe_remove_pending(wqe, work, prev);
+			raw_spin_unlock_irqrestore(&wqe->lock, flags);
+			io_run_cancel(work, wqe);
+			match->nr_pending++;
+			if (!match->cancel_all)
+				return;
+
+			/* not safe to continue after unlock */
+			goto retry;
+		}
 	}
 	raw_spin_unlock_irqrestore(&wqe->lock, flags);
 }
@@ -966,18 +938,24 @@ static int io_wqe_hash_wake(struct wait_queue_entry *wait, unsigned mode,
 			    int sync, void *key)
 {
 	struct io_wqe *wqe = container_of(wait, struct io_wqe, wait);
+	int i;
 
 	list_del_init(&wait->entry);
 
 	rcu_read_lock();
-	io_wqe_activate_free_worker(wqe);
+	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+		struct io_wqe_acct *acct = &wqe->acct[i];
+
+		if (test_and_clear_bit(IO_ACCT_STALLED_BIT, &acct->flags))
+			io_wqe_activate_free_worker(wqe, acct);
+	}
 	rcu_read_unlock();
 	return 1;
 }
 
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 {
-	int ret, node;
+	int ret, node, i;
 	struct io_wq *wq;
 
 	if (WARN_ON_ONCE(!data->free_work || !data->do_work))
@@ -1012,18 +990,20 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 		cpumask_copy(wqe->cpu_mask, cpumask_of_node(node));
 		wq->wqes[node] = wqe;
 		wqe->node = alloc_node;
-		wqe->acct[IO_WQ_ACCT_BOUND].index = IO_WQ_ACCT_BOUND;
-		wqe->acct[IO_WQ_ACCT_UNBOUND].index = IO_WQ_ACCT_UNBOUND;
 		wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
-		atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0);
 		wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
 					task_rlimit(current, RLIMIT_NPROC);
-		atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
-		wqe->wait.func = io_wqe_hash_wake;
 		INIT_LIST_HEAD(&wqe->wait.entry);
+		wqe->wait.func = io_wqe_hash_wake;
+		for (i = 0; i < IO_WQ_ACCT_NR; i++) {
+			struct io_wqe_acct *acct = &wqe->acct[i];
+
+			acct->index = i;
+			atomic_set(&acct->nr_running, 0);
+			INIT_WQ_LIST(&acct->work_list);
+		}
 		wqe->wq = wq;
 		raw_spin_lock_init(&wqe->lock);
-		INIT_WQ_LIST(&wqe->work_list);
 		INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0);
 		INIT_LIST_HEAD(&wqe->all_list);
 	}
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24 16:10                                           ` Jens Axboe
@ 2021-11-24 16:18                                             ` Greg Kroah-Hartman
  2021-11-24 16:22                                               ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-24 16:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable

On Wed, Nov 24, 2021 at 09:10:25AM -0700, Jens Axboe wrote:
> On 11/24/21 8:28 AM, Jens Axboe wrote:
> > On 11/23/21 8:27 PM, Daniel Black wrote:
> >> On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 11/14/21 1:33 PM, Daniel Black wrote:
> >>>> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> Alright, give this one a go if you can. Against -git, but will apply to
> >>>>> 5.15 as well.
> >>>>
> >>>>
> >>>> Works. Thank you very much.
> >>>>
> >>>> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
> >>>>
> >>>> Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
> >>>
> >>> The patch is already upstream (and in the 5.15 stable queue), and I
> >>> provided 5.14 patches too.
> >>
> >> Jens,
> >>
> >> I'm getting the same reproducer on 5.14.20
> >> (https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
> >> backport change logs indicate 5.14.19 has the patch.
> >>
> >> Anything missing?
> > 
> > We might also need another patch that isn't in stable, I'm attaching
> > it here. Any chance you can run 5.14.20/21 with this applied? If not,
> > I'll do some sanity checking here and push it to -stable.
> 
> Looks good to me - Greg, would you mind queueing this up for
> 5.14-stable?

5.14 is end-of-life and not getting any more releases (the front page of
kernel.org should show that.)

If this needs to go anywhere else, please let me know.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24 16:18                                             ` Greg Kroah-Hartman
@ 2021-11-24 16:22                                               ` Jens Axboe
  2021-11-24 22:52                                                 ` Stefan Metzmacher
  2021-11-24 22:57                                                 ` Daniel Black
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2021-11-24 16:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable

On 11/24/21 9:18 AM, Greg Kroah-Hartman wrote:
> On Wed, Nov 24, 2021 at 09:10:25AM -0700, Jens Axboe wrote:
>> On 11/24/21 8:28 AM, Jens Axboe wrote:
>>> On 11/23/21 8:27 PM, Daniel Black wrote:
>>>> On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 11/14/21 1:33 PM, Daniel Black wrote:
>>>>>> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> Alright, give this one a go if you can. Against -git, but will apply to
>>>>>>> 5.15 as well.
>>>>>>
>>>>>>
>>>>>> Works. Thank you very much.
>>>>>>
>>>>>> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=205599#comment-205599
>>>>>>
>>>>>> Tested-by: Marko Mäkelä <marko.makela@mariadb.com>
>>>>>
>>>>> The patch is already upstream (and in the 5.15 stable queue), and I
>>>>> provided 5.14 patches too.
>>>>
>>>> Jens,
>>>>
>>>> I'm getting the same reproducer on 5.14.20
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
>>>> backport change logs indicate 5.14.19 has the patch.
>>>>
>>>> Anything missing?
>>>
>>> We might also need another patch that isn't in stable, I'm attaching
>>> it here. Any chance you can run 5.14.20/21 with this applied? If not,
>>> I'll do some sanity checking here and push it to -stable.
>>
>> Looks good to me - Greg, would you mind queueing this up for
>> 5.14-stable?
> 
> 5.14 is end-of-life and not getting any more releases (the front page of
> kernel.org should show that.)

Oh, well I guess that settles that...

> If this needs to go anywhere else, please let me know.

Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
already has the patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24 16:22                                               ` Jens Axboe
@ 2021-11-24 22:52                                                 ` Stefan Metzmacher
  2021-11-25  0:58                                                   ` Jens Axboe
  2021-11-24 22:57                                                 ` Daniel Black
  1 sibling, 1 reply; 36+ messages in thread
From: Stefan Metzmacher @ 2021-11-24 22:52 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable

Hi Jens,

>>> Looks good to me - Greg, would you mind queueing this up for
>>> 5.14-stable?
>>
>> 5.14 is end-of-life and not getting any more releases (the front page of
>> kernel.org should show that.)
> 
> Oh, well I guess that settles that...
> 
>> If this needs to go anywhere else, please let me know.
> 
> Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
> already has the patch.

Are 5.11 and 5.13 are affected, these are hwe kernels for ubuntu,
I may need to open a bug for them...

Thanks!
metze

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24 16:22                                               ` Jens Axboe
  2021-11-24 22:52                                                 ` Stefan Metzmacher
@ 2021-11-24 22:57                                                 ` Daniel Black
  1 sibling, 0 replies; 36+ messages in thread
From: Daniel Black @ 2021-11-24 22:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Greg Kroah-Hartman, Salvatore Bonaccorso, Pavel Begunkov,
	linux-block, io-uring, stable

On Thu, Nov 25, 2021 at 3:22 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/24/21 9:18 AM, Greg Kroah-Hartman wrote:
> > On Wed, Nov 24, 2021 at 09:10:25AM -0700, Jens Axboe wrote:
> >> On 11/24/21 8:28 AM, Jens Axboe wrote:
> >>> On 11/23/21 8:27 PM, Daniel Black wrote:
> >>>> On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe@kernel.dk> wrote:

> >>>> I'm getting the same reproducer on 5.14.20
> >>>> (https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
> >>>> backport change logs indicate 5.14.19 has the patch.
> >>>>
> >>>> Anything missing?
> >>>
> >>> We might also need another patch that isn't in stable, I'm attaching
> >>> it here. Any chance you can run 5.14.20/21 with this applied? If not,
> >>> I'll do some sanity checking here and push it to -stable.
> >>
> >> Looks good to me - Greg, would you mind queueing this up for
> >> 5.14-stable?
> >
> > 5.14 is end-of-life and not getting any more releases (the front page of
> > kernel.org should show that.)
>
> Oh, well I guess that settles that...

Certainly does. Thanks for looking and finding the patch.

> > If this needs to go anywhere else, please let me know.
>
> Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
> already has the patch.

Thank you

https://github.com/MariaDB/server/commit/de7db5517de11a58d57d2a41d0bc6f38b6f92dd8

On Thu, Nov 25, 2021 at 9:52 AM Stefan Metzmacher <metze@samba.org> wrote:
> Are 5.11 and 5.13 are affected,

Yes.

> these are hwe kernels for ubuntu,
> I may need to open a bug for them...

Yes please.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-24 22:52                                                 ` Stefan Metzmacher
@ 2021-11-25  0:58                                                   ` Jens Axboe
  2021-11-25 16:35                                                     ` Stefan Metzmacher
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2021-11-25  0:58 UTC (permalink / raw)
  To: Stefan Metzmacher, Greg Kroah-Hartman
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable

On 11/24/21 3:52 PM, Stefan Metzmacher wrote:
> Hi Jens,
> 
>>>> Looks good to me - Greg, would you mind queueing this up for
>>>> 5.14-stable?
>>>
>>> 5.14 is end-of-life and not getting any more releases (the front page of
>>> kernel.org should show that.)
>>
>> Oh, well I guess that settles that...
>>
>>> If this needs to go anywhere else, please let me know.
>>
>> Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
>> already has the patch.
> 
> Are 5.11 and 5.13 are affected, these are hwe kernels for ubuntu,
> I may need to open a bug for them...

Please do, then we can help get the appropriate patches lined up for
5.11/13. They should need the same set, basically what ended up in 5.14
plus the one I posted today.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-25  0:58                                                   ` Jens Axboe
@ 2021-11-25 16:35                                                     ` Stefan Metzmacher
  2021-11-25 17:11                                                       ` Jens Axboe
  2022-02-09 23:01                                                       ` Stefan Metzmacher
  0 siblings, 2 replies; 36+ messages in thread
From: Stefan Metzmacher @ 2021-11-25 16:35 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable

Am 25.11.21 um 01:58 schrieb Jens Axboe:
> On 11/24/21 3:52 PM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>>>>> Looks good to me - Greg, would you mind queueing this up for
>>>>> 5.14-stable?
>>>>
>>>> 5.14 is end-of-life and not getting any more releases (the front page of
>>>> kernel.org should show that.)
>>>
>>> Oh, well I guess that settles that...
>>>
>>>> If this needs to go anywhere else, please let me know.
>>>
>>> Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
>>> already has the patch.
>>
>> Are 5.11 and 5.13 are affected, these are hwe kernels for ubuntu,
>> I may need to open a bug for them...
> 
> Please do, then we can help get the appropriate patches lined up for
> 5.11/13. They should need the same set, basically what ended up in 5.14
> plus the one I posted today.

Ok, I've created https://bugs.launchpad.net/bugs/1952222

Let's see what happens...

metze


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-25 16:35                                                     ` Stefan Metzmacher
@ 2021-11-25 17:11                                                       ` Jens Axboe
  2022-02-09 23:01                                                       ` Stefan Metzmacher
  1 sibling, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2021-11-25 17:11 UTC (permalink / raw)
  To: Stefan Metzmacher, Greg Kroah-Hartman
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable

On 11/25/21 9:35 AM, Stefan Metzmacher wrote:
> Am 25.11.21 um 01:58 schrieb Jens Axboe:
>> On 11/24/21 3:52 PM, Stefan Metzmacher wrote:
>>> Hi Jens,
>>>
>>>>>> Looks good to me - Greg, would you mind queueing this up for
>>>>>> 5.14-stable?
>>>>>
>>>>> 5.14 is end-of-life and not getting any more releases (the front page of
>>>>> kernel.org should show that.)
>>>>
>>>> Oh, well I guess that settles that...
>>>>
>>>>> If this needs to go anywhere else, please let me know.
>>>>
>>>> Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
>>>> already has the patch.
>>>
>>> Are 5.11 and 5.13 are affected, these are hwe kernels for ubuntu,
>>> I may need to open a bug for them...
>>
>> Please do, then we can help get the appropriate patches lined up for
>> 5.11/13. They should need the same set, basically what ended up in 5.14
>> plus the one I posted today.
> 
> Ok, I've created https://bugs.launchpad.net/bugs/1952222
> 
> Let's see what happens...

Let me know if I can help, should probably prepare a set for 5.11-stable
and 5.13-stable, but I don't know if the above kernels already have some
patches applied past last stable release of each...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2021-11-25 16:35                                                     ` Stefan Metzmacher
  2021-11-25 17:11                                                       ` Jens Axboe
@ 2022-02-09 23:01                                                       ` Stefan Metzmacher
  2022-02-10  0:10                                                         ` Daniel Black
  1 sibling, 1 reply; 36+ messages in thread
From: Stefan Metzmacher @ 2022-02-09 23:01 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman
  Cc: Daniel Black, Salvatore Bonaccorso, Pavel Begunkov, linux-block,
	io-uring, stable


Hi Jens,

>>>>>> Looks good to me - Greg, would you mind queueing this up for
>>>>>> 5.14-stable?
>>>>>
>>>>> 5.14 is end-of-life and not getting any more releases (the front page of
>>>>> kernel.org should show that.)
>>>>
>>>> Oh, well I guess that settles that...
>>>>
>>>>> If this needs to go anywhere else, please let me know.
>>>>
>>>> Should be fine, previous 5.10 isn't affected and 5.15 is fine too as it
>>>> already has the patch.
>>>
>>> Are 5.11 and 5.13 are affected, these are hwe kernels for ubuntu,
>>> I may need to open a bug for them...
>>
>> Please do, then we can help get the appropriate patches lined up for
>> 5.11/13. They should need the same set, basically what ended up in 5.14
>> plus the one I posted today.
> 
> Ok, I've created https://bugs.launchpad.net/bugs/1952222

At least for 5.14 the patch is included in

https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oem/+git/focal/log/?h=Ubuntu-oem-5.14-5.14.0-1023.25

https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oem/+git/focal/commit/?h=Ubuntu-oem-5.14-5.14.0-1023.25&id=9e2b95e7c9dd103297e6a3ccd98a7bf11ef66921

apt-get install -V -t focal-proposed linux-oem-20.04d linux-tools-oem-20.04d
installs linux-image-5.14.0-1023-oem (5.14.0-1023.25)

Do we have any reproducer I can use to reproduce the problem
and demonstrate the bug if fixed?

Thanks!
metze

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: uring regression - lost write request
  2022-02-09 23:01                                                       ` Stefan Metzmacher
@ 2022-02-10  0:10                                                         ` Daniel Black
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Black @ 2022-02-10  0:10 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Jens Axboe, Greg Kroah-Hartman, Salvatore Bonaccorso,
	Pavel Begunkov, linux-block, io-uring, stable

Stefan,

On Thu, Feb 10, 2022 at 10:01 AM Stefan Metzmacher <metze@samba.org> wrote:
> > Ok, I've created https://bugs.launchpad.net/bugs/1952222
>
> At least for 5.14 the patch is included in
>
> https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oem/+git/focal/log/?h=Ubuntu-oem-5.14-5.14.0-1023.25
>
> https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oem/+git/focal/commit/?h=Ubuntu-oem-5.14-5.14.0-1023.25&id=9e2b95e7c9dd103297e6a3ccd98a7bf11ef66921
>
> apt-get install -V -t focal-proposed linux-oem-20.04d linux-tools-oem-20.04d
> installs linux-image-5.14.0-1023-oem (5.14.0-1023.25)

Thanks!

> Do we have any reproducer I can use to reproduce the problem
> and demonstrate the bug if fixed?
>

The original container and test from
https://lore.kernel.org/linux-block/CABVffEOpuViC9OyOuZg28sRfGK4GRc8cV0CnkOU2cM0RJyRhPw@mail.gmail.com/
will be sufficient.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-02-10  1:56 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22  3:12 uring regression - lost write request Daniel Black
2021-10-22  9:10 ` Pavel Begunkov
2021-10-25  9:57   ` Pavel Begunkov
2021-10-25 11:09     ` Daniel Black
2021-10-25 11:25       ` Pavel Begunkov
2021-10-30  7:30         ` Salvatore Bonaccorso
2021-11-01  7:28           ` Daniel Black
2021-11-09 22:58             ` Daniel Black
2021-11-09 23:24               ` Jens Axboe
2021-11-10 18:01                 ` Jens Axboe
2021-11-11  6:52                   ` Daniel Black
2021-11-11 14:30                     ` Jens Axboe
2021-11-11 14:58                       ` Jens Axboe
2021-11-11 15:29                         ` Jens Axboe
2021-11-11 16:19                           ` Jens Axboe
2021-11-11 16:55                             ` Jens Axboe
2021-11-11 17:28                               ` Jens Axboe
2021-11-11 23:44                                 ` Jens Axboe
2021-11-12  6:25                                   ` Daniel Black
2021-11-12 19:19                                     ` Salvatore Bonaccorso
2021-11-14 20:33                                   ` Daniel Black
2021-11-14 20:55                                     ` Jens Axboe
2021-11-14 21:02                                       ` Salvatore Bonaccorso
2021-11-14 21:03                                         ` Jens Axboe
2021-11-24  3:27                                       ` Daniel Black
2021-11-24 15:28                                         ` Jens Axboe
2021-11-24 16:10                                           ` Jens Axboe
2021-11-24 16:18                                             ` Greg Kroah-Hartman
2021-11-24 16:22                                               ` Jens Axboe
2021-11-24 22:52                                                 ` Stefan Metzmacher
2021-11-25  0:58                                                   ` Jens Axboe
2021-11-25 16:35                                                     ` Stefan Metzmacher
2021-11-25 17:11                                                       ` Jens Axboe
2022-02-09 23:01                                                       ` Stefan Metzmacher
2022-02-10  0:10                                                         ` Daniel Black
2021-11-24 22:57                                                 ` Daniel Black

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).