linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
@ 2018-06-07 11:19 Krzysztof Kozlowski
  2018-06-07 11:22 ` Krzysztof Kozlowski
  2018-06-15 12:53 ` Sudeep Holla
  0 siblings, 2 replies; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-06-07 11:19 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, J. Bruce Fields, Jeff Layton,
	David S. Miller, linux-nfs, netdev, linux-kernel,
	linux-samsung-soc

[-- Attachment #1: Type: text/plain, Size: 1514 bytes --]

Hi,

When booting my boards under recent linux-next, I see failures of systemd:

[FAILED] Failed to start Flush Journal to Persistent Storage.
See 'systemctl status systemd-journal-flush.service' for details.
         Starting Create Volatile Files and Directories...
[**    ] A start job is running for Create V… [  223.209289] nfs:
server 192.168.1.10 not responding, still trying
[  223.209377] nfs: server 192.168.1.10 not responding, still trying

Effectively the boards fails to boot. Example is here:
https://krzk.eu/#/builders/1/builds/2157

This was bisected to:
commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Fri May 4 15:34:53 2018 -0400

    SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock

    alloc_slot is a transport-specific op, but initializing an rpc_rqst
    is common to all transports. In addition, the only part of initial-
    izing an rpc_rqst that needs serialization is getting a fresh XID.

    Move rpc_rqst initialization to common code in preparation for
    adding a transport-specific alloc_slot to xprtrdma.

    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>


Bisect log attached. Full configuration:
1. exynos_defconfig
2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
3. NFSv4 client (from Raspberry Pi)

Let me know if you need any more information.

Best regards,
Krzysztof

[-- Attachment #2: log.txt --]
[-- Type: text/plain, Size: 5666 bytes --]

git bisect start
# good: [c64d4419a17cfb39a5b573f9016cd02ade4c9a64] mtd: cfi_cmdset_0002: Change erase one block to enable XIP once
git bisect good c64d4419a17cfb39a5b573f9016cd02ade4c9a64
# good: [56c6855c81c8a6828b5d65aa974cd50f4b67760c] mtd: spi-nor: Add Micron MT25QU02 support
git bisect good 56c6855c81c8a6828b5d65aa974cd50f4b67760c
# bad: [2b70a7bd4673b6dcf2763888d0b172a40dd49434] Merge remote-tracking branch 'block/for-next'
git bisect bad 2b70a7bd4673b6dcf2763888d0b172a40dd49434
# bad: [25350cbeef4a3ec943754c4aa4c8ac1aaa64c7a2] Merge remote-tracking branch 'nand/nand/next'
git bisect bad 25350cbeef4a3ec943754c4aa4c8ac1aaa64c7a2
# good: [311da4975894aab7a4bb94aa83f38f052d7ffda4] Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
git bisect good 311da4975894aab7a4bb94aa83f38f052d7ffda4
# bad: [7d470f95b4ccd3244759524f6cc9a76a2a34a9c6] Merge remote-tracking branch 'printk/for-next'
git bisect bad 7d470f95b4ccd3244759524f6cc9a76a2a34a9c6
# good: [8a3ab2f38f1669e3be6433a1f6b82a077b38c4c7] brcmfmac: trigger memory dump upon firmware halt signal
git bisect good 8a3ab2f38f1669e3be6433a1f6b82a077b38c4c7
# good: [7fa76d777ec53eeece1546b737a3b93b37639575] netdevsim: Add extack error message for devlink reload
git bisect good 7fa76d777ec53eeece1546b737a3b93b37639575
# bad: [1fe087da59fde2cce4cf8e0a1c78b3279fb7ea44] Merge remote-tracking branch 'fbdev/fbdev-for-next'
git bisect bad 1fe087da59fde2cce4cf8e0a1c78b3279fb7ea44
# bad: [9d85b2fee6616994759eb4599c886af510ded175] Merge remote-tracking branch 'pci/next'
git bisect bad 9d85b2fee6616994759eb4599c886af510ded175
# good: [13fbadcd512c225c907d6e8147fb48a88114bf03] Merge branch 'pci/sparc'
git bisect good 13fbadcd512c225c907d6e8147fb48a88114bf03
# good: [741f8e7ecc2c6414cff442ec8eb07dcfe4481533] Merge branch 'lorenzo/pci/hv'
git bisect good 741f8e7ecc2c6414cff442ec8eb07dcfe4481533
# good: [1c2bef0a3fd14287a66edd7ead57fd2e439485a2] Merge branch 'lorenzo/pci/rcar'
git bisect good 1c2bef0a3fd14287a66edd7ead57fd2e439485a2
# good: [e52d38f4abf49f8b63a6ad0ce21e5f495c15897f] Merge branch 'lorenzo/pci/rockchip'
git bisect good e52d38f4abf49f8b63a6ad0ce21e5f495c15897f
# good: [73144d77cb87d60b4bcab6992a62d6787b09dcf0] Merge branch 'lorenzo/pci/vmd'
git bisect good 73144d77cb87d60b4bcab6992a62d6787b09dcf0
# good: [82e1719c4cd65bd7f7847d6c02376cfca3d5e793] PCI: Clean up whitespace in quirks.c
git bisect good 82e1719c4cd65bd7f7847d6c02376cfca3d5e793
# good: [0ecda3a087462eb89c1d9227deea998d8cd014e8] Merge branch 'pci/kconfig'
git bisect good 0ecda3a087462eb89c1d9227deea998d8cd014e8
# good: [488ad6d3678beee65bcd74e6a9764bd7cee9d3d3] Merge branch 'pci/trivial'
git bisect good 488ad6d3678beee65bcd74e6a9764bd7cee9d3d3
# good: [885892fb378dc096693557ba4f2b875188619b36] mlx4_core: restore optimal ICM memory allocation
git bisect good 885892fb378dc096693557ba4f2b875188619b36
# good: [488ad6d3678beee65bcd74e6a9764bd7cee9d3d3] Merge branch 'pci/trivial'
git bisect good 488ad6d3678beee65bcd74e6a9764bd7cee9d3d3
# good: [0ecda3a087462eb89c1d9227deea998d8cd014e8] Merge branch 'pci/kconfig'
git bisect good 0ecda3a087462eb89c1d9227deea998d8cd014e8
# good: [82e1719c4cd65bd7f7847d6c02376cfca3d5e793] PCI: Clean up whitespace in quirks.c
git bisect good 82e1719c4cd65bd7f7847d6c02376cfca3d5e793
# good: [1c2bef0a3fd14287a66edd7ead57fd2e439485a2] Merge branch 'lorenzo/pci/rcar'
git bisect good 1c2bef0a3fd14287a66edd7ead57fd2e439485a2
# good: [87cb5ac9cece9f0f75d0532fa2afcbf871f6b72e] Merge remote-tracking branch 'arm-soc/for-next'
git bisect good 87cb5ac9cece9f0f75d0532fa2afcbf871f6b72e
# good: [64eec192d609531b0c173c5b4885832372fb2a4c] Merge remote-tracking branch 'powerpc/next'
git bisect good 64eec192d609531b0c173c5b4885832372fb2a4c
# good: [dd8c1fd2071de3a02ea60e3fa68be24c1e89945e] Merge remote-tracking branch 'jfs/jfs-next'
git bisect good dd8c1fd2071de3a02ea60e3fa68be24c1e89945e
# bad: [6125a3dab21f1939dae7e836105dea0c9c465db4] Merge remote-tracking branch 'orangefs/for-next'
git bisect bad 6125a3dab21f1939dae7e836105dea0c9c465db4
# good: [3f0b3cf46e0542ac4b4241c579b944b755d11b67] NFS: Filter cache invalidation when holding a delegation
git bisect good 3f0b3cf46e0542ac4b4241c579b944b755d11b67
# good: [28771950c592482ee86cb1c3b661688aec3c0d7d] svcrdma: Fix incorrect return value/type in svc_rdma_post_recvs
git bisect good 28771950c592482ee86cb1c3b661688aec3c0d7d
# bad: [8335640cf89faa0f4e39e73e314f3f3a22d776f3] xprtrdma: Add trace_xprtrdma_dma_map(mr)
git bisect bad 8335640cf89faa0f4e39e73e314f3f3a22d776f3
# bad: [0e0b854cfb3302b1907e9d3a927469b95710238f] xprtrdma: Clean up Receive trace points
git bisect bad 0e0b854cfb3302b1907e9d3a927469b95710238f
# bad: [0e0b854cfb3302b1907e9d3a927469b95710238f] xprtrdma: Clean up Receive trace points
git bisect bad 0e0b854cfb3302b1907e9d3a927469b95710238f
# good: [75bc37fefc4471e718ba8e651aa74673d4e0a9eb] Linux 4.17-rc4
git bisect good 75bc37fefc4471e718ba8e651aa74673d4e0a9eb
# bad: [0e0b854cfb3302b1907e9d3a927469b95710238f] xprtrdma: Clean up Receive trace points
git bisect bad 0e0b854cfb3302b1907e9d3a927469b95710238f
# good: [914fcad9873cbd46e3a4c3c31551b98b15a49079] xprtrdma: Fix max_send_wr computation
git bisect good 914fcad9873cbd46e3a4c3c31551b98b15a49079
# bad: [a9cde23ab7cdf5e4e93432dffd0e734267f2b745] SUNRPC: Add a ->free_slot transport callout
git bisect bad a9cde23ab7cdf5e4e93432dffd0e734267f2b745
# bad: [37ac86c3a76c113619b7d9afe0251bbfc04cb80a] SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
git bisect bad 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
# first bad commit: [37ac86c3a76c113619b7d9afe0251bbfc04cb80a] SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-06-07 11:19 [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage Krzysztof Kozlowski
@ 2018-06-07 11:22 ` Krzysztof Kozlowski
  2018-06-15 12:53 ` Sudeep Holla
  1 sibling, 0 replies; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-06-07 11:22 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, J. Bruce Fields, Jeff Layton,
	David S. Miller, linux-nfs, netdev, linux-kernel,
	linux-samsung-soc

On Thu, Jun 7, 2018 at 1:19 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> Hi,
>
> When booting my boards under recent linux-next, I see failures of systemd:
>
> [FAILED] Failed to start Flush Journal to Persistent Storage.
> See 'systemctl status systemd-journal-flush.service' for details.
>          Starting Create Volatile Files and Directories...
> [**    ] A start job is running for Create V… [  223.209289] nfs:
> server 192.168.1.10 not responding, still trying
> [  223.209377] nfs: server 192.168.1.10 not responding, still trying
>
> Effectively the boards fails to boot. Example is here:
> https://krzk.eu/#/builders/1/builds/2157
>
> This was bisected to:
> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date:   Fri May 4 15:34:53 2018 -0400
>
>     SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>
>     alloc_slot is a transport-specific op, but initializing an rpc_rqst
>     is common to all transports. In addition, the only part of initial-
>     izing an rpc_rqst that needs serialization is getting a fresh XID.
>
>     Move rpc_rqst initialization to common code in preparation for
>     adding a transport-specific alloc_slot to xprtrdma.
>
>     Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>     Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
>
>
> Bisect log attached. Full configuration:
> 1. exynos_defconfig
> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
> 3. NFSv4 client (from Raspberry Pi)
>
> Let me know if you need any more information.

Ah, I forgot maybe the most important information in reproducment -
client uses NFS root (NFSv4).

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-06-07 11:19 [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage Krzysztof Kozlowski
  2018-06-07 11:22 ` Krzysztof Kozlowski
@ 2018-06-15 12:53 ` Sudeep Holla
  2018-06-15 14:07   ` Krzysztof Kozlowski
  1 sibling, 1 reply; 11+ messages in thread
From: Sudeep Holla @ 2018-06-15 12:53 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Trond Myklebust, linux-nfs
  Cc: Anna Schumaker, J. Bruce Fields, Jeff Layton, David S. Miller,
	netdev, open list, linux-samsung-soc, Sudeep Holla

Hi,

On Thu, Jun 7, 2018 at 12:19 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> Hi,
>
> When booting my boards under recent linux-next, I see failures of systemd:
>
> [FAILED] Failed to start Flush Journal to Persistent Storage.
> See 'systemctl status systemd-journal-flush.service' for details.
>          Starting Create Volatile Files and Directories...
> [**    ] A start job is running for Create V… [  223.209289] nfs:
> server 192.168.1.10 not responding, still trying
> [  223.209377] nfs: server 192.168.1.10 not responding, still trying
>
> Effectively the boards fails to boot. Example is here:
> https://krzk.eu/#/builders/1/builds/2157
>

I too encountered the same issue.

> This was bisected to:
> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date:   Fri May 4 15:34:53 2018 -0400
>
>     SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>
>     alloc_slot is a transport-specific op, but initializing an rpc_rqst
>     is common to all transports. In addition, the only part of initial-
>     izing an rpc_rqst that needs serialization is getting a fresh XID.
>
>     Move rpc_rqst initialization to common code in preparation for
>     adding a transport-specific alloc_slot to xprtrdma.
>
>     Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>     Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
>

Unfortunately, spent time to bisect independently without seeing this
report and got the same culprit.

>
> Bisect log attached. Full configuration:
> 1. exynos_defconfig
> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
> 3. NFSv4 client (from Raspberry Pi)
>

Yes the issue is seen only with NFSv4 client and with latest systemd I think.
My Ubuntu 16.04(32bit FS) is  boots fine while 18.04 has the above issue.
Passing nfsv3 in kernel command line makes it work again.

> Let me know if you need any more information.
>

Also I was observing this issue with Linus master branch from
the time the above patch was merged until today. The issue
is no longer seen since this morning however I just enabled lockdep
and got these messages.

---->8

DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0)))
WARNING: CPU: 2 PID: 74 at kernel/locking/rwsem.c:217
up_read_non_owner+0x78/0x90
Modules linked in:
CPU: 2 PID: 74 Comm: kworker/2:1 Not tainted 4.17.0-10597-g4c5e8fc62d6a #40
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno
Development Platform, BIOS EDK II Jun 14 2018
Workqueue: nfsiod rpc_async_release
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : up_read_non_owner+0x78/0x90
lr : up_read_non_owner+0x78/0x90
Call trace:
 up_read_non_owner+0x78/0x90
 nfs_async_unlink_release+0x2c/0x88
 rpc_free_task+0x28/0x48
 rpc_async_release+0x10/0x18
 process_one_work+0x29c/0x7a0
 worker_thread+0x40/0x458
 kthread+0x12c/0x130
 ret_from_fork+0x10/0x18
irq event stamp: 64737
hardirqs last  enabled at (64737): _raw_spin_unlock_irq+0x2c/0x60
hardirqs last disabled at (64736): _raw_spin_lock_irq+0x1c/0x60
softirqs last  enabled at (64728): srcu_invoke_callbacks+0xec/0x1a8
softirqs last disabled at (64724): srcu_invoke_callbacks+0xec/0x1a8


DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0)))
WARNING: CPU: 2 PID: 1517 at kernel/locking/rwsem.c:217
up_read_non_owner+0x78/0x90
Modules linked in:
CPU: 2 PID: 1517 Comm: kworker/2:2 Not tainted
4.17.0-10597-g4c5e8fc62d6a-dirty #42
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno
Development Platform, BIOS EDK II Jun 14 2018
Workqueue: nfsiod rpc_async_release
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : up_read_non_owner+0x78/0x90
lr : up_read_non_owner+0x78/0x90
Call trace:
 up_read_non_owner+0x78/0x90
 nfs_async_unlink_release+0x2c/0x88
 rpc_free_task+0x28/0x48
 rpc_async_release+0x10/0x18
 process_one_work+0x258/0x400
 worker_thread+0x40/0x448
 kthread+0x11c/0x120
 ret_from_fork+0x10/0x18
irq event stamp: 69059
hardirqs last  enabled at (69059): _raw_spin_unlock_irq+0x2c/0x60
hardirqs last disabled at (69058): _raw_spin_lock_irq+0x1c/0x60
softirqs last  enabled at (68194): css_release_work_fn+0x188/0x1c0
softirqs last disabled at (68192): css_release_work_fn+0x170/0x1c0

Regards,
Sudeep

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-06-15 12:53 ` Sudeep Holla
@ 2018-06-15 14:07   ` Krzysztof Kozlowski
  2018-06-15 14:10     ` Krzysztof Kozlowski
  2018-06-15 14:23     ` Chuck Lever
  0 siblings, 2 replies; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-06-15 14:07 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Trond Myklebust, linux-nfs, Anna Schumaker, J. Bruce Fields,
	Jeff Layton, David S. Miller, netdev, open list,
	linux-samsung-soc

On Fri, Jun 15, 2018 at 2:53 PM, Sudeep Holla <sudeep.holla@arm.com> wrote:
> Hi,
>
> On Thu, Jun 7, 2018 at 12:19 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>> Hi,
>>
>> When booting my boards under recent linux-next, I see failures of systemd:
>>
>> [FAILED] Failed to start Flush Journal to Persistent Storage.
>> See 'systemctl status systemd-journal-flush.service' for details.
>>          Starting Create Volatile Files and Directories...
>> [**    ] A start job is running for Create V… [  223.209289] nfs:
>> server 192.168.1.10 not responding, still trying
>> [  223.209377] nfs: server 192.168.1.10 not responding, still trying
>>
>> Effectively the boards fails to boot. Example is here:
>> https://krzk.eu/#/builders/1/builds/2157
>>
>
> I too encountered the same issue.
>
>> This was bisected to:
>> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
>> Author: Chuck Lever <chuck.lever@oracle.com>
>> Date:   Fri May 4 15:34:53 2018 -0400
>>
>>     SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>>
>>     alloc_slot is a transport-specific op, but initializing an rpc_rqst
>>     is common to all transports. In addition, the only part of initial-
>>     izing an rpc_rqst that needs serialization is getting a fresh XID.
>>
>>     Move rpc_rqst initialization to common code in preparation for
>>     adding a transport-specific alloc_slot to xprtrdma.
>>
>>     Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>     Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
>>
>
> Unfortunately, spent time to bisect independently without seeing this
> report and got the same culprit.
>
>>
>> Bisect log attached. Full configuration:
>> 1. exynos_defconfig
>> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
>> 3. NFSv4 client (from Raspberry Pi)
>>
>
> Yes the issue is seen only with NFSv4 client and with latest systemd I think.
> My Ubuntu 16.04(32bit FS) is  boots fine while 18.04 has the above issue.
> Passing nfsv3 in kernel command line makes it work again.

Thanks for reply!

I test it on systemd versions 236 and 238... and it fails on both.
However one board passes always - it is Odroid HC1 with same core
configuration as described before. Probably there is some different SW
package on it.

>> Let me know if you need any more information.
>>
>
> Also I was observing this issue with Linus master branch from
> the time the above patch was merged until today. The issue
> is no longer seen since this morning however I just enabled lockdep
> and got these messages.

All recent linux-next fail. Today's Linus' tree (4c5e8fc62d6a ("Merge
tag 'linux-kselftest-4.18-rc1-2' of
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"))
managed to get up on one board but stuck on different board with the
same issue.

I am quite surprised that there is no response from the author of the
commit and this was just moved from next (while failing) to Linus'
tree... bringing the issue to mainline now.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-06-15 14:07   ` Krzysztof Kozlowski
@ 2018-06-15 14:10     ` Krzysztof Kozlowski
  2018-06-15 14:23     ` Chuck Lever
  1 sibling, 0 replies; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-06-15 14:10 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Trond Myklebust, linux-nfs, Anna Schumaker, J. Bruce Fields,
	Jeff Layton, David S. Miller, netdev, open list,
	linux-samsung-soc

On Fri, Jun 15, 2018 at 4:07 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>
> I test it on systemd versions 236 and 238... and it fails on both.
> However one board passes always - it is Odroid HC1 with same core
> configuration as described before. Probably there is some different SW
> package on it.
>
>>> Let me know if you need any more information.
>>>
>>
>> Also I was observing this issue with Linus master branch from
>> the time the above patch was merged until today. The issue
>> is no longer seen since this morning however I just enabled lockdep
>> and got these messages.

About this lockdep error - I see it for very long time and it appears
always (see end of serial log here:
https://krzk.eu/#/builders/21/builds/707/steps/10/logs/serial0). It
started much before this issue came so I am not sure whether it is
related.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-06-15 14:07   ` Krzysztof Kozlowski
  2018-06-15 14:10     ` Krzysztof Kozlowski
@ 2018-06-15 14:23     ` Chuck Lever
  2018-06-15 14:28       ` Krzysztof Kozlowski
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2018-06-15 14:23 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Sudeep Holla, Trond Myklebust, Linux NFS Mailing List,
	Anna Schumaker, Bruce Fields, Jeff Layton, David S. Miller,
	netdev, open list, linux-samsung-soc



> On Jun 15, 2018, at 10:07 AM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> 
> On Fri, Jun 15, 2018 at 2:53 PM, Sudeep Holla <sudeep.holla@arm.com> wrote:
>> Hi,
>> 
>> On Thu, Jun 7, 2018 at 12:19 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>>> Hi,
>>> 
>>> When booting my boards under recent linux-next, I see failures of systemd:
>>> 
>>> [FAILED] Failed to start Flush Journal to Persistent Storage.
>>> See 'systemctl status systemd-journal-flush.service' for details.
>>>         Starting Create Volatile Files and Directories...
>>> [**    ] A start job is running for Create V… [  223.209289] nfs:
>>> server 192.168.1.10 not responding, still trying
>>> [  223.209377] nfs: server 192.168.1.10 not responding, still trying
>>> 
>>> Effectively the boards fails to boot. Example is here:
>>> https://krzk.eu/#/builders/1/builds/2157
>>> 
>> 
>> I too encountered the same issue.
>> 
>>> This was bisected to:
>>> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
>>> Author: Chuck Lever <chuck.lever@oracle.com>
>>> Date:   Fri May 4 15:34:53 2018 -0400
>>> 
>>>    SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>>> 
>>>    alloc_slot is a transport-specific op, but initializing an rpc_rqst
>>>    is common to all transports. In addition, the only part of initial-
>>>    izing an rpc_rqst that needs serialization is getting a fresh XID.
>>> 
>>>    Move rpc_rqst initialization to common code in preparation for
>>>    adding a transport-specific alloc_slot to xprtrdma.
>>> 
>>>    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>    Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
>>> 
>> 
>> Unfortunately, spent time to bisect independently without seeing this
>> report and got the same culprit.
>> 
>>> 
>>> Bisect log attached. Full configuration:
>>> 1. exynos_defconfig
>>> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
>>> 3. NFSv4 client (from Raspberry Pi)
>>> 
>> 
>> Yes the issue is seen only with NFSv4 client and with latest systemd I think.
>> My Ubuntu 16.04(32bit FS) is  boots fine while 18.04 has the above issue.
>> Passing nfsv3 in kernel command line makes it work again.
> 
> Thanks for reply!
> 
> I test it on systemd versions 236 and 238... and it fails on both.
> However one board passes always - it is Odroid HC1 with same core
> configuration as described before. Probably there is some different SW
> package on it.
> 
>>> Let me know if you need any more information.
>>> 
>> 
>> Also I was observing this issue with Linus master branch from
>> the time the above patch was merged until today. The issue
>> is no longer seen since this morning however I just enabled lockdep
>> and got these messages.
> 
> All recent linux-next fail. Today's Linus' tree (4c5e8fc62d6a ("Merge
> tag 'linux-kselftest-4.18-rc1-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"))
> managed to get up on one board but stuck on different board with the
> same issue.
> 
> I am quite surprised that there is no response from the author of the
> commit and this was just moved from next (while failing) to Linus'
> tree... bringing the issue to mainline now.

Sorry. This morning is the first time I've seen this report, which was
not To: or Cc'd to me.

Since I don't have access to this kind of hardware, I will have to ask
for your help to perform basic troubleshooting.

Can we start by capturing the network traffic that occurs while you
reproduce the problem? Use tshark or tcpdump on your NFS server, filter
on the IP of the client, and send me (or the list) the raw pcap file.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-06-15 14:23     ` Chuck Lever
@ 2018-06-15 14:28       ` Krzysztof Kozlowski
       [not found]         ` <CAJKOXPf7V1z4pNZ2RsjkJmaVAgk34_rPOtQOc7J5KRMSRemrdg@mail.gmail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-06-15 14:28 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sudeep Holla, Trond Myklebust, Linux NFS Mailing List,
	Anna Schumaker, Bruce Fields, Jeff Layton, David S. Miller,
	netdev, open list, linux-samsung-soc

On Fri, Jun 15, 2018 at 4:23 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>> On Jun 15, 2018, at 10:07 AM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>>
>> On Fri, Jun 15, 2018 at 2:53 PM, Sudeep Holla <sudeep.holla@arm.com> wrote:
>>> Hi,
>>>
>>> On Thu, Jun 7, 2018 at 12:19 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>>>> Hi,
>>>>
>>>> When booting my boards under recent linux-next, I see failures of systemd:
>>>>
>>>> [FAILED] Failed to start Flush Journal to Persistent Storage.
>>>> See 'systemctl status systemd-journal-flush.service' for details.
>>>>         Starting Create Volatile Files and Directories...
>>>> [**    ] A start job is running for Create V… [  223.209289] nfs:
>>>> server 192.168.1.10 not responding, still trying
>>>> [  223.209377] nfs: server 192.168.1.10 not responding, still trying
>>>>
>>>> Effectively the boards fails to boot. Example is here:
>>>> https://krzk.eu/#/builders/1/builds/2157
>>>>
>>>
>>> I too encountered the same issue.
>>>
>>>> This was bisected to:
>>>> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
>>>> Author: Chuck Lever <chuck.lever@oracle.com>
>>>> Date:   Fri May 4 15:34:53 2018 -0400
>>>>
>>>>    SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>>>>
>>>>    alloc_slot is a transport-specific op, but initializing an rpc_rqst
>>>>    is common to all transports. In addition, the only part of initial-
>>>>    izing an rpc_rqst that needs serialization is getting a fresh XID.
>>>>
>>>>    Move rpc_rqst initialization to common code in preparation for
>>>>    adding a transport-specific alloc_slot to xprtrdma.
>>>>
>>>>    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>    Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
>>>>
>>>
>>> Unfortunately, spent time to bisect independently without seeing this
>>> report and got the same culprit.
>>>
>>>>
>>>> Bisect log attached. Full configuration:
>>>> 1. exynos_defconfig
>>>> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
>>>> 3. NFSv4 client (from Raspberry Pi)
>>>>
>>>
>>> Yes the issue is seen only with NFSv4 client and with latest systemd I think.
>>> My Ubuntu 16.04(32bit FS) is  boots fine while 18.04 has the above issue.
>>> Passing nfsv3 in kernel command line makes it work again.
>>
>> Thanks for reply!
>>
>> I test it on systemd versions 236 and 238... and it fails on both.
>> However one board passes always - it is Odroid HC1 with same core
>> configuration as described before. Probably there is some different SW
>> package on it.
>>
>>>> Let me know if you need any more information.
>>>>
>>>
>>> Also I was observing this issue with Linus master branch from
>>> the time the above patch was merged until today. The issue
>>> is no longer seen since this morning however I just enabled lockdep
>>> and got these messages.
>>
>> All recent linux-next fail. Today's Linus' tree (4c5e8fc62d6a ("Merge
>> tag 'linux-kselftest-4.18-rc1-2' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"))
>> managed to get up on one board but stuck on different board with the
>> same issue.
>>
>> I am quite surprised that there is no response from the author of the
>> commit and this was just moved from next (while failing) to Linus'
>> tree... bringing the issue to mainline now.
>
> Sorry. This morning is the first time I've seen this report, which was
> not To: or Cc'd to me.

D'oh! That's mine mistake. Apparently I missed to put you on CC list.
Sorry for that.


> Since I don't have access to this kind of hardware, I will have to ask
> for your help to perform basic troubleshooting.
>
> Can we start by capturing the network traffic that occurs while you
> reproduce the problem? Use tshark or tcpdump on your NFS server, filter
> on the IP of the client, and send me (or the list) the raw pcap file.

Sure, I'll send you tcpdump without Cc-ing list.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
       [not found]                       ` <784782B0-A387-4D3B-A517-E85D851162B8@oracle.com>
@ 2018-07-25 13:27                         ` Krzysztof Kozlowski
  2018-07-25 14:31                           ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-07-25 13:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Trond Myklebust, sudeep.holla, Trond Myklebust, Anna Schumaker,
	J. Bruce Fields, Jeff Layton, David S. Miller, linux-nfs, netdev,
	linux-kernel, linux-samsung-soc

On 18 June 2018 at 18:20, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> The extra serialization appears to have a reproducible performance
> impact on RDMA, which no longer takes the reserve_lock when allocating
> a slot.
>
> I could put an xprt_alloc_xid call in xprt_alloc_slot, but that would
> only work for socket-based transports. Would it be OK if RDMA had its
> own XID allocation mechanism?

Hi,

On recent next the issue appeared again. My boards with NFSv4 root
timeout on 80% of boots. This time my NFS server is faster - Pi3 B+
:).

Is this know? Should I start long bisect or maybe you can point me to
possible causes?

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-07-25 13:27                         ` Krzysztof Kozlowski
@ 2018-07-25 14:31                           ` Chuck Lever
  2018-07-26  8:46                             ` Krzysztof Kozlowski
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2018-07-25 14:31 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Trond Myklebust, sudeep.holla, Trond Myklebust, Anna Schumaker,
	Bruce Fields, Jeff Layton, David S. Miller,
	Linux NFS Mailing List, netdev, linux-kernel, linux-samsung-soc



> On Jul 25, 2018, at 9:27 AM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> 
> On 18 June 2018 at 18:20, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> The extra serialization appears to have a reproducible performance
>> impact on RDMA, which no longer takes the reserve_lock when allocating
>> a slot.
>> 
>> I could put an xprt_alloc_xid call in xprt_alloc_slot, but that would
>> only work for socket-based transports. Would it be OK if RDMA had its
>> own XID allocation mechanism?
> 
> Hi,
> 
> On recent next the issue appeared again. My boards with NFSv4 root
> timeout on 80% of boots. This time my NFS server is faster - Pi3 B+
> :).
> 
> Is this know? Should I start long bisect or maybe you can point me to
> possible causes?

Hi Krzysztof, I don't know of any recent changes. Bisecting would be
a good place to start.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-07-25 14:31                           ` Chuck Lever
@ 2018-07-26  8:46                             ` Krzysztof Kozlowski
  2018-07-27  1:48                               ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Krzysztof Kozlowski @ 2018-07-26  8:46 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Trond Myklebust, sudeep.holla, Trond Myklebust, Anna Schumaker,
	Bruce Fields, Jeff Layton, David S. Miller,
	Linux NFS Mailing List, netdev, linux-kernel, linux-samsung-soc

On 25 July 2018 at 16:31, Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>> On Jul 25, 2018, at 9:27 AM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>>
>> On 18 June 2018 at 18:20, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>
>>> The extra serialization appears to have a reproducible performance
>>> impact on RDMA, which no longer takes the reserve_lock when allocating
>>> a slot.
>>>
>>> I could put an xprt_alloc_xid call in xprt_alloc_slot, but that would
>>> only work for socket-based transports. Would it be OK if RDMA had its
>>> own XID allocation mechanism?
>>
>> Hi,
>>
>> On recent next the issue appeared again. My boards with NFSv4 root
>> timeout on 80% of boots. This time my NFS server is faster - Pi3 B+
>> :).
>>
>> Is this know? Should I start long bisect or maybe you can point me to
>> possible causes?
>
> Hi Krzysztof, I don't know of any recent changes. Bisecting would be
> a good place to start.

Hi,

That was my mistake because of missing part of NFS server
configuration on new board. I tested again recent releases and current
linux-next and everything works fine.

Sorry for the noise.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
  2018-07-26  8:46                             ` Krzysztof Kozlowski
@ 2018-07-27  1:48                               ` Chuck Lever
  0 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2018-07-27  1:48 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Trond Myklebust, sudeep.holla, Trond Myklebust, Anna Schumaker,
	Bruce Fields, Jeff Layton, David S. Miller,
	Linux NFS Mailing List, netdev, linux-kernel, linux-samsung-soc



> On Jul 26, 2018, at 4:46 AM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> 
> On 25 July 2018 at 16:31, Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> 
>>> On Jul 25, 2018, at 9:27 AM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
>>> 
>>> On 18 June 2018 at 18:20, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> The extra serialization appears to have a reproducible performance
>>>> impact on RDMA, which no longer takes the reserve_lock when allocating
>>>> a slot.
>>>> 
>>>> I could put an xprt_alloc_xid call in xprt_alloc_slot, but that would
>>>> only work for socket-based transports. Would it be OK if RDMA had its
>>>> own XID allocation mechanism?
>>> 
>>> Hi,
>>> 
>>> On recent next the issue appeared again. My boards with NFSv4 root
>>> timeout on 80% of boots. This time my NFS server is faster - Pi3 B+
>>> :).
>>> 
>>> Is this know? Should I start long bisect or maybe you can point me to
>>> possible causes?
>> 
>> Hi Krzysztof, I don't know of any recent changes. Bisecting would be
>> a good place to start.
> 
> Hi,
> 
> That was my mistake because of missing part of NFS server
> configuration on new board. I tested again recent releases and current
> linux-next and everything works fine.
> 
> Sorry for the noise.

Thanks for the update.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-07-27  1:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07 11:19 [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage Krzysztof Kozlowski
2018-06-07 11:22 ` Krzysztof Kozlowski
2018-06-15 12:53 ` Sudeep Holla
2018-06-15 14:07   ` Krzysztof Kozlowski
2018-06-15 14:10     ` Krzysztof Kozlowski
2018-06-15 14:23     ` Chuck Lever
2018-06-15 14:28       ` Krzysztof Kozlowski
     [not found]         ` <CAJKOXPf7V1z4pNZ2RsjkJmaVAgk34_rPOtQOc7J5KRMSRemrdg@mail.gmail.com>
     [not found]           ` <082848F5-6360-4523-BA95-601777E17CF2@oracle.com>
     [not found]             ` <CAJKOXPe4uqOZGd_zb_Mqq4Fspu63LHZky86O4tPaD3BZNCBS6w@mail.gmail.com>
     [not found]               ` <6AAC5897-8762-4209-8718-67291B87AA00@oracle.com>
     [not found]                 ` <d617f8a85c3414230cd755fe0e4470c0f8bd17ff.camel@hammerspace.com>
     [not found]                   ` <26995E49-E8DD-421D-BABF-9E672B9383BD@oracle.com>
     [not found]                     ` <30a1455b6817361f8228c022628aef5cdee70c3e.camel@hammerspace.com>
     [not found]                       ` <784782B0-A387-4D3B-A517-E85D851162B8@oracle.com>
2018-07-25 13:27                         ` Krzysztof Kozlowski
2018-07-25 14:31                           ` Chuck Lever
2018-07-26  8:46                             ` Krzysztof Kozlowski
2018-07-27  1:48                               ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).