All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] xattr stat mismatch in sanityn test 73
@ 2021-10-08 13:04 Ellis Wilson via lustre-devel
  2021-10-12  1:31 ` Ellis Wilson via lustre-devel
  2021-10-15  6:18 ` Andreas Dilger via lustre-devel
  0 siblings, 2 replies; 5+ messages in thread
From: Ellis Wilson via lustre-devel @ 2021-10-08 13:04 UTC (permalink / raw)
  To: lustre-devel

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn
user.attr1="value1"

ELLIS: expected 2, but got 5
 sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre 
  Trace dump:
  = /usr/lib/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
  = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
  = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
  = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
  = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Best,

ellis
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lustre-devel] xattr stat mismatch in sanityn test 73
  2021-10-08 13:04 [lustre-devel] xattr stat mismatch in sanityn test 73 Ellis Wilson via lustre-devel
@ 2021-10-12  1:31 ` Ellis Wilson via lustre-devel
  2021-10-15  6:18 ` Andreas Dilger via lustre-devel
  1 sibling, 0 replies; 5+ messages in thread
From: Ellis Wilson via lustre-devel @ 2021-10-12  1:31 UTC (permalink / raw)
  To: lustre-devel

Apologies if somebody responded and it got spammed or I missed it -- in any event I got to root cause and figured I'd answer my own questions in case somebody went looking.

The additional three getxattrs are caused by inode audit code in the kernel, which is triggered if there are any filesystem specific rules in-place (e.g., log to audit file if directory /etc/auditd is modified).

You can see the additional stack traces with eBPF stackcount:

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  vfs_getxattr
  getxattr
  path_getxattr
  __x64_sys_getxattr
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    getfattr [19698]
    1  

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  vfs_getxattr
  getxattr
  path_getxattr
  __x64_sys_getxattr
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    getfattr [19698]
    1  

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  get_vfs_caps_from_disk
  audit_copy_inode
  __audit_inode
  filename_lookup
  user_path_at_empty
  vfs_statx
  __do_sys_newlstat
  __x64_sys_newlstat
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
    getfattr [19698]
    1  

  ll_xattr_cache_get
  ll_xattr_get_common
  __vfs_getxattr
  get_vfs_caps_from_disk
  audit_copy_inode
  __audit_inode
  filename_lookup
  user_path_at_empty
  path_getxattr
  __x64_sys_getxattr
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  [unknown]
  [unknown]
    getfattr [19698]
    2

Only the ones without "__audit_inode" in them are doing "real" work.  The remaining three are simply performing getxattrs against the path being operated on as part of filename_lookup for audit.  So long as auditctl shows enabled AND there is at least a single rule that relates to the filesystem (e.g., audit all modifications to /etc/audit), every single lookup will result in a get(x)attr to get VFS caps.  Relevant call path through the kernel is:

filename_lookup
audit_inode
__audit_inode
audit_copy_inode
audit_copy_fcaps
get_vfs_caps_from_disk

Disabling auditd completely via auditctl -e0 returns the total number of getxattrs to the expected value of 2.

Overall, the check in-place for this test is arguably too overfitted.  At some point I'll put together a PR such that the check is assuring there are at least 2 getxattr_hits rather than exactly 2 unless somebody objects to that solution.

Also, I found maloo and got an account so I could review upstream unit test results.

Best,

ellis


-----Original Message-----
From: Ellis Wilson 
Sent: Friday, October 8, 2021 9:04 AM
To: lustre-devel@lists.lustre.org
Subject: xattr stat mismatch in sanityn test 73

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"

ELLIS: expected 2, but got 5
 sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
  Trace dump:
  = /usr/lib/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
  = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
  = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
  = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
  = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Best,

ellis
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lustre-devel] xattr stat mismatch in sanityn test 73
  2021-10-08 13:04 [lustre-devel] xattr stat mismatch in sanityn test 73 Ellis Wilson via lustre-devel
  2021-10-12  1:31 ` Ellis Wilson via lustre-devel
@ 2021-10-15  6:18 ` Andreas Dilger via lustre-devel
  2021-10-15 14:25   ` Ellis Wilson via lustre-devel
  1 sibling, 1 reply; 5+ messages in thread
From: Andreas Dilger via lustre-devel @ 2021-10-15  6:18 UTC (permalink / raw)
  To: Ellis Wilson; +Cc: lustre-devel


[-- Attachment #1.1: Type: text/plain, Size: 4444 bytes --]

Ellis, the getfattr code has a poor "optimization", in that it first calls getxattr(buf=NULL, size=0) to have the kernel report the xattr size to allocate a buffer of exactly the right size.  However, IMHO while this avoids a "too large" allocation (maybe a few bytes vs. 64KB), it is inefficient by doubling the number of syscalls into the filesystem.

The other getxattr cache hit during lstat() is likely because of the kernel accessing an selinux/security xattr.

On Oct 8, 2021, at 07:04, Ellis Wilson via lustre-devel <lustre-devel@lists.lustre.org<mailto:lustre-devel@lists.lustre.org>> wrote:

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn
user.attr1="value1"

ELLIS: expected 2, but got 5
sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
 Trace dump:
 = /usr/lib/lustre/tests/test-framework.sh:6273:error()
 = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
 = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
 = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
 = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
 = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








[-- Attachment #1.2: Type: text/html, Size: 8929 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lustre-devel] xattr stat mismatch in sanityn test 73
  2021-10-15  6:18 ` Andreas Dilger via lustre-devel
@ 2021-10-15 14:25   ` Ellis Wilson via lustre-devel
  2021-10-16  6:33     ` Andreas Dilger via lustre-devel
  0 siblings, 1 reply; 5+ messages in thread
From: Ellis Wilson via lustre-devel @ 2021-10-15 14:25 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: lustre-devel


[-- Attachment #1.1: Type: text/plain, Size: 5395 bytes --]

Hi Andreas,

Yea - I noticed that.  What makes it even funnier is that it makes allocation calls in statically defined chunk sizes, so it's not even like it's exactly the right size in truth.  If you look at my follow-up email you'll see that I RC'd this to auditd mechanics in the kernel, which was enabled on my system.

I'm going to open a bug and submit a patch for the test to be more forgiving for test systems that have auditd enabled next week.  In short, it will assert that it saw at minimum two getxattr hits, rather than exactly two getxattrs hits.  That's behaved well for me since I revised it locally.  Let me know if you see any issue with this simple solution.

Best,

ellis

From: Andreas Dilger <adilger@whamcloud.com>
Sent: Friday, October 15, 2021 2:19 AM
To: Ellis Wilson <elliswilson@microsoft.com>
Cc: lustre-devel@lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-devel] xattr stat mismatch in sanityn test 73

Ellis, the getfattr code has a poor "optimization", in that it first calls getxattr(buf=NULL, size=0) to have the kernel report the xattr size to allocate a buffer of exactly the right size.  However, IMHO while this avoids a "too large" allocation (maybe a few bytes vs. 64KB), it is inefficient by doubling the number of syscalls into the filesystem.

The other getxattr cache hit during lstat() is likely because of the kernel accessing an selinux/security xattr.


On Oct 8, 2021, at 07:04, Ellis Wilson via lustre-devel <lustre-devel@lists.lustre.org<mailto:lustre-devel@lists.lustre.org>> wrote:

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn
user.attr1="value1"

ELLIS: expected 2, but got 5
sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
 Trace dump:
 = /usr/lib/lustre/tests/test-framework.sh:6273:error()
 = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
 = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
 = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
 = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
 = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







[-- Attachment #1.2: Type: text/html, Size: 10078 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [lustre-devel] xattr stat mismatch in sanityn test 73
  2021-10-15 14:25   ` Ellis Wilson via lustre-devel
@ 2021-10-16  6:33     ` Andreas Dilger via lustre-devel
  0 siblings, 0 replies; 5+ messages in thread
From: Andreas Dilger via lustre-devel @ 2021-10-16  6:33 UTC (permalink / raw)
  To: Ellis Wilson; +Cc: lustre-devel


[-- Attachment #1.1: Type: text/plain, Size: 5839 bytes --]

The simple solution is fine with me.

Is there a way to make the kernel itself more efficient in this regard?
For example, limit the checks to a specific directory tree rather than
the whole filesystem?  Granted the cache avoids network RPCs, but it
would be even better if the VFS is skipping these unnecessary checks.

Cheers, Andreas

On Oct 15, 2021, at 08:25, Ellis Wilson <elliswilson@microsoft.com> wrote:


Hi Andreas,

Yea – I noticed that.  What makes it even funnier is that it makes allocation calls in statically defined chunk sizes, so it’s not even like it’s exactly the right size in truth.  If you look at my follow-up email you’ll see that I RC’d this to auditd mechanics in the kernel, which was enabled on my system.

I’m going to open a bug and submit a patch for the test to be more forgiving for test systems that have auditd enabled next week.  In short, it will assert that it saw at minimum two getxattr hits, rather than exactly two getxattrs hits.  That’s behaved well for me since I revised it locally.  Let me know if you see any issue with this simple solution.

Best,

ellis

From: Andreas Dilger <adilger@whamcloud.com>
Sent: Friday, October 15, 2021 2:19 AM
To: Ellis Wilson <elliswilson@microsoft.com>
Cc: lustre-devel@lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-devel] xattr stat mismatch in sanityn test 73

Ellis, the getfattr code has a poor "optimization", in that it first calls getxattr(buf=NULL, size=0) to have the kernel report the xattr size to allocate a buffer of exactly the right size.  However, IMHO while this avoids a "too large" allocation (maybe a few bytes vs. 64KB), it is inefficient by doubling the number of syscalls into the filesystem.

The other getxattr cache hit during lstat() is likely because of the kernel accessing an selinux/security xattr.


On Oct 8, 2021, at 07:04, Ellis Wilson via lustre-devel <lustre-devel@lists.lustre.org<mailto:lustre-devel@lists.lustre.org>> wrote:

Hi all,

I'm trying to get to the bottom of a failure I'm seeing while running the lustre unit tests, specifically here for this test in the sanityn suite:
= sanityn test 73: getxattr should not cause xattr lock cancellation =

I'm running stock lustre 2.14.0 on Ubuntu 18.04 with Linux Kernel 5.4.0.  I have six nodes total for the test: two clients, two mds, two oss.  The test is running from one of the clients.

The error I get is:
============
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/f73.sanityn
user.attr1="value1"

ELLIS: expected 2, but got 5
sanityn test_73: @@@@@@ FAIL: not cached in /mnt/lustre
 Trace dump:
 = /usr/lib/lustre/tests/test-framework.sh:6273:error()
 = /usr/lib/lustre/tests/sanityn.sh:3557:test_73()
 = /usr/lib/lustre/tests/test-framework.sh:6581:run_one()
 = /usr/lib/lustre/tests/test-framework.sh:6628:run_one_logged()
 = /usr/lib/lustre/tests/test-framework.sh:6455:run_test()
 = /usr/lib/lustre/tests/sanityn.sh:3567:main()
============

I've instrumented the code to spit out the expected vs. discovered stat values.  The failure indicates the file in question wasn't cached, but in fact the inverse is occurring -- it's both cached and hit more often than expected.

The unadulterated test code follows:
================
3549     touch $DIR1/$tfile
3550     setfattr -n user.attr1 -v value1 $DIR1/$tfile ||
3551         error "setfattr1 failed"
3552     getfattr -n user.attr1 $DIR2/$tfile || error "getfattr1 failed"
3553     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr2 failed"
3554     clear_stats llite.*.stats
3555     # PR lock should be cached by now on both clients
3556     getfattr -n user.attr1 $DIR1/$tfile || error "getfattr3 failed"
3557     # 2 hits for getfattr(0)+getfattr(size)
3558     [ $(calc_stats llite.*.stats getxattr_hits) -eq 2 ] ||
3559         error "not cached in $DIR1"
================

The failure occurs on line 3558.

Manually performing these actions validates that indeed the jump is by 5, not 2:
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             85 samples [reqs]
getxattr_hits             4 samples [reqs]
~# getfattr -n user.attr1 /mnt/lustre/f73.sanityn
getfattr: Removing leading '/' from absolute path names # file: mnt/lustre/f73.sanityn user.attr1="value1"
~# lctl get_param llite.*.stats | grep hits
getxattr_hits             90 samples [reqs]
getxattr_hits             4 samples [reqs]

I straced getfattr as run in the test and found it issues the following:
23262 lstat("/mnt/lustre/f73.sanityn", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", NULL, 0) = 6
23262 getxattr("/mnt/lustre/f73.sanityn", "user.attr1", "value1", 256) = 6

I built a small C program to replicate just the above without all of the other fluff in getfattr, and I see 1 xattr cache hit occurring for the lstat, and two xattr cache hits occurring for each call of getxattr.  So it replicates the 5 xattr cache hits.  It is notable that if one does NOT specify "user.attr1" and instead just uses an empty string you only get a single hit on each getxattr.

I have a patch that revises the expected stat values from 2 to 5 and from 4 to 10, and while that works in my system I wanted to know:
1. Are these changes expected?  I don't know much about the xattr cache or when it's expected to be hit, but hitting twice for a single getxattr seemed high.
2. Is there any location online where I can look at release testing results for these unit tests?  I wanted to see if I was alone in hitting this many times, but couldn't locate such a repository of historical test results.

Thanks for any and all help!

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







[-- Attachment #1.2: Type: text/html, Size: 10438 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-10-16  6:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-08 13:04 [lustre-devel] xattr stat mismatch in sanityn test 73 Ellis Wilson via lustre-devel
2021-10-12  1:31 ` Ellis Wilson via lustre-devel
2021-10-15  6:18 ` Andreas Dilger via lustre-devel
2021-10-15 14:25   ` Ellis Wilson via lustre-devel
2021-10-16  6:33     ` Andreas Dilger via lustre-devel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.