All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52
@ 2020-02-27 21:07 James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 001/622] lustre: always enable special debugging, fhandles, and quota support James Simmons
                   ` (622 more replies)
  0 siblings, 623 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

These patches need to be applied to the lustre-backport branch
starting at commit a436653f641e4b3e2841f38113620535e918dd3f.
Combining the work of Neil and myself this brings the lustre
linux client up to just btefore the he landing of Direct I/O
(LU-4198) support. Testing shows this work is pretty stable.

Alex Zhuravlev (19):
  lustre: ptlrpc: idle connections can disconnect
  lustre: osc: serialize access to idle_timeout vs cleanup
  lustre: protocol: MDT as a statfs proxy
  lustre: ptlrpc: new request vs disconnect race
  lustre: ldlm: pass preallocated env to methods
  lustre: mdc: use old statfs format
  lustre: osc: re-check target versus available grant
  lustre: ptlrpc: reset generation for old requests
  lustre: osc: propagate grant shrink interval immediately
  lustre: osc: grant shrink shouldn't account skipped OSC
  lnet: libcfs: poll fail_loc in cfs_fail_timeout_set()
  lustre: obdclass: put all service's env on the list
  lustre: obdclass: use RCU to release lu_env_item
  lustre: obd: add rmfid support
  lustre: mdc: polling mode for changelog reader
  lustre: llite: forget cached ACLs properly
  lustre: ptlrpc: return proper error code
  lustre: llite: statfs to use NODELAY with MDS
  lustre: ptlrpc: suppress connection restored message

Alexander Boyko (9):
  lustre: ldlm: fix l_last_activity usage
  lustre: ptlrpc: don't zero request handle
  lustre: mgc: don't proccess cld during stopping
  lustre: llog: add startcat for wrapped catalog
  lustre: llog: add synchronization for the last record
  lustre: mdc: don't use ACL at setattr
  lnet: adds checking msg len
  lustre: llite: prevent mulitple group locks
  lustre: obdclass: don't skip records for wrapped catalog

Alexander Zarochentsev (4):
  lustre: llite: ll_fault should fail for insane file offsets
  lustre: osc: don't re-enable grant shrink on reconnect
  lustre: ptlrpc: grammar fix.
  lustre: osc: glimpse and lock cancel race

Alexey Lyashkov (10):
  lustre: lu_object: improve debug message for lu_object_put()
  lnet: use right rtr address
  lnet: use right address for routing message
  lustre: mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea
  lustre: obdecho: reuse an cl env cache for obdecho survey
  lustre: obdecho: avoid panic with partially object init
  lustre: mgc: config lock leak
  lnet: fix rspt counter
  lnet: lnet response entries leak
  lnet: avoid extra memory consumption

Alexey Zhuravlev (1):
  lustre: grant: prevent overflow of o_undirty

Amir Shehata (87):
  lnet: ko2iblnd: determine gaps correctly
  lnet: refactor lnet_select_pathway()
  lnet: add health value per ni
  lnet: add lnet_health_sensitivity
  lnet: add monitor thread
  lnet: handle local ni failure
  lnet: handle o2iblnd tx failure
  lnet: handle socklnd tx failure
  lnet: handle remote errors in LNet
  lnet: add retry count
  lnet: calculate the lnd timeout
  lnet: sysfs functions for module params
  lnet: timeout delayed REPLYs and ACKs
  lnet: remove duplicate timeout mechanism
  lnet: handle fatal device error
  lnet: reset health value
  lnet: add health statistics
  lnet: Add ioctl to get health stats
  lnet: remove obsolete health functions
  lnet: set health value from user space
  lnet: add global health statistics
  lnet: print recovery queues content
  lnet: health error simulation
  lnet: lnd: conditionally set health status
  lnet: router handling
  lnet: update logging
  lnet: lnd: Clean up logging
  lnet: unlink md if fail to send recovery
  lnet: set the health status correctly
  lnet: Decrement health on timeout
  lnet: properly error check sensitivity
  lnet: configure recovery interval
  lnet: separate ni state from recovery
  lnet: handle multi-md usage
  lnet: socklnd: improve scheduling algorithm
  lnet: lnd: increase CQ entries
  lnet: lnd: bring back concurrent_sends
  lnet: use number of wrs to calculate CQEs
  lnet: recovery event handling broken
  lnet: clean mt_eqh properly
  lnet: handle remote health error
  lnet: setup health timeout defaults
  lnet: fix cpt locking
  lnet: detach response tracker
  lnet: invalidate recovery ping mdh
  lnet: fix list corruption
  lnet: correct discovery LNetEQFree()
  lnet: verify msg is commited for send/recv
  lnet: select LO interface for sending
  lnet: remove route add restriction
  lnet: Discover routers on first use
  lnet: use peer for gateway
  lnet: lnet_add/del_route()
  lnet: Do not allow deleting of router nis
  lnet: router sensitivity
  lnet: cache ni status
  lnet: Cache the routing feature
  lnet: peer aliveness
  lnet: router aliveness
  lnet: simplify lnet_handle_local_failure()
  lnet: Cleanup rcd
  lnet: modify lnd notification mechanism
  lnet: use discovery for routing
  lnet: MR aware gateway selection
  lnet: consider alive_router_check_interval
  lnet: allow deleting router primary_nid
  lnet: transfer routers
  lnet: handle health for incoming messages
  lnet: misleading discovery seqno.
  lnet: drop all rule
  lnet: handle discovery off
  lnet: handle router health off
  lnet: push router interface updates
  lnet: net aliveness
  lnet: discover each gateway Net
  lnet: look up MR peers routes
  lnet: check peer timeout on a router
  lnet: prevent loop in LNetPrimaryNID()
  lnet: fix peer ref counting
  lnet: honor discovery setting
  lnet: warn if discovery is off
  lnet: handle unlink before send completes
  lnet: handle recursion in resend
  lnet: discovery off route state update
  lnet: o2iblnd: cache max_qp_wr
  lnet: fix peer_ni selection
  lnet: peer lookup handle shutdown

Andreas Dilger (55):
  lustre: llite: increase whole-file readahead to RPC size
  lustre: mdc: fix possible NULL pointer dereference
  lustre: obdclass: allow specifying complex jobids
  lustre: idl: remove obsolete directory split flags
  lustre: obdecho: use vmalloc for lnb
  lustre: mgc: remove obsolete IR swabbing workaround
  lustre: mds: remove obsolete MDS_VTX_BYPASS flag
  lustre: ptlrpc: fix return type of boolean functions
  lustre: ptlrpc: remove obsolete OBD RPC opcodes
  lustre: ptlrpc: assign specific values to MGS opcodes
  lustre: ptlrpc: remove obsolete LLOG_ORIGIN_* RPCs
  lustre: obdclass: remove unused ll_import_cachep
  lustre: ptlrpc: add debugging for idle connections
  lustre: mdc: move RPC semaphore code to lustre/osp
  lustre: misc: name open file handles as such
  lustre: osc: move obdo_cache to OSC code
  lustre: idl: remove obsolete RPC flags
  lustre: osc: clarify short_io_bytes is maximum value
  lustre: misc: quiet console messages at startup
  lustre: idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags
  lustre: lov: add debugging info for statfs
  lustre: hsm: make changelog flag argument an enum
  lustre: uapi: fix warnings when lustre_user.h included
  lustre: ptlrpc: clean up rq_interpret_reply callbacks
  lustre: lov: quiet lov_dump_lmm_ console messages
  lustre: llite: remove cl_file_inode_init() LASSERT
  lnet: libcfs: allow file/func/line passed to CDEBUG()
  lustre: llite: enable flock mount option by default
  lustre: lmv: avoid gratuitous 64-bit modulus
  lustre: Ensure crc-t10pi is enabled.
  lustre: lov: avoid signed vs. unsigned comparison
  lustre: llite: limit statfs ffree if less than OST ffree
  lustre: misc: delete OBD_IOC_PING_TARGET ioctl
  lustre: misc: remove LIBCFS_IOC_DEBUG_MASK ioctl
  lustre: obdclass: improve llog config record message
  lustre: ptlrpc: allow stopping threads above threads_max
  lustre: llite: improve max_readahead console messages
  lustre: uapi: fix file heat support
  lustre: mdt: improve IBITS lock definitions
  lustre: obdclass: don't send multiple statfs RPCs
  lustre: uapi: add unused enum obd_statfs_state
  lustre: mdc: hold lock while walking changelog dev list
  lustre: ptlrpc: make DEBUG_REQ messages consistent
  lustre: obdclass: align to T10 sector size when generating guard
  lustre: ptlrpc: fix watchdog ratelimit logic
  lustre: llite: clear flock when using localflock
  lustre: llite: limit max xattr size by kernel value
  lustre: llite: report latency for filesystem ops
  lustre: osc: allow increasing osc.*.short_io_bytes
  lustre: ptlrpc: update wiretest for new values
  lustre: uapi: LU-12521 llapi: add separate fsname and instance API
  lustre: ptlrpc: show target name in req_history
  lustre: llite: proper names/types for offset/pages
  lustre: uapi: remove unused LUSTRE_DIRECTIO_FL
  lnet: use conservative health timeouts

Andrew Perepechko (5):
  lustre: build: armv7 client build fixes
  lustre: osc: speed up page cache cleanup during blocking ASTs
  lustre: ptlrpc: improve memory allocation for service RPCs
  lustre: llite: optimizations for not granted lock processing
  lnet: libcfs: crashes with certain cpu part numbers

Andriy Skulysh (16):
  lustre: ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM
  lustre: ptlrpc: Serialize procfs access to scp_hist_reqs using mutex
  lustre: ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor))
  lustre: ptlrpc: connect vs import invalidate race
  lnet: o2iblnd: ibc_rxs is created and freed with different size
  lustre: ldlm: Lost lease lock on migrate error
  lnet: o2iblnd: kib_conn leak
  lustre: ptlrpc: Bulk assertion fails on -ENOMEM
  lustre: ptlrpc: ASSERTION (req_transno < next_transno) failed
  lustre: ptlrpc: ocd_connect_flags are wrong during reconnect
  lustre: ptlrpc: Add increasing XIDs CONNECT2 flag
  lustre: ptlrpc: don't reset lru_resize on idle reconnect
  lustre: ptlrpc: resend may corrupt the data
  lustre: ldlm: FLOCK request can be processed twice
  lustre: ldlm: signal vs CP callback race
  lustre: llite: eviction during ll_open_cleanup()

Ann Koehler (8):
  lustre: llite: yield cpu after call to ll_agl_trigger
  lustre: ptlrpc: Do not map unrecognized ELDLM errnos to EIO
  lustre: llite: Lock inode on tiny write if setuid/setgid set
  lustre: statahead: sa_handle_callback get lli_sa_lock earlier
  lustre: ptlrpc: Add jobid to rpctrace debug messages
  lnet: libcfs: Reduce memory frag due to HA debug msg
  lustre: llite: release active extent on sync write commit
  lustre: ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM

Arshad Hussain (17):
  lustre: osc: truncate does not update blocks count on client
  lustre: lmv: Fix style issues for lmv_fld.c
  lustre: llite: Fix style issues for llite_nfs.c
  lustre: llite: Fix style issues for lcommon_misc.c
  lustre: llite: Fix style issues for symlink.c
  lustre: ptlrpc: Change static defines to use macro for sec_gc.c
  lustre: ldlm: Fix style issues for ldlm_lockd.c
  lustre: ldlm: Fix style issues for ldlm_request.c
  lustre: ptlrpc: Fix style issues for sec_bulk.c
  lustre: ldlm: Fix style issues for ptlrpcd.c
  lustre: ptlrpc: Fix style issues for sec_null.c
  lustre: ptlrpc: Fix style issues for service.c
  lustre: ldlm: Fix style issues for ldlm_resource.c
  lustre: ptlrpc: Fix style issues for sec_gc.c
  lustre: ptlrpc: Fix style issues for llog_client.c
  lnet: Change static defines to use macro for module.c
  lustre: ldlm: Fix style issues for ldlm_lib.c

Artem Blagodarenko (1):
  lnet: add fault injection for bulk transfers

Aurelien Degremont (1):
  lnet: support non-default network namespace

Ben Evans (1):
  lustre: headers: define pct(a,b) once

Bobi Jam (11):
  lustre: osc: depart grant shrinking from pinger
  lustre: osc: enable/disable OSC grant shrink
  lustre: flr: add 'nosync' flag for FLR mirrors
  lustre: mdc: grow lvb buffer to hold layout
  lustre: flr: add mirror write command
  lustre: llite: protect reading inode->i_data.nrpages
  lustre: osc: limit chunk number of write submit
  lustre: osc: prevent use after free
  lustre: llite: error handling of ll_och_fill()
  lustre: flr: avoid reading unhealthy mirror
  lustre: llite: file write pos mimatch

Bruno Faccini (6):
  lustre: obdclass: fix llog_cat_cleanup() usage on Client
  lustre: ptlrpc: fix test_req_buffer_pressure behavior
  lustre: ldlm: cleanup LVB handling
  lustre: security: return security context for metadata ops
  lustre: lov: new foreign LOV format
  lustre: lmv: new foreign LMV format

Chris Horn (28):
  lnet: Cleanup lnet_get_rtr_pool_cfg
  lnet: Fix NI status in debugfs for loopback ni
  lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro
  lnet: Protect lp_dc_pendq manipulation with lp_lock
  lnet: Ensure md is detached when msg is not committed
  lnet: Do not allow gateways on remote nets
  lnet: Convert noisy timeout error to cdebug
  lnet: Misleading error from lnet_is_health_check
  lnet: Sync the start of discovery and monitor threads
  lnet: Deprecate live and dead router check params
  lnet: Detach rspt when md_threshold is infinite
  lnet: Return EHOSTUNREACH for unreachable gateway
  lnet: Defer rspt cleanup when MD queued for unlink
  lnet: Don't queue msg when discovery has completed
  lnet: Use alternate ping processing for non-mr peers
  lnet: o2ib: Record rc in debug log on startup failure
  lnet: o2ib: Reintroduce kiblnd_dev_search
  lnet: Optimize check for routing feature flag
  lnet: Wait for single discovery attempt of routers
  lnet: Prefer route specified by rtr_nid
  lnet: Add peer level aliveness information
  lnet: Refactor lnet_find_best_lpni_on_net
  lnet: Avoid comparing route to itself
  lnet: Avoid extra lnet_remotenet lookup
  lnet: Remove unused vars in lnet_find_route_locked
  lnet: Refactor lnet_compare_routes
  lnet: Fix source specified route selection
  lnet: Do not assume peers are MR capable

Christopher J. Morrone (1):
  lustre: ldlm: Make kvzalloc | kvfree use consistent

Di Wang (1):
  lustre: llite: handle ORPHAN/DEAD directories

Emoly Liu (4):
  lnet: fix nid range format '*@<net>' support
  lustre: checksum: enable/disable checksum correctly
  lustre: ptlrpc: check lm_bufcount and lm_buflen
  lustre: ptlrpc: check buffer length in lustre_msg_string()

Fan Yong (3):
  lustre: llite: return compatible fsid for statfs
  lustre: llite: decrease sa_running if fail to start statahead
  lustre: lfsck: layout LFSCK for mirrored file

Gu Zheng (3):
  lustre: osc: cancel osc_lock list traversal once found the lock is
    being used
  lustre: ldlm: always cancel aged locks regardless enabling or
    disabling lru resize
  lustre: uapi: fix building fail against Power9 little endian

Hongchao Zhang (8):
  lustre: mdc: resend quotactl if needed
  lustre: quota: add default quota setting support
  lustre: ptlrpc: race in AT early reply
  lustre: ptlrpc: always unregister bulk
  lustre: quota: protect quota flags at OSC
  lustre: quota: make overquota flag for old req
  lustre: fld: let's caller to retry FLD_QUERY
  lustre: mdc: hold obd while processing changelog

Jacek Tomaka (1):
  lustre: llite: Mark lustre_inode_cache as reclaimable

Jadhav Vikram (1):
  lustre: lov: protected ost pool count updation

James Nunez (1):
  lustre: llite: limit smallest max_cached_mb value

James Simmons (33):
  lustre: always enable special debugging, fhandles, and quota support.
  lustre: osc_cache: remove __might_sleep()
  lustre: uapi: remove enum hsm_progress_states
  lustre: uapi: sync enum obd_statfs_state
  lustre: obd: create ping sysfs file
  lustre: ldlm: change LDLM_POOL_ADD_VAR macro to inline function
  lustre: osc: fix idle_timeout handling
  lustre: ptlrpc: replace simple_strtol with kstrtol
  lustre: obd: use correct ip_compute_csum() version
  lustre: llite: create checksums to replace checksum_pages
  lustre: obd: use correct names for conn_uuid
  lustre: mgc: restore mgc binding for sptlrpc
  lustre: update version to 2.11.99
  lustre: obdclass: report all obd states for OBD_IOC_GETDEVICE
  lustre: sysfs: make ping sysfs file read and writable
  lustre: sptlrpc: split sptlrpc_process_config()
  lustre: clio: fix incorrect invariant in cl_io_iter_fini()
  lustre: obd: use ldo_process_config for mdc and osc layer
  lustre: obd: make health_check sysfs compliant
  lnet: properly cleanup lnet debugfs files
  lustre: obd: update udev event handling
  lustre: obd: replace class_uuid with linux kernel version.
  lustre: obd: round values to nearest MiB for *_mb syfs files
  lustre: obdclass: add comment for rcu handling in lu_env_remove
  lustre: ptlrpc: change IMPORT_SET_* macros into real functions
  lustre: obd: harden debugfs handling
  lustre: update version to 2.13.50
  lnet: timers: correctly offset mod_timer.
  lustre: obd: perform proper division
  lustre: llite: don't cache MDS_OPEN_LOCK for volatile files
  lnet: socklnd: rename struct ksock_peer to struct ksock_peer_ni
  lustre: sysfs: use string helper like functions for sysfs
  lustre: uapi: properly pack data structures

Jian Yu (4):
  lustre: mdt: revoke lease lock for truncate
  lustre: llite: swab LOV EA user data
  lustre: llite: swab LOV EA data in ll_getxattr_lov()
  lustre: llite: fetch default layout for a directory

Jinshan Xiong (4):
  lustre: llite: rename FSFILT_IOC_* to system flags
  lustre: llite: optimize read on open pages
  lustre: dne: performance improvement for file creation
  lustre: llite: do not cache write open lock for exec file

John L. Hammond (12):
  lustre: llite: reorganize variable and data structures
  lustre: hsm: ignore compound_id
  lustre: llog: remove obsolete llog handlers
  lustre: obd: keep dirty_max_pages a round number of MB
  lustre: llite: handle zero length xattr values correctly
  lustre: mdc: remove obsolete intent opcodes
  lustre: ldlm: correct logic in ldlm_prepare_lru_list()
  lustre: llite: zero lum for stripeless files
  lustre: mdc: move empty xattr handling to mdc layer
  lustre: obd: remove portals handle from OBD import
  lustre: llite: handle -ENODATA in ll_layout_fetch()
  lustre: ldlm: remove trace from ldlm_pool_count()

Kit Westneat (1):
  lnet: remove .nf_min_max handling

Lai Siyao (27):
  lustre: ptlrpc: add dir migration connect flag
  lustre: lmv: dir page is released while in use
  lustre: migrate: pack lmv ea in migrate rpc
  lustre: migrate: migrate striped directory
  lustre: lmv: support accessing migrating directory
  lustre: llite: add lock for dir layout data
  lustre: lmv: allocate fid on parent MDT in migrate
  lustre: obdclass: lu_dirent record length missing '0'
  lustre: uapi: reserve connect flag for plain layout
  lustre: dne: allow access to striped dir with broken layout
  lustre: dne: add new dir hash type "space"
  lustre: ptlrpc: intent_getattr fetches default LMV
  lustre: mdc: add async statfs
  lustre: lmv: mkdir with balanced space usage
  lustre: lmv: reuse object alloc QoS code from LOD
  lustre: obdclass: generate random u64 max correctly
  lustre: uapi: change "space" hash type to hash flag
  lustre: obdclass: 0-nlink race in lu_object_find_at()
  lustre: mdc: dir page ldp_hash_end mistakenly adjusted
  lustre: lmv: disable remote file statahead
  lustre: lmv: use lu_tgt_descs to manage tgts
  lustre: lmv: share object alloc QoS code with LMV
  lustre: obdclass: qos penalties miscalculated
  lustre: obdclass: lu_tgt_descs cleanup
  lustre: lmv: alloc dir stripes by QoS
  lustre: uapi: introduce OBD_CONNECT2_CRUSH
  lustre: llite: fix deadlock in ll_update_lsm_md()

Li Dongyang (9):
  lustre: ldlm: check double grant race after resource change
  lustre: clio: use pagevec_release for many pages
  lustre: osc: reduce atomic ops in osc_enter_cache_try
  lustre: osc: check if opg is in lru list without locking
  lustre: osc: don't check capability for every page
  lustre: osc: reduce lock contention in osc_unreserve_grant
  lustre: obdclass: protect imp_sec using rwlock_t
  lustre: llite: create obd_device with usercopy whitelist
  lustre: obdclass: remove assertion for imp_refcount

Li Xi (6):
  lustre: osc: add T10PI support for RPC checksum
  lustre: osc: wrong page offset for T10PI checksum
  lustre: llite: add file heat support
  lustre: llite: console message for disabled flock call
  lustre: llite: cleanup stats of LPROC_LL_*
  lustre: osc: add preferred checksum type support

Liang Zhen (2):
  lustre: ldlm: don't disable softirq for exp_rpc_lock
  lustre: obdclass: new wrapper to convert NID to string

Mike Marciniszyn (1):
  lnet: libcfs: remove unnecessary set_fs(KERNEL_DS)

Mikhail Pershin (28):
  lustre: mdc: deny layout swap for DoM file
  lustre: ldlm: expose dirty age limit for flush-on-glimpse
  lustre: ldlm: IBITS lock convert instead of cancel
  lustre: ptlrpc: add LOCK_CONVERT connection flag
  lustre: ldlm: handle lock converts in cancel handler
  lustre: ldlm: don't add canceling lock back to LRU
  lustre: mdt: read on open for DoM files
  lustre: llite: check truncate race for DOM pages
  lustre: ldlm: don't cancel DoM locks before replay
  lustre: ptlrpc: don't change buffer when signature is ready
  lustre: ldlm: update l_blocking_lock under lock
  lustre: ldlm: don't apply ELC to converting and DOM locks
  lustre: ldlm: don't skip bl_ast for local lock
  lustre: mdt: fix read-on-open for big PAGE_SIZE
  lustre: ldlm: don't convert wrong resource
  lustre: mdc: prevent glimpse lock count grow
  lustre: mdc: return DOM size on open resend
  lustre: osc: pass client page size during reconnect too
  lustre: mdt: fix mdt_dom_discard_data() timeouts
  lustre: dom: per-resource ELC for WRITE lock enqueue
  lustre: dom: mdc_lock_flush() improvement
  lustre: obdclass: remove unprotected access to lu_object
  lustre: llite: check correct size in ll_dom_finish_open()
  lustre: ptlrpc: fix reply buffers shrinking and growing
  lustre: dom: manual OST-to-DOM migration via mirroring
  lustre: ptlrpc: do lu_env_refill for any new request
  lustre: dom: check read-on-open buffer presents in reply
  lustre: llog: keep llog handle alive until last reference

Mr NeilBrown (47):
  lustre: obdclass: allow per-session jobids.
  lustre: fld: remove fci_no_shrink field.
  lustre: lustre: remove ldt_obd_type field of lu_device_type
  lustre: lustre: remove imp_no_timeout field
  lustre: llog: remove olg_cat_processing field.
  lustre: ptlrpc: remove struct ptlrpc_bulk_page
  lustre: ptlrpc: remove bd_import_generation field.
  lustre: ptlrpc: remove srv_threads from struct ptlrpc_service
  lustre: ptlrpc: remove scp_nthrs_stopping field.
  lustre: ldlm: remove unused ldlm_server_conn
  lustre: llite: remove lli_readdir_mutex
  lustre: llite: remove ll_umounting field
  lustre: llite: align field names in ll_sb_info
  lustre: llite: remove lti_iter field
  lustre: llite: remove ft_mtime field
  lustre: llite: remove sub_reenter field.
  lustre: osc: remove oti_descr oti_handle oti_plist
  lustre: osc: remove oe_next_page
  lnet: o2iblnd: remove some unused fields.
  lnet: socklnd: remove ksnp_sharecount
  lnet: change ln_mt_waitq to a completion.
  lustre: import: Fix missing spin_unlock()
  lustre: use simple sleep in some cases
  lustre: modules: Use LIST_HEAD for declaring list_heads
  lnet: remove pt_number from lnet_peer_table.
  lustre: obdclass: Allow read-ahead for write requests
  lnet: discard lnd_refcount
  lnet: change ksocknal_create_peer() to return pointer
  lnet: discard ksnn_lock
  lnet: discard LNetMEInsert
  lustre: all: prefer sizeof(*var) for alloc
  lnet: always check return of try_module_get()
  lnet: prepare to make lnet_lnd const.
  lnet: discard struct ksock_peer
  lnet: socklnd: initialize the_ksocklnd at compile-time.
  lnet: remove locking protection ln_testprotocompat
  lustre: handle: remove locking from class_handle2object()
  lustre: obdclass: convert waiting in cl_sync_io_wait().
  lnet: modules: use list_move were appropriate.
  lnet: fix small race in unloading klnd modules.
  lnet: me: discard struct lnet_handle_me
  lnet: socklnd: convert peers hash table to hashtable.h
  lustre: ptlrpc: simplify wait_event handling in unregister functions
  lustre: ptlrpc: use l_wait_event_abortable in ptlrpcd_add_reg()
  lnet: use LIST_HEAD() for local lists.
  lustre: lustre: use LIST_HEAD() for local lists.
  lnet: remove lnd_query interface.

Nathaniel Clark (1):
  lustre: lov: Correct bounds checking

NeilBrown (18):
  lustre: llite: Don't clear d_fsdata in ll_release()
  lustre: llite: move agl_thread cleanup out of thread.
  lustre/lnet: remove unnecessary use of msecs_to_jiffies()
  lnet: net_fault: don't pass struct member to do_div()
  lustre: obd: discard unused enum
  lustre: lov: use wait_event() in lov_subobject_kill()
  lustre: llite: use wait_event in cl_object_put_last()
  lustre: handle: move refcount into the lustre_handle.
  lustre: ldlm: separate buckets from ldlm hash table
  lustre: handle: discard OBD_FREE_RCU
  lnet: use list_move where appropriate.
  lustre: ldlm: add a counter to the per-namespace data
  lustre: rename ops to owner
  lustre: ldlm: simplify ldlm_ns_hash_defs[]
  lustre: u_object: factor out extra per-bucket data
  lustre: llite: replace lli_trunc_sem
  lustre: handle: use hlist for hash lists.
  lustre: handle: discard h_lock.

Olaf Faaland (2):
  lnet: create existing net returns EEXIST
  lustre: llite: Update mdc and lite stats on open|creat

Olaf Weber (1):
  lnet: use after free in lnet_discover_peer_locked()

Oleg Drokin (6):
  lustre: ptlrpc: Add WBC connect flag
  lustre: lov: Move lov_tgts_kobj init to lov_setup
  lustre: osc: increase default max_dirty_mb to 2G
  lustre: llite: Revalidate dentries in ll_intent_file_open
  lustre: llite: hash just created files if lock allows
  lustre: ptlrpc: Properly swab ll_fiemap_info_key

Patrick Farrell (30):
  lustre: osc: Do not request more than 2GiB grant
  lustre: ldlm: Reduce debug to console during eviction
  lustre: ptlrpc: Make CPU binding switchable
  lustre: osc: Do not walk full extent list
  lustre: ldlm: Adjust search_* functions
  lustre: mdc: Improve xattr buffer allocations
  lustre: llite: Initialize cl_dirty_max_pages
  lustre: llite: ll_fault fixes
  lustre: osd: Set max ea size to XATTR_SIZE_MAX
  lustre: lov: Remove unnecessary assert
  lustre: obd: Add overstriping CONNECT flag
  lustre: lov: Add overstriping support
  lustre: uapi: Add nonrotational flag to statfs
  lustre: llite: collect debug info for ll_fsync
  lustre: lu_object: Add missed qos_rr_init
  lustre: osc: Do not assert for first extent
  lustre: ptlrpc: Don't get jobid in body_v2
  lustre: lov: Correct write_intent end for trunc
  lustre: osc: Fix dom handling in weight_ast
  lustre: llite: Fix extents_stats
  lustre: ptlrpc: Stop sending ptlrpc_body_v2
  lustre: uapi: Remove unused CONNECT flag
  lustre: llite: Fix page count for unaligned reads
  lustre: llite: Improve readahead RPC issuance
  lustre: lov: Move page index to top level
  lustre: ptlrpc: Hold imp lock for idle reconnect
  lustre: osc: glimpse - search for active lock
  lnet: o2iblnd: Make credits hiw connection aware
  lustre: vvp: dirty pages with pagevec
  lustre: llite: Accept EBUSY for page unaligned read

Qian Yingjin (16):
  lustre: mdt: Lazy size on MDT
  lustre: uapi: add new changerec_type
  lustre: lsom: Add an OBD_CONNECT2_LSOM connect flag
  lustre: pcc: Reserve a new connection flag for PCC
  lustre: rpc: support maximum 64MB I/O RPC
  lustre: llite: Add persistent cache on client
  lustre: pcc: Non-blocking PCC caching
  lustre: pcc: security and permission for non-root user access
  lustre: llite: Rule based auto PCC caching when create files
  lustre: pcc: auto attach during open for valid cache
  lustre: pcc: change detach behavior and add keep option
  lustre: som: integrate LSOM with lfs find
  lustre: pcc: Auto attach for PCC during IO
  lustre: pcc: Incorrect size after re-attach
  lustre: pcc: auto attach not work after client cache clear
  lustre: pcc: Init saved dataset flags properly

Quentin Bouget (1):
  lustre: uapi: turn struct lustre_nfs_fid to userland fhandle

Rahul Deshmukh (1):
  lustre: obdecho: turn on async flag only for mode 3

Rob Latham (1):
  lustre: uapi: Make lustre_user.h c++-legal

Ryan Haasken (1):
  lustre: obdclass: Add lbug_on_eviction option

Sebastien Buisson (7):
  lustre: obd: check '-o network' and peer discovery conflict
  lustre: cfg: reserve flags for SELinux status checking
  lustre: sec: create new function sptlrpc_get_sepol()
  lnet: check for asymmetrical route messages
  lustre: ptlrpc: manage SELinux policy info at connect time
  lustre: ptlrpc: manage SELinux policy info for metadata ops
  lustre: sec: reserve flags for client side encryption

Sergey Cheremencev (1):
  lustre: ptlrpc: IR doesn't reconnect after EAGAIN

Shaun Tancheff (7):
  lustre: lov: return error if cl_env_get fails
  lustre: llite: MS_* flags and SB_* flags split
  lustre: clio: support custom csi_end_io handler
  lnet: Fix style issues for selftest/rpc.c
  lnet: Fix style issues for module.c conctl.c
  lnet: libcfs: provide an scnprintf and start using it
  lnet: libcfs: Cleanup use of bare printk

Sonia Sharma (6):
  lnet: Fix selftest backward compatibility post health
  lnet: socklnd: dynamically set LND parameters
  lnet: peer deletion code may hide error
  lnet: increase lnet transaction timeout
  lnet: Avoid lnet debugfs read/write if ctl_table does not exist
  lnet: check if current->nsproxy is NULL before using

Swapnil Pimpale (1):
  lustre: lustre: Reserve OST_FALLOCATE(fallocate) opcode

Tatsushi Takamura (1):
  lnet: handling device failure by IB event handler

Teddy Chan (1):
  lustre: ptlrpc: Add QoS for uid and gid in NRS-TBF

Teddy Zheng (2):
  lustre: hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id
    lists in array
  lustre: hsm: increase upper limit of maximum HSM backends registered
    with MDT

Vitaly Fertman (6):
  lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro
  lustre: ldlm: layout lock fixes
  lustre: osc: layout and chunkbits alignment mismatch
  lustre: osc: wrong cache of LVB attrs
  lustre: osc: wrong cache of LVB attrs, part2
  lustre: ldlm: fix lock convert races

Vladimir Saveliev (6):
  lustre: obdclass: make mod rpc slot wait queue FIFO
  lustre: lov: fix lov_iocontrol for inactive OST case
  lnet: libcfs: do not calculate debug_mb if it is set
  lustre: llite: improve ll_dom_lock_cancel
  lustre: lov: check all entries in lov_flush_composite
  lustre: lmv: disable statahead for remote objects

Wang Shilong (24):
  lustre: llite: fix setstripe for specific osts upon dir
  lnet: libcfs: fix wrong check in libcfs_debug_vmsg2()
  lustre: quota: fix setattr project check
  lustre: llite: make sure name pack atomic
  lustre: ptlrpc: handle proper import states for recovery
  lustre: llite: switch to use ll_fsname directly
  lustre: llite: fill copied dentry name's ending char properly
  lustre: llite, readahead: fix to call ll_ras_enter() properly
  lnet: libcfs: fix panic for too large cpu partitions
  lustre: lov: fix wrong calculated length for fiemap
  lustre: push rcu_barrier() before destroying slab
  lustre: llite,readahead: don't always use max RPC size
  lustre: llite: improve single-thread read performance
  lustre: llite: fix deadloop with tiny write
  lustre: llite: make sure readahead cover current read
  lustre: llite: don't check vmpage refcount in ll_releasepage()
  lustre: osc: reserve lru pages for read in batch
  lustre: llite: don't miss every first stride page
  lustre: llite: extend readahead locks for striped file
  lustre: readahead: convert stride page index to byte
  lnet: eliminate uninitialized warning
  lustre: llite: support page unaligned stride readahead
  lustre: ptlrpc: always reset generation for idle reconnect
  lustre: lmv: fix to return correct MDT count

Yang Sheng (6):
  lustre: ldlm: speed up preparation for list of lock cancel
  lustre: ldlm: fix for l_lru usage
  lustre: class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD
  lustre: lov: cl_cache could miss initialize
  lustre: lov: remove KEY_CACHE_SET to simplify the code
  lustre: import: fix race between imp_state & imp_invalid

 fs/lustre/Kconfig                             |    5 +
 fs/lustre/fid/fid_request.c                   |    7 +
 fs/lustre/fld/fld_cache.c                     |   15 +-
 fs/lustre/fld/fld_internal.h                  |    1 -
 fs/lustre/fld/fld_request.c                   |   23 +-
 fs/lustre/include/cl_object.h                 |   73 +-
 fs/lustre/include/lprocfs_status.h            |   33 +-
 fs/lustre/include/lu_object.h                 |  238 +-
 fs/lustre/include/lustre_disk.h               |    1 +
 fs/lustre/include/lustre_dlm.h                |  155 +-
 fs/lustre/include/lustre_dlm_flags.h          |   33 +-
 fs/lustre/include/lustre_export.h             |   29 +-
 fs/lustre/include/lustre_ha.h                 |    2 +-
 fs/lustre/include/lustre_handles.h            |   21 +-
 fs/lustre/include/lustre_import.h             |   45 +-
 fs/lustre/include/lustre_lmv.h                |   82 +-
 fs/lustre/include/lustre_log.h                |    4 +-
 fs/lustre/include/lustre_mdc.h                |  120 -
 fs/lustre/include/lustre_net.h                |  159 +-
 fs/lustre/include/lustre_osc.h                |   27 +-
 fs/lustre/include/lustre_req_layout.h         |   15 +-
 fs/lustre/include/lustre_sec.h                |   12 +
 fs/lustre/include/lustre_swab.h               |    2 +
 fs/lustre/include/obd.h                       |  151 +-
 fs/lustre/include/obd_cksum.h                 |  130 +-
 fs/lustre/include/obd_class.h                 |  150 +-
 fs/lustre/include/obd_support.h               |   59 +-
 fs/lustre/ldlm/ldlm_extent.c                  |    2 +-
 fs/lustre/ldlm/ldlm_inodebits.c               |  111 +-
 fs/lustre/ldlm/ldlm_internal.h                |   42 +-
 fs/lustre/ldlm/ldlm_lib.c                     |   59 +-
 fs/lustre/ldlm/ldlm_lock.c                    |  359 +--
 fs/lustre/ldlm/ldlm_lockd.c                   |  224 +-
 fs/lustre/ldlm/ldlm_pool.c                    |   28 +-
 fs/lustre/ldlm/ldlm_request.c                 |  444 +++-
 fs/lustre/ldlm/ldlm_resource.c                |  196 +-
 fs/lustre/llite/Makefile                      |    2 +-
 fs/lustre/llite/dcache.c                      |    1 -
 fs/lustre/llite/dir.c                         |  731 ++++--
 fs/lustre/llite/file.c                        |  987 ++++++--
 fs/lustre/llite/glimpse.c                     |    1 +
 fs/lustre/llite/lcommon_cl.c                  |   35 +-
 fs/lustre/llite/lcommon_misc.c                |   47 +-
 fs/lustre/llite/llite_internal.h              |  444 +++-
 fs/lustre/llite/llite_lib.c                   |  719 ++++--
 fs/lustre/llite/llite_mmap.c                  |   63 +-
 fs/lustre/llite/llite_nfs.c                   |   59 +-
 fs/lustre/llite/lproc_llite.c                 |  465 +++-
 fs/lustre/llite/namei.c                       |  717 ++++--
 fs/lustre/llite/pcc.c                         | 2614 ++++++++++++++++++++
 fs/lustre/llite/pcc.h                         |  264 ++
 fs/lustre/llite/rw.c                          | 1091 ++++++---
 fs/lustre/llite/rw26.c                        |    4 -
 fs/lustre/llite/statahead.c                   |  182 +-
 fs/lustre/llite/super25.c                     |   23 +-
 fs/lustre/llite/symlink.c                     |   21 +-
 fs/lustre/llite/vvp_dev.c                     |    1 +
 fs/lustre/llite/vvp_internal.h                |   22 +-
 fs/lustre/llite/vvp_io.c                      |  176 +-
 fs/lustre/llite/vvp_object.c                  |   11 +-
 fs/lustre/llite/vvp_page.c                    |   19 +-
 fs/lustre/llite/xattr.c                       |  109 +-
 fs/lustre/llite/xattr_security.c              |   19 +
 fs/lustre/lmv/lmv_fld.c                       |   17 +-
 fs/lustre/lmv/lmv_intent.c                    |  201 +-
 fs/lustre/lmv/lmv_internal.h                  |  162 +-
 fs/lustre/lmv/lmv_obd.c                       | 2078 +++++++++-------
 fs/lustre/lmv/lproc_lmv.c                     |  143 +-
 fs/lustre/lov/Makefile                        |    2 +-
 fs/lustre/lov/lov_cl_internal.h               |   28 +-
 fs/lustre/lov/lov_ea.c                        |  117 +-
 fs/lustre/lov/lov_internal.h                  |   49 +-
 fs/lustre/lov/lov_io.c                        |   89 +-
 fs/lustre/lov/lov_obd.c                       |  162 +-
 fs/lustre/lov/lov_object.c                    |  159 +-
 fs/lustre/lov/lov_offset.c                    |    2 +
 fs/lustre/lov/lov_pack.c                      |   73 +-
 fs/lustre/lov/lov_page.c                      |   17 +-
 fs/lustre/lov/lov_pool.c                      |   19 +-
 fs/lustre/lov/lov_request.c                   |   29 +-
 fs/lustre/lov/lovsub_page.c                   |   68 -
 fs/lustre/lov/lproc_lov.c                     |    4 +-
 fs/lustre/mdc/lproc_mdc.c                     |   87 +-
 fs/lustre/mdc/mdc_changelog.c                 |  154 +-
 fs/lustre/mdc/mdc_dev.c                       |  171 +-
 fs/lustre/mdc/mdc_internal.h                  |   14 +-
 fs/lustre/mdc/mdc_lib.c                       |   86 +-
 fs/lustre/mdc/mdc_locks.c                     |  277 ++-
 fs/lustre/mdc/mdc_reint.c                     |  106 +-
 fs/lustre/mdc/mdc_request.c                   |  476 +++-
 fs/lustre/mgc/lproc_mgc.c                     |   12 +-
 fs/lustre/mgc/mgc_request.c                   |   86 +-
 fs/lustre/obdclass/Makefile                   |    2 +-
 fs/lustre/obdclass/cl_io.c                    |   49 +-
 fs/lustre/obdclass/cl_object.c                |   23 +-
 fs/lustre/obdclass/cl_page.c                  |   36 +-
 fs/lustre/obdclass/class_obd.c                |  151 +-
 fs/lustre/obdclass/genops.c                   |  139 +-
 fs/lustre/obdclass/integrity.c                |  273 +++
 fs/lustre/obdclass/jobid.c                    |  282 ++-
 fs/lustre/obdclass/llog.c                     |  126 +-
 fs/lustre/obdclass/llog_cat.c                 |   59 +-
 fs/lustre/obdclass/llog_internal.h            |    4 +-
 fs/lustre/obdclass/lprocfs_status.c           |  277 ++-
 fs/lustre/obdclass/lu_object.c                |  518 +++-
 fs/lustre/obdclass/lu_tgt_descs.c             |  682 ++++++
 fs/lustre/obdclass/lustre_handles.c           |   61 +-
 fs/lustre/obdclass/obd_cksum.c                |  151 ++
 fs/lustre/obdclass/obd_config.c               |   39 +-
 fs/lustre/obdclass/obd_mount.c                |   23 +-
 fs/lustre/obdclass/obd_sysfs.c                |  101 +-
 fs/lustre/obdclass/obdo.c                     |    7 +-
 fs/lustre/obdecho/echo_client.c               |   77 +-
 fs/lustre/osc/lproc_osc.c                     |  232 +-
 fs/lustre/osc/osc_cache.c                     |  172 +-
 fs/lustre/osc/osc_dev.c                       |   19 +-
 fs/lustre/osc/osc_internal.h                  |   48 +-
 fs/lustre/osc/osc_io.c                        |  115 +-
 fs/lustre/osc/osc_lock.c                      |  156 +-
 fs/lustre/osc/osc_object.c                    |   28 +-
 fs/lustre/osc/osc_page.c                      |   20 +-
 fs/lustre/osc/osc_quota.c                     |   18 +-
 fs/lustre/osc/osc_request.c                   |  619 +++--
 fs/lustre/ptlrpc/client.c                     |  282 ++-
 fs/lustre/ptlrpc/errno.c                      |   27 +
 fs/lustre/ptlrpc/events.c                     |   12 +-
 fs/lustre/ptlrpc/import.c                     |  507 ++--
 fs/lustre/ptlrpc/layout.c                     |  172 +-
 fs/lustre/ptlrpc/llog_client.c                |   15 +-
 fs/lustre/ptlrpc/lproc_ptlrpc.c               |   66 +-
 fs/lustre/ptlrpc/niobuf.c                     |  102 +-
 fs/lustre/ptlrpc/pack_generic.c               |  236 +-
 fs/lustre/ptlrpc/pinger.c                     |   50 +-
 fs/lustre/ptlrpc/ptlrpc_internal.h            |    3 +-
 fs/lustre/ptlrpc/ptlrpcd.c                    |   21 +-
 fs/lustre/ptlrpc/recover.c                    |   23 +-
 fs/lustre/ptlrpc/sec.c                        |  146 +-
 fs/lustre/ptlrpc/sec_bulk.c                   |   71 +-
 fs/lustre/ptlrpc/sec_config.c                 |   89 +-
 fs/lustre/ptlrpc/sec_gc.c                     |   16 +-
 fs/lustre/ptlrpc/sec_lproc.c                  |   74 +
 fs/lustre/ptlrpc/sec_null.c                   |   16 +-
 fs/lustre/ptlrpc/sec_plain.c                  |    7 +-
 fs/lustre/ptlrpc/service.c                    |  427 ++--
 fs/lustre/ptlrpc/wiretest.c                   |  342 ++-
 include/linux/libcfs/libcfs.h                 |    1 +
 include/linux/libcfs/libcfs_debug.h           |   69 +-
 include/linux/libcfs/libcfs_fail.h            |   46 +-
 include/linux/lnet/api.h                      |   34 +-
 include/linux/lnet/lib-lnet.h                 |  225 +-
 include/linux/lnet/lib-types.h                |  355 ++-
 include/uapi/linux/lnet/libcfs_debug.h        |    4 +-
 include/uapi/linux/lnet/libcfs_ioctl.h        |   13 +-
 include/uapi/linux/lnet/lnet-dlc.h            |   42 +
 include/uapi/linux/lnet/lnet-types.h          |   49 +-
 include/uapi/linux/lnet/lnetctl.h             |   23 +
 include/uapi/linux/lnet/nidstr.h              |    2 +
 include/uapi/linux/lustre/lustre_cfg.h        |    1 +
 include/uapi/linux/lustre/lustre_fid.h        |    7 +
 include/uapi/linux/lustre/lustre_idl.h        |  392 +--
 include/uapi/linux/lustre/lustre_ioctl.h      |    5 +-
 include/uapi/linux/lustre/lustre_kernelcomm.h |   15 +-
 include/uapi/linux/lustre/lustre_user.h       |  575 +++--
 include/uapi/linux/lustre/lustre_ver.h        |    6 +-
 mm/page-writeback.c                           |    1 +
 net/lnet/klnds/o2iblnd/o2iblnd.c              |  357 ++-
 net/lnet/klnds/o2iblnd/o2iblnd.h              |   63 +-
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c           |  169 +-
 net/lnet/klnds/o2iblnd/o2iblnd_modparams.c    |   30 +-
 net/lnet/klnds/socklnd/socklnd.c              |  737 +++---
 net/lnet/klnds/socklnd/socklnd.h              |   95 +-
 net/lnet/klnds/socklnd/socklnd_cb.c           |  139 +-
 net/lnet/klnds/socklnd/socklnd_proto.c        |   24 +-
 net/lnet/libcfs/debug.c                       |    5 +-
 net/lnet/libcfs/fail.c                        |   15 +-
 net/lnet/libcfs/libcfs_cpu.c                  |   11 +-
 net/lnet/libcfs/libcfs_lock.c                 |    2 +-
 net/lnet/libcfs/linux-crypto.c                |    5 +-
 net/lnet/libcfs/module.c                      |   33 +-
 net/lnet/libcfs/tracefile.c                   |   50 +-
 net/lnet/lnet/acceptor.c                      |   27 +-
 net/lnet/lnet/api-ni.c                        |  828 +++++--
 net/lnet/lnet/config.c                        |   57 +-
 net/lnet/lnet/lib-eq.c                        |    4 +-
 net/lnet/lnet/lib-md.c                        |   23 +-
 net/lnet/lnet/lib-me.c                        |  135 +-
 net/lnet/lnet/lib-move.c                      | 3266 +++++++++++++++++++------
 net/lnet/lnet/lib-msg.c                       |  711 +++++-
 net/lnet/lnet/lib-ptl.c                       |    2 +-
 net/lnet/lnet/lib-socket.c                    |   17 +-
 net/lnet/lnet/lo.c                            |    1 -
 net/lnet/lnet/module.c                        |    8 +-
 net/lnet/lnet/net_fault.c                     |  126 +-
 net/lnet/lnet/nidstrings.c                    |  272 +-
 net/lnet/lnet/peer.c                          |  725 ++++--
 net/lnet/lnet/router.c                        | 1613 ++++++------
 net/lnet/lnet/router_proc.c                   |  212 +-
 net/lnet/selftest/conctl.c                    |    4 +-
 net/lnet/selftest/console.c                   |   10 +-
 net/lnet/selftest/framework.c                 |   28 +-
 net/lnet/selftest/module.c                    |    2 +-
 net/lnet/selftest/rpc.c                       |   43 +-
 net/lnet/selftest/rpc.h                       |   10 +-
 203 files changed, 25222 insertions(+), 10485 deletions(-)
 create mode 100644 fs/lustre/llite/pcc.c
 create mode 100644 fs/lustre/llite/pcc.h
 delete mode 100644 fs/lustre/lov/lovsub_page.c
 create mode 100644 fs/lustre/obdclass/integrity.c
 create mode 100644 fs/lustre/obdclass/lu_tgt_descs.c
 create mode 100644 fs/lustre/obdclass/obd_cksum.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 001/622] lustre: always enable special debugging, fhandles, and quota support.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 002/622] lustre: osc_cache: remove __might_sleep() James Simmons
                   ` (621 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

Lustre heavily depends on fhandles for its FID handling and needs
quota always enabled.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/lustre/Kconfig b/fs/lustre/Kconfig
index 2ea3f24..2eb7e45 100644
--- a/fs/lustre/Kconfig
+++ b/fs/lustre/Kconfig
@@ -9,6 +9,9 @@ config LUSTRE_FS
 	select CRYPTO_SHA1
 	select CRYPTO_SHA256
 	select CRYPTO_SHA512
+	select DEBUG_FS
+	select FHANDLE
+	select QUOTA
 	depends on MULTIUSER
 	help
 	  This option enables Lustre file system client support. Choose Y
@@ -43,6 +46,7 @@ config LUSTRE_FS_POSIX_ACL
 
 config LUSTRE_DEBUG_EXPENSIVE_CHECK
 	bool "Enable Lustre DEBUG checks"
+	select REFCOUNT_FULL
 	depends on LUSTRE_FS
 	help
 	  This option is mainly for debug purpose. It enables Lustre code to do
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 002/622] lustre: osc_cache: remove __might_sleep()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 001/622] lustre: always enable special debugging, fhandles, and quota support James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 003/622] lustre: uapi: remove enum hsm_progress_states James Simmons
                   ` (620 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

The patch 'simplify osc_wake_cache_waiters()' created a new
wrapper wait_event_idle_exclusive_timeout_cmd() which includes
a __might_sleep() test. This was causing the following back
trace:

kernel: BUG: sleeping function called from invalid context at fs/lustre/osc/osc_cache.c:1635
kernel: in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19374, name: cp
kernel: INFO: lockdep is turned off.
kernel: Preemption disabled at:
kernel: [<0000000000000000>] 0x0
kernel: CPU: 11 PID: 19374 Comm: cp Tainted: G        W         5.4.0-rc5+ #1
kernel: Call Trace:
kernel: dump_stack+0x5e/0x8b
kernel: ___might_sleep+0x205/0x260
kernel: osc_queue_async_io+0x1104/0x1de0 [osc]
kernel: ? _raw_spin_unlock+0x2e/0x50
kernel: ? libcfs_debug_msg+0x6ab/0xc80 [libcfs]
kernel: ? vvp_io_setattr_start+0x200/0x200 [lustre]
kernel: osc_page_cache_add+0x2c/0xa0 [osc]
kernel: osc_io_commit_async+0x1a8/0x420 [osc]
kernel: cl_io_commit_async+0x58/0x80 [obdclass]
kernel: ? vvp_io_setattr_start+0x200/0x200 [lustre:1

This can be called from an atomic context and examing the code
suggest we don't need __might_sleep() so lets remove it.

Fixes: def8e96d4f3d ("lustre: osc_cache: simplify osc_wake_cache_waiters()")

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 3189eb3..2ed7ca2 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -1570,7 +1570,6 @@ static bool osc_enter_cache_try(struct client_obd *cli,
 					      cmd1, cmd2)		\
 ({									\
 	long __ret = timeout;						\
-	might_sleep();							\
 	if (!___wait_cond_timeout(condition))				\
 		__ret = __wait_event_idle_exclusive_timeout_cmd(	\
 			wq_head, condition, timeout, cmd1, cmd2);	\
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 003/622] lustre: uapi: remove enum hsm_progress_states
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 001/622] lustre: always enable special debugging, fhandles, and quota support James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 002/622] lustre: osc_cache: remove __might_sleep() James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 004/622] lustre: uapi: sync enum obd_statfs_state James Simmons
                   ` (619 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

This enum is used only by server side code.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 0566afad..f5474c5 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -1532,27 +1532,6 @@ enum hsm_states {
  */
 #define HSM_FLAGS_MASK  (HSM_USER_MASK | HSM_STATUS_MASK)
 
-/**
- * HSM??request progress state
- */
-enum hsm_progress_states {
-	HPS_WAITING	= 1,
-	HPS_RUNNING	= 2,
-	HPS_DONE	= 3,
-};
-
-#define HPS_NONE	0
-
-static inline const char *hsm_progress_state2name(enum hsm_progress_states s)
-{
-	switch  (s) {
-	case HPS_WAITING:	return "waiting";
-	case HPS_RUNNING:	return "running";
-	case HPS_DONE:		return "done";
-	default:		return "unknown";
-	}
-}
-
 struct hsm_extent {
 	__u64 offset;
 	__u64 length;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 004/622] lustre: uapi: sync enum obd_statfs_state
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (2 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 003/622] lustre: uapi: remove enum hsm_progress_states James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 005/622] lustre: llite: return compatible fsid for statfs James Simmons
                   ` (618 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

With the drift between the OpenSFS and linux client various
enum obd_statfs_state values where dropped that are transmitted
over the wire. Sync the values.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index f5474c5..27501a2 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -101,9 +101,9 @@
 enum obd_statfs_state {
 	OS_STATE_DEGRADED	= 0x00000001, /**< RAID degraded/rebuilding */
 	OS_STATE_READONLY	= 0x00000002, /**< filesystem is read-only */
-	OS_STATE_RDONLY_1	= 0x00000004, /**< obsolete 1.6, was EROFS=30 */
-	OS_STATE_RDONLY_2	= 0x00000008, /**< obsolete 1.6, was EROFS=30 */
-	OS_STATE_RDONLY_3	= 0x00000010, /**< obsolete 1.6, was EROFS=30 */
+	OS_STATE_NOPRECREATE	= 0x00000004, /**< no object precreation */
+	OS_STATE_ENOSPC		= 0x00000020, /**< not enough free space */
+	OS_STATE_ENOINO		= 0x00000040, /**< not enough inodes */
 };
 
 struct obd_statfs {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 005/622] lustre: llite: return compatible fsid for statfs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (3 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 004/622] lustre: uapi: sync enum obd_statfs_state James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 006/622] lustre: ldlm: Make kvzalloc | kvfree use consistent James Simmons
                   ` (617 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: Fan Yong <fan.yong@intel.com>

Lustre uses 64-bits inode number to identify object on client side.
When re-export Lustre via NFS, NFS will detect whether support fsid
via statfs(). For the non-support case, it will only recognizes and
packs low 32-bits inode number in nfs handle. Such handle cannot be
used to locate the object properly.
To avoid patch linux kernel, Lustre client should generate fsid and
return it via statfs() to up layer.

To be compatible with old Lustre client (NFS server), the fsid will
be generated from super_block::s_dev.

WC-bug-id: https://jira.whamcloud.com/browse/LU-2904
Lustre-commit: abe4d83fab00 ("LU-2904 llite: return compatible fsid for statfs")
Signed-off-by: Fan Yong <fan.yong@intel.com>
Reviewed-on: http://review.whamcloud.com/7434
Reviewed-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h |  3 ---
 fs/lustre/llite/llite_lib.c      |  8 ++++----
 fs/lustre/llite/llite_nfs.c      | 16 ----------------
 3 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index f0a50fc..3192340 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -538,8 +538,6 @@ struct ll_sb_info {
 	/* st_blksize returned by stat(2), when non-zero */
 	unsigned int		 ll_stat_blksize;
 
-	__kernel_fsid_t		 ll_fsid;
-
 	struct kset		ll_kset;	/* sysfs object */
 	struct completion	 ll_kobj_unregister;
 };
@@ -941,7 +939,6 @@ static inline ssize_t ll_lov_user_md_size(const struct lov_user_md *lum)
 /* llite/llite_nfs.c */
 extern const struct export_operations lustre_export_operations;
 u32 get_uuid2int(const char *name, int len);
-void get_uuid2fsid(const char *name, int len, __kernel_fsid_t *fsid);
 struct inode *search_inode_for_lustre(struct super_block *sb,
 				      const struct lu_fid *fid);
 int ll_dir_get_parent_fid(struct inode *dir, struct lu_fid *parent_fid);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index a48d753..e1932ae 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -591,10 +591,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	 * only a node-local comparison.
 	 */
 	uuid = obd_get_uuid(sbi->ll_md_exp);
-	if (uuid) {
+	if (uuid)
 		sb->s_dev = get_uuid2int(uuid->uuid, strlen(uuid->uuid));
-		get_uuid2fsid(uuid->uuid, strlen(uuid->uuid), &sbi->ll_fsid);
-	}
 
 	kfree(data);
 	kfree(osfs);
@@ -1775,6 +1773,7 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs)
 {
 	struct super_block *sb = de->d_sb;
 	struct obd_statfs osfs;
+	u64 fsid = huge_encode_dev(sb->s_dev);
 	int rc;
 
 	CDEBUG(D_VFSTRACE, "VFS Op: at %llu jiffies\n", get_jiffies_64());
@@ -1805,7 +1804,8 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs)
 	sfs->f_blocks = osfs.os_blocks;
 	sfs->f_bfree = osfs.os_bfree;
 	sfs->f_bavail = osfs.os_bavail;
-	sfs->f_fsid = ll_s2sbi(sb)->ll_fsid;
+	sfs->f_fsid.val[0] = (u32)fsid;
+	sfs->f_fsid.val[1] = (u32)(fsid >> 32);
 	return 0;
 }
 
diff --git a/fs/lustre/llite/llite_nfs.c b/fs/lustre/llite/llite_nfs.c
index d6643d0..434f92b 100644
--- a/fs/lustre/llite/llite_nfs.c
+++ b/fs/lustre/llite/llite_nfs.c
@@ -57,22 +57,6 @@ u32 get_uuid2int(const char *name, int len)
 	return (key0 << 1);
 }
 
-void get_uuid2fsid(const char *name, int len, __kernel_fsid_t *fsid)
-{
-	u64 key = 0, key0 = 0x12a3fe2d, key1 = 0x37abe8f9;
-
-	while (len--) {
-		key = key1 + (key0 ^ (*name++ * 7152373));
-		if (key & 0x8000000000000000ULL)
-			key -= 0x7fffffffffffffffULL;
-		key1 = key0;
-		key0 = key;
-	}
-
-	fsid->val[0] = key;
-	fsid->val[1] = key >> 32;
-}
-
 struct inode *search_inode_for_lustre(struct super_block *sb,
 				      const struct lu_fid *fid)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 006/622] lustre: ldlm: Make kvzalloc | kvfree use consistent
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (4 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 005/622] lustre: llite: return compatible fsid for statfs James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 007/622] lustre: llite: limit smallest max_cached_mb value James Simmons
                   ` (616 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: "Christopher J. Morrone" <morrone2@llnl.gov>

struct ldlm_lock's l_lvb_data field is freed in ldlm_lock_put()
using kfree.  However, some other code paths can attach
a buffer to l_lvb_data that was allocated using vmalloc().
This can lead to a kfree() of a vmalloc()ed buffer, which can
trigger a kernel Oops.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4194
Lustre-commit: 9c4d506c5fea ("LU-4194 ldlm: Make OBD_[ALLOC|FREE]_LARGE use consistent")
Signed-off-by: Christopher J. Morrone <morrone2@llnl.gov>
Reviewed-on: http://review.whamcloud.com/8298
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Faccini Bruno <bruno.faccini@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 6eebf5f..7242cd1 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -185,7 +185,7 @@ void ldlm_lock_put(struct ldlm_lock *lock)
 			lock->l_export = NULL;
 		}
 
-		kfree(lock->l_lvb_data);
+		kvfree(lock->l_lvb_data);
 
 		lu_ref_fini(&lock->l_reference);
 		OBD_FREE_RCU(lock, sizeof(*lock), &lock->l_handle);
@@ -1548,7 +1548,7 @@ struct ldlm_lock *ldlm_lock_create(struct ldlm_namespace *ns,
 
 	if (lvb_len) {
 		lock->l_lvb_len = lvb_len;
-		lock->l_lvb_data = kzalloc(lvb_len, GFP_NOFS);
+		lock->l_lvb_data = kvzalloc(lvb_len, GFP_NOFS);
 		if (!lock->l_lvb_data) {
 			rc = -ENOMEM;
 			goto out;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 007/622] lustre: llite: limit smallest max_cached_mb value
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (5 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 006/622] lustre: ldlm: Make kvzalloc | kvfree use consistent James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 008/622] lustre: obdecho: turn on async flag only for mode 3 James Simmons
                   ` (615 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: James Nunez <jnunez@whamcloud.com>

Currently, ost-survey hangs due to calling
'lfs setstripe' in an old (positional) style and
setting max_cached_mb to zero.

In ll_max_cached_mb_seq_write(), the number of
pages requested is set to the max of pages requested
or PTLRPC_MAX_BRW_PAGES to allow the client to make
well formed RPCs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4768
Lustre-commit: 46bec835ac72 ("LU-4768 tests: Update ost-survey script")
Signed-off-by: James Nunez <jnunez@whamcloud.com>
Reviewed-on: http://review.whamcloud.com/11971
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Cliff White <cliff.white@intel.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lproc_llite.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index e108326..5ac6689 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -527,6 +527,8 @@ static ssize_t ll_max_cached_mb_seq_write(struct file *file,
 		       totalram_pages() >> (20 - PAGE_SHIFT));
 		return -ERANGE;
 	}
+	/* Allow enough cache so clients can make well-formed RPCs */
+	pages_number = max_t(long, pages_number, PTLRPC_MAX_BRW_PAGES);
 
 	spin_lock(&sbi->ll_lock);
 	diff = pages_number - cache->ccc_lru_max;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 008/622] lustre: obdecho: turn on async flag only for mode 3
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (6 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 007/622] lustre: llite: limit smallest max_cached_mb value James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 009/622] lustre: llite: reorganize variable and data structures James Simmons
                   ` (614 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: Rahul Deshmukh <rahul_deshmukh@xyratex.com>

There are couple of problems in obdfilter-survey:
- Type of test brw i.e. "g" was not followed with npages,
- Target netdisk was not set properly and
- Turn ON async flag only for mode 3.

This patch fixes the last problem which is kernel side.

WC-bug-id: https://jira.whamcloud.com/browse/LU-5031
Lustre-commit: 9f38647a7b24 ("LU-5031 tests: obdfilter-survey fixes")
Signed-off-by: Rahul Deshmukh <rahul_deshmukh@xyratex.com>
Reviewed-on: http://review.whamcloud.com/10264
Reviewed-by: Cliff White <cliff.white@intel.com>
Reviewed-by: Bob Glossman <bob.glossman@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdecho/echo_client.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index ca963bb..3984cb4 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -1425,7 +1425,7 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 	struct obdo *oa = &data->ioc_obdo1;
 	struct echo_object *eco;
 	int rc;
-	int async = 1;
+	int async = 0;
 	long test_mode;
 
 	LASSERT(oa->o_valid & OBD_MD_FLGROUP);
@@ -1438,14 +1438,14 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 
 	/* OFD/obdfilter works only via prep/commit */
 	test_mode = (long)data->ioc_pbuf1;
-	if (test_mode == 1)
-		async = 0;
-
 	if (!ed->ed_next && test_mode != 3) {
 		test_mode = 3;
 		data->ioc_plen1 = data->ioc_count;
 	}
 
+	if (test_mode == 3)
+		async = 1;
+
 	/* Truncate batch size to maximum */
 	if (data->ioc_plen1 > PTLRPC_MAX_BRW_SIZE)
 		data->ioc_plen1 = PTLRPC_MAX_BRW_SIZE;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 009/622] lustre: llite: reorganize variable and data structures
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (7 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 008/622] lustre: obdecho: turn on async flag only for mode 3 James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 010/622] lustre: llite: increase whole-file readahead to RPC size James Simmons
                   ` (613 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

This patch covers the bits missed in the patch series
"Lustre IO stack simplifications and cleanups" from the OpenSFS
branch for the LU-5971 work. Details of the original push can
be viewed at https://lore.kernel.org/patchwork/cover/662900.
No Fixed is provided since the staging patch series was broken
up into a much larger patch set.

WC-bug-id: https://jira.whamcloud.com/browse/LU-5971
Lustre-commit: 6eda93c7b5f6 ("LU-5971 llite: reorganize variable and data structures")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-on: http://review.whamcloud.com/13714
Reviewed-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c           |  1 +
 fs/lustre/llite/glimpse.c        |  1 +
 fs/lustre/llite/lcommon_cl.c     |  5 ++---
 fs/lustre/llite/lcommon_misc.c   | 24 ++++++++++++------------
 fs/lustre/llite/llite_internal.h |  8 ++++----
 fs/lustre/llite/llite_lib.c      |  4 ++--
 fs/lustre/llite/super25.c        |  1 +
 fs/lustre/llite/vvp_dev.c        |  1 +
 fs/lustre/llite/vvp_internal.h   | 13 +++----------
 fs/lustre/llite/vvp_io.c         |  4 ++--
 10 files changed, 29 insertions(+), 33 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index fe4340d..fe965b1 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -49,6 +49,7 @@
 
 #include <cl_object.h>
 #include "llite_internal.h"
+#include "vvp_internal.h"
 
 struct split_param {
 	struct inode	*sp_inode;
diff --git a/fs/lustre/llite/glimpse.c b/fs/lustre/llite/glimpse.c
index de1a31f..3441904 100644
--- a/fs/lustre/llite/glimpse.c
+++ b/fs/lustre/llite/glimpse.c
@@ -47,6 +47,7 @@
 
 #include <cl_object.h>
 #include "llite_internal.h"
+#include "vvp_internal.h"
 
 static const struct cl_lock_descr whole_file = {
 	.cld_start = 0,
diff --git a/fs/lustre/llite/lcommon_cl.c b/fs/lustre/llite/lcommon_cl.c
index 988855b..978e05b 100644
--- a/fs/lustre/llite/lcommon_cl.c
+++ b/fs/lustre/llite/lcommon_cl.c
@@ -30,8 +30,6 @@
  * This file is part of Lustre, http://www.lustre.org/
  * Lustre is a trademark of Sun Microsystems, Inc.
  *
- * cl code used by vvp (and other Lustre clients in the future).
- *
  *   Author: Nikita Danilov <nikita.danilov@sun.com>
  */
 
@@ -63,6 +61,7 @@
  * Vvp device and device type functions.
  *
  */
+#include "vvp_internal.h"
 
 /**
  * An `emergency' environment used by cl_inode_fini() when cl_env_get()
@@ -282,7 +281,7 @@ u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
 		return fid_flatten(fid);
 }
 
-/**
+/*
  * build inode generation from passed @fid.  If our FID overflows the 32-bit
  * inode number then return a non-zero generation to distinguish them.
  */
diff --git a/fs/lustre/llite/lcommon_misc.c b/fs/lustre/llite/lcommon_misc.c
index 29daf5b..48503d6 100644
--- a/fs/lustre/llite/lcommon_misc.c
+++ b/fs/lustre/llite/lcommon_misc.c
@@ -46,7 +46,7 @@
  * maximum-sized (= maximum striped) EA and cookie without having to
  * calculate this (via a call into the LOV + OSCs) each time we make an RPC.
  */
-int cl_init_ea_size(struct obd_export *md_exp, struct obd_export *dt_exp)
+static int cl_init_ea_size(struct obd_export *md_exp, struct obd_export *dt_exp)
 {
 	u32 val_size, max_easize, def_easize;
 	int rc;
@@ -115,7 +115,7 @@ int cl_ocd_update(struct obd_device *host, struct obd_device *watched,
 #define GROUPLOCK_SCOPE "grouplock"
 
 int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
-		     struct ll_grouplock *cg)
+		     struct ll_grouplock *lg)
 {
 	struct lu_env	  *env;
 	struct cl_io	   *io;
@@ -160,22 +160,22 @@ int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
 		return rc;
 	}
 
-	cg->lg_env  = env;
-	cg->lg_io   = io;
-	cg->lg_lock = lock;
-	cg->lg_gid  = gid;
+	lg->lg_env = env;
+	lg->lg_io = io;
+	lg->lg_lock = lock;
+	lg->lg_gid = gid;
 
 	return 0;
 }
 
-void cl_put_grouplock(struct ll_grouplock *cg)
+void cl_put_grouplock(struct ll_grouplock *lg)
 {
-	struct lu_env  *env  = cg->lg_env;
-	struct cl_io   *io   = cg->lg_io;
-	struct cl_lock *lock = cg->lg_lock;
+	struct lu_env *env  = lg->lg_env;
+	struct cl_io *io   = lg->lg_io;
+	struct cl_lock *lock = lg->lg_lock;
 
-	LASSERT(cg->lg_env);
-	LASSERT(cg->lg_gid);
+	LASSERT(lg->lg_env);
+	LASSERT(lg->lg_gid);
 
 	cl_lock_release(env, lock);
 	cl_io_fini(env, io);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 3192340..fbe93a4 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -707,7 +707,6 @@ static inline bool ll_sbi_has_tiny_write(struct ll_sb_info *sbi)
 void ll_ras_enter(struct file *f);
 
 /* llite/lcommon_misc.c */
-int cl_init_ea_size(struct obd_export *md_exp, struct obd_export *dt_exp);
 int cl_ocd_update(struct obd_device *host, struct obd_device *watched,
 		  enum obd_notify_event ev, void *owner);
 int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
@@ -975,9 +974,9 @@ struct ll_cl_context {
 
 struct ll_thread_info {
 	struct iov_iter		lti_iter;
-	struct vvp_io_args   lti_args;
-	struct ra_io_arg     lti_ria;
-	struct ll_cl_context lti_io_ctx;
+	struct vvp_io_args	lti_args;
+	struct ra_io_arg	lti_ria;
+	struct ll_cl_context	lti_io_ctx;
 };
 
 extern struct lu_context_key ll_thread_key;
@@ -1165,6 +1164,7 @@ struct ll_statahead_info {
 blkcnt_t dirty_cnt(struct inode *inode);
 
 int __cl_glimpse_size(struct inode *inode, int agl);
+
 int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
 		    struct inode *inode, struct cl_object *clob, int agl);
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index e1932ae..aaa8ad2 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2542,7 +2542,7 @@ void ll_dirty_page_discard_warn(struct page *page, int ioret)
 {
 	char *buf, *path = NULL;
 	struct dentry *dentry = NULL;
-	struct vvp_object *obj = cl_inode2vvp(page->mapping->host);
+	struct inode *inode = page->mapping->host;
 
 	/* this can be called inside spin lock so use GFP_ATOMIC. */
 	buf = (char *)__get_free_page(GFP_ATOMIC);
@@ -2556,7 +2556,7 @@ void ll_dirty_page_discard_warn(struct page *page, int ioret)
 	       "%s: dirty page discard: %s/fid: " DFID "/%s may get corrupted (rc %d)\n",
 	       ll_get_fsname(page->mapping->host->i_sb, NULL, 0),
 	       s2lsi(page->mapping->host->i_sb)->lsi_lmd->lmd_dev,
-	       PFID(&obj->vob_header.coh_lu.loh_fid),
+	       PFID(ll_inode2fid(inode)),
 	       (path && !IS_ERR(path)) ? path : "", ioret);
 
 	if (dentry)
diff --git a/fs/lustre/llite/super25.c b/fs/lustre/llite/super25.c
index 2b65e2f..133fe2a 100644
--- a/fs/lustre/llite/super25.c
+++ b/fs/lustre/llite/super25.c
@@ -42,6 +42,7 @@
 #include <linux/fs.h>
 #include <lprocfs_status.h>
 #include "llite_internal.h"
+#include "vvp_internal.h"
 
 static struct kmem_cache *ll_inode_cachep;
 
diff --git a/fs/lustre/llite/vvp_dev.c b/fs/lustre/llite/vvp_dev.c
index 9f793e9..e1d87f9 100644
--- a/fs/lustre/llite/vvp_dev.c
+++ b/fs/lustre/llite/vvp_dev.c
@@ -93,6 +93,7 @@ static void *ll_thread_key_init(const struct lu_context *ctx,
 	info = kmem_cache_zalloc(ll_thread_kmem, GFP_NOFS);
 	if (!info)
 		info = ERR_PTR(-ENOMEM);
+
 	return info;
 }
 
diff --git a/fs/lustre/llite/vvp_internal.h b/fs/lustre/llite/vvp_internal.h
index 96f10d2..7a463cb 100644
--- a/fs/lustre/llite/vvp_internal.h
+++ b/fs/lustre/llite/vvp_internal.h
@@ -166,7 +166,7 @@ static inline struct cl_io *vvp_env_thread_io(const struct lu_env *env)
 }
 
 struct vvp_session {
-	struct vvp_io	cs_ios;
+	struct vvp_io	vs_ios;
 };
 
 static inline struct vvp_session *vvp_env_session(const struct lu_env *env)
@@ -181,11 +181,11 @@ static inline struct vvp_session *vvp_env_session(const struct lu_env *env)
 
 static inline struct vvp_io *vvp_env_io(const struct lu_env *env)
 {
-	return &vvp_env_session(env)->cs_ios;
+	return &vvp_env_session(env)->vs_ios;
 }
 
 /**
- * ccc-private object state.
+ * VPP-private object state.
  */
 struct vvp_object {
 	struct cl_object_header vob_header;
@@ -246,13 +246,6 @@ struct vvp_device {
 	struct cl_device	*vdv_next;
 };
 
-void *ccc_key_init(const struct lu_context *ctx,
-		   struct lu_context_key *key);
-void ccc_key_fini(const struct lu_context *ctx,
-		  struct lu_context_key *key, void *data);
-
-void ccc_umount(const struct lu_env *env, struct cl_device *dev);
-
 static inline struct lu_device *vvp2lu_dev(struct vvp_device *vdv)
 {
 	return &vdv->vdv_cl.cd_lu_dev;
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 6145064..37bf942 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -416,10 +416,10 @@ static enum cl_lock_mode vvp_mode_from_vma(struct vm_area_struct *vma)
 static int vvp_mmap_locks(const struct lu_env *env,
 			  struct vvp_io *vio, struct cl_io *io)
 {
-	struct vvp_thread_info *cti = vvp_env_info(env);
+	struct vvp_thread_info *vti = vvp_env_info(env);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct  *vma;
-	struct cl_lock_descr *descr = &cti->vti_descr;
+	struct cl_lock_descr *descr = &vti->vti_descr;
 	union ldlm_policy_data policy;
 	unsigned long addr;
 	ssize_t	count;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 010/622] lustre: llite: increase whole-file readahead to RPC size
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (8 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 009/622] lustre: llite: reorganize variable and data structures James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:07 ` [lustre-devel] [PATCH 011/622] lustre: llite: handle ORPHAN/DEAD directories James Simmons
                   ` (612 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Increase the default whole-file readahead limit to match the current
RPC size.  That ensures that files smaller than the RPC size will be
read in a single round-trip instead of sending multiple smaller RPCs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-7990
Lustre-commit: 627d0133d9d7 ("LU-7990 llite: increase whole-file readahead to RPC size")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/26955
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index aaa8ad2..12aafe0 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -465,6 +465,12 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	sbi->ll_dt_exp->exp_connect_data = *data;
 
+	/* Don't change value if it was specified in the config log */
+	if (sbi->ll_ra_info.ra_max_read_ahead_whole_pages == -1)
+		sbi->ll_ra_info.ra_max_read_ahead_whole_pages =
+			max_t(unsigned long, SBI_DEFAULT_READAHEAD_WHOLE_MAX,
+			      (data->ocd_brw_size >> PAGE_SHIFT));
+
 	err = obd_fid_init(sbi->ll_dt_exp->exp_obd, sbi->ll_dt_exp,
 			   LUSTRE_SEQ_METADATA);
 	if (err) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 011/622] lustre: llite: handle ORPHAN/DEAD directories
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (9 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 010/622] lustre: llite: increase whole-file readahead to RPC size James Simmons
@ 2020-02-27 21:07 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 012/622] lustre: lov: protected ost pool count updation James Simmons
                   ` (611 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:07 UTC (permalink / raw)
  To: lustre-devel

From: Di Wang <di.wang@intel.com>

Don't set the directory MDS striping if the parent is dead.
To test this works add the OBD_FAIL_LLITE_NO_CHECK_DEAD
injection fault.

WC-bug-id: https://jira.whamcloud.com/browse/LU-7579
Lustre-commit: 098fb363c39 ("LU-7579 osd: move ORPHAN/DEAD flag to OSD")
Signed-off-by: Di Wang <di.wang@intel.com>
Reviewed-on: http://review.whamcloud.com/18024
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 1 +
 fs/lustre/llite/dir.c           | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index e10b372..653a456 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -442,6 +442,7 @@
 #define OBD_FAIL_LLITE_XATTR_ENOMEM			0x1405
 #define OBD_FAIL_MAKE_LOVEA_HOLE			0x1406
 #define OBD_FAIL_LLITE_LOST_LAYOUT			0x1407
+#define OBD_FAIL_LLITE_NO_CHECK_DEAD			0x1408
 #define OBD_FAIL_GETATTR_DELAY				0x1409
 #define OBD_FAIL_LLITE_CREATE_NODE_PAUSE		0x140c
 #define OBD_FAIL_LLITE_IMUTEX_SEC			0x140e
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index d3ef669..f21727b 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -433,6 +433,10 @@ static int ll_dir_setdirstripe(struct dentry *dparent, struct lmv_user_md *lump,
 	    !(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_DIR_STRIPE))
 		return -EINVAL;
 
+	if (IS_DEADDIR(parent) &&
+	    !OBD_FAIL_CHECK(OBD_FAIL_LLITE_NO_CHECK_DEAD))
+		return -ENOENT;
+
 	if (lump->lum_magic != cpu_to_le32(LMV_USER_MAGIC) &&
 	    lump->lum_magic != cpu_to_le32(LMV_USER_MAGIC_SPECIFIC))
 		lustre_swab_lmv_user_md(lump);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 012/622] lustre: lov: protected ost pool count updation
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (10 preceding siblings ...)
  2020-02-27 21:07 ` [lustre-devel] [PATCH 011/622] lustre: llite: handle ORPHAN/DEAD directories James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 013/622] lustre: obdclass: fix llog_cat_cleanup() usage on Client James Simmons
                   ` (610 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Jadhav Vikram <jadhav.vikram@seagate.com>

ASSERTION(iter->lpi_idx <= ((iter->lpi_pool)->pool_obds.op_count)
caused due to reading of ost pool count is not protected in
pool_proc_next and pool_proc_show, pool_proc_show get called when
op_count was zero.

Fix to protect ost pool count by taking lock at start sequence
function pool_proc_start and released lock in pool_proc_stop.
Rather than using down_read / up_read pairs around pool_proc_next
and pool_proc_show, this changes make sure ost pool data gets
protected throughout sequence operation.

Seagate-bug-id: MRP-3629
WC-bug-id: https://jira.whamcloud.com/browse/LU-9620
Lustre-commit: 61c803319b91 ("LU-9620 lod: protected ost pool count updation")
Signed-off-by: Jadhav Vikram <jadhav.vikram@seagate.com>
Reviewed-by: Ashish Purkar <ashish.purkar@seagate.com>
Reviewed-by: Vladimir Saveliev <vladimir.saveliev@seagate.com>
Reviewed-on: https://review.whamcloud.com/27506
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Niu Yawei <yawei.niu@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_pool.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/lov/lov_pool.c b/fs/lustre/lov/lov_pool.c
index 60565b9..a0552fb 100644
--- a/fs/lustre/lov/lov_pool.c
+++ b/fs/lustre/lov/lov_pool.c
@@ -117,14 +117,11 @@ static void *pool_proc_next(struct seq_file *s, void *v, loff_t *pos)
 
 	/* iterate to find a non empty entry */
 	prev_idx = iter->idx;
-	down_read(&pool_tgt_rw_sem(iter->pool));
 	iter->idx++;
-	if (iter->idx == pool_tgt_count(iter->pool)) {
+	if (iter->idx >= pool_tgt_count(iter->pool)) {
 		iter->idx = prev_idx; /* we stay on the last entry */
-		up_read(&pool_tgt_rw_sem(iter->pool));
 		return NULL;
 	}
-	up_read(&pool_tgt_rw_sem(iter->pool));
 	(*pos)++;
 	/* return != NULL to continue */
 	return iter;
@@ -157,6 +154,7 @@ static void *pool_proc_start(struct seq_file *s, loff_t *pos)
 	 */
 	/* /!\ do not forget to restore it to pool before freeing it */
 	s->private = iter;
+	down_read(&pool_tgt_rw_sem(pool));
 	if (*pos > 0) {
 		loff_t i;
 		void *ptr;
@@ -179,6 +177,7 @@ static void pool_proc_stop(struct seq_file *s, void *v)
 	 * we have to free only if s->private is an iterator
 	 */
 	if ((iter) && (iter->magic == POOL_IT_MAGIC)) {
+		up_read(&pool_tgt_rw_sem(iter->pool));
 		/* we restore s->private so next call to pool_proc_start()
 		 * will work
 		 */
@@ -197,9 +196,7 @@ static int pool_proc_show(struct seq_file *s, void *v)
 	LASSERT(iter->pool);
 	LASSERT(iter->idx <= pool_tgt_count(iter->pool));
 
-	down_read(&pool_tgt_rw_sem(iter->pool));
 	tgt = pool_tgt(iter->pool, iter->idx);
-	up_read(&pool_tgt_rw_sem(iter->pool));
 	if (tgt)
 		seq_printf(s, "%s\n", obd_uuid2str(&tgt->ltd_uuid));
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 013/622] lustre: obdclass: fix llog_cat_cleanup() usage on Client
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (11 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 012/622] lustre: lov: protected ost pool count updation James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 014/622] lustre: mdc: fix possible NULL pointer dereference James Simmons
                   ` (609 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Bruno Faccini <bruno.faccini@intel.com>

With patch/commit 3a83b4b9 for LU-5195, LLOG code has been
strengthen against catalog inconsistency by detecting a
referenced plain LLOG is missing and by clearing its
associated entry by calling llog_cat_cleanup(), which now
needs to handle the case where it is also executed on a Client
(ie, cathandle->lgh_obj == NULL) and thus must not attempt to
update on-disk catalog.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6471
Lustre-commit: 485f3ba87433 ("LU-6471 obdclass: fix llog_cat_cleanup() usage on Client")
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: http://review.whamcloud.com/14489
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/llog_cat.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/obdclass/llog_cat.c b/fs/lustre/obdclass/llog_cat.c
index 580d807..ca97e08 100644
--- a/fs/lustre/obdclass/llog_cat.c
+++ b/fs/lustre/obdclass/llog_cat.c
@@ -133,10 +133,8 @@ int llog_cat_close(const struct lu_env *env, struct llog_handle *cathandle)
 		list_del_init(&loghandle->u.phd.phd_entry);
 		llog_close(env, loghandle);
 	}
-	/* if handle was stored in ctxt, remove it too */
-	if (cathandle->lgh_ctxt->loc_handle == cathandle)
-		cathandle->lgh_ctxt->loc_handle = NULL;
-	return llog_close(env, cathandle);
+
+	return 0;
 }
 EXPORT_SYMBOL(llog_cat_close);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 014/622] lustre: mdc: fix possible NULL pointer dereference
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (12 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 013/622] lustre: obdclass: fix llog_cat_cleanup() usage on Client James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 015/622] lustre: obdclass: allow specifying complex jobids James Simmons
                   ` (608 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Fix two static analysis errors.

fs/lustre/mdc/mdc_dev.c: in mdc_enqueue_send(), pointer 'matched' return
    from call to function 'ldlm_handle2lock' at line 704 may be NULL
    and will be dereferenced at line 705.
If client is evicted between ldlm_lock_match() and ldlm_handle2lock()
the lock pointer could be NULL.

fs/lustre/lov/lov_dev.c:488 in lov_process_config, sscanf format
    specification '%d' expects type 'int' for 'd', but parameter 3
    has a different type '__u32'.
Converting to kstrtou32() requires changing the "index" variable type
from __u32 to u32, which is fine since it is only used internally,
fix up the few functions that are also passing "__u32 index" and the
resulting checkpatch.pl warnings.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10264
Lustre-commit: b89206476174 ("LU-10264 mdc: fix possible NULL pointer dereference")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31621
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Bob Glossman <bob.glossman@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_obd.c | 45 ++++++++++++++++++++++++---------------------
 fs/lustre/mdc/mdc_dev.c |  2 +-
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 1708fa9..26637bc 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -312,7 +312,8 @@ static int lov_disconnect(struct obd_export *exp)
 {
 	struct obd_device *obd = class_exp2obd(exp);
 	struct lov_obd *lov = &obd->u.lov;
-	int i, rc;
+	u32 index;
+	int rc;
 
 	if (!lov->lov_tgts)
 		goto out;
@@ -321,19 +322,19 @@ static int lov_disconnect(struct obd_export *exp)
 	lov->lov_connects--;
 	if (lov->lov_connects != 0) {
 		/* why should there be more than 1 connect? */
-		CERROR("disconnect #%d\n", lov->lov_connects);
+		CWARN("%s: unexpected disconnect #%d\n",
+		      obd->obd_name, lov->lov_connects);
 		goto out;
 	}
 
-	/* Let's hold another reference so lov_del_obd doesn't spin through
-	 * putref every time
-	 */
+	/* hold another ref so lov_del_obd() doesn't spin in putref each time */
 	lov_tgts_getref(obd);
 
-	for (i = 0; i < lov->desc.ld_tgt_count; i++) {
-		if (lov->lov_tgts[i] && lov->lov_tgts[i]->ltd_exp) {
-			/* Disconnection is the last we know about an obd */
-			lov_del_target(obd, i, NULL, lov->lov_tgts[i]->ltd_gen);
+	for (index = 0; index < lov->desc.ld_tgt_count; index++) {
+		if (lov->lov_tgts[index] && lov->lov_tgts[index]->ltd_exp) {
+			/* Disconnection is the last we know about an OBD */
+			lov_del_target(obd, index, NULL,
+				       lov->lov_tgts[index]->ltd_gen);
 		}
 	}
 
@@ -490,13 +491,12 @@ static int lov_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	       uuidp->uuid, index, gen, active);
 
 	if (gen <= 0) {
-		CERROR("request to add OBD %s with invalid generation: %d\n",
-		       uuidp->uuid, gen);
+		CERROR("%s: request to add '%s' with invalid generation: %d\n",
+		       obd->obd_name, uuidp->uuid, gen);
 		return -EINVAL;
 	}
 
-	tgt_obd = class_find_client_obd(uuidp, LUSTRE_OSC_NAME,
-					&obd->obd_uuid);
+	tgt_obd = class_find_client_obd(uuidp, LUSTRE_OSC_NAME, &obd->obd_uuid);
 	if (!tgt_obd)
 		return -EINVAL;
 
@@ -504,10 +504,11 @@ static int lov_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 
 	if ((index < lov->lov_tgt_size) && lov->lov_tgts[index]) {
 		tgt = lov->lov_tgts[index];
-		CERROR("UUID %s already assigned at LOV target index %d\n",
-		       obd_uuid2str(&tgt->ltd_uuid), index);
+		rc = -EEXIST;
+		CERROR("%s: UUID %s already assigned at index %d: rc = %d\n",
+		       obd->obd_name, obd_uuid2str(&tgt->ltd_uuid), index, rc);
 		mutex_unlock(&lov->lov_lock);
-		return -EEXIST;
+		return rc;
 	}
 
 	if (index >= lov->lov_tgt_size) {
@@ -602,8 +603,8 @@ static int lov_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 
 out:
 	if (rc) {
-		CERROR("add failed (%d), deleting %s\n", rc,
-		       obd_uuid2str(&tgt->ltd_uuid));
+		CERROR("%s: add failed, deleting %s: rc = %d\n",
+		       obd->obd_name, obd_uuid2str(&tgt->ltd_uuid), rc);
 		lov_del_target(obd, index, NULL, 0);
 	}
 	lov_tgts_putref(obd);
@@ -860,6 +861,7 @@ int lov_process_config_base(struct obd_device *obd, struct lustre_cfg *lcfg,
 	case LCFG_LOV_DEL_OBD: {
 		u32 index;
 		int gen;
+
 		/* lov_modify_tgts add  0:lov_mdsA  1:ost1_UUID  2:0  3:1 */
 		if (LUSTRE_CFG_BUFLEN(lcfg, 1) > sizeof(obd_uuid.uuid)) {
 			rc = -EINVAL;
@@ -868,11 +870,11 @@ int lov_process_config_base(struct obd_device *obd, struct lustre_cfg *lcfg,
 
 		obd_str2uuid(&obd_uuid,  lustre_cfg_buf(lcfg, 1));
 
-		rc = kstrtoint(lustre_cfg_buf(lcfg, 2), 10, indexp);
-		if (rc < 0)
+		rc = kstrtou32(lustre_cfg_buf(lcfg, 2), 10, indexp);
+		if (rc)
 			goto out;
 		rc = kstrtoint(lustre_cfg_buf(lcfg, 3), 10, genp);
-		if (rc < 0)
+		if (rc)
 			goto out;
 		index = *indexp;
 		gen = *genp;
@@ -882,6 +884,7 @@ int lov_process_config_base(struct obd_device *obd, struct lustre_cfg *lcfg,
 			rc = lov_add_target(obd, &obd_uuid, index, gen, 0);
 		else
 			rc = lov_del_target(obd, index, &obd_uuid, gen);
+
 		goto out;
 	}
 	case LCFG_PARAM: {
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index ca0822d..80e3120 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -684,7 +684,7 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 			return ELDLM_OK;
 
 		matched = ldlm_handle2lock(&lockh);
-		if (ldlm_is_kms_ignore(matched))
+		if (!matched || ldlm_is_kms_ignore(matched))
 			goto no_match;
 
 		if (mdc_set_dom_lock_data(env, matched, einfo->ei_cbdata)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 015/622] lustre: obdclass: allow specifying complex jobids
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (13 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 014/622] lustre: mdc: fix possible NULL pointer dereference James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 016/622] lustre: ldlm: don't disable softirq for exp_rpc_lock James Simmons
                   ` (607 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Allow specifying a format string for the jobid_name variable to create
a jobid for processes on the client.  The jobid_name is used when
jobid_var=nodelocal, if jobid_name contains "%j", or as a fallback if
getting the specified jobid_var from the environment fails.

The jobid_node string allows the following escape sequences:

    %e = executable name
    %g = group ID
    %h = hostname (system utsname)
    %j = jobid from jobid_var environment variable
    %p = process ID
    %u = user ID

Any unknown escape sequences are dropped. Other arbitrary characters
pass through unmodified, up to the maximum jobid string size of 32,
though whitespace within the jobid is not copied.

This allows, for example, specifying an arbitrary prefix, such as the
cluster name, in addition to the traditional "procname.uid" format,
to distinguish between jobs running on clients in different clusters:

    lctl set_param jobid_var=nodelocal jobid_name=cluster2.%e.%u
or
    lctl set_param jobid_var=SLURM_JOB_ID jobid_name=cluster2.%j.%e

To use an environment-specified JobID, if available, but fall back to
a static string for all processes that do not have a valid JobID:

    lctl set_param jobid_var=SLURM_JOB_ID jobid_name=unknown

Implementation notes:

The LUSTRE_JOBID_SIZE includes a trailing NUL, so don't use
"LUSTRE_JOBID_SIZE + 1" anywhere, as that is misleading.

Rename the "obd_jobid_node" variable to "obd_jobid_name" to match
the sysfs "jobid_name" parameter name to avoid confusion.

Rename "struct jobid_to_pid_map" to "jobid_pid_map" since this is
not actually mapping from a jobid *to* a PID, but the reverse.
Save jobid length, and reorder fields to avoid holes in structure.

Consolidate PID->jobid cache handling in jobid_get_from_cache(),
which only does environment lookups and caches the results.
The fallback to using obd_jobid_name is handled by the caller.

Rename check_job_name() to jobid_name_is_valid(), since that makes
it clear to the reader a "true" return is a valid name.

In jobid_cache_init() there is no benefit for locking the jobid_hash
creation, since the spinlock is just initialized in this function,
so multiple callers of this function would already be broken.

Pass the buffer size from the callers (who know the buffer size) to
lustre_get_jobid() instead of assuming it is LUSTRE_JOBID_SIZE.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10698
Lustre-commit: 6488c0ec57de ("LU-10698 obdclass: allow specifying complex jobids")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31691
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamclould.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h          |  4 +-
 fs/lustre/llite/llite_internal.h       |  4 +-
 fs/lustre/llite/llite_lib.c            |  2 +-
 fs/lustre/llite/vvp_io.c               |  2 +-
 fs/lustre/llite/vvp_object.c           |  3 +-
 fs/lustre/obdclass/jobid.c             | 95 +++++++++++++++++++++++++++++++---
 fs/lustre/obdclass/obd_sysfs.c         | 10 ++--
 fs/lustre/ptlrpc/pack_generic.c        |  4 +-
 include/uapi/linux/lustre/lustre_idl.h |  2 +-
 9 files changed, 105 insertions(+), 21 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 9e07853..146c37e 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -54,7 +54,7 @@
 /* OBD Operations Declarations */
 struct obd_device *class_exp2obd(struct obd_export *exp);
 int class_handle_ioctl(unsigned int cmd, unsigned long arg);
-int lustre_get_jobid(char *jobid);
+int lustre_get_jobid(char *jobid, size_t len);
 
 struct lu_device_type;
 
@@ -1672,7 +1672,7 @@ static inline void class_uuid_unparse(class_uuid_t uu, struct obd_uuid *out)
 int class_check_uuid(struct obd_uuid *uuid, u64 nid);
 
 /* class_obd.c */
-extern char obd_jobid_node[];
+extern char obd_jobid_name[];
 int class_procfs_init(void);
 int class_procfs_clean(void);
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index fbe93a4..d0a703d 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -195,11 +195,11 @@ struct ll_inode_info {
 			int				lli_async_rc;
 
 			/*
-			 * whenever a process try to read/write the file, the
+			 * Whenever a process try to read/write the file, the
 			 * jobid of the process will be saved here, and it'll
 			 * be packed into the write PRC when flush later.
 			 *
-			 * so the read/write statistics for jobid will not be
+			 * So the read/write statistics for jobid will not be
 			 * accurate if the file is shared by different jobs.
 			 */
 			char				lli_jobid[LUSTRE_JOBID_SIZE];
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 12aafe0..7580d57 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -937,7 +937,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 		lli->lli_async_rc = 0;
 	}
 	mutex_init(&lli->lli_layout_mutex);
-	memset(lli->lli_jobid, 0, LUSTRE_JOBID_SIZE);
+	memset(lli->lli_jobid, 0, sizeof(lli->lli_jobid));
 }
 
 int ll_fill_super(struct super_block *sb)
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 37bf942..85bb3e0 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -1419,7 +1419,7 @@ int vvp_io_init(const struct lu_env *env, struct cl_object *obj,
 		 * it's not accurate if the file is shared by different
 		 * jobs.
 		 */
-		lustre_get_jobid(lli->lli_jobid);
+		lustre_get_jobid(lli->lli_jobid, sizeof(lli->lli_jobid));
 	} else if (io->ci_type == CIT_SETATTR) {
 		if (!cl_io_is_trunc(io))
 			io->ci_lockreq = CILR_MANDATORY;
diff --git a/fs/lustre/llite/vvp_object.c b/fs/lustre/llite/vvp_object.c
index c750a80..24cde0d 100644
--- a/fs/lustre/llite/vvp_object.c
+++ b/fs/lustre/llite/vvp_object.c
@@ -212,7 +212,8 @@ static void vvp_req_attr_set(const struct lu_env *env, struct cl_object *obj,
 	obdo_set_parent_fid(oa, &ll_i2info(inode)->lli_fid);
 	if (OBD_FAIL_CHECK(OBD_FAIL_LFSCK_INVALID_PFID))
 		oa->o_parent_oid++;
-	memcpy(attr->cra_jobid, ll_i2info(inode)->lli_jobid, LUSTRE_JOBID_SIZE);
+	memcpy(attr->cra_jobid, ll_i2info(inode)->lli_jobid,
+	       sizeof(attr->cra_jobid));
 }
 
 static const struct cl_object_operations vvp_ops = {
diff --git a/fs/lustre/obdclass/jobid.c b/fs/lustre/obdclass/jobid.c
index 3655a2e..8bad859 100644
--- a/fs/lustre/obdclass/jobid.c
+++ b/fs/lustre/obdclass/jobid.c
@@ -32,17 +32,19 @@
  */
 
 #define DEBUG_SUBSYSTEM S_RPC
+#include <linux/ctype.h>
 #include <linux/user_namespace.h>
 #ifdef HAVE_UIDGID_HEADER
 #include <linux/uidgid.h>
 #endif
+#include <linux/utsname.h>
 
 #include <obd_support.h>
 #include <obd_class.h>
 #include <lustre_net.h>
 
 char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE;
-char obd_jobid_node[LUSTRE_JOBID_SIZE + 1];
+char obd_jobid_name[LUSTRE_JOBID_SIZE] = "%e.%u";
 
 /* Get jobid of current process from stored variable or calculate
  * it from pid and user_id.
@@ -52,9 +54,89 @@
  * This is now deprecated.
  */
 
-int lustre_get_jobid(char *jobid)
+/*
+ * jobid_interpret_string()
+ *
+ * Interpret the jobfmt string to expand specified fields, like coredumps do:
+ *   %e = executable
+ *   %g = gid
+ *   %h = hostname
+ *   %j = jobid from environment
+ *   %p = pid
+ *   %u = uid
+ *
+ * Unknown escape strings are dropped.  Other characters are copied through,
+ * excluding whitespace (to avoid making jobid parsing difficult).
+ *
+ * Return: -EOVERFLOW if the expanded string does not fit within @joblen
+ *         0 for success
+ */
+static int jobid_interpret_string(const char *jobfmt, char *jobid,
+				  ssize_t joblen)
+{
+	char c;
+
+	while ((c = *jobfmt++) && joblen > 1) {
+		char f;
+		int l;
+
+		if (isspace(c)) /* Don't allow embedded spaces */
+			continue;
+
+		if (c != '%') {
+			*jobid = c;
+			joblen--;
+			jobid++;
+			continue;
+		}
+
+		switch ((f = *jobfmt++)) {
+		case 'e': /* executable name */
+			l = snprintf(jobid, joblen, "%s", current->comm);
+			break;
+		case 'g': /* group ID */
+			l = snprintf(jobid, joblen, "%u",
+				     from_kgid(&init_user_ns, current_fsgid()));
+			break;
+		case 'h': /* hostname */
+			l = snprintf(jobid, joblen, "%s",
+				     init_utsname()->nodename);
+			break;
+		case 'j': /* jobid requested by process
+			   * - currently not supported
+			   */
+			l = snprintf(jobid, joblen, "%s", "jobid");
+			break;
+		case 'p': /* process ID */
+			l = snprintf(jobid, joblen, "%u", current->pid);
+			break;
+		case 'u': /* user ID */
+			l = snprintf(jobid, joblen, "%u",
+				     from_kuid(&init_user_ns, current_fsuid()));
+			break;
+		case '\0': /* '%' at end of format string */
+			l = 0;
+			goto out;
+		default: /* drop unknown %x format strings */
+			l = 0;
+			break;
+		}
+		jobid += l;
+		joblen -= l;
+	}
+	/*
+	 * This points at the end of the buffer, so long as jobid is always
+	 * incremented the same amount as joblen is decremented.
+	 */
+out:
+	jobid[joblen - 1] = '\0';
+
+	return joblen < 0 ? -EOVERFLOW : 0;
+}
+
+int lustre_get_jobid(char *jobid, size_t joblen)
 {
-	char tmp_jobid[LUSTRE_JOBID_SIZE] = { 0 };
+	char tmp_jobid[LUSTRE_JOBID_SIZE] = "";
 
 	/* Jobstats isn't enabled */
 	if (strcmp(obd_jobid_var, JOBSTATS_DISABLE) == 0)
@@ -70,10 +152,11 @@ int lustre_get_jobid(char *jobid)
 
 	/* Whole node dedicated to single job */
 	if (strcmp(obd_jobid_var, JOBSTATS_NODELOCAL) == 0) {
-		strcpy(tmp_jobid, obd_jobid_node);
-		goto out_cache_jobid;
+		int rc2 = jobid_interpret_string(obd_jobid_name,
+						 tmp_jobid, joblen);
+		if (!rc2)
+			goto out_cache_jobid;
 	}
-
 	return -ENOENT;
 
 out_cache_jobid:
diff --git a/fs/lustre/obdclass/obd_sysfs.c b/fs/lustre/obdclass/obd_sysfs.c
index bac8e7c5..cd2917e 100644
--- a/fs/lustre/obdclass/obd_sysfs.c
+++ b/fs/lustre/obdclass/obd_sysfs.c
@@ -233,7 +233,7 @@ static ssize_t jobid_var_store(struct kobject *kobj, struct attribute *attr,
 static ssize_t jobid_name_show(struct kobject *kobj, struct attribute *attr,
 			       char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "%s\n", obd_jobid_node);
+	return snprintf(buf, PAGE_SIZE, "%s\n", obd_jobid_name);
 }
 
 static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr,
@@ -243,13 +243,13 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr,
 	if (!count || count > LUSTRE_JOBID_SIZE)
 		return -EINVAL;
 
-	memcpy(obd_jobid_node, buffer, count);
+	memcpy(obd_jobid_name, buffer, count);
 
-	obd_jobid_node[count] = 0;
+	obd_jobid_name[count] = 0;
 
 	/* Trim the trailing '\n' if any */
-	if (obd_jobid_node[count - 1] == '\n')
-		obd_jobid_node[count - 1] = 0;
+	if (obd_jobid_name[count - 1] == '\n')
+		obd_jobid_name[count - 1] = 0;
 
 	return count;
 }
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index b6a4fd8..bc5e513 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1406,9 +1406,9 @@ void lustre_msg_set_jobid(struct lustre_msg *msg, char *jobid)
 		LASSERTF(pb, "invalid msg %p: no ptlrpc body!\n", msg);
 
 		if (jobid)
-			memcpy(pb->pb_jobid, jobid, LUSTRE_JOBID_SIZE);
+			memcpy(pb->pb_jobid, jobid, sizeof(pb->pb_jobid));
 		else if (pb->pb_jobid[0] == '\0')
-			lustre_get_jobid(pb->pb_jobid);
+			lustre_get_jobid(pb->pb_jobid, sizeof(pb->pb_jobid));
 		return;
 	}
 	default:
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 401f7ef..4e1605a2 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -635,7 +635,7 @@ struct ptlrpc_body_v3 {
 	__u64 pb_padding64_0;
 	__u64 pb_padding64_1;
 	__u64 pb_padding64_2;
-	char  pb_jobid[LUSTRE_JOBID_SIZE]; /* req: ASCII MPI jobid from env */
+	char  pb_jobid[LUSTRE_JOBID_SIZE]; /* req: ASCII jobid from env + NUL */
 };
 
 #define ptlrpc_body	ptlrpc_body_v3
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 016/622] lustre: ldlm: don't disable softirq for exp_rpc_lock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (14 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 015/622] lustre: obdclass: allow specifying complex jobids James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 017/622] lustre: obdclass: new wrapper to convert NID to string James Simmons
                   ` (606 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Liang Zhen <liang.zhen@intel.com>

it is not necessary to call ldlm_lock_busy() in the context of timer
callback, we can call it in thread context of expired_lock_main.
With this change, we don't need to disable softirq for exp_rpc_lock.

Instead of moving busy locks to the end of the waiting list one
at a time in the context of the timer callback, move any locks
that may be expired onto the expired list.  If these locks are
still being used by RPCs being processed, then put them back
onto the end of the waiting list instead of evicting the client.

For the linux client the impact of this change is change of
spin_lock_bh() to spin_lock() for the exp_rpc_lock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6032
Lustre-commit: 292aa42e0897 ("LU-6032 ldlm: don't disable softirq for exp_rpc_lock")
Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-on: https://review.whamcloud.com/12957
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index d57df36..3c61e83 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1307,9 +1307,9 @@ static int ptlrpc_server_hpreq_init(struct ptlrpc_service_part *svcpt,
 			LASSERT(rc <= 1);
 		}
 
-		spin_lock_bh(&req->rq_export->exp_rpc_lock);
+		spin_lock(&req->rq_export->exp_rpc_lock);
 		list_add(&req->rq_exp_list, &req->rq_export->exp_hp_rpcs);
-		spin_unlock_bh(&req->rq_export->exp_rpc_lock);
+		spin_unlock(&req->rq_export->exp_rpc_lock);
 	}
 
 	ptlrpc_nrs_req_initialize(svcpt, req, rc);
@@ -1327,9 +1327,9 @@ static void ptlrpc_server_hpreq_fini(struct ptlrpc_request *req)
 		if (req->rq_ops->hpreq_fini)
 			req->rq_ops->hpreq_fini(req);
 
-		spin_lock_bh(&req->rq_export->exp_rpc_lock);
+		spin_lock(&req->rq_export->exp_rpc_lock);
 		list_del_init(&req->rq_exp_list);
-		spin_unlock_bh(&req->rq_export->exp_rpc_lock);
+		spin_unlock(&req->rq_export->exp_rpc_lock);
 	}
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 017/622] lustre: obdclass: new wrapper to convert NID to string
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (15 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 016/622] lustre: ldlm: don't disable softirq for exp_rpc_lock James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 018/622] lustre: ptlrpc: Add QoS for uid and gid in NRS-TBF James Simmons
                   ` (605 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Liang Zhen <liang.zhen@intel.com>

This patch includes a couple of changes:
- add new wrapper function obd_import_nid2str
- use obd_import_nid2str and obd_export_nid2str to replace all
  libcfs_nid2str conversions for NID of export/import connection

WC-bug-id: https://jira.whamcloud.com/browse/LU-6032
Lustre-commit: 61f9847a812f ("LU-6032 obdclass: new wrapper to convert NID to string")
Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-on: https://review.whamcloud.com/12956
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h | 12 ++++++++++++
 fs/lustre/ldlm/ldlm_lock.c    |  4 ++--
 fs/lustre/ptlrpc/client.c     |  5 ++---
 fs/lustre/ptlrpc/import.c     |  6 +++---
 4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 146c37e..d896049 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -86,6 +86,18 @@ struct obd_device *class_devices_in_group(struct obd_uuid *grp_uuid,
 int obd_connect_flags2str(char *page, int count, u64 flags, u64 flags2,
 			  const char *sep);
 
+static inline char *obd_export_nid2str(struct obd_export *exp)
+{
+	return exp->exp_connection ?
+		libcfs_nid2str(exp->exp_connection->c_peer.nid) : "<unknown>";
+}
+
+static inline char *obd_import_nid2str(struct obd_import *imp)
+{
+	return imp->imp_connection ?
+		libcfs_nid2str(imp->imp_connection->c_peer.nid) : "<unknown>";
+}
+
 int obd_zombie_impexp_init(void);
 void obd_zombie_impexp_stop(void);
 void obd_zombie_barrier(void);
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 7242cd1..aa19b89 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -1987,11 +1987,11 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 	vaf.va = &args;
 
 	if (exp && exp->exp_connection) {
-		nid = libcfs_nid2str(exp->exp_connection->c_peer.nid);
+		nid = obd_export_nid2str(exp);
 	} else if (exp && exp->exp_obd) {
 		struct obd_import *imp = exp->exp_obd->u.cli.cl_import;
 
-		nid = libcfs_nid2str(imp->imp_connection->c_peer.nid);
+		nid = obd_import_nid2str(imp);
 	}
 
 	if (!resource) {
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index a533cbb..424db55 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -1605,8 +1605,7 @@ static int ptlrpc_send_new_req(struct ptlrpc_request *req)
 	       current->comm,
 	       imp->imp_obd->obd_uuid.uuid,
 	       lustre_msg_get_status(req->rq_reqmsg), req->rq_xid,
-	       libcfs_nid2str(imp->imp_connection->c_peer.nid),
-	       lustre_msg_get_opc(req->rq_reqmsg));
+	       obd_import_nid2str(imp), lustre_msg_get_opc(req->rq_reqmsg));
 
 	rc = ptl_send_rpc(req, 0);
 	if (rc == -ENOMEM) {
@@ -2017,7 +2016,7 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 			       current->comm, imp->imp_obd->obd_uuid.uuid,
 			       lustre_msg_get_status(req->rq_reqmsg),
 			       req->rq_xid,
-			       libcfs_nid2str(imp->imp_connection->c_peer.nid),
+			       obd_import_nid2str(imp),
 			       lustre_msg_get_opc(req->rq_reqmsg));
 
 		spin_lock(&imp->imp_lock);
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index d032962..dca4aa0 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -171,13 +171,13 @@ int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt)
 			LCONSOLE_WARN("%s: Connection to %.*s (at %s) was lost; in progress operations using this service will wait for recovery to complete\n",
 				      imp->imp_obd->obd_name,
 				      target_len, target_start,
-				      libcfs_nid2str(imp->imp_connection->c_peer.nid));
+				      obd_import_nid2str(imp));
 		} else {
 			LCONSOLE_ERROR_MSG(0x166,
 					   "%s: Connection to %.*s (at %s) was lost; in progress operations using this service will fail\n",
 					   imp->imp_obd->obd_name,
 					   target_len, target_start,
-					   libcfs_nid2str(imp->imp_connection->c_peer.nid));
+					   obd_import_nid2str(imp));
 		}
 		IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_DISCON);
 		spin_unlock(&imp->imp_lock);
@@ -1461,7 +1461,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 		LCONSOLE_INFO("%s: Connection restored to %.*s (at %s)\n",
 			      imp->imp_obd->obd_name,
 			      target_len, target_start,
-			      libcfs_nid2str(imp->imp_connection->c_peer.nid));
+			      obd_import_nid2str(imp));
 	}
 
 	if (imp->imp_state == LUSTRE_IMP_FULL) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 018/622] lustre: ptlrpc: Add QoS for uid and gid in NRS-TBF
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (16 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 017/622] lustre: obdclass: new wrapper to convert NID to string James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 019/622] lustre: hsm: ignore compound_id James Simmons
                   ` (604 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Teddy Chan <teddy@ddn.com>

This patch add a new QoS feature in TBF policy which could
limits the rate based on uid or gid. The policy is able to
limit the rate both on MDT and OSS site.

The command for this feature is like:
Start the tbf uid QoS on OST:
    lctl set_param ost.OSS.*.nrs_policies="tbf uid"
Limit the rate of ptlrpc requests of the uid 500
    lctl set_param ost.OSS.*.nrs_tbf_rule=
 "start tbf_name uid={500} rate=100"

Start the tbf gid QoS on OST:
    lctl set_param ost.OSS.*.nrs_policies="tbf gid"
Limit the rate of ptlrpc requests of the gid 500
    lctl set_param ost.OSS.*.nrs_tbf_rule=
 "start tbf_name gid={500} rate=100"

or use generic tbf rule to mix them on OST:
    lctl set_param ost.OSS.*.nrs_policies="tbf"
Limit the rate of ptlrpc requests of the uid 500 gid 500
    lctl set_param ost.OSS.*.nrs_tbf_rule=
 "start tbf_name uid={500}&gid={500} rate=100"

Also, you can use the following rule to control all reqs
to mds:
Start the tbf uid QoS on MDS:
    lctl set_param mds.MDS.*.nrs_policies="tbf uid"
Limit the rate of ptlrpc requests of the uid 500
    lctl set_param mds.MDS.*.nrs_tbf_rule=
 "start tbf_name uid={500} rate=100"

For the linux client we need to send the uid and gid
information to the NRS-TBF handling on the servers.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9658
Lustre-commit: e0cdde123c14 ("LU-9658 ptlrpc: Add QoS for uid and gid in NRS-TBF")
Signed-off-by: Teddy Chan <teddy@ddn.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/27608
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/vvp_object.c |  5 ++---
 fs/lustre/obdclass/obdo.c    |  5 +++++
 fs/lustre/osc/osc_request.c  | 10 ++++++++++
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/vvp_object.c b/fs/lustre/llite/vvp_object.c
index 24cde0d..eeb8823 100644
--- a/fs/lustre/llite/vvp_object.c
+++ b/fs/lustre/llite/vvp_object.c
@@ -196,7 +196,7 @@ static int vvp_object_glimpse(const struct lu_env *env,
 static void vvp_req_attr_set(const struct lu_env *env, struct cl_object *obj,
 			     struct cl_req_attr *attr)
 {
-	u64 valid_flags = OBD_MD_FLTYPE;
+	u64 valid_flags = OBD_MD_FLTYPE | OBD_MD_FLUID | OBD_MD_FLGID;
 	struct inode *inode;
 	struct obdo *oa;
 
@@ -204,8 +204,7 @@ static void vvp_req_attr_set(const struct lu_env *env, struct cl_object *obj,
 	inode = vvp_object_inode(obj);
 
 	if (attr->cra_type == CRT_WRITE) {
-		valid_flags |= OBD_MD_FLMTIME | OBD_MD_FLCTIME |
-			       OBD_MD_FLUID | OBD_MD_FLGID;
+		valid_flags |= OBD_MD_FLMTIME | OBD_MD_FLCTIME;
 		obdo_set_o_projid(oa, ll_i2info(inode)->lli_projid);
 	}
 	obdo_from_inode(oa, inode, valid_flags & attr->cra_flags);
diff --git a/fs/lustre/obdclass/obdo.c b/fs/lustre/obdclass/obdo.c
index 1926896..e5475f1 100644
--- a/fs/lustre/obdclass/obdo.c
+++ b/fs/lustre/obdclass/obdo.c
@@ -144,6 +144,11 @@ void lustre_set_wire_obdo(const struct obd_connect_data *ocd,
 	if (!ocd)
 		return;
 
+	if (!(wobdo->o_valid & OBD_MD_FLUID))
+		wobdo->o_uid = from_kuid(&init_user_ns, current_uid());
+	if (!(wobdo->o_valid & OBD_MD_FLGID))
+		wobdo->o_gid = from_kgid(&init_user_ns, current_gid());
+
 	if (unlikely(!(ocd->ocd_connect_flags & OBD_CONNECT_FID)) &&
 	    fid_seq_is_echo(ostid_seq(&lobdo->o_oi))) {
 		/*
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 300dee5..99c9620 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1184,6 +1184,16 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 
 	lustre_set_wire_obdo(&req->rq_import->imp_connect_data, &body->oa, oa);
 
+	/* For READ and WRITE, we can't fill o_uid and o_gid using from_kuid()
+	 * and from_kgid(), because they are asynchronous. Fortunately, variable
+	 * oa contains valid o_uid and o_gid in these two operations.
+	 * Besides, filling o_uid and o_gid is enough for nrs-tbf, see LU-9658.
+	 * OBD_MD_FLUID and OBD_MD_FLUID is not set in order to avoid breaking
+	 * other process logic
+	 */
+	body->oa.o_uid = oa->o_uid;
+	body->oa.o_gid = oa->o_gid;
+
 	obdo_to_ioobj(oa, ioobj);
 	ioobj->ioo_bufcnt = niocount;
 	/* The high bits of ioo_max_brw tells server _maximum_ number of bulks
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 019/622] lustre: hsm: ignore compound_id
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (17 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 018/622] lustre: ptlrpc: Add QoS for uid and gid in NRS-TBF James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 020/622] lnet: libcfs: remove unnecessary set_fs(KERNEL_DS) James Simmons
                   ` (603 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

Ignore request compound ids in the HSM coordinator. Compound ids
prevent batching of CDT to CT requests and degrade HSM
performance. Use CT/archive id compatabiliy when deciding which HSM
actions to put in a request.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10383
Lustre-commit: 9ee81f920bb3 ("LU-10383 hsm: ignore compound_id")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/30949
Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-by: Faccini Bruno <bruno.faccini@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_idl.h  | 2 +-
 include/uapi/linux/lustre/lustre_user.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 4e1605a2..307feb3 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2508,7 +2508,7 @@ struct llog_agent_req_rec {
 						 */
 	__u32			arr_archive_id;	/**< backend archive number */
 	__u64			arr_flags;	/**< req flags */
-	__u64			arr_compound_id;/**< compound cookie */
+	__u64			arr_compound_id;/**< compound cookie, ignored */
 	__u64			arr_req_create;	/**< req. creation time */
 	__u64			arr_req_change;	/**< req. status change time */
 	struct hsm_action_item	arr_hai;	/**< req. to the agent */
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 27501a2..5405e1b 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -1729,7 +1729,7 @@ static inline char *hai_dump_data_field(struct hsm_action_item *hai,
 struct hsm_action_list {
 	__u32 hal_version;
 	__u32 hal_count;	/* number of hai's to follow */
-	__u64 hal_compound_id;	/* returned by coordinator */
+	__u64 hal_compound_id;	/* returned by coordinator, ignored */
 	__u64 hal_flags;
 	__u32 hal_archive_id;	/* which archive backend */
 	__u32 padding1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 020/622] lnet: libcfs: remove unnecessary set_fs(KERNEL_DS)
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (18 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 019/622] lustre: hsm: ignore compound_id James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 021/622] lustre: ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM James Simmons
                   ` (602 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

When we converted to using kernel_write(), we left some
set_fs() calls that are not unnecessary.
Remove them.

Original OpenSFS version of this patch, as mentioned below,
did the full conversion to kernel_write.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10560
lustre-commit: b9a32054600a ("LU-10560 libcfs: Use kernel_write when appropriate")
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-on: https://review.whamcloud.com/31154
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
igned-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/tracefile.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/lnet/libcfs/tracefile.c b/net/lnet/libcfs/tracefile.c
index 3b29116..6e4cc31 100644
--- a/net/lnet/libcfs/tracefile.c
+++ b/net/lnet/libcfs/tracefile.c
@@ -807,7 +807,6 @@ int cfs_tracefile_dump_all_pages(char *filename)
 	struct cfs_trace_page *tage;
 	struct cfs_trace_page *tmp;
 	char *buf;
-	mm_segment_t __oldfs;
 	int rc;
 
 	down_write(&cfs_tracefile_sem);
@@ -828,8 +827,6 @@ int cfs_tracefile_dump_all_pages(char *filename)
 		rc = 0;
 		goto close;
 	}
-	__oldfs = get_fs();
-	set_fs(KERNEL_DS);
 
 	/* ok, for now, just write the pages.  in the future we'll be building
 	 * iobufs with the pages and calling generic_direct_IO
@@ -851,7 +848,7 @@ int cfs_tracefile_dump_all_pages(char *filename)
 		list_del(&tage->linkage);
 		cfs_tage_free(tage);
 	}
-	set_fs(__oldfs);
+
 	rc = vfs_fsync(filp, 1);
 	if (rc)
 		pr_err("sync returns %d\n", rc);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 021/622] lustre: ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (19 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 020/622] lnet: libcfs: remove unnecessary set_fs(KERNEL_DS) James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 022/622] lustre: llite: yield cpu after call to ll_agl_trigger James Simmons
                   ` (601 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

Assertion fails on !desc->bd_registered during
retry after ENOMEM.

Drop bd_registered flag and exit via cleanup_bulk
to ensure that bulk is fully unregistered.

Cray-bug-id: MRP-4733
WC-bug-id: https://jira.whamcloud.com/browse/LU-10643
Lustre-commit: 4a81be263079 ("LU-10643 ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/31228
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h |  1 +
 fs/lustre/ptlrpc/niobuf.c       | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 653a456..67500b5 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -349,6 +349,7 @@
 #define OBD_FAIL_PTLRPC_DROP_BULK			0x51a
 #define OBD_FAIL_PTLRPC_LONG_REQ_UNLINK			0x51b
 #define OBD_FAIL_PTLRPC_LONG_BOTH_UNLINK		0x51c
+#define OBD_FAIL_PTLRPC_BULK_ATTACH      0x521
 
 #define OBD_FAIL_OBD_PING_NET				0x600
 #define OBD_FAIL_OBD_LOG_CANCEL_NET			0x601
diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 02ed373..2e866fe 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -179,8 +179,13 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 			      LNET_MD_OP_GET : LNET_MD_OP_PUT);
 		ptlrpc_fill_bulk_md(&md, desc, posted_md);
 
-		rc = LNetMEAttach(desc->bd_portal, peer, mbits, 0,
-				  LNET_UNLINK, LNET_INS_AFTER, &me_h);
+		if (posted_md > 0 && posted_md + 1 == total_md &&
+		    OBD_FAIL_CHECK(OBD_FAIL_PTLRPC_BULK_ATTACH)) {
+			rc = -ENOMEM;
+		} else {
+			rc = LNetMEAttach(desc->bd_portal, peer, mbits, 0,
+					  LNET_UNLINK, LNET_INS_AFTER, &me_h);
+		}
 		if (rc != 0) {
 			CERROR("%s: LNetMEAttach failed x%llu/%d: rc = %d\n",
 			       desc->bd_import->imp_obd->obd_name, mbits,
@@ -209,6 +214,7 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 		LASSERT(desc->bd_md_count >= 0);
 		mdunlink_iterate_helper(desc->bd_mds, desc->bd_md_max_brw);
 		req->rq_status = -ENOMEM;
+		desc->bd_registered = 0;
 		return -ENOMEM;
 	}
 
@@ -585,7 +591,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 	if (request->rq_bulk) {
 		rc = ptlrpc_register_bulk(request);
 		if (rc != 0)
-			goto out;
+			goto cleanup_bulk;
 		/*
 		 * All the mds in the request will have the same cpt
 		 * encoded in the cookie. So we can just get the first
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 022/622] lustre: llite: yield cpu after call to ll_agl_trigger
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (20 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 021/622] lustre: ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 023/622] lustre: osc: Do not request more than 2GiB grant James Simmons
                   ` (600 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

The statahead and agl threads loop over all entries in the
directory without yielding the CPU. If the number of entries in
the directory is large enough then these threads may trigger
soft lockups. The fix is to add calls to cond_resched() after
calling ll_agl_trigger(), which gets the glimpse lock for a
file.

Cray-bug-id: LUS-2584
WC-bug-id: https://jira.whamcloud.com/browse/LU-10649
Lustre-commit: 031001f0d438 ("LU-10649 llite: yield cpu after call to ll_agl_trigger")
Signed-off-by: Ann Koehler <amk@cray.com>
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/31240
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/statahead.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/statahead.c b/fs/lustre/llite/statahead.c
index 99b3fee..4a61dac 100644
--- a/fs/lustre/llite/statahead.c
+++ b/fs/lustre/llite/statahead.c
@@ -907,6 +907,7 @@ static int ll_agl_thread(void *arg)
 			list_del_init(&clli->lli_agl_list);
 			spin_unlock(&plli->lli_agl_lock);
 			ll_agl_trigger(&clli->lli_vfs_inode, sai);
+			cond_resched();
 		} else {
 			spin_unlock(&plli->lli_agl_lock);
 		}
@@ -1071,7 +1072,7 @@ static int ll_statahead_thread(void *arg)
 
 					ll_agl_trigger(&clli->lli_vfs_inode,
 						       sai);
-
+					cond_resched();
 					spin_lock(&lli->lli_agl_lock);
 				}
 				spin_unlock(&lli->lli_agl_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 023/622] lustre: osc: Do not request more than 2GiB grant
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (21 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 022/622] lustre: llite: yield cpu after call to ll_agl_trigger James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 024/622] lustre: llite: rename FSFILT_IOC_* to system flags James Simmons
                   ` (599 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

The server enforces a grant limit of 2 GiB, which the
client must honor.  The existing client code combined with
16 MiB RPCs make it possible for the client to ask for
more than this limit.

Make this limit explicit, and also fix an overflow bug in
o_undirty calculation in osc_announce_cached.  (o_undirty
is a 32 bit value and 16 MiB*256 rpcs_in_flight = 4 GiB.
4 GiB + extra grant components overflows o_undirty.)

Cray-bug-id: LUS-5750
WC-bug-id: https://jira.whamcloud.com/browse/LU-10776
Lustre-commit: c0246d887809 ("LU-10776 osc: Do not request more than 2GiB grant")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31533
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c            | 10 ++++++++--
 include/uapi/linux/lustre/lustre_idl.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 99c9620..c430239 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -664,11 +664,12 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa,
 		oa->o_undirty = 0;
 	} else {
 		unsigned long nrpages;
+		unsigned long undirty;
 
 		nrpages = cli->cl_max_pages_per_rpc;
 		nrpages *= cli->cl_max_rpcs_in_flight + 1;
 		nrpages = max(nrpages, cli->cl_dirty_max_pages);
-		oa->o_undirty = nrpages << PAGE_SHIFT;
+		undirty = nrpages << PAGE_SHIFT;
 		if (OCD_HAS_FLAG(&cli->cl_import->imp_connect_data,
 				 GRANT_PARAM)) {
 			int nrextents;
@@ -679,8 +680,13 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa,
 			 */
 			nrextents = DIV_ROUND_UP(nrpages,
 						 cli->cl_max_extent_pages);
-			oa->o_undirty += nrextents * cli->cl_grant_extent_tax;
+			undirty += nrextents * cli->cl_grant_extent_tax;
 		}
+		/* Do not ask for more than OBD_MAX_GRANT - a margin for server
+		 * to add extent tax, etc.
+		 */
+		oa->o_undirty = min(undirty, OBD_MAX_GRANT -
+				    (PTLRPC_MAX_BRW_PAGES << PAGE_SHIFT)*4UL);
 	}
 	oa->o_grant = cli->cl_avail_grant + cli->cl_reserved_grant;
 	oa->o_dropped = cli->cl_lost_grant;
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 307feb3..0bce63d 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1213,6 +1213,8 @@ struct hsm_state_set {
 				      * it to sync quickly
 				      */
 
+#define OBD_MAX_GRANT 0x7fffffffUL /* Max grant allowed to one client: 2 GiB */
+
 #define OBD_OBJECT_EOF	LUSTRE_EOF
 
 #define OST_MIN_PRECREATE 32
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 024/622] lustre: llite: rename FSFILT_IOC_* to system flags
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (22 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 023/622] lustre: osc: Do not request more than 2GiB grant James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 025/622] lnet: fix nid range format '*@<net>' support James Simmons
                   ` (598 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Jinshan Xiong <jinshan.xiong@gmail.com>

Those definitions were probably created for compatibility. Now that
FS_IOC_* have been existing in kernel for long time, we should use
them to avoid confusion.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10779
Lustre-commit: 7e3fc106d6e7 ("LU-10779 llite: rename FSFILT_IOC_* to system flags")
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-on: https://review.whamcloud.com/31546
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c       | 13 +++++++------
 fs/lustre/llite/file.c      | 19 ++++++++++---------
 fs/lustre/llite/llite_lib.c |  4 ++--
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index f21727b..b006e32 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1108,18 +1108,19 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 
 	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_IOCTL, 1);
 	switch (cmd) {
-	case FSFILT_IOC_GETFLAGS:
-	case FSFILT_IOC_SETFLAGS:
+	case FS_IOC_GETFLAGS:
+	case FS_IOC_SETFLAGS:
 		return ll_iocontrol(inode, file, cmd, arg);
-	case FSFILT_IOC_GETVERSION_OLD:
 	case FSFILT_IOC_GETVERSION:
+	case FS_IOC_GETVERSION:
 		return put_user(inode->i_generation, (int __user *)arg);
 	/* We need to special case any other ioctls we want to handle,
 	 * to send them to the MDS/OST as appropriate and to properly
 	 * network encode the arg field.
-	case FSFILT_IOC_SETVERSION_OLD:
-	case FSFILT_IOC_SETVERSION:
-	*/
+	 */
+	case FS_IOC_SETVERSION:
+		return -ENOTSUPP;
+
 	case LL_IOC_GET_MDTIDX: {
 		int mdtidx;
 
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index fe965b1..c3fb104b 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3055,12 +3055,19 @@ static long ll_file_set_lease(struct file *file, struct ll_ioc_lease *ioc,
 	case LL_IOC_LOV_GETSTRIPE:
 	case LL_IOC_LOV_GETSTRIPE_NEW:
 		return ll_file_getstripe(inode, (void __user *)arg, 0);
-	case FSFILT_IOC_GETFLAGS:
-	case FSFILT_IOC_SETFLAGS:
+	case FS_IOC_GETFLAGS:
+	case FS_IOC_SETFLAGS:
 		return ll_iocontrol(inode, file, cmd, arg);
-	case FSFILT_IOC_GETVERSION_OLD:
 	case FSFILT_IOC_GETVERSION:
+	case FS_IOC_GETVERSION:
 		return put_user(inode->i_generation, (int __user *)arg);
+	/* We need to special case any other ioctls we want to handle,
+	 * to send them to the MDS/OST as appropriate and to properly
+	 * network encode the arg field.
+	 */
+	case FS_IOC_SETVERSION:
+		return -ENOTSUPP;
+
 	case LL_IOC_GROUP_LOCK:
 		return ll_get_grouplock(inode, file, arg);
 	case LL_IOC_GROUP_UNLOCK:
@@ -3068,12 +3075,6 @@ static long ll_file_set_lease(struct file *file, struct ll_ioc_lease *ioc,
 	case IOC_OBD_STATFS:
 		return ll_obd_statfs(inode, (void __user *)arg);
 
-	/* We need to special case any other ioctls we want to handle,
-	 * to send them to the MDS/OST as appropriate and to properly
-	 * network encode the arg field.
-	case FSFILT_IOC_SETVERSION_OLD:
-	case FSFILT_IOC_SETVERSION:
-	*/
 	case LL_IOC_FLUSHCTX:
 		return ll_flush_ctx(inode);
 	case LL_IOC_PATH2FID: {
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 7580d57..e2c7a4d 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2037,7 +2037,7 @@ int ll_iocontrol(struct inode *inode, struct file *file,
 	int rc, flags = 0;
 
 	switch (cmd) {
-	case FSFILT_IOC_GETFLAGS: {
+	case FS_IOC_GETFLAGS: {
 		struct mdt_body *body;
 		struct md_op_data *op_data;
 
@@ -2065,7 +2065,7 @@ int ll_iocontrol(struct inode *inode, struct file *file,
 
 		return put_user(flags, (int __user *)arg);
 	}
-	case FSFILT_IOC_SETFLAGS: {
+	case FS_IOC_SETFLAGS: {
 		struct md_op_data *op_data;
 		struct cl_object *obj;
 		struct iattr *attr;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 025/622] lnet: fix nid range format '*@<net>' support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (23 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 024/622] lustre: llite: rename FSFILT_IOC_* to system flags James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 026/622] lustre: ptlrpc: fix test_req_buffer_pressure behavior James Simmons
                   ` (597 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Emoly Liu <emoly@whamcloud.com>

In cfs_ip_min_max(), (nidrange->nr_all == 1) means this nid range
is a full IP address range(*.*.*.*). In this case, we don't need
to compare it to any other nid range, but set min_nid to 0.0.0.0
and max_nid to 255.255.255.255 directly.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8913
Lustre-commit: 230266326f49 ("LU-8913 nodemap: fix nodemap range format '*@<net>' support")
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31684
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/nidstrings.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/lnet/lnet/nidstrings.c b/net/lnet/lnet/nidstrings.c
index b4e38e5..13338d0 100644
--- a/net/lnet/lnet/nidstrings.c
+++ b/net/lnet/lnet/nidstrings.c
@@ -680,6 +680,12 @@ static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
 		if (nidlist_count > 0)
 			return -EINVAL;
 
+		if (nr->nr_all) {
+			min_ip_addr = 0;
+			max_ip_addr = 0xffffffff;
+			break;
+		}
+
 		list_for_each_entry(ar, &nr->nr_addrranges, ar_link) {
 			rc = cfs_ip_ar_min_max(ar, &tmp_min_ip_addr,
 					       &tmp_max_ip_addr);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 026/622] lustre: ptlrpc: fix test_req_buffer_pressure behavior
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (24 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 025/622] lnet: fix nid range format '*@<net>' support James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 027/622] lustre: lu_object: improve debug message for lu_object_put() James Simmons
                   ` (596 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Bruno Faccini <bruno.faccini@intel.com>

In 2nd patch for LU-9372, to allow limiting number of rqbd-buffers,
a wrong and unnecessary test had been added to enhance
test_req_buffer_pressure feature.
This patch fixes this issue by removing such test.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10826
Lustre-commit: 040eca67f8d5 ("LU-10826 ptlrpc: fix test_req_buffer_pressure behavior")
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: https://review.whamcloud.com/31690
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 3c61e83..8dae21a 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -150,8 +150,7 @@
 		/* NB: another thread might have recycled enough rqbds, we
 		 * need to make sure it wouldn't over-allocate, see LU-1212.
 		 */
-		if (test_req_buffer_pressure ||
-		    svcpt->scp_nrqbds_posted >= svc->srv_nbuf_per_group ||
+		if (svcpt->scp_nrqbds_posted >= svc->srv_nbuf_per_group ||
 		    (svc->srv_nrqbds_max != 0 &&
 		     svcpt->scp_nrqbds_total > svc->srv_nrqbds_max))
 			break;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 027/622] lustre: lu_object: improve debug message for lu_object_put()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (25 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 026/622] lustre: ptlrpc: fix test_req_buffer_pressure behavior James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 028/622] lustre: idl: remove obsolete directory split flags James Simmons
                   ` (595 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

Use a top level object in debug in lu_object_put to match with
lu_object_get.

WC-bug-id: https://jira.whamcloud.com/browse/LU-LU-10877
Lustre-commit: fd669eba1921 ("LU-10877 lu: fix reference leak")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/31870
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Mikhal Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lu_object.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index d8dfc721..2ab4977 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -184,8 +184,8 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
 		percpu_counter_inc(&site->ls_lru_len_counter);
-		CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p\n",
-		       o, site->ls_obj_hash, bkt);
+		CDEBUG(D_INODE, "Add %p/%p to site lru. hash: %p, bkt: %p\n",
+		       orig, top, site->ls_obj_hash, bkt);
 		cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 		return;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 028/622] lustre: idl: remove obsolete directory split flags
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (26 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 027/622] lustre: lu_object: improve debug message for lu_object_put() James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 029/622] lustre: mdc: resend quotactl if needed James Simmons
                   ` (594 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The directory split functionality from the old CMD (pre-DNE)
feature was never usable in production, and was removed before
the DNE 2.4 release.  Remove old flags relating to this feature.

WC-bug-id: https://jira.whamcloud.com/browse/LU-1187
Lustre-commit: 5c53c353fd82 ("LU-1187 idl: remove obsolete directory split flags")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31700
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_lib.c                | 2 --
 fs/lustre/ptlrpc/wiretest.c            | 4 ----
 include/uapi/linux/lustre/lustre_idl.h | 4 ++--
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index d4b2bb9..467503c 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -520,8 +520,6 @@ void mdc_getattr_pack(struct ptlrpc_request *req, u64 valid, u32 flags,
 						    &RMF_MDT_BODY);
 
 	b->mbo_valid = valid;
-	if (op_data->op_bias & MDS_CHECK_SPLIT)
-		b->mbo_valid |= OBD_MD_FLCKSPLIT;
 	if (op_data->op_bias & MDS_CROSS_REF)
 		b->mbo_valid |= OBD_MD_FLCROSSREF;
 	b->mbo_eadatasize = ea_size;
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 21698cc..bcd0229 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1341,8 +1341,6 @@ void lustre_assert_wire_constants(void)
 		 OBD_MD_FLMDSCAPA);
 	LASSERTF(OBD_MD_FLOSSCAPA == (0x0000040000000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLOSSCAPA);
-	LASSERTF(OBD_MD_FLCKSPLIT == (0x0000080000000000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_FLCKSPLIT);
 	LASSERTF(OBD_MD_FLCROSSREF == (0x0000100000000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLCROSSREF);
 	LASSERTF(OBD_MD_FLGETATTRLOCK == (0x0000200000000000ULL), "found 0x%.16llxULL\n",
@@ -1866,8 +1864,6 @@ void lustre_assert_wire_constants(void)
 	LASSERTF((int)sizeof(((struct ll_fid *)0)->f_type) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct ll_fid *)0)->f_type));
 
-	LASSERTF(MDS_CHECK_SPLIT == 0x00000001UL, "found 0x%.8xUL\n",
-		(unsigned int)MDS_CHECK_SPLIT);
 	LASSERTF(MDS_CROSS_REF == 0x00000002UL, "found 0x%.8xUL\n",
 		(unsigned int)MDS_CROSS_REF);
 	LASSERTF(MDS_VTX_BYPASS == 0x00000004UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 0bce63d..589bb81 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1131,7 +1131,7 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 /*	OBD_MD_FLRMTPERM	(0x0000010000000000ULL) remote perm, obsolete */
 #define OBD_MD_FLMDSCAPA	(0x0000020000000000ULL) /* MDS capability */
 #define OBD_MD_FLOSSCAPA	(0x0000040000000000ULL) /* OSS capability */
-#define OBD_MD_FLCKSPLIT	(0x0000080000000000ULL) /* Check split on server */
+/*	OBD_MD_FLCKSPLIT	(0x0000080000000000ULL) obsolete 2.3.58*/
 #define OBD_MD_FLCROSSREF	(0x0000100000000000ULL) /* Cross-ref case */
 #define OBD_MD_FLGETATTRLOCK	(0x0000200000000000ULL) /* Get IOEpoch attributes
 							 * under lock; for xattr
@@ -1640,7 +1640,7 @@ struct mdt_rec_setattr {
 #define MDS_ATTR_PROJID		0x10000ULL /* = 65536 */
 
 enum mds_op_bias {
-	MDS_CHECK_SPLIT		= 1 << 0,
+/*	MDS_CHECK_SPLIT		= 1 << 0, obsolete before 2.3.58 */
 	MDS_CROSS_REF		= 1 << 1,
 	MDS_VTX_BYPASS		= 1 << 2,
 	MDS_PERM_BYPASS		= 1 << 3,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 029/622] lustre: mdc: resend quotactl if needed
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (27 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 028/622] lustre: idl: remove obsolete directory split flags James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 030/622] lustre: obd: create ping sysfs file James Simmons
                   ` (593 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

In mdc_quotactl, it is better to resend the quotactl request
if reconnection or failover is triggered during the process.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10368
Lustre-commit: d511918e8eb7 ("LU-10368 mdc: resend quotactl if needed")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31773
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_request.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 5718db2..feac374 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1867,7 +1867,7 @@ static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
 				struct lustre_kernelcomm *lk);
 
 static int mdc_quotactl(struct obd_device *unused, struct obd_export *exp,
-			struct obd_quotactl *oqctl)
+                        struct obd_quotactl *oqctl)
 {
 	struct ptlrpc_request *req;
 	struct obd_quotactl *oqc;
@@ -1884,7 +1884,6 @@ static int mdc_quotactl(struct obd_device *unused, struct obd_export *exp,
 
 	ptlrpc_request_set_replen(req);
 	ptlrpc_at_set_req_timeout(req);
-	req->rq_no_resend = 1;
 
 	rc = ptlrpc_queue_wait(req);
 	if (rc)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 030/622] lustre: obd: create ping sysfs file
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (28 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 029/622] lustre: mdc: resend quotactl if needed James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 031/622] lustre: ldlm: change LDLM_POOL_ADD_VAR macro to inline function James Simmons
                   ` (592 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

We have ping in the lustre debugfs tree. Its a perfect
fit for sysfs. Create a sysfs equivalent so we can in time
remove the debugfs file.

WC-bug-id: https://jira.hpdd.intel.com/browse/LU-8066
Lustre-commit: 0100ab268c31 ("LU-8066 obd: final pieces for sysfs/debugfs support")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/28108
Lustre-commit: 6bbae72c6900 ("LU-8066 sysfs: make ping sysfs file read and writable")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33776
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lprocfs_status.h |  6 ++++--
 fs/lustre/mdc/lproc_mdc.c          |  7 +++----
 fs/lustre/mgc/lproc_mgc.c          |  7 +++----
 fs/lustre/osc/lproc_osc.c          |  7 +++----
 fs/lustre/ptlrpc/lproc_ptlrpc.c    | 18 ++++++++----------
 5 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index 965f8a1..32d43fb 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -457,8 +457,10 @@ int lprocfs_wr_uint(struct file *file, const char __user *buffer,
 struct adaptive_timeout;
 int lprocfs_at_hist_helper(struct seq_file *m, struct adaptive_timeout *at);
 int lprocfs_rd_timeouts(struct seq_file *m, void *data);
-int lprocfs_wr_ping(struct file *file, const char __user *buffer,
-		    size_t count, loff_t *off);
+
+ssize_t ping_show(struct kobject *kobj, struct attribute *attr,
+		  char *buffer);
+
 int lprocfs_wr_import(struct file *file, const char __user *buffer,
 		      size_t count, loff_t *off);
 int lprocfs_rd_pinger_recov(struct seq_file *m, void *n);
diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index f09292e..6b87e76 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -306,6 +306,8 @@ static ssize_t max_mod_rpcs_in_flight_store(struct kobject *kobj,
 #define mdc_conn_uuid_show conn_uuid_show
 LUSTRE_RO_ATTR(mdc_conn_uuid);
 
+LUSTRE_RO_ATTR(ping);
+
 static ssize_t mdc_rpc_stats_seq_write(struct file *file,
 				       const char __user *buf,
 				       size_t len, loff_t *off)
@@ -454,8 +456,6 @@ static ssize_t mdc_stats_seq_write(struct file *file,
 }
 LPROC_SEQ_FOPS(mdc_stats);
 
-LPROC_SEQ_FOPS_WR_ONLY(mdc, ping);
-
 LPROC_SEQ_FOPS_RO_TYPE(mdc, connect_flags);
 LPROC_SEQ_FOPS_RO_TYPE(mdc, server_uuid);
 LPROC_SEQ_FOPS_RO_TYPE(mdc, timeouts);
@@ -465,8 +465,6 @@ static ssize_t mdc_stats_seq_write(struct file *file,
 LPROC_SEQ_FOPS_RW_TYPE(mdc, pinger_recov);
 
 static struct lprocfs_vars lprocfs_mdc_obd_vars[] = {
-	{ .name	=	"ping",
-	  .fops	=	&mdc_ping_fops			},
 	{ .name	=	"connect_flags",
 	  .fops	=	&mdc_connect_flags_fops		},
 	{ .name	=	"mds_server_uuid",
@@ -500,6 +498,7 @@ static ssize_t mdc_stats_seq_write(struct file *file,
 	&lustre_attr_max_mod_rpcs_in_flight.attr,
 	&lustre_attr_max_pages_per_rpc.attr,
 	&lustre_attr_mdc_conn_uuid.attr,
+	&lustre_attr_ping.attr,
 	NULL,
 };
 
diff --git a/fs/lustre/mgc/lproc_mgc.c b/fs/lustre/mgc/lproc_mgc.c
index d977d51..4c276f9 100644
--- a/fs/lustre/mgc/lproc_mgc.c
+++ b/fs/lustre/mgc/lproc_mgc.c
@@ -45,8 +45,6 @@
 
 LPROC_SEQ_FOPS_RO_TYPE(mgc, state);
 
-LPROC_SEQ_FOPS_WR_ONLY(mgc, ping);
-
 static int mgc_ir_state_seq_show(struct seq_file *m, void *v)
 {
 	return lprocfs_mgc_rd_ir_state(m, m->private);
@@ -55,8 +53,6 @@ static int mgc_ir_state_seq_show(struct seq_file *m, void *v)
 LPROC_SEQ_FOPS_RO(mgc_ir_state);
 
 struct lprocfs_vars lprocfs_mgc_obd_vars[] = {
-	{ .name	=	"ping",
-	  .fops =	&mgc_ping_fops		},
 	{ .name =	"connect_flags",
 	  .fops =	&mgc_connect_flags_fops	},
 	{ .name =	"mgs_server_uuid",
@@ -73,8 +69,11 @@ struct lprocfs_vars lprocfs_mgc_obd_vars[] = {
 #define mgs_conn_uuid_show conn_uuid_show
 LUSTRE_RO_ATTR(mgs_conn_uuid);
 
+LUSTRE_RO_ATTR(ping);
+
 static struct attribute *mgc_attrs[] = {
 	&lustre_attr_mgs_conn_uuid.attr,
+	&lustre_attr_ping.attr,
 	NULL,
 };
 
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index df48138..605a236 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -176,6 +176,8 @@ static ssize_t max_dirty_mb_store(struct kobject *kobj,
 #define ost_conn_uuid_show conn_uuid_show
 LUSTRE_RO_ATTR(ost_conn_uuid);
 
+LUSTRE_RO_ATTR(ping);
+
 static int osc_cached_mb_seq_show(struct seq_file *m, void *v)
 {
 	struct obd_device *dev = m->private;
@@ -601,14 +603,10 @@ static int osc_unstable_stats_seq_show(struct seq_file *m, void *v)
 LPROC_SEQ_FOPS_RO_TYPE(osc, timeouts);
 LPROC_SEQ_FOPS_RO_TYPE(osc, state);
 
-LPROC_SEQ_FOPS_WR_ONLY(osc, ping);
-
 LPROC_SEQ_FOPS_RW_TYPE(osc, import);
 LPROC_SEQ_FOPS_RW_TYPE(osc, pinger_recov);
 
 static struct lprocfs_vars lprocfs_osc_obd_vars[] = {
-	{ .name	=	"ping",
-	  .fops =	&osc_ping_fops			},
 	{ .name	=	"connect_flags",
 	  .fops	=	&osc_connect_flags_fops		},
 	{ .name	=	"ost_server_uuid",
@@ -812,6 +810,7 @@ void lproc_osc_attach_seqstat(struct obd_device *dev)
 	&lustre_attr_short_io_bytes.attr,
 	&lustre_attr_resend_count.attr,
 	&lustre_attr_ost_conn_uuid.attr,
+	&lustre_attr_ping.attr,
 	NULL,
 };
 
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index 3dc99d4..e48a4e8 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -1227,13 +1227,11 @@ void ptlrpc_lprocfs_unregister_obd(struct obd_device *obd)
 }
 EXPORT_SYMBOL(ptlrpc_lprocfs_unregister_obd);
 
-#undef BUFLEN
-
-int lprocfs_wr_ping(struct file *file, const char __user *buffer,
-		    size_t count, loff_t *off)
+ssize_t ping_show(struct kobject *kobj, struct attribute *attr,
+		  char *buffer)
 {
-	struct seq_file *m = file->private_data;
-	struct obd_device *obd = m->private;
+	struct obd_device *obd = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
 	struct ptlrpc_request *req;
 	int rc;
 
@@ -1249,13 +1247,13 @@ int lprocfs_wr_ping(struct file *file, const char __user *buffer,
 	req->rq_send_state = LUSTRE_IMP_FULL;
 
 	rc = ptlrpc_queue_wait(req);
-
 	ptlrpc_req_finished(req);
-	if (rc >= 0)
-		return count;
+
 	return rc;
 }
-EXPORT_SYMBOL(lprocfs_wr_ping);
+EXPORT_SYMBOL(ping_show);
+
+#undef BUFLEN
 
 /* Write the connection UUID to this file to attempt to connect to that node.
  * The connection UUID is a node's primary NID. For example,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 031/622] lustre: ldlm: change LDLM_POOL_ADD_VAR macro to inline function
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (29 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 030/622] lustre: obd: create ping sysfs file James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 032/622] lustre: obdecho: use vmalloc for lnb James Simmons
                   ` (591 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

Simple cleanup to create inline funciton ldlm_pool_add_var().

WC-bug-id: https://jira.hpdd.intel.com/browse/LU-8066
Lustre-commit: 05a36534ba2d ("LU-8066 ldlm: move all remaining files from procfs to debugfs")
Signed-off-by: Dmitry Eremin <dmiter4ever@gmail.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/29255
WC-bug-id: https://jira.hpdd.intel.com/browse/LU-3319
Lustre-commit: 4ad445ccd54 ("LU-3319 procfs: move ldlm proc handling over to seq_file")
Reviewed-on: http://review.whamcloud.com/7293
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Peng Tao <bergwolf@gmail.com>
Reviewed-by: Bob Glossman <bob.glossman@intel.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_internal.h | 10 ++++++++++
 fs/lustre/ldlm/ldlm_pool.c     | 11 ++---------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index 6e54521..96dff1d 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -292,6 +292,16 @@ enum ldlm_policy_res {
 	}								    \
 	struct __##var##__dummy_write {; } /* semicolon catcher */
 
+static inline void
+ldlm_add_var(struct lprocfs_vars *vars, struct dentry *debugfs_entry,
+	     const char *name, void *data, const struct file_operations *ops)
+{
+	vars->name = name;
+	vars->data = data;
+	vars->fops = ops;
+	ldebugfs_add_vars(debugfs_entry, vars, NULL);
+}
+
 static inline int is_granted_or_cancelled(struct ldlm_lock *lock)
 {
 	int ret = 0;
diff --git a/fs/lustre/ldlm/ldlm_pool.c b/fs/lustre/ldlm/ldlm_pool.c
index 04bf5de..d2149a6 100644
--- a/fs/lustre/ldlm/ldlm_pool.c
+++ b/fs/lustre/ldlm/ldlm_pool.c
@@ -504,14 +504,6 @@ static ssize_t grant_speed_show(struct kobject *kobj, struct attribute *attr,
 LDLM_POOL_SYSFS_WRITER_NOLOCK_STORE(lock_volume_factor, atomic);
 LUSTRE_RW_ATTR(lock_volume_factor);
 
-#define LDLM_POOL_ADD_VAR(_name, var, ops)			\
-	do {							\
-		pool_vars[0].name = #_name;			\
-		pool_vars[0].data = var;			\
-		pool_vars[0].fops = ops;			\
-		ldebugfs_add_vars(pl->pl_debugfs_entry, pool_vars, NULL);\
-	} while (0)
-
 /* These are for pools in /sys/fs/lustre/ldlm/namespaces/.../pool */
 static struct attribute *ldlm_pl_attrs[] = {
 	&lustre_attr_grant_speed.attr,
@@ -571,7 +563,8 @@ static int ldlm_pool_debugfs_init(struct ldlm_pool *pl)
 
 	memset(pool_vars, 0, sizeof(pool_vars));
 
-	LDLM_POOL_ADD_VAR(state, pl, &lprocfs_pool_state_fops);
+	ldlm_add_var(&pool_vars[0], pl->pl_debugfs_entry, "state", pl,
+		     &lprocfs_pool_state_fops);
 
 	pl->pl_stats = lprocfs_alloc_stats(LDLM_POOL_LAST_STAT -
 					   LDLM_POOL_FIRST_STAT, 0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 032/622] lustre: obdecho: use vmalloc for lnb
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (30 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 031/622] lustre: ldlm: change LDLM_POOL_ADD_VAR macro to inline function James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 033/622] lustre: mdc: deny layout swap for DoM file James Simmons
                   ` (590 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

When allocating the niobuf_local, if there are a large number of
(potential) fragments this allocation can be quite large. Use
kvmalloc_array() and kvfree() to avoid allocation errors and
console noise. This was causing sanity test_180c to fail in a
VM on occasion, and could also be problem in real use.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10903
Lustre-commit: 8878bab7ae5f ("LU-10903 obdecho: use OBD_ALLOC_LARGE for lnb")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31964
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdecho/echo_client.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 3984cb4..0735a5a 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -1343,7 +1343,8 @@ static int echo_client_prep_commit(const struct lu_env *env,
 	npages = batch >> PAGE_SHIFT;
 	tot_pages = count >> PAGE_SHIFT;
 
-	lnb = kcalloc(npages, sizeof(struct niobuf_local), GFP_NOFS);
+	lnb = kvmalloc_array(npages, sizeof(struct niobuf_local),
+			     GFP_NOFS | __GFP_ZERO);
 	if (!lnb) {
 		ret = -ENOMEM;
 		goto out;
@@ -1411,7 +1412,7 @@ static int echo_client_prep_commit(const struct lu_env *env,
 	}
 
 out:
-	kfree(lnb);
+	kvfree(lnb);
 	return ret;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 033/622] lustre: mdc: deny layout swap for DoM file
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (31 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 032/622] lustre: obdecho: use vmalloc for lnb James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 034/622] lustre: mgc: remove obsolete IR swabbing workaround James Simmons
                   ` (589 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Layout swap is prohibited for DoM files until LU-10177
will be implemented. The only exception is the new layout
having the same DoM component.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10910
Lustre-commit: 51c11d7cfaff ("LU-10910 mdd: deny layout swap for DoM file")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32044
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 80e3120..21dc83e 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -149,7 +149,8 @@ struct ldlm_lock *mdc_dlmlock_at_pgoff(const struct lu_env *env,
 	 * writers can share a single PW lock.
 	 */
 	mode = mdc_dom_lock_match(env, osc_export(obj), resname, LDLM_IBITS,
-				  policy, LCK_PR | LCK_PW, &flags, obj, &lockh,
+				  policy, LCK_PR | LCK_PW | LCK_GROUP, &flags,
+				  obj, &lockh,
 				  dap_flags & OSC_DAP_FL_CANCELING);
 	if (mode) {
 		lock = ldlm_handle2lock(&lockh);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 034/622] lustre: mgc: remove obsolete IR swabbing workaround
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (32 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 033/622] lustre: mdc: deny layout swap for DoM file James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 035/622] lustre: ptlrpc: add dir migration connect flag James Simmons
                   ` (588 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The OBD_CONNECT_MNE_SWAB check was added to the MGC for compatibility
with servers in the 2.2.0-2.2.55 range (in 2012) with big-endian
clients.  2.2 was not an LTS release and is no longer being used.

Remove the checks on the client for OBD_CONNECT_MNE_SWAB being set,
and assume that the server does not have this bug.  This will allow
the removal of the rest of this workaround from the server code once
there are no more clients depending on the presence of this flag.

WC-bug-id: https://jira.whamcloud.com/browse/LU-1644
Lustre-commit: a0c644fde340 ("LU-1644 mgc: remove obsolete IR swabbing workaround")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32087
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h |  4 ----
 fs/lustre/mgc/mgc_request.c       |  9 +--------
 fs/lustre/ptlrpc/import.c         | 21 ---------------------
 3 files changed, 1 insertion(+), 33 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index 522e5b7..0d7bb0f 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -289,10 +289,6 @@ struct obd_import {
 					imp_resend_replay:1,
 					/* disable normal recovery, for test only. */
 					imp_no_pinger_recover:1,
-#if OBD_OCD_VERSION(3, 0, 53, 0) > LUSTRE_VERSION_CODE
-					/* need IR MNE swab */
-					imp_need_mne_swab:1,
-#endif
 					/* import must be reconnected instead of
 					 * chosing new connection
 					 */
diff --git a/fs/lustre/mgc/mgc_request.c b/fs/lustre/mgc/mgc_request.c
index ca4b8a9..c114aa8 100644
--- a/fs/lustre/mgc/mgc_request.c
+++ b/fs/lustre/mgc/mgc_request.c
@@ -1436,14 +1436,7 @@ static int mgc_process_recover_log(struct obd_device *obd,
 		goto out;
 	}
 
-	mne_swab = !!ptlrpc_rep_need_swab(req);
-#if OBD_OCD_VERSION(3, 0, 53, 0) > LUSTRE_VERSION_CODE
-	/* This import flag means the server did an extra swab of IR MNE
-	 * records (fixed in LU-1252), reverse it here if needed. LU-1644
-	 */
-	if (unlikely(req->rq_import->imp_need_mne_swab))
-		mne_swab = !mne_swab;
-#endif
+	mne_swab = ptlrpc_rep_need_swab(req);
 
 	for (i = 0; i < nrpages && ealen > 0; i++) {
 		int rc2;
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index dca4aa0..f69b907 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -780,27 +780,6 @@ static int ptlrpc_connect_set_flags(struct obd_import *imp,
 		warned = true;
 	}
 
-#if LUSTRE_VERSION_CODE < OBD_OCD_VERSION(3, 0, 53, 0)
-	/*
-	 * Check if server has LU-1252 fix applied to not always swab
-	 * the IR MNE entries. Do this only once per connection.  This
-	 * fixup is version-limited, because we don't want to carry the
-	 * OBD_CONNECT_MNE_SWAB flag around forever, just so long as we
-	 * need interop with unpatched 2.2 servers.  For newer servers,
-	 * the client will do MNE swabbing only as needed.  LU-1644
-	 */
-	if (unlikely((ocd->ocd_connect_flags & OBD_CONNECT_VERSION) &&
-		     !(ocd->ocd_connect_flags & OBD_CONNECT_MNE_SWAB) &&
-		     OBD_OCD_VERSION_MAJOR(ocd->ocd_version) == 2 &&
-		     OBD_OCD_VERSION_MINOR(ocd->ocd_version) == 2 &&
-		     OBD_OCD_VERSION_PATCH(ocd->ocd_version) < 55 &&
-		     !strcmp(imp->imp_obd->obd_type->typ_name,
-			     LUSTRE_MGC_NAME)))
-		imp->imp_need_mne_swab = 1;
-	else /* clear if server was upgraded since last connect */
-		imp->imp_need_mne_swab = 0;
-#endif
-
 	if (ocd->ocd_connect_flags & OBD_CONNECT_CKSUM) {
 		/*
 		 * We sent to the server ocd_cksum_types with bits set
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 035/622] lustre: ptlrpc: add dir migration connect flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (33 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 034/622] lustre: mgc: remove obsolete IR swabbing workaround James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 036/622] lustre: mds: remove obsolete MDS_VTX_BYPASS flag James Simmons
                   ` (587 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Add dir migration connect flag to prevent collision with other
features. Though dir migration code exists, it will be reworked,
and the new RPC protocol won't be compatible with current one.

Also handle the previously-added OBD_CONNECT2_FLR flag.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4684
Lustre-commit: 14b98596fa24 ("LU-4684 ptlrpc: add dir migration connect flag")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31914
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 8 ++++++--
 fs/lustre/ptlrpc/wiretest.c            | 4 ++++
 include/uapi/linux/lustre/lustre_idl.h | 2 ++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 33c76c1..66d2679 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -111,8 +111,12 @@
 	"compact_obdo",
 	"second_flags",
 	/* flags2 names */
-	"file_secctx",
-	"lockaheadv2",
+	"file_secctx",	/* 0x01 */
+	"lockaheadv2",	/* 0x02 */
+	"dir_migrate",	/* 0x04 */
+	"unknown",	/* 0x08 */
+	"unknown",	/* 0x10 */
+	"flr",		/* 0x20 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index bcd0229..46d5e74 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1111,6 +1111,10 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_FILE_SECCTX);
 	LASSERTF(OBD_CONNECT2_LOCKAHEAD == 0x2ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_LOCKAHEAD);
+	LASSERTF(OBD_CONNECT2_DIR_MIGRATE == 0x4ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_DIR_MIGRATE);
+	LASSERTF(OBD_CONNECT2_FLR == 0x20ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_FLR);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 589bb81..e898e67 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -791,6 +791,8 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_LOCKAHEAD		0x2ULL		/* ladvise lockahead
 							 * v2
 							 */
+#define OBD_CONNECT2_DIR_MIGRATE	0x4ULL		/* migrate striped dir
+							 */
 #define OBD_CONNECT2_FLR		0x20ULL		/* FLR support */
 
 /* XXX README XXX:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 036/622] lustre: mds: remove obsolete MDS_VTX_BYPASS flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (34 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 035/622] lustre: ptlrpc: add dir migration connect flag James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 037/622] lustre: ldlm: expose dirty age limit for flush-on-glimpse James Simmons
                   ` (586 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The MDS_VTX_BYPASS flag is only set and never checked.  This is true
since 2.3.53-66-g54fe979 "LU-2216 mdt: remove obsolete DNE code", but
it was already obsolete for a long time before that.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6349
Lustre-commit: b99344dda425 ("LU-6349 mds: remove obsolete MDS_VTX_BYPASS flag")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31984
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c            | 2 --
 include/uapi/linux/lustre/lustre_idl.h | 4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 46d5e74..c92663b 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1870,8 +1870,6 @@ void lustre_assert_wire_constants(void)
 
 	LASSERTF(MDS_CROSS_REF == 0x00000002UL, "found 0x%.8xUL\n",
 		(unsigned int)MDS_CROSS_REF);
-	LASSERTF(MDS_VTX_BYPASS == 0x00000004UL, "found 0x%.8xUL\n",
-		(unsigned int)MDS_VTX_BYPASS);
 	LASSERTF(MDS_PERM_BYPASS == 0x00000008UL, "found 0x%.8xUL\n",
 		(unsigned int)MDS_PERM_BYPASS);
 	LASSERTF(MDS_QUOTA_IGNORE == 0x00000020UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index e898e67..794e6d6 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1644,11 +1644,11 @@ struct mdt_rec_setattr {
 enum mds_op_bias {
 /*	MDS_CHECK_SPLIT		= 1 << 0, obsolete before 2.3.58 */
 	MDS_CROSS_REF		= 1 << 1,
-	MDS_VTX_BYPASS		= 1 << 2,
+/*	MDS_VTX_BYPASS		= 1 << 2, obsolete since 2.3.54 */
 	MDS_PERM_BYPASS		= 1 << 3,
 /*	MDS_SOM			= 1 << 4, obsolete since 2.8.0 */
 	MDS_QUOTA_IGNORE	= 1 << 5,
-	MDS_CLOSE_CLEANUP	= 1 << 6,
+/*	MDS_CLOSE_CLEANUP	= 1 << 6, obsolete since 2.3.51 */
 	MDS_KEEP_ORPHAN		= 1 << 7,
 	MDS_RECOV_OPEN		= 1 << 8,
 	MDS_DATA_MODIFIED	= 1 << 9,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 037/622] lustre: ldlm: expose dirty age limit for flush-on-glimpse
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (35 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 036/622] lustre: mds: remove obsolete MDS_VTX_BYPASS flag James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 038/622] lustre: ldlm: IBITS lock convert instead of cancel James Simmons
                   ` (585 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Glimpse request may cancel old lock and cause data flush.
That helps to cache stat results on client locally early.
The time limit was hardcoded to 10s and is exposed now
as ns_dirty_age_limit namespace value, it can be set/check
via /sys/fs/lustre/ldlm/namespaces/<namespace>/dirty_age_limit

WC-bug-id: https://jira.whamcloud.com/browse/LU-10413
Lustre-commit: 69727e45b4c0 ("LU-10413 ldlm: expose dirty age limit for flush-on-glimpse")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32113
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h | 12 +++++++++++-
 fs/lustre/ldlm/ldlm_lockd.c    |  2 +-
 fs/lustre/ldlm/ldlm_resource.c | 28 ++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index b1a37f0..8dea9ab 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -60,6 +60,10 @@
 
 #define LDLM_DEFAULT_LRU_SIZE (100 * num_online_cpus())
 #define LDLM_DEFAULT_MAX_ALIVE (64 * 60)	/* 65 min */
+/* if client lock is unused for that time it can be cancelled if any other
+ * client shows interest in that lock, e.g. glimpse is occurred.
+ */
+#define LDLM_DIRTY_AGE_LIMIT (10)
 #define LDLM_DEFAULT_PARALLEL_AST_LIMIT 1024
 
 /**
@@ -412,7 +416,13 @@ struct ldlm_namespace {
 
 	/** Maximum allowed age (last used time) for locks in the LRU */
 	ktime_t			ns_max_age;
-
+	/**
+	 * Number of seconds since the lock was last used. The client may
+	 * cancel the lock limited by this age and flush related data if
+	 * any other client shows interest in it doing glimpse request.
+	 * This allows to cache stat data locally for such files early.
+	 */
+	time64_t		ns_dirty_age_limit;
 	/**
 	 * Used to rate-limit ldlm_namespace_dump calls.
 	 * \see ldlm_namespace_dump. Increased by 10 seconds every time
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 84d73e6..481719b 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -305,7 +305,7 @@ static void ldlm_handle_gl_callback(struct ptlrpc_request *req,
 	    !lock->l_readers && !lock->l_writers &&
 	    ktime_after(ktime_get(),
 			ktime_add(lock->l_last_used,
-				  ktime_set(10, 0)))) {
+				  ktime_set(ns->ns_dirty_age_limit, 0)))) {
 		unlock_res_and_lock(lock);
 		if (ldlm_bl_to_thread_lock(ns, NULL, lock))
 			ldlm_handle_bl_callback(ns, NULL, lock);
diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 4e3c6e7..5e0dd53 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -327,6 +327,32 @@ static ssize_t early_lock_cancel_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(early_lock_cancel);
 
+static ssize_t dirty_age_limit_show(struct kobject *kobj,
+				    struct attribute *attr, char *buf)
+{
+	struct ldlm_namespace *ns = container_of(kobj, struct ldlm_namespace,
+						 ns_kobj);
+
+	return sprintf(buf, "%llu\n", ns->ns_dirty_age_limit);
+}
+
+static ssize_t dirty_age_limit_store(struct kobject *kobj,
+				     struct attribute *attr,
+				     const char *buffer, size_t count)
+{
+	struct ldlm_namespace *ns = container_of(kobj, struct ldlm_namespace,
+						 ns_kobj);
+	unsigned long long tmp;
+
+	if (kstrtoull(buffer, 10, &tmp))
+		return -EINVAL;
+
+	ns->ns_dirty_age_limit = tmp;
+
+	return count;
+}
+LUSTRE_RW_ATTR(dirty_age_limit);
+
 /* These are for namespaces in /sys/fs/lustre/ldlm/namespaces/ */
 static struct attribute *ldlm_ns_attrs[] = {
 	&lustre_attr_resource_count.attr,
@@ -335,6 +361,7 @@ static ssize_t early_lock_cancel_store(struct kobject *kobj,
 	&lustre_attr_lru_size.attr,
 	&lustre_attr_lru_max_age.attr,
 	&lustre_attr_early_lock_cancel.attr,
+	&lustre_attr_dirty_age_limit.attr,
 	NULL,
 };
 
@@ -653,6 +680,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	ns->ns_max_age = ktime_set(LDLM_DEFAULT_MAX_ALIVE, 0);
 	ns->ns_orig_connect_flags = 0;
 	ns->ns_connect_flags = 0;
+	ns->ns_dirty_age_limit = LDLM_DIRTY_AGE_LIMIT;
 	ns->ns_stopping = 0;
 
 	rc = ldlm_namespace_sysfs_register(ns);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 038/622] lustre: ldlm: IBITS lock convert instead of cancel
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (36 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 037/622] lustre: ldlm: expose dirty age limit for flush-on-glimpse James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 039/622] lustre: ptlrpc: fix return type of boolean functions James Simmons
                   ` (584 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

For IBITS lock it is possible to drop just conflicting
bits and keep lock itself instead of cancelling it.
Lock convert is only bits downgrade on client and then
on server.
Patch implements lock convert during blocking AST.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10175
Lustre-commit: 37932c4beb98 ("LU-10175 ldlm: IBITS lock convert instead of cancel")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/30202
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h         |   6 +
 fs/lustre/include/lustre_dlm_flags.h   |  16 +-
 fs/lustre/ldlm/ldlm_inodebits.c        |  92 +++++++-
 fs/lustre/ldlm/ldlm_internal.h         |   2 +
 fs/lustre/ldlm/ldlm_lock.c             |  13 +-
 fs/lustre/ldlm/ldlm_lockd.c            |  18 ++
 fs/lustre/ldlm/ldlm_request.c          | 198 ++++++++++++++++-
 fs/lustre/llite/namei.c                | 383 ++++++++++++++++++++-------------
 fs/lustre/ptlrpc/wiretest.c            |   2 +-
 include/uapi/linux/lustre/lustre_idl.h |   1 +
 10 files changed, 569 insertions(+), 162 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 8dea9ab..66608a9 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -544,6 +544,7 @@ enum ldlm_cancel_flags {
 	LCF_BL_AST     = 0x4, /* Cancel locks marked as LDLM_FL_BL_AST
 			       * in the same RPC
 			       */
+	LCF_CONVERT    = 0x8, /* Try to convert IBITS lock before cancel */
 };
 
 struct ldlm_flock {
@@ -1306,6 +1307,7 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 			  enum ldlm_mode mode,
 			  u64 *flags, void *lvb, u32 lvb_len,
 			  const struct lustre_handle *lockh, int rc);
+int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags);
 int ldlm_cli_update_pool(struct ptlrpc_request *req);
 int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		    enum ldlm_cancel_flags cancel_flags);
@@ -1330,6 +1332,10 @@ int ldlm_cli_cancel_list(struct list_head *head, int count,
 			 enum ldlm_cancel_flags flags);
 /** @} ldlm_cli_api */
 
+int ldlm_inodebits_drop(struct ldlm_lock *lock, u64 to_drop);
+int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits);
+int ldlm_cli_dropbits_list(struct list_head *converts, u64 drop_bits);
+
 /* mds/handler.c */
 /* This has to be here because recursive inclusion sucks. */
 int intent_disposition(struct ldlm_reply *rep, int flag);
diff --git a/fs/lustre/include/lustre_dlm_flags.h b/fs/lustre/include/lustre_dlm_flags.h
index 22fb595..c8667c8 100644
--- a/fs/lustre/include/lustre_dlm_flags.h
+++ b/fs/lustre/include/lustre_dlm_flags.h
@@ -26,10 +26,10 @@
  */
 #ifndef LDLM_ALL_FLAGS_MASK
 
-/** l_flags bits marked as "all_flags" bits */
-#define LDLM_FL_ALL_FLAGS_MASK		0x00FFFFFFC08F932FULL
+/* l_flags bits marked as "all_flags" bits */
+#define LDLM_FL_ALL_FLAGS_MASK		0x00FFFFFFC28F932FULL
 
-/** extent, mode, or resource changed */
+/* extent, mode, or resource changed */
 #define LDLM_FL_LOCK_CHANGED		0x0000000000000001ULL /* bit 0 */
 #define ldlm_is_lock_changed(_l)	LDLM_TEST_FLAG((_l), 1ULL <<  0)
 #define ldlm_set_lock_changed(_l)	LDLM_SET_FLAG((_l), 1ULL <<  0)
@@ -146,6 +146,16 @@
 #define ldlm_clear_cancel_on_block(_l)	LDLM_CLEAR_FLAG((_l), 1ULL << 23)
 
 /**
+ * Flag indicates that lock is being converted (downgraded) during the blocking
+ * AST instead of cancelling. Used for IBITS locks now and drops conflicting
+ * bits only keepeing other.
+ */
+#define LDLM_FL_CONVERTING		0x0000000002000000ULL /* bit 25 */
+#define ldlm_is_converting(_l)		LDLM_TEST_FLAG((_l), 1ULL << 25)
+#define ldlm_set_converting(_l)		LDLM_SET_FLAG((_l), 1ULL << 25)
+#define ldlm_clear_converting(_l)	LDLM_CLEAR_FLAG((_l), 1ULL << 25)
+
+/*
  * Part of original lockahead implementation, OBD_CONNECT_LOCKAHEAD_OLD.
  * Reserved temporarily to allow those implementations to keep working.
  * Will be removed after 2.12 release.
diff --git a/fs/lustre/ldlm/ldlm_inodebits.c b/fs/lustre/ldlm/ldlm_inodebits.c
index ea63d9d..e74928e 100644
--- a/fs/lustre/ldlm/ldlm_inodebits.c
+++ b/fs/lustre/ldlm/ldlm_inodebits.c
@@ -68,7 +68,14 @@ void ldlm_ibits_policy_local_to_wire(const union ldlm_policy_data *lpolicy,
 	wpolicy->l_inodebits.bits = lpolicy->l_inodebits.bits;
 }
 
-int ldlm_inodebits_drop(struct ldlm_lock *lock,  __u64 to_drop)
+/**
+ * Attempt to convert already granted IBITS lock with several bits set to
+ * a lock with less bits (downgrade).
+ *
+ * Such lock conversion is used to keep lock with non-blocking bits instead of
+ * cancelling it, introduced for better support of DoM files.
+ */
+int ldlm_inodebits_drop(struct ldlm_lock *lock, u64 to_drop)
 {
 	check_res_locked(lock->l_resource);
 
@@ -89,3 +96,86 @@ int ldlm_inodebits_drop(struct ldlm_lock *lock,  __u64 to_drop)
 	return 0;
 }
 EXPORT_SYMBOL(ldlm_inodebits_drop);
+
+/* convert single lock */
+int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits)
+{
+	struct lustre_handle lockh;
+	u32 flags = 0;
+	int rc;
+
+	LASSERT(drop_bits);
+	LASSERT(!lock->l_readers && !lock->l_writers);
+
+	LDLM_DEBUG(lock, "client lock convert START");
+
+	ldlm_lock2handle(lock, &lockh);
+	lock_res_and_lock(lock);
+	/* check if all bits are cancelled */
+	if (!(lock->l_policy_data.l_inodebits.bits & ~drop_bits)) {
+		unlock_res_and_lock(lock);
+		/* return error to continue with cancel */
+		rc = -EINVAL;
+		goto exit;
+	}
+
+	/* check if there is race with cancel */
+	if (ldlm_is_canceling(lock) || ldlm_is_cancel(lock)) {
+		unlock_res_and_lock(lock);
+		rc = -EINVAL;
+		goto exit;
+	}
+
+	/* clear cbpending flag early, it is safe to match lock right after
+	 * client convert because it is downgrade always.
+	 */
+	ldlm_clear_cbpending(lock);
+	ldlm_clear_bl_ast(lock);
+
+	/* If lock is being converted already, check drop bits first */
+	if (ldlm_is_converting(lock)) {
+		/* raced lock convert, lock inodebits are remaining bits
+		 * so check if they are conflicting with new convert or not.
+		 */
+		if (!(lock->l_policy_data.l_inodebits.bits & drop_bits)) {
+			unlock_res_and_lock(lock);
+			rc = 0;
+			goto exit;
+		}
+		/* Otherwise drop new conflicting bits in new convert */
+	}
+	ldlm_set_converting(lock);
+	/* from all bits of blocking lock leave only conflicting */
+	drop_bits &= lock->l_policy_data.l_inodebits.bits;
+	/* save them in cancel_bits, so l_blocking_ast will know
+	 * which bits from the current lock were dropped.
+	 */
+	lock->l_policy_data.l_inodebits.cancel_bits = drop_bits;
+	/* Finally clear these bits in lock ibits */
+	ldlm_inodebits_drop(lock, drop_bits);
+	unlock_res_and_lock(lock);
+	/* Finally call cancel callback for remaining bits only.
+	 * It is important to have converting flag during that
+	 * so blocking_ast callback can distinguish convert from
+	 * cancels.
+	 */
+	if (lock->l_blocking_ast)
+		lock->l_blocking_ast(lock, NULL, lock->l_ast_data,
+				     LDLM_CB_CANCELING);
+
+	/* now notify server about convert */
+	rc = ldlm_cli_convert(lock, &flags);
+	if (rc) {
+		lock_res_and_lock(lock);
+		ldlm_clear_converting(lock);
+		ldlm_set_cbpending(lock);
+		ldlm_set_bl_ast(lock);
+		unlock_res_and_lock(lock);
+		LASSERT(list_empty(&lock->l_lru));
+		goto exit;
+	}
+
+exit:
+	LDLM_DEBUG(lock, "client lock convert END");
+	return rc;
+}
diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index 96dff1d..ec68713 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -153,7 +153,9 @@ int ldlm_run_ast_work(struct ldlm_namespace *ns, struct list_head *rpc_list,
 #define ldlm_lock_remove_from_lru(lock) \
 		ldlm_lock_remove_from_lru_check(lock, ktime_set(0, 0))
 int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock);
+void ldlm_lock_add_to_lru_nolock(struct ldlm_lock *lock);
 void ldlm_lock_destroy_nolock(struct ldlm_lock *lock);
+void ldlm_grant_lock_with_skiplist(struct ldlm_lock *lock);
 
 /* ldlm_lockd.c */
 int ldlm_bl_to_thread_lock(struct ldlm_namespace *ns, struct ldlm_lock_desc *ld,
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index aa19b89..9847c43 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -241,7 +241,7 @@ int ldlm_lock_remove_from_lru_check(struct ldlm_lock *lock, ktime_t last_use)
 /**
  * Adds LDLM lock @lock to namespace LRU. Assumes LRU is already locked.
  */
-static void ldlm_lock_add_to_lru_nolock(struct ldlm_lock *lock)
+void ldlm_lock_add_to_lru_nolock(struct ldlm_lock *lock)
 {
 	struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
 
@@ -791,7 +791,8 @@ void ldlm_lock_decref_internal(struct ldlm_lock *lock, enum ldlm_mode mode)
 		    ldlm_bl_to_thread_lock(ns, NULL, lock) != 0)
 			ldlm_handle_bl_callback(ns, NULL, lock);
 	} else if (!lock->l_readers && !lock->l_writers &&
-		   !ldlm_is_no_lru(lock) && !ldlm_is_bl_ast(lock)) {
+		   !ldlm_is_no_lru(lock) && !ldlm_is_bl_ast(lock) &&
+		   !ldlm_is_converting(lock)) {
 		LDLM_DEBUG(lock, "add lock into lru list");
 
 		/* If this is a client-side namespace and this was the last
@@ -1648,6 +1649,13 @@ enum ldlm_error ldlm_lock_enqueue(struct ldlm_namespace *ns,
 	unlock_res_and_lock(lock);
 
 	ldlm_lock2desc(lock->l_blocking_lock, &d);
+	/* copy blocking lock ibits in cancel_bits as well,
+	 * new client may use them for lock convert and it is
+	 * important to use new field to convert locks from
+	 * new servers only
+	 */
+	d.l_policy_data.l_inodebits.cancel_bits =
+		lock->l_blocking_lock->l_policy_data.l_inodebits.bits;
 
 	rc = lock->l_blocking_ast(lock, &d, (void *)arg, LDLM_CB_BLOCKING);
 	LDLM_LOCK_RELEASE(lock->l_blocking_lock);
@@ -1896,6 +1904,7 @@ void ldlm_lock_cancel(struct ldlm_lock *lock)
 	 */
 	if (lock->l_readers || lock->l_writers) {
 		LDLM_ERROR(lock, "lock still has references");
+		unlock_res_and_lock(lock);
 		LBUG();
 	}
 
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 481719b..b50a3f7 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -118,6 +118,24 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 	LDLM_DEBUG(lock, "client blocking AST callback handler");
 
 	lock_res_and_lock(lock);
+
+	/* set bits to cancel for this lock for possible lock convert */
+	if (lock->l_resource->lr_type == LDLM_IBITS) {
+		/* Lock description contains policy of blocking lock,
+		 * and its cancel_bits is used to pass conflicting bits.
+		 * NOTE: ld can be NULL or can be not NULL but zeroed if
+		 * passed from ldlm_bl_thread_blwi(), check below used bits
+		 * in ld to make sure it is valid description.
+		 */
+		if (ld && ld->l_policy_data.l_inodebits.bits)
+			lock->l_policy_data.l_inodebits.cancel_bits =
+				ld->l_policy_data.l_inodebits.cancel_bits;
+		/* if there is no valid ld and lock is cbpending already
+		 * then cancel_bits should be kept, otherwise it is zeroed.
+		 */
+		else if (!ldlm_is_cbpending(lock))
+			lock->l_policy_data.l_inodebits.cancel_bits = 0;
+	}
 	ldlm_set_cbpending(lock);
 
 	if (ldlm_is_cancel_on_block(lock))
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 92e4f69..5ec0da5 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -818,6 +818,177 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 EXPORT_SYMBOL(ldlm_cli_enqueue);
 
 /**
+ * Client-side lock convert reply handling.
+ *
+ * Finish client lock converting, checks for concurrent converts
+ * and clear 'converting' flag so lock can be placed back into LRU.
+ */
+static int lock_convert_interpret(const struct lu_env *env,
+				  struct ptlrpc_request *req,
+				  struct ldlm_async_args *aa, int rc)
+{
+	struct ldlm_lock *lock;
+	struct ldlm_reply *reply;
+
+	lock = ldlm_handle2lock(&aa->lock_handle);
+	if (!lock) {
+		LDLM_DEBUG_NOLOCK("convert ACK for unknown local cookie %#llx",
+			aa->lock_handle.cookie);
+		return -ESTALE;
+	}
+
+	LDLM_DEBUG(lock, "CONVERTED lock:");
+
+	if (rc != ELDLM_OK)
+		goto out;
+
+	reply = req_capsule_server_get(&req->rq_pill, &RMF_DLM_REP);
+	if (!reply) {
+		rc = -EPROTO;
+		goto out;
+	}
+
+	if (reply->lock_handle.cookie != aa->lock_handle.cookie) {
+		LDLM_ERROR(lock,
+			   "convert ACK with wrong lock cookie %#llx but cookie %#llx from server %s id %s\n",
+			   aa->lock_handle.cookie, reply->lock_handle.cookie,
+			   req->rq_export->exp_client_uuid.uuid,
+			   libcfs_id2str(req->rq_peer));
+		rc = -ESTALE;
+		goto out;
+	}
+
+	lock_res_and_lock(lock);
+	/* Lock convert is sent for any new bits to drop, the converting flag
+	 * is dropped when ibits on server are the same as on client. Meanwhile
+	 * that can be so that more later convert will be replied first with
+	 * and clear converting flag, so in case of such race just exit here.
+	 * if lock has no converting bits then.
+	 */
+	if (!ldlm_is_converting(lock)) {
+		LDLM_DEBUG(lock,
+			   "convert ACK for lock without converting flag, reply ibits %#llx",
+			   reply->lock_desc.l_policy_data.l_inodebits.bits);
+	} else if (reply->lock_desc.l_policy_data.l_inodebits.bits !=
+		   lock->l_policy_data.l_inodebits.bits) {
+		/* Compare server returned lock ibits and local lock ibits
+		 * if they are the same we consider conversion is done,
+		 * otherwise we have more converts inflight and keep
+		 * converting flag.
+		 */
+		LDLM_DEBUG(lock, "convert ACK with ibits %#llx\n",
+			   reply->lock_desc.l_policy_data.l_inodebits.bits);
+	} else {
+		ldlm_clear_converting(lock);
+
+		/* Concurrent BL AST has arrived, it may cause another convert
+		 * or cancel so just exit here.
+		 */
+		if (!ldlm_is_bl_ast(lock)) {
+			struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
+
+			/* Drop cancel_bits since there are no more converts
+			 * and put lock into LRU if it is not there yet.
+			 */
+			lock->l_policy_data.l_inodebits.cancel_bits = 0;
+			spin_lock(&ns->ns_lock);
+			if (!list_empty(&lock->l_lru))
+				ldlm_lock_remove_from_lru_nolock(lock);
+			ldlm_lock_add_to_lru_nolock(lock);
+			spin_unlock(&ns->ns_lock);
+		}
+	}
+	unlock_res_and_lock(lock);
+out:
+	if (rc) {
+		lock_res_and_lock(lock);
+		if (ldlm_is_converting(lock)) {
+			LASSERT(list_empty(&lock->l_lru));
+			ldlm_clear_converting(lock);
+			ldlm_set_cbpending(lock);
+			ldlm_set_bl_ast(lock);
+		}
+		unlock_res_and_lock(lock);
+	}
+
+	LDLM_LOCK_PUT(lock);
+	return rc;
+}
+
+/**
+ * Client-side IBITS lock convert.
+ *
+ * Inform server that lock has been converted instead of canceling.
+ * Server finishes convert on own side and does reprocess to grant
+ * all related waiting locks.
+ *
+ * Since convert means only ibits downgrading, client doesn't need to
+ * wait for server reply to finish local converting process so this request
+ * is made asynchronous.
+ *
+ */
+int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
+{
+	struct ldlm_request *body;
+	struct ptlrpc_request *req;
+	struct ldlm_async_args *aa;
+	struct obd_export *exp = lock->l_conn_export;
+
+	if (!exp) {
+		LDLM_ERROR(lock, "convert must not be called on local locks.");
+		return -EINVAL;
+	}
+
+	if (lock->l_resource->lr_type != LDLM_IBITS) {
+		LDLM_ERROR(lock, "convert works with IBITS locks only.");
+		return -EINVAL;
+	}
+
+	LDLM_DEBUG(lock, "client-side convert");
+
+	req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp),
+					&RQF_LDLM_CONVERT, LUSTRE_DLM_VERSION,
+					LDLM_CONVERT);
+	if (!req)
+		return -ENOMEM;
+
+	body = req_capsule_client_get(&req->rq_pill, &RMF_DLM_REQ);
+	body->lock_handle[0] = lock->l_remote_handle;
+
+	body->lock_desc.l_req_mode = lock->l_req_mode;
+	body->lock_desc.l_granted_mode = lock->l_granted_mode;
+
+	body->lock_desc.l_policy_data.l_inodebits.bits =
+					lock->l_policy_data.l_inodebits.bits;
+	body->lock_desc.l_policy_data.l_inodebits.cancel_bits = 0;
+
+	body->lock_flags = ldlm_flags_to_wire(*flags);
+	body->lock_count = 1;
+
+	ptlrpc_request_set_replen(req);
+
+	/* That could be useful to use cancel portals for convert as well
+	 * as high-priority handling. This will require changes in
+	 * ldlm_cancel_handler to understand convert RPC as well.
+	 *
+	 * req->rq_request_portal = LDLM_CANCEL_REQUEST_PORTAL;
+	 * req->rq_reply_portal = LDLM_CANCEL_REPLY_PORTAL;
+	 */
+	ptlrpc_at_set_req_timeout(req);
+
+	if (exp->exp_obd->obd_svc_stats)
+		lprocfs_counter_incr(exp->exp_obd->obd_svc_stats,
+				     LDLM_CONVERT - LDLM_FIRST_OPC);
+
+	aa = ptlrpc_req_async_args(aa, req);
+	ldlm_lock2handle(lock, &aa->lock_handle);
+	req->rq_interpret_reply = (ptlrpc_interpterer_t)lock_convert_interpret;
+
+	ptlrpcd_add_req(req);
+	return 0;
+}
+
+/**
  * Cancel locks locally.
  *
  * Returns:	LDLM_FL_LOCAL_ONLY if there is no need for a CANCEL RPC
@@ -1057,6 +1228,19 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		return 0;
 	}
 
+	/* Convert lock bits instead of cancel for IBITS locks */
+	if (cancel_flags & LCF_CONVERT) {
+		LASSERT(lock->l_resource->lr_type == LDLM_IBITS);
+		LASSERT(lock->l_policy_data.l_inodebits.cancel_bits != 0);
+
+		rc = ldlm_cli_dropbits(lock,
+				lock->l_policy_data.l_inodebits.cancel_bits);
+		if (rc == 0) {
+			LDLM_LOCK_RELEASE(lock);
+			return 0;
+		}
+	}
+
 	lock_res_and_lock(lock);
 	/* Lock is being canceled and the caller doesn't want to wait */
 	if (ldlm_is_canceling(lock)) {
@@ -1069,6 +1253,15 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		return 0;
 	}
 
+	/* Lock is being converted, cancel it immediately.
+	 * When convert will end, it releases lock and it will be gone.
+	 */
+	if (ldlm_is_converting(lock)) {
+		/* set back flags removed by convert */
+		ldlm_set_cbpending(lock);
+		ldlm_set_bl_ast(lock);
+	}
+
 	ldlm_set_canceling(lock);
 	unlock_res_and_lock(lock);
 
@@ -1439,7 +1632,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 			/* Somebody is already doing CANCEL. No need for this
 			 * lock in LRU, do not traverse it again.
 			 */
-			if (!ldlm_is_canceling(lock))
+			if (!ldlm_is_canceling(lock) ||
+			    !ldlm_is_converting(lock))
 				break;
 
 			ldlm_lock_remove_from_lru_nolock(lock);
@@ -1483,7 +1677,7 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 
 		lock_res_and_lock(lock);
 		/* Check flags again under the lock. */
-		if (ldlm_is_canceling(lock) ||
+		if (ldlm_is_canceling(lock) || ldlm_is_converting(lock) ||
 		    (ldlm_lock_remove_from_lru_check(lock, last_use) == 0)) {
 			/* Another thread is removing lock from LRU, or
 			 * somebody is already doing CANCEL, or there
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 1b5e270..8b1a1ca 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -213,184 +213,261 @@ int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
 	return rc;
 }
 
-int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
-		       void *data, int flag)
+void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 {
-	struct lustre_handle lockh;
+	struct inode *inode = ll_inode_from_resource_lock(lock);
+	u64 bits = to_cancel;
 	int rc;
 
-	switch (flag) {
-	case LDLM_CB_BLOCKING:
-		ldlm_lock2handle(lock, &lockh);
-		rc = ldlm_cli_cancel(&lockh, LCF_ASYNC);
-		if (rc < 0) {
-			CDEBUG(D_INODE, "ldlm_cli_cancel: rc = %d\n", rc);
-			return rc;
-		}
-		break;
-	case LDLM_CB_CANCELING: {
-		struct inode *inode = ll_inode_from_resource_lock(lock);
-		u64 bits = lock->l_policy_data.l_inodebits.bits;
+	if (!inode)
+		return;
 
-		if (!inode)
-			break;
+	if (!fid_res_name_eq(ll_inode2fid(inode),
+			     &lock->l_resource->lr_name)) {
+		LDLM_ERROR(lock,
+			   "data mismatch with object " DFID "(%p)",
+			   PFID(ll_inode2fid(inode)), inode);
+		LBUG();
+	}
 
-		/* Invalidate all dentries associated with this inode */
-		LASSERT(ldlm_is_canceling(lock));
+	if (bits & MDS_INODELOCK_XATTR) {
+		if (S_ISDIR(inode->i_mode))
+			ll_i2info(inode)->lli_def_stripe_offset = -1;
+		ll_xattr_cache_destroy(inode);
+		bits &= ~MDS_INODELOCK_XATTR;
+	}
 
-		if (!fid_res_name_eq(ll_inode2fid(inode),
-				     &lock->l_resource->lr_name)) {
-			LDLM_ERROR(lock,
-				   "data mismatch with object " DFID "(%p)",
-				   PFID(ll_inode2fid(inode)), inode);
+	/* For OPEN locks we differentiate between lock modes
+	 * LCK_CR, LCK_CW, LCK_PR - bug 22891
+	 */
+	if (bits & MDS_INODELOCK_OPEN)
+		ll_have_md_lock(inode, &bits, lock->l_req_mode);
+
+	if (bits & MDS_INODELOCK_OPEN) {
+		fmode_t fmode;
+
+		switch (lock->l_req_mode) {
+		case LCK_CW:
+			fmode = FMODE_WRITE;
+			break;
+		case LCK_PR:
+			fmode = FMODE_EXEC;
+			break;
+		case LCK_CR:
+			fmode = FMODE_READ;
+			break;
+		default:
+			LDLM_ERROR(lock, "bad lock mode for OPEN lock");
 			LBUG();
 		}
 
-		if (bits & MDS_INODELOCK_XATTR) {
-			if (S_ISDIR(inode->i_mode))
-				ll_i2info(inode)->lli_def_stripe_offset = -1;
-			ll_xattr_cache_destroy(inode);
-			bits &= ~MDS_INODELOCK_XATTR;
-		}
+		ll_md_real_close(inode, fmode);
 
-		/* For OPEN locks we differentiate between lock modes
-		 * LCK_CR, LCK_CW, LCK_PR - bug 22891
-		 */
-		if (bits & MDS_INODELOCK_OPEN)
-			ll_have_md_lock(inode, &bits, lock->l_req_mode);
-
-		if (bits & MDS_INODELOCK_OPEN) {
-			fmode_t fmode;
-
-			switch (lock->l_req_mode) {
-			case LCK_CW:
-				fmode = FMODE_WRITE;
-				break;
-			case LCK_PR:
-				fmode = FMODE_EXEC;
-				break;
-			case LCK_CR:
-				fmode = FMODE_READ;
-				break;
-			default:
-				LDLM_ERROR(lock, "bad lock mode for OPEN lock");
-				LBUG();
-			}
+		bits &= ~MDS_INODELOCK_OPEN;
+	}
 
-			ll_md_real_close(inode, fmode);
-		}
+	if (bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_UPDATE |
+		    MDS_INODELOCK_LAYOUT | MDS_INODELOCK_PERM |
+		    MDS_INODELOCK_DOM))
+		ll_have_md_lock(inode, &bits, LCK_MINMODE);
+
+	if (bits & MDS_INODELOCK_DOM) {
+		rc = ll_dom_lock_cancel(inode, lock);
+		if (rc < 0)
+			CDEBUG(D_INODE, "cannot flush DoM data "
+			       DFID": rc = %d\n",
+			       PFID(ll_inode2fid(inode)), rc);
+		lock_res_and_lock(lock);
+		ldlm_set_kms_ignore(lock);
+		unlock_res_and_lock(lock);
+	}
 
-		if (bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_UPDATE |
-			    MDS_INODELOCK_LAYOUT | MDS_INODELOCK_PERM |
-			    MDS_INODELOCK_DOM))
-			ll_have_md_lock(inode, &bits, LCK_MINMODE);
-
-		if (bits & MDS_INODELOCK_DOM) {
-			rc =  ll_dom_lock_cancel(inode, lock);
-			if (rc < 0)
-				CDEBUG(D_INODE, "cannot flush DoM data "
-				       DFID": rc = %d\n",
-				       PFID(ll_inode2fid(inode)), rc);
-			lock_res_and_lock(lock);
-			ldlm_set_kms_ignore(lock);
-			unlock_res_and_lock(lock);
-			bits &= ~MDS_INODELOCK_DOM;
-		}
+	if (bits & MDS_INODELOCK_LAYOUT) {
+		struct cl_object_conf conf = {
+			.coc_opc = OBJECT_CONF_INVALIDATE,
+			.coc_inode = inode,
+		};
 
-		if (bits & MDS_INODELOCK_LAYOUT) {
-			struct cl_object_conf conf = {
-				.coc_opc = OBJECT_CONF_INVALIDATE,
-				.coc_inode = inode,
-			};
-
-			rc = ll_layout_conf(inode, &conf);
-			if (rc < 0)
-				CDEBUG(D_INODE, "cannot invalidate layout of "
-				       DFID ": rc = %d\n",
-				       PFID(ll_inode2fid(inode)), rc);
-		}
+		rc = ll_layout_conf(inode, &conf);
+		if (rc < 0)
+			CDEBUG(D_INODE, "cannot invalidate layout of "
+			       DFID ": rc = %d\n",
+			       PFID(ll_inode2fid(inode)), rc);
+	}
 
-		if (bits & MDS_INODELOCK_UPDATE) {
-			set_bit(LLIF_UPDATE_ATIME,
-				&ll_i2info(inode)->lli_flags);
-		}
+	if (bits & MDS_INODELOCK_UPDATE)
+		set_bit(LLIF_UPDATE_ATIME,
+			&ll_i2info(inode)->lli_flags);
 
-		if ((bits & MDS_INODELOCK_UPDATE) && S_ISDIR(inode->i_mode)) {
-			struct ll_inode_info *lli = ll_i2info(inode);
+	if ((bits & MDS_INODELOCK_UPDATE) && S_ISDIR(inode->i_mode)) {
+		struct ll_inode_info *lli = ll_i2info(inode);
 
-			CDEBUG(D_INODE,
-			       "invalidating inode " DFID " lli = %p, pfid  = " DFID "\n",
-			       PFID(ll_inode2fid(inode)), lli,
-			       PFID(&lli->lli_pfid));
+		CDEBUG(D_INODE,
+		       "invalidating inode "DFID" lli = %p, pfid  = "DFID"\n",
+		       PFID(ll_inode2fid(inode)),
+		       lli, PFID(&lli->lli_pfid));
+		truncate_inode_pages(inode->i_mapping, 0);
 
-			truncate_inode_pages(inode->i_mapping, 0);
+		if (unlikely(!fid_is_zero(&lli->lli_pfid))) {
+			struct inode *master_inode = NULL;
+			unsigned long hash;
 
-			if (unlikely(!fid_is_zero(&lli->lli_pfid))) {
-				struct inode *master_inode = NULL;
-				unsigned long hash;
+			/*
+			 * This is slave inode, since all of the child dentry
+			 * is connected on the master inode, so we have to
+			 * invalidate the negative children on master inode
+			 */
+			CDEBUG(D_INODE,
+			       "Invalidate s" DFID " m" DFID "\n",
+			       PFID(ll_inode2fid(inode)), PFID(&lli->lli_pfid));
 
-				/*
-				 * This is slave inode, since all of the child
-				 * dentry is connected on the master inode, so
-				 * we have to invalidate the negative children
-				 * on master inode
-				 */
-				CDEBUG(D_INODE,
-				       "Invalidate s" DFID " m" DFID "\n",
-				       PFID(ll_inode2fid(inode)),
-				       PFID(&lli->lli_pfid));
-
-				hash = cl_fid_build_ino(&lli->lli_pfid,
-							ll_need_32bit_api(ll_i2sbi(inode)));
-				/*
-				 * Do not lookup the inode with ilookup5,
-				 * otherwise it will cause dead lock,
-				 *
-				 * 1. Client1 send chmod req to the MDT0, then
-				 * on MDT0, it enqueues master and all of its
-				 * slaves lock, (mdt_attr_set() ->
-				 * mdt_lock_slaves()), after gets master and
-				 * stripe0 lock, it will send the enqueue req
-				 * (for stripe1) to MDT1, then MDT1 finds the
-				 * lock has been granted to client2. Then MDT1
-				 * sends blocking ast to client2.
-				 *
-				 * 2. At the same time, client2 tries to unlink
-				 * the striped dir (rm -rf striped_dir), and
-				 * during lookup, it will hold the master inode
-				 * of the striped directory, whose inode state
-				 * is NEW, then tries to revalidate all of its
-				 * slaves, (ll_prep_inode()->ll_iget()->
-				 * ll_read_inode2()-> ll_update_inode().). And
-				 * it will be blocked on the server side because
-				 * of 1.
-				 *
-				 * 3. Then the client get the blocking_ast req,
-				 * cancel the lock, but being blocked if using
-				 * ->ilookup5()), because master inode state is
-				 *  NEW.
-				 */
-				master_inode = ilookup5_nowait(inode->i_sb,
-							       hash,
-							       ll_test_inode_by_fid,
-							       (void *)&lli->lli_pfid);
-				if (master_inode) {
-					ll_invalidate_negative_children(master_inode);
-					iput(master_inode);
-				}
-			} else {
-				ll_invalidate_negative_children(inode);
+			hash = cl_fid_build_ino(&lli->lli_pfid,
+						ll_need_32bit_api(
+							ll_i2sbi(inode)));
+			/*
+			 * Do not lookup the inode with ilookup5, otherwise
+			 * it will cause dead lock,
+			 * 1. Client1 send chmod req to the MDT0, then on MDT0,
+			 * it enqueues master and all of its slaves lock,
+			 * (mdt_attr_set() -> mdt_lock_slaves()), after gets
+			 * master and stripe0 lock, it will send the enqueue
+			 * req (for stripe1) to MDT1, then MDT1 finds the lock
+			 * has been granted to client2. Then MDT1 sends blocking
+			 * ast to client2.
+			 * 2. At the same time, client2 tries to unlink
+			 * the striped dir (rm -rf striped_dir), and during
+			 * lookup, it will hold the master inode of the striped
+			 * directory, whose inode state is NEW, then tries to
+			 * revalidate all of its slaves, (ll_prep_inode()->
+			 * ll_iget()->ll_read_inode2()-> ll_update_inode().).
+			 * And it will be blocked on the server side because
+			 * of 1.
+			 * 3. Then the client get the blocking_ast req, cancel
+			 * the lock, but being blocked if using ->ilookup5()),
+			 * because master inode state is NEW.
+			 */
+			master_inode = ilookup5_nowait(inode->i_sb, hash,
+							ll_test_inode_by_fid,
+							(void *)&lli->lli_pfid);
+			if (master_inode) {
+				ll_invalidate_negative_children(master_inode);
+				iput(master_inode);
 			}
+		} else {
+			ll_invalidate_negative_children(inode);
 		}
+	}
 
-		if ((bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_PERM)) &&
-		    inode->i_sb->s_root &&
-		    !is_root_inode(inode))
-			ll_invalidate_aliases(inode);
+	if ((bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_PERM)) &&
+	    inode->i_sb->s_root &&
+	    !is_root_inode(inode))
+		ll_invalidate_aliases(inode);
 
-		iput(inode);
+	iput(inode);
+}
+
+/* Check if the given lock may be downgraded instead of canceling and
+ * that convert is really needed.
+ */
+int ll_md_need_convert(struct ldlm_lock *lock)
+{
+	struct inode *inode;
+	u64 wanted = lock->l_policy_data.l_inodebits.cancel_bits;
+	u64 bits = lock->l_policy_data.l_inodebits.bits & ~wanted;
+	enum ldlm_mode mode = LCK_MINMODE;
+
+	if (!wanted || !bits || ldlm_is_cancel(lock))
+		return 0;
+
+	/* do not convert locks other than DOM for now */
+	if (!((bits | wanted) & MDS_INODELOCK_DOM))
+		return 0;
+
+	/* We may have already remaining bits in some other lock so
+	 * lock convert will leave us just extra lock for the same bit.
+	 * Check if client has other lock with the same bits and the same
+	 * or lower mode and don't convert if any.
+	 */
+	switch (lock->l_req_mode) {
+	case LCK_PR:
+		mode = LCK_PR;
+		/* fall-through */
+	case LCK_PW:
+		mode |= LCK_CR;
+		break;
+	case LCK_CW:
+		mode = LCK_CW;
+		/* fall-through */
+	case LCK_CR:
+		mode |= LCK_CR;
 		break;
+	default:
+		/* do not convert other modes */
+		return 0;
 	}
+
+	/* is lock is too old to be converted? */
+	lock_res_and_lock(lock);
+	if (ktime_after(ktime_get(),
+			ktime_add(lock->l_last_used,
+				  ktime_set(10, 0)))) {
+		unlock_res_and_lock(lock);
+		return 0;
+	}
+	unlock_res_and_lock(lock);
+
+	inode = ll_inode_from_resource_lock(lock);
+	ll_have_md_lock(inode, &bits, mode);
+	iput(inode);
+	return !!(bits);
+}
+
+int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
+		       void *data, int flag)
+{
+	struct lustre_handle lockh;
+	u64 bits = lock->l_policy_data.l_inodebits.bits;
+	int rc;
+
+	switch (flag) {
+	case LDLM_CB_BLOCKING:
+	{
+		u64 cancel_flags = LCF_ASYNC;
+
+		if (ll_md_need_convert(lock)) {
+			cancel_flags |= LCF_CONVERT;
+			/* For lock convert some cancel actions may require
+			 * this lock with non-dropped canceled bits, e.g. page
+			 * flush for DOM lock. So call ll_lock_cancel_bits()
+			 * here while canceled bits are still set.
+			 */
+			bits = lock->l_policy_data.l_inodebits.cancel_bits;
+			if (bits & MDS_INODELOCK_DOM)
+				ll_lock_cancel_bits(lock, MDS_INODELOCK_DOM);
+		}
+		ldlm_lock2handle(lock, &lockh);
+		rc = ldlm_cli_cancel(&lockh, cancel_flags);
+		if (rc < 0) {
+			CDEBUG(D_INODE, "ldlm_cli_cancel: rc = %d\n", rc);
+			return rc;
+		}
+		break;
+	}
+	case LDLM_CB_CANCELING:
+		if (ldlm_is_converting(lock)) {
+			/* this is called on already converted lock, so
+			 * ibits has remained bits only and cancel_bits
+			 * are bits that were dropped.
+			 * Note that DOM lock is handled prior lock convert
+			 * and is excluded here.
+			 */
+			bits = lock->l_policy_data.l_inodebits.cancel_bits &
+				~MDS_INODELOCK_DOM;
+		} else {
+			LASSERT(ldlm_is_canceling(lock));
+		}
+		ll_lock_cancel_bits(lock, bits);
+		break;
 	default:
 		LBUG();
 	}
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index c92663b..b14d301c 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -3027,7 +3027,7 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)sizeof(((struct ldlm_extent *)0)->gid));
 
 	/* Checks for struct ldlm_inodebits */
-	LASSERTF((int)sizeof(struct ldlm_inodebits) == 8, "found %lld\n",
+	LASSERTF((int)sizeof(struct ldlm_inodebits) == 16, "found %lld\n",
 		 (long long)(int)sizeof(struct ldlm_inodebits));
 	LASSERTF((int)offsetof(struct ldlm_inodebits, bits) == 0, "found %lld\n",
 		 (long long)(int)offsetof(struct ldlm_inodebits, bits));
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 794e6d6..2403b89 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2120,6 +2120,7 @@ static inline bool ldlm_extent_equal(const struct ldlm_extent *ex1,
 
 struct ldlm_inodebits {
 	__u64 bits;
+	__u64 cancel_bits; /* for lock convert */
 };
 
 struct ldlm_flock_wire {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 039/622] lustre: ptlrpc: fix return type of boolean functions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (37 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 038/622] lustre: ldlm: IBITS lock convert instead of cancel James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 040/622] lustre: llite: decrease sa_running if fail to start statahead James Simmons
                   ` (583 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Some functions are returning type int with values 0 or 1 when
they could be returning bool.  Fix up the return type of:

   lustre_req_swabbed()
   lustre_rep_swabbed()
   ptlrpc_req_need_swab()
   ptlrpc_rep_need_swab()
   ptlrpc_buf_need_swab()

WC-bug-id: https://jira.whamcloud.com/browse/LU-1644
Lustre-commit: e2cac9fb9baf ("LU-1644 ptlrpc: fix return type of boolean functions")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32088
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h  | 20 ++++++++++----------
 fs/lustre/ptlrpc/pack_generic.c |  9 ++++-----
 fs/lustre/ptlrpc/sec_plain.c    |  7 +++----
 3 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 961b8cb..0231011 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -953,35 +953,35 @@ static inline bool ptlrpc_nrs_req_can_move(struct ptlrpc_request *req)
 /** @} nrs */
 
 /**
- * Returns 1 if request buffer at offset @index was already swabbed
+ * Returns true if request buffer at offset @index was already swabbed
  */
-static inline int lustre_req_swabbed(struct ptlrpc_request *req, size_t index)
+static inline bool lustre_req_swabbed(struct ptlrpc_request *req, size_t index)
 {
 	LASSERT(index < sizeof(req->rq_req_swab_mask) * 8);
 	return req->rq_req_swab_mask & (1 << index);
 }
 
 /**
- * Returns 1 if request reply buffer at offset @index was already swabbed
+ * Returns true if request reply buffer@offset @index was already swabbed
  */
-static inline int lustre_rep_swabbed(struct ptlrpc_request *req, size_t index)
+static inline bool lustre_rep_swabbed(struct ptlrpc_request *req, size_t index)
 {
 	LASSERT(index < sizeof(req->rq_rep_swab_mask) * 8);
 	return req->rq_rep_swab_mask & (1 << index);
 }
 
 /**
- * Returns 1 if request needs to be swabbed into local cpu byteorder
+ * Returns true if request needs to be swabbed into local cpu byteorder
  */
-static inline int ptlrpc_req_need_swab(struct ptlrpc_request *req)
+static inline bool ptlrpc_req_need_swab(struct ptlrpc_request *req)
 {
 	return lustre_req_swabbed(req, MSG_PTLRPC_HEADER_OFF);
 }
 
 /**
- * Returns 1 if request reply needs to be swabbed into local cpu byteorder
+ * Returns true if request reply needs to be swabbed into local cpu byteorder
  */
-static inline int ptlrpc_rep_need_swab(struct ptlrpc_request *req)
+static inline bool ptlrpc_rep_need_swab(struct ptlrpc_request *req)
 {
 	return lustre_rep_swabbed(req, MSG_PTLRPC_HEADER_OFF);
 }
@@ -1999,8 +1999,8 @@ struct ptlrpc_service *ptlrpc_register_service(struct ptlrpc_service_conf *conf,
  *
  * @{
  */
-int ptlrpc_buf_need_swab(struct ptlrpc_request *req, const int inout,
-			 u32 index);
+bool ptlrpc_buf_need_swab(struct ptlrpc_request *req, const int inout,
+			  u32 index);
 void ptlrpc_buf_set_swabbed(struct ptlrpc_request *req, const int inout,
 			    u32 index);
 int ptlrpc_unpack_rep_msg(struct ptlrpc_request *req, int len);
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index bc5e513..9cea826 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -78,15 +78,14 @@ void ptlrpc_buf_set_swabbed(struct ptlrpc_request *req, const int inout,
 		lustre_set_rep_swabbed(req, index);
 }
 
-int ptlrpc_buf_need_swab(struct ptlrpc_request *req, const int inout,
-			 u32 index)
+bool ptlrpc_buf_need_swab(struct ptlrpc_request *req, const int inout,
+			  u32 index)
 {
 	if (inout)
 		return (ptlrpc_req_need_swab(req) &&
 			!lustre_req_swabbed(req, index));
-	else
-		return (ptlrpc_rep_need_swab(req) &&
-			!lustre_rep_swabbed(req, index));
+
+	return (ptlrpc_rep_need_swab(req) && !lustre_rep_swabbed(req, index));
 }
 
 /* early reply size */
diff --git a/fs/lustre/ptlrpc/sec_plain.c b/fs/lustre/ptlrpc/sec_plain.c
index 2358c3f..93a9a17 100644
--- a/fs/lustre/ptlrpc/sec_plain.c
+++ b/fs/lustre/ptlrpc/sec_plain.c
@@ -217,7 +217,7 @@ int plain_ctx_verify(struct ptlrpc_cli_ctx *ctx, struct ptlrpc_request *req)
 	struct lustre_msg *msg = req->rq_repdata;
 	struct plain_header *phdr;
 	u32 cksum;
-	int swabbed;
+	bool swabbed;
 
 	if (msg->lm_bufcount != PLAIN_PACK_SEGMENTS) {
 		CERROR("unexpected reply buf count %u\n", msg->lm_bufcount);
@@ -715,12 +715,11 @@ int plain_enlarge_reqbuf(struct ptlrpc_sec *sec,
 	.sc_policy	= &plain_policy,
 };
 
-static
-int plain_accept(struct ptlrpc_request *req)
+static int plain_accept(struct ptlrpc_request *req)
 {
 	struct lustre_msg *msg = req->rq_reqbuf;
 	struct plain_header *phdr;
-	int swabbed;
+	bool swabbed;
 
 	LASSERT(SPTLRPC_FLVR_POLICY(req->rq_flvr.sf_rpc) ==
 		SPTLRPC_POLICY_PLAIN);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 040/622] lustre: llite: decrease sa_running if fail to start statahead
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (38 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 039/622] lustre: ptlrpc: fix return type of boolean functions James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 041/622] lustre: lmv: dir page is released while in use James Simmons
                   ` (582 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Fan Yong <fan.yong@intel.com>

Otherwise the counter of ll_sb_info::ll_sa_running will leak as
to the umount process will be blocked for ever.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10992
Lustre-commit: 6b8638bf7920 ("LU-10992 llite: decrease sa_running if fail to start statahead")
Signed-off-by: Fan Yong <fan.yong@intel.com>
Reviewed-on: https://review.whamcloud.com/32287
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/statahead.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/lustre/llite/statahead.c b/fs/lustre/llite/statahead.c
index 4a61dac..122b9d8 100644
--- a/fs/lustre/llite/statahead.c
+++ b/fs/lustre/llite/statahead.c
@@ -1566,6 +1566,7 @@ static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
 		spin_lock(&lli->lli_sa_lock);
 		lli->lli_sai = NULL;
 		spin_unlock(&lli->lli_sa_lock);
+		atomic_dec(&ll_i2sbi(parent->d_inode)->ll_sa_running);
 		rc = PTR_ERR(task);
 		CERROR("can't start ll_sa thread, rc : %d\n", rc);
 		goto out;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 041/622] lustre: lmv: dir page is released while in use
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (39 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 040/622] lustre: llite: decrease sa_running if fail to start statahead James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 042/622] lustre: ldlm: speed up preparation for list of lock cancel James Simmons
                   ` (581 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

When popping stripe dirent, if it reaches page end,
stripe_dirent_next() releases current page and then reads next one,
but current dirent is still in use, as will cause wrong values used,
and trigger assertion.

This patch changes to not read next page upon reaching end, but
leave it to next dirent read.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9857
Lustre-commit: b51e8d6b53a3 ("LU-9857 lmv: dir page is released while in use")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32180
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_obd.c | 123 +++++++++++++++++++++++-------------------------
 1 file changed, 60 insertions(+), 63 deletions(-)

diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index d0f626f..c7bf8c7 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -2016,7 +2016,7 @@ struct lmv_dir_ctxt {
 	struct stripe_dirent	 ldc_stripes[0];
 };
 
-static inline void put_stripe_dirent(struct stripe_dirent *stripe)
+static inline void stripe_dirent_unload(struct stripe_dirent *stripe)
 {
 	if (stripe->sd_page) {
 		kunmap(stripe->sd_page);
@@ -2031,62 +2031,77 @@ static inline void put_lmv_dir_ctxt(struct lmv_dir_ctxt *ctxt)
 	int i;
 
 	for (i = 0; i < ctxt->ldc_count; i++)
-		put_stripe_dirent(&ctxt->ldc_stripes[i]);
+		stripe_dirent_unload(&ctxt->ldc_stripes[i]);
 }
 
-static struct lu_dirent *stripe_dirent_next(struct lmv_dir_ctxt *ctxt,
+/* if @ent is dummy, or . .., get next */
+static struct lu_dirent *stripe_dirent_get(struct lmv_dir_ctxt *ctxt,
+					   struct lu_dirent *ent,
+					   int stripe_index)
+{
+	for (; ent; ent = lu_dirent_next(ent)) {
+		/* Skip dummy entry */
+		if (le16_to_cpu(ent->lde_namelen) == 0)
+			continue;
+
+		/* skip . and .. for other stripes */
+		if (stripe_index &&
+		    (strncmp(ent->lde_name, ".",
+			     le16_to_cpu(ent->lde_namelen)) == 0 ||
+		     strncmp(ent->lde_name, "..",
+			     le16_to_cpu(ent->lde_namelen)) == 0))
+			continue;
+
+		if (le64_to_cpu(ent->lde_hash) >= ctxt->ldc_hash)
+			break;
+	}
+
+	return ent;
+}
+
+static struct lu_dirent *stripe_dirent_load(struct lmv_dir_ctxt *ctxt,
 					    struct stripe_dirent *stripe,
 					    int stripe_index)
 {
+	struct md_op_data *op_data = ctxt->ldc_op_data;
+	struct lmv_oinfo *oinfo;
+	struct lu_fid fid = op_data->op_fid1;
+	struct inode *inode = op_data->op_data;
+	struct lmv_tgt_desc *tgt;
 	struct lu_dirent *ent = stripe->sd_ent;
 	u64 hash = ctxt->ldc_hash;
-	u64 end;
 	int rc = 0;
 
 	LASSERT(stripe == &ctxt->ldc_stripes[stripe_index]);
-
-	if (stripe->sd_eof)
-		return NULL;
-
-	if (ent) {
-		ent = lu_dirent_next(ent);
-		if (!ent) {
-check_eof:
-			end = le64_to_cpu(stripe->sd_dp->ldp_hash_end);
-
-			LASSERTF(hash <= end, "hash %llx end %llx\n",
-				 hash, end);
+	LASSERT(!ent);
+
+	do {
+		if (stripe->sd_page) {
+			u64 end = le64_to_cpu(stripe->sd_dp->ldp_hash_end);
+
+			/* @hash should be the last dirent hash */
+			LASSERTF(hash <= end,
+				 "ctxt@%p stripe@%p hash %llx end %llx\n",
+				 ctxt, stripe, hash, end);
+			/* unload last page */
+			stripe_dirent_unload(stripe);
+			/* eof */
 			if (end == MDS_DIR_END_OFF) {
 				stripe->sd_ent = NULL;
 				stripe->sd_eof = true;
-				return NULL;
+				break;
 			}
-
-			put_stripe_dirent(stripe);
 			hash = end;
 		}
-	}
-
-	if (!ent) {
-		struct md_op_data *op_data = ctxt->ldc_op_data;
-		struct lmv_oinfo *oinfo;
-		struct lu_fid fid = op_data->op_fid1;
-		struct inode *inode = op_data->op_data;
-		struct lmv_tgt_desc *tgt;
-
-		LASSERT(!stripe->sd_page);
 
 		oinfo = &op_data->op_mea1->lsm_md_oinfo[stripe_index];
 		tgt = lmv_get_target(ctxt->ldc_lmv, oinfo->lmo_mds, NULL);
 		if (IS_ERR(tgt)) {
 			rc = PTR_ERR(tgt);
-			goto out;
+			break;
 		}
 
-		/*
-		 * op_data will be shared by each stripe, so we need
-		 * reset these value for each stripe
-		 */
+		/* op_data is shared by stripes, reset after use */
 		op_data->op_fid1 = oinfo->lmo_fid;
 		op_data->op_fid2 = oinfo->lmo_fid;
 		op_data->op_data = oinfo->lmo_root;
@@ -2099,42 +2114,24 @@ static struct lu_dirent *stripe_dirent_next(struct lmv_dir_ctxt *ctxt,
 		op_data->op_data = inode;
 
 		if (rc)
-			goto out;
-
-		stripe->sd_dp = page_address(stripe->sd_page);
-		ent = lu_dirent_start(stripe->sd_dp);
-	}
-
-	for (; ent; ent = lu_dirent_next(ent)) {
-		/* Skip dummy entry */
-		if (!le16_to_cpu(ent->lde_namelen))
-			continue;
-
-		/* skip . and .. for other stripes */
-		if (stripe_index &&
-		    (strncmp(ent->lde_name, ".",
-			     le16_to_cpu(ent->lde_namelen)) == 0 ||
-		     strncmp(ent->lde_name, "..",
-			     le16_to_cpu(ent->lde_namelen)) == 0))
-			continue;
-
-		if (le64_to_cpu(ent->lde_hash) >= hash)
 			break;
-	}
 
-	if (!ent)
-		goto check_eof;
+		stripe->sd_dp = page_address(stripe->sd_page);
+		ent = stripe_dirent_get(ctxt, lu_dirent_start(stripe->sd_dp),
+					stripe_index);
+		/* in case a page filled with ., .. and dummy, read next */
+	} while (!ent);
 
-out:
 	stripe->sd_ent = ent;
-	/* treat error as eof, so dir can be partially accessed */
 	if (rc) {
-		put_stripe_dirent(stripe);
+		LASSERT(!ent);
+		/* treat error as eof, so dir can be partially accessed */
 		stripe->sd_eof = true;
 		LCONSOLE_WARN("dir " DFID " stripe %d readdir failed: %d, directory is partially accessed!\n",
 			      PFID(&ctxt->ldc_op_data->op_fid1), stripe_index,
 			      rc);
 	}
+
 	return ent;
 }
 
@@ -2186,8 +2183,7 @@ static struct lu_dirent *lmv_dirent_next(struct lmv_dir_ctxt *ctxt)
 			continue;
 
 		if (!stripe->sd_ent) {
-			/* locate starting entry */
-			stripe_dirent_next(ctxt, stripe, i);
+			stripe_dirent_load(ctxt, stripe, i);
 			if (!stripe->sd_ent) {
 				LASSERT(stripe->sd_eof);
 				continue;
@@ -2208,7 +2204,8 @@ static struct lu_dirent *lmv_dirent_next(struct lmv_dir_ctxt *ctxt)
 		stripe = &ctxt->ldc_stripes[min];
 		ent = stripe->sd_ent;
 		/* pop found dirent */
-		stripe_dirent_next(ctxt, stripe, min);
+		stripe->sd_ent = stripe_dirent_get(ctxt, lu_dirent_next(ent),
+						   min);
 	}
 
 	return ent;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 042/622] lustre: ldlm: speed up preparation for list of lock cancel
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (40 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 041/622] lustre: lmv: dir page is released while in use James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 043/622] lustre: checksum: enable/disable checksum correctly James Simmons
                   ` (580 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

Keep the skipped locks in lru list will cause serious
contention for ns_lock. Since we have to travel them
every time in the ldlm_prepare_lru_list(). So we will
use a cursor to record position that last accessed
lock in lru list.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9230
Lustre-commit: 651f2cdd2d8d ("LU-9230 ldlm: speed up preparation for list of lock cancel")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Signed-off-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-on: https://review.whamcloud.com/26327
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h       |  1 +
 fs/lustre/include/lustre_dlm_flags.h |  9 -----
 fs/lustre/ldlm/ldlm_lock.c           |  3 +-
 fs/lustre/ldlm/ldlm_request.c        | 72 ++++++++++++++++--------------------
 fs/lustre/ldlm/ldlm_resource.c       |  1 +
 5 files changed, 35 insertions(+), 51 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 66608a9..1a19b35 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -406,6 +406,7 @@ struct ldlm_namespace {
 	struct list_head	ns_unused_list;
 	/** Number of locks in the LRU list above */
 	int			ns_nr_unused;
+	struct list_head	*ns_last_pos;
 
 	/**
 	 * Maximum number of locks permitted in the LRU. If 0, means locks
diff --git a/fs/lustre/include/lustre_dlm_flags.h b/fs/lustre/include/lustre_dlm_flags.h
index c8667c8..3d69c49 100644
--- a/fs/lustre/include/lustre_dlm_flags.h
+++ b/fs/lustre/include/lustre_dlm_flags.h
@@ -200,15 +200,6 @@
 #define ldlm_set_fail_loc(_l)		LDLM_SET_FLAG((_l), 1ULL << 32)
 #define ldlm_clear_fail_loc(_l)		LDLM_CLEAR_FLAG((_l), 1ULL << 32)
 
-/**
- * Used while processing the unused list to know that we have already
- * handled this lock and decided to skip it.
- */
-#define LDLM_FL_SKIPPED			0x0000000200000000ULL /* bit 33 */
-#define ldlm_is_skipped(_l)		LDLM_TEST_FLAG((_l), 1ULL << 33)
-#define ldlm_set_skipped(_l)		LDLM_SET_FLAG((_l), 1ULL << 33)
-#define ldlm_clear_skipped(_l)		LDLM_CLEAR_FLAG((_l), 1ULL << 33)
-
 /** this lock is being destroyed */
 #define LDLM_FL_CBPENDING		0x0000000400000000ULL /* bit 34 */
 #define ldlm_is_cbpending(_l)		LDLM_TEST_FLAG((_l), 1ULL << 34)
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 9847c43..894b99b 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -204,6 +204,8 @@ int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock)
 		struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
 
 		LASSERT(lock->l_resource->lr_type != LDLM_FLOCK);
+		if (ns->ns_last_pos == &lock->l_lru)
+			ns->ns_last_pos = lock->l_lru.prev;
 		list_del_init(&lock->l_lru);
 		LASSERT(ns->ns_nr_unused > 0);
 		ns->ns_nr_unused--;
@@ -249,7 +251,6 @@ void ldlm_lock_add_to_lru_nolock(struct ldlm_lock *lock)
 	LASSERT(list_empty(&lock->l_lru));
 	LASSERT(lock->l_resource->lr_type != LDLM_FLOCK);
 	list_add_tail(&lock->l_lru, &ns->ns_unused_list);
-	ldlm_clear_skipped(lock);
 	LASSERT(ns->ns_nr_unused >= 0);
 	ns->ns_nr_unused++;
 }
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 5ec0da5..dd4d958 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1368,9 +1368,6 @@ int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
 		/* fall through */
 	default:
 		result = LDLM_POLICY_SKIP_LOCK;
-		lock_res_and_lock(lock);
-		ldlm_set_skipped(lock);
-		unlock_res_and_lock(lock);
 		break;
 	}
 
@@ -1592,54 +1589,47 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 				 int flags)
 {
 	ldlm_cancel_lru_policy_t pf;
-	struct ldlm_lock *lock, *next;
-	int added = 0, unused, remained;
+	int added = 0;
 	int no_wait = flags & LDLM_LRU_FLAG_NO_WAIT;
 
-	spin_lock(&ns->ns_lock);
-	unused = ns->ns_nr_unused;
-	remained = unused;
-
 	if (!ns_connect_lru_resize(ns))
-		count += unused - ns->ns_max_unused;
+		count += ns->ns_nr_unused - ns->ns_max_unused;
 
 	pf = ldlm_cancel_lru_policy(ns, flags);
 	LASSERT(pf);
 
-	while (!list_empty(&ns->ns_unused_list)) {
+	/* For any flags, stop scanning if @max is reached. */
+	while (!list_empty(&ns->ns_unused_list) && (max == 0 || added < max)) {
+		struct ldlm_lock *lock;
+		struct list_head *item, *next;
 		enum ldlm_policy_res result;
 		ktime_t last_use = ktime_set(0, 0);
 
-		/* all unused locks */
-		if (remained-- <= 0)
-			break;
-
-		/* For any flags, stop scanning if @max is reached. */
-		if (max && added >= max)
-			break;
+		spin_lock(&ns->ns_lock);
+		item = no_wait ? ns->ns_last_pos : &ns->ns_unused_list;
+		for (item = item->next, next = item->next;
+		     item != &ns->ns_unused_list;
+		     item = next, next = item->next) {
+			lock = list_entry(item, struct ldlm_lock, l_lru);
 
-		list_for_each_entry_safe(lock, next, &ns->ns_unused_list,
-					 l_lru) {
 			/* No locks which got blocking requests. */
 			LASSERT(!ldlm_is_bl_ast(lock));
 
-			if (no_wait && ldlm_is_skipped(lock))
-				/* already processed */
-				continue;
-
-			last_use = lock->l_last_used;
-
-			/* Somebody is already doing CANCEL. No need for this
-			 * lock in LRU, do not traverse it again.
-			 */
 			if (!ldlm_is_canceling(lock) ||
 			    !ldlm_is_converting(lock))
 				break;
 
+			/* Somebody is already doing CANCEL. No need for this
+			 * lock in LRU, do not traverse it again.
+			 */
 			ldlm_lock_remove_from_lru_nolock(lock);
 		}
-		if (&lock->l_lru == &ns->ns_unused_list)
+		if (item == &ns->ns_unused_list) {
+			spin_unlock(&ns->ns_lock);
 			break;
+		}
+
+		last_use = lock->l_last_used;
 
 		LDLM_LOCK_GET(lock);
 		spin_unlock(&ns->ns_lock);
@@ -1659,19 +1649,23 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		 * their weight. Big extent locks will stay in
 		 * the cache.
 		 */
-		result = pf(ns, lock, unused, added, count);
+		result = pf(ns, lock, ns->ns_nr_unused, added, count);
 		if (result == LDLM_POLICY_KEEP_LOCK) {
-			lu_ref_del(&lock->l_reference,
-				   __func__, current);
+			lu_ref_del(&lock->l_reference, __func__, current);
 			LDLM_LOCK_RELEASE(lock);
-			spin_lock(&ns->ns_lock);
 			break;
 		}
+
 		if (result == LDLM_POLICY_SKIP_LOCK) {
-			lu_ref_del(&lock->l_reference,
-				   __func__, current);
+			lu_ref_del(&lock->l_reference, __func__, current);
 			LDLM_LOCK_RELEASE(lock);
-			spin_lock(&ns->ns_lock);
+			if (no_wait) {
+				spin_lock(&ns->ns_lock);
+				if (!list_empty(&lock->l_lru) &&
+				    lock->l_lru.prev == ns->ns_last_pos)
+					ns->ns_last_pos = &lock->l_lru;
+				spin_unlock(&ns->ns_lock);
+			}
 			continue;
 		}
 
@@ -1690,7 +1684,6 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 			lu_ref_del(&lock->l_reference,
 				   __func__, current);
 			LDLM_LOCK_RELEASE(lock);
-			spin_lock(&ns->ns_lock);
 			continue;
 		}
 		LASSERT(!lock->l_readers && !lock->l_writers);
@@ -1728,11 +1721,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		list_add(&lock->l_bl_ast, cancels);
 		unlock_res_and_lock(lock);
 		lu_ref_del(&lock->l_reference, __func__, current);
-		spin_lock(&ns->ns_lock);
 		added++;
-		unused--;
 	}
-	spin_unlock(&ns->ns_lock);
 	return added;
 }
 
diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 5e0dd53..7fe8a8b 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -682,6 +682,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	ns->ns_connect_flags = 0;
 	ns->ns_dirty_age_limit = LDLM_DIRTY_AGE_LIMIT;
 	ns->ns_stopping = 0;
+	ns->ns_last_pos = &ns->ns_unused_list;
 
 	rc = ldlm_namespace_sysfs_register(ns);
 	if (rc != 0) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 043/622] lustre: checksum: enable/disable checksum correctly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (41 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 042/622] lustre: ldlm: speed up preparation for list of lock cancel James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 044/622] lustre: build: armv7 client build fixes James Simmons
                   ` (579 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Emoly Liu <emoly@whamcloud.com>

There are three ways to set checksum support in Lustre. Their
order during client mount is:
- 1. configure --enable/disable-checksum, this(ENABLE_CHECKSUM)
  only affects the default mount option and is set in function
  client_obd_setup().
- 2. lctl set_param -P osc.*.checksums=0/1, when processing llog,
  this value will be set by osc_checksum_seq_write().
- 3. mount option checksum/nochecksum, this will be checked in
  ll_options() and be set in client_common_fill_super()->
  obd_set_info_async().

This patch fixes one issue in 3. That is if mount option
"-o checksum/nochecksum" is specified, checksum will be changed
accordingly, no matter what is set by "set_param -P" or the
default option; and if no mount option is specified, the value
set by "set_param -P" will be kept. Also, test_77k is added to
sanity.sh to verify this patch.

What's more, a minor initialization issue of cl_supp_cksum_types
is fixed. cl_supp_cksum_types should be always initialized no
matter checksum is enabled or not.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10906
Lustre-commit: e9b13cd1daf9 ("LU-10906 checksum: enable/disable checksum correctly")
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32095
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lib.c        |  5 +++--
 fs/lustre/llite/llite_internal.h |  3 ++-
 fs/lustre/llite/llite_lib.c      | 23 ++++++++++++++---------
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 7bc1d10..2c0fad3 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -355,6 +355,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 
 	init_waitqueue_head(&cli->cl_destroy_waitq);
 	atomic_set(&cli->cl_destroy_in_flight, 0);
+
+	cli->cl_supp_cksum_types = OBD_CKSUM_CRC32;
 	/* Turn on checksumming by default. */
 	cli->cl_checksum = 1;
 	/*
@@ -362,8 +364,7 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	 * Set cl_chksum* to CRC32 for now to avoid returning screwed info
 	 * through procfs.
 	 */
-	cli->cl_cksum_type = OBD_CKSUM_CRC32;
-	cli->cl_supp_cksum_types = OBD_CKSUM_CRC32;
+	cli->cl_cksum_type = cli->cl_supp_cksum_types;
 	atomic_set(&cli->cl_resends, OSC_DEFAULT_RESENDS);
 
 	/*
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index d0a703d..6bdbf28 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -479,7 +479,8 @@ struct ll_sb_info {
 	unsigned int		  ll_umounting:1,
 				  ll_xattr_cache_enabled:1,
 				ll_xattr_cache_set:1, /* already set to 0/1 */
-				  ll_client_common_fill_super_succeeded:1;
+				  ll_client_common_fill_super_succeeded:1,
+				  ll_checksum_set:1;
 
 	struct lustre_client_ocd  ll_lco;
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index e2c7a4d..eb29064 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -560,13 +560,15 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	}
 
 	checksum = sbi->ll_flags & LL_SBI_CHECKSUM;
-	err = obd_set_info_async(NULL, sbi->ll_dt_exp, sizeof(KEY_CHECKSUM),
-				 KEY_CHECKSUM, sizeof(checksum), &checksum,
-				 NULL);
-	if (err) {
-		CERROR("%s: Set checksum failed: rc = %d\n",
-		       sbi->ll_dt_exp->exp_obd->obd_name, err);
-		goto out_root;
+	if (sbi->ll_checksum_set) {
+		err = obd_set_info_async(NULL, sbi->ll_dt_exp,
+					 sizeof(KEY_CHECKSUM), KEY_CHECKSUM,
+					 sizeof(checksum), &checksum, NULL);
+		if (err) {
+			CERROR("%s: Set checksum failed: rc = %d\n",
+			       sbi->ll_dt_exp->exp_obd->obd_name, err);
+			goto out_root;
+		}
 	}
 	cl_sb_init(sb);
 
@@ -763,10 +765,11 @@ static inline int ll_set_opt(const char *opt, char *data, int fl)
 }
 
 /* non-client-specific mount options are parsed in lmd_parse */
-static int ll_options(char *options, int *flags)
+static int ll_options(char *options, struct ll_sb_info *sbi)
 {
 	int tmp;
 	char *s1 = options, *s2;
+	int *flags = &sbi->ll_flags;
 
 	if (!options)
 		return 0;
@@ -832,11 +835,13 @@ static int ll_options(char *options, int *flags)
 		tmp = ll_set_opt("checksum", s1, LL_SBI_CHECKSUM);
 		if (tmp) {
 			*flags |= tmp;
+			sbi->ll_checksum_set = 1;
 			goto next;
 		}
 		tmp = ll_set_opt("nochecksum", s1, LL_SBI_CHECKSUM);
 		if (tmp) {
 			*flags &= ~tmp;
+			sbi->ll_checksum_set = 1;
 			goto next;
 		}
 		tmp = ll_set_opt("lruresize", s1, LL_SBI_LRU_RESIZE);
@@ -971,7 +976,7 @@ int ll_fill_super(struct super_block *sb)
 		goto out_free;
 	}
 
-	err = ll_options(lsi->lsi_lmd->lmd_opts, &sbi->ll_flags);
+	err = ll_options(lsi->lsi_lmd->lmd_opts, sbi);
 	if (err)
 		goto out_free;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 044/622] lustre: build: armv7 client build fixes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (42 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 043/622] lustre: checksum: enable/disable checksum correctly James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 045/622] lustre: ldlm: fix l_last_activity usage James Simmons
                   ` (578 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andrew Perepechko <c17827@cray.com>

This commit is supposed to fix armv7 Lustre client
build, mostly 64-bit division related changes.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10964
Lustre-commit: 0300a6efd226 ("LU-10964 build: armv7 client build fixes")
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/32194
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 3 ++-
 fs/lustre/ptlrpc/import.c     | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index dd4d958..3991a8f 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1408,7 +1408,8 @@ static enum ldlm_policy_res ldlm_cancel_lrur_policy(struct ldlm_namespace *ns,
 
 	slv = ldlm_pool_get_slv(pl);
 	lvf = ldlm_pool_get_lvf(pl);
-	la = ktime_to_ns(ktime_sub(cur, lock->l_last_used)) / NSEC_PER_SEC;
+	la = div_u64(ktime_to_ns(ktime_sub(cur, lock->l_last_used)),
+		     NSEC_PER_SEC);
 	lv = lvf * la * unused;
 
 	/* Inform pool about current CLV to see it via debugfs. */
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index f69b907..5d6546d 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -289,7 +289,7 @@ void ptlrpc_invalidate_import(struct obd_import *imp)
 		 */
 		if (!OBD_FAIL_CHECK(OBD_FAIL_PTLRPC_LONG_REPL_UNLINK)) {
 			timeout = ptlrpc_inflight_timeout(imp);
-			timeout += timeout / 3;
+			timeout += div_u64(timeout, 3);
 
 			if (timeout == 0)
 				timeout = obd_timeout;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 045/622] lustre: ldlm: fix l_last_activity usage
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (43 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 044/622] lustre: build: armv7 client build fixes James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 046/622] lustre: ptlrpc: Add WBC connect flag James Simmons
                   ` (577 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

When race happen between ldlm_server_blocking_ast() and
ldlm_request_cancel(), the at_measured() is called with wrong
value equal to current time. And even worse, ldlm_bl_timeout() can
return current_time*1.5.
Before a time functions was fixed by LU-9019(fdeeed2fb) for 64bit,
this race leads to ETIMEDOUT at ptlrpc_import_delay_req() and
client eviction during bl ast sending. The wrong type conversion
take a place at pltrpc_send_limit_expired() at cfs_time_seconds().

We should not take cancels into accoount if the BLAST is not send,
just because the last_activity is not properly initialised - it
destroys the AT completely.
The patch devides l_last_activity to the client l_activity and
server l_blast_sent for better understanding. The l_blast_sent is
used for blocking ast only to measure time between BLAST and
cancel request.

For example:
 server cancels blocked lock after 1518731697s
 waiting_locks_callback()) ### lock callback timer expired after 0s:
 evicting client

WC-bug-id: https://jira.whamcloud.com/browse/LU-10945
Lustre-commit: e09d273cb5f2 ("LU-10945 ldlm: fix l_last_activity usage")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-5736
Reviewed-on: https://review.whamcloud.com/32133
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Mikhal Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h | 13 +++++++------
 fs/lustre/ldlm/ldlm_lock.c     |  1 +
 fs/lustre/ldlm/ldlm_request.c  | 14 +++++++-------
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 1a19b35..6ad12a3 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -708,12 +708,6 @@ struct ldlm_lock {
 	wait_queue_head_t		l_waitq;
 
 	/**
-	 * Seconds. It will be updated if there is any activity related to
-	 * the lock, e.g. enqueue the lock or send blocking AST.
-	 */
-	time64_t			l_last_activity;
-
-	/**
 	 * Time, in nanoseconds, last used by e.g. being matched by lock match.
 	 */
 	ktime_t				l_last_used;
@@ -735,6 +729,13 @@ struct ldlm_lock {
 
 	/** Private storage for lock user. Opaque to LDLM. */
 	void				*l_ast_data;
+
+	/**
+	 * Seconds. It will be updated if there is any activity related to
+	 * the lock at client, e.g. enqueue the lock.
+	 */
+	time64_t			l_activity;
+
 	/* Separate ost_lvb used mostly by Data-on-MDT for now.
 	 * It is introduced to don't mix with layout lock data.
 	 */
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 894b99b..1bf387a 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -420,6 +420,7 @@ static struct ldlm_lock *ldlm_lock_new(struct ldlm_resource *resource)
 	lu_ref_init(&lock->l_reference);
 	lu_ref_add(&lock->l_reference, "hash", lock);
 	lock->l_callback_timeout = 0;
+	lock->l_activity = 0;
 
 #if LUSTRE_TRACKS_LOCK_EXP_REFS
 	INIT_LIST_HEAD(&lock->l_exp_refs_link);
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 3991a8f..67c23fc 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -114,9 +114,9 @@ static void ldlm_expired_completion_wait(struct ldlm_lock *lock, u32 conn_cnt)
 
 		LDLM_ERROR(lock,
 			   "lock timed out (enqueued at %lld, %llds ago); not entering recovery in server code, just going back to sleep",
-			   (s64)lock->l_last_activity,
+			   (s64)lock->l_activity,
 			   (s64)(ktime_get_real_seconds() -
-				 lock->l_last_activity));
+				 lock->l_activity));
 		if (ktime_get_seconds() > next_dump) {
 			last_dump = next_dump;
 			next_dump = ktime_get_seconds() + 300;
@@ -133,8 +133,8 @@ static void ldlm_expired_completion_wait(struct ldlm_lock *lock, u32 conn_cnt)
 	ptlrpc_fail_import(imp, conn_cnt);
 	LDLM_ERROR(lock,
 		   "lock timed out (enqueued at %lld, %llds ago), entering recovery for %s@%s",
-		   (s64)lock->l_last_activity,
-		   (s64)(ktime_get_real_seconds() - lock->l_last_activity),
+		   (s64)lock->l_activity,
+		   (s64)(ktime_get_real_seconds() - lock->l_activity),
 		   obd2cli_tgt(obd), imp->imp_connection->c_remote_uuid.uuid);
 }
 
@@ -182,7 +182,7 @@ static int ldlm_completion_tail(struct ldlm_lock *lock, void *data)
 		LDLM_DEBUG(lock, "client-side enqueue: granted");
 	} else {
 		/* Take into AT only CP RPC, not immediately granted locks */
-		delay = ktime_get_real_seconds() - lock->l_last_activity;
+		delay = ktime_get_real_seconds() - lock->l_activity;
 		LDLM_DEBUG(lock, "client-side enqueue: granted after %lds",
 			   delay);
 
@@ -245,7 +245,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, u64 flags, void *data)
 
 	timeout = ldlm_cp_timeout(lock);
 
-	lock->l_last_activity = ktime_get_real_seconds();
+	lock->l_activity = ktime_get_real_seconds();
 
 	if (imp) {
 		spin_lock(&imp->imp_lock);
@@ -725,7 +725,7 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 	lock->l_export = NULL;
 	lock->l_blocking_ast = einfo->ei_cb_bl;
 	lock->l_flags |= (*flags & (LDLM_FL_NO_LRU | LDLM_FL_EXCL));
-	lock->l_last_activity = ktime_get_real_seconds();
+	lock->l_activity = ktime_get_real_seconds();
 
 	/* lock not sent to server yet */
 	if (!reqp || !*reqp) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 046/622] lustre: ptlrpc: Add WBC connect flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (44 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 045/622] lustre: ldlm: fix l_last_activity usage James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 047/622] lustre: llog: remove obsolete llog handlers James Simmons
                   ` (576 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Oleg Drokin <green@whamcloud.com>

It denotes ability of the node to understand additional
types of intent requests, exclusive metadata locks issued
to clients and server operations performed under such
locks while still held by clients.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10938
Lustre-commit: f024aabf8bbf ("LU-10938 ptlrpc: Add WBC connect flag")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32241
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mikhal Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 1 +
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 5 +++++
 3 files changed, 8 insertions(+)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 66d2679..e2575b4 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -117,6 +117,7 @@
 	"unknown",	/* 0x08 */
 	"unknown",	/* 0x10 */
 	"flr",		/* 0x20 */
+	"wbc",		/* 0x40 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index b14d301c..c566dea 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1115,6 +1115,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_DIR_MIGRATE);
 	LASSERTF(OBD_CONNECT2_FLR == 0x20ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_FLR);
+	LASSERTF(OBD_CONNECT2_WBC_INTENTS == 0x40ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_WBC_INTENTS);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 2403b89..f437614 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -794,6 +794,11 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_DIR_MIGRATE	0x4ULL		/* migrate striped dir
 							 */
 #define OBD_CONNECT2_FLR		0x20ULL		/* FLR support */
+#define OBD_CONNECT2_WBC_INTENTS	0x40ULL /* create/unlink/... intents
+						 * for wbc, also operations
+						 * under client-held parent
+						 * locks
+						 */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 047/622] lustre: llog: remove obsolete llog handlers
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (45 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 046/622] lustre: ptlrpc: Add WBC connect flag James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 048/622] lustre: ldlm: fix for l_lru usage James Simmons
                   ` (575 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

Remove the obsolete llog RPC handling for cancel, close, and
destroy. Remove llog handling from ldlm_callback_handler(). Remove the
unused client side method llog_client_destroy().

WC-bug-id: https://jira.whamcloud.com/browse/LU-10855
Lustre-commit: 85011d372dfb ("LU-10855 llog: remove obsolete llog handlers")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32202
Reviewed-by: Mikhal Pershin <mpershin@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h  |  3 ---
 fs/lustre/ptlrpc/layout.c              | 26 --------------------------
 include/uapi/linux/lustre/lustre_idl.h | 12 ++++++------
 3 files changed, 6 insertions(+), 35 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 2348569..2737240 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -212,13 +212,10 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_format RQF_LDLM_GL_CALLBACK;
 extern struct req_format RQF_LDLM_GL_CALLBACK_DESC;
 /* LOG req_format */
-extern struct req_format RQF_LOG_CANCEL;
 extern struct req_format RQF_LLOG_ORIGIN_HANDLE_CREATE;
-extern struct req_format RQF_LLOG_ORIGIN_HANDLE_DESTROY;
 extern struct req_format RQF_LLOG_ORIGIN_HANDLE_NEXT_BLOCK;
 extern struct req_format RQF_LLOG_ORIGIN_HANDLE_PREV_BLOCK;
 extern struct req_format RQF_LLOG_ORIGIN_HANDLE_READ_HEADER;
-extern struct req_format RQF_LLOG_ORIGIN_CONNECT;
 
 extern struct req_format RQF_CONNECT;
 
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 4909b30..8fe661d 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -88,11 +88,6 @@
 	&RMF_MGS_CONFIG_RES
 };
 
-static const struct req_msg_field *log_cancel_client[] = {
-	&RMF_PTLRPC_BODY,
-	&RMF_LOGCOOKIES
-};
-
 static const struct req_msg_field *mdt_body_only[] = {
 	&RMF_PTLRPC_BODY,
 	&RMF_MDT_BODY
@@ -547,11 +542,6 @@
 	&RMF_LLOG_LOG_HDR
 };
 
-static const struct req_msg_field *llogd_conn_body_only[] = {
-	&RMF_PTLRPC_BODY,
-	&RMF_LLOGD_CONN_BODY
-};
-
 static const struct req_msg_field *llog_origin_handle_next_block_server[] = {
 	&RMF_PTLRPC_BODY,
 	&RMF_LLOGD_BODY,
@@ -766,13 +756,10 @@
 	&RQF_LDLM_INTENT_CREATE,
 	&RQF_LDLM_INTENT_UNLINK,
 	&RQF_LDLM_INTENT_GETXATTR,
-	&RQF_LOG_CANCEL,
 	&RQF_LLOG_ORIGIN_HANDLE_CREATE,
-	&RQF_LLOG_ORIGIN_HANDLE_DESTROY,
 	&RQF_LLOG_ORIGIN_HANDLE_NEXT_BLOCK,
 	&RQF_LLOG_ORIGIN_HANDLE_PREV_BLOCK,
 	&RQF_LLOG_ORIGIN_HANDLE_READ_HEADER,
-	&RQF_LLOG_ORIGIN_CONNECT,
 	&RQF_CONNECT,
 };
 
@@ -1254,10 +1241,6 @@ struct req_format RQF_FLD_READ =
 	DEFINE_REQ_FMT0("FLD_READ", fld_read_client, fld_read_server);
 EXPORT_SYMBOL(RQF_FLD_READ);
 
-struct req_format RQF_LOG_CANCEL =
-	DEFINE_REQ_FMT0("OBD_LOG_CANCEL", log_cancel_client, empty);
-EXPORT_SYMBOL(RQF_LOG_CANCEL);
-
 struct req_format RQF_MDS_QUOTACTL =
 	DEFINE_REQ_FMT0("MDS_QUOTACTL", quotactl_only, quotactl_only);
 EXPORT_SYMBOL(RQF_MDS_QUOTACTL);
@@ -1511,11 +1494,6 @@ struct req_format RQF_LLOG_ORIGIN_HANDLE_CREATE =
 			llog_origin_handle_create_client, llogd_body_only);
 EXPORT_SYMBOL(RQF_LLOG_ORIGIN_HANDLE_CREATE);
 
-struct req_format RQF_LLOG_ORIGIN_HANDLE_DESTROY =
-	DEFINE_REQ_FMT0("LLOG_ORIGIN_HANDLE_DESTROY",
-			llogd_body_only, llogd_body_only);
-EXPORT_SYMBOL(RQF_LLOG_ORIGIN_HANDLE_DESTROY);
-
 struct req_format RQF_LLOG_ORIGIN_HANDLE_NEXT_BLOCK =
 	DEFINE_REQ_FMT0("LLOG_ORIGIN_HANDLE_NEXT_BLOCK",
 			llogd_body_only, llog_origin_handle_next_block_server);
@@ -1531,10 +1509,6 @@ struct req_format RQF_LLOG_ORIGIN_HANDLE_READ_HEADER =
 			llogd_body_only, llog_log_hdr_only);
 EXPORT_SYMBOL(RQF_LLOG_ORIGIN_HANDLE_READ_HEADER);
 
-struct req_format RQF_LLOG_ORIGIN_CONNECT =
-	DEFINE_REQ_FMT0("LLOG_ORIGIN_CONNECT", llogd_conn_body_only, empty);
-EXPORT_SYMBOL(RQF_LLOG_ORIGIN_CONNECT);
-
 struct req_format RQF_CONNECT =
 	DEFINE_REQ_FMT0("CONNECT", obd_connect_client, obd_connect_server);
 EXPORT_SYMBOL(RQF_CONNECT);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index f437614..7cf7307 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2312,7 +2312,7 @@ struct cfg_marker {
 
 enum obd_cmd {
 	OBD_PING = 400,
-	OBD_LOG_CANCEL,
+	OBD_LOG_CANCEL,	/* Obsolete since 1.5. */
 	OBD_QC_CALLBACK, /* not used since 2.4 */
 	OBD_IDX_READ,
 	OBD_LAST_OPC
@@ -2624,12 +2624,12 @@ enum llogd_rpc_ops {
 	LLOG_ORIGIN_HANDLE_CREATE	= 501,
 	LLOG_ORIGIN_HANDLE_NEXT_BLOCK	= 502,
 	LLOG_ORIGIN_HANDLE_READ_HEADER	= 503,
-	LLOG_ORIGIN_HANDLE_WRITE_REC	= 504,
-	LLOG_ORIGIN_HANDLE_CLOSE	= 505,
-	LLOG_ORIGIN_CONNECT		= 506,
-	LLOG_CATINFO			= 507,  /* deprecated */
+	LLOG_ORIGIN_HANDLE_WRITE_REC	= 504,	/* Obsolete by 2.1. */
+	LLOG_ORIGIN_HANDLE_CLOSE	= 505,	/* Obsolete by 1.8. */
+	LLOG_ORIGIN_CONNECT		= 506,	/* Obsolete by 2.4. */
+	LLOG_CATINFO			= 507,  /* Obsolete by 2.3. */
 	LLOG_ORIGIN_HANDLE_PREV_BLOCK	= 508,
-	LLOG_ORIGIN_HANDLE_DESTROY	= 509,  /* for destroy llog object*/
+	LLOG_ORIGIN_HANDLE_DESTROY	= 509,  /* Obsolete. */
 	LLOG_LAST_OPC,
 	LLOG_FIRST_OPC			= LLOG_ORIGIN_HANDLE_CREATE
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 048/622] lustre: ldlm: fix for l_lru usage
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (46 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 047/622] lustre: llog: remove obsolete llog handlers James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 049/622] lustre: lov: Move lov_tgts_kobj init to lov_setup James Simmons
                   ` (574 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

Fixes for lock convert code to prevent false assertion and
busy locks in LRU:
- ensure no l_readers and l_writers when add lock to LRU after
  convert.
- don't verify l_lru without ns_lock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11003
Lustre-commit: 2a77dd3bee76 ("LU-11003 ldlm: fix for l_lru usage")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32309
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Mikhal Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_inodebits.c |  1 -
 fs/lustre/ldlm/ldlm_request.c   | 19 +++++++++++--------
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_inodebits.c b/fs/lustre/ldlm/ldlm_inodebits.c
index e74928e..ddbf8d4 100644
--- a/fs/lustre/ldlm/ldlm_inodebits.c
+++ b/fs/lustre/ldlm/ldlm_inodebits.c
@@ -171,7 +171,6 @@ int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits)
 		ldlm_set_cbpending(lock);
 		ldlm_set_bl_ast(lock);
 		unlock_res_and_lock(lock);
-		LASSERT(list_empty(&lock->l_lru));
 		goto exit;
 	}
 
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 67c23fc..5833f59 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -881,21 +881,25 @@ static int lock_convert_interpret(const struct lu_env *env,
 	} else {
 		ldlm_clear_converting(lock);
 
-		/* Concurrent BL AST has arrived, it may cause another convert
-		 * or cancel so just exit here.
+		/* Concurrent BL AST may arrive and cause another convert
+		 * or cancel so just do nothing here if bl_ast is set,
+		 * finish with convert otherwise.
 		 */
 		if (!ldlm_is_bl_ast(lock)) {
 			struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
 
 			/* Drop cancel_bits since there are no more converts
-			 * and put lock into LRU if it is not there yet.
+			 * and put lock into LRU if it is still not used and
+			 * is not there yet.
 			 */
 			lock->l_policy_data.l_inodebits.cancel_bits = 0;
-			spin_lock(&ns->ns_lock);
-			if (!list_empty(&lock->l_lru))
+			if (!lock->l_readers && !lock->l_writers) {
+				spin_lock(&ns->ns_lock);
+				/* there is check for list_empty() inside */
 				ldlm_lock_remove_from_lru_nolock(lock);
-			ldlm_lock_add_to_lru_nolock(lock);
-			spin_unlock(&ns->ns_lock);
+				ldlm_lock_add_to_lru_nolock(lock);
+				spin_unlock(&ns->ns_lock);
+			}
 		}
 	}
 	unlock_res_and_lock(lock);
@@ -903,7 +907,6 @@ static int lock_convert_interpret(const struct lu_env *env,
 	if (rc) {
 		lock_res_and_lock(lock);
 		if (ldlm_is_converting(lock)) {
-			LASSERT(list_empty(&lock->l_lru));
 			ldlm_clear_converting(lock);
 			ldlm_set_cbpending(lock);
 			ldlm_set_bl_ast(lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 049/622] lustre: lov: Move lov_tgts_kobj init to lov_setup
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (47 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 048/622] lustre: ldlm: fix for l_lru usage James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 050/622] lustre: osc: add T10PI support for RPC checksum James Simmons
                   ` (573 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Oleg Drokin <green@whamcloud.com>

and free it in lov_cleanup.
This looks like a more robust solution vs doint it in lov_putref
esp. since we know refcount there crosses 0 repeatedly, confusing
things.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11015
Lustre-commit: 313ac16698db ("LU-11015 lov: Move lov_tgts_kobj init to lov_setup")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32367
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_obd.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 26637bc..9449aa9 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -110,10 +110,6 @@ void lov_tgts_putref(struct obd_device *obd)
 			/* Disconnect */
 			__lov_del_obd(obd, tgt);
 		}
-
-		if (lov->lov_tgts_kobj)
-			kobject_put(lov->lov_tgts_kobj);
-
 	} else {
 		mutex_unlock(&lov->lov_lock);
 	}
@@ -235,9 +231,6 @@ static int lov_connect(const struct lu_env *env,
 
 	lov_tgts_getref(obd);
 
-	lov->lov_tgts_kobj = kobject_create_and_add("target_obds",
-						    &obd->obd_kset.kobj);
-
 	for (i = 0; i < lov->desc.ld_tgt_count; i++) {
 		tgt = lov->lov_tgts[i];
 		if (!tgt || obd_uuid_empty(&tgt->ltd_uuid))
@@ -784,6 +777,9 @@ int lov_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	if (rc)
 		goto out_tunables;
 
+	lov->lov_tgts_kobj = kobject_create_and_add("target_obds",
+						    &obd->obd_kset.kobj);
+
 	return 0;
 
 out_tunables:
@@ -799,6 +795,11 @@ static int lov_cleanup(struct obd_device *obd)
 	struct lov_obd *lov = &obd->u.lov;
 	struct pool_desc *pool, *tmp;
 
+	if (lov->lov_tgts_kobj) {
+		kobject_put(lov->lov_tgts_kobj);
+		lov->lov_tgts_kobj = NULL;
+	}
+
 	list_for_each_entry_safe(pool, tmp, &lov->lov_pool_list, pool_list) {
 		/* free pool structs */
 		CDEBUG(D_INFO, "delete pool %p\n", pool);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 050/622] lustre: osc: add T10PI support for RPC checksum
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (48 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 049/622] lustre: lov: Move lov_tgts_kobj init to lov_setup James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 051/622] lustre: ldlm: Reduce debug to console during eviction James Simmons
                   ` (572 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Li Xi <lixi@ddn.com>

T10 Protection Information (T10 PI), previously known as Data
Integrity Field (DIF), is a standard for end-to-end data integrity
validation. T10 PI prevents silent data corruption, ensuring that
incomplete and incorrect data cannot overwrite good data.

Lustre file system already supports RPC level checksum which
validates the data in bulk RPCs when writing/reading data to/from
objects on OSTs. RPC level checksum can detect data corruption that
happens during RPC being transferred over the wire. However, it is
not capable to prevent silent data corruption happening in other
conditions, for example, memory corruption when data is cached in
page cache. And by using the existing checksum mechanism, only
disjoint protection coverage is provided. Thus, in order to provide
end-to-end data protection, T10PI support for Lustre should be added.

In order to provide end-to-end data integrity validation, the T10 PI
checksum of data in a sector need to be calculated on Lustre client
side and validated later on the Lustre OSS side. The T10 protection
information should be sent together with the data in the RPC.
However, in order to avoid significant performance degradation,
instead of sending all original guard tags for all sectors in a bulk
RPC, the existing checksum feature of bulk RPC will be integrated
together with the new T10PI feature.

When OST starts, necessary T10PI information will be extracted from
storage, i.e. the T10PI DIF type and sector size. The DIF type could
be one of TYPE1_IP, TYPE1_CRC, TYPE3_IP and TYPE3_CRC. And sector
size could be either 512 or 4K bytes.

When an OSC is connecting to OST, OSC and OST will negotiate about
the checksum types. New checksum types are added for T10PI support
including OBD_CKSUM_T10IP512, OBD_CKSUM_T10IP4K, OBD_CKSUM_T10CRC512,
and OBD_CKSUM_T10CRC4K. If the OST storage has T10PI suppoort, the
only selectable T10PI checksum type would have the same type with the
T10PI type of the hardware. The other existing checksum types (crc32,
crc32c, adler32) are still valid options for the RPC checksum type.

When calculating RPC checksum of T10PI, the T10PI checksums of all
sectors will be calculated first using the T10PI chekcsum type, i.e.
16-bit crc or IP checksum. And then RPC checksum will be calculated on
all of the T10PI checksums. The RPC checksum type used in this step is
always alder32. Considering that the checksum-of-checksums is only
computed on a * 4KB chunk of GRD tags for a 1MB RPC for 512B sectors,
or 16KB of GRD tags for 16MB of 4KB sectors, this is only 1/256 or
1/1024 of the total data being checksummed, so the checksum type used
here should not affect overall system performance noticeably.

obdfilter.*.enforce_t10pi_cksum can be used to tune whether to enforce
T10-PI checksum or not.

If the OST supports T10-PI feature and T10-PI chekcsum is enforced, clients
will have no other choice for RPC checksum type other than using the T10PI
chekcsum type. This is useful for enforcing end-to-end integrity in the
whole system.

If the OST doesn't support T10-PI feature and T10-PI chekcsum is enforced,
together with other checksums with reasonably good speeds (e.g. crc32,
crc32c, adler, etc.), all the T10-PI checksum types (t10ip512, t10ip4K,
t10crc512, t10crc4K) will be added to the available checksum types,
regardless of the speeds of T10-PI chekcsums. This is useful for testing
T10-PI checksums of RPC.

If the OST supports T10-PI feature and T10-PI chekcsum is NOT enforced,
the corresponding T10-PI checksum type will be added to the checksum type
list, regardless of the speed of the T10-PI chekcsum. This provide the
clients to flexibility to choose whether to enable end-to-end integrity
or not.

If the OST does NOT supports T10-PI feature and T10-PI chekcsum is NOT
enforced, together with other checksums with reasonably good speeds,
all the T10-PI checksum types with good speeds will be added into the
checksum type list. Note that a T10-PI checksum type with a speed worse
than half of Alder will NOT be added as a option. In this circumstance,
T10-PI checksum types has the same behavior like other normal checksum
types.

The clients that has no T10-PI RPC checksum support will not be affected
by the above-mentioned logic. And that logic will only be enforced to the
newly connected clients after changing obdfilter.*.enforce_t10pi_cksum on
an OST.

Following are the speeds of different checksum types on a server with CPU
of Intel(R) Xeon(R) E5-2650 @ 2.00GHz:

crc: 1575 MB/s
crc32c: 9763 MB/s
adler: 1255 MB/s
t10ip512: 6151 MB/s
t10ip4k: 7935 MB/s
t10crc512: 1119 MB/s
t10crc4k: 1531 MB/s

WC-bug-id: https://jira.whamcloud.com/browse/LU-10472
Lustre-commit: b1e7be00cb6e ("LU-10472 osc: add T10PI support for RPC checksum")
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/30980
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Faccini Bruno <bruno.faccini@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_cksum.h          | 123 +++++++++------
 fs/lustre/include/obd_class.h          |   1 -
 fs/lustre/llite/llite_lib.c            |   4 +-
 fs/lustre/obdclass/Makefile            |   2 +-
 fs/lustre/obdclass/integrity.c         | 273 +++++++++++++++++++++++++++++++++
 fs/lustre/obdclass/obd_cksum.c         | 151 ++++++++++++++++++
 fs/lustre/osc/osc_request.c            | 214 +++++++++++++++++++++++---
 fs/lustre/ptlrpc/import.c              |   8 +-
 fs/lustre/ptlrpc/wiretest.c            |  17 +-
 include/uapi/linux/lustre/lustre_idl.h |  48 ++++--
 net/lnet/libcfs/linux-crypto.c         |   3 +
 11 files changed, 753 insertions(+), 91 deletions(-)
 create mode 100644 fs/lustre/obdclass/integrity.c
 create mode 100644 fs/lustre/obdclass/obd_cksum.c

diff --git a/fs/lustre/include/obd_cksum.h b/fs/lustre/include/obd_cksum.h
index 26a9555..cc47c44 100644
--- a/fs/lustre/include/obd_cksum.h
+++ b/fs/lustre/include/obd_cksum.h
@@ -35,6 +35,9 @@
 #include <linux/libcfs/libcfs_crypto.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 
+int obd_t10_cksum_speed(const char *obd_name,
+			enum cksum_type cksum_type);
+
 static inline unsigned char cksum_obd2cfs(enum cksum_type cksum_type)
 {
 	switch (cksum_type) {
@@ -51,59 +54,23 @@ static inline unsigned char cksum_obd2cfs(enum cksum_type cksum_type)
 	return 0;
 }
 
-/* The OBD_FL_CKSUM_* flags is packed into 5 bits of o_flags, since there can
- * only be a single checksum type per RPC.
- *
- * The OBD_CHECKSUM_* type bits passed in ocd_cksum_types are a 32-bit bitmask
- * since they need to represent the full range of checksum algorithms that
- * both the client and server can understand.
- *
- * In case of an unsupported types/flags we fall back to ADLER
- * because that is supported by all clients since 1.8
- *
- * In case multiple algorithms are supported the best one is used.
- */
-static inline u32 cksum_type_pack(enum cksum_type cksum_type)
-{
-	unsigned int performance = 0, tmp;
-	u32 flag = OBD_FL_CKSUM_ADLER;
-
-	if (cksum_type & OBD_CKSUM_CRC32) {
-		tmp = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32));
-		if (tmp > performance) {
-			performance = tmp;
-			flag = OBD_FL_CKSUM_CRC32;
-		}
-	}
-	if (cksum_type & OBD_CKSUM_CRC32C) {
-		tmp = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32C));
-		if (tmp > performance) {
-			performance = tmp;
-			flag = OBD_FL_CKSUM_CRC32C;
-		}
-	}
-	if (cksum_type & OBD_CKSUM_ADLER) {
-		tmp = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_ADLER));
-		if (tmp > performance) {
-			performance = tmp;
-			flag = OBD_FL_CKSUM_ADLER;
-		}
-	}
-	if (unlikely(cksum_type && !(cksum_type & (OBD_CKSUM_CRC32C |
-						   OBD_CKSUM_CRC32 |
-						   OBD_CKSUM_ADLER))))
-		CWARN("unknown cksum type %x\n", cksum_type);
-
-	return flag;
-}
+u32 obd_cksum_type_pack(const char *obd_name, enum cksum_type cksum_type);
 
-static inline enum cksum_type cksum_type_unpack(u32 o_flags)
+static inline enum cksum_type obd_cksum_type_unpack(u32 o_flags)
 {
 	switch (o_flags & OBD_FL_CKSUM_ALL) {
 	case OBD_FL_CKSUM_CRC32C:
 		return OBD_CKSUM_CRC32C;
 	case OBD_FL_CKSUM_CRC32:
 		return OBD_CKSUM_CRC32;
+	case OBD_FL_CKSUM_T10IP512:
+		return OBD_CKSUM_T10IP512;
+	case OBD_FL_CKSUM_T10IP4K:
+		return OBD_CKSUM_T10IP4K;
+	case OBD_FL_CKSUM_T10CRC512:
+		return OBD_CKSUM_T10CRC512;
+	case OBD_FL_CKSUM_T10CRC4K:
+		return OBD_CKSUM_T10CRC4K;
 	default:
 		break;
 	}
@@ -115,7 +82,7 @@ static inline enum cksum_type cksum_type_unpack(u32 o_flags)
  * 1.8 supported ADLER it is base and not depend on hw
  * Client uses all available local algos
  */
-static inline enum cksum_type cksum_types_supported_client(void)
+static inline enum cksum_type obd_cksum_types_supported_client(void)
 {
 	enum cksum_type ret = OBD_CKSUM_ADLER;
 
@@ -128,6 +95,8 @@ static inline enum cksum_type cksum_types_supported_client(void)
 		ret |= OBD_CKSUM_CRC32C;
 	if (cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32)) > 0)
 		ret |= OBD_CKSUM_CRC32;
+	/* Client support all kinds of T10 checksum */
+	ret |= OBD_CKSUM_T10_ALL;
 
 	return ret;
 }
@@ -140,14 +109,68 @@ static inline enum cksum_type cksum_types_supported_client(void)
  * Caution is advised, however, since what is fastest on a single client may
  * not be the fastest or most efficient algorithm on the server.
  */
-static inline enum cksum_type cksum_type_select(enum cksum_type cksum_types)
+static inline enum cksum_type
+obd_cksum_type_select(const char *obd_name, enum cksum_type cksum_types)
 {
-	return cksum_type_unpack(cksum_type_pack(cksum_types));
+	u32 flag = obd_cksum_type_pack(obd_name, cksum_types);
+
+	return obd_cksum_type_unpack(flag);
 }
 
 /* Checksum algorithm names. Must be defined in the same order as the
  * OBD_CKSUM_* flags.
  */
-#define DECLARE_CKSUM_NAME char *cksum_name[] = {"crc32", "adler", "crc32c"}
+#define DECLARE_CKSUM_NAME const char *cksum_name[] = {"crc32", "adler", \
+	"crc32c", "reserved", "t10ip512", "t10ip4K", "t10crc512", "t10crc4K"}
+
+typedef u16 (obd_dif_csum_fn) (void *, unsigned int);
+
+u16 obd_dif_crc_fn(void *data, unsigned int len);
+u16 obd_dif_ip_fn(void *data, unsigned int len);
+int obd_page_dif_generate_buffer(const char *obd_name, struct page *page,
+				 u32 offset, u32 length,
+				 u16 *guard_start, int guard_number,
+				 int *used_number, int sector_size,
+				 obd_dif_csum_fn *fn);
+/*
+ * If checksum type is one T10 checksum types, init the csum_fn and sector
+ * size. Otherwise, init them to NULL/zero.
+ */
+static inline void obd_t10_cksum2dif(enum cksum_type cksum_type,
+				     obd_dif_csum_fn **fn, int *sector_size)
+{
+	*fn = NULL;
+	*sector_size = 0;
+
+	switch (cksum_type) {
+	case OBD_CKSUM_T10IP512:
+		*fn = obd_dif_ip_fn;
+		*sector_size = 512;
+		break;
+	case OBD_CKSUM_T10IP4K:
+		*fn = obd_dif_ip_fn;
+		*sector_size = 4096;
+		break;
+	case OBD_CKSUM_T10CRC512:
+		*fn = obd_dif_crc_fn;
+		*sector_size = 512;
+		break;
+	case OBD_CKSUM_T10CRC4K:
+		*fn = obd_dif_crc_fn;
+		*sector_size = 4096;
+		break;
+	default:
+		break;
+	}
+}
+
+enum obd_t10_cksum_type {
+	OBD_T10_CKSUM_UNKNOWN = 0,
+	OBD_T10_CKSUM_IP512,
+	OBD_T10_CKSUM_IP4K,
+	OBD_T10_CKSUM_CRC512,
+	OBD_T10_CKSUM_CRC4K,
+	OBD_T10_CKSUM_MAX
+};
 
 #endif /* __OBD_H */
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index d896049..0153c50 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1687,7 +1687,6 @@ static inline void class_uuid_unparse(class_uuid_t uu, struct obd_uuid *out)
 extern char obd_jobid_name[];
 int class_procfs_init(void);
 int class_procfs_clean(void);
-
 /* prng.c */
 #define ll_generate_random_uuid(uuid_out) \
 	get_random_bytes(uuid_out, sizeof(class_uuid_t))
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index eb29064..dff349f 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -218,7 +218,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				   OBD_CONNECT_LARGE_ACL;
 #endif
 
-	data->ocd_cksum_types = cksum_types_supported_client();
+	data->ocd_cksum_types = obd_cksum_types_supported_client();
 
 	if (OBD_FAIL_CHECK(OBD_FAIL_MDC_LIGHTWEIGHT))
 		/* flag mdc connection as lightweight, only used for test
@@ -432,7 +432,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	if (OBD_FAIL_CHECK(OBD_FAIL_OSC_CKSUM_ADLER_ONLY))
 		data->ocd_cksum_types = OBD_CKSUM_ADLER;
 	else
-		data->ocd_cksum_types = cksum_types_supported_client();
+		data->ocd_cksum_types = obd_cksum_types_supported_client();
 
 	data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
 
diff --git a/fs/lustre/obdclass/Makefile b/fs/lustre/obdclass/Makefile
index 96fce1b..25d2e1d 100644
--- a/fs/lustre/obdclass/Makefile
+++ b/fs/lustre/obdclass/Makefile
@@ -8,4 +8,4 @@ obdclass-y := llog.o llog_cat.o llog_obd.o llog_swab.o class_obd.o \
 	      lustre_handles.o lustre_peer.o statfs_pack.o linkea.o \
 	      obdo.o obd_config.o obd_mount.o lu_object.o lu_ref.o \
 	      cl_object.o cl_page.o cl_lock.o cl_io.o kernelcomm.o \
-	      jobid.o
+	      jobid.o integrity.o obd_cksum.o
diff --git a/fs/lustre/obdclass/integrity.c b/fs/lustre/obdclass/integrity.c
new file mode 100644
index 0000000..8348b16
--- /dev/null
+++ b/fs/lustre/obdclass/integrity.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2018, DataDirect Networks Storage.
+ * Author: Li Xi.
+ *
+ * General data integrity functions
+ */
+#include <linux/blkdev.h>
+#include <linux/crc-t10dif.h>
+#include <asm-generic/checksum.h>
+#include <obd_class.h>
+#include <obd_cksum.h>
+
+u16 obd_dif_crc_fn(void *data, unsigned int len)
+{
+	return cpu_to_be16(crc_t10dif(data, len));
+}
+EXPORT_SYMBOL(obd_dif_crc_fn);
+
+u16 obd_dif_ip_fn(void *data, unsigned int len)
+{
+	return ip_compute_csum(data, len);
+}
+EXPORT_SYMBOL(obd_dif_ip_fn);
+
+int obd_page_dif_generate_buffer(const char *obd_name, struct page *page,
+				 u32 offset, u32 length,
+				 u16 *guard_start, int guard_number,
+				 int *used_number, int sector_size,
+				 obd_dif_csum_fn *fn)
+{
+	unsigned int i;
+	char *data_buf;
+	u16 *guard_buf = guard_start;
+	unsigned int data_size;
+	int used = 0;
+
+	data_buf = kmap(page) + offset;
+	for (i = 0; i < length; i += sector_size) {
+		if (used >= guard_number) {
+			CERROR("%s: unexpected used guard number of DIF %u/%u, data length %u, sector size %u: rc = %d\n",
+			       obd_name, used, guard_number, length,
+			       sector_size, -E2BIG);
+			return -E2BIG;
+		}
+		data_size = length - i;
+		if (data_size > sector_size)
+			data_size = sector_size;
+		*guard_buf = fn(data_buf, data_size);
+		guard_buf++;
+		data_buf += data_size;
+		used++;
+	}
+	kunmap(page);
+	*used_number = used;
+
+	return 0;
+}
+EXPORT_SYMBOL(obd_page_dif_generate_buffer);
+
+static int __obd_t10_performance_test(const char *obd_name,
+				      enum cksum_type cksum_type,
+				      struct page *data_page,
+				      int repeat_number)
+{
+	unsigned char cfs_alg = cksum_obd2cfs(OBD_CKSUM_T10_TOP);
+	struct ahash_request *hdesc;
+	obd_dif_csum_fn *fn = NULL;
+	unsigned int bufsize;
+	unsigned char *buffer;
+	struct page *__page;
+	u16 *guard_start;
+	int guard_number;
+	int used_number = 0;
+	int sector_size = 0;
+	u32 cksum;
+	int rc = 0;
+	int rc2;
+	int used;
+	int i;
+
+	obd_t10_cksum2dif(cksum_type, &fn, &sector_size);
+	if (!fn)
+		return -EINVAL;
+
+	__page = alloc_page(GFP_KERNEL);
+	if (!__page)
+		return -ENOMEM;
+
+	hdesc = cfs_crypto_hash_init(cfs_alg, NULL, 0);
+	if (IS_ERR(hdesc)) {
+		rc = PTR_ERR(hdesc);
+		CERROR("%s: unable to initialize checksum hash %s: rc = %d\n",
+		       obd_name, cfs_crypto_hash_name(cfs_alg), rc);
+		goto out;
+	}
+
+	buffer = kmap(__page);
+	guard_start = (u16 *)buffer;
+	guard_number = PAGE_SIZE / sizeof(*guard_start);
+	for (i = 0; i < repeat_number; i++) {
+		/*
+		 * The left guard number should be able to hold checksums of a
+		 * whole page
+		 */
+		rc = obd_page_dif_generate_buffer(obd_name, data_page, 0,
+						  PAGE_SIZE,
+						  guard_start + used_number,
+						  guard_number - used_number,
+						  &used, sector_size, fn);
+		if (rc)
+			break;
+
+		used_number += used;
+		if (used_number == guard_number) {
+			cfs_crypto_hash_update_page(hdesc, __page, 0,
+				used_number * sizeof(*guard_start));
+			used_number = 0;
+		}
+	}
+	kunmap(__page);
+	if (rc)
+		goto out_final;
+
+	if (used_number != 0)
+		cfs_crypto_hash_update_page(hdesc, __page, 0,
+			used_number * sizeof(*guard_start));
+
+	bufsize = sizeof(cksum);
+out_final:
+	rc2 = cfs_crypto_hash_final(hdesc, (unsigned char *)&cksum, &bufsize);
+	rc = rc ? rc : rc2;
+out:
+	__free_page(__page);
+
+	return rc;
+}
+
+/**
+ *  Array of T10PI checksum algorithm speed in MByte per second
+ */
+static int obd_t10_cksum_speeds[OBD_T10_CKSUM_MAX];
+
+static enum obd_t10_cksum_type
+obd_t10_cksum2type(enum cksum_type cksum_type)
+{
+	switch (cksum_type) {
+	case OBD_CKSUM_T10IP512:
+		return OBD_T10_CKSUM_IP512;
+	case OBD_CKSUM_T10IP4K:
+		return OBD_T10_CKSUM_IP4K;
+	case OBD_CKSUM_T10CRC512:
+		return OBD_T10_CKSUM_CRC512;
+	case OBD_CKSUM_T10CRC4K:
+		return OBD_T10_CKSUM_CRC4K;
+	default:
+		return OBD_T10_CKSUM_UNKNOWN;
+	}
+}
+
+static const char *obd_t10_cksum_name(enum obd_t10_cksum_type index)
+{
+	DECLARE_CKSUM_NAME;
+
+	/* Need to skip "crc32", "adler", "crc32c", "reserved" */
+	return cksum_name[3 + index];
+}
+
+/**
+ * Compute the speed of specified T10PI checksum type
+ *
+ * Run a speed test on the given T10PI checksum on buffer using a 1MB buffer
+ * size. This is a reasonable buffer size for Lustre RPCs, even if the actual
+ * RPC size is larger or smaller.
+ *
+ * The speed is stored internally in the obd_t10_cksum_speeds[] array, and
+ * is available through the obd_t10_cksum_speed() function.
+ *
+ * This function needs to stay the same as cfs_crypto_performance_test() so
+ * that the speeds are comparable. And this function should reflect the real
+ * cost of the checksum calculation.
+ *
+ * \param[in] obd_name		name of the OBD device
+ * \param[in] cksum_type	checksum type (OBD_CKSUM_T10*)
+ */
+static void obd_t10_performance_test(const char *obd_name,
+				     enum cksum_type cksum_type)
+{
+	enum obd_t10_cksum_type index = obd_t10_cksum2type(cksum_type);
+	const int buf_len = max(PAGE_SIZE, 1048576UL);
+	unsigned long bcount;
+	unsigned long start;
+	unsigned long end;
+	struct page *page;
+	int rc = 0;
+	void *buf;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	buf = kmap(page);
+	memset(buf, 0xAD, PAGE_SIZE);
+	kunmap(page);
+
+	for (start = jiffies, end = start + msecs_to_jiffies(MSEC_PER_SEC / 4),
+	     bcount = 0; time_before(jiffies, end) && rc == 0; bcount++) {
+		rc = __obd_t10_performance_test(obd_name, cksum_type, page,
+						buf_len / PAGE_SIZE);
+		if (rc)
+			break;
+	}
+	end = jiffies;
+	__free_page(page);
+out:
+	if (rc) {
+		obd_t10_cksum_speeds[index] = rc;
+		CDEBUG(D_INFO,
+		       "%s: T10 checksum algorithm %s test error: rc = %d\n",
+		       obd_name, obd_t10_cksum_name(index), rc);
+	} else {
+		unsigned long tmp;
+
+		tmp = ((bcount * buf_len / jiffies_to_msecs(end - start)) *
+		       1000) / (1024 * 1024);
+		obd_t10_cksum_speeds[index] = (int)tmp;
+		CDEBUG(D_CONFIG,
+		       "%s: T10 checksum algorithm %s speed = %d MB/s\n",
+		       obd_name, obd_t10_cksum_name(index),
+		       obd_t10_cksum_speeds[index]);
+	}
+}
+
+int obd_t10_cksum_speed(const char *obd_name,
+			enum cksum_type cksum_type)
+{
+	enum obd_t10_cksum_type index = obd_t10_cksum2type(cksum_type);
+
+	if (unlikely(obd_t10_cksum_speeds[index] == 0)) {
+		static DEFINE_MUTEX(obd_t10_cksum_speed_mutex);
+
+		mutex_lock(&obd_t10_cksum_speed_mutex);
+		if (obd_t10_cksum_speeds[index] == 0)
+			obd_t10_performance_test(obd_name, cksum_type);
+		mutex_unlock(&obd_t10_cksum_speed_mutex);
+	}
+
+	return obd_t10_cksum_speeds[index];
+}
+EXPORT_SYMBOL(obd_t10_cksum_speed);
diff --git a/fs/lustre/obdclass/obd_cksum.c b/fs/lustre/obdclass/obd_cksum.c
new file mode 100644
index 0000000..601feb7
--- /dev/null
+++ b/fs/lustre/obdclass/obd_cksum.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2018, DataDirect Networks Storage.
+ * Author: Li Xi.
+ *
+ * Checksum functions
+ */
+#include <obd_class.h>
+#include <obd_cksum.h>
+
+/* Server uses algos that perform at 50% or better of the Adler */
+enum cksum_type obd_cksum_types_supported_server(const char *obd_name)
+{
+	enum cksum_type ret = OBD_CKSUM_ADLER;
+	int base_speed;
+
+	CDEBUG(D_INFO,
+	       "%s: checksum speed: crc %d, crc32c %d, adler %d, t10ip512 %d, t10ip4k %d, t10crc512 %d, t10crc4k %d\n",
+	       obd_name,
+	       cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32)),
+	       cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32C)),
+	       cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_ADLER)),
+	       obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10IP512),
+	       obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10IP4K),
+	       obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10CRC512),
+	       obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10CRC4K));
+
+	base_speed = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_ADLER)) / 2;
+
+	if (cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32C)) >=
+	    base_speed)
+		ret |= OBD_CKSUM_CRC32C;
+
+	if (cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32)) >=
+	    base_speed)
+		ret |= OBD_CKSUM_CRC32;
+
+	if (obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10IP512) >= base_speed)
+		ret |= OBD_CKSUM_T10IP512;
+
+	if (obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10IP4K) >= base_speed)
+		ret |= OBD_CKSUM_T10IP4K;
+
+	if (obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10CRC512) >= base_speed)
+		ret |= OBD_CKSUM_T10CRC512;
+
+	if (obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10CRC4K) >= base_speed)
+		ret |= OBD_CKSUM_T10CRC4K;
+
+	return ret;
+}
+EXPORT_SYMBOL(obd_cksum_types_supported_server);
+
+/* The OBD_FL_CKSUM_* flags is packed into 5 bits of o_flags, since there can
+ * only be a single checksum type per RPC.
+ *
+ * The OBD_CKSUM_* type bits passed in ocd_cksum_types are a 32-bit bitmask
+ * since they need to represent the full range of checksum algorithms that
+ * both the client and server can understand.
+ *
+ * In case of an unsupported types/flags we fall back to ADLER
+ * because that is supported by all clients since 1.8
+ *
+ * In case multiple algorithms are supported the best one is used.
+ */
+u32 obd_cksum_type_pack(const char *obd_name, enum cksum_type cksum_type)
+{
+	unsigned int performance = 0, tmp;
+	u32 flag = OBD_FL_CKSUM_ADLER;
+
+	if (cksum_type & OBD_CKSUM_CRC32) {
+		tmp = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32));
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_CRC32;
+		}
+	}
+	if (cksum_type & OBD_CKSUM_CRC32C) {
+		tmp = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_CRC32C));
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_CRC32C;
+		}
+	}
+	if (cksum_type & OBD_CKSUM_ADLER) {
+		tmp = cfs_crypto_hash_speed(cksum_obd2cfs(OBD_CKSUM_ADLER));
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_ADLER;
+		}
+	}
+
+	if (cksum_type & OBD_CKSUM_T10IP512) {
+		tmp = obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10IP512);
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_T10IP512;
+		}
+	}
+
+	if (cksum_type & OBD_CKSUM_T10IP4K) {
+		tmp = obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10IP4K);
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_T10IP4K;
+		}
+	}
+
+	if (cksum_type & OBD_CKSUM_T10CRC512) {
+		tmp = obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10CRC512);
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_T10CRC512;
+		}
+	}
+
+	if (cksum_type & OBD_CKSUM_T10CRC4K) {
+		tmp = obd_t10_cksum_speed(obd_name, OBD_CKSUM_T10CRC4K);
+		if (tmp > performance) {
+			performance = tmp;
+			flag = OBD_FL_CKSUM_T10CRC4K;
+		}
+	}
+
+	if (unlikely(cksum_type && !(cksum_type & OBD_CKSUM_ALL)))
+		CWARN("%s: unknown cksum type %x\n", obd_name, cksum_type);
+
+	return flag;
+}
+EXPORT_SYMBOL(obd_cksum_type_pack);
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index c430239..9ac9c84 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1030,6 +1030,105 @@ static inline int can_merge_pages(struct brw_page *p1, struct brw_page *p2)
 	return (p1->off + p1->count == p2->off);
 }
 
+static int osc_checksum_bulk_t10pi(const char *obd_name, int nob,
+				   size_t pg_count, struct brw_page **pga,
+				   int opc, obd_dif_csum_fn *fn,
+				   int sector_size,
+				   u32 *check_sum)
+{
+	struct ahash_request *hdesc;
+	/* Used Adler as the default checksum type on top of DIF tags */
+	unsigned char cfs_alg = cksum_obd2cfs(OBD_CKSUM_T10_TOP);
+	struct page *__page;
+	unsigned char *buffer;
+	u16 *guard_start;
+	unsigned int bufsize;
+	int guard_number;
+	int used_number = 0;
+	int used;
+	u32 cksum;
+	int rc = 0;
+	int i = 0;
+
+	LASSERT(pg_count > 0);
+
+	__page = alloc_page(GFP_KERNEL);
+	if (!__page)
+		return -ENOMEM;
+
+	hdesc = cfs_crypto_hash_init(cfs_alg, NULL, 0);
+	if (IS_ERR(hdesc)) {
+		rc = PTR_ERR(hdesc);
+		CERROR("%s: unable to initialize checksum hash %s: rc = %d\n",
+		       obd_name, cfs_crypto_hash_name(cfs_alg), rc);
+		goto out;
+	}
+
+	buffer = kmap(__page);
+	guard_start = (u16 *)buffer;
+	guard_number = PAGE_SIZE / sizeof(*guard_start);
+	while (nob > 0 && pg_count > 0) {
+		unsigned int count = pga[i]->count > nob ? nob : pga[i]->count;
+
+		/* corrupt the data before we compute the checksum, to
+		 * simulate an OST->client data error
+		 */
+		if (unlikely(i == 0 && opc == OST_READ &&
+			     OBD_FAIL_CHECK(OBD_FAIL_OSC_CHECKSUM_RECEIVE))) {
+			unsigned char *ptr = kmap(pga[i]->pg);
+			int off = pga[i]->off & ~PAGE_MASK;
+
+			memcpy(ptr + off, "bad1", min_t(typeof(nob), 4, nob));
+			kunmap(pga[i]->pg);
+		}
+
+		/*
+		 * The left guard number should be able to hold checksums of a
+		 * whole page
+		 */
+		rc = obd_page_dif_generate_buffer(obd_name, pga[i]->pg, 0,
+						  count,
+						  guard_start + used_number,
+						  guard_number - used_number,
+						  &used, sector_size,
+						  fn);
+		if (rc)
+			break;
+
+		used_number += used;
+		if (used_number == guard_number) {
+			cfs_crypto_hash_update_page(hdesc, __page, 0,
+				used_number * sizeof(*guard_start));
+			used_number = 0;
+		}
+
+		nob -= pga[i]->count;
+		pg_count--;
+		i++;
+	}
+	kunmap(__page);
+	if (rc)
+		goto out;
+
+	if (used_number != 0)
+		cfs_crypto_hash_update_page(hdesc, __page, 0,
+			used_number * sizeof(*guard_start));
+
+	bufsize = sizeof(cksum);
+	cfs_crypto_hash_final(hdesc, (unsigned char *)&cksum, &bufsize);
+
+	/* For sending we only compute the wrong checksum instead
+	 * of corrupting the data so it is still correct on a redo
+	 */
+	if (opc == OST_WRITE && OBD_FAIL_CHECK(OBD_FAIL_OSC_CHECKSUM_SEND))
+		cksum++;
+
+	*check_sum = cksum;
+out:
+	__free_page(__page);
+	return rc;
+}
+
 static int osc_checksum_bulk(int nob, u32 pg_count,
 			     struct brw_page **pga, int opc,
 			     enum cksum_type cksum_type,
@@ -1090,6 +1189,28 @@ static int osc_checksum_bulk(int nob, u32 pg_count,
 	return 0;
 }
 
+static int osc_checksum_bulk_rw(const char *obd_name,
+				enum cksum_type cksum_type,
+				int nob, size_t pg_count,
+				struct brw_page **pga, int opc,
+				u32 *check_sum)
+{
+	obd_dif_csum_fn *fn = NULL;
+	int sector_size = 0;
+	int rc;
+
+	obd_t10_cksum2dif(cksum_type, &fn, &sector_size);
+
+	if (fn)
+		rc = osc_checksum_bulk_t10pi(obd_name, nob, pg_count, pga,
+					     opc, fn, sector_size, check_sum);
+	else
+		rc = osc_checksum_bulk(nob, pg_count, pga, opc, cksum_type,
+				       check_sum);
+
+	return rc;
+}
+
 static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 				struct obdo *oa, u32 page_count,
 				struct brw_page **pga,
@@ -1107,6 +1228,7 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 	struct req_capsule *pill;
 	struct brw_page *pg_prev;
 	void *short_io_buf;
+	const char *obd_name = cli->cl_import->imp_obd->obd_name;
 
 	if (OBD_FAIL_CHECK(OBD_FAIL_OSC_BRW_PREP_REQ))
 		return -ENOMEM; /* Recoverable */
@@ -1306,12 +1428,14 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 			if ((body->oa.o_valid & OBD_MD_FLFLAGS) == 0)
 				body->oa.o_flags = 0;
 
-			body->oa.o_flags |= cksum_type_pack(cksum_type);
+			body->oa.o_flags |= obd_cksum_type_pack(obd_name,
+								cksum_type);
 			body->oa.o_valid |= OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
 
-			rc = osc_checksum_bulk(requested_nob, page_count,
-					       pga, OST_WRITE, cksum_type,
-					       &body->oa.o_cksum);
+			rc = osc_checksum_bulk_rw(obd_name, cksum_type,
+						  requested_nob, page_count,
+						  pga, OST_WRITE,
+						  &body->oa.o_cksum);
 			if (rc < 0) {
 				CDEBUG(D_PAGE, "failed to checksum, rc = %d\n",
 				       rc);
@@ -1322,7 +1446,8 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 
 			/* save this in 'oa', too, for later checking */
 			oa->o_valid |= OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
-			oa->o_flags |= cksum_type_pack(cksum_type);
+			oa->o_flags |= obd_cksum_type_pack(obd_name,
+							   cksum_type);
 		} else {
 			/* clear out the checksum flag, in case this is a
 			 * resend but cl_checksum is no longer set. b=11238
@@ -1338,7 +1463,8 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 		    !sptlrpc_flavor_has_bulk(&req->rq_flvr)) {
 			if ((body->oa.o_valid & OBD_MD_FLFLAGS) == 0)
 				body->oa.o_flags = 0;
-			body->oa.o_flags |= cksum_type_pack(cli->cl_cksum_type);
+			body->oa.o_flags |= obd_cksum_type_pack(obd_name,
+				cli->cl_cksum_type);
 			body->oa.o_valid |= OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
 		}
 
@@ -1441,6 +1567,10 @@ static int check_write_checksum(struct obdo *oa,
 				u32 client_cksum, u32 server_cksum,
 				struct osc_brw_async_args *aa)
 {
+	const char *obd_name = aa->aa_cli->cl_import->imp_obd->obd_name;
+	obd_dif_csum_fn *fn = NULL;
+	int sector_size = 0;
+	bool t10pi = false;
 	u32 new_cksum;
 	char *msg;
 	enum cksum_type cksum_type;
@@ -1455,15 +1585,50 @@ static int check_write_checksum(struct obdo *oa,
 		dump_all_bulk_pages(oa, aa->aa_page_count, aa->aa_ppga,
 				    server_cksum, client_cksum);
 
-	cksum_type = cksum_type_unpack(oa->o_valid & OBD_MD_FLFLAGS ?
-				       oa->o_flags : 0);
-	rc = osc_checksum_bulk(aa->aa_requested_nob, aa->aa_page_count,
-			       aa->aa_ppga, OST_WRITE, cksum_type,
-			       &new_cksum);
+	cksum_type = obd_cksum_type_unpack(oa->o_valid & OBD_MD_FLFLAGS ?
+					   oa->o_flags : 0);
+
+	switch (cksum_type) {
+	case OBD_CKSUM_T10IP512:
+		t10pi = true;
+		fn = obd_dif_ip_fn;
+		sector_size = 512;
+		break;
+	case OBD_CKSUM_T10IP4K:
+		t10pi = true;
+		fn = obd_dif_ip_fn;
+		sector_size = 4096;
+		break;
+	case OBD_CKSUM_T10CRC512:
+		t10pi = true;
+		fn = obd_dif_crc_fn;
+		sector_size = 512;
+		break;
+	case OBD_CKSUM_T10CRC4K:
+		t10pi = true;
+		fn = obd_dif_crc_fn;
+		sector_size = 4096;
+		break;
+	default:
+		break;
+	}
+
+	if (t10pi)
+		rc = osc_checksum_bulk_t10pi(obd_name, aa->aa_requested_nob,
+					     aa->aa_page_count,
+					     aa->aa_ppga,
+					     OST_WRITE,
+					     fn,
+					     sector_size,
+					     &new_cksum);
+	else
+		rc = osc_checksum_bulk(aa->aa_requested_nob, aa->aa_page_count,
+				       aa->aa_ppga, OST_WRITE, cksum_type,
+				       &new_cksum);
 
 	if (rc < 0)
 		msg = "failed to calculate the client write checksum";
-	else if (cksum_type != cksum_type_unpack(aa->aa_oa->o_flags))
+	else if (cksum_type != obd_cksum_type_unpack(aa->aa_oa->o_flags))
 		msg = "the server did not use the checksum type specified in the original request - likely a protocol problem";
 	else if (new_cksum == server_cksum)
 		msg = "changed on the client after we checksummed it - likely false positive due to mmap IO (bug 11742)";
@@ -1474,15 +1639,15 @@ static int check_write_checksum(struct obdo *oa,
 
 	LCONSOLE_ERROR_MSG(0x132,
 			   "%s: BAD WRITE CHECKSUM: %s: from %s inode " DFID " object " DOSTID " extent [%llu-%llu], original client csum %x (type %x), server csum %x (type %x), client csum now %x\n",
-			   aa->aa_cli->cl_import->imp_obd->obd_name,
-			   msg, libcfs_nid2str(peer->nid),
+			   obd_name, msg, libcfs_nid2str(peer->nid),
 			   oa->o_valid & OBD_MD_FLFID ? oa->o_parent_seq : (u64)0,
 			   oa->o_valid & OBD_MD_FLFID ? oa->o_parent_oid : 0,
 			   oa->o_valid & OBD_MD_FLFID ? oa->o_parent_ver : 0,
 			   POSTID(&oa->o_oi), aa->aa_ppga[0]->off,
 			   aa->aa_ppga[aa->aa_page_count - 1]->off +
 			   aa->aa_ppga[aa->aa_page_count - 1]->count - 1,
-			   client_cksum, cksum_type_unpack(aa->aa_oa->o_flags),
+			   client_cksum,
+			   obd_cksum_type_unpack(aa->aa_oa->o_flags),
 			   server_cksum, cksum_type, new_cksum);
 
 	return 1;
@@ -1495,6 +1660,7 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 	const struct lnet_process_id *peer =
 			&req->rq_import->imp_connection->c_peer;
 	struct client_obd *cli = aa->aa_cli;
+	const char *obd_name = cli->cl_import->imp_obd->obd_name;
 	struct ost_body *body;
 	u32 client_cksum = 0;
 
@@ -1619,17 +1785,17 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 		char *via = "";
 		char *router = "";
 		enum cksum_type cksum_type;
+		u32 o_flags = body->oa.o_valid & OBD_MD_FLFLAGS ?
+			body->oa.o_flags : 0;
 
-		cksum_type = cksum_type_unpack(body->oa.o_valid & OBD_MD_FLFLAGS ?
-					       body->oa.o_flags : 0);
+		cksum_type = obd_cksum_type_unpack(o_flags);
 
-		rc = osc_checksum_bulk(rc, aa->aa_page_count, aa->aa_ppga,
-				       OST_READ, cksum_type, &client_cksum);
-		if (rc < 0) {
-			CDEBUG(D_PAGE,
-			       "failed to calculate checksum, rc = %d\n", rc);
+		rc = osc_checksum_bulk_rw(obd_name, cksum_type, rc,
+					  aa->aa_page_count, aa->aa_ppga,
+					  OST_READ, &client_cksum);
+		if (rc < 0)
 			goto out;
-		}
+
 		if (req->rq_bulk &&
 		    peer->nid != req->rq_bulk->bd_sender) {
 			via = " via ";
@@ -1652,7 +1818,7 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 				"%s: BAD READ CHECKSUM: from %s%s%s inode " DFID
 				" object " DOSTID
 				" extent [%llu-%llu], client %x, server %x, cksum_type %x\n",
-				req->rq_import->imp_obd->obd_name,
+				obd_name,
 				libcfs_nid2str(peer->nid),
 				via, router,
 				clbody->oa.o_valid & OBD_MD_FLFID ?
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 5d6546d..019648b 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -786,11 +786,12 @@ static int ptlrpc_connect_set_flags(struct obd_import *imp,
 		 * for algorithms we understand. The server masked off
 		 * the checksum types it doesn't support
 		 */
-		if (!(ocd->ocd_cksum_types & cksum_types_supported_client())) {
+		if (!(ocd->ocd_cksum_types &
+		      obd_cksum_types_supported_client())) {
 			LCONSOLE_ERROR("The negotiation of the checksum algorithm to use with server %s failed (%x/%x), disabling checksums\n",
 				      obd2cli_tgt(imp->imp_obd),
 				      ocd->ocd_cksum_types,
-				      cksum_types_supported_client());
+				      obd_cksum_types_supported_client());
 			return -EPROTO;
 		}
 		cli->cl_supp_cksum_types = ocd->ocd_cksum_types;
@@ -801,7 +802,8 @@ static int ptlrpc_connect_set_flags(struct obd_import *imp,
 		 */
 		cli->cl_supp_cksum_types = OBD_CKSUM_ADLER;
 	}
-	cli->cl_cksum_type = cksum_type_select(cli->cl_supp_cksum_types);
+	cli->cl_cksum_type = obd_cksum_type_select(imp->imp_obd->obd_name,
+						   cli->cl_supp_cksum_types);
 
 	if (ocd->ocd_connect_flags & OBD_CONNECT_BRW_SIZE)
 		cli->cl_max_pages_per_rpc =
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index c566dea..01ddbee 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1123,6 +1123,18 @@ void lustre_assert_wire_constants(void)
 		 (unsigned int)OBD_CKSUM_ADLER);
 	LASSERTF(OBD_CKSUM_CRC32C == 0x00000004UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32C);
+	LASSERTF(OBD_CKSUM_RESERVED == 0x00000008UL, "found 0x%.8xUL\n",
+		(unsigned int)OBD_CKSUM_RESERVED);
+	LASSERTF(OBD_CKSUM_T10IP512 == 0x00000010UL, "found 0x%.8xUL\n",
+		(unsigned int)OBD_CKSUM_T10IP512);
+	LASSERTF(OBD_CKSUM_T10IP4K == 0x00000020UL, "found 0x%.8xUL\n",
+		(unsigned int)OBD_CKSUM_T10IP4K);
+	LASSERTF(OBD_CKSUM_T10CRC512 == 0x00000040UL, "found 0x%.8xUL\n",
+		(unsigned int)OBD_CKSUM_T10CRC512);
+	LASSERTF(OBD_CKSUM_T10CRC4K == 0x00000080UL, "found 0x%.8xUL\n",
+		(unsigned int)OBD_CKSUM_T10CRC4K);
+	LASSERTF(OBD_CKSUM_T10_TOP == 0x00000002UL, "found 0x%.8xUL\n",
+		(unsigned int)OBD_CKSUM_T10_TOP);
 
 	/* Checks for struct ost_layout */
 	LASSERTF((int)sizeof(struct ost_layout) == 28, "found %lld\n",
@@ -1372,7 +1384,10 @@ void lustre_assert_wire_constants(void)
 	BUILD_BUG_ON(OBD_FL_CKSUM_CRC32 != 0x00001000);
 	BUILD_BUG_ON(OBD_FL_CKSUM_ADLER != 0x00002000);
 	BUILD_BUG_ON(OBD_FL_CKSUM_CRC32C != 0x00004000);
-	BUILD_BUG_ON(OBD_FL_CKSUM_RSVD2 != 0x00008000);
+	BUILD_BUG_ON(OBD_FL_CKSUM_T10IP512 != 0x00005000);
+	BUILD_BUG_ON(OBD_FL_CKSUM_T10IP4K != 0x00006000);
+	BUILD_BUG_ON(OBD_FL_CKSUM_T10CRC512 != 0x00007000);
+	BUILD_BUG_ON(OBD_FL_CKSUM_T10CRC4K != 0x00008000);
 	BUILD_BUG_ON(OBD_FL_CKSUM_RSVD3 != 0x00010000);
 	BUILD_BUG_ON(OBD_FL_SHRINK_GRANT != 0x00020000);
 	BUILD_BUG_ON(OBD_FL_MMAP != 0x00040000);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 7cf7307..11df7b4 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -883,15 +883,37 @@ struct obd_connect_data {
 /*
  * Supported checksum algorithms. Up to 32 checksum types are supported.
  * (32-bit mask stored in obd_connect_data::ocd_cksum_types)
- * Please update DECLARE_CKSUM_NAME/OBD_CKSUM_ALL in obd.h when adding a new
- * algorithm and also the OBD_FL_CKSUM* flags.
+ * Please update DECLARE_CKSUM_NAME in obd_cksum.h when adding a new
+ * algorithm and also the OBD_FL_CKSUM* flags, OBD_CKSUM_ALL flag,
+ * OBD_FL_CKSUM_ALL flag and potentially OBD_CKSUM_T10_ALL flag.
  */
 enum cksum_type {
-	OBD_CKSUM_CRC32  = 0x00000001,
-	OBD_CKSUM_ADLER  = 0x00000002,
-	OBD_CKSUM_CRC32C = 0x00000004,
+	OBD_CKSUM_CRC32		= 0x00000001,
+	OBD_CKSUM_ADLER		= 0x00000002,
+	OBD_CKSUM_CRC32C	= 0x00000004,
+	OBD_CKSUM_RESERVED	= 0x00000008,
+	OBD_CKSUM_T10IP512	= 0x00000010,
+	OBD_CKSUM_T10IP4K	= 0x00000020,
+	OBD_CKSUM_T10CRC512	= 0x00000040,
+	OBD_CKSUM_T10CRC4K	= 0x00000080,
 };
 
+#define OBD_CKSUM_T10_ALL (OBD_CKSUM_T10IP512 | OBD_CKSUM_T10IP4K | \
+	OBD_CKSUM_T10CRC512 | OBD_CKSUM_T10CRC4K)
+
+#define OBD_CKSUM_ALL (OBD_CKSUM_CRC32 | OBD_CKSUM_ADLER | OBD_CKSUM_CRC32C | \
+		       OBD_CKSUM_T10_ALL)
+
+/*
+ * The default checksum algorithm used on top of T10PI GRD tags for RPC.
+ * Considering that the checksum-of-checksums is only computing CRC32 on a
+ * 4KB chunk of GRD tags for a 1MB RPC for 512B sectors, or 16KB of GRD
+ * tags for 16MB of 4KB sectors, this is only 1/256 or 1/1024 of the
+ * total data being checksummed, so the checksum type used here should not
+ * affect overall system performance noticeably.
+ */
+#define OBD_CKSUM_T10_TOP OBD_CKSUM_ADLER
+
 /*
  *   OST requests: OBDO & OBD request records
  */
@@ -940,7 +962,10 @@ enum obdo_flags {
 	OBD_FL_CKSUM_CRC32	= 0x00001000, /* CRC32 checksum type */
 	OBD_FL_CKSUM_ADLER	= 0x00002000, /* ADLER checksum type */
 	OBD_FL_CKSUM_CRC32C	= 0x00004000, /* CRC32C checksum type */
-	OBD_FL_CKSUM_RSVD2	= 0x00008000, /* for future cksum types */
+	OBD_FL_CKSUM_T10IP512	= 0x00005000, /* T10PI IP cksum, 512B sector */
+	OBD_FL_CKSUM_T10IP4K	= 0x00006000, /* T10PI IP cksum, 4KB sector */
+	OBD_FL_CKSUM_T10CRC512	= 0x00007000, /* T10PI CRC cksum, 512B sector */
+	OBD_FL_CKSUM_T10CRC4K	= 0x00008000, /* T10PI CRC cksum, 4KB sector */
 	OBD_FL_CKSUM_RSVD3	= 0x00010000, /* for future cksum types */
 	OBD_FL_SHRINK_GRANT	= 0x00020000, /* object shrink the grant */
 	OBD_FL_MMAP		= 0x00040000, /* object is mmapped on the client.
@@ -953,11 +978,16 @@ enum obdo_flags {
 	OBD_FL_SHORT_IO		= 0x00400000, /* short io request */
 	/* OBD_FL_LOCAL_MASK = 0xF0000000, was local-only flags until 2.10 */
 
-	/* Note that while these checksum values are currently separate bits,
-	 * in 2.x we can actually allow all values from 1-31 if we wanted.
+	/*
+	 * Note that while the original checksum values were separate bits,
+	 * in 2.x we can actually allow all values from 1-31. T10-PI checksum
+	 * types already use values which are not separate bits.
 	 */
 	OBD_FL_CKSUM_ALL	= (OBD_FL_CKSUM_CRC32 | OBD_FL_CKSUM_ADLER |
-				   OBD_FL_CKSUM_CRC32C),
+				   OBD_FL_CKSUM_CRC32C | OBD_FL_CKSUM_T10IP512 |
+				   OBD_FL_CKSUM_T10IP4K |
+				   OBD_FL_CKSUM_T10CRC512 |
+				   OBD_FL_CKSUM_T10CRC4K),
 };
 
 /*
diff --git a/net/lnet/libcfs/linux-crypto.c b/net/lnet/libcfs/linux-crypto.c
index 53285c2..532fab4 100644
--- a/net/lnet/libcfs/linux-crypto.c
+++ b/net/lnet/libcfs/linux-crypto.c
@@ -318,6 +318,9 @@ int cfs_crypto_hash_final(struct ahash_request *req,
  * The speed is stored internally in the cfs_crypto_hash_speeds[] array, and
  * is available through the cfs_crypto_hash_speed() function.
  *
+ * This function needs to stay the same as obd_t10_performance_test() so that
+ * the speeds are comparable.
+ *
  * @hash_alg	hash algorithm id (CFS_HASH_ALG_*)
  * @buf		data buffer on which to compute the hash
  * @buf_len	length of @buf on which to compute hash
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 051/622] lustre: ldlm: Reduce debug to console during eviction
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (49 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 050/622] lustre: osc: add T10PI support for RPC checksum James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 052/622] lustre: ptlrpc: idle connections can disconnect James Simmons
                   ` (571 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

During an eviction, Lustre calls ldlm_namespace_cleanup,
and it will sometimes end up dumping all of the locks on a
particular resource to the console log
(ldlm_resource_complain), which is very wasteful and only
rarely helpful.

Move the debug level for this to D_NETERROR since it is in the
default debug mask.

Cray-bug-id: LUS-1418
WC-bug-id: https://jira.whamcloud.com/browse/LU-10648
Lustre-commit: f92fcb863cb9 ("LU-10648 ldlm: Reduce debug to console during eviction")
Signed-off-by: Chris Horn <hornc@cray.com>
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31237
Reviewed-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_resource.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 7fe8a8b..5d73132 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -819,7 +819,8 @@ static int ldlm_resource_complain(struct cfs_hash *hs, struct cfs_hash_bd *bd,
 	       ldlm_ns_name(ldlm_res_to_ns(res)), PLDLMRES(res), res,
 	       atomic_read(&res->lr_refcount) - 1);
 
-	ldlm_resource_dump(D_ERROR, res);
+	/* Use D_NETERROR since it is in the default mask */
+	ldlm_resource_dump(D_NETERROR, res);
 	unlock_res(res);
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 052/622] lustre: ptlrpc: idle connections can disconnect
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (50 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 051/622] lustre: ldlm: Reduce debug to console during eviction James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 053/622] lustre: osc: truncate does not update blocks count on client James Simmons
                   ` (570 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

 - when new request is being allocated ptlrpc initiates
   connection if it's not connected yet
 - if the import is idle (no locks, no active RPCs, no
   non-PING reply for last osc_idle_timeout seconds),
   then pinger tries to disconnect asynchronously
 - currently only client-to-OST connections can be idle
 - lctl set_param osc.*.idle_timeout=N controls new feature:
   N=0 - disable
   N>0 - seconds to idle before disconnect
 - lctl set_param osc.*.idle_connect=N to reconnect if idle
   (N is positive number)
 - OSC module parameter osc_idle_timeout controls default
   idle timeout and set to 20 seconds by default

WC-bug-id: https://jira.whamcloud.com/browse/LU-7236
Lustre-commit: 5a6ceb664f07 ("LU-7236 ptlrpc: idle connections can disconnect")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/16682
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h |  17 +++--
 fs/lustre/include/lustre_net.h    |   1 +
 fs/lustre/lov/lov_ea.c            |   3 +-
 fs/lustre/lov/lov_obd.c           |   8 ++-
 fs/lustre/lov/lov_request.c       |  25 ++++++--
 fs/lustre/osc/lproc_osc.c         |  66 +++++++++++++++++++
 fs/lustre/osc/osc_request.c       |   3 +
 fs/lustre/ptlrpc/client.c         |  32 +++++++++-
 fs/lustre/ptlrpc/events.c         |   3 +-
 fs/lustre/ptlrpc/import.c         | 130 ++++++++++++++++++++++++++++++--------
 fs/lustre/ptlrpc/pinger.c         |  30 +++++++++
 11 files changed, 275 insertions(+), 43 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index 0d7bb0f..c4452e1 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -96,6 +96,8 @@ enum lustre_imp_state {
 	LUSTRE_IMP_RECOVER	= 8,
 	LUSTRE_IMP_FULL		= 9,
 	LUSTRE_IMP_EVICTED	= 10,
+	LUSTRE_IMP_IDLE		= 11,
+	LUSTRE_IMP_LAST
 };
 
 /** Returns test string representation of numeric import state @state */
@@ -104,10 +106,10 @@ static inline char *ptlrpc_import_state_name(enum lustre_imp_state state)
 	static char *import_state_names[] = {
 		"<UNKNOWN>", "CLOSED",  "NEW", "DISCONN",
 		"CONNECTING", "REPLAY", "REPLAY_LOCKS", "REPLAY_WAIT",
-		"RECOVER", "FULL", "EVICTED",
+		"RECOVER", "FULL", "EVICTED", "IDLE",
 	};
 
-	LASSERT(state <= LUSTRE_IMP_EVICTED);
+	LASSERT(state < LUSTRE_IMP_LAST);
 	return import_state_names[state];
 }
 
@@ -226,12 +228,14 @@ struct obd_import {
 	int				imp_state_hist_idx;
 	/** Current import generation. Incremented on every reconnect */
 	int				imp_generation;
+	/* Idle connection initiated@this generation */
+	int				imp_initiated_at;
 	/** Incremented every time we send reconnection request */
 	u32				imp_conn_cnt;
-       /**
-	* \see ptlrpc_free_committed remembers imp_generation value here
-	* after a check to save on unnecessary replay list iterations
-	*/
+	/*
+	 * \see ptlrpc_free_committed remembers imp_generation value here
+	 * after a check to save on unnecessary replay list iterations
+	 */
 	int				imp_last_generation_checked;
 	/** Last transno we replayed */
 	u64				imp_last_replay_transno;
@@ -299,6 +303,7 @@ struct obd_import {
 					imp_connected:1;
 
 	u32				imp_connect_op;
+	u32				imp_idle_timeout;
 	struct obd_connect_data		imp_connect_data;
 	u64				imp_connect_flags_orig;
 	u64				imp_connect_flags2_orig;
diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 0231011..674803c 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1988,6 +1988,7 @@ struct ptlrpc_service *ptlrpc_register_service(struct ptlrpc_service_conf *conf,
 int ptlrpc_connect_import(struct obd_import *imp);
 int ptlrpc_init_import(struct obd_import *imp);
 int ptlrpc_disconnect_import(struct obd_import *imp, int noclose);
+int ptlrpc_disconnect_and_idle_import(struct obd_import *imp);
 int ptlrpc_import_recovery_state_machine(struct obd_import *imp);
 
 /* ptlrpc/pack_generic.c */
diff --git a/fs/lustre/lov/lov_ea.c b/fs/lustre/lov/lov_ea.c
index 41308d3..edca3b0 100644
--- a/fs/lustre/lov/lov_ea.c
+++ b/fs/lustre/lov/lov_ea.c
@@ -70,7 +70,8 @@ static loff_t lov_tgt_maxbytes(struct lov_tgt_desc *tgt)
 		return maxbytes;
 
 	spin_lock(&imp->imp_lock);
-	if (imp->imp_state == LUSTRE_IMP_FULL &&
+	if ((imp->imp_state == LUSTRE_IMP_FULL ||
+	    imp->imp_state == LUSTRE_IMP_IDLE) &&
 	    (imp->imp_connect_data.ocd_connect_flags & OBD_CONNECT_MAXBYTES) &&
 	     imp->imp_connect_data.ocd_maxbytes > 0)
 		maxbytes = imp->imp_connect_data.ocd_maxbytes;
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 9449aa9..35eaa1f 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -977,17 +977,21 @@ static int lov_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 		struct obd_ioctl_data *data = karg;
 		struct obd_device *osc_obd;
 		struct obd_statfs stat_buf = { 0 };
+		struct obd_import *imp;
 		u32 index;
 		u32 flags;
 
-		memcpy(&index, data->ioc_inlbuf2, sizeof(u32));
+		memcpy(&index, data->ioc_inlbuf2, sizeof(index));
 		if (index >= count)
 			return -ENODEV;
 
 		if (!lov->lov_tgts[index])
 			/* Try again with the next index */
 			return -EAGAIN;
-		if (!lov->lov_tgts[index]->ltd_active)
+
+		imp = lov->lov_tgts[index]->ltd_exp->exp_obd->u.cli.cl_import;
+		if (!lov->lov_tgts[index]->ltd_active &&
+		    imp->imp_state != LUSTRE_IMP_IDLE)
 			return -ENODATA;
 
 		osc_obd = class_exp2obd(lov->lov_tgts[index]->ltd_exp);
diff --git a/fs/lustre/lov/lov_request.c b/fs/lustre/lov/lov_request.c
index 864e410..added19 100644
--- a/fs/lustre/lov/lov_request.c
+++ b/fs/lustre/lov/lov_request.c
@@ -99,6 +99,7 @@ static int lov_check_and_wait_active(struct lov_obd *lov, int ost_idx)
 {
 	int cnt = 0;
 	struct lov_tgt_desc *tgt;
+	struct obd_import *imp = NULL;
 	int rc = 0;
 
 	mutex_lock(&lov->lov_lock);
@@ -115,7 +116,13 @@ static int lov_check_and_wait_active(struct lov_obd *lov, int ost_idx)
 		goto out;
 	}
 
-	if (tgt->ltd_exp && class_exp2cliimp(tgt->ltd_exp)->imp_connect_tried) {
+	if (tgt->ltd_exp)
+		imp = class_exp2cliimp(tgt->ltd_exp);
+	if (imp && imp->imp_connect_tried) {
+		rc = 0;
+		goto out;
+	}
+	if (imp && imp->imp_state == LUSTRE_IMP_IDLE) {
 		rc = 0;
 		goto out;
 	}
@@ -302,11 +309,10 @@ int lov_prep_statfs_set(struct obd_device *obd, struct obd_info *oinfo,
 
 	/* We only get block data from the OBD */
 	for (i = 0; i < lov->desc.ld_tgt_count; i++) {
+		struct lov_tgt_desc *ltd = lov->lov_tgts[i];
 		struct lov_request *req;
 
-		if (!lov->lov_tgts[i] ||
-		    (oinfo->oi_flags & OBD_STATFS_NODELAY &&
-		     !lov->lov_tgts[i]->ltd_active)) {
+		if (!ltd) {
 			CDEBUG(D_HA, "lov idx %d inactive\n", i);
 			continue;
 		}
@@ -314,13 +320,20 @@ int lov_prep_statfs_set(struct obd_device *obd, struct obd_info *oinfo,
 		/* skip targets that have been explicitly disabled by the
 		 * administrator
 		 */
-		if (!lov->lov_tgts[i]->ltd_exp) {
+		if (!ltd->ltd_exp) {
 			CDEBUG(D_HA,
 			       "lov idx %d administratively disabled\n", i);
 			continue;
 		}
 
-		if (!lov->lov_tgts[i]->ltd_active)
+		if (oinfo->oi_flags & OBD_STATFS_NODELAY &&
+		    class_exp2cliimp(ltd->ltd_exp)->imp_state !=
+		    LUSTRE_IMP_IDLE && !ltd->ltd_active) {
+			CDEBUG(D_HA, "lov idx %d inactive\n", i);
+			continue;
+		}
+
+		if (!ltd->ltd_active)
 			lov_check_and_wait_active(lov, i);
 
 		req = kzalloc(sizeof(*req), GFP_NOFS);
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 605a236..fd84393 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -598,6 +598,68 @@ static int osc_unstable_stats_seq_show(struct seq_file *m, void *v)
 
 LPROC_SEQ_FOPS_RO(osc_unstable_stats);
 
+static int osc_idle_timeout_seq_show(struct seq_file *m, void *v)
+{
+	struct obd_device *obd = m->private;
+	struct client_obd *cli = &obd->u.cli;
+
+	seq_printf(m, "%u\n", cli->cl_import->imp_idle_timeout);
+	return 0;
+}
+
+static ssize_t osc_idle_timeout_seq_write(struct file *f,
+					  const char __user *buffer,
+					  size_t count, loff_t *off)
+{
+	struct obd_device *obd = ((struct seq_file *)f->private_data)->private;
+	struct client_obd *cli = &obd->u.cli;
+	struct ptlrpc_request *req;
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint_from_user(buffer, count, 0, &val);
+	if (rc)
+		return rc;
+
+	if (val > CONNECTION_SWITCH_MAX)
+		return -ERANGE;
+
+	cli->cl_import->imp_idle_timeout = val;
+
+	/* to initiate the connection if it's in IDLE state */
+	if (!val) {
+		req = ptlrpc_request_alloc(cli->cl_import, &RQF_OST_STATFS);
+		if (req)
+			ptlrpc_req_finished(req);
+	}
+
+	return count;
+}
+LPROC_SEQ_FOPS(osc_idle_timeout);
+
+static int osc_idle_connect_seq_show(struct seq_file *m, void *v)
+{
+	return 0;
+}
+
+static ssize_t osc_idle_connect_seq_write(struct file *f,
+					  const char __user *buffer,
+					  size_t count, loff_t *off)
+{
+	struct obd_device *dev = ((struct seq_file *)f->private_data)->private;
+	struct client_obd *cli = &dev->u.cli;
+	struct ptlrpc_request *req;
+
+	/* to initiate the connection if it's in IDLE state */
+	req = ptlrpc_request_alloc(cli->cl_import, &RQF_OST_STATFS);
+	if (req)
+		ptlrpc_req_finished(req);
+	ptlrpc_pinger_force(cli->cl_import);
+
+	return count;
+}
+LPROC_SEQ_FOPS(osc_idle_connect);
+
 LPROC_SEQ_FOPS_RO_TYPE(osc, connect_flags);
 LPROC_SEQ_FOPS_RO_TYPE(osc, server_uuid);
 LPROC_SEQ_FOPS_RO_TYPE(osc, timeouts);
@@ -625,6 +687,10 @@ static int osc_unstable_stats_seq_show(struct seq_file *m, void *v)
 	  .fops	=	&osc_pinger_recov_fops		},
 	{ .name	=	"unstable_stats",
 	  .fops	=	&osc_unstable_stats_fops	},
+	{ .name	=	"idle_timeout",
+	  .fops	=	&osc_idle_timeout_fops		},
+	{ .name	=	"idle_connect",
+	  .fops	=	&osc_idle_connect_fops		},
 	{ NULL }
 };
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 9ac9c84..e341fcc 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -61,6 +61,8 @@
 /* max memory used for request pool, unit is MB */
 static unsigned int osc_reqpool_mem_max = 5;
 module_param(osc_reqpool_mem_max, uint, 0444);
+static int osc_idle_timeout = 20;
+module_param(osc_idle_timeout, uint, 0644);
 
 struct osc_async_args {
 	struct obd_info		*aa_oi;
@@ -3214,6 +3216,7 @@ int osc_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	spin_lock(&osc_shrink_lock);
 	list_add_tail(&cli->cl_shrink_list, &osc_shrink_list);
 	spin_unlock(&osc_shrink_lock);
+	cli->cl_import->imp_idle_timeout = osc_idle_timeout;
 
 	return rc;
 
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 424db55..9b41c12 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -885,6 +885,28 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 			      const struct req_format *format)
 {
 	struct ptlrpc_request *request;
+	int connect = 0;
+
+	if (unlikely(imp->imp_state == LUSTRE_IMP_IDLE)) {
+		int rc;
+
+		CDEBUG(D_INFO, "%s: connect at new req\n",
+		       imp->imp_obd->obd_name);
+		spin_lock(&imp->imp_lock);
+		if (imp->imp_state == LUSTRE_IMP_IDLE) {
+			imp->imp_generation++;
+			imp->imp_initiated_at = imp->imp_generation;
+			imp->imp_state =  LUSTRE_IMP_NEW;
+			connect = 1;
+		}
+		spin_unlock(&imp->imp_lock);
+		if (connect) {
+			rc = ptlrpc_connect_import(imp);
+			if (rc < 0)
+				return NULL;
+			ptlrpc_pinger_add_import(imp);
+		}
+	}
 
 	request = __ptlrpc_request_alloc(imp, pool);
 	if (!request)
@@ -1075,6 +1097,7 @@ void ptlrpc_set_add_req(struct ptlrpc_request_set *set,
 		return;
 	}
 
+	LASSERT(req->rq_import->imp_state != LUSTRE_IMP_IDLE);
 	LASSERT(list_empty(&req->rq_set_chain));
 
 	/* The set takes over the caller's request reference */
@@ -1183,7 +1206,9 @@ static int ptlrpc_import_delay_req(struct obd_import *imp,
 		if (atomic_read(&imp->imp_inval_count) != 0) {
 			DEBUG_REQ(D_ERROR, req, "invalidate in flight");
 			*status = -EIO;
-		} else if (req->rq_no_delay) {
+		} else if (req->rq_no_delay &&
+			   imp->imp_generation != imp->imp_initiated_at) {
+			/* ignore nodelay for requests initiating connections */
 			*status = -EWOULDBLOCK;
 		} else if (req->rq_allow_replay &&
 			  (imp->imp_state == LUSTRE_IMP_REPLAY ||
@@ -1842,8 +1867,11 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 					spin_unlock(&imp->imp_lock);
 					goto interpret;
 				}
+				/* ignore on just initiated connections */
 				if (ptlrpc_no_resend(req) &&
-				    !req->rq_wait_ctx) {
+				    !req->rq_wait_ctx &&
+				    imp->imp_generation !=
+				    imp->imp_initiated_at) {
 					req->rq_status = -ENOTCONN;
 					ptlrpc_rqphase_move(req,
 							    RQ_PHASE_INTERPRET);
diff --git a/fs/lustre/ptlrpc/events.c b/fs/lustre/ptlrpc/events.c
index 93a59b8..87c0ab7 100644
--- a/fs/lustre/ptlrpc/events.c
+++ b/fs/lustre/ptlrpc/events.c
@@ -164,7 +164,8 @@ void reply_in_callback(struct lnet_event *ev)
 			  ev->mlength, ev->offset, req->rq_replen);
 	}
 
-	req->rq_import->imp_last_reply_time = ktime_get_real_seconds();
+	if (lustre_msg_get_opc(req->rq_reqmsg) != OBD_PING)
+		req->rq_import->imp_last_reply_time = ktime_get_real_seconds();
 
 out_wake:
 	/* NB don't unlock till after wakeup; req can disappear under us
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 019648b..b90f78c 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -925,6 +925,21 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 	}
 
 	if (rc) {
+		struct ptlrpc_request *free_req;
+		struct ptlrpc_request *tmp;
+
+		/* abort all delayed requests initiated connection */
+		list_for_each_entry_safe(free_req, tmp, &imp->imp_delayed_list,
+					 rq_list) {
+			spin_lock(&free_req->rq_lock);
+			if (free_req->rq_no_resend) {
+				free_req->rq_err = 1;
+				free_req->rq_status = -EIO;
+				ptlrpc_client_wake_req(free_req);
+			}
+			spin_unlock(&free_req->rq_lock);
+		}
+
 		/* if this reconnect to busy export - not need select new target
 		 * for connecting
 		 */
@@ -1454,14 +1469,11 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 	return rc;
 }
 
-int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
+static struct ptlrpc_request *ptlrpc_disconnect_prep_req(struct obd_import *imp)
 {
 	struct ptlrpc_request *req;
 	int rq_opc, rc = 0;
 
-	if (imp->imp_obd->obd_force)
-		goto set_state;
-
 	switch (imp->imp_connect_op) {
 	case OST_CONNECT:
 		rq_opc = OST_DISCONNECT;
@@ -1477,9 +1489,47 @@ int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
 		CERROR("%s: don't know how to disconnect from %s (connect_op %d): rc = %d\n",
 		       imp->imp_obd->obd_name, obd2cli_tgt(imp->imp_obd),
 		       imp->imp_connect_op, rc);
-		return rc;
+		return ERR_PTR(rc);
 	}
 
+	req = ptlrpc_request_alloc_pack(imp, &RQF_MDS_DISCONNECT,
+					LUSTRE_OBD_VERSION, rq_opc);
+	if (!req)
+		return NULL;
+
+	/* We are disconnecting, do not retry a failed DISCONNECT rpc if
+	 * it fails.  We can get through the above with a down server
+	 * if the client doesn't know the server is gone yet.
+	 */
+	req->rq_no_resend = 1;
+
+	/* We want client umounts to happen quickly, no matter the
+	 * server state...
+	 */
+	req->rq_timeout = min_t(int, req->rq_timeout,
+				INITIAL_CONNECT_TIMEOUT);
+
+	IMPORT_SET_STATE(imp, LUSTRE_IMP_CONNECTING);
+	req->rq_send_state =  LUSTRE_IMP_CONNECTING;
+	ptlrpc_request_set_replen(req);
+
+	return req;
+}
+
+int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
+{
+	struct ptlrpc_request *req;
+	int rc = 0;
+
+	if (imp->imp_obd->obd_force)
+		goto set_state;
+
+	/* probably the import has been disconnected already being idle */
+	spin_lock(&imp->imp_lock);
+	if (imp->imp_state == LUSTRE_IMP_IDLE)
+		goto out;
+	spin_unlock(&imp->imp_lock);
+
 	if (ptlrpc_import_in_recovery(imp)) {
 		long timeout_jiffies;
 		time64_t timeout;
@@ -1512,27 +1562,13 @@ int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
 		goto out;
 	spin_unlock(&imp->imp_lock);
 
-	req = ptlrpc_request_alloc_pack(imp, &RQF_MDS_DISCONNECT,
-					LUSTRE_OBD_VERSION, rq_opc);
-	if (req) {
-		/* We are disconnecting, do not retry a failed DISCONNECT rpc if
-		 * it fails.  We can get through the above with a down server
-		 * if the client doesn't know the server is gone yet.
-		 */
-		req->rq_no_resend = 1;
-
-		/* We want client umounts to happen quickly, no matter the
-		 * server state...
-		 */
-		req->rq_timeout = min_t(int, req->rq_timeout,
-					INITIAL_CONNECT_TIMEOUT);
-
-		IMPORT_SET_STATE(imp, LUSTRE_IMP_CONNECTING);
-		req->rq_send_state = LUSTRE_IMP_CONNECTING;
-		ptlrpc_request_set_replen(req);
-		rc = ptlrpc_queue_wait(req);
-		ptlrpc_req_finished(req);
+	req = ptlrpc_disconnect_prep_req(imp);
+	if (IS_ERR(req)) {
+		rc = PTR_ERR(req);
+		goto set_state;
 	}
+	rc = ptlrpc_queue_wait(req);
+	ptlrpc_req_finished(req);
 
 set_state:
 	spin_lock(&imp->imp_lock);
@@ -1551,6 +1587,50 @@ int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
 }
 EXPORT_SYMBOL(ptlrpc_disconnect_import);
 
+static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
+					    struct ptlrpc_request *req,
+					    void *data, int rc)
+{
+	struct obd_import *imp = req->rq_import;
+
+	LASSERT(imp->imp_state == LUSTRE_IMP_CONNECTING);
+	spin_lock(&imp->imp_lock);
+	IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_IDLE);
+	memset(&imp->imp_remote_handle, 0, sizeof(imp->imp_remote_handle));
+	spin_unlock(&imp->imp_lock);
+
+	return 0;
+}
+
+int ptlrpc_disconnect_and_idle_import(struct obd_import *imp)
+{
+	struct ptlrpc_request *req;
+
+	if (imp->imp_obd->obd_force)
+		return 0;
+
+	if (ptlrpc_import_in_recovery(imp))
+		return 0;
+
+	spin_lock(&imp->imp_lock);
+	if (imp->imp_state != LUSTRE_IMP_FULL) {
+		spin_unlock(&imp->imp_lock);
+		return 0;
+	}
+	spin_unlock(&imp->imp_lock);
+
+	req = ptlrpc_disconnect_prep_req(imp);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	CDEBUG(D_INFO, "%s: disconnect\n", imp->imp_obd->obd_name);
+	req->rq_interpret_reply = ptlrpc_disconnect_idle_interpret;
+	ptlrpcd_add_req(req);
+
+	return 0;
+}
+EXPORT_SYMBOL(ptlrpc_disconnect_and_idle_import);
+
 /* Adaptive Timeout utils */
 
 /*
diff --git a/fs/lustre/ptlrpc/pinger.c b/fs/lustre/ptlrpc/pinger.c
index 762fd0e..c565e2d 100644
--- a/fs/lustre/ptlrpc/pinger.c
+++ b/fs/lustre/ptlrpc/pinger.c
@@ -79,10 +79,40 @@ int ptlrpc_obd_ping(struct obd_device *obd)
 }
 EXPORT_SYMBOL(ptlrpc_obd_ping);
 
+static bool ptlrpc_check_import_is_idle(struct obd_import *imp)
+{
+	struct ldlm_namespace *ns = imp->imp_obd->obd_namespace;
+	time64_t now;
+
+	if (!imp->imp_idle_timeout)
+		return false;
+	/* 4 comes from:
+	 *  - client_obd_setup() - hashed import
+	 *  - ptlrpcd_alloc_work()
+	 *  - ptlrpcd_alloc_work()
+	 *  - ptlrpc_pinger_add_import
+	 */
+	if (atomic_read(&imp->imp_refcount) > 4)
+		return false;
+
+	/* any lock increases ns_bref being a resource holder */
+	if (ns && atomic_read(&ns->ns_bref) > 0)
+		return false;
+
+	now = ktime_get_real_seconds();
+	if (now - imp->imp_last_reply_time < imp->imp_idle_timeout)
+		return false;
+
+	return true;
+}
+
 static int ptlrpc_ping(struct obd_import *imp)
 {
 	struct ptlrpc_request *req;
 
+	if (ptlrpc_check_import_is_idle(imp))
+		return ptlrpc_disconnect_and_idle_import(imp);
+
 	req = ptlrpc_prep_ping(imp);
 	if (!req) {
 		CERROR("OOM trying to ping %s->%s\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 053/622] lustre: osc: truncate does not update blocks count on client
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (51 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 052/622] lustre: ptlrpc: idle connections can disconnect James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 054/622] lustre: ptlrpc: add LOCK_CONVERT connection flag James Simmons
                   ` (569 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

'truncate' call correctly updates the server side with
correct size and blocks count. However, on the client
side all the metadata are correctly updated except the
blocks count, which still reflects the old count prior
to truncate call. This patch fixes this issue on the
client by modifying osc_io_setattr_end() to update
attr with the updated block count.

New test case under sanity is added to verify the that
the blocks counts are correctly updated after truncate call

Co-authored-by: Abrarahmed Momin <abrar.momin@gmail.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10370
Lustre-commit: 6115eb7fd55a ("LU-10370 ofd: truncate does not update blocks count on client")
Signed-off-by: Abrarahmed Momin <abrar.momin@gmail.com>
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/31073
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_io.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 970e8a7..1485962 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -588,6 +588,9 @@ void osc_io_setattr_end(const struct lu_env *env,
 	struct osc_io *oio = cl2osc_io(env, slice);
 	struct cl_object *obj = slice->cis_obj;
 	struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
+	struct cl_attr  *attr = &osc_env_info(env)->oti_attr;
+	struct obdo *oa = &oio->oi_oa;
+	unsigned int cl_valid = 0;
 	int result = 0;
 
 	if (cbargs->opc_rpc_sent) {
@@ -609,6 +612,14 @@ void osc_io_setattr_end(const struct lu_env *env,
 	if (cl_io_is_trunc(io)) {
 		u64 size = io->u.ci_setattr.sa_attr.lvb_size;
 
+		cl_object_attr_lock(obj);
+		if (oa->o_valid & OBD_MD_FLBLOCKS) {
+			attr->cat_blocks = oa->o_blocks;
+			cl_valid |= CAT_BLOCKS;
+		}
+
+		cl_object_attr_update(env, obj, attr, cl_valid);
+		cl_object_attr_unlock(obj);
 		osc_trunc_check(env, io, oio, size);
 		osc_cache_truncate_end(env, oio->oi_trunc);
 		oio->oi_trunc = NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 054/622] lustre: ptlrpc: add LOCK_CONVERT connection flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (52 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 053/622] lustre: osc: truncate does not update blocks count on client James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 055/622] lustre: ldlm: handle lock converts in cancel handler James Simmons
                   ` (568 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Add LOCK_CONVERT connection flag to don't use lock
convert feature with old servers.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10175
Lustre-commit: 44a2092f08ca ("LU-10175 ptlrpc: add LOCK_CONVERT connection flag")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32593
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 1 +
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 3 files changed, 4 insertions(+)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index e2575b4..385359f 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -118,6 +118,7 @@
 	"unknown",	/* 0x10 */
 	"flr",		/* 0x20 */
 	"wbc",		/* 0x40 */
+	"lock_convert",	/* 0x80 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 01ddbee..202c5ab 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1117,6 +1117,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_FLR);
 	LASSERTF(OBD_CONNECT2_WBC_INTENTS == 0x40ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_WBC_INTENTS);
+	LASSERTF(OBD_CONNECT2_LOCK_CONVERT == 0x80ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_LOCK_CONVERT);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 11df7b4..798aa57 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -799,6 +799,7 @@ struct ptlrpc_body_v2 {
 						 * under client-held parent
 						 * locks
 						 */
+#define OBD_CONNECT2_LOCK_CONVERT	0x80ULL /* IBITS lock convert support */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 055/622] lustre: ldlm: handle lock converts in cancel handler
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (53 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 054/622] lustre: ptlrpc: add LOCK_CONVERT connection flag James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 056/622] lustre: ptlrpc: Serialize procfs access to scp_hist_reqs using mutex James Simmons
                   ` (567 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

- Use cancel portals and high-priority handling for lock
  converts. Update ldlm_cancel_handler to understand
  LDLM_CONVERT RPC for that.
- Use ns_dirty_age_limit for lock convert - don't convert too old
  locks.
- Check for empty converts and skip such

WC-bug-id: https://jira.whamcloud.com/browse/LU-10175
Lustre-commit: 541902a3f934 ("LU-10175 ldlm: handle lock converts in cancel handler")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32314
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_export.h |  6 ++++++
 fs/lustre/ldlm/ldlm_inodebits.c   | 19 ++++++++++++++-----
 fs/lustre/ldlm/ldlm_request.c     | 39 +++++++++++++++++++++++++++++++--------
 fs/lustre/llite/llite_lib.c       |  2 +-
 fs/lustre/llite/namei.c           |  7 ++++++-
 5 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/include/lustre_export.h b/fs/lustre/include/lustre_export.h
index de3b109..57cf68b 100644
--- a/fs/lustre/include/lustre_export.h
+++ b/fs/lustre/include/lustre_export.h
@@ -269,9 +269,15 @@ static inline int exp_connect_flr(struct obd_export *exp)
 	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_FLR);
 }
 
+static inline int exp_connect_lock_convert(struct obd_export *exp)
+{
+	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_LOCK_CONVERT);
+}
+
 struct obd_export *class_conn2export(struct lustre_handle *conn);
 
 #define KKUC_CT_DATA_MAGIC	0x092013cea
+
 struct kkuc_ct_data {
 	u32			kcd_magic;
 	u32			kcd_archive;
diff --git a/fs/lustre/ldlm/ldlm_inodebits.c b/fs/lustre/ldlm/ldlm_inodebits.c
index ddbf8d4..9cf3c5f 100644
--- a/fs/lustre/ldlm/ldlm_inodebits.c
+++ b/fs/lustre/ldlm/ldlm_inodebits.c
@@ -81,7 +81,7 @@ int ldlm_inodebits_drop(struct ldlm_lock *lock, u64 to_drop)
 
 	/* Just return if there are no conflicting bits */
 	if ((lock->l_policy_data.l_inodebits.bits & to_drop) == 0) {
-		LDLM_WARN(lock, "try to drop unset bits %#llx/%#llx\n",
+		LDLM_WARN(lock, "try to drop unset bits %#llx/%#llx",
 			  lock->l_policy_data.l_inodebits.bits, to_drop);
 		/* nothing to do */
 		return 0;
@@ -111,7 +111,7 @@ int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits)
 
 	ldlm_lock2handle(lock, &lockh);
 	lock_res_and_lock(lock);
-	/* check if all bits are cancelled */
+	/* check if all bits are blocked */
 	if (!(lock->l_policy_data.l_inodebits.bits & ~drop_bits)) {
 		unlock_res_and_lock(lock);
 		/* return error to continue with cancel */
@@ -119,6 +119,13 @@ int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits)
 		goto exit;
 	}
 
+	/* check if no common bits, consider this as successful convert */
+	if (!(lock->l_policy_data.l_inodebits.bits & drop_bits)) {
+		unlock_res_and_lock(lock);
+		rc = 0;
+		goto exit;
+	}
+
 	/* check if there is race with cancel */
 	if (ldlm_is_canceling(lock) || ldlm_is_cancel(lock)) {
 		unlock_res_and_lock(lock);
@@ -167,9 +174,11 @@ int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits)
 	rc = ldlm_cli_convert(lock, &flags);
 	if (rc) {
 		lock_res_and_lock(lock);
-		ldlm_clear_converting(lock);
-		ldlm_set_cbpending(lock);
-		ldlm_set_bl_ast(lock);
+		if (ldlm_is_converting(lock)) {
+			ldlm_clear_converting(lock);
+			ldlm_set_cbpending(lock);
+			ldlm_set_bl_ast(lock);
+		}
 		unlock_res_and_lock(lock);
 		goto exit;
 	}
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 5833f59..ad54bd2 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -854,7 +854,7 @@ static int lock_convert_interpret(const struct lu_env *env,
 			   aa->lock_handle.cookie, reply->lock_handle.cookie,
 			   req->rq_export->exp_client_uuid.uuid,
 			   libcfs_id2str(req->rq_peer));
-		rc = -ESTALE;
+		rc = ELDLM_NO_LOCK_DATA;
 		goto out;
 	}
 
@@ -905,15 +905,30 @@ static int lock_convert_interpret(const struct lu_env *env,
 	unlock_res_and_lock(lock);
 out:
 	if (rc) {
+		int flag;
+
 		lock_res_and_lock(lock);
 		if (ldlm_is_converting(lock)) {
 			ldlm_clear_converting(lock);
 			ldlm_set_cbpending(lock);
 			ldlm_set_bl_ast(lock);
+			lock->l_policy_data.l_inodebits.cancel_bits = 0;
 		}
 		unlock_res_and_lock(lock);
-	}
 
+		/* fallback to normal lock cancel. If rc means there is no
+		 * valid lock on server, do only local cancel
+		 */
+		if (rc == ELDLM_NO_LOCK_DATA)
+			flag = LCF_LOCAL;
+		else
+			flag = LCF_ASYNC;
+
+		rc = ldlm_cli_cancel(&aa->lock_handle, flag);
+		if (rc < 0)
+			LDLM_DEBUG(lock, "failed to cancel lock: rc = %d\n",
+				   rc);
+	}
 	LDLM_LOCK_PUT(lock);
 	return rc;
 }
@@ -942,6 +957,15 @@ int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
 		return -EINVAL;
 	}
 
+	/* this is better to check earlier and it is done so already,
+	 * but this check is kept too as final one to issue an error
+	 * if any new code will miss such check.
+	 */
+	if (!exp_connect_lock_convert(exp)) {
+		LDLM_ERROR(lock, "server doesn't support lock convert\n");
+		return -EPROTO;
+	}
+
 	if (lock->l_resource->lr_type != LDLM_IBITS) {
 		LDLM_ERROR(lock, "convert works with IBITS locks only.");
 		return -EINVAL;
@@ -970,13 +994,12 @@ int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
 
 	ptlrpc_request_set_replen(req);
 
-	/* That could be useful to use cancel portals for convert as well
-	 * as high-priority handling. This will require changes in
-	 * ldlm_cancel_handler to understand convert RPC as well.
-	 *
-	 * req->rq_request_portal = LDLM_CANCEL_REQUEST_PORTAL;
-	 * req->rq_reply_portal = LDLM_CANCEL_REPLY_PORTAL;
+	/*
+	 * Use cancel portals for convert as well as high-priority handling.
 	 */
+	req->rq_request_portal = LDLM_CANCEL_REQUEST_PORTAL;
+	req->rq_reply_portal = LDLM_CANCEL_REPLY_PORTAL;
+
 	ptlrpc_at_set_req_timeout(req);
 
 	if (exp->exp_obd->obd_svc_stats)
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index dff349f..0844318 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -209,7 +209,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				  OBD_CONNECT_GRANT_PARAM |
 				  OBD_CONNECT_SHORTIO | OBD_CONNECT_FLAGS2;
 
-	data->ocd_connect_flags2 = OBD_CONNECT2_FLR;
+	data->ocd_connect_flags2 = OBD_CONNECT2_FLR | OBD_CONNECT2_LOCK_CONVERT;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 8b1a1ca..f835abb 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -371,11 +371,16 @@ void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
  */
 int ll_md_need_convert(struct ldlm_lock *lock)
 {
+	struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
 	struct inode *inode;
 	u64 wanted = lock->l_policy_data.l_inodebits.cancel_bits;
 	u64 bits = lock->l_policy_data.l_inodebits.bits & ~wanted;
 	enum ldlm_mode mode = LCK_MINMODE;
 
+	if (!lock->l_conn_export ||
+	    !exp_connect_lock_convert(lock->l_conn_export))
+		return 0;
+
 	if (!wanted || !bits || ldlm_is_cancel(lock))
 		return 0;
 
@@ -410,7 +415,7 @@ int ll_md_need_convert(struct ldlm_lock *lock)
 	lock_res_and_lock(lock);
 	if (ktime_after(ktime_get(),
 			ktime_add(lock->l_last_used,
-				  ktime_set(10, 0)))) {
+				  ktime_set(ns->ns_dirty_age_limit, 0)))) {
 		unlock_res_and_lock(lock);
 		return 0;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 056/622] lustre: ptlrpc: Serialize procfs access to scp_hist_reqs using mutex
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (54 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 055/622] lustre: ldlm: handle lock converts in cancel handler James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 057/622] lustre: ldlm: don't add canceling lock back to LRU James Simmons
                   ` (566 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

scp_hist_reqs list can be quite long thus a lot of
userland processes can waste CPU power in spinlock cycles.

Cray-bug-id: LUS-5833
WC-bug-id: https://jira.whamcloud.com/browse/LU-11004
Lustre-commit: 413a738a37d7 ("LU-11004 ptlrpc: Serialize procfs access to scp_hist_reqs using mutex")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-on: https://review.whamcloud.com/32307
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h  | 2 ++
 fs/lustre/ptlrpc/lproc_ptlrpc.c | 7 +++++++
 fs/lustre/ptlrpc/service.c      | 1 +
 3 files changed, 10 insertions(+)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 674803c..cf13555 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1543,6 +1543,8 @@ struct ptlrpc_service_part {
 	 * threads starting & stopping are also protected by this lock.
 	 */
 	spinlock_t			scp_lock __cfs_cacheline_aligned;
+	/* userland serialization */
+	struct mutex			scp_mutex;
 	/** total # req buffer descs allocated */
 	int				scp_nrqbds_total;
 	/** # posted request buffers for receiving */
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index e48a4e8..0efbcfc 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -869,10 +869,12 @@ struct ptlrpc_srh_iterator {
 		if (i > cpt) /* make up the lowest position for this CPT */
 			*pos = PTLRPC_REQ_CPT2POS(svc, i);
 
+		mutex_lock(&svcpt->scp_mutex);
 		spin_lock(&svcpt->scp_lock);
 		rc = ptlrpc_lprocfs_svc_req_history_seek(svcpt, srhi,
 				PTLRPC_REQ_POS2SEQ(svc, *pos));
 		spin_unlock(&svcpt->scp_lock);
+		mutex_unlock(&svcpt->scp_mutex);
 		if (rc == 0) {
 			*pos = PTLRPC_REQ_SEQ2POS(svc, srhi->srhi_seq);
 			srhi->srhi_idx = i;
@@ -914,9 +916,11 @@ struct ptlrpc_srh_iterator {
 			seq = srhi->srhi_seq + (1 << svc->srv_cpt_bits);
 		}
 
+		mutex_lock(&svcpt->scp_mutex);
 		spin_lock(&svcpt->scp_lock);
 		rc = ptlrpc_lprocfs_svc_req_history_seek(svcpt, srhi, seq);
 		spin_unlock(&svcpt->scp_lock);
+		mutex_unlock(&svcpt->scp_mutex);
 		if (rc == 0) {
 			*pos = PTLRPC_REQ_SEQ2POS(svc, srhi->srhi_seq);
 			srhi->srhi_idx = i;
@@ -940,6 +944,7 @@ static int ptlrpc_lprocfs_svc_req_history_show(struct seq_file *s, void *iter)
 
 	svcpt = svc->srv_parts[srhi->srhi_idx];
 
+	mutex_lock(&svcpt->scp_mutex);
 	spin_lock(&svcpt->scp_lock);
 
 	rc = ptlrpc_lprocfs_svc_req_history_seek(svcpt, srhi, srhi->srhi_seq);
@@ -980,6 +985,8 @@ static int ptlrpc_lprocfs_svc_req_history_show(struct seq_file *s, void *iter)
 	}
 
 	spin_unlock(&svcpt->scp_lock);
+	mutex_unlock(&svcpt->scp_mutex);
+
 	return rc;
 }
 
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 8dae21a..cf920ae 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -471,6 +471,7 @@ static void ptlrpc_at_timer(struct timer_list *t)
 
 	/* rqbd and incoming request queue */
 	spin_lock_init(&svcpt->scp_lock);
+	mutex_init(&svcpt->scp_mutex);
 	INIT_LIST_HEAD(&svcpt->scp_rqbd_idle);
 	INIT_LIST_HEAD(&svcpt->scp_rqbd_posted);
 	INIT_LIST_HEAD(&svcpt->scp_req_incoming);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 057/622] lustre: ldlm: don't add canceling lock back to LRU
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (55 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 056/622] lustre: ptlrpc: Serialize procfs access to scp_hist_reqs using mutex James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 058/622] lustre: quota: add default quota setting support James Simmons
                   ` (565 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

When lock is converted check it is not canceling before
adding it back to LRU.

Lustre-commit: ad52f394bd82 ("LU-11003 ldlm: don't add canceling lock back to LRU")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32692
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index ad54bd2..bc441f0 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -893,7 +893,8 @@ static int lock_convert_interpret(const struct lu_env *env,
 			 * is not there yet.
 			 */
 			lock->l_policy_data.l_inodebits.cancel_bits = 0;
-			if (!lock->l_readers && !lock->l_writers) {
+			if (!lock->l_readers && !lock->l_writers &&
+			    !ldlm_is_canceling(lock)) {
 				spin_lock(&ns->ns_lock);
 				/* there is check for list_empty() inside */
 				ldlm_lock_remove_from_lru_nolock(lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 058/622] lustre: quota: add default quota setting support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (56 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 057/622] lustre: ldlm: don't add canceling lock back to LRU James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 059/622] lustre: ptlrpc: don't zero request handle James Simmons
                   ` (564 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

Similar function which is motivated by GPFS which is friendly
feature for cluster administrators to manage quota.

Lazy Quota default setting support, here is basic idea:

Default quota setting is global quota setting for user, group,
project quotas, if default quota is set for one quota type,
newer created users/groups/projects will inherit this setting
automatically, since Lustre itself don't have ideas when new
users created, they could only know when this users trying to
acquire space from Lustre.

So we try to implement lazy quota setting inherit, Slave firstly
check if there exists default quota setting, if exists, it will
force slave to acquire quota from master, and master will detect
whether default quota is set, then it will set this quota and also
return proper grant space to slave.

To implement this and reuse existed quota APIs, we try to manage
the default quota in the quota record of 0 id, and enforce the
quota check when reading the quota recored from disk.

In the current Lustre implementation, the grace time is either
the time or the timestamp to be used after some quota ID exceeds
the soft limt, then 48bits should be enough for it, its high 16bits
can be used as kinds of quota flags, this patch will use one of
them as the default quota flag.

The global quota record used by default quota will set its soft
and hard limit as zero, its grace time will contain the default flag.

Use lfs setquota -U/-G/-P <mnt> to set default quota.
Use lfs setquota -u/-g/-p foo -d <mnt> to set foo to use default quota
Use lfs quota -U/-G/-P <mnt> to show default quota.

WC-bug-id: https://jira.whamcloud.com/browse/LU-7816
Lustre-commit: 530881fe4ee2 ("LU-7816 quota: add default quota setting support")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32306
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                   |  4 +++-
 include/uapi/linux/lustre/lustre_user.h | 22 ++++++++++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index b006e32..c0c3bf0 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -949,10 +949,12 @@ static int quotactl_ioctl(struct ll_sb_info *sbi, struct if_quotactl *qctl)
 	switch (cmd) {
 	case Q_SETQUOTA:
 	case Q_SETINFO:
+	case LUSTRE_Q_SETDEFAULT:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
 		break;
 	case Q_GETQUOTA:
+	case LUSTRE_Q_GETDEFAULT:
 		if (check_owner(type, id) && !capable(CAP_SYS_ADMIN))
 			return -EPERM;
 		break;
@@ -960,7 +962,7 @@ static int quotactl_ioctl(struct ll_sb_info *sbi, struct if_quotactl *qctl)
 		break;
 	default:
 		CERROR("unsupported quotactl op: %#x\n", cmd);
-		return -ENOTTY;
+		return -ENOTSUPP;
 	}
 
 	if (valid != QC_GENERAL) {
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 5405e1b..5956f33 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -728,6 +728,28 @@ static inline void obd_uuid2fsname(char *buf, char *uuid, int buflen)
 /* lustre-specific control commands */
 #define LUSTRE_Q_INVALIDATE	0x80000b	/* deprecated as of 2.4 */
 #define LUSTRE_Q_FINVALIDATE	0x80000c	/* deprecated as of 2.4 */
+#define LUSTRE_Q_GETDEFAULT	0x80000d	/* get default quota */
+#define LUSTRE_Q_SETDEFAULT	0x80000e	/* set default quota */
+
+/* In the current Lustre implementation, the grace time is either the time
+ * or the timestamp to be used after some quota ID exceeds the soft limt,
+ * 48 bits should be enough, its high 16 bits can be used as quota flags.
+ */
+#define LQUOTA_GRACE_BITS	48
+#define LQUOTA_GRACE_MASK	((1ULL << LQUOTA_GRACE_BITS) - 1)
+#define LQUOTA_GRACE_MAX	LQUOTA_GRACE_MASK
+#define LQUOTA_GRACE(t)		(t & LQUOTA_GRACE_MASK)
+#define LQUOTA_FLAG(t)		(t >> LQUOTA_GRACE_BITS)
+#define LQUOTA_GRACE_FLAG(t, f)	((__u64)t | (__u64)f << LQUOTA_GRACE_BITS)
+
+/* different quota flags */
+
+/* the default quota flag, the corresponding quota ID will use the default
+ * quota setting, the hardlimit and softlimit of its quota record in the global
+ * quota file will be set to 0, the low 48 bits of the grace will be set to 0
+ * and high 16 bits will contain this flag (see above comment).
+ */
+#define LQUOTA_FLAG_DEFAULT	0x0001
 
 #define ALLQUOTA 255	/* set all quota */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 059/622] lustre: ptlrpc: don't zero request handle
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (57 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 058/622] lustre: quota: add default quota setting support James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 060/622] lnet: ko2iblnd: determine gaps correctly James Simmons
                   ` (563 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

LNet can retransmit a request at any time if it isn't replied.
The ptlrpc_resend_req zero the request handle and ptlrpc_send_rpc
set it. If retransmission happen with zeroed handle, the client
can't find a valid export by handle and set rq_export to NULL and
reply with ENOTCONN. A server evict client with this error.

client (nid x.x.x.x at tcp) returned error from blocking AST
(req status -107 rc -107), evict it

WC-bug-id: https://jira.whamcloud.com/browse/LU-11117
Lustre-commit: 00c72ab6bb43 ("LU-11117 ptlrpc: don't zero request handle")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-6037
Reviewed-on: https://review.whamcloud.com/32781
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 9b41c12..d28a9cd 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -2728,7 +2728,6 @@ void ptlrpc_resend_req(struct ptlrpc_request *req)
 		return;
 	}
 
-	lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 });
 	req->rq_status = -EAGAIN;
 
 	req->rq_resend = 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 060/622] lnet: ko2iblnd: determine gaps correctly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (58 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 059/622] lustre: ptlrpc: don't zero request handle James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 061/622] lustre: osc: increase default max_dirty_mb to 2G James Simmons
                   ` (562 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

We're allowed to start at a non-aligned page offset in the first
fragment and end at a non-aligned page offset in the last fragment.

When checking the iovec exclude both of the first and last fragments
from the tx_gaps check.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11064
Lustre-commit: e40ea6fd4494 ("LU-11064 lnd: determine gaps correctly")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32586
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index c2ce3b9..60706b4 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -737,6 +737,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	struct kib_net *net = ni->ni_data;
 	struct scatterlist *sg;
 	int fragnob;
+	int max_nkiov;
 
 	CDEBUG(D_NET, "niov %d offset %d nob %d\n", nkiov, offset, nob);
 
@@ -751,16 +752,24 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		LASSERT(nkiov > 0);
 	}
 
+	max_nkiov = nkiov;
+
 	sg = tx->tx_frags;
 	do {
 		LASSERT(nkiov > 0);
 
 		fragnob = min((int)(kiov->bv_len - offset), nob);
 
-		if ((fragnob < (int)(kiov->bv_len - offset)) && nkiov > 1) {
+		/* We're allowed to start at a non-aligned page offset in
+		 * the first fragment and end at a non-aligned page offset
+		 * in the last fragment.
+		 */
+		if ((fragnob < (int)(kiov->bv_len - offset)) &&
+		    nkiov < max_nkiov && nob > fragnob) {
 			CDEBUG(D_NET,
-			       "fragnob %d < available page %d: with remaining %d kiovs\n",
-			       fragnob, (int)(kiov->bv_len - offset), nkiov);
+			       "fragnob %d < available page %d: with remaining %d kiovs with %d nob left\n",
+			       fragnob, (int)(kiov->bv_len - offset),
+			       nkiov, nob);
 			tx->tx_gaps = true;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 061/622] lustre: osc: increase default max_dirty_mb to 2G
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (59 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 060/622] lnet: ko2iblnd: determine gaps correctly James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 062/622] lustre: ptlrpc: remove obsolete OBD RPC opcodes James Simmons
                   ` (561 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Oleg Drokin <green@whamcloud.com>

While ideally we want to go away from max_dirty_mb setting
completely and let grants code to take the msot part of it,
Andreas raises a somewhat valid point that for certain
system configurations with high-latency links, system
administrators might want to have ability to limit
amount of dirty pages just for those OSCs to limit amount
of time it might take to flush that dirty data.

So a good compromise is to lift the max_dirty_mb default
value first while we work out the current grant code
deficiencies

WC-bug-id: https://jira.whamcloud.com/browse/LU-10990
Lustre-commit: 92e2b514e06c ("LU-10990 osc: increase default max_dirty_mb to 2G")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32288
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 99577e4..d2bd234 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -127,7 +127,7 @@ struct timeout_item {
 #define OBD_MAX_RIF_DEFAULT	8
 #define OBD_MAX_RIF_MAX		512
 #define OSC_MAX_RIF_MAX		256
-#define OSC_MAX_DIRTY_DEFAULT	(OBD_MAX_RIF_DEFAULT * 4)
+#define OSC_MAX_DIRTY_DEFAULT	2000	/* Arbitrary large value */
 #define OSC_MAX_DIRTY_MB_MAX	2048	/* arbitrary, but < MAX_LONG bytes */
 #define OSC_DEFAULT_RESENDS	10
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 062/622] lustre: ptlrpc: remove obsolete OBD RPC opcodes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (60 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 061/622] lustre: osc: increase default max_dirty_mb to 2G James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 063/622] lustre: ptlrpc: assign specific values to MGS opcodes James Simmons
                   ` (560 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Remove the obsolete OBD_LOG_CANCEL (since Lustre 1.5) and
OBD_QC_CALLBACK (since Lustre 2.4) RPC opcodes.

Assign  OBD_IDX_READ an explicit opcode (as should be done with all
enums in lustre_idl.h) so that the value does not change if some
prior field is removed.

Also remove the OBD_FAIL checks that were used to test them.
The setting in conf_sanity.sh test_58 was unused for many years.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10855
Lustre-commit: 7d89a5b8aefc ("LU-10855 ptlrpc: remove obsolete OBD RPC opcodes")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32651
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h        |  6 +++---
 fs/lustre/ptlrpc/lproc_ptlrpc.c        |  4 ++--
 fs/lustre/ptlrpc/wiretest.c            |  4 ----
 include/uapi/linux/lustre/lustre_idl.h | 12 ++++++------
 4 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 67500b5..99b4f1f 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -352,12 +352,12 @@
 #define OBD_FAIL_PTLRPC_BULK_ATTACH      0x521
 
 #define OBD_FAIL_OBD_PING_NET				0x600
-#define OBD_FAIL_OBD_LOG_CANCEL_NET			0x601
+/*	OBD_FAIL_OBD_LOG_CANCEL_NET	0x601 obsolete since 1.5 */
 #define OBD_FAIL_OBD_LOGD_NET				0x602
-/*	OBD_FAIL_OBD_QC_CALLBACK_NET     0x603 obsolete since 2.4 */
+/*	OBD_FAIL_OBD_QC_CALLBACK_NET	0x603 obsolete since 2.4 */
 #define OBD_FAIL_OBD_DQACQ				0x604
 #define OBD_FAIL_OBD_LLOG_SETUP				0x605
-#define OBD_FAIL_OBD_LOG_CANCEL_REP			0x606
+/*	OBD_FAIL_OBD_LOG_CANCEL_REP	0x606 obsolete since 1.5 */
 #define OBD_FAIL_OBD_IDX_READ_NET			0x607
 #define OBD_FAIL_OBD_IDX_READ_BREAK			0x608
 #define OBD_FAIL_OBD_NO_LRU				0x609
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index 0efbcfc..b70a1c7 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -111,8 +111,8 @@
 	{ MGS_SET_INFO,				"mgs_set_info" },
 	{ MGS_CONFIG_READ,			"mgs_config_read" },
 	{ OBD_PING,				"obd_ping" },
-	{ OBD_LOG_CANCEL,			"llog_cancel" },
-	{ OBD_QC_CALLBACK,			"obd_quota_callback" },
+	{ 401, /* was OBD_LOG_CANCEL */		"llog_cancel" },
+	{ 402, /* was OBD_QC_CALLBACK */	"obd_quota_callback" },
 	{ OBD_IDX_READ,				"dt_index_read" },
 	{ LLOG_ORIGIN_HANDLE_CREATE,		 "llog_origin_handle_open" },
 	{ LLOG_ORIGIN_HANDLE_NEXT_BLOCK,	"llog_origin_handle_next_block" },
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 202c5ab..015c5bd 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -326,10 +326,6 @@ void lustre_assert_wire_constants(void)
 	BUILD_BUG_ON(LUSTRE_RES_ID_HSH_OFF != 3);
 	LASSERTF(OBD_PING == 400, "found %lld\n",
 		 (long long)OBD_PING);
-	LASSERTF(OBD_LOG_CANCEL == 401, "found %lld\n",
-		 (long long)OBD_LOG_CANCEL);
-	LASSERTF(OBD_QC_CALLBACK == 402, "found %lld\n",
-		 (long long)OBD_QC_CALLBACK);
 	LASSERTF(OBD_IDX_READ == 403, "found %lld\n",
 		 (long long)OBD_IDX_READ);
 	LASSERTF(OBD_LAST_OPC == 404, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 798aa57..adaa994 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2342,13 +2342,13 @@ struct cfg_marker {
  */
 
 enum obd_cmd {
-	OBD_PING = 400,
-	OBD_LOG_CANCEL,	/* Obsolete since 1.5. */
-	OBD_QC_CALLBACK, /* not used since 2.4 */
-	OBD_IDX_READ,
-	OBD_LAST_OPC
+	OBD_PING	= 400,
+/*	OBD_LOG_CANCEL	= 401, Obsolete since 1.5 */
+/*	OBD_QC_CALLBACK	= 402, not used since 2.4 */
+	OBD_IDX_READ	= 403,
+	OBD_LAST_OPC,
+	OBD_FIRST_OPC = OBD_PING
 };
-#define OBD_FIRST_OPC OBD_PING
 
 /**
  * llog contexts indices.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 063/622] lustre: ptlrpc: assign specific values to MGS opcodes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (61 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 062/622] lustre: ptlrpc: remove obsolete OBD RPC opcodes James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 064/622] lustre: ptlrpc: remove obsolete LLOG_ORIGIN_* RPCs James Simmons
                   ` (559 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Assign specific values to all of the MGS opcodes in enum mgs_cmd
so that these values do not change if a new items is added or one
is removed in the future.  These opcodes are part of the wire
protocol and need to remain constant.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10855
Lustre-commit: 12c5a26609f1 ("LU-10855 ptlrpc: assign specific values to MGS opcodes")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32653
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c            |  2 ++
 include/uapi/linux/lustre/lustre_idl.h | 20 ++++++++++----------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 015c5bd..ef07975 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -348,6 +348,8 @@ void lustre_assert_wire_constants(void)
 		 (long long)MGS_TARGET_DEL);
 	LASSERTF(MGS_SET_INFO == 255, "found %lld\n",
 		 (long long)MGS_SET_INFO);
+	LASSERTF(MGS_CONFIG_READ == 256, "found %lld\n",
+		 (long long)MGS_CONFIG_READ);
 	LASSERTF(MGS_LAST_OPC == 257, "found %lld\n",
 		 (long long)MGS_LAST_OPC);
 	LASSERTF(SEC_CTX_INIT == 801, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index adaa994..1b5794a 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2247,16 +2247,16 @@ struct ldlm_reply {
  * Opcodes for mountconf (mgs and mgc)
  */
 enum mgs_cmd {
-	MGS_CONNECT = 250,
-	MGS_DISCONNECT,
-	MGS_EXCEPTION,		/* node died, etc. */
-	MGS_TARGET_REG,		/* whenever target starts up */
-	MGS_TARGET_DEL,
-	MGS_SET_INFO,
-	MGS_CONFIG_READ,
-	MGS_LAST_OPC
-};
-#define MGS_FIRST_OPC MGS_CONNECT
+	MGS_CONNECT	= 250,
+	MGS_DISCONNECT	= 251,
+	MGS_EXCEPTION	= 252,	/* node died, etc. */
+	MGS_TARGET_REG	= 253,	/* whenever target starts up */
+	MGS_TARGET_DEL	= 254,
+	MGS_SET_INFO	= 255,
+	MGS_CONFIG_READ	= 256,
+	MGS_LAST_OPC,
+	MGS_FIRST_OPC	= MGS_CONNECT
+};
 
 #define MGS_PARAM_MAXLEN 1024
 #define KEY_SET_INFO "set_info"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 064/622] lustre: ptlrpc: remove obsolete LLOG_ORIGIN_* RPCs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (62 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 063/622] lustre: ptlrpc: assign specific values to MGS opcodes James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 065/622] lustre: osc: fix idle_timeout handling James Simmons
                   ` (558 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Remove the obsolete RPC opcodes LLOG_ORIGIN_HANDLE_WRITE_REC,
LLOG_ORIGIN_HANDLE_CLOSE, LLOG_ORIGIN_CONNECT, LLOG_CATINFO
along with their unused OBD_FAIL counterparts.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10855
Lustre-commit: 830ce1b10f3a ("LU-10855 ptlrpc: remove obsolete LLOG_ORIGIN_* RPCs")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32654
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h        | 10 +++++-----
 fs/lustre/ptlrpc/lproc_ptlrpc.c        |  8 ++++----
 fs/lustre/ptlrpc/wiretest.c            |  5 -----
 include/uapi/linux/lustre/lustre_idl.h | 10 +++++-----
 4 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 99b4f1f..28becfa 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -423,15 +423,15 @@
 #define OBD_FAIL_SEC_CTX_HDL_PAUSE			0x1204
 
 #define OBD_FAIL_LLOG					0x1300
-#define OBD_FAIL_LLOG_ORIGIN_CONNECT_NET		0x1301
+/* was	OBD_FAIL_LLOG_ORIGIN_CONNECT_NET		0x1301 until 2.4 */
 #define OBD_FAIL_LLOG_ORIGIN_HANDLE_CREATE_NET		0x1302
-#define OBD_FAIL_LLOG_ORIGIN_HANDLE_DESTROY_NET		0x1303
+/* was	OBD_FAIL_LLOG_ORIGIN_HANDLE_DESTROY_NET		0x1303 until 2.11 */
 #define OBD_FAIL_LLOG_ORIGIN_HANDLE_READ_HEADER_NET	0x1304
 #define OBD_FAIL_LLOG_ORIGIN_HANDLE_NEXT_BLOCK_NET	0x1305
 #define OBD_FAIL_LLOG_ORIGIN_HANDLE_PREV_BLOCK_NET	0x1306
-#define OBD_FAIL_LLOG_ORIGIN_HANDLE_WRITE_REC_NET	0x1307
-#define OBD_FAIL_LLOG_ORIGIN_HANDLE_CLOSE_NET		0x1308
-#define OBD_FAIL_LLOG_CATINFO_NET			0x1309
+/* was	OBD_FAIL_LLOG_ORIGIN_HANDLE_WRITE_REC_NET	0x1307 until 2.1 */
+/* was	OBD_FAIL_LLOG_ORIGIN_HANDLE_CLOSE_NET		0x1308 until 1.8 */
+/* was	OBD_FAIL_LLOG_CATINFO_NET			0x1309 until 2.3 */
 #define OBD_FAIL_MDS_SYNC_CAPA_SL			0x1310
 #define OBD_FAIL_SEQ_ALLOC				0x1311
 
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index b70a1c7..6af3384 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -117,10 +117,10 @@
 	{ LLOG_ORIGIN_HANDLE_CREATE,		 "llog_origin_handle_open" },
 	{ LLOG_ORIGIN_HANDLE_NEXT_BLOCK,	"llog_origin_handle_next_block" },
 	{ LLOG_ORIGIN_HANDLE_READ_HEADER,	"llog_origin_handle_read_header" },
-	{ LLOG_ORIGIN_HANDLE_WRITE_REC,		"llog_origin_handle_write_rec" },
-	{ LLOG_ORIGIN_HANDLE_CLOSE,		"llog_origin_handle_close" },
-	{ LLOG_ORIGIN_CONNECT,			"llog_origin_connect" },
-	{ LLOG_CATINFO,				"llog_catinfo" },
+	{ 504, /*LLOG_ORIGIN_HANDLE_WRITE_REC*/	"llog_origin_handle_write_rec" },
+	{ 505, /* was LLOG_ORIGIN_HANDLE_CLOSE */"llog_origin_handle_close" },
+	{ 506, /* was LLOG_ORIGIN_CONNECT */	"llog_origin_connect" },
+	{ 507, /* was LLOG_CATINFO */		"llog_catinfo" },
 	{ LLOG_ORIGIN_HANDLE_PREV_BLOCK,	"llog_origin_handle_prev_block" },
 	{ LLOG_ORIGIN_HANDLE_DESTROY,		"llog_origin_handle_destroy" },
 	{ QUOTA_DQACQ,				"quota_acquire" },
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index ef07975..7b6ea86 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -3757,12 +3757,7 @@ void lustre_assert_wire_constants(void)
 	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_CREATE != 501);
 	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_NEXT_BLOCK != 502);
 	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_READ_HEADER != 503);
-	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_WRITE_REC != 504);
-	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_CLOSE != 505);
-	BUILD_BUG_ON(LLOG_ORIGIN_CONNECT != 506);
-	BUILD_BUG_ON(LLOG_CATINFO != 507);
 	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_PREV_BLOCK != 508);
-	BUILD_BUG_ON(LLOG_ORIGIN_HANDLE_DESTROY != 509);
 	BUILD_BUG_ON(LLOG_FIRST_OPC != 501);
 	BUILD_BUG_ON(LLOG_LAST_OPC != 510);
 	BUILD_BUG_ON(LLOG_CONFIG_ORIG_CTXT != 0);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 1b5794a..5db742f 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2655,12 +2655,12 @@ enum llogd_rpc_ops {
 	LLOG_ORIGIN_HANDLE_CREATE	= 501,
 	LLOG_ORIGIN_HANDLE_NEXT_BLOCK	= 502,
 	LLOG_ORIGIN_HANDLE_READ_HEADER	= 503,
-	LLOG_ORIGIN_HANDLE_WRITE_REC	= 504,	/* Obsolete by 2.1. */
-	LLOG_ORIGIN_HANDLE_CLOSE	= 505,	/* Obsolete by 1.8. */
-	LLOG_ORIGIN_CONNECT		= 506,	/* Obsolete by 2.4. */
-	LLOG_CATINFO			= 507,  /* Obsolete by 2.3. */
+/*	LLOG_ORIGIN_HANDLE_WRITE_REC	= 504,	Obsolete by 2.1. */
+/*	LLOG_ORIGIN_HANDLE_CLOSE	= 505,	Obsolete by 1.8. */
+/*	LLOG_ORIGIN_CONNECT		= 506,	Obsolete by 2.4. */
+/*	LLOG_CATINFO			= 507,  Obsolete by 2.3. */
 	LLOG_ORIGIN_HANDLE_PREV_BLOCK	= 508,
-	LLOG_ORIGIN_HANDLE_DESTROY	= 509,  /* Obsolete. */
+	LLOG_ORIGIN_HANDLE_DESTROY	= 509,  /* Obsolete by 2.11. */
 	LLOG_LAST_OPC,
 	LLOG_FIRST_OPC			= LLOG_ORIGIN_HANDLE_CREATE
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 065/622] lustre: osc: fix idle_timeout handling
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (63 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 064/622] lustre: ptlrpc: remove obsolete LLOG_ORIGIN_* RPCs James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 066/622] lustre: ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor)) James Simmons
                   ` (557 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

The patch that landed for LU-7236 introduced new sysfs entries
which were done wrong.

1) For idle_timeout it returns -ERANGE for
   any value passed in expect setting idle_timeout to zero. This
   does not match what the commit message said for LU-7236. So
   I changed lprocfs_str_with_units_to_s64() into kstrtouint()
   since a signed 64 bit timeout is not needed. Using kstrtouint()
   ensures that negative values are not possible and also cap the
   value to CONNECTION_SWITCH_MAX since the max of 4 billion
   seconds is over kill.

2) For the next procfs idle_connect it is really a write only file
   but it was treated as both read and write. There is no need for
   the osc_idle_connect_seq_show() function.

3) Lastly no more stuffing new entries into proc or debugfs. For
   this patch convert these new proc entries to sysfs. It seems
   to be a common occurrence so add LPROC_SEQ_* to spelling.txt
   so checkpatch will complain about using LPROC_SEQ_* which will
   go away.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: 406cd8a74d84 ("LU-8066 osc: fix idle_timeout handling")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/32719
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/lproc_osc.c | 42 ++++++++++++++++++------------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index fd84393..0a12079 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -598,26 +598,27 @@ static int osc_unstable_stats_seq_show(struct seq_file *m, void *v)
 
 LPROC_SEQ_FOPS_RO(osc_unstable_stats);
 
-static int osc_idle_timeout_seq_show(struct seq_file *m, void *v)
+static ssize_t idle_timeout_show(struct kobject *kobj, struct attribute *attr,
+				 char *buf)
 {
-	struct obd_device *obd = m->private;
+	struct obd_device *obd = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
 	struct client_obd *cli = &obd->u.cli;
 
-	seq_printf(m, "%u\n", cli->cl_import->imp_idle_timeout);
-	return 0;
+	return sprintf(buf, "%u\n", cli->cl_import->imp_idle_timeout);
 }
 
-static ssize_t osc_idle_timeout_seq_write(struct file *f,
-					  const char __user *buffer,
-					  size_t count, loff_t *off)
+static ssize_t idle_timeout_store(struct kobject *kobj, struct attribute *attr,
+				  const char *buffer, size_t count)
 {
-	struct obd_device *obd = ((struct seq_file *)f->private_data)->private;
+	struct obd_device *obd = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
 	struct client_obd *cli = &obd->u.cli;
 	struct ptlrpc_request *req;
 	unsigned int val;
 	int rc;
 
-	rc = kstrtouint_from_user(buffer, count, 0, &val);
+	rc = kstrtouint(buffer, 0, &val);
 	if (rc)
 		return rc;
 
@@ -635,18 +636,13 @@ static ssize_t osc_idle_timeout_seq_write(struct file *f,
 
 	return count;
 }
-LPROC_SEQ_FOPS(osc_idle_timeout);
+LUSTRE_RW_ATTR(idle_timeout);
 
-static int osc_idle_connect_seq_show(struct seq_file *m, void *v)
+static ssize_t idle_connect_store(struct kobject *kobj, struct attribute *attr,
+				  const char *buffer, size_t count)
 {
-	return 0;
-}
-
-static ssize_t osc_idle_connect_seq_write(struct file *f,
-					  const char __user *buffer,
-					  size_t count, loff_t *off)
-{
-	struct obd_device *dev = ((struct seq_file *)f->private_data)->private;
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
 	struct client_obd *cli = &dev->u.cli;
 	struct ptlrpc_request *req;
 
@@ -658,7 +654,7 @@ static ssize_t osc_idle_connect_seq_write(struct file *f,
 
 	return count;
 }
-LPROC_SEQ_FOPS(osc_idle_connect);
+LUSTRE_WO_ATTR(idle_connect);
 
 LPROC_SEQ_FOPS_RO_TYPE(osc, connect_flags);
 LPROC_SEQ_FOPS_RO_TYPE(osc, server_uuid);
@@ -687,10 +683,6 @@ static ssize_t osc_idle_connect_seq_write(struct file *f,
 	  .fops	=	&osc_pinger_recov_fops		},
 	{ .name	=	"unstable_stats",
 	  .fops	=	&osc_unstable_stats_fops	},
-	{ .name	=	"idle_timeout",
-	  .fops	=	&osc_idle_timeout_fops		},
-	{ .name	=	"idle_connect",
-	  .fops	=	&osc_idle_connect_fops		},
 	{ NULL }
 };
 
@@ -877,6 +869,8 @@ void lproc_osc_attach_seqstat(struct obd_device *dev)
 	&lustre_attr_resend_count.attr,
 	&lustre_attr_ost_conn_uuid.attr,
 	&lustre_attr_ping.attr,
+	&lustre_attr_idle_timeout.attr,
+	&lustre_attr_idle_connect.attr,
 	NULL,
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 066/622] lustre: ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor))
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (64 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 065/622] lustre: osc: fix idle_timeout handling James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 067/622] lustre: obd: keep dirty_max_pages a round number of MB James Simmons
                   ` (556 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

It's ptlrpc_replay_next() vs close race.
ll_close_inode_openhandle() calls
mdc_free_open()->ptlrpc_request_committed->ptlrpc_free_request

Need to reset imp_replay_cursor while dropping a request from
replay list.

Cray-bug-id: LUS-2455
WC-bug-id: https://jira.whamcloud.com/browse/LU-11098
Lustre-commit: d69d488e1778 ("LU-11098 ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor))")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/32727
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Vladimir Saveliev <c17830@cray.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index d28a9cd..57b08de 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -2613,8 +2613,11 @@ void ptlrpc_request_committed(struct ptlrpc_request *req, int force)
 		return;
 	}
 
-	if (force || req->rq_transno <= imp->imp_peer_committed_transno)
+	if (force || req->rq_transno <= imp->imp_peer_committed_transno) {
+		if (imp->imp_replay_cursor == &req->rq_replay_list)
+			imp->imp_replay_cursor = req->rq_replay_list.next;
 		ptlrpc_free_request(req);
+	}
 
 	spin_unlock(&imp->imp_lock);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 067/622] lustre: obd: keep dirty_max_pages a round number of MB
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (65 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 066/622] lustre: ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor)) James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 068/622] lustre: osc: depart grant shrinking from pinger James Simmons
                   ` (555 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In client_adjust_max_dirty() ensure that the dirty pages limit is
always divisible by 256 so that it may faithfully be represented in MB
as is the case when the max_dirty_mb parameters are used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11157
Lustre-commit: d3f88d376c49 ("LU-11157 obd: keep dirty_max_pages a round number of MB")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32831
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index d2bd234..5656eb0 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -1106,7 +1106,7 @@ static inline int cli_brw_size(struct obd_device *obd)
 }
 
 /*
- * when RPC size or the max RPCs in flight is increased, the max dirty pages
+ * When RPC size or the max RPCs in flight is increased, the max dirty pages
  * of the client should be increased accordingly to avoid sending fragmented
  * RPCs over the network when the client runs out of the maximum dirty space
  * when so many RPCs are being generated.
@@ -1114,10 +1114,10 @@ static inline int cli_brw_size(struct obd_device *obd)
 static inline void client_adjust_max_dirty(struct client_obd *cli)
 {
 	/* initializing */
-	if (cli->cl_dirty_max_pages <= 0)
+	if (cli->cl_dirty_max_pages <= 0) {
 		cli->cl_dirty_max_pages =
 			(OSC_MAX_DIRTY_DEFAULT * 1024 * 1024) >> PAGE_SHIFT;
-	else {
+	} else {
 		unsigned long dirty_max = cli->cl_max_rpcs_in_flight *
 					  cli->cl_max_pages_per_rpc;
 
@@ -1127,6 +1127,13 @@ static inline void client_adjust_max_dirty(struct client_obd *cli)
 
 	if (cli->cl_dirty_max_pages > totalram_pages() / 8)
 		cli->cl_dirty_max_pages = totalram_pages() / 8;
+
+	/* This value is exported to userspace through the max_dirty_mb
+	 * parameter.  So we round up the number of pages to make it a round
+	 * number of MBs.
+	 */
+	cli->cl_dirty_max_pages = round_up(cli->cl_dirty_max_pages,
+					   1 << (20 - PAGE_SHIFT));
 }
 
 #endif /* __OBD_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 068/622] lustre: osc: depart grant shrinking from pinger
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (66 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 067/622] lustre: obd: keep dirty_max_pages a round number of MB James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 069/622] lustre: mdt: Lazy size on MDT James Simmons
                   ` (554 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@hotmail.com>

* Removing grant shrinking code outside of pinger, use a workqueue
  to handle grant shrinking timer.
* Enable OSC grant shrinking by default.

bugzilla: 19507

WC-bug-id: https://jira.whamcloud.com/browse/LU-8708
Lustre-commit: fc915a43786e ("LU-8708 osc: depart grant shrinking from pinger")
Signed-off-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-on: https://review.whamcloud.com/23202
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lib.c   |   1 +
 fs/lustre/llite/llite_lib.c |   2 +-
 fs/lustre/osc/osc_request.c | 155 ++++++++++++++++++++++++++++++--------------
 3 files changed, 110 insertions(+), 48 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 2c0fad3..838ddb3 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -349,6 +349,7 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	spin_lock_init(&cli->cl_lru_list_lock);
 	atomic_long_set(&cli->cl_unstable_count, 0);
 	INIT_LIST_HEAD(&cli->cl_shrink_list);
+	INIT_LIST_HEAD(&cli->cl_grant_chain);
 
 	INIT_LIST_HEAD(&cli->cl_flight_waiters);
 	cli->cl_rpcs_in_flight = 0;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 0844318..56624e8 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -399,7 +399,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				  OBD_CONNECT_LAYOUTLOCK  |
 				  OBD_CONNECT_PINGLESS	| OBD_CONNECT_LFSCK |
 				  OBD_CONNECT_BULK_MBITS  | OBD_CONNECT_SHORTIO |
-				  OBD_CONNECT_FLAGS2;
+				  OBD_CONNECT_FLAGS2 | OBD_CONNECT_GRANT_SHRINK;
 
 	/* The client currently advertises support for OBD_CONNECT_LOCKAHEAD_OLD
 	 * so it can interoperate with an older version of lockahead which was
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index e341fcc..1a9ed8d 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -33,6 +33,7 @@
 
 #define DEBUG_SUBSYSTEM S_OSC
 
+#include <linux/workqueue.h>
 #include <linux/highmem.h>
 #include <linux/libcfs/libcfs_hash.h>
 #include <linux/sched/mm.h>
@@ -721,6 +722,16 @@ static void osc_update_grant(struct client_obd *cli, struct ost_body *body)
 	}
 }
 
+/**
+ * grant thread data for shrinking space.
+ */
+struct grant_thread_data {
+	struct list_head	gtd_clients;
+	struct mutex		gtd_mutex;
+	unsigned long		gtd_stopped:1;
+};
+static struct grant_thread_data client_gtd;
+
 static int osc_shrink_grant_interpret(const struct lu_env *env,
 				      struct ptlrpc_request *req,
 				      void *aa, int rc)
@@ -823,6 +834,9 @@ static int osc_should_shrink_grant(struct client_obd *client)
 {
 	time64_t next_shrink = client->cl_next_shrink_grant;
 
+	if (!client->cl_import)
+		return 0;
+
 	if ((client->cl_import->imp_connect_data.ocd_connect_flags &
 	     OBD_CONNECT_GRANT_SHRINK) == 0)
 		return 0;
@@ -843,38 +857,83 @@ static int osc_should_shrink_grant(struct client_obd *client)
 	return 0;
 }
 
-static int osc_grant_shrink_grant_cb(struct timeout_item *item, void *data)
-{
-	struct client_obd *client;
+#define GRANT_SHRINK_RPC_BATCH	100
+
+static void osc_grant_work_handler(struct work_struct *data);
+static DECLARE_DELAYED_WORK(work, osc_grant_work_handler);
 
-	list_for_each_entry(client, &item->ti_obd_list, cl_grant_shrink_list) {
-		if (osc_should_shrink_grant(client))
-			osc_shrink_grant(client);
+static void osc_grant_work_handler(struct work_struct *data)
+{
+	struct client_obd *cli;
+	int rpc_sent;
+	bool init_next_shrink = true;
+	time64_t next_shrink = ktime_get_seconds() + GRANT_SHRINK_INTERVAL;
+
+	rpc_sent = 0;
+	mutex_lock(&client_gtd.gtd_mutex);
+	list_for_each_entry(cli, &client_gtd.gtd_clients,
+			    cl_grant_chain) {
+		if (++rpc_sent < GRANT_SHRINK_RPC_BATCH &&
+		    osc_should_shrink_grant(cli))
+			osc_shrink_grant(cli);
+
+		if (!init_next_shrink) {
+			if (cli->cl_next_shrink_grant < next_shrink &&
+			    cli->cl_next_shrink_grant > ktime_get_seconds())
+				next_shrink = cli->cl_next_shrink_grant;
+		} else {
+			init_next_shrink = false;
+			next_shrink = cli->cl_next_shrink_grant;
+		}
 	}
-	return 0;
+	mutex_unlock(&client_gtd.gtd_mutex);
+
+	if (client_gtd.gtd_stopped == 1)
+		return;
+
+	if (next_shrink > ktime_get_seconds())
+		schedule_delayed_work(&work, msecs_to_jiffies(
+					(next_shrink - ktime_get_seconds()) *
+					MSEC_PER_SEC));
+	else
+		schedule_work(&work.work);
 }
 
-static int osc_add_shrink_grant(struct client_obd *client)
+/**
+ * Start grant thread for returing grant to server for idle clients.
+ */
+static int osc_start_grant_work(void)
 {
-	int rc;
+	client_gtd.gtd_stopped = 0;
+	mutex_init(&client_gtd.gtd_mutex);
+	INIT_LIST_HEAD(&client_gtd.gtd_clients);
+
+	schedule_work(&work.work);
 
-	rc = ptlrpc_add_timeout_client(client->cl_grant_shrink_interval,
-				       TIMEOUT_GRANT,
-				       osc_grant_shrink_grant_cb, NULL,
-				       &client->cl_grant_shrink_list);
-	if (rc) {
-		CERROR("add grant client %s error %d\n", cli_name(client), rc);
-		return rc;
-	}
-	CDEBUG(D_CACHE, "add grant client %s\n", cli_name(client));
-	osc_update_next_shrink(client);
 	return 0;
 }
 
-static int osc_del_shrink_grant(struct client_obd *client)
+static void osc_stop_grant_work(void)
+{
+	client_gtd.gtd_stopped = 1;
+	cancel_delayed_work_sync(&work);
+}
+
+static void osc_add_grant_list(struct client_obd *client)
 {
-	return ptlrpc_del_timeout_client(&client->cl_grant_shrink_list,
-					 TIMEOUT_GRANT);
+	mutex_lock(&client_gtd.gtd_mutex);
+	list_add(&client->cl_grant_chain, &client_gtd.gtd_clients);
+	mutex_unlock(&client_gtd.gtd_mutex);
+}
+
+static void osc_del_grant_list(struct client_obd *client)
+{
+	if (list_empty(&client->cl_grant_chain))
+		return;
+
+	mutex_lock(&client_gtd.gtd_mutex);
+	list_del_init(&client->cl_grant_chain);
+	mutex_unlock(&client_gtd.gtd_mutex);
 }
 
 void osc_init_grant(struct client_obd *cli, struct obd_connect_data *ocd)
@@ -929,9 +988,8 @@ void osc_init_grant(struct client_obd *cli, struct obd_connect_data *ocd)
 	       cli_name(cli), cli->cl_avail_grant, cli->cl_lost_grant,
 	       cli->cl_chunkbits, cli->cl_max_extent_pages);
 
-	if (ocd->ocd_connect_flags & OBD_CONNECT_GRANT_SHRINK &&
-	    list_empty(&cli->cl_grant_shrink_list))
-		osc_add_shrink_grant(cli);
+	if (OCD_HAS_FLAG(ocd, GRANT_SHRINK) && list_empty(&cli->cl_grant_chain))
+		osc_add_grant_list(cli);
 }
 EXPORT_SYMBOL(osc_init_grant);
 
@@ -2971,15 +3029,12 @@ int osc_disconnect(struct obd_export *exp)
 	 *				     osc_disconnect
 	 *				     del_shrink_grant
 	 *   ptlrpc_connect_interrupt
-	 *     init_grant_shrink
+	 *     osc_init_grant
 	 *   add this client to shrink list
-	 *				      cleanup_osc
-	 * Bang! pinger trigger the shrink.
-	 * So the osc should be disconnected from the shrink list, after we
-	 * are sure the import has been destroyed. BUG18662
+	 *				     cleanup_osc
+	 * Bang! grant shrink thread trigger the shrink. BUG18662
 	 */
-	if (!obd->u.cli.cl_import)
-		osc_del_shrink_grant(&obd->u.cli);
+	osc_del_grant_list(&obd->u.cli);
 	return rc;
 }
 EXPORT_SYMBOL(osc_disconnect);
@@ -3159,8 +3214,8 @@ int osc_setup_common(struct obd_device *obd, struct lustre_cfg *lcfg)
 		goto out_ptlrpcd_work;
 
 	cli->cl_grant_shrink_interval = GRANT_SHRINK_INTERVAL;
+	osc_update_next_shrink(cli);
 
-	INIT_LIST_HEAD(&cli->cl_grant_shrink_list);
 	return 0;
 
 out_ptlrpcd_work:
@@ -3210,7 +3265,6 @@ int osc_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 		atomic_add(added, &osc_pool_req_count);
 	}
 
-	INIT_LIST_HEAD(&cli->cl_grant_shrink_list);
 	ns_register_cancel(obd->obd_namespace, osc_cancel_weight);
 
 	spin_lock(&osc_shrink_lock);
@@ -3356,14 +3410,19 @@ static int __init osc_init(void)
 	if (rc)
 		return rc;
 
+	rc = class_register_type(&osc_obd_ops, NULL,
+				 LUSTRE_OSC_NAME, &osc_device_type);
+	if (rc)
+		goto out_kmem;
+
 	rc = register_shrinker(&osc_cache_shrinker);
 	if (rc)
-		goto err;
+		goto out_type;
 
 	/* This is obviously too much memory, only prevent overflow here */
 	if (osc_reqpool_mem_max >= 1 << 12 || osc_reqpool_mem_max == 0) {
 		rc = -EINVAL;
-		goto err;
+		goto out_shrinker;
 	}
 
 	reqpool_size = osc_reqpool_mem_max << 20;
@@ -3383,29 +3442,31 @@ static int __init osc_init(void)
 	atomic_set(&osc_pool_req_count, 0);
 	osc_rq_pool = ptlrpc_init_rq_pool(0, OST_MAXREQSIZE,
 					  ptlrpc_add_rqs_to_pool);
+	if (!osc_rq_pool) {
+		rc = -ENOMEM;
+		goto out_shrinker;
+	}
 
-	rc = -ENOMEM;
-
-	if (!osc_rq_pool)
-		goto err;
-
-	rc = class_register_type(&osc_obd_ops, NULL,
-				 LUSTRE_OSC_NAME, &osc_device_type);
+	rc = osc_start_grant_work();
 	if (rc)
-		goto err;
+		goto out_req_pool;
 
 	return rc;
 
-err:
-	if (osc_rq_pool)
-		ptlrpc_free_rq_pool(osc_rq_pool);
+out_req_pool:
+	ptlrpc_free_rq_pool(osc_rq_pool);
+out_type:
+	class_unregister_type(LUSTRE_OSC_NAME);
+out_shrinker:
 	unregister_shrinker(&osc_cache_shrinker);
+out_kmem:
 	lu_kmem_fini(osc_caches);
 	return rc;
 }
 
 static void /*__exit*/ osc_exit(void)
 {
+	osc_stop_grant_work();
 	unregister_shrinker(&osc_cache_shrinker);
 	class_unregister_type(LUSTRE_OSC_NAME);
 	lu_kmem_fini(osc_caches);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 069/622] lustre: mdt: Lazy size on MDT
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (67 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 068/622] lustre: osc: depart grant shrinking from pinger James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 070/622] lustre: lfsck: layout LFSCK for mirrored file James Simmons
                   ` (553 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

The design of Lazy size on MDT (LSOM) does not guarantee the
accuracy. A file that is being opened for a long time might
cause inaccurate LSOM for a very long time. And also eviction or
crash of client might cause incomplete process of closing a file,
thus might cause inaccurate LSOM. A precise LSOM could only be read
from MDT when 1) all possible corruption and inconsistency caused
by client eviction or client/server crash have all been fixed by
LFSCK and 2) the file is not being opened for write.
In the first step of implementing LSOM, LSOM will not be accessible
from client. Instead, LSOM values can only be accessed on MDT. Thus,
no interface or logic codes will be added on client side to enabled
the access of LSOM from client side.
The LSOM will be saved as an EA value on MDT.
LSOM includes both the apparent size and also the disk usage of
the file.
Whenever a file is being truncated, the LSOM of the file on MDT
will be updated.
Whenever a client is closing a file, ll_prepare_close() will send
the size and blocks to the MDS. The MDS will update the LSOM of
the file if the file size or block size is being increased.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9538
Lustre-commit: f1ebf88aef21 ("LU-9538 mdt: Lazy size on MDT")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/29960
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h                 |  4 +++-
 fs/lustre/llite/file.c                  |  5 +++++
 fs/lustre/mdc/mdc_lib.c                 |  4 ++++
 fs/lustre/ptlrpc/wiretest.c             | 24 ++++++++++++++++++++++++
 include/uapi/linux/lustre/lustre_idl.h  |  2 ++
 include/uapi/linux/lustre/lustre_user.h | 17 +++++++++++++++--
 6 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 5656eb0..c712979 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -204,7 +204,7 @@ struct client_obd {
 	long			cl_reserved_grant;
 	wait_queue_head_t	cl_cache_waiters;	/* waiting for cache/grant */
 	time64_t		cl_next_shrink_grant;	/* seconds */
-	struct list_head	cl_grant_shrink_list;	/* Timeout event list */
+	struct list_head	cl_grant_chain;
 	time64_t		cl_grant_shrink_interval; /* seconds */
 
 	/* A chunk is an optimal size used by osc_extent to determine
@@ -670,6 +670,8 @@ enum op_xvalid {
 	OP_XVALID_OWNEROVERRIDE	= BIT(2),	/* 0x0004 */
 	OP_XVALID_FLAGS		= BIT(3),	/* 0x0008 */
 	OP_XVALID_PROJID	= BIT(4),	/* 0x0010 */
+	OP_XVALID_LAZYSIZE	= BIT(5),	/* 0x0020 */
+	OP_XVALID_LAZYBLOCKS	= BIT(6),	/* 0x0040 */
 };
 
 struct lu_context;
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index c3fb104b..837add1 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -207,6 +207,11 @@ static int ll_close_inode_openhandle(struct inode *inode,
 		break;
 	}
 
+	if (!(op_data->op_attr.ia_valid & ATTR_SIZE))
+		op_data->op_xvalid |= OP_XVALID_LAZYSIZE;
+	if (!(op_data->op_xvalid & OP_XVALID_BLOCKS))
+		op_data->op_xvalid |= OP_XVALID_LAZYBLOCKS;
+
 	rc = md_close(md_exp, op_data, och->och_mod, &req);
 	if (rc && rc != -EINTR) {
 		CERROR("%s: inode " DFID " mdc close failed: rc = %d\n",
diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index 467503c..e2f1a49 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -317,6 +317,10 @@ static inline u64 attr_pack(unsigned int ia_valid, enum op_xvalid ia_xvalid)
 		sa_valid |= MDS_OPEN_OWNEROVERRIDE;
 	if (ia_xvalid & OP_XVALID_PROJID)
 		sa_valid |= MDS_ATTR_PROJID;
+	if (ia_xvalid & OP_XVALID_LAZYSIZE)
+		sa_valid |= MDS_ATTR_LSIZE;
+	if (ia_xvalid & OP_XVALID_LAZYBLOCKS)
+		sa_valid |= MDS_ATTR_LBLOCKS;
 	return sa_valid;
 }
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 7b6ea86..b4bb30d 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -258,6 +258,10 @@ void lustre_assert_wire_constants(void)
 	LASSERTF(MDS_ATTR_PROJID == 0x0000000000010000ULL, "found 0x%.16llxULL\n",
 		 (long long)MDS_ATTR_PROJID);
 
+	LASSERTF(MDS_ATTR_LSIZE == 0x0000000000020000ULL, "found 0x%.16llxULL\n",
+		 (long long)MDS_ATTR_LSIZE);
+	LASSERTF(MDS_ATTR_LBLOCKS == 0x0000000000040000ULL, "found 0x%.16llxULL\n",
+		 (long long)MDS_ATTR_LBLOCKS);
 	LASSERTF(FLD_QUERY == 900, "found %lld\n",
 		 (long long)FLD_QUERY);
 	LASSERTF(FLD_FIRST_OPC == 900, "found %lld\n",
@@ -390,6 +394,26 @@ void lustre_assert_wire_constants(void)
 	LASSERTF(LU_SEQ_RANGE_OST == 1, "found %lld\n",
 		 (long long)LU_SEQ_RANGE_OST);
 
+	/* Checks for struct lustre_som_attrs */
+	LASSERTF((int)sizeof(struct lustre_som_attrs) == 24, "found %lld\n",
+		 (long long)(int)sizeof(struct lustre_som_attrs));
+	LASSERTF((int)offsetof(struct lustre_som_attrs, lsa_valid) == 0, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_som_attrs, lsa_valid));
+	LASSERTF((int)sizeof(((struct lustre_som_attrs *)0)->lsa_valid) == 2, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_som_attrs *)0)->lsa_valid));
+	LASSERTF((int)offsetof(struct lustre_som_attrs, lsa_reserved) == 2, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_som_attrs, lsa_reserved));
+	LASSERTF((int)sizeof(((struct lustre_som_attrs *)0)->lsa_reserved) == 6, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_som_attrs *)0)->lsa_reserved));
+	LASSERTF((int)offsetof(struct lustre_som_attrs, lsa_size) == 8, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_som_attrs, lsa_size));
+	LASSERTF((int)sizeof(((struct lustre_som_attrs *)0)->lsa_size) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_som_attrs *)0)->lsa_size));
+	LASSERTF((int)offsetof(struct lustre_som_attrs, lsa_blocks) == 16, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_som_attrs, lsa_blocks));
+	LASSERTF((int)sizeof(((struct lustre_som_attrs *)0)->lsa_blocks) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_som_attrs *)0)->lsa_blocks));
+
 	/* Checks for struct lustre_mdt_attrs */
 	LASSERTF((int)sizeof(struct lustre_mdt_attrs) == 24, "found %lld\n",
 		 (long long)(int)sizeof(struct lustre_mdt_attrs));
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 5db742f..9f8d65d 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1676,6 +1676,8 @@ struct mdt_rec_setattr {
 					   */
 #define MDS_ATTR_BLOCKS		0x8000ULL  /* = 32768 */
 #define MDS_ATTR_PROJID		0x10000ULL /* = 65536 */
+#define MDS_ATTR_LSIZE		0x20000ULL /* = 131072 */
+#define MDS_ATTR_LBLOCKS	0x40000ULL /* = 262144 */
 
 enum mds_op_bias {
 /*	MDS_CHECK_SPLIT		= 1 << 0, obsolete before 2.3.58 */
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 5956f33..b2f5b57 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -202,8 +202,19 @@ struct lustre_mdt_attrs {
  */
 #define LMA_OLD_SIZE (sizeof(struct lustre_mdt_attrs) + 5 * sizeof(__u64))
 
-enum {
-	LSOM_FL_VALID = 1 << 0,
+enum lustre_som_flags {
+	/* Unknown or no SoM data, must get size from OSTs. */
+	SOM_FL_UNKNOWN	= 0x0000,
+	/* Known strictly correct, FLR or DoM file (SoM guaranteed). */
+	SOM_FL_STRICT	= 0x0001,
+	/* Known stale - was right@some point in the past, but it is
+	 * known (or likely) to be incorrect now (e.g. opened for write).
+	 */
+	SOM_FL_STALE	= 0x0002,
+	/* Approximate, may never have been strictly correct,
+	 * need to sync SOM data to achieve eventual consistency.
+	 */
+	SOM_FL_LAZY	= 0x0004,
 };
 
 struct lustre_som_attrs {
@@ -882,6 +893,8 @@ enum la_valid {
 	LA_KILL_SGID	= 1 << 14,
 	LA_PROJID	= 1 << 15,
 	LA_LAYOUT_VERSION = 1 << 16,
+	LA_LSIZE	= 1 << 17,
+	LA_LBLOCKS	= 1 << 18,
 	/**
 	 * Attributes must be transmitted to OST objects
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 070/622] lustre: lfsck: layout LFSCK for mirrored file
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (68 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 069/622] lustre: mdt: Lazy size on MDT James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:08 ` [lustre-devel] [PATCH 071/622] lustre: mdt: read on open for DoM files James Simmons
                   ` (552 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Fan Yong <fan.yong@intel.com>

This patch makes the layout LFSCK to support mirrored file
as following:

1. Verify mirrored file's LOV EA and PFID EA, including all
   kinds of inconsistencies as non-mirrored file may hit.

2. Rebuild mirrored file's LOV EA from orphan OST-objects,
   recover the component's status/flags before the crash:
   init, stale, and so on.

3. For the mirrored file with dangling reference (OST object),
   it does NOT rebuild the lost OST-object from other replica,
   instead, it either reports the curruption or re-create empty
   OST-object that follows the same rules as non-mirrored case.

Some code cleanup and new test cases for LFSCK against mirrored file.

For the linux client we want to keep the wire protocol in sync.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10288
Lustre-commit: 36ba989752c6 ("LU-10288 lfsck: layout LFSCK for mirrored file")
Signed-off-by: Fan Yong <fan.yong@intel.com>
Reviewed-on: https://review.whamcloud.com/32705
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/pack_generic.c         |  4 +++-
 fs/lustre/ptlrpc/wiretest.c             | 16 ++++++++++++----
 include/uapi/linux/lustre/lustre_user.h |  4 +++-
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 9cea826..d09cf3f 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -2066,7 +2066,9 @@ void lustre_swab_lov_comp_md_v1(struct lov_comp_md_v1 *lum)
 		__swab64s(&ent->lcme_extent.e_end);
 		__swab32s(&ent->lcme_offset);
 		__swab32s(&ent->lcme_size);
-		BUILD_BUG_ON(offsetof(typeof(*ent), lcme_padding) == 0);
+		__swab32s(&ent->lcme_layout_gen);
+		BUILD_BUG_ON(offsetof(typeof(*ent), lcme_padding_1) == 0);
+		BUILD_BUG_ON(offsetof(typeof(*ent), lcme_padding_2) == 0);
 
 		v1 = (struct lov_user_md_v1 *)((char *)lum + off);
 		stripe_count = v1->lmm_stripe_count;
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index b4bb30d..e22f8f8 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1536,10 +1536,18 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_size));
 	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_size) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_size));
-	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_padding) == 32, "found %lld\n",
-		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_padding));
-	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding) == 16, "found %lld\n",
-		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding));
+	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_layout_gen) == 32, "found %lld\n",
+		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_layout_gen));
+	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_layout_gen) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_layout_gen));
+	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_1) == 36, "found %lld\n",
+		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_1));
+	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_1) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_1));
+	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_2) == 40, "found %lld\n",
+		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_2));
+	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_2) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_2));
 	LASSERTF(LCME_FL_INIT == 0x00000010UL, "found 0x%.8xUL\n",
 		 (unsigned int)LCME_FL_INIT);
 	LASSERTF(LCME_FL_NEG == 0x80000000UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index b2f5b57..8fd5b26 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -517,7 +517,9 @@ struct lov_comp_md_entry_v1 {
 						 * start from lov_comp_md_v1
 						 */
 	__u32			lcme_size;	/* size of component blob */
-	__u64			lcme_padding[2];
+	__u32			lcme_layout_gen;
+	__u32			lcme_padding_1;
+	__u64			lcme_padding_2;
 } __packed;
 
 #define SEQ_ID_MAX		0x0000FFFF
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 071/622] lustre: mdt: read on open for DoM files
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (69 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 070/622] lustre: lfsck: layout LFSCK for mirrored file James Simmons
@ 2020-02-27 21:08 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 072/622] lustre: migrate: pack lmv ea in migrate rpc James Simmons
                   ` (551 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:08 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Read file data upon open and return it in reply. That works
only for file with Data-on-MDT layout and no OST components
initialized. There are three possible cases may occur:
1) file data fits in already allocated reply buffer (~9K)
   and is returned in that buffer in OPEN reply.
2) File fits in the maximum reply buffer (128K) and reply is
   returned with larger size to the client causing resend
   with re-allocated buffer.
3) File doesn't fit in reply buffer but its tail fills page
   partially then that tail is returned. This can be useful
   for an append case

WC-bug-id: https://jira.whamcloud.com/browse/LU-10181
Lustre-commit: 13372d6c243c ("LU-10181 mdt: read on open for DoM files")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/23011
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h |   1 +
 fs/lustre/include/obd.h               |  11 +++
 fs/lustre/llite/file.c                | 131 +++++++++++++++++++++++++++++++++-
 fs/lustre/llite/llite_internal.h      |   3 +
 fs/lustre/llite/namei.c               |   3 +
 fs/lustre/mdc/lproc_mdc.c             |  32 +++++++++
 fs/lustre/mdc/mdc_internal.h          |   4 ++
 fs/lustre/mdc/mdc_locks.c             |  28 +++++++-
 fs/lustre/mdc/mdc_request.c           |   2 +
 fs/lustre/ptlrpc/layout.c             |  11 ++-
 fs/lustre/ptlrpc/niobuf.c             |   5 ++
 11 files changed, 227 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 2737240..807d080 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -291,6 +291,7 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_msg_field RMF_OBD_ID;
 extern struct req_msg_field RMF_FID;
 extern struct req_msg_field RMF_NIOBUF_REMOTE;
+extern struct req_msg_field RMF_NIOBUF_INLINE;
 extern struct req_msg_field RMF_RCS;
 extern struct req_msg_field RMF_FIEMAP_KEY;
 extern struct req_msg_field RMF_FIEMAP_VAL;
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index c712979..de9642f 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -184,6 +184,17 @@ struct client_obd {
 	 */
 	u32			 cl_max_mds_easize;
 
+	/* Data-on-MDT specific value to set larger reply buffer for possible
+	 * data read along with open/stat requests. By default it tries to use
+	 * unused space in reply buffer.
+	 * This value is used to ensure that reply buffer has at least as
+	 * much free space as value indicates. That free space is gained from
+	 * LOV EA buffer which is small for DoM files and on big systems can
+	 * provide up to 32KB of extra space in reply buffer.
+	 * Default value is 8K now.
+	 */
+	u32			 cl_dom_min_inline_repsize;
+
 	enum lustre_sec_part     cl_sp_me;
 	enum lustre_sec_part     cl_sp_to;
 	struct sptlrpc_flavor    cl_flvr_mgc;   /* fixed flavor of mgc->mgs */
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 837add1..7657c79 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -393,6 +393,132 @@ int ll_file_release(struct inode *inode, struct file *file)
 	return rc;
 }
 
+static inline int ll_dom_readpage(void *data, struct page *page)
+{
+	struct niobuf_local *lnb = data;
+	void *kaddr;
+
+	kaddr = kmap_atomic(page);
+	memcpy(kaddr, lnb->lnb_data, lnb->lnb_len);
+	if (lnb->lnb_len < PAGE_SIZE)
+		memset(kaddr + lnb->lnb_len, 0,
+		       PAGE_SIZE - lnb->lnb_len);
+	flush_dcache_page(page);
+	SetPageUptodate(page);
+	kunmap_atomic(kaddr);
+	unlock_page(page);
+
+	return 0;
+}
+
+void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
+			struct lookup_intent *it)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct cl_object *obj = lli->lli_clob;
+	struct address_space *mapping = inode->i_mapping;
+	struct page *vmpage;
+	struct niobuf_remote *rnb;
+	char *data;
+	struct lu_env *env;
+	struct cl_io *io;
+	u16 refcheck;
+	struct lustre_handle lockh;
+	struct ldlm_lock *lock;
+	unsigned long index, start;
+	struct niobuf_local lnb;
+	int rc;
+	bool dom_lock = false;
+
+	if (!obj)
+		return;
+
+	if (it->it_lock_mode != 0) {
+		lockh.cookie = it->it_lock_handle;
+		lock = ldlm_handle2lock(&lockh);
+		if (lock)
+			dom_lock = ldlm_has_dom(lock);
+		LDLM_LOCK_PUT(lock);
+	}
+
+	if (!dom_lock)
+		return;
+
+	env = cl_env_get(&refcheck);
+	if (IS_ERR(env))
+		return;
+
+	if (!req_capsule_has_field(&req->rq_pill, &RMF_NIOBUF_INLINE,
+				   RCL_SERVER)) {
+		rc = -ENODATA;
+		goto out_env;
+	}
+
+	rnb = req_capsule_server_get(&req->rq_pill, &RMF_NIOBUF_INLINE);
+	data = (char *)rnb + sizeof(*rnb);
+
+	if (!rnb || rnb->rnb_len == 0) {
+		rc = 0;
+		goto out_env;
+	}
+
+	CDEBUG(D_INFO, "Get data buffer along with open, len %i, i_size %llu\n",
+	       rnb->rnb_len, i_size_read(inode));
+
+	io = vvp_env_thread_io(env);
+	io->ci_obj = obj;
+	io->ci_ignore_layout = 1;
+	rc = cl_io_init(env, io, CIT_MISC, obj);
+	if (rc)
+		goto out_io;
+
+	lnb.lnb_file_offset = rnb->rnb_offset;
+	start = lnb.lnb_file_offset / PAGE_SIZE;
+	index = 0;
+	LASSERT(lnb.lnb_file_offset % PAGE_SIZE == 0);
+	lnb.lnb_page_offset = 0;
+	do {
+		struct cl_page *clp;
+
+		lnb.lnb_data = data + (index << PAGE_SHIFT);
+		lnb.lnb_len = rnb->rnb_len - (index << PAGE_SHIFT);
+		if (lnb.lnb_len > PAGE_SIZE)
+			lnb.lnb_len = PAGE_SIZE;
+
+		vmpage = read_cache_page(mapping, index + start,
+					 ll_dom_readpage, &lnb);
+		if (IS_ERR(vmpage)) {
+			CWARN("%s: cannot fill page %lu for "DFID
+			      " with data: rc = %li\n",
+			      ll_get_fsname(inode->i_sb, NULL, 0),
+			      index + start, PFID(lu_object_fid(&obj->co_lu)),
+			      PTR_ERR(vmpage));
+			break;
+		}
+		lock_page(vmpage);
+		clp = cl_page_find(env, obj, vmpage->index, vmpage,
+				   CPT_CACHEABLE);
+		if (IS_ERR(clp)) {
+			unlock_page(vmpage);
+			put_page(vmpage);
+			rc = PTR_ERR(clp);
+			goto out_io;
+		}
+
+		/* export page */
+		cl_page_export(env, clp, 1);
+		cl_page_put(env, clp);
+		unlock_page(vmpage);
+		put_page(vmpage);
+		index++;
+	} while (rnb->rnb_len > (index << PAGE_SHIFT));
+	rc = 0;
+out_io:
+	cl_io_fini(env, io);
+out_env:
+	cl_env_put(env, &refcheck);
+}
+
 static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 			       struct lookup_intent *itp)
 {
@@ -450,8 +576,11 @@ static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 	}
 
 	rc = ll_prep_inode(&inode, req, NULL, itp);
-	if (!rc && itp->it_lock_mode)
+
+	if (!rc && itp->it_lock_mode) {
+		ll_dom_finish_open(d_inode(de), req, itp);
 		ll_set_lock_data(sbi->ll_md_exp, inode, itp, NULL);
+	}
 
 out:
 	ptlrpc_req_finished(req);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 6bdbf28..7491397 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -916,6 +916,9 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 ssize_t ll_copy_user_md(const struct lov_user_md __user *md,
 			struct lov_user_md **kbuf);
 
+void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
+			struct lookup_intent *it);
+
 /* Compute expected user md size when passing in a md from user space */
 static inline ssize_t ll_lov_user_md_size(const struct lov_user_md *lum)
 {
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index f835abb..4ac62b2 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -600,6 +600,9 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		if (rc)
 			return rc;
 
+		if (it->it_op & IT_OPEN)
+			ll_dom_finish_open(inode, request, it);
+
 		ll_set_lock_data(ll_i2sbi(parent)->ll_md_exp, inode, it, &bits);
 
 		/* We used to query real size from OSTs here, but actually
diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index 6b87e76..0c52bcf 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -456,6 +456,36 @@ static ssize_t mdc_stats_seq_write(struct file *file,
 }
 LPROC_SEQ_FOPS(mdc_stats);
 
+static int mdc_dom_min_repsize_seq_show(struct seq_file *m, void *v)
+{
+	struct obd_device *dev = m->private;
+
+	seq_printf(m, "%u\n", dev->u.cli.cl_dom_min_inline_repsize);
+
+	return 0;
+}
+
+static ssize_t mdc_dom_min_repsize_seq_write(struct file *file,
+					     const char __user *buffer,
+					     size_t count, loff_t *off)
+{
+	struct obd_device *dev;
+	unsigned int val;
+	int rc;
+
+	dev =  ((struct seq_file *)file->private_data)->private;
+	rc = kstrtouint_from_user(buffer, count, 0, &val);
+	if (rc)
+		return rc;
+
+	if (val > MDC_DOM_MAX_INLINE_REPSIZE)
+		return -ERANGE;
+
+	dev->u.cli.cl_dom_min_inline_repsize = val;
+	return count;
+}
+LPROC_SEQ_FOPS(mdc_dom_min_repsize);
+
 LPROC_SEQ_FOPS_RO_TYPE(mdc, connect_flags);
 LPROC_SEQ_FOPS_RO_TYPE(mdc, server_uuid);
 LPROC_SEQ_FOPS_RO_TYPE(mdc, timeouts);
@@ -489,6 +519,8 @@ static ssize_t mdc_stats_seq_write(struct file *file,
 	  .fops	=	&mdc_unstable_stats_fops	},
 	{ .name	=	"mdc_stats",
 	  .fops	=	&mdc_stats_fops			},
+	{ .name	=	"mdc_dom_min_repsize",
+	  .fops	=	&mdc_dom_min_repsize_fops	},
 	{ NULL }
 };
 
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index 079539d..6cfa79c 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -159,4 +159,8 @@ int mdc_ldlm_blocking_ast(struct ldlm_lock *dlmlock,
 			  struct ldlm_lock_desc *new, void *data, int flag);
 int mdc_ldlm_glimpse_ast(struct ldlm_lock *dlmlock, void *data);
 int mdc_fill_lvb(struct ptlrpc_request *req, struct ost_lvb *lvb);
+
+#define MDC_DOM_DEF_INLINE_REPSIZE 8192
+#define MDC_DOM_MAX_INLINE_REPSIZE XATTR_SIZE_MAX
+
 #endif
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 2e4a5c6..abbc908 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -254,8 +254,9 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	u32 lmmsize = op_data->op_data_size;
 	LIST_HEAD(cancels);
 	int count = 0;
-	int mode;
+	enum ldlm_mode mode;
 	int rc;
+	int repsize;
 
 	it->it_create_mode = (it->it_create_mode & ~S_IFMT) | S_IFREG;
 
@@ -336,7 +337,32 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 			     obddev->u.cli.cl_max_mds_easize);
 	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER, acl_bufsize);
 
+	/**
+	 * Inline buffer for possible data from Data-on-MDT files.
+	 */
+	req_capsule_set_size(&req->rq_pill, &RMF_NIOBUF_INLINE, RCL_SERVER,
+			     sizeof(struct niobuf_remote));
 	ptlrpc_request_set_replen(req);
+
+	/* Get real repbuf allocated size as rounded up power of 2 */
+	repsize = size_roundup_power2(req->rq_replen +
+				      lustre_msg_early_size());
+
+	/* Estimate free space for DoM files in repbuf */
+	repsize -= req->rq_replen - obddev->u.cli.cl_max_mds_easize +
+		   sizeof(struct lov_comp_md_v1) +
+		   sizeof(struct lov_comp_md_entry_v1) +
+		   lov_mds_md_size(0, LOV_MAGIC_V3);
+
+	if (repsize < obddev->u.cli.cl_dom_min_inline_repsize) {
+		repsize = obddev->u.cli.cl_dom_min_inline_repsize - repsize;
+		req_capsule_set_size(&req->rq_pill, &RMF_NIOBUF_INLINE,
+				     RCL_SERVER,
+				     sizeof(struct niobuf_remote) + repsize);
+		ptlrpc_request_set_replen(req);
+		CDEBUG(D_INFO, "Increase repbuf by %d bytes, total: %d\n",
+		       repsize, req->rq_replen);
+	}
 	return req;
 }
 
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index feac374..b173937 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -2551,6 +2551,8 @@ int mdc_setup(struct obd_device *obd, struct lustre_cfg *cfg)
 	if (rc)
 		goto err_osc_cleanup;
 
+	obd->u.cli.cl_dom_min_inline_repsize = MDC_DOM_DEF_INLINE_REPSIZE;
+
 	ns_register_cancel(obd->obd_namespace, mdc_cancel_weight);
 
 	obd->obd_namespace->ns_lvbo = &inode_lvbo;
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 8fe661d..c11b1b0 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -414,7 +414,8 @@
 	&RMF_MDT_MD,
 	&RMF_ACL,
 	&RMF_CAPA1,
-	&RMF_CAPA2
+	&RMF_CAPA2,
+	&RMF_NIOBUF_INLINE,
 };
 
 static const struct req_msg_field *ldlm_intent_getattr_client[] = {
@@ -1065,8 +1066,14 @@ struct req_msg_field RMF_NIOBUF_REMOTE =
 		    dump_rniobuf);
 EXPORT_SYMBOL(RMF_NIOBUF_REMOTE);
 
+struct req_msg_field RMF_NIOBUF_INLINE =
+	DEFINE_MSGF("niobuf_inline", RMF_F_NO_SIZE_CHECK,
+		    sizeof(struct niobuf_remote), lustre_swab_niobuf_remote,
+		    dump_rniobuf);
+EXPORT_SYMBOL(RMF_NIOBUF_INLINE);
+
 struct req_msg_field RMF_RCS =
-	DEFINE_MSGF("niobuf_remote", RMF_F_STRUCT_ARRAY, sizeof(u32),
+	DEFINE_MSGF("niobuf_rcs", RMF_F_STRUCT_ARRAY, sizeof(u32),
 		    lustre_swab_generic_32s, dump_rcs);
 EXPORT_SYMBOL(RMF_RCS);
 
diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 2e866fe..e8ba57b 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -617,6 +617,11 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 				request->rq_status = rc;
 				goto cleanup_bulk;
 			}
+			/* Use real allocated value in lm_repsize,
+			 * so the server may use whole reply buffer
+			 * without resends where it is needed.
+			 */
+			request->rq_reqmsg->lm_repsize = request->rq_repbuf_len;
 		} else {
 			request->rq_repdata = NULL;
 			request->rq_repmsg = NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 072/622] lustre: migrate: pack lmv ea in migrate rpc
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (70 preceding siblings ...)
  2020-02-27 21:08 ` [lustre-devel] [PATCH 071/622] lustre: mdt: read on open for DoM files James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 073/622] lustre: hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id lists in array James Simmons
                   ` (550 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

To support stripe directory migration, pack lmv_user_md in migrate
RPC. Add arguments of 'mdt-count' and 'mdt-hash' for 'lfs migrate'.

Disable directory migration related tests temprorily, and we'll
enable them later in the last patch of this set.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4684
Lustre-commit: 470bdeec6ca5 ("LU-4684 migrate: pack lmv ea in migrate rpc")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31424
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                   | 19 ++++++----
 fs/lustre/llite/file.c                  | 67 +++++++++++++++++----------------
 fs/lustre/llite/llite_internal.h        |  4 +-
 fs/lustre/llite/llite_lib.c             |  4 +-
 fs/lustre/mdc/mdc_lib.c                 | 21 +++++++----
 fs/lustre/mdc/mdc_reint.c               | 20 ++--------
 fs/lustre/ptlrpc/layout.c               |  3 +-
 include/uapi/linux/lustre/lustre_idl.h  |  2 +-
 include/uapi/linux/lustre/lustre_user.h |  8 +++-
 9 files changed, 77 insertions(+), 71 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index c0c3bf0..751d0183 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1322,7 +1322,8 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			goto finish_req;
 		}
 
-		lum_size = lmv_user_md_size(stripe_count, LMV_MAGIC_V1);
+		lum_size = lmv_user_md_size(stripe_count,
+					    LMV_USER_MAGIC_SPECIFIC);
 		tmp = kzalloc(lum_size, GFP_NOFS);
 		if (!tmp) {
 			rc = -ENOMEM;
@@ -1655,14 +1656,14 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return rc;
 	}
 	case LL_IOC_MIGRATE: {
-		const char *filename;
+		struct lmv_user_md *lum;
+		char *filename;
 		int namelen = 0;
 		int len;
 		int rc;
-		int mdtidx;
 
 		rc = obd_ioctl_getdata(&data, &len, (void __user *)arg);
-		if (rc < 0)
+		if (rc)
 			return rc;
 
 		if (!data->ioc_inlbuf1 || !data->ioc_inlbuf2 ||
@@ -1674,17 +1675,21 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		filename = data->ioc_inlbuf1;
 		namelen = data->ioc_inllen1;
 		if (namelen < 1 || namelen != strlen(filename) + 1) {
+			CDEBUG(D_INFO, "IOC_MDC_LOOKUP missing filename\n");
 			rc = -EINVAL;
 			goto migrate_free;
 		}
 
-		if (data->ioc_inllen2 != sizeof(mdtidx)) {
+		lum = (struct lmv_user_md *)data->ioc_inlbuf2;
+		if (lum->lum_magic != LMV_USER_MAGIC &&
+		    lum->lum_magic != LMV_USER_MAGIC_SPECIFIC) {
 			rc = -EINVAL;
+			CERROR("%s: wrong lum magic %x: rc = %d\n",
+			       filename, lum->lum_magic, rc);
 			goto migrate_free;
 		}
-		mdtidx = *(int *)data->ioc_inlbuf2;
 
-		rc = ll_migrate(inode, file, mdtidx, filename, namelen - 1);
+		rc = ll_migrate(inode, file, lum, filename);
 migrate_free:
 		kvfree(data);
 
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 7657c79..68fb623 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3785,8 +3785,8 @@ int ll_get_fid_by_name(struct inode *parent, const char *name,
 	return rc;
 }
 
-int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
-	       const char *name, int namelen)
+int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
+	       const char *name)
 {
 	struct ptlrpc_request *request = NULL;
 	struct obd_client_handle *och = NULL;
@@ -3795,16 +3795,18 @@ int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
 	struct md_op_data *op_data;
 	struct mdt_body *body;
 	u64 data_version = 0;
+	size_t namelen = strlen(name);
+	int lumlen = lmv_user_md_size(lum->lum_stripe_count, lum->lum_magic);
 	struct qstr qstr;
 	int rc;
 
-	CDEBUG(D_VFSTRACE, "migrate %s under " DFID " to MDT%d\n",
-	       name, PFID(ll_inode2fid(parent)), mdtidx);
+	CDEBUG(D_VFSTRACE, "migrate " DFID "/%s to MDT%d stripe count %d\n",
+	       PFID(ll_inode2fid(parent)), name,
+	       lum->lum_stripe_offset, lum->lum_stripe_count);
 
-	op_data = ll_prep_md_op_data(NULL, parent, NULL, name, namelen,
-				     0, LUSTRE_OPC_ANY, NULL);
-	if (IS_ERR(op_data))
-		return PTR_ERR(op_data);
+	if (lum->lum_magic != cpu_to_le32(LMV_USER_MAGIC) &&
+	    lum->lum_magic != cpu_to_le32(LMV_USER_MAGIC_SPECIFIC))
+		lustre_swab_lmv_user_md(lum);
 
 	/* Get child FID first */
 	qstr.hash = full_name_hash(file_dentry(file), name, namelen);
@@ -3818,16 +3820,14 @@ int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
 	}
 
 	if (!child_inode) {
-		rc = ll_get_fid_by_name(parent, name, namelen,
-					&op_data->op_fid3, &child_inode);
+		rc = ll_get_fid_by_name(parent, name, namelen, NULL,
+					&child_inode);
 		if (rc)
-			goto out_free;
+			return rc;
 	}
 
-	if (!child_inode) {
-		rc = -EINVAL;
-		goto out_free;
-	}
+	if (!child_inode)
+		return -ENOENT;
 
 	/*
 	 * lfs migrate command needs to be blocked on the client
@@ -3839,6 +3839,13 @@ int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
 		goto out_iput;
 	}
 
+	op_data = ll_prep_md_op_data(NULL, parent, NULL, name, namelen,
+				     child_inode->i_mode, LUSTRE_OPC_ANY, NULL);
+	if (IS_ERR(op_data)) {
+		rc = PTR_ERR(op_data);
+		goto out_iput;
+	}
+
 	inode_lock(child_inode);
 	op_data->op_fid3 = *ll_inode2fid(child_inode);
 	if (!fid_is_sane(&op_data->op_fid3)) {
@@ -3849,16 +3856,10 @@ int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
 		goto out_unlock;
 	}
 
-	rc = ll_get_mdt_idx_by_fid(ll_i2sbi(parent), &op_data->op_fid3);
-	if (rc < 0)
-		goto out_unlock;
+	op_data->op_cli_flags |= CLI_MIGRATE | CLI_SET_MEA;
+	op_data->op_data = lum;
+	op_data->op_data_size = lumlen;
 
-	if (rc == mdtidx) {
-		CDEBUG(D_INFO, "%s: " DFID " is already on MDT%d.\n", name,
-		       PFID(&op_data->op_fid3), mdtidx);
-		rc = 0;
-		goto out_unlock;
-	}
 again:
 	if (S_ISREG(child_inode->i_mode)) {
 		och = ll_lease_open(child_inode, NULL, FMODE_WRITE, 0);
@@ -3874,16 +3875,17 @@ int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
 			goto out_close;
 
 		op_data->op_handle = och->och_fh;
-		op_data->op_data = och->och_mod;
 		op_data->op_data_version = data_version;
 		op_data->op_lease_handle = och->och_lease_handle;
-		op_data->op_bias |= MDS_RENAME_MIGRATE;
+		op_data->op_bias |= MDS_CLOSE_MIGRATE;
+
+		spin_lock(&och->och_mod->mod_open_req->rq_lock);
+		och->och_mod->mod_open_req->rq_replay = 0;
+		spin_unlock(&och->och_mod->mod_open_req->rq_lock);
 	}
 
-	op_data->op_mds = mdtidx;
-	op_data->op_cli_flags = CLI_MIGRATE;
-	rc = md_rename(ll_i2sbi(parent)->ll_md_exp, op_data, name,
-		       namelen, name, namelen, &request);
+	rc = md_rename(ll_i2sbi(parent)->ll_md_exp, op_data, name, namelen,
+		       name, namelen, &request);
 	if (!rc) {
 		LASSERT(request);
 		ll_update_times(request, parent);
@@ -3915,16 +3917,15 @@ int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
 		goto again;
 
 out_close:
-	if (och) /* close the file */
+	if (och)
 		ll_lease_close(och, child_inode, NULL);
 	if (!rc)
 		clear_nlink(child_inode);
 out_unlock:
 	inode_unlock(child_inode);
+	ll_finish_md_op_data(op_data);
 out_iput:
 	iput(child_inode);
-out_free:
-	ll_finish_md_op_data(op_data);
 	return rc;
 }
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 7491397..edb5f2a 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -824,8 +824,8 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 #define ll_set_acl NULL
 #endif /* CONFIG_LUSTRE_FS_POSIX_ACL */
 
-int ll_migrate(struct inode *parent, struct file *file, int mdtidx,
-	       const char *name, int namelen);
+int ll_migrate(struct inode *parent, struct file *file,
+	       struct lmv_user_md *lum, const char *name);
 int ll_get_fid_by_name(struct inode *parent, const char *name,
 		       int namelen, struct lu_fid *fid, struct inode **inode);
 int ll_inode_permission(struct inode *inode, int mask);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 56624e8..c04146f 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -209,7 +209,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				  OBD_CONNECT_GRANT_PARAM |
 				  OBD_CONNECT_SHORTIO | OBD_CONNECT_FLAGS2;
 
-	data->ocd_connect_flags2 = OBD_CONNECT2_FLR | OBD_CONNECT2_LOCK_CONVERT;
+	data->ocd_connect_flags2 = OBD_CONNECT2_FLR |
+				   OBD_CONNECT2_LOCK_CONVERT |
+				   OBD_CONNECT2_DIR_MIGRATE;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index e2f1a49..1d38574 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -443,7 +443,7 @@ static void mdc_close_intent_pack(struct ptlrpc_request *req,
 	struct close_data *data;
 	struct ldlm_lock *lock;
 
-	if (!(bias & (MDS_CLOSE_INTENT | MDS_RENAME_MIGRATE)))
+	if (!(bias & (MDS_CLOSE_INTENT | MDS_CLOSE_MIGRATE)))
 		return;
 
 	data = req_capsule_client_get(&req->rq_pill, &RMF_CLOSE_DATA);
@@ -507,13 +507,20 @@ void mdc_rename_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 	if (new)
 		mdc_pack_name(req, &RMF_SYMTGT, new, newlen);
 
-	if (op_data->op_cli_flags & CLI_MIGRATE &&
-	    op_data->op_bias & MDS_RENAME_MIGRATE) {
-		struct mdt_ioepoch *epoch;
+	if (op_data->op_cli_flags & CLI_MIGRATE) {
+		char *tmp;
 
-		mdc_close_intent_pack(req, op_data);
-		epoch = req_capsule_client_get(&req->rq_pill, &RMF_MDT_EPOCH);
-		mdc_ioepoch_pack(epoch, op_data);
+		if (op_data->op_bias & MDS_CLOSE_MIGRATE) {
+			struct mdt_ioepoch *epoch;
+
+			mdc_close_intent_pack(req, op_data);
+			epoch = req_capsule_client_get(&req->rq_pill,
+							&RMF_MDT_EPOCH);
+			mdc_ioepoch_pack(epoch, op_data);
+		}
+
+		tmp = req_capsule_client_get(&req->rq_pill, &RMF_EADATA);
+		memcpy(tmp, op_data->op_data, op_data->op_data_size);
 	}
 }
 
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index d326962..030c247 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -390,6 +390,9 @@ int mdc_rename(struct obd_export *exp, struct md_op_data *op_data,
 	req_capsule_set_size(&req->rq_pill, &RMF_NAME, RCL_CLIENT, oldlen + 1);
 	req_capsule_set_size(&req->rq_pill, &RMF_SYMTGT, RCL_CLIENT,
 			     newlen + 1);
+	if (op_data->op_cli_flags & CLI_MIGRATE)
+		req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_CLIENT,
+				     op_data->op_data_size);
 
 	rc = mdc_prep_elc_req(exp, req, MDS_REINT, &cancels, count);
 	if (rc) {
@@ -397,23 +400,6 @@ int mdc_rename(struct obd_export *exp, struct md_op_data *op_data,
 		return rc;
 	}
 
-	if (op_data->op_cli_flags & CLI_MIGRATE && op_data->op_data) {
-		struct md_open_data *mod = op_data->op_data;
-
-		LASSERTF(mod->mod_open_req &&
-			 mod->mod_open_req->rq_type != LI_POISON,
-			 "POISONED open %p!\n", mod->mod_open_req);
-
-		DEBUG_REQ(D_HA, mod->mod_open_req, "matched open");
-		/*
-		 * We no longer want to preserve this open for replay even
-		 * though the open was committed. b=3632, b=3633
-		 */
-		spin_lock(&mod->mod_open_req->rq_lock);
-		mod->mod_open_req->rq_replay = 0;
-		spin_unlock(&mod->mod_open_req->rq_lock);
-	}
-
 	if (exp_connect_cancelset(exp) && req)
 		ldlm_cli_cancel_list(&cancels, count, req, 0);
 
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index c11b1b0..ae573a2 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -263,7 +263,8 @@
 	&RMF_SYMTGT,
 	&RMF_DLM_REQ,
 	&RMF_MDT_EPOCH,
-	&RMF_CLOSE_DATA
+	&RMF_CLOSE_DATA,
+	&RMF_EADATA
 };
 
 static const struct req_msg_field *mds_last_unlink_server[] = {
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 9f8d65d..75326c0 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1693,7 +1693,7 @@ enum mds_op_bias {
 	MDS_CREATE_VOLATILE	= 1 << 10,
 	MDS_OWNEROVERRIDE	= 1 << 11,
 	MDS_HSM_RELEASE		= 1 << 12,
-	MDS_RENAME_MIGRATE	= 1 << 13,
+	MDS_CLOSE_MIGRATE	= 1 << 13,
 	MDS_CLOSE_LAYOUT_SWAP	= 1 << 14,
 	MDS_CLOSE_LAYOUT_MERGE	= 1 << 15,
 	MDS_CLOSE_RESYNC_DONE	= 1 << 16,
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 8fd5b26..421c977 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -632,8 +632,12 @@ struct lmv_user_md_v1 {
 
 static inline int lmv_user_md_size(int stripes, int lmm_magic)
 {
-	return sizeof(struct lmv_user_md) +
-		      stripes * sizeof(struct lmv_user_mds_data);
+	int size = sizeof(struct lmv_user_md);
+
+	if (lmm_magic == LMV_USER_MAGIC_SPECIFIC)
+		size += stripes * sizeof(struct lmv_user_mds_data);
+
+	return size;
 }
 
 struct ll_recreate_obj {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 073/622] lustre: hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id lists in array
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (71 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 072/622] lustre: migrate: pack lmv ea in migrate rpc James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 074/622] lustre: llite: handle zero length xattr values correctly James Simmons
                   ` (549 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Teddy Zheng <teddy@ddn.com>

Clients registed to MDS with OBD_CONNECT2_ARCHIVE_ID_ARRAY will
use array to pass ARCHIVED IDs. While clients without it still
use bitmap. This flag allows old clients connect to new MDSs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10114
Lustre-commit: 1c7e7d1243f7 ("LU-10114 hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id lists in array")
Signed-off-by: Teddy Zheng <teddy@ddn.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/32806
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 1 +
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 3 files changed, 4 insertions(+)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 385359f..fbd46df 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -119,6 +119,7 @@
 	"flr",		/* 0x20 */
 	"wbc",		/* 0x40 */
 	"lock_convert",	/* 0x80 */
+	"archive_id_array",	/* 0x100 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index e22f8f8..1afbb41 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1141,6 +1141,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_WBC_INTENTS);
 	LASSERTF(OBD_CONNECT2_LOCK_CONVERT == 0x80ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_LOCK_CONVERT);
+	LASSERTF(OBD_CONNECT2_ARCHIVE_ID_ARRAY == 0x100ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_ARCHIVE_ID_ARRAY);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 75326c0..dc9872cf3 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -800,6 +800,7 @@ struct ptlrpc_body_v2 {
 						 * locks
 						 */
 #define OBD_CONNECT2_LOCK_CONVERT	0x80ULL /* IBITS lock convert support */
+#define OBD_CONNECT2_ARCHIVE_ID_ARRAY  0x100ULL	/* store HSM archive_id in array */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 074/622] lustre: llite: handle zero length xattr values correctly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (72 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 073/622] lustre: hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id lists in array James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 075/622] lnet: refactor lnet_select_pathway() James Simmons
                   ` (548 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In mdt_getxattr(), set OBD_MD_FLXATTR in mbo_valid of the reply's MDT
body so that the client can distinguish between nonexistent extended
attributes and zero length values. In ll_xattr_list() and
ll_getxattr_common() test for OBD_MD_FLXATTR and return 0 rather than
-ENODATA in the appropriate cases. Add sanity test_102t() to test that
zero length values are handled correctly.

Lustre-commit: 1e4164a1254d ("LU-11109 mdt: handle zero length xattr values correctly")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32755
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/xattr.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index f25ae59..636334e 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -363,6 +363,11 @@ int ll_xattr_list(struct inode *inode, const char *name, int type, void *buffer,
 
 		/* only detect the xattr size */
 		if (size == 0) {
+			/* LU-11109: Older MDTs do not distinguish
+			 * between nonexistent xattrs and zero length
+			 * values in this case. Newer MDTs will return
+			 * -ENODATA or set OBD_MD_FLXATTR.
+			 */
 			rc = body->mbo_eadatasize;
 			goto out;
 		}
@@ -375,7 +380,22 @@ int ll_xattr_list(struct inode *inode, const char *name, int type, void *buffer,
 		}
 
 		if (body->mbo_eadatasize == 0) {
-			rc = -ENODATA;
+			/* LU-11109: Newer MDTs set OBD_MD_FLXATTR on
+			 * success so that we can distinguish between
+			 * zero length value and nonexistent xattr.
+			 *
+			 * If OBD_MD_FLXATTR is not set then we keep
+			 * the old behavior and return -ENODATA for
+			 * getxattr() when mbo_eadatasize is 0. But
+			 * -ENODATA only makes sense for getxattr()
+			 * and not for listxattr().
+			 */
+			if (body->mbo_valid & OBD_MD_FLXATTR)
+				rc = 0;
+			else if (valid == OBD_MD_FLXATTR)
+				rc = -ENODATA;
+			else
+				rc = 0;
 			goto out;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 075/622] lnet: refactor lnet_select_pathway()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (73 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 074/622] lustre: llite: handle zero length xattr values correctly James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 076/622] lnet: add health value per ni James Simmons
                   ` (547 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

lnet_select_pathway() is a complex monolithic function which handles
many send cases. Broke down lnet_select_pathway() to multiple
functions. Each function handles a different send case. This will
make it easier to add the handling of the different health cases in
future patches.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 4e48761a5719 ("LU-9120 lnet: refactor lnet_select_pathway()")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32760
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |   13 +
 net/lnet/lnet/lib-move.c      | 1398 ++++++++++++++++++++++++++---------------
 2 files changed, 911 insertions(+), 500 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 22c6152..20b4660 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -827,6 +827,19 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 	return false;
 }
 
+static inline struct lnet_peer_net *
+lnet_find_peer_net_locked(struct lnet_peer *peer, u32 net_id)
+{
+	struct lnet_peer_net *peer_net;
+
+	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
+		if (peer_net->lpn_net_id == net_id)
+			return peer_net;
+	}
+
+	return NULL;
+}
+
 static inline void
 lnet_peer_set_alive(struct lnet_peer_ni *lp)
 {
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index cab830a..10aa753 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -45,6 +45,23 @@
 module_param(local_nid_dist_zero, int, 0444);
 MODULE_PARM_DESC(local_nid_dist_zero, "Reserved");
 
+struct lnet_send_data {
+	struct lnet_ni		*sd_best_ni;
+	struct lnet_peer_ni	*sd_best_lpni;
+	struct lnet_peer_ni	*sd_final_dst_lpni;
+	struct lnet_peer	*sd_peer;
+	struct lnet_peer	*sd_gw_peer;
+	struct lnet_peer_ni	*sd_gw_lpni;
+	struct lnet_peer_net	*sd_peer_net;
+	struct lnet_msg		*sd_msg;
+	lnet_nid_t		sd_dst_nid;
+	lnet_nid_t		sd_src_nid;
+	lnet_nid_t		sd_rtr_nid;
+	int			sd_cpt;
+	int			sd_md_cpt;
+	u32			sd_send_case;
+};
+
 static inline struct lnet_comm_count *
 get_stats_counts(struct lnet_element_stats *stats,
 		 enum lnet_stats_type stats_type)
@@ -1188,7 +1205,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 }
 
 static struct lnet_peer_ni *
-lnet_find_route_locked(struct lnet_net *net, lnet_nid_t target,
+lnet_find_route_locked(struct lnet_net *net, u32 remote_net,
 		       lnet_nid_t rtr_nid)
 {
 	struct lnet_remotenet *rnet;
@@ -1203,7 +1220,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	 * If @rtr_nid is not LNET_NID_ANY, return the gateway with
 	 * rtr_nid nid, otherwise find the best gateway I can use
 	 */
-	rnet = lnet_find_rnet_locked(LNET_NIDNET(target));
+	rnet = lnet_find_rnet_locked(remote_net);
 	if (!rnet)
 		return NULL;
 
@@ -1252,13 +1269,20 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 }
 
 static struct lnet_ni *
-lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *cur_ni,
+lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *best_ni,
+		 struct lnet_peer *peer, struct lnet_peer_net *peer_net,
 		 int md_cpt)
 {
-	struct lnet_ni *ni = NULL, *best_ni = cur_ni;
+	struct lnet_ni *ni = NULL;
 	unsigned int shortest_distance;
 	int best_credits;
 
+	/* If there is no peer_ni that we can send to on this network,
+	 * then there is no point in looking for a new best_ni here.
+	 */
+	if (!lnet_get_next_peer_ni_locked(peer, peer_net, NULL))
+		return best_ni;
+
 	if (!best_ni) {
 		shortest_distance = UINT_MAX;
 		best_credits = INT_MIN;
@@ -1286,6 +1310,13 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 					    md_cpt,
 					    ni->ni_dev_cpt);
 
+		CDEBUG(D_NET,
+		       "compare ni %s [c:%d, d:%d, s:%d] with best_ni %s [c:%d, d:%d, s:%d]\n",
+		       libcfs_nid2str(ni->ni_nid), ni_credits, distance,
+		       ni->ni_seq, (best_ni) ? libcfs_nid2str(best_ni->ni_nid)
+			: "not seleced", best_credits, shortest_distance,
+			(best_ni) ? best_ni->ni_seq : 0);
+
 		/*
 		 * All distances smaller than the NUMA range
 		 * are treated equally.
@@ -1311,6 +1342,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		best_credits = ni_credits;
 	}
 
+	CDEBUG(D_NET, "selected best_ni %s\n",
+	       (best_ni) ? libcfs_nid2str(best_ni->ni_nid) : "no selection");
+
 	return best_ni;
 }
 
@@ -1335,421 +1369,140 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	return false;
 }
 
+#define SRC_SPEC	0x0001
+#define SRC_ANY		0x0002
+#define LOCAL_DST	0x0004
+#define REMOTE_DST	0x0008
+#define MR_DST		0x0010
+#define NMR_DST		0x0020
+#define SND_RESP	0x0040
+
+/* The following to defines are used for return codes */
+#define REPEAT_SEND	0x1000
+#define PASS_THROUGH	0x2000
+
+/* The different cases lnet_select pathway needs to handle */
+#define SRC_SPEC_LOCAL_MR_DST	(SRC_SPEC | LOCAL_DST | MR_DST)
+#define SRC_SPEC_ROUTER_MR_DST	(SRC_SPEC | REMOTE_DST | MR_DST)
+#define SRC_SPEC_LOCAL_NMR_DST	(SRC_SPEC | LOCAL_DST | NMR_DST)
+#define SRC_SPEC_ROUTER_NMR_DST	(SRC_SPEC | REMOTE_DST | NMR_DST)
+#define SRC_ANY_LOCAL_MR_DST	(SRC_ANY | LOCAL_DST | MR_DST)
+#define SRC_ANY_ROUTER_MR_DST	(SRC_ANY | REMOTE_DST | MR_DST)
+#define SRC_ANY_LOCAL_NMR_DST	(SRC_ANY | LOCAL_DST | NMR_DST)
+#define SRC_ANY_ROUTER_NMR_DST	(SRC_ANY | REMOTE_DST | NMR_DST)
+
 static int
-lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
-		    struct lnet_msg *msg, lnet_nid_t rtr_nid)
+lnet_handle_send(struct lnet_send_data *sd)
 {
-	struct lnet_ni *best_ni = NULL;
-	struct lnet_peer_ni *best_lpni = NULL;
-	struct lnet_peer_ni *best_gw = NULL;
-	struct lnet_peer_ni *lpni;
-	struct lnet_peer_ni *final_dst;
-	struct lnet_peer *peer;
-	struct lnet_peer_net *peer_net;
-	struct lnet_net *local_net;
-	int cpt, cpt2, rc;
-	bool routing;
-	bool routing2;
-	bool ni_is_pref;
-	bool preferred;
-	bool local_found;
-	int best_lpni_credits;
-	int md_cpt;
-
-	/*
-	 * get an initial CPT to use for locking. The idea here is not to
-	 * serialize the calls to select_pathway, so that as many
-	 * operations can run concurrently as possible. To do that we use
-	 * the CPT where this call is being executed. Later on when we
-	 * determine the CPT to use in lnet_message_commit, we switch the
-	 * lock and check if there was any configuration change.  If none,
-	 * then we proceed, if there is, then we restart the operation.
-	 */
-	cpt = lnet_net_lock_current();
-
-	md_cpt = lnet_cpt_of_md(msg->msg_md, msg->msg_offset);
-	if (md_cpt == CFS_CPT_ANY)
-		md_cpt = cpt;
-
-again:
-	best_ni = NULL;
-	best_lpni = NULL;
-	best_gw = NULL;
-	final_dst = NULL;
-	local_net = NULL;
-	routing = false;
-	routing2 = false;
-	local_found = false;
-
-	/*
-	 * lnet_nid2peerni_locked() is the path that will find an
-	 * existing peer_ni, or create one and mark it as having been
-	 * created due to network traffic.
-	 */
-	lpni = lnet_nid2peerni_locked(dst_nid, LNET_NID_ANY, cpt);
-	if (IS_ERR(lpni)) {
-		lnet_net_unlock(cpt);
-		return PTR_ERR(lpni);
-	}
+	struct lnet_ni *best_ni = sd->sd_best_ni;
+	struct lnet_peer_ni *best_lpni = sd->sd_best_lpni;
+	struct lnet_peer_ni *final_dst_lpni = sd->sd_final_dst_lpni;
+	struct lnet_msg *msg = sd->sd_msg;
+	int cpt2;
+	u32 send_case = sd->sd_send_case;
+	int rc;
+	u32 routing = send_case & REMOTE_DST;
 
-	/* If we're being asked to send to the loopback interface, there
-	 * is no need to go through any selection. We can just shortcut
-	 * the entire process and send over lolnd
+	/* Increment sequence number of the selected peer so that we
+	 * pick the next one in Round Robin.
 	 */
-	if (LNET_NETTYP(LNET_NIDNET(dst_nid)) == LOLND) {
-		lnet_peer_ni_decref_locked(lpni);
-		best_ni = the_lnet.ln_loni;
-		goto send;
-	}
+	best_lpni->lpni_seq++;
 
-	/*
-	 * Now that we have a peer_ni, check if we want to discover
-	 * the peer. Traffic to the LNET_RESERVED_PORTAL should not
-	 * trigger discovery.
+	/* grab a reference on the peer_ni so it sticks around even if
+	 * we need to drop and relock the lnet_net_lock below.
 	 */
-	peer = lpni->lpni_peer_net->lpn_peer;
-	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
-		rc = lnet_discover_peer_locked(lpni, cpt, false);
-		if (rc) {
-			lnet_peer_ni_decref_locked(lpni);
-			lnet_net_unlock(cpt);
-			return rc;
-		}
-		/* The peer may have changed. */
-		peer = lpni->lpni_peer_net->lpn_peer;
-		/* queue message and return */
-		msg->msg_src_nid_param = src_nid;
-		msg->msg_rtr_nid_param = rtr_nid;
-		msg->msg_sending = 0;
-		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
-		CDEBUG(D_NET, "%s pending discovery\n",
-		       libcfs_nid2str(peer->lp_primary_nid));
-		lnet_peer_ni_decref_locked(lpni);
-		lnet_net_unlock(cpt);
-
-		return LNET_DC_WAIT;
-	}
-	lnet_peer_ni_decref_locked(lpni);
-
-	/* If peer is not healthy then can not send anything to it */
-	if (!lnet_is_peer_healthy_locked(peer)) {
-		lnet_net_unlock(cpt);
-		return -EHOSTUNREACH;
-	}
+	lnet_peer_ni_addref_locked(best_lpni);
 
-	/*
-	 * STEP 1: first jab at determining best_ni
-	 * if src_nid is explicitly specified, then best_ni is already
-	 * pre-determiend for us. Otherwise we need to select the best
-	 * one to use later on
+	/* Use lnet_cpt_of_nid() to determine the CPT used to commit the
+	 * message. This ensures that we get a CPT that is correct for
+	 * the NI when the NI has been restricted to a subset of all CPTs.
+	 * If the selected CPT differs from the one currently locked, we
+	 * must unlock and relock the lnet_net_lock(), and then check whether
+	 * the configuration has changed. We don't have a hold on the best_ni
+	 * yet, and it may have vanished.
 	 */
-	if (src_nid != LNET_NID_ANY) {
-		best_ni = lnet_nid2ni_locked(src_nid, cpt);
-		if (!best_ni) {
-			lnet_net_unlock(cpt);
-			LCONSOLE_WARN("Can't send to %s: src %s is not a local nid\n",
-				      libcfs_nid2str(dst_nid),
-				      libcfs_nid2str(src_nid));
-			return -EINVAL;
-		}
-	}
+	cpt2 = lnet_cpt_of_nid_locked(best_lpni->lpni_nid, best_ni);
+	if (sd->sd_cpt != cpt2) {
+		u32 seq = lnet_get_dlc_seq_locked();
 
-	if (msg->msg_type == LNET_MSG_REPLY ||
-	    msg->msg_type == LNET_MSG_ACK ||
-	    !lnet_peer_is_multi_rail(peer) ||
-	    best_ni) {
-		/*
-		 * for replies we want to respond on the same peer_ni we
-		 * received the message on if possible. If not, then pick
-		 * a peer_ni to send to
-		 *
-		 * if the peer is non-multi-rail then you want to send to
-		 * the dst_nid provided as well.
-		 *
-		 * If the best_ni has already been determined, IE the
-		 * src_nid has been specified, then use the
-		 * destination_nid provided as well, since we're
-		 * continuing a series of related messages for the same
-		 * RPC.
-		 *
-		 * It is expected to find the lpni using dst_nid, since we
-		 * created it earlier.
-		 */
-		best_lpni = lnet_find_peer_ni_locked(dst_nid);
-		if (best_lpni)
+		lnet_net_unlock(sd->sd_cpt);
+		sd->sd_cpt = cpt2;
+		lnet_net_lock(sd->sd_cpt);
+		if (seq != lnet_get_dlc_seq_locked()) {
 			lnet_peer_ni_decref_locked(best_lpni);
-
-		if (best_lpni && !lnet_get_net_locked(LNET_NIDNET(dst_nid))) {
-			/*
-			 * this lpni is not on a local network so we need
-			 * to route this reply.
-			 */
-			best_gw = lnet_find_route_locked(NULL,
-							 best_lpni->lpni_nid,
-							 rtr_nid);
-			if (best_gw) {
-				/*
-				 * RULE: Each node considers only the next-hop
-				 *
-				 * We're going to route the message,
-				 * so change the peer to the router.
-				 */
-				LASSERT(best_gw->lpni_peer_net);
-				LASSERT(best_gw->lpni_peer_net->lpn_peer);
-				peer = best_gw->lpni_peer_net->lpn_peer;
-
-				/*
-				 * if the router is not multi-rail
-				 * then use the best_gw found to send
-				 * the message to
-				 */
-				if (!lnet_peer_is_multi_rail(peer))
-					best_lpni = best_gw;
-				else
-					best_lpni = NULL;
-
-				routing = true;
-			} else {
-				best_lpni = NULL;
-			}
-		} else if (!best_lpni) {
-			lnet_net_unlock(cpt);
-			CERROR("unable to send msg_type %d to originating %s. Destination NID not in DB\n",
-			       msg->msg_type, libcfs_nid2str(dst_nid));
-			return -EINVAL;
-		}
-	}
-
-	/*
-	 * We must use a consistent source address when sending to a
-	 * non-MR peer. However, a non-MR peer can have multiple NIDs
-	 * on multiple networks, and we may even need to talk to this
-	 * peer on multiple networks -- certain types of
-	 * load-balancing configuration do this.
-	 *
-	 * So we need to pick the NI the peer prefers for this
-	 * particular network.
-	 */
-	if (!lnet_peer_is_multi_rail(peer)) {
-		if (!best_lpni) {
-			lnet_net_unlock(cpt);
-			CERROR("no route to %s\n",
-			       libcfs_nid2str(dst_nid));
-			return -EHOSTUNREACH;
-		}
-
-		/* best ni is already set if src_nid was provided */
-		if (!best_ni) {
-			/* Get the target peer_ni */
-			peer_net = lnet_peer_get_net_locked(
-				peer, LNET_NIDNET(best_lpni->lpni_nid));
-			list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
-					    lpni_peer_nis) {
-				if (lpni->lpni_pref_nnids == 0)
-					continue;
-				LASSERT(lpni->lpni_pref_nnids == 1);
-				best_ni = lnet_nid2ni_locked(
-					lpni->lpni_pref.nid, cpt);
-				break;
-			}
+			return REPEAT_SEND;
 		}
-		/* if best_ni is still not set just pick one */
-		if (!best_ni) {
-			best_ni = lnet_net2ni_locked(
-				best_lpni->lpni_net->net_id, cpt);
-			/* If there is no best_ni we don't have a route */
-			if (!best_ni) {
-				CERROR("no path to %s from net %s\n",
-				       libcfs_nid2str(best_lpni->lpni_nid),
-				       libcfs_net2str(best_lpni->lpni_net->net_id));
-				lnet_net_unlock(cpt);
-				return -EHOSTUNREACH;
-			}
-			lpni = list_first_entry(&peer_net->lpn_peer_nis,
-						struct lnet_peer_ni,
-					  lpni_peer_nis);
-		}
-		/* Set preferred NI if necessary. */
-		if (lpni->lpni_pref_nnids == 0)
-			lnet_peer_ni_set_non_mr_pref_nid(lpni, best_ni->ni_nid);
 	}
 
-	/*
-	 * if we already found a best_ni because src_nid is specified and
-	 * best_lpni because we are replying to a message then just send
-	 * the message
+	/* store the best_lpni in the message right away to avoid having
+	 * to do the same operation under different conditions
 	 */
-	if (best_ni && best_lpni)
-		goto send;
+	msg->msg_txpeer = best_lpni;
+	msg->msg_txni = best_ni;
 
-	/*
-	 * If we already found a best_ni because src_nid is specified then
-	 * pick the peer then send the message
+	/* grab a reference for the best_ni since now it's in use in this
+	 * send. The reference will be dropped in lnet_finalize()
 	 */
-	if (best_ni)
-		goto pick_peer;
+	lnet_ni_addref_locked(msg->msg_txni, sd->sd_cpt);
 
-	/*
-	 * pick the best_ni by going through all the possible networks of
-	 * that peer and see which local NI is best suited to talk to that
-	 * peer.
-	 *
-	 * Locally connected networks will always be preferred over
-	 * a routed network. If there are only routed paths to the peer,
-	 * then the best route is chosen. If all routes are equal then
-	 * they are used in round robin.
+	/* Always set the target.nid to the best peer picked. Either the
+	 * NID will be one of the peer NIDs selected, or the same NID as
+	 * what was originally set in the target or it will be the NID of
+	 * a router if this message should be routed
 	 */
-	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
-		if (!lnet_is_peer_net_healthy_locked(peer_net))
-			continue;
-
-		local_net = lnet_get_net_locked(peer_net->lpn_net_id);
-		if (!local_net && !routing && !local_found) {
-			struct lnet_peer_ni *net_gw;
-
-			lpni = list_first_entry(&peer_net->lpn_peer_nis,
-						struct lnet_peer_ni,
-						lpni_peer_nis);
-
-			net_gw = lnet_find_route_locked(NULL,
-							lpni->lpni_nid,
-							rtr_nid);
-			if (!net_gw)
-				continue;
-
-			if (best_gw) {
-				/*
-				 * lnet_find_route_locked() call
-				 * will return the best_Gw on the
-				 * lpni->lpni_nid network.
-				 * However, best_gw and net_gw can
-				 * be on different networks.
-				 * Therefore need to compare them
-				 * to pick the better of either.
-				 */
-				if (lnet_compare_peers(best_gw, net_gw) > 0)
-					continue;
-				if (best_gw->lpni_gw_seq <= net_gw->lpni_gw_seq)
-					continue;
-			}
-			best_gw = net_gw;
-			final_dst = lpni;
-
-			routing2 = true;
-		} else {
-			best_gw = NULL;
-			final_dst = NULL;
-			routing2 = false;
-			local_found = true;
-		}
-
-		/*
-		 * a gw on this network is found, but there could be
-		 * other better gateways on other networks. So don't pick
-		 * the best_ni until we determine the best_gw.
-		 */
-		if (best_gw)
-			continue;
-
-		/* if no local_net found continue */
-		if (!local_net)
-			continue;
-
-		/*
-		 * Iterate through the NIs in this local Net and select
-		 * the NI to send from. The selection is determined by
-		 * these 3 criterion in the following priority:
-		 *	1. NUMA
-		 *	2. NI available credits
-		 *	3. Round Robin
-		 */
-		best_ni = lnet_get_best_ni(local_net, best_ni, md_cpt);
-	}
-
-	if (!best_ni && !best_gw) {
-		lnet_net_unlock(cpt);
-		LCONSOLE_WARN("No local ni found to send from to %s\n",
-			      libcfs_nid2str(dst_nid));
-		return -EINVAL;
-	}
-
-	if (!best_ni) {
-		best_ni = lnet_get_best_ni(best_gw->lpni_net, best_ni, md_cpt);
-		LASSERT(best_gw && best_ni);
-
-		/*
-		 * We're going to route the message, so change the peer to
-		 * the router.
-		 */
-		LASSERT(best_gw->lpni_peer_net);
-		LASSERT(best_gw->lpni_peer_net->lpn_peer);
-		best_gw->lpni_gw_seq++;
-		peer = best_gw->lpni_peer_net->lpn_peer;
-	}
+	msg->msg_target.nid = msg->msg_txpeer->lpni_nid;
 
-	/*
-	 * Now that we selected the NI to use increment its sequence
-	 * number so the Round Robin algorithm will detect that it has
-	 * been used and pick the next NI.
+	/* lnet_msg_commit assigns the correct cpt to the message, which
+	 * is used to decrement the correct refcount on the ni when it's
+	 * time to return the credits
 	 */
-	best_ni->ni_seq++;
+	lnet_msg_commit(msg, sd->sd_cpt);
 
-pick_peer:
-	/*
-	 * At this point the best_ni is on a local network on which
-	 * the peer has a peer_ni as well
-	 */
-	peer_net = lnet_peer_get_net_locked(peer,
-					    best_ni->ni_net->net_id);
-	/*
-	 * peer_net is not available or the src_nid is explicitly defined
-	 * and the peer_net for that src_nid is unhealthy. find a route to
-	 * the destination nid.
+	/* If we are routing the message then we keep the src_nid that was
+	 * set by the originator. If we are not routing then we are the
+	 * originator and set it here.
 	 */
-	if (!peer_net ||
-	    (src_nid != LNET_NID_ANY &&
-	     !lnet_is_peer_net_healthy_locked(peer_net))) {
-		best_gw = lnet_find_route_locked(best_ni->ni_net,
-						 dst_nid,
-						 rtr_nid);
-		/*
-		 * if no route is found for that network then
-		 * move onto the next peer_ni in the peer
-		 */
-		if (!best_gw) {
-			LCONSOLE_WARN("No route to peer from %s\n",
-				      libcfs_nid2str(best_ni->ni_nid));
-			lnet_net_unlock(cpt);
-			return -EHOSTUNREACH;
-		}
-
-		CDEBUG(D_NET, "Best route to %s via %s for %s %d\n",
-			libcfs_nid2str(dst_nid),
-			libcfs_nid2str(best_gw->lpni_nid),
-			lnet_msgtyp2str(msg->msg_type), msg->msg_len);
+	if (!msg->msg_routing)
+		msg->msg_hdr.src_nid = cpu_to_le64(msg->msg_txni->ni_nid);
 
-		routing2 = true;
-		/*
-		 * RULE: Each node considers only the next-hop
+	if (routing) {
+		msg->msg_target_is_router = 1;
+		msg->msg_target.pid = LNET_PID_LUSTRE;
+		/* since we're routing we want to ensure that the
+		 * msg_hdr.dest_nid is set to the final destination. When
+		 * the router receives this message it knows how to route
+		 * it.
 		 *
-		 * We're going to route the message, so change the peer to
-		 * the router.
+		 * final_dst_lpni is set at the beginning of the
+		 * lnet_select_pathway() function and is never changed.
+		 * It's safe to use it here.
 		 */
-		LASSERT(best_gw->lpni_peer_net);
-		LASSERT(best_gw->lpni_peer_net->lpn_peer);
-		peer = best_gw->lpni_peer_net->lpn_peer;
-	} else if (!lnet_is_peer_net_healthy_locked(peer_net)) {
-		/*
-		 * this peer_net is unhealthy but we still have an opportunity
-		 * to find another peer_net that we can use
+		msg->msg_hdr.dest_nid = cpu_to_le64(final_dst_lpni->lpni_nid);
+	} else {
+		/* if we're not routing set the dest_nid to the best peer
+		 * ni NID that we picked earlier in the algorithm.
 		 */
-		u32 net_id = peer_net->lpn_net_id;
-
-		LCONSOLE_WARN("peer net %s unhealthy\n",
-			      libcfs_net2str(net_id));
-		goto again;
+		msg->msg_hdr.dest_nid = cpu_to_le64(msg->msg_txpeer->lpni_nid);
 	}
 
+	rc = lnet_post_send_locked(msg, 0);
+	if (!rc)
+		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) : %s\n",
+		       libcfs_nid2str(msg->msg_hdr.src_nid),
+		       libcfs_nid2str(msg->msg_txni->ni_nid),
+		       libcfs_nid2str(sd->sd_src_nid),
+		       libcfs_nid2str(msg->msg_hdr.dest_nid),
+		       libcfs_nid2str(sd->sd_dst_nid),
+		       libcfs_nid2str(msg->msg_txpeer->lpni_nid),
+		       lnet_msgtyp2str(msg->msg_type));
+
+	return rc;
+}
+
+static struct lnet_peer_ni *
+lnet_select_peer_ni(struct lnet_send_data *sd, struct lnet_peer *peer,
+		    struct lnet_peer_net *peer_net)
+{
 	/*
 	 * Look at the peer NIs for the destination peer that connect
 	 * to the chosen net. If a peer_ni is preferred when using the
@@ -1758,20 +1511,30 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	 * the available transmit credits are used. If the transmit
 	 * credits are equal, we round-robin over the peer_ni.
 	 */
-	lpni = NULL;
-	best_lpni_credits = INT_MIN;
-	preferred = false;
-	best_lpni = NULL;
+	struct lnet_peer_ni *lpni = NULL;
+	struct lnet_peer_ni *best_lpni = NULL;
+	struct lnet_ni *best_ni = sd->sd_best_ni;
+	lnet_nid_t dst_nid = sd->sd_dst_nid;
+	int best_lpni_credits = INT_MIN;
+	bool preferred = false;
+	bool ni_is_pref;
+
 	while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni))) {
-		/*
-		 * if this peer ni is not healthy just skip it, no point in
-		 * examining it further
+		/* if the best_ni we've chosen aleady has this lpni
+		 * preferred, then let's use it
 		 */
-		if (!lnet_is_peer_ni_healthy_locked(lpni))
-			continue;
 		ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
 							  best_ni->ni_nid);
 
+		CDEBUG(D_NET, "%s ni_is_pref = %d\n",
+		       libcfs_nid2str(best_ni->ni_nid), ni_is_pref);
+
+		if (best_lpni)
+			CDEBUG(D_NET, "%s c:[%d, %d], s:[%d, %d]\n",
+			       libcfs_nid2str(lpni->lpni_nid),
+			       lpni->lpni_txcredits, best_lpni_credits,
+			       lpni->lpni_seq, best_lpni->lpni_seq);
+
 		/* if this is a preferred peer use it */
 		if (!preferred && ni_is_pref) {
 			preferred = true;
@@ -1810,131 +1573,766 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		u32 net_id = peer_net ? peer_net->lpn_net_id :
 					LNET_NIDNET(dst_nid);
 
-		lnet_net_unlock(cpt);
-		LCONSOLE_WARN("no peer_ni found on peer net %s\n",
-			      libcfs_net2str(net_id));
-		return -EHOSTUNREACH;
+		CDEBUG(D_NET, "no peer_ni found on peer net %s\n",
+		       libcfs_net2str(net_id));
+		return NULL;
 	}
 
-send:
-	/* Shortcut for loopback. */
-	if (best_ni == the_lnet.ln_loni) {
-		/* No send credit hassles with LOLND */
-		lnet_ni_addref_locked(best_ni, cpt);
-		msg->msg_hdr.dest_nid = cpu_to_le64(best_ni->ni_nid);
-		if (!msg->msg_routing)
-			msg->msg_hdr.src_nid = cpu_to_le64(best_ni->ni_nid);
-		msg->msg_target.nid = best_ni->ni_nid;
-		lnet_msg_commit(msg, cpt);
-		msg->msg_txni = best_ni;
-		lnet_net_unlock(cpt);
-
-		return LNET_CREDIT_OK;
-	}
+	CDEBUG(D_NET, "sd_best_lpni = %s\n",
+	       libcfs_nid2str(best_lpni->lpni_nid));
 
-	routing = routing || routing2;
+	return best_lpni;
+}
 
-	/*
-	 * Increment sequence number of the peer selected so that we
-	 * pick the next one in Round Robin.
-	 */
-	best_lpni->lpni_seq++;
+/* Prerequisite: the best_ni should already be set in the sd
+ */
+static inline struct lnet_peer_ni *
+lnet_find_best_lpni_on_net(struct lnet_send_data *sd, struct lnet_peer *peer,
+			   u32 net_id)
+{
+	struct lnet_peer_net *peer_net;
 
-	/*
-	 * grab a reference on the peer_ni so it sticks around even if
-	 * we need to drop and relock the lnet_net_lock below.
+	/* The gateway is Multi-Rail capable so now we must select the
+	 * proper peer_ni
 	 */
-	lnet_peer_ni_addref_locked(best_lpni);
+	peer_net = lnet_peer_get_net_locked(peer, net_id);
 
-	/*
-	 * Use lnet_cpt_of_nid() to determine the CPT used to commit the
-	 * message. This ensures that we get a CPT that is correct for
-	 * the NI when the NI has been restricted to a subset of all CPTs.
-	 * If the selected CPT differs from the one currently locked, we
-	 * must unlock and relock the lnet_net_lock(), and then check whether
-	 * the configuration has changed. We don't have a hold on the best_ni
-	 * yet, and it may have vanished.
+	if (!peer_net) {
+		CERROR("gateway peer %s has no NI on net %s\n",
+		       libcfs_nid2str(peer->lp_primary_nid),
+		       libcfs_net2str(net_id));
+		return NULL;
+	}
+
+	return lnet_select_peer_ni(sd, peer, peer_net);
+}
+
+static inline void
+lnet_set_non_mr_pref_nid(struct lnet_send_data *sd)
+{
+	if (sd->sd_send_case & NMR_DST &&
+	    sd->sd_msg->msg_type != LNET_MSG_REPLY &&
+	    sd->sd_msg->msg_type != LNET_MSG_ACK &&
+	    sd->sd_best_lpni->lpni_pref_nnids == 0) {
+		CDEBUG(D_NET, "Setting preferred local NID %s on NMR peer %s\n",
+		       libcfs_nid2str(sd->sd_best_ni->ni_nid),
+		       libcfs_nid2str(sd->sd_best_lpni->lpni_nid));
+		lnet_peer_ni_set_non_mr_pref_nid(sd->sd_best_lpni,
+						 sd->sd_best_ni->ni_nid);
+	}
+}
+
+/* Source Specified
+ * Local Destination
+ * non-mr peer
+ *
+ * use the source and destination NIDs as the pathway
+ */
+static int
+lnet_handle_spec_local_nmr_dst(struct lnet_send_data *sd)
+{
+	/* the destination lpni is set before we get here. */
+
+	/* find local NI */
+	sd->sd_best_ni = lnet_nid2ni_locked(sd->sd_src_nid, sd->sd_cpt);
+	if (!sd->sd_best_ni) {
+		CERROR("Can't send to %s: src %s is not a local nid\n",
+		       libcfs_nid2str(sd->sd_dst_nid),
+		       libcfs_nid2str(sd->sd_src_nid));
+		return -EINVAL;
+	}
+
+	/* the preferred NID will only be set for NMR peers
 	 */
-	cpt2 = lnet_cpt_of_nid_locked(best_lpni->lpni_nid, best_ni);
-	if (cpt != cpt2) {
-		u32 seq = lnet_get_dlc_seq_locked();
-		lnet_net_unlock(cpt);
-		cpt = cpt2;
-		lnet_net_lock(cpt);
-		if (seq != lnet_get_dlc_seq_locked()) {
-			lnet_peer_ni_decref_locked(best_lpni);
-			goto again;
-		}
+	lnet_set_non_mr_pref_nid(sd);
+
+	return lnet_handle_send(sd);
+}
+
+/* Source Specified
+ * Local Destination
+ * MR Peer
+ *
+ * Run the selection algorithm on the peer NIs unless we're sending
+ * a response, in this case just send to the destination
+ */
+static int
+lnet_handle_spec_local_mr_dst(struct lnet_send_data *sd)
+{
+	sd->sd_best_ni = lnet_nid2ni_locked(sd->sd_src_nid, sd->sd_cpt);
+	if (!sd->sd_best_ni) {
+		CERROR("Can't send to %s: src %s is not a local nid\n",
+		       libcfs_nid2str(sd->sd_dst_nid),
+		       libcfs_nid2str(sd->sd_src_nid));
+		return -EINVAL;
 	}
 
-	/*
-	 * store the best_lpni in the message right away to avoid having
-	 * to do the same operation under different conditions
+	/* only run the selection algorithm to pick the peer_ni if we're
+	 * sending a GET or a PUT. Responses are sent to the same
+	 * destination NID provided.
 	 */
-	msg->msg_txpeer = best_lpni;
-	msg->msg_txni = best_ni;
+	if (!(sd->sd_send_case & SND_RESP)) {
+		sd->sd_best_lpni =
+		  lnet_find_best_lpni_on_net(sd, sd->sd_peer,
+					     sd->sd_best_ni->ni_net->net_id);
+	}
 
-	/*
-	 * grab a reference for the best_ni since now it's in use in this
-	 * send. the reference will need to be dropped when the message is
-	 * finished in lnet_finalize()
+	if (sd->sd_best_lpni)
+		return lnet_handle_send(sd);
+
+	CERROR("can't send to %s. no NI on %s\n",
+	       libcfs_nid2str(sd->sd_dst_nid),
+	       libcfs_net2str(sd->sd_best_ni->ni_net->net_id));
+
+	return -EHOSTUNREACH;
+}
+
+struct lnet_ni *
+lnet_find_best_ni_on_spec_net(struct lnet_ni *cur_best_ni,
+			      struct lnet_peer *peer,
+			      struct lnet_peer_net *peer_net,
+			      int cpt,
+			      bool incr_seq)
+{
+	struct lnet_net *local_net;
+	struct lnet_ni *best_ni;
+
+	local_net = lnet_get_net_locked(peer_net->lpn_net_id);
+	if (!local_net)
+		return NULL;
+
+	/* Iterate through the NIs in this local Net and select
+	 * the NI to send from. The selection is determined by
+	 * these 3 criterion in the following priority:
+	 *	1. NUMA
+	 *	2. NI available credits
+	 *	3. Round Robin
 	 */
-	lnet_ni_addref_locked(msg->msg_txni, cpt);
+	best_ni = lnet_get_best_ni(local_net, cur_best_ni,
+				   peer, peer_net, cpt);
 
-	/*
-	 * Always set the target.nid to the best peer picked. Either the
-	 * nid will be one of the preconfigured NIDs, or the same NID as
-	 * what was originally set in the target or it will be the NID of
-	 * a router if this message should be routed
+	if (incr_seq && best_ni)
+		best_ni->ni_seq++;
+
+	return best_ni;
+}
+
+static int
+lnet_handle_find_routed_path(struct lnet_send_data *sd,
+			     lnet_nid_t dst_nid,
+			     struct lnet_peer_ni **gw_lpni,
+			     struct lnet_peer **gw_peer)
+{
+	struct lnet_peer_ni *gw;
+	lnet_nid_t src_nid = sd->sd_src_nid;
+
+	gw = lnet_find_route_locked(NULL, LNET_NIDNET(dst_nid),
+				    sd->sd_rtr_nid);
+	if (!gw) {
+		CERROR("no route to %s from %s\n",
+		       libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid));
+		return -EHOSTUNREACH;
+	}
+
+	/* get the peer of the gw_ni */
+	LASSERT(gw->lpni_peer_net);
+	LASSERT(gw->lpni_peer_net->lpn_peer);
+
+	*gw_peer = gw->lpni_peer_net->lpn_peer;
+
+	if (!sd->sd_best_ni)
+		sd->sd_best_ni =
+			lnet_find_best_ni_on_spec_net(NULL, *gw_peer,
+						      gw->lpni_peer_net,
+						      sd->sd_md_cpt,
+						      true);
+
+	if (!sd->sd_best_ni) {
+		CERROR("Internal Error. Expected local ni on %s but non found :%s\n",
+		       libcfs_net2str(gw->lpni_peer_net->lpn_net_id),
+		       libcfs_nid2str(sd->sd_src_nid));
+		return -EFAULT;
+	}
+
+	/* if gw is MR let's find its best peer_ni
 	 */
-	msg->msg_target.nid = msg->msg_txpeer->lpni_nid;
+	if (lnet_peer_is_multi_rail(*gw_peer)) {
+		gw = lnet_find_best_lpni_on_net(sd, *gw_peer,
+						sd->sd_best_ni->ni_net->net_id);
+		/* We've already verified that the gw has an NI on that
+		 * desired net, but we're not finding it. Something is
+		 * wrong.
+		 */
+		if (!gw) {
+			CERROR("Internal Error. Route expected to %s from %s\n",
+			       libcfs_nid2str(dst_nid),
+			       libcfs_nid2str(src_nid));
+			return -EFAULT;
+		}
+	}
 
-	/*
-	 * lnet_msg_commit assigns the correct cpt to the message, which
-	 * is used to decrement the correct refcount on the ni when it's
-	 * time to return the credits
+	*gw_lpni = gw;
+
+	return 0;
+}
+
+/* Handle two cases:
+ *
+ * Case 1:
+ *  Source specified
+ *  Remote destination
+ *  Non-MR destination
+ *
+ * Case 2:
+ *  Source specified
+ *  Remote destination
+ *  MR destination
+ *
+ * The handling of these two cases is similar. Even though the destination
+ * can be MR or non-MR, we'll deal directly with the router.
+ */
+static int
+lnet_handle_spec_router_dst(struct lnet_send_data *sd)
+{
+	int rc;
+	struct lnet_peer_ni *gw_lpni = NULL;
+	struct lnet_peer *gw_peer = NULL;
+
+	/* find local NI */
+	sd->sd_best_ni = lnet_nid2ni_locked(sd->sd_src_nid, sd->sd_cpt);
+	if (!sd->sd_best_ni) {
+		CERROR("Can't send to %s: src %s is not a local nid\n",
+		       libcfs_nid2str(sd->sd_dst_nid),
+		       libcfs_nid2str(sd->sd_src_nid));
+		return -EINVAL;
+	}
+
+	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
+					  &gw_peer);
+	if (rc < 0)
+		return rc;
+
+	if (sd->sd_send_case & NMR_DST)
+		/* since the final destination is non-MR let's set its preferred
+		 * NID before we send
+		 */
+		lnet_set_non_mr_pref_nid(sd);
+
+	/* We're going to send to the gw found so let's set its
+	 * info
 	 */
-	lnet_msg_commit(msg, cpt);
+	sd->sd_peer = gw_peer;
+	sd->sd_best_lpni = gw_lpni;
 
-	/*
-	 * If we are routing the message then we don't need to overwrite
-	 * the src_nid since it would've been set at the origin. Otherwise
-	 * we are the originator so we need to set it.
+	return lnet_handle_send(sd);
+}
+
+struct lnet_ni *
+lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt)
+{
+	struct lnet_peer_net *peer_net = NULL;
+	struct lnet_ni *best_ni = NULL;
+
+	/* The peer can have multiple interfaces, some of them can be on
+	 * the local network and others on a routed network. We should
+	 * prefer the local network. However if the local network is not
+	 * available then we need to try the routed network
 	 */
-	if (!msg->msg_routing)
-		msg->msg_hdr.src_nid = cpu_to_le64(msg->msg_txni->ni_nid);
 
-	if (routing) {
-		msg->msg_target_is_router = 1;
-		msg->msg_target.pid = LNET_PID_LUSTRE;
-		/*
-		 * since we're routing we want to ensure that the
-		 * msg_hdr.dest_nid is set to the final destination. When
-		 * the router receives this message it knows how to route
-		 * it.
-		 */
-		msg->msg_hdr.dest_nid =
-			cpu_to_le64(final_dst ? final_dst->lpni_nid : dst_nid);
-	} else {
-		/*
-		 * if we're not routing set the dest_nid to the best peer
-		 * ni that we picked earlier in the algorithm.
+	/* go through all the peer nets and find the best_ni */
+	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
+		/* The peer's list of nets can contain non-local nets. We
+		 * want to only examine the local ones.
 		 */
-		msg->msg_hdr.dest_nid = cpu_to_le64(msg->msg_txpeer->lpni_nid);
+		if (!lnet_get_net_locked(peer_net->lpn_net_id))
+			continue;
+		best_ni = lnet_find_best_ni_on_spec_net(best_ni, peer,
+							peer_net, md_cpt,
+							false);
 	}
 
-	rc = lnet_post_send_locked(msg, 0);
+	if (best_ni)
+		/* increment sequence number so we can round robin */
+		best_ni->ni_seq++;
+
+	return best_ni;
+}
+
+static struct lnet_ni *
+lnet_find_existing_preferred_best_ni(struct lnet_send_data *sd)
+{
+	struct lnet_ni *best_ni = NULL;
+	struct lnet_peer_net *peer_net;
+	struct lnet_peer *peer = sd->sd_peer;
+	struct lnet_peer_ni *best_lpni = sd->sd_best_lpni;
+	struct lnet_peer_ni *lpni;
+	int cpt = sd->sd_cpt;
+
+	/* We must use a consistent source address when sending to a
+	 * non-MR peer. However, a non-MR peer can have multiple NIDs
+	 * on multiple networks, and we may even need to talk to this
+	 * peer on multiple networks -- certain types of
+	 * load-balancing configuration do this.
+	 *
+	 * So we need to pick the NI the peer prefers for this
+	 * particular network.
+	 */
+
+	/* Get the target peer_ni */
+	peer_net = lnet_peer_get_net_locked(peer,
+					    LNET_NIDNET(best_lpni->lpni_nid));
+	LASSERT(peer_net);
+	list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
+			    lpni_peer_nis) {
+		if (lpni->lpni_pref_nnids == 0)
+			continue;
+		LASSERT(lpni->lpni_pref_nnids == 1);
+		best_ni = lnet_nid2ni_locked(lpni->lpni_pref.nid, cpt);
+		break;
+	}
+
+	return best_ni;
+}
+
+/* Prerequisite: sd->sd_peer and sd->sd_best_lpni should be set */
+static int
+lnet_select_preferred_best_ni(struct lnet_send_data *sd)
+{
+	struct lnet_ni *best_ni = NULL;
+	struct lnet_peer_ni *best_lpni = sd->sd_best_lpni;
+
+	/* We must use a consistent source address when sending to a
+	 * non-MR peer. However, a non-MR peer can have multiple NIDs
+	 * on multiple networks, and we may even need to talk to this
+	 * peer on multiple networks -- certain types of
+	 * load-balancing configuration do this.
+	 *
+	 * So we need to pick the NI the peer prefers for this
+	 * particular network.
+	 */
+
+	best_ni = lnet_find_existing_preferred_best_ni(sd);
+
+	/* if best_ni is still not set just pick one */
+	if (!best_ni) {
+		best_ni =
+			lnet_find_best_ni_on_spec_net(NULL, sd->sd_peer,
+						      sd->sd_best_lpni->lpni_peer_net,
+						      sd->sd_md_cpt, true);
+		/* If there is no best_ni we don't have a route */
+		if (!best_ni) {
+			CERROR("no path to %s from net %s\n",
+			       libcfs_nid2str(best_lpni->lpni_nid),
+			       libcfs_net2str(best_lpni->lpni_net->net_id));
+			return -EHOSTUNREACH;
+		}
+	}
+
+	sd->sd_best_ni = best_ni;
+
+	/* Set preferred NI if necessary. */
+	lnet_set_non_mr_pref_nid(sd);
+
+	return 0;
+}
+
+/* Source not specified
+ * Local destination
+ * Non-MR Peer
+ *
+ * always use the same source NID for NMR peers
+ * If we've talked to that peer before then we already have a preferred
+ * source NI associated with it. Otherwise, we select a preferred local NI
+ * and store it in the peer
+ */
+static int
+lnet_handle_any_local_nmr_dst(struct lnet_send_data *sd)
+{
+	int rc;
+
+	/* sd->sd_best_lpni is already set to the final destination */
+
+	/* At this point we should've created the peer ni and peer. If we
+	 * can't find it, then something went wrong. Instead of assert
+	 * output a relevant message and fail the send
+	 */
+	if (!sd->sd_best_lpni) {
+		CERROR("Internal fault. Unable to send msg %s to %s. NID not known\n",
+		       lnet_msgtyp2str(sd->sd_msg->msg_type),
+		       libcfs_nid2str(sd->sd_dst_nid));
+		return -EFAULT;
+	}
+
+	rc = lnet_select_preferred_best_ni(sd);
 	if (!rc)
-		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) : %s\n",
-		       libcfs_nid2str(msg->msg_hdr.src_nid),
-		       libcfs_nid2str(msg->msg_txni->ni_nid),
-		       libcfs_nid2str(src_nid),
-		       libcfs_nid2str(msg->msg_hdr.dest_nid),
-		       libcfs_nid2str(dst_nid),
-		       libcfs_nid2str(msg->msg_txpeer->lpni_nid),
-		       lnet_msgtyp2str(msg->msg_type));
+		rc = lnet_handle_send(sd);
 
-	lnet_net_unlock(cpt);
+	return rc;
+}
+
+static int
+lnet_handle_any_mr_dsta(struct lnet_send_data *sd)
+{
+	/* NOTE we've already handled the remote peer case. So we only
+	 * need to worry about the local case here.
+	 *
+	 * if we're sending a response, ACK or reply, we need to send it
+	 * to the destination NID given to us. At this point we already
+	 * have the peer_ni we're suppose to send to, so just find the
+	 * best_ni on the peer net and use that. Since we're sending to an
+	 * MR peer then we can just run the selection algorithm on our
+	 * local NIs and pick the best one.
+	 */
+	if (sd->sd_send_case & SND_RESP) {
+		sd->sd_best_ni =
+		  lnet_find_best_ni_on_spec_net(NULL, sd->sd_peer,
+						sd->sd_best_lpni->lpni_peer_net,
+						sd->sd_md_cpt, true);
+
+		if (!sd->sd_best_ni) {
+			/* We're not going to deal with not able to send
+			 * a response to the provided final destination
+			 */
+			CERROR("Can't send response to %s. No local NI available\n",
+			       libcfs_nid2str(sd->sd_dst_nid));
+			return -EHOSTUNREACH;
+		}
+
+		return lnet_handle_send(sd);
+	}
+
+	/* If we get here that means we're sending a fresh request, PUT or
+	 * GET, so we need to run our standard selection algorithm.
+	 * First find the best local interface that's on any of the peer's
+	 * networks.
+	 */
+	sd->sd_best_ni = lnet_find_best_ni_on_local_net(sd->sd_peer,
+							sd->sd_md_cpt);
+	if (sd->sd_best_ni) {
+		sd->sd_best_lpni =
+		  lnet_find_best_lpni_on_net(sd, sd->sd_peer,
+					     sd->sd_best_ni->ni_net->net_id);
+
+		/* if we're successful in selecting a peer_ni on the local
+		 * network, then send to it. Otherwise fall through and
+		 * try and see if we can reach it over another routed
+		 * network
+		 */
+		if (sd->sd_best_lpni) {
+			/* in case we initially started with a routed
+			 * destination, let's reset to local
+			 */
+			sd->sd_send_case &= ~REMOTE_DST;
+			sd->sd_send_case |= LOCAL_DST;
+			return lnet_handle_send(sd);
+		}
+
+		CERROR("Internal Error. Expected to have a best_lpni: %s -> %s\n",
+		       libcfs_nid2str(sd->sd_src_nid),
+		       libcfs_nid2str(sd->sd_dst_nid));
+
+		return -EFAULT;
+	}
+
+	/* Peer doesn't have a local network. Let's see if there is
+	 * a remote network we can reach it on.
+	 */
+	return PASS_THROUGH;
+}
+
+/* Case 1:
+ *	Source NID not specified
+ *	Local destination
+ *	MR peer
+ *
+ * Case 2:
+ *	Source NID not speified
+ *	Remote destination
+ *	MR peer
+ *
+ * In both of these cases if we're sending a response, ACK or REPLY, then
+ * we need to send to the destination NID provided.
+ *
+ * In the remote case let's deal with MR routers.
+ *
+ */
+static int
+lnet_handle_any_mr_dst(struct lnet_send_data *sd)
+{
+	int rc = 0;
+	struct lnet_peer *gw_peer = NULL;
+	struct lnet_peer_ni *gw_lpni = NULL;
+
+	/* handle sending a response to a remote peer here so we don't
+	 * have to worry about it if we hit lnet_handle_any_mr_dsta()
+	 */
+	if (sd->sd_send_case & REMOTE_DST &&
+	    sd->sd_send_case & SND_RESP) {
+		struct lnet_peer_ni *gw;
+		struct lnet_peer *gw_peer;
+
+		rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw,
+						  &gw_peer);
+		if (rc < 0) {
+			CERROR("Can't send response to %s. No route available\n",
+			       libcfs_nid2str(sd->sd_dst_nid));
+			return -EHOSTUNREACH;
+		}
+
+		sd->sd_best_lpni = gw;
+		sd->sd_peer = gw_peer;
+
+		return lnet_handle_send(sd);
+	}
+
+	/* Even though the NID for the peer might not be on a local network,
+	 * since the peer is MR there could be other interfaces on the
+	 * local network. In that case we'd still like to prefer the local
+	 * network over the routed network. If we're unable to do that
+	 * then we select the best router among the different routed networks,
+	 * and if the router is MR then we can deal with it as such.
+	 */
+	rc = lnet_handle_any_mr_dsta(sd);
+	if (rc != PASS_THROUGH)
+		return rc;
+
+	/* TODO; One possible enhancement is to run the selection
+	 * algorithm on the peer. However for remote peers the credits are
+	 * not decremented, so we'll be basically going over the peer NIs
+	 * in round robin. An MR router will run the selection algorithm
+	 * on the next-hop interfaces.
+	 */
+	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
+					  &gw_peer);
+	if (rc < 0)
+		return rc;
+
+	sd->sd_send_case &= ~LOCAL_DST;
+	sd->sd_send_case |= REMOTE_DST;
+
+	sd->sd_peer = gw_peer;
+	sd->sd_best_lpni = gw_lpni;
+
+	return lnet_handle_send(sd);
+}
+
+/* Source not specified
+ * Remote destination
+ * Non-MR peer
+ *
+ * Must send to the specified peer NID using the same source NID that
+ * we've used before. If it's the first time to talk to that peer then
+ * find the source NI and assign it as preferred to that peer
+ */
+static int
+lnet_handle_any_router_nmr_dst(struct lnet_send_data *sd)
+{
+	int rc;
+	struct lnet_peer_ni *gw_lpni = NULL;
+	struct lnet_peer *gw_peer = NULL;
+
+	/* Let's set if we have a preferred NI to talk to this NMR peer
+	 */
+	sd->sd_best_ni = lnet_find_existing_preferred_best_ni(sd);
+
+	/* find the router and that'll find the best NI if we didn't find
+	 * it already.
+	 */
+	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
+					  &gw_peer);
+	if (rc < 0)
+		return rc;
+
+	/* set the best_ni we've chosen as the preferred one for
+	 * this peer
+	 */
+	lnet_set_non_mr_pref_nid(sd);
+
+	/* we'll be sending to the gw */
+	sd->sd_best_lpni = gw_lpni;
+	sd->sd_peer = gw_peer;
+
+	return lnet_handle_send(sd);
+}
+
+static int
+lnet_handle_send_case_locked(struct lnet_send_data *sd)
+{
+	/* Turn off the SND_RESP bit.
+	 * It will be checked in the case handling
+	 */
+	u32 send_case = sd->sd_send_case &= ~SND_RESP;
+
+	CDEBUG(D_NET, "Source %s%s to %s %s %s destination\n",
+	       (send_case & SRC_SPEC) ? "Specified: " : "ANY",
+	       (send_case & SRC_SPEC) ? libcfs_nid2str(sd->sd_src_nid) : "",
+	       (send_case & MR_DST) ? "MR: " : "NMR: ",
+	       libcfs_nid2str(sd->sd_dst_nid),
+	       (send_case & LOCAL_DST) ? "local" : "routed");
+
+	switch (send_case) {
+	/* For all cases where the source is specified, we should always
+	 * use the destination NID, whether it's an MR destination or not,
+	 * since we're continuing a series of related messages for the
+	 * same RPC
+	 */
+	case SRC_SPEC_LOCAL_NMR_DST:
+		return lnet_handle_spec_local_nmr_dst(sd);
+	case SRC_SPEC_LOCAL_MR_DST:
+		return lnet_handle_spec_local_mr_dst(sd);
+	case SRC_SPEC_ROUTER_NMR_DST:
+	case SRC_SPEC_ROUTER_MR_DST:
+		return lnet_handle_spec_router_dst(sd);
+	case SRC_ANY_LOCAL_NMR_DST:
+		return lnet_handle_any_local_nmr_dst(sd);
+	case SRC_ANY_LOCAL_MR_DST:
+	case SRC_ANY_ROUTER_MR_DST:
+		return lnet_handle_any_mr_dst(sd);
+	case SRC_ANY_ROUTER_NMR_DST:
+		return lnet_handle_any_router_nmr_dst(sd);
+	default:
+		CERROR("Unknown send case\n");
+		return -1;
+	}
+}
+
+static int
+lnet_select_pathway(lnet_nid_t src_nid, lnet_nid_t dst_nid,
+		    struct lnet_msg *msg, lnet_nid_t rtr_nid)
+{
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer *peer;
+	struct lnet_send_data send_data;
+	int cpt, rc;
+	int md_cpt;
+	u32 send_case = 0;
+
+	memset(&send_data, 0, sizeof(send_data));
+
+	/* get an initial CPT to use for locking. The idea here is not to
+	 * serialize the calls to select_pathway, so that as many
+	 * operations can run concurrently as possible. To do that we use
+	 * the CPT where this call is being executed. Later on when we
+	 * determine the CPT to use in lnet_message_commit, we switch the
+	 * lock and check if there was any configuration change.  If none,
+	 * then we proceed, if there is, then we restart the operation.
+	 */
+	cpt = lnet_net_lock_current();
+
+	md_cpt = lnet_cpt_of_md(msg->msg_md, msg->msg_offset);
+	if (md_cpt == CFS_CPT_ANY)
+		md_cpt = cpt;
+
+again:
+	/* If we're being asked to send to the loopback interface, there
+	 * is no need to go through any selection. We can just shortcut
+	 * the entire process and send over lolnd
+	 */
+	if (LNET_NETTYP(LNET_NIDNET(dst_nid)) == LOLND) {
+		/* No send credit hassles with LOLND */
+		lnet_ni_addref_locked(the_lnet.ln_loni, cpt);
+		msg->msg_hdr.dest_nid = cpu_to_le64(the_lnet.ln_loni->ni_nid);
+		if (!msg->msg_routing)
+			msg->msg_hdr.src_nid =
+				cpu_to_le64(the_lnet.ln_loni->ni_nid);
+		msg->msg_target.nid = the_lnet.ln_loni->ni_nid;
+		lnet_msg_commit(msg, cpt);
+		msg->msg_txni = the_lnet.ln_loni;
+		lnet_net_unlock(cpt);
+
+		return LNET_CREDIT_OK;
+	}
+
+	/* find an existing peer_ni, or create one and mark it as having been
+	 * created due to network traffic. This call will create the
+	 * peer->peer_net->peer_ni tree.
+	 */
+	lpni = lnet_nid2peerni_locked(dst_nid, LNET_NID_ANY, cpt);
+	if (IS_ERR(lpni)) {
+		lnet_net_unlock(cpt);
+		return PTR_ERR(lpni);
+	}
+
+	/* Now that we have a peer_ni, check if we want to discover
+	 * the peer. Traffic to the LNET_RESERVED_PORTAL should not
+	 * trigger discovery.
+	 */
+	peer = lpni->lpni_peer_net->lpn_peer;
+	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
+		lnet_nid_t primary_nid;
+
+		rc = lnet_discover_peer_locked(lpni, cpt, false);
+		if (rc) {
+			lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(cpt);
+			return rc;
+		}
+		/* The peer may have changed. */
+		peer = lpni->lpni_peer_net->lpn_peer;
+		/* queue message and return */
+		msg->msg_src_nid_param = src_nid;
+		msg->msg_rtr_nid_param = rtr_nid;
+		msg->msg_sending = 0;
+		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
+		lnet_peer_ni_decref_locked(lpni);
+		primary_nid = peer->lp_primary_nid;
+		lnet_net_unlock(cpt);
+
+		CDEBUG(D_NET, "%s pending discovery\n",
+		       libcfs_nid2str(primary_nid));
+
+		return LNET_DC_WAIT;
+	}
+	lnet_peer_ni_decref_locked(lpni);
+
+	/* If peer is not healthy then can not send anything to it */
+	if (!lnet_is_peer_healthy_locked(peer)) {
+		lnet_net_unlock(cpt);
+		return -EHOSTUNREACH;
+	}
+
+	/* Identify the different send cases
+	 */
+	if (src_nid == LNET_NID_ANY)
+		send_case |= SRC_ANY;
+	else
+		send_case |= SRC_SPEC;
+
+	if (lnet_get_net_locked(LNET_NIDNET(dst_nid)))
+		send_case |= LOCAL_DST;
+	else
+		send_case |= REMOTE_DST;
+
+	if (!lnet_peer_is_multi_rail(peer))
+		send_case |= NMR_DST;
+	else
+		send_case |= MR_DST;
+
+	if (msg->msg_type == LNET_MSG_REPLY ||
+	    msg->msg_type == LNET_MSG_ACK)
+		send_case |= SND_RESP;
+
+	/* assign parameters to the send_data */
+	send_data.sd_msg = msg;
+	send_data.sd_rtr_nid = rtr_nid;
+	send_data.sd_src_nid = src_nid;
+	send_data.sd_dst_nid = dst_nid;
+	send_data.sd_best_lpni = lpni;
+	/* keep a pointer to the final destination in case we're going to
+	 * route, so we'll need to access it later
+	 */
+	send_data.sd_final_dst_lpni = lpni;
+	send_data.sd_peer = peer;
+	send_data.sd_md_cpt = md_cpt;
+	send_data.sd_cpt = cpt;
+	send_data.sd_send_case = send_case;
+
+	rc = lnet_handle_send_case_locked(&send_data);
+
+	if (rc == REPEAT_SEND)
+		goto again;
+
+	lnet_net_unlock(send_data.sd_cpt);
 
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 076/622] lnet: add health value per ni
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (74 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 075/622] lnet: refactor lnet_select_pathway() James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 077/622] lnet: add lnet_health_sensitivity James Simmons
                   ` (546 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add a health value per local network interface. The health value
reflects the health of the NI. It is initialized to 1000. 1000 is
chosen to be able to granularly decrement the health value on error.

If the NI is absolutely not healthy that will be indicated by an
LND event, which will flag that the NI is down and should never
be used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: d54afb86116c ("LU-9120 lnet: add health value per ni")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32761
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h | 15 +++++++++++++++
 net/lnet/lnet/api-ni.c         |  1 +
 net/lnet/lnet/lib-move.c       | 17 +++++++++++------
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index e9560a9..0ed325a 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -52,6 +52,12 @@
 
 #define LNET_MAX_IOV		(LNET_MAX_PAYLOAD >> PAGE_SHIFT)
 
+/*
+ * This is the maximum health value.
+ * All local and peer NIs created have their health default to this value.
+ */
+#define LNET_MAX_HEALTH_VALUE 1000
+
 /* forward refs */
 struct lnet_libmd;
 
@@ -388,6 +394,15 @@ struct lnet_ni {
 	u32			ni_seq;
 
 	/*
+	 * health value
+	 *	initialized to LNET_MAX_HEALTH_VALUE
+	 * Value is decremented every time we fail to send a message over
+	 * this NI because of a NI specific failure.
+	 * Value is incremented if we successfully send a message.
+	 */
+	atomic_t		ni_healthv;
+
+	/*
 	 * equivalent interfaces to use
 	 * This is an array because socklnd bonding can still be configured
 	 */
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 8be3354..4e83fa8 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1817,6 +1817,7 @@ static void lnet_push_target_fini(void)
 
 	atomic_set(&ni->ni_tx_credits,
 		   lnet_ni_tq_credits(ni) * ni->ni_ncpts);
+	atomic_set(&ni->ni_healthv, LNET_MAX_HEALTH_VALUE);
 
 	CDEBUG(D_LNI, "Added LNI %s [%d/%d/%d/%d]\n",
 	       libcfs_nid2str(ni->ni_nid),
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 10aa753..ab32c6f 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1276,6 +1276,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_ni *ni = NULL;
 	unsigned int shortest_distance;
 	int best_credits;
+	int best_healthv;
 
 	/* If there is no peer_ni that we can send to on this network,
 	 * then there is no point in looking for a new best_ni here.
@@ -1286,20 +1287,21 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	if (!best_ni) {
 		shortest_distance = UINT_MAX;
 		best_credits = INT_MIN;
+		best_healthv = 0;
 	} else {
 		shortest_distance = cfs_cpt_distance(lnet_cpt_table(), md_cpt,
 						     best_ni->ni_dev_cpt);
 		best_credits = atomic_read(&best_ni->ni_tx_credits);
+		best_healthv = atomic_read(&best_ni->ni_healthv);
 	}
 
 	while ((ni = lnet_get_next_ni_locked(local_net, ni))) {
 		unsigned int distance;
 		int ni_credits;
-
-		if (!lnet_is_ni_healthy_locked(ni))
-			continue;
+		int ni_healthv;
 
 		ni_credits = atomic_read(&ni->ni_tx_credits);
+		ni_healthv = atomic_read(&ni->ni_healthv);
 
 		/*
 		 * calculate the distance from the CPT on which
@@ -1325,21 +1327,24 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			distance = lnet_numa_range;
 
 		/*
-		 * Select on shorter distance, then available
+		 * Select on health, shorter distance, available
 		 * credits, then round-robin.
 		 */
-		if (distance > shortest_distance) {
+		if (ni_healthv < best_healthv) {
+			continue;
+		} else if (distance > shortest_distance) {
 			continue;
 		} else if (distance < shortest_distance) {
 			shortest_distance = distance;
 		} else if (ni_credits < best_credits) {
 			continue;
 		} else if (ni_credits == best_credits) {
-			if (best_ni && (best_ni)->ni_seq <= ni->ni_seq)
+			if (best_ni && best_ni->ni_seq <= ni->ni_seq)
 				continue;
 		}
 		best_ni = ni;
 		best_credits = ni_credits;
+		best_healthv = ni_healthv;
 	}
 
 	CDEBUG(D_NET, "selected best_ni %s\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 077/622] lnet: add lnet_health_sensitivity
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (75 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 076/622] lnet: add health value per ni James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 078/622] lnet: add monitor thread James Simmons
                   ` (545 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add lnet_health_senstivity value. This value determines the amount
the NI health value is decremented by. The value defaults to 0,
which turns off the health feature by default. The user needs
to explicitly turn on this feature. The assumption is that many sites
will only have one interface in their nodes. In this case the
health feature will not increase the resiliency of their system.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 63cf744d0fdf ("LU-9120 lnet: add lnet_health_sensitivity")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32762
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 +
 net/lnet/lnet/api-ni.c        | 52 +++++++++++++++++++++++++++++++++++++++++++
 net/lnet/lnet/lib-move.c      | 11 ++++++++-
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 20b4660..5e13d32 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -479,6 +479,7 @@ struct lnet_ni *
 
 extern unsigned int lnet_transaction_timeout;
 extern unsigned int lnet_numa_range;
+extern unsigned int lnet_health_sensitivity;
 extern unsigned int lnet_peer_discovery_disabled;
 extern int portal_rotor;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 4e83fa8..9d68434 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -78,6 +78,23 @@ struct lnet the_lnet = {
 MODULE_PARM_DESC(lnet_numa_range,
 		 "NUMA range to consider during Multi-Rail selection");
 
+/* lnet_health_sensitivity determines by how much we decrement the health
+ * value on sending error. The value defaults to 0, which means health
+ * checking is turned off by default.
+ */
+unsigned int lnet_health_sensitivity;
+static int sensitivity_set(const char *val, const struct kernel_param *kp);
+static struct kernel_param_ops param_ops_health_sensitivity = {
+	.set = sensitivity_set,
+	.get = param_get_int,
+};
+
+#define param_check_health_sensitivity(name, p) \
+	__param_check(name, p, int)
+module_param(lnet_health_sensitivity, health_sensitivity, 0644);
+MODULE_PARM_DESC(lnet_health_sensitivity,
+		 "Value to decrement the health value by on error");
+
 static int lnet_interfaces_max = LNET_INTERFACES_MAX_DEFAULT;
 static int intf_max_set(const char *val, const struct kernel_param *kp);
 module_param_call(lnet_interfaces_max, intf_max_set, param_get_int,
@@ -115,6 +132,41 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 			 struct lnet_process_id __user *ids, int n_ids);
 
 static int
+sensitivity_set(const char *val, const struct kernel_param *kp)
+{
+	int rc;
+	unsigned int *sensitivity = (unsigned int *)kp->arg;
+	unsigned long value;
+
+	rc = kstrtoul(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'lnet_health_sensitivity'\n");
+		return rc;
+	}
+
+	/* The purpose of locking the api_mutex here is to ensure that
+	 * the correct value ends up stored properly.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+
+	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	if (value == *sensitivity) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	*sensitivity = value;
+
+	mutex_unlock(&the_lnet.ln_api_mutex);
+
+	return 0;
+}
+
+static int
 discovery_set(const char *val, const struct kernel_param *kp)
 {
 	int rc;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index ab32c6f..38815fd 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1332,6 +1332,16 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		 */
 		if (ni_healthv < best_healthv) {
 			continue;
+		} else if (ni_healthv > best_healthv) {
+			best_healthv = ni_healthv;
+			/* If we're going to prefer this ni because it's
+			 * the healthiest, then we should set the
+			 * shortest_distance in the algorithm in case
+			 * there are multiple NIs with the same health but
+			 * different distances.
+			 */
+			if (distance < shortest_distance)
+				shortest_distance = distance;
 		} else if (distance > shortest_distance) {
 			continue;
 		} else if (distance < shortest_distance) {
@@ -1344,7 +1354,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		}
 		best_ni = ni;
 		best_credits = ni_credits;
-		best_healthv = ni_healthv;
 	}
 
 	CDEBUG(D_NET, "selected best_ni %s\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 078/622] lnet: add monitor thread
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (76 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 077/622] lnet: add lnet_health_sensitivity James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 079/622] lnet: handle local ni failure James Simmons
                   ` (544 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Refactored the router checker thread to be the monitor thread.
The monitor thread will check router aliveness, expires messages
on the active list, recover local and remote NIs and resend messages.

In this patch it only checks router aliveness.

A deadline on the message is also added to keep track of when this
message should expire.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: b01e6fce1c98 ("LU-9120 lnet: add monitor thread")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32763
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |  11 ++-
 include/linux/lnet/lib-types.h |  27 +++----
 net/lnet/lnet/api-ni.c         |  12 ++--
 net/lnet/lnet/lib-move.c       |  98 ++++++++++++++++++++++++++
 net/lnet/lnet/lib-msg.c        |   9 ++-
 net/lnet/lnet/router.c         | 156 +++++++++++++----------------------------
 6 files changed, 185 insertions(+), 128 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 5e13d32..2c3f665 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -714,8 +714,15 @@ int lnet_sock_connect(struct socket **sockp, int *fatal,
 int lnet_peers_start_down(void);
 int lnet_peer_buffer_credits(struct lnet_net *net);
 
-int lnet_router_checker_start(void);
-void lnet_router_checker_stop(void);
+int lnet_monitor_thr_start(void);
+void lnet_monitor_thr_stop(void);
+
+bool lnet_router_checker_active(void);
+void lnet_check_routers(void);
+int lnet_router_pre_mt_start(void);
+void lnet_router_post_mt_start(void);
+void lnet_prune_rc_data(int wait_unlink);
+void lnet_router_cleanup(void);
 void lnet_router_ni_update_locked(struct lnet_peer_ni *gw, u32 net);
 void lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf);
 
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 0ed325a..e1a56a1 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -79,6 +79,12 @@ struct lnet_msg {
 	lnet_nid_t		msg_src_nid_param;
 	lnet_nid_t		msg_rtr_nid_param;
 
+	/*
+	 * Deadline for the message after which it will be finalized if it
+	 * has not completed.
+	 */
+	ktime_t			msg_deadline;
+
 	/* committed for sending */
 	unsigned int		msg_tx_committed:1;
 	/* CPT # this message committed for sending */
@@ -905,9 +911,9 @@ struct lnet_msg_container {
 
 /* Router Checker states */
 enum lnet_rc_state {
-	LNET_RC_STATE_SHUTDOWN,	/* not started */
-	LNET_RC_STATE_RUNNING,	/* started up OK */
-	LNET_RC_STATE_STOPPING,	/* telling thread to stop */
+	LNET_MT_STATE_SHUTDOWN,	/* not started */
+	LNET_MT_STATE_RUNNING,	/* started up OK */
+	LNET_MT_STATE_STOPPING,	/* telling thread to stop */
 };
 
 /* LNet states */
@@ -1014,8 +1020,8 @@ struct lnet {
 	/* discovery startup/shutdown state */
 	int				ln_dc_state;
 
-	/* router checker startup/shutdown state */
-	enum lnet_rc_state		ln_rc_state;
+	/* monitor thread startup/shutdown state */
+	enum lnet_rc_state		ln_mt_state;
 	/* router checker's event queue */
 	struct lnet_handle_eq		ln_rc_eqh;
 	/* rcd still pending on net */
@@ -1023,7 +1029,7 @@ struct lnet {
 	/* rcd ready for free */
 	struct list_head		ln_rcd_zombie;
 	/* serialise startup/shutdown */
-	struct completion		ln_rc_signal;
+	struct completion		ln_mt_signal;
 
 	struct mutex			ln_api_mutex;
 	struct mutex			ln_lnd_mutex;
@@ -1053,13 +1059,10 @@ struct lnet {
 	 */
 	bool				ln_nis_from_mod_params;
 
-	/*
-	 * waitq for router checker.  As long as there are no routes in
-	 * the list, the router checker will sleep on this queue.  when
-	 * routes are added the thread will wake up
+	/* waitq for the monitor thread. The monitor thread takes care of
+	 * checking routes, timedout messages and resending messages.
 	 */
-	wait_queue_head_t		ln_rc_waitq;
-
+	wait_queue_head_t		ln_mt_waitq;
 };
 
 #endif
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 9d68434..418d65e 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -309,7 +309,7 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	spin_lock_init(&the_lnet.ln_eq_wait_lock);
 	spin_lock_init(&the_lnet.ln_msg_resend_lock);
 	init_waitqueue_head(&the_lnet.ln_eq_waitq);
-	init_waitqueue_head(&the_lnet.ln_rc_waitq);
+	init_waitqueue_head(&the_lnet.ln_mt_waitq);
 	mutex_init(&the_lnet.ln_lnd_mutex);
 }
 
@@ -2281,13 +2281,13 @@ void lnet_lib_exit(void)
 
 	lnet_ping_target_update(pbuf, ping_mdh);
 
-	rc = lnet_router_checker_start();
+	rc = lnet_monitor_thr_start();
 	if (rc)
 		goto err_stop_ping;
 
 	rc = lnet_push_target_init();
 	if (rc != 0)
-		goto err_stop_router_checker;
+		goto err_stop_monitor_thr;
 
 	rc = lnet_peer_discovery_start();
 	if (rc != 0)
@@ -2302,8 +2302,8 @@ void lnet_lib_exit(void)
 
 err_destroy_push_target:
 	lnet_push_target_fini();
-err_stop_router_checker:
-	lnet_router_checker_stop();
+err_stop_monitor_thr:
+	lnet_monitor_thr_stop();
 err_stop_ping:
 	lnet_ping_target_fini();
 err_acceptor_stop:
@@ -2353,7 +2353,7 @@ void lnet_lib_exit(void)
 		lnet_router_debugfs_fini();
 		lnet_peer_discovery_stop();
 		lnet_push_target_fini();
-		lnet_router_checker_stop();
+		lnet_monitor_thr_stop();
 		lnet_ping_target_fini();
 
 		/* Teardown fns that use my own API functions BEFORE here */
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 38815fd..418e3ad 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -818,6 +818,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		}
 	}
 
+	/* unset the tx_delay flag as we're going to send it now */
+	msg->msg_tx_delayed = 0;
+
 	if (do_send) {
 		lnet_net_unlock(cpt);
 		lnet_ni_send(ni, msg);
@@ -914,6 +917,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	msg->msg_niov = rbp->rbp_npages;
 	msg->msg_kiov = &rb->rb_kiov[0];
 
+	/* unset the msg-rx_delayed flag since we're receiving the message */
+	msg->msg_rx_delayed = 0;
+
 	if (do_recv) {
 		int cpt = msg->msg_rx_cpt;
 
@@ -2383,6 +2389,98 @@ struct lnet_ni *
 	return 0;
 }
 
+static int
+lnet_monitor_thread(void *arg)
+{
+	/* The monitor thread takes care of the following:
+	 *  1. Checks the aliveness of routers
+	 *  2. Checks if there are messages on the resend queue to resend
+	 *     them.
+	 *  3. Check if there are any NIs on the local recovery queue and
+	 *     pings them
+	 *  4. Checks if there are any NIs on the remote recovery queue
+	 *     and pings them.
+	 */
+	while (the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING) {
+		if (lnet_router_checker_active())
+			lnet_check_routers();
+
+		/* TODO do we need to check if we should sleep without
+		 * timeout?  Technically, an active system will always
+		 * have messages in flight so this check will always
+		 * evaluate to false. And on an idle system do we care
+		 * if we wake up every 1 second? Although, we've seen
+		 * cases where we get a complaint that an idle thread
+		 * is waking up unnecessarily.
+		 */
+		wait_event_interruptible_timeout(the_lnet.ln_mt_waitq,
+						 false, HZ);
+	}
+
+	/* clean up the router checker */
+	lnet_prune_rc_data(1);
+
+	/* Shutting down */
+	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
+
+	/* signal that the monitor thread is exiting */
+	complete(&the_lnet.ln_mt_signal);
+
+	return 0;
+}
+
+int lnet_monitor_thr_start(void)
+{
+	int rc;
+	struct task_struct *task;
+
+	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
+
+	init_completion(&the_lnet.ln_mt_signal);
+
+	/* Pre monitor thread start processing */
+	rc = lnet_router_pre_mt_start();
+	if (!rc)
+		return rc;
+
+	the_lnet.ln_mt_state = LNET_MT_STATE_RUNNING;
+	task = kthread_run(lnet_monitor_thread, NULL, "monitor_thread");
+	if (IS_ERR(task)) {
+		rc = PTR_ERR(task);
+		CERROR("Can't start monitor thread: %d\n", rc);
+		/* block until event callback signals exit */
+		wait_for_completion(&the_lnet.ln_mt_signal);
+
+		/* clean up */
+		lnet_router_cleanup();
+		the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
+		return -ENOMEM;
+	}
+
+	/* post monitor thread start processing */
+	lnet_router_post_mt_start();
+
+	return 0;
+}
+
+void lnet_monitor_thr_stop(void)
+{
+	if (the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN)
+		return;
+
+	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING);
+	the_lnet.ln_mt_state = LNET_MT_STATE_STOPPING;
+
+	/* tell the monitor thread that we're shutting down */
+	wake_up(&the_lnet.ln_mt_waitq);
+
+	/* block until monitor thread signals that it's done */
+	wait_for_completion(&the_lnet.ln_mt_signal);
+	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
+
+	lnet_router_cleanup();
+}
+
 void
 lnet_drop_message(struct lnet_ni *ni, int cpt, void *private, unsigned int nob,
 		  u32 msg_type)
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index a7062f6..7869b96 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -141,13 +141,17 @@
 {
 	struct lnet_msg_container *container = the_lnet.ln_msg_containers[cpt];
 	struct lnet_counters *counters = the_lnet.ln_counters[cpt];
+	s64 timeout_ns;
+
+	/* set the message deadline */
+	timeout_ns = lnet_transaction_timeout * NSEC_PER_SEC;
+	msg->msg_deadline = ktime_add_ns(ktime_get(), timeout_ns);
 
 	/* routed message can be committed for both receiving and sending */
 	LASSERT(!msg->msg_tx_committed);
 
 	if (msg->msg_sending) {
 		LASSERT(!msg->msg_receiving);
-
 		msg->msg_tx_cpt = cpt;
 		msg->msg_tx_committed = 1;
 		if (msg->msg_rx_committed) { /* routed message REPLY */
@@ -161,8 +165,9 @@
 	}
 
 	LASSERT(!msg->msg_onactivelist);
+
 	msg->msg_onactivelist = 1;
-	list_add(&msg->msg_activelist, &container->msc_active);
+	list_add_tail(&msg->msg_activelist, &container->msc_active);
 
 	counters->msgs_alloc++;
 	if (counters->msgs_alloc > counters->msgs_max)
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 278807d..3f9d8c5 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -70,9 +70,6 @@
 	return net->net_tunables.lct_peer_tx_credits;
 }
 
-/* forward ref's */
-static int lnet_router_checker(void *);
-
 static int check_routers_before_use;
 module_param(check_routers_before_use, int, 0444);
 MODULE_PARM_DESC(check_routers_before_use, "Assume routers are down and ping them before use");
@@ -423,8 +420,8 @@ static void lnet_shuffle_seed(void)
 	if (rnet != rnet2)
 		kfree(rnet);
 
-	/* indicate to startup the router checker if configured */
-	wake_up(&the_lnet.ln_rc_waitq);
+	/* kick start the monitor thread to handle the added route */
+	wake_up(&the_lnet.ln_mt_waitq);
 
 	return rc;
 }
@@ -809,7 +806,7 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	struct lnet_peer_ni *rtr;
 	int all_known;
 
-	LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING);
+	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING);
 
 	for (;;) {
 		int cpt = lnet_net_lock_current();
@@ -1038,7 +1035,7 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	lnet_ni_notify_locked(ni, rtr);
 
 	if (!lnet_isrouter(rtr) ||
-	    the_lnet.ln_rc_state != LNET_RC_STATE_RUNNING) {
+	    the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
 		/* router table changed or router checker is shutting down */
 		lnet_peer_ni_decref_locked(rtr);
 		return;
@@ -1092,14 +1089,9 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	lnet_peer_ni_decref_locked(rtr);
 }
 
-int
-lnet_router_checker_start(void)
+int lnet_router_pre_mt_start(void)
 {
-	struct task_struct *task;
 	int rc;
-	int eqsz = 0;
-
-	LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_SHUTDOWN);
 
 	if (check_routers_before_use &&
 	    dead_router_check_interval <= 0) {
@@ -1107,27 +1099,17 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 		return -EINVAL;
 	}
 
-	init_completion(&the_lnet.ln_rc_signal);
-
 	rc = LNetEQAlloc(0, lnet_router_checker_event, &the_lnet.ln_rc_eqh);
 	if (rc) {
-		CERROR("Can't allocate EQ(%d): %d\n", eqsz, rc);
+		CERROR("Can't allocate EQ(0): %d\n", rc);
 		return -ENOMEM;
 	}
 
-	the_lnet.ln_rc_state = LNET_RC_STATE_RUNNING;
-	task = kthread_run(lnet_router_checker, NULL, "router_checker");
-	if (IS_ERR(task)) {
-		rc = PTR_ERR(task);
-		CERROR("Can't start router checker thread: %d\n", rc);
-		/* block until event callback signals exit */
-		wait_for_completion(&the_lnet.ln_rc_signal);
-		rc = LNetEQFree(the_lnet.ln_rc_eqh);
-		LASSERT(!rc);
-		the_lnet.ln_rc_state = LNET_RC_STATE_SHUTDOWN;
-		return -ENOMEM;
-	}
+	return 0;
+}
 
+void lnet_router_post_mt_start(void)
+{
 	if (check_routers_before_use) {
 		/*
 		 * Note that a helpful side-effect of pinging all known routers
@@ -1136,33 +1118,17 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 		 */
 		lnet_wait_known_routerstate();
 	}
-
-	return 0;
 }
 
-void
-lnet_router_checker_stop(void)
+void lnet_router_cleanup(void)
 {
 	int rc;
 
-	if (the_lnet.ln_rc_state == LNET_RC_STATE_SHUTDOWN)
-		return;
-
-	LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING);
-	the_lnet.ln_rc_state = LNET_RC_STATE_STOPPING;
-	/* wakeup the RC thread if it's sleeping */
-	wake_up(&the_lnet.ln_rc_waitq);
-
-	/* block until event callback signals exit */
-	wait_for_completion(&the_lnet.ln_rc_signal);
-	LASSERT(the_lnet.ln_rc_state == LNET_RC_STATE_SHUTDOWN);
-
 	rc = LNetEQFree(the_lnet.ln_rc_eqh);
-	LASSERT(!rc);
+	LASSERT(rc == 0);
 }
 
-static void
-lnet_prune_rc_data(int wait_unlink)
+void lnet_prune_rc_data(int wait_unlink)
 {
 	struct lnet_rc_data *rcd;
 	struct lnet_rc_data *tmp;
@@ -1170,7 +1136,7 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	struct list_head head;
 	int i = 2;
 
-	if (likely(the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING &&
+	if (likely(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING &&
 		   list_empty(&the_lnet.ln_rcd_deathrow) &&
 		   list_empty(&the_lnet.ln_rcd_zombie)))
 		return;
@@ -1179,7 +1145,7 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 
 	lnet_net_lock(LNET_LOCK_EX);
 
-	if (the_lnet.ln_rc_state != LNET_RC_STATE_RUNNING) {
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
 		/* router checker is stopping, prune all */
 		list_for_each_entry(lp, &the_lnet.ln_routers,
 				    lpni_rtr_list) {
@@ -1242,18 +1208,12 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 }
 
 /*
- * This function is called to check if the RC should block indefinitely.
- * It's called from lnet_router_checker() as well as being passed to
- * wait_event_interruptible() to avoid the lost wake_up problem.
- *
- * When it's called from wait_event_interruptible() it is necessary to
- * also not sleep if the rc state is not running to avoid a deadlock
- * when the system is shutting down
+ * This function is called from the monitor thread to check if there are
+ * any active routers that need to be checked.
  */
-static inline bool
-lnet_router_checker_active(void)
+bool lnet_router_checker_active(void)
 {
-	if (the_lnet.ln_rc_state != LNET_RC_STATE_RUNNING)
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING)
 		return true;
 
 	/*
@@ -1263,70 +1223,54 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	if (the_lnet.ln_routing)
 		return true;
 
+	/* if there are routers that need to be cleaned up then do so */
+	if (!list_empty(&the_lnet.ln_rcd_deathrow) ||
+	    !list_empty(&the_lnet.ln_rcd_zombie))
+		return true;
+
 	return !list_empty(&the_lnet.ln_routers) &&
 		(live_router_check_interval > 0 ||
 		 dead_router_check_interval > 0);
 }
 
-static int
-lnet_router_checker(void *arg)
+void
+lnet_check_routers(void)
 {
 	struct lnet_peer_ni *rtr;
+	u64 version;
+	int cpt;
+	int cpt2;
 
-	while (the_lnet.ln_rc_state == LNET_RC_STATE_RUNNING) {
-		u64 version;
-		int cpt;
-		int cpt2;
-
-		cpt = lnet_net_lock_current();
+	cpt = lnet_net_lock_current();
 rescan:
-		version = the_lnet.ln_routers_version;
+	version = the_lnet.ln_routers_version;
 
-		list_for_each_entry(rtr, &the_lnet.ln_routers, lpni_rtr_list) {
-			cpt2 = rtr->lpni_cpt;
-			if (cpt != cpt2) {
-				lnet_net_unlock(cpt);
-				cpt = cpt2;
-				lnet_net_lock(cpt);
-				/* the routers list has changed */
-				if (version != the_lnet.ln_routers_version)
-					goto rescan;
-			}
-
-			lnet_ping_router_locked(rtr);
-
-			/* NB dropped lock */
-			if (version != the_lnet.ln_routers_version) {
-				/* the routers list has changed */
+	list_for_each_entry(rtr, &the_lnet.ln_routers, lpni_rtr_list) {
+		cpt2 = rtr->lpni_cpt;
+		if (cpt != cpt2) {
+			lnet_net_unlock(cpt);
+			cpt = cpt2;
+			lnet_net_lock(cpt);
+			/* the routers list has changed */
+			if (version != the_lnet.ln_routers_version)
 				goto rescan;
-			}
 		}
 
-		if (the_lnet.ln_routing)
-			lnet_update_ni_status_locked();
-
-		lnet_net_unlock(cpt);
-
-		lnet_prune_rc_data(0); /* don't wait for UNLINK */
+		lnet_ping_router_locked(rtr);
 
-		/*
-		 * if there are any routes then wakeup every second.  If
-		 * there are no routes then sleep indefinitely until woken
-		 * up by a user adding a route
-		 */
-		if (!lnet_router_checker_active())
-			wait_event_idle(the_lnet.ln_rc_waitq,
-					lnet_router_checker_active());
-		else
-			schedule_timeout_idle(HZ);
+		/* NB dropped lock */
+		if (version != the_lnet.ln_routers_version) {
+			/* the routers list has changed */
+			goto rescan;
+		}
 	}
 
-	lnet_prune_rc_data(1); /* wait for UNLINK */
+	if (the_lnet.ln_routing)
+		lnet_update_ni_status_locked();
 
-	the_lnet.ln_rc_state = LNET_RC_STATE_SHUTDOWN;
-	complete(&the_lnet.ln_rc_signal);
-	/* The unlink event callback will signal final completion */
-	return 0;
+	lnet_net_unlock(cpt);
+
+	lnet_prune_rc_data(0); /* don't wait for UNLINK */
 }
 
 void
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 079/622] lnet: handle local ni failure
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (77 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 078/622] lnet: add monitor thread James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 080/622] lnet: handle o2iblnd tx failure James Simmons
                   ` (543 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Added an enumerated type listing the different errors which
the LND can propagate up to LNet for further handling.

All local timeout errors will trigger a resend if the
system is configured for resends. Remote errors will
not trigger a resend to avoid creating duplicate message
scenario on the receiving end. If a transmit error is encountered
where we're sure the message wasn't received by the remote end
we will attempt a resend.

LNet level logic to handle local NI failure. When the LND finalizes
a message lnet_finalize() will check if the message completed
successfully, if so it increments the healthv of the local NI, but
not beyond the max, and if it failed then it'll decrement the healthv
but not below 0 and put the message on the resend queue.

On local NI failure the local NI is placed on a recovery queue.

The monitor thread will wake up and resend all the messages pending.
The selection algorithm will properly select the local and remote NIs
based on the new healthv.

The monitor thread will ping each NI on the local recovery queue. On
reply it will check if the NIs healthv is back to maximum, if it is
then it will remove it from the recovery queue, otherwise it'll
keep it there until it's fully recovered.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 70616605dd44 ("LU-9120 lnet: handle local ni failure")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32764
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/api.h       |   3 +-
 include/linux/lnet/lib-lnet.h  |   3 +
 include/linux/lnet/lib-types.h |  54 +++--
 net/lnet/lnet/api-ni.c         |  30 ++-
 net/lnet/lnet/config.c         |   3 +-
 net/lnet/lnet/lib-move.c       | 516 +++++++++++++++++++++++++++++++++++++++--
 net/lnet/lnet/lib-msg.c        | 281 +++++++++++++++++++++-
 net/lnet/lnet/peer.c           |  57 ++---
 net/lnet/lnet/router.c         |   2 +-
 net/lnet/selftest/rpc.c        |   2 +-
 10 files changed, 862 insertions(+), 89 deletions(-)

diff --git a/include/linux/lnet/api.h b/include/linux/lnet/api.h
index 7cc1d04..a57ecc8 100644
--- a/include/linux/lnet/api.h
+++ b/include/linux/lnet/api.h
@@ -195,7 +195,8 @@ int LNetGet(lnet_nid_t self,
 	    struct lnet_process_id target_in,
 	    unsigned int portal_in,
 	    u64	match_bits_in,
-	    unsigned int offset_in);
+	    unsigned int offset_in,
+	    bool recovery);
 /** @} lnet_data */
 
 /** \defgroup lnet_misc Miscellaneous operations.
diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 2c3f665..965fc5f 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -536,6 +536,8 @@ void lnet_prep_send(struct lnet_msg *msg, int type,
 		    struct lnet_process_id target, unsigned int offset,
 		    unsigned int len);
 int lnet_send(lnet_nid_t nid, struct lnet_msg *msg, lnet_nid_t rtr_nid);
+int lnet_send_ping(lnet_nid_t dest_nid, struct lnet_handle_md *mdh, int nnis,
+		   void *user_ptr, struct lnet_handle_eq eqh, bool recovery);
 void lnet_return_tx_credits_locked(struct lnet_msg *msg);
 void lnet_return_rx_credits_locked(struct lnet_msg *msg);
 void lnet_schedule_blocked_locked(struct lnet_rtrbufpool *rbp);
@@ -623,6 +625,7 @@ void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
 void lnet_msg_containers_destroy(void);
 int lnet_msg_containers_create(void);
 
+char *lnet_health_error2str(enum lnet_msg_hstatus hstatus);
 char *lnet_msgtyp2str(int type);
 void lnet_print_hdr(struct lnet_hdr *hdr);
 int lnet_fail_nid(lnet_nid_t nid, unsigned int threshold);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index e1a56a1..8c3bf34 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -61,6 +61,20 @@
 /* forward refs */
 struct lnet_libmd;
 
+enum lnet_msg_hstatus {
+	LNET_MSG_STATUS_OK = 0,
+	LNET_MSG_STATUS_LOCAL_INTERRUPT,
+	LNET_MSG_STATUS_LOCAL_DROPPED,
+	LNET_MSG_STATUS_LOCAL_ABORTED,
+	LNET_MSG_STATUS_LOCAL_NO_ROUTE,
+	LNET_MSG_STATUS_LOCAL_ERROR,
+	LNET_MSG_STATUS_LOCAL_TIMEOUT,
+	LNET_MSG_STATUS_REMOTE_ERROR,
+	LNET_MSG_STATUS_REMOTE_DROPPED,
+	LNET_MSG_STATUS_REMOTE_TIMEOUT,
+	LNET_MSG_STATUS_NETWORK_TIMEOUT
+};
+
 struct lnet_msg {
 	struct list_head	msg_activelist;
 	struct list_head	msg_list;	/* Q for credits/MD */
@@ -85,6 +99,13 @@ struct lnet_msg {
 	 */
 	ktime_t			msg_deadline;
 
+	/* The message health status. */
+	enum lnet_msg_hstatus	msg_health_status;
+	/* This is a recovery message */
+	bool			msg_recovery;
+	/* flag to indicate that we do not want to resend this message */
+	bool			msg_no_resend;
+
 	/* committed for sending */
 	unsigned int		msg_tx_committed:1;
 	/* CPT # this message committed for sending */
@@ -277,18 +298,11 @@ struct lnet_tx_queue {
 	struct list_head	tq_delayed;	/* delayed TXs */
 };
 
-enum lnet_ni_state {
-	/* set when NI block is allocated */
-	LNET_NI_STATE_INIT	= 0,
-	/* set when NI is started successfully */
-	LNET_NI_STATE_ACTIVE,
-	/* set when LND notifies NI failed */
-	LNET_NI_STATE_FAILED,
-	/* set when LND notifies NI degraded */
-	LNET_NI_STATE_DEGRADED,
-	/* set when shuttding down NI */
-	LNET_NI_STATE_DELETING
-};
+#define LNET_NI_STATE_INIT		(1 << 0)
+#define LNET_NI_STATE_ACTIVE		(1 << 1)
+#define LNET_NI_STATE_FAILED		(1 << 2)
+#define LNET_NI_STATE_RECOVERY_PENDING	(1 << 3)
+#define LNET_NI_STATE_DELETING		(1 << 4)
 
 enum lnet_stats_type {
 	LNET_STATS_TYPE_SEND	= 0,
@@ -351,6 +365,12 @@ struct lnet_ni {
 	/* chain on the lnet_net structure */
 	struct list_head	ni_netlist;
 
+	/* chain on the recovery queue */
+	struct list_head	ni_recovery;
+
+	/* MD handle for recovery ping */
+	struct lnet_handle_md	ni_ping_mdh;
+
 	/* number of CPTs */
 	int			ni_ncpts;
 
@@ -382,7 +402,7 @@ struct lnet_ni {
 	struct lnet_ni_status	*ni_status;
 
 	/* NI FSM */
-	enum lnet_ni_state	ni_state;
+	u32			ni_state;
 
 	/* per NI LND tunables */
 	struct lnet_lnd_tunables ni_lnd_tunables;
@@ -1063,6 +1083,14 @@ struct lnet {
 	 * checking routes, timedout messages and resending messages.
 	 */
 	wait_queue_head_t		ln_mt_waitq;
+
+	/* per-cpt resend queues */
+	struct list_head		**ln_mt_resendqs;
+	/* local NIs to recover */
+	struct list_head		ln_mt_localNIRecovq;
+	/* recovery eq handler */
+	struct lnet_handle_eq		ln_mt_eqh;
+
 };
 
 #endif
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 418d65e..deef404 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -831,6 +831,7 @@ struct lnet_libhandle *
 	INIT_LIST_HEAD(&the_lnet.ln_dc_request);
 	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
 	INIT_LIST_HEAD(&the_lnet.ln_dc_expired);
+	INIT_LIST_HEAD(&the_lnet.ln_mt_localNIRecovq);
 	init_waitqueue_head(&the_lnet.ln_dc_waitq);
 
 	rc = lnet_descriptor_setup();
@@ -1072,8 +1073,7 @@ struct lnet_net *
 bool
 lnet_is_ni_healthy_locked(struct lnet_ni *ni)
 {
-	if (ni->ni_state == LNET_NI_STATE_ACTIVE ||
-	    ni->ni_state == LNET_NI_STATE_DEGRADED)
+	if (ni->ni_state & LNET_NI_STATE_ACTIVE)
 		return true;
 
 	return false;
@@ -1650,7 +1650,7 @@ static void lnet_push_target_fini(void)
 		list_del_init(&ni->ni_netlist);
 		/* the ni should be in deleting state. If it's not it's
 		 * a bug */
-		LASSERT(ni->ni_state == LNET_NI_STATE_DELETING);
+		LASSERT(ni->ni_state & LNET_NI_STATE_DELETING);
 		cfs_percpt_for_each(ref, j, ni->ni_refs) {
 			if (!*ref)
 				continue;
@@ -1697,7 +1697,10 @@ static void lnet_push_target_fini(void)
 	struct lnet_net *net = ni->ni_net;
 
 	lnet_net_lock(LNET_LOCK_EX);
-	ni->ni_state = LNET_NI_STATE_DELETING;
+	lnet_ni_lock(ni);
+	ni->ni_state |= LNET_NI_STATE_DELETING;
+	ni->ni_state &= ~LNET_NI_STATE_ACTIVE;
+	lnet_ni_unlock(ni);
 	lnet_ni_unlink_locked(ni);
 	lnet_incr_dlc_seq();
 	lnet_net_unlock(LNET_LOCK_EX);
@@ -1789,6 +1792,7 @@ static void lnet_push_target_fini(void)
 
 	list_for_each_entry_safe(msg, tmp, &resend, msg_list) {
 		list_del_init(&msg->msg_list);
+		msg->msg_no_resend = true;
 		lnet_finalize(msg, -ECANCELED);
 	}
 
@@ -1827,7 +1831,10 @@ static void lnet_push_target_fini(void)
 		goto failed0;
 	}
 
-	ni->ni_state = LNET_NI_STATE_ACTIVE;
+	lnet_ni_lock(ni);
+	ni->ni_state |= LNET_NI_STATE_ACTIVE;
+	ni->ni_state &= ~LNET_NI_STATE_INIT;
+	lnet_ni_unlock(ni);
 
 	/* We keep a reference on the loopback net through the loopback NI */
 	if (net->net_lnd->lnd_type == LOLND) {
@@ -2554,11 +2561,17 @@ struct lnet_ni *
 	struct lnet_ni *ni;
 	struct lnet_net *net = mynet;
 
+	/* It is possible that the net has been cleaned out while there is
+	 * a message being sent. This function accessed the net without
+	 * checking if the list is empty
+	 */
 	if (!prev) {
 		if (!net)
 			net = list_first_entry(&the_lnet.ln_nets,
 					       struct lnet_net,
 					       net_list);
+		if (list_empty(&net->net_ni_list))
+			return NULL;
 		ni = list_first_entry(&net->net_ni_list, struct lnet_ni,
 				      ni_netlist);
 
@@ -2580,6 +2593,8 @@ struct lnet_ni *
 		/* get the next net */
 		net = list_first_entry(&prev->ni_net->net_list, struct lnet_net,
 				       net_list);
+		if (list_empty(&net->net_ni_list))
+			return NULL;
 		/* get the ni on it */
 		ni = list_first_entry(&net->net_ni_list, struct lnet_ni,
 				      ni_netlist);
@@ -2587,6 +2602,9 @@ struct lnet_ni *
 		return ni;
 	}
 
+	if (list_empty(&prev->ni_netlist))
+		return NULL;
+
 	/* there are more nis left */
 	ni = list_first_entry(&prev->ni_netlist, struct lnet_ni, ni_netlist);
 
@@ -3571,7 +3589,7 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 
 	rc = LNetGet(LNET_NID_ANY, mdh, id,
 		     LNET_RESERVED_PORTAL,
-		     LNET_PROTO_PING_MATCHBITS, 0);
+		     LNET_PROTO_PING_MATCHBITS, 0, false);
 	if (rc) {
 		/* Don't CERROR; this could be deliberate! */
 		rc2 = LNetMDUnlink(mdh);
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index 0560215..ea62d36 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -442,6 +442,7 @@ struct lnet_net *
 
 	spin_lock_init(&ni->ni_lock);
 	INIT_LIST_HEAD(&ni->ni_netlist);
+	INIT_LIST_HEAD(&ni->ni_recovery);
 	ni->ni_refs = cfs_percpt_alloc(lnet_cpt_table(),
 				       sizeof(*ni->ni_refs[0]));
 	if (!ni->ni_refs)
@@ -466,7 +467,7 @@ struct lnet_net *
 		ni->ni_net_ns = NULL;
 
 	ni->ni_last_alive = ktime_get_real_seconds();
-	ni->ni_state = LNET_NI_STATE_INIT;
+	ni->ni_state |= LNET_NI_STATE_INIT;
 	list_add_tail(&ni->ni_netlist, &net->net_ni_added);
 
 	/*
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 418e3ad..f3f4b84 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -579,8 +579,10 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		(msg->msg_txcredit && msg->msg_peertxcredit));
 
 	rc = ni->ni_net->net_lnd->lnd_send(ni, priv, msg);
-	if (rc < 0)
+	if (rc < 0) {
+		msg->msg_no_resend = true;
 		lnet_finalize(msg, rc);
+	}
 }
 
 static int
@@ -759,8 +761,10 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 		CNETERR("Dropping message for %s: peer not alive\n",
 			libcfs_id2str(msg->msg_target));
-		if (do_send)
+		if (do_send) {
+			msg->msg_health_status = LNET_MSG_STATUS_LOCAL_DROPPED;
 			lnet_finalize(msg, -EHOSTUNREACH);
+		}
 
 		lnet_net_lock(cpt);
 		return -EHOSTUNREACH;
@@ -772,8 +776,10 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 		CNETERR("Aborting message for %s: LNetM[DE]Unlink() already called on the MD/ME.\n",
 			libcfs_id2str(msg->msg_target));
-		if (do_send)
+		if (do_send) {
+			msg->msg_no_resend = true;
 			lnet_finalize(msg, -ECANCELED);
+		}
 
 		lnet_net_lock(cpt);
 		return -ECANCELED;
@@ -1059,6 +1065,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		lnet_ni_recv(msg->msg_rxni, msg->msg_private, NULL,
 			     0, 0, 0, msg->msg_hdr.payload_length);
 		list_del_init(&msg->msg_list);
+		msg->msg_no_resend = true;
 		lnet_finalize(msg, -ECANCELED);
 	}
 
@@ -2273,6 +2280,14 @@ struct lnet_ni *
 		return PTR_ERR(lpni);
 	}
 
+	/* Cache the original src_nid. If we need to resend the message
+	 * then we'll need to know whether the src_nid was originally
+	 * specified for this message. If it was originally specified,
+	 * then we need to keep using the same src_nid since it's
+	 * continuing the same sequence of messages.
+	 */
+	msg->msg_src_nid_param = src_nid;
+
 	/* Now that we have a peer_ni, check if we want to discover
 	 * the peer. Traffic to the LNET_RESERVED_PORTAL should not
 	 * trigger discovery.
@@ -2290,7 +2305,6 @@ struct lnet_ni *
 		/* The peer may have changed. */
 		peer = lpni->lpni_peer_net->lpn_peer;
 		/* queue message and return */
-		msg->msg_src_nid_param = src_nid;
 		msg->msg_rtr_nid_param = rtr_nid;
 		msg->msg_sending = 0;
 		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
@@ -2323,7 +2337,11 @@ struct lnet_ni *
 	else
 		send_case |= REMOTE_DST;
 
-	if (!lnet_peer_is_multi_rail(peer))
+	/* if this is a non-MR peer or if we're recovering a peer ni then
+	 * let's consider this an NMR case so we can hit the destination
+	 * NID.
+	 */
+	if (!lnet_peer_is_multi_rail(peer) || msg->msg_recovery)
 		send_case |= NMR_DST;
 	else
 		send_case |= MR_DST;
@@ -2370,6 +2388,7 @@ struct lnet_ni *
 	 */
 	/* NB: !ni == interface pre-determined (ACK/REPLY) */
 	LASSERT(!msg->msg_txpeer);
+	LASSERT(!msg->msg_txni);
 	LASSERT(!msg->msg_sending);
 	LASSERT(!msg->msg_target_is_router);
 	LASSERT(!msg->msg_receiving);
@@ -2389,6 +2408,314 @@ struct lnet_ni *
 	return 0;
 }
 
+static void
+lnet_resend_pending_msgs_locked(struct list_head *resendq, int cpt)
+{
+	struct lnet_msg *msg;
+
+	while (!list_empty(resendq)) {
+		struct lnet_peer_ni *lpni;
+
+		msg = list_entry(resendq->next, struct lnet_msg,
+				 msg_list);
+
+		list_del_init(&msg->msg_list);
+
+		lpni = lnet_find_peer_ni_locked(msg->msg_hdr.dest_nid);
+		if (!lpni) {
+			lnet_net_unlock(cpt);
+			CERROR("Expected that a peer is already created for %s\n",
+			       libcfs_nid2str(msg->msg_hdr.dest_nid));
+			msg->msg_no_resend = true;
+			lnet_finalize(msg, -EFAULT);
+			lnet_net_lock(cpt);
+		} else {
+			struct lnet_peer *peer;
+			int rc;
+			lnet_nid_t src_nid = LNET_NID_ANY;
+
+			/* if this message is not being routed and the
+			 * peer is non-MR then we must use the same
+			 * src_nid that was used in the original send.
+			 * Otherwise if we're routing the message (IE
+			 * we're a router) then we can use any of our
+			 * local interfaces. It doesn't matter to the
+			 * final destination.
+			 */
+			peer = lpni->lpni_peer_net->lpn_peer;
+			if (!msg->msg_routing &&
+			    !lnet_peer_is_multi_rail(peer))
+				src_nid = le64_to_cpu(msg->msg_hdr.src_nid);
+
+			/* If we originally specified a src NID, then we
+			 * must attempt to reuse it in the resend as well.
+			 */
+			if (msg->msg_src_nid_param != LNET_NID_ANY)
+				src_nid = msg->msg_src_nid_param;
+			lnet_peer_ni_decref_locked(lpni);
+
+			lnet_net_unlock(cpt);
+			rc = lnet_send(src_nid, msg, LNET_NID_ANY);
+			if (rc) {
+				CERROR("Error sending %s to %s: %d\n",
+				       lnet_msgtyp2str(msg->msg_type),
+				       libcfs_id2str(msg->msg_target), rc);
+				msg->msg_no_resend = true;
+				lnet_finalize(msg, rc);
+			}
+			lnet_net_lock(cpt);
+		}
+	}
+}
+
+static void
+lnet_resend_pending_msgs(void)
+{
+	int i;
+
+	cfs_cpt_for_each(i, lnet_cpt_table()) {
+		lnet_net_lock(i);
+		lnet_resend_pending_msgs_locked(the_lnet.ln_mt_resendqs[i], i);
+		lnet_net_unlock(i);
+	}
+}
+
+/* called with cpt and ni_lock held */
+static void
+lnet_unlink_ni_recovery_mdh_locked(struct lnet_ni *ni, int cpt)
+{
+	struct lnet_handle_md recovery_mdh;
+
+	LNetInvalidateMDHandle(&recovery_mdh);
+
+	if (ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING) {
+		recovery_mdh = ni->ni_ping_mdh;
+		LNetInvalidateMDHandle(&ni->ni_ping_mdh);
+	}
+	lnet_ni_unlock(ni);
+	lnet_net_unlock(cpt);
+	if (!LNetMDHandleIsInvalid(recovery_mdh))
+		LNetMDUnlink(recovery_mdh);
+	lnet_net_lock(cpt);
+	lnet_ni_lock(ni);
+}
+
+static void
+lnet_recover_local_nis(void)
+{
+	struct list_head processed_list;
+	struct list_head local_queue;
+	struct lnet_handle_md mdh;
+	struct lnet_ni *tmp;
+	struct lnet_ni *ni;
+	lnet_nid_t nid;
+	int healthv;
+	int rc;
+
+	INIT_LIST_HEAD(&local_queue);
+	INIT_LIST_HEAD(&processed_list);
+
+	/* splice the recovery queue on a local queue. We will iterate
+	 * through the local queue and update it as needed. Once we're
+	 * done with the traversal, we'll splice the local queue back on
+	 * the head of the ln_mt_localNIRecovq. Any newly added local NIs
+	 * will be traversed in the next iteration.
+	 */
+	lnet_net_lock(0);
+	list_splice_init(&the_lnet.ln_mt_localNIRecovq,
+			 &local_queue);
+	lnet_net_unlock(0);
+
+	list_for_each_entry_safe(ni, tmp, &local_queue, ni_recovery) {
+		/* if an NI is being deleted or it is now healthy, there
+		 * is no need to keep it around in the recovery queue.
+		 * The monitor thread is the only thread responsible for
+		 * removing the NI from the recovery queue.
+		 * Multiple threads can be adding NIs to the recovery
+		 * queue.
+		 */
+		healthv = atomic_read(&ni->ni_healthv);
+
+		lnet_net_lock(0);
+		lnet_ni_lock(ni);
+		if (!(ni->ni_state & LNET_NI_STATE_ACTIVE) ||
+		    healthv == LNET_MAX_HEALTH_VALUE) {
+			list_del_init(&ni->ni_recovery);
+			lnet_unlink_ni_recovery_mdh_locked(ni, 0);
+			lnet_ni_unlock(ni);
+			lnet_ni_decref_locked(ni, 0);
+			lnet_net_unlock(0);
+			continue;
+		}
+		lnet_ni_unlock(ni);
+		lnet_net_unlock(0);
+
+		/* protect the ni->ni_state field. Once we call the
+		 * lnet_send_ping function it's possible we receive
+		 * a response before we check the rc. The lock ensures
+		 * a stable value for the ni_state RECOVERY_PENDING bit
+		 */
+		lnet_ni_lock(ni);
+		if (!(ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING)) {
+			ni->ni_state |= LNET_NI_STATE_RECOVERY_PENDING;
+			lnet_ni_unlock(ni);
+			mdh = ni->ni_ping_mdh;
+			/* Invalidate the ni mdh in case it's deleted.
+			 * We'll unlink the mdh in this case below.
+			 */
+			LNetInvalidateMDHandle(&ni->ni_ping_mdh);
+			nid = ni->ni_nid;
+
+			/* remove the NI from the local queue and drop the
+			 * reference count to it while we're recovering
+			 * it. The reason for that, is that the NI could
+			 * be deleted, and the way the code is structured
+			 * is if we don't drop the NI, then the deletion
+			 * code will enter a loop waiting for the
+			 * reference count to be removed while holding the
+			 * ln_mutex_lock(). When we look up the peer to
+			 * send to in lnet_select_pathway() we will try to
+			 * lock the ln_mutex_lock() as well, leading to
+			 * a deadlock. By dropping the refcount and
+			 * removing it from the list, we allow for the NI
+			 * to be removed, then we use the cached NID to
+			 * look it up again. If it's gone, then we just
+			 * continue examining the rest of the queue.
+			 */
+			lnet_net_lock(0);
+			list_del_init(&ni->ni_recovery);
+			lnet_ni_decref_locked(ni, 0);
+			lnet_net_unlock(0);
+
+			rc = lnet_send_ping(nid, &mdh,
+					    LNET_INTERFACES_MIN, (void *)nid,
+					    the_lnet.ln_mt_eqh, true);
+			/* lookup the nid again */
+			lnet_net_lock(0);
+			ni = lnet_nid2ni_locked(nid, 0);
+			if (!ni) {
+				/* the NI has been deleted when we dropped
+				 * the ref count
+				 */
+				lnet_net_unlock(0);
+				LNetMDUnlink(mdh);
+				continue;
+			}
+			/* Same note as in lnet_recover_peer_nis(). When
+			 * we're sending the ping, the NI is free to be
+			 * deleted or manipulated. By this point it
+			 * could've been added back on the recovery queue,
+			 * and a refcount taken on it.
+			 * So we can't just add it blindly again or we'll
+			 * corrupt the queue. We must check under lock if
+			 * it's not on any list and if not then add it
+			 * to the processed list, which will eventually be
+			 * spliced back on to the recovery queue.
+			 */
+			ni->ni_ping_mdh = mdh;
+			if (list_empty(&ni->ni_recovery)) {
+				list_add_tail(&ni->ni_recovery,
+					      &processed_list);
+				lnet_ni_addref_locked(ni, 0);
+			}
+			lnet_net_unlock(0);
+
+			lnet_ni_lock(ni);
+			if (rc)
+				ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+		}
+		lnet_ni_unlock(ni);
+	}
+
+	/* put back the remaining NIs on the ln_mt_localNIRecovq to be
+	 * reexamined in the next iteration.
+	 */
+	list_splice_init(&processed_list, &local_queue);
+	lnet_net_lock(0);
+	list_splice(&local_queue, &the_lnet.ln_mt_localNIRecovq);
+	lnet_net_unlock(0);
+}
+
+static struct list_head **
+lnet_create_array_of_queues(void)
+{
+	struct list_head **qs;
+	struct list_head *q;
+	int i;
+
+	qs = cfs_percpt_alloc(lnet_cpt_table(),
+			      sizeof(struct list_head));
+	if (!qs) {
+		CERROR("Failed to allocate queues\n");
+		return NULL;
+	}
+
+	cfs_percpt_for_each(q, i, qs)
+		INIT_LIST_HEAD(q);
+
+	return qs;
+}
+
+static int
+lnet_resendqs_create(void)
+{
+	struct list_head **resendqs;
+
+	resendqs = lnet_create_array_of_queues();
+	if (!resendqs)
+		return -ENOMEM;
+
+	lnet_net_lock(LNET_LOCK_EX);
+	the_lnet.ln_mt_resendqs = resendqs;
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	return 0;
+}
+
+static void
+lnet_clean_local_ni_recoveryq(void)
+{
+	struct lnet_ni *ni;
+
+	/* This is only called when the monitor thread has stopped */
+	lnet_net_lock(0);
+
+	while (!list_empty(&the_lnet.ln_mt_localNIRecovq)) {
+		ni = list_entry(the_lnet.ln_mt_localNIRecovq.next,
+				struct lnet_ni, ni_recovery);
+		list_del_init(&ni->ni_recovery);
+		lnet_ni_lock(ni);
+		lnet_unlink_ni_recovery_mdh_locked(ni, 0);
+		lnet_ni_unlock(ni);
+		lnet_ni_decref_locked(ni, 0);
+	}
+
+	lnet_net_unlock(0);
+}
+
+static void
+lnet_clean_resendqs(void)
+{
+	struct lnet_msg *msg, *tmp;
+	struct list_head msgs;
+	int i;
+
+	INIT_LIST_HEAD(&msgs);
+
+	cfs_cpt_for_each(i, lnet_cpt_table()) {
+		lnet_net_lock(i);
+		list_splice_init(the_lnet.ln_mt_resendqs[i], &msgs);
+		lnet_net_unlock(i);
+		list_for_each_entry_safe(msg, tmp, &msgs, msg_list) {
+			list_del_init(&msg->msg_list);
+			msg->msg_no_resend = true;
+			lnet_finalize(msg, -ESHUTDOWN);
+		}
+	}
+
+	cfs_percpt_free(the_lnet.ln_mt_resendqs);
+}
+
 static int
 lnet_monitor_thread(void *arg)
 {
@@ -2405,6 +2732,10 @@ struct lnet_ni *
 		if (lnet_router_checker_active())
 			lnet_check_routers();
 
+		lnet_resend_pending_msgs();
+
+		lnet_recover_local_nis();
+
 		/* TODO do we need to check if we should sleep without
 		 * timeout?  Technically, an active system will always
 		 * have messages in flight so this check will always
@@ -2429,42 +2760,180 @@ struct lnet_ni *
 	return 0;
 }
 
-int lnet_monitor_thr_start(void)
+/* lnet_send_ping
+ * Sends a ping.
+ * Returns == 0 if success
+ * Returns > 0 if LNetMDBind or prior fails
+ * Returns < 0 if LNetGet fails
+ */
+int
+lnet_send_ping(lnet_nid_t dest_nid,
+	       struct lnet_handle_md *mdh, int nnis,
+	       void *user_data, struct lnet_handle_eq eqh, bool recovery)
 {
+	struct lnet_md md = { NULL };
+	struct lnet_process_id id;
+	struct lnet_ping_buffer *pbuf;
 	int rc;
+
+	if (dest_nid == LNET_NID_ANY) {
+		rc = -EHOSTUNREACH;
+		goto fail_error;
+	}
+
+	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
+	if (!pbuf) {
+		rc = ENOMEM;
+		goto fail_error;
+	}
+
+	/* initialize md content */
+	md.start = &pbuf->pb_info;
+	md.length = LNET_PING_INFO_SIZE(nnis);
+	md.threshold = 2; /* GET/REPLY */
+	md.max_size = 0;
+	md.options = LNET_MD_TRUNCATE;
+	md.user_ptr = user_data;
+	md.eq_handle = eqh;
+
+	rc = LNetMDBind(md, LNET_UNLINK, mdh);
+	if (rc) {
+		lnet_ping_buffer_decref(pbuf);
+		CERROR("Can't bind MD: %d\n", rc);
+		rc = -rc; /* change the rc to positive */
+		goto fail_error;
+	}
+	id.pid = LNET_PID_LUSTRE;
+	id.nid = dest_nid;
+
+	rc = LNetGet(LNET_NID_ANY, *mdh, id,
+		     LNET_RESERVED_PORTAL,
+		     LNET_PROTO_PING_MATCHBITS, 0, recovery);
+	if (rc)
+		goto fail_unlink_md;
+
+	return 0;
+
+fail_unlink_md:
+	LNetMDUnlink(*mdh);
+	LNetInvalidateMDHandle(mdh);
+fail_error:
+	return rc;
+}
+
+static void
+lnet_mt_event_handler(struct lnet_event *event)
+{
+	lnet_nid_t nid = (lnet_nid_t)event->md.user_ptr;
+	struct lnet_ni *ni;
+	struct lnet_ping_buffer *pbuf;
+
+	/* TODO: remove assert */
+	LASSERT(event->type == LNET_EVENT_REPLY ||
+		event->type == LNET_EVENT_SEND ||
+		event->type == LNET_EVENT_UNLINK);
+
+	CDEBUG(D_NET, "Received event: %d status: %d\n", event->type,
+	       event->status);
+
+	switch (event->type) {
+	case LNET_EVENT_REPLY:
+		/* If the NI has been restored completely then remove from
+		 * the recovery queue
+		 */
+		lnet_net_lock(0);
+		ni = lnet_nid2ni_locked(nid, 0);
+		if (!ni) {
+			lnet_net_unlock(0);
+			break;
+		}
+		lnet_ni_lock(ni);
+		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+		lnet_ni_unlock(ni);
+		lnet_net_unlock(0);
+		break;
+	case LNET_EVENT_SEND:
+		CDEBUG(D_NET, "%s recovery message sent %s:%d\n",
+		       libcfs_nid2str(nid),
+		       (event->status) ? "unsuccessfully" :
+		       "successfully", event->status);
+		break;
+	case LNET_EVENT_UNLINK:
+		/* nothing to do */
+		CDEBUG(D_NET, "%s recovery ping unlinked\n",
+		       libcfs_nid2str(nid));
+		break;
+	default:
+		CERROR("Unexpected event: %d\n", event->type);
+		return;
+	}
+	if (event->unlinked) {
+		pbuf = LNET_PING_INFO_TO_BUFFER(event->md.start);
+		lnet_ping_buffer_decref(pbuf);
+	}
+}
+
+int lnet_monitor_thr_start(void)
+{
+	int rc = 0;
 	struct task_struct *task;
 
-	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_SHUTDOWN)
+		return -EALREADY;
 
-	init_completion(&the_lnet.ln_mt_signal);
+	rc = lnet_resendqs_create();
+	if (rc)
+		return rc;
+
+	rc = LNetEQAlloc(0, lnet_mt_event_handler, &the_lnet.ln_mt_eqh);
+	if (rc != 0) {
+		CERROR("Can't allocate monitor thread EQ: %d\n", rc);
+		goto clean_queues;
+	}
 
 	/* Pre monitor thread start processing */
 	rc = lnet_router_pre_mt_start();
-	if (!rc)
-		return rc;
+	if (rc)
+		goto free_mem;
+
+	init_completion(&the_lnet.ln_mt_signal);
 
 	the_lnet.ln_mt_state = LNET_MT_STATE_RUNNING;
 	task = kthread_run(lnet_monitor_thread, NULL, "monitor_thread");
 	if (IS_ERR(task)) {
 		rc = PTR_ERR(task);
 		CERROR("Can't start monitor thread: %d\n", rc);
-		/* block until event callback signals exit */
-		wait_for_completion(&the_lnet.ln_mt_signal);
-
-		/* clean up */
-		lnet_router_cleanup();
-		the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
-		return -ENOMEM;
+		goto clean_thread;
 	}
 
 	/* post monitor thread start processing */
 	lnet_router_post_mt_start();
 
 	return 0;
+
+clean_thread:
+	the_lnet.ln_mt_state = LNET_MT_STATE_STOPPING;
+	/* block until event callback signals exit */
+	wait_for_completion(&the_lnet.ln_mt_signal);
+	/* clean up */
+	lnet_router_cleanup();
+free_mem:
+	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
+	lnet_clean_resendqs();
+	lnet_clean_local_ni_recoveryq();
+	LNetEQFree(the_lnet.ln_mt_eqh);
+	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
+	return rc;
+clean_queues:
+	lnet_clean_resendqs();
+	lnet_clean_local_ni_recoveryq();
+	return rc;
 }
 
 void lnet_monitor_thr_stop(void)
 {
+	int rc;
+
 	if (the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN)
 		return;
 
@@ -2478,7 +2947,12 @@ void lnet_monitor_thr_stop(void)
 	wait_for_completion(&the_lnet.ln_mt_signal);
 	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
 
+	/* perform cleanup tasks */
 	lnet_router_cleanup();
+	lnet_clean_resendqs();
+	lnet_clean_local_ni_recoveryq();
+	rc = LNetEQFree(the_lnet.ln_mt_eqh);
+	LASSERT(rc == 0);
 }
 
 void
@@ -3173,6 +3647,8 @@ void lnet_monitor_thr_stop(void)
 		lnet_drop_message(msg->msg_rxni, msg->msg_rx_cpt,
 				  msg->msg_private, msg->msg_len,
 				  msg->msg_type);
+
+		msg->msg_no_resend = true;
 		/*
 		 * NB: message will not generate event because w/o attached MD,
 		 * but we still should give error code so lnet_msg_decommit()
@@ -3338,6 +3814,7 @@ void lnet_monitor_thr_stop(void)
 	if (rc) {
 		CNETERR("Error sending PUT to %s: %d\n",
 			libcfs_id2str(target), rc);
+		msg->msg_no_resend = true;
 		lnet_finalize(msg, rc);
 	}
 
@@ -3476,7 +3953,7 @@ struct lnet_msg *
 int
 LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
 	struct lnet_process_id target, unsigned int portal,
-	u64 match_bits, unsigned int offset)
+	u64 match_bits, unsigned int offset, bool recovery)
 {
 	struct lnet_msg *msg;
 	struct lnet_libmd *md;
@@ -3499,6 +3976,8 @@ struct lnet_msg *
 		return -ENOMEM;
 	}
 
+	msg->msg_recovery = recovery;
+
 	cpt = lnet_cpt_of_cookie(mdh.cookie);
 	lnet_res_lock(cpt);
 
@@ -3542,6 +4021,7 @@ struct lnet_msg *
 	if (rc < 0) {
 		CNETERR("Error sending GET to %s: %d\n",
 			libcfs_id2str(target), rc);
+		msg->msg_no_resend = true;
 		lnet_finalize(msg, rc);
 	}
 
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 7869b96..e7f7469 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -469,6 +469,234 @@
 	return 0;
 }
 
+static void
+lnet_dec_healthv_locked(atomic_t *healthv)
+{
+	int h = atomic_read(healthv);
+
+	if (h < lnet_health_sensitivity) {
+		atomic_set(healthv, 0);
+	} else {
+		h -= lnet_health_sensitivity;
+		atomic_set(healthv, h);
+	}
+}
+
+static inline void
+lnet_inc_healthv(atomic_t *healthv)
+{
+	atomic_add_unless(healthv, 1, LNET_MAX_HEALTH_VALUE);
+}
+
+static void
+lnet_handle_local_failure(struct lnet_msg *msg)
+{
+	struct lnet_ni *local_ni;
+
+	local_ni = msg->msg_txni;
+
+	/* the lnet_net_lock(0) is used to protect the addref on the ni
+	 * and the recovery queue.
+	 */
+	lnet_net_lock(0);
+	/* the mt could've shutdown and cleaned up the queues */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
+		lnet_net_unlock(0);
+		return;
+	}
+
+	lnet_dec_healthv_locked(&local_ni->ni_healthv);
+	/* add the NI to the recovery queue if it's not already there
+	 * and it's health value is actually below the maximum. It's
+	 * possible that the sensitivity might be set to 0, and the health
+	 * value will not be reduced. In this case, there is no reason to
+	 * invoke recovery
+	 */
+	if (list_empty(&local_ni->ni_recovery) &&
+	    atomic_read(&local_ni->ni_healthv) < LNET_MAX_HEALTH_VALUE) {
+		CERROR("ni %s added to recovery queue. Health = %d\n",
+		       libcfs_nid2str(local_ni->ni_nid),
+		       atomic_read(&local_ni->ni_healthv));
+		list_add_tail(&local_ni->ni_recovery,
+			      &the_lnet.ln_mt_localNIRecovq);
+		lnet_ni_addref_locked(local_ni, 0);
+	}
+	lnet_net_unlock(0);
+}
+
+/* Do a health check on the message:
+ * return -1 if we're not going to handle the error
+ *   success case will return -1 as well
+ * return 0 if it the message is requeued for send
+ */
+static int
+lnet_health_check(struct lnet_msg *msg)
+{
+	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
+
+	/* TODO: lnet_incr_hstats(hstatus); */
+
+	LASSERT(msg->msg_txni);
+
+	if (hstatus != LNET_MSG_STATUS_OK &&
+	    ktime_compare(ktime_get(), msg->msg_deadline) >= 0)
+		return -1;
+
+	/* if we're shutting down no point in handling health. */
+	if (the_lnet.ln_state != LNET_STATE_RUNNING)
+		return -1;
+
+	switch (hstatus) {
+	case LNET_MSG_STATUS_OK:
+		lnet_inc_healthv(&msg->msg_txni->ni_healthv);
+		/* we can finalize this message */
+		return -1;
+	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
+	case LNET_MSG_STATUS_LOCAL_DROPPED:
+	case LNET_MSG_STATUS_LOCAL_ABORTED:
+	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
+	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
+		lnet_handle_local_failure(msg);
+		/* add to the re-send queue */
+		goto resend;
+
+		/* TODO: since the remote dropped the message we can
+		 * attempt a resend safely.
+		 */
+	case LNET_MSG_STATUS_REMOTE_DROPPED:
+		break;
+
+		/* These errors will not trigger a resend so simply
+		 * finalize the message
+		 */
+	case LNET_MSG_STATUS_LOCAL_ERROR:
+		lnet_handle_local_failure(msg);
+		return -1;
+	case LNET_MSG_STATUS_REMOTE_ERROR:
+	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
+	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
+		return -1;
+	}
+
+resend:
+	/* don't resend recovery messages */
+	if (msg->msg_recovery)
+		return -1;
+
+	/* if we explicitly indicated we don't want to resend then just
+	 * return
+	 */
+	if (msg->msg_no_resend)
+		return -1;
+
+	lnet_net_lock(msg->msg_tx_cpt);
+
+	/* remove message from the active list and reset it in preparation
+	 * for a resend. Two exception to this
+	 *
+	 * 1. the router case, when a message is committed for rx when
+	 * received, then tx when it is sent. When committed to both tx and
+	 * rx we don't want to remove it from the active list.
+	 *
+	 * 2. The REPLY case since it uses the same msg block for the GET
+	 * that was received.
+	 */
+	if (!msg->msg_routing && msg->msg_type != LNET_MSG_REPLY) {
+		list_del_init(&msg->msg_activelist);
+		msg->msg_onactivelist = 0;
+	}
+
+	/* The msg_target.nid which was originally set
+	 * when calling LNetGet() or LNetPut() might've
+	 * been overwritten if we're routing this message.
+	 * Call lnet_return_tx_credits_locked() to return
+	 * the credit this message consumed. The message will
+	 * consume another credit when it gets resent.
+	 */
+	msg->msg_target.nid = msg->msg_hdr.dest_nid;
+	lnet_msg_decommit_tx(msg, -EAGAIN);
+	msg->msg_sending = 0;
+	msg->msg_receiving = 0;
+	msg->msg_target_is_router = 0;
+
+	CDEBUG(D_NET, "%s->%s:%s:%s - queuing for resend\n",
+	       libcfs_nid2str(msg->msg_hdr.src_nid),
+	       libcfs_nid2str(msg->msg_hdr.dest_nid),
+	       lnet_msgtyp2str(msg->msg_type),
+	       lnet_health_error2str(hstatus));
+
+	list_add_tail(&msg->msg_list, the_lnet.ln_mt_resendqs[msg->msg_tx_cpt]);
+	lnet_net_unlock(msg->msg_tx_cpt);
+
+	wake_up(&the_lnet.ln_mt_waitq);
+	return 0;
+}
+
+static void
+lnet_detach_md(struct lnet_msg *msg, int status)
+{
+	int cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
+
+	lnet_res_lock(cpt);
+	lnet_msg_detach_md(msg, status);
+	lnet_res_unlock(cpt);
+}
+
+static bool
+lnet_is_health_check(struct lnet_msg *msg)
+{
+	bool hc;
+	int status = msg->msg_ev.status;
+
+	/* perform a health check for any message committed for transmit */
+	hc = msg->msg_tx_committed;
+
+	/* Check for status inconsistencies */
+	if (hc &&
+	    ((!status && msg->msg_health_status != LNET_MSG_STATUS_OK) ||
+	     (status && msg->msg_health_status == LNET_MSG_STATUS_OK))) {
+		CERROR("Msg is in inconsistent state, don't perform health checking (%d, %d)\n",
+		       status, msg->msg_health_status);
+		hc = false;
+	}
+
+	CDEBUG(D_NET, "health check = %d, status = %d, hstatus = %d\n",
+	       hc, status, msg->msg_health_status);
+
+	return hc;
+}
+
+char *
+lnet_health_error2str(enum lnet_msg_hstatus hstatus)
+{
+	switch (hstatus) {
+	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
+		return "LOCAL_INTERRUPT";
+	case LNET_MSG_STATUS_LOCAL_DROPPED:
+		return "LOCAL_DROPPED";
+	case LNET_MSG_STATUS_LOCAL_ABORTED:
+		return "LOCAL_ABORTED";
+	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
+		return "LOCAL_NO_ROUTE";
+	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
+		return "LOCAL_TIMEOUT";
+	case LNET_MSG_STATUS_LOCAL_ERROR:
+		return "LOCAL_ERROR";
+	case LNET_MSG_STATUS_REMOTE_DROPPED:
+		return "REMOTE_DROPPED";
+	case LNET_MSG_STATUS_REMOTE_ERROR:
+		return "REMOTE_ERROR";
+	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
+		return "REMOTE_TIMEOUT";
+	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
+		return "NETWORK_TIMEOUT";
+	case LNET_MSG_STATUS_OK:
+		return "OK";
+	default:
+		return "<UNKNOWN>";
+	}
+}
+
 void
 lnet_finalize(struct lnet_msg *msg, int status)
 {
@@ -477,6 +705,7 @@
 	int cpt;
 	int rc;
 	int i;
+	bool hc;
 
 	LASSERT(!in_interrupt());
 
@@ -485,15 +714,27 @@
 
 	msg->msg_ev.status = status;
 
-	if (msg->msg_md) {
-		cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
-
-		lnet_res_lock(cpt);
-		lnet_msg_detach_md(msg, status);
-		lnet_res_unlock(cpt);
-	}
+	/* if the message is successfully sent, no need to keep the MD around */
+	if (msg->msg_md && !status)
+		lnet_detach_md(msg, status);
 
 again:
+	hc = lnet_is_health_check(msg);
+
+	/* the MD would've been detached from the message if it was
+	 * successfully sent. However, if it wasn't successfully sent the
+	 * MD would be around. And since we recalculate whether to
+	 * health check or not, it's possible that we change our minds and
+	 * we don't want to health check this message. In this case also
+	 * free the MD.
+	 *
+	 * If the message is successful we're going to
+	 * go through the lnet_health_check() function, but that'll just
+	 * increment the appropriate health value and return.
+	 */
+	if (msg->msg_md && !hc)
+		lnet_detach_md(msg, status);
+
 	rc = 0;
 	if (!msg->msg_tx_committed && !msg->msg_rx_committed) {
 		/* not committed to network yet */
@@ -502,6 +743,28 @@
 		return;
 	}
 
+	if (hc) {
+		/* Check the health status of the message. If it has one
+		 * of the errors that we're supposed to handle, and it has
+		 * not timed out, then
+		 *	1. Decrement the appropriate health_value
+		 *	2. queue the message on the resend queue
+		 *
+		 * if the message send is success, timed out or failed in the
+		 * health check for any reason then we'll just finalize the
+		 * message. Otherwise just return since the message has been
+		 * put on the resend queue.
+		 */
+		if (!lnet_health_check(msg))
+			return;
+
+		/* if we get here then we need to clean up the md because we're
+		 * finalizing the message.
+		 */
+		if (msg->msg_md)
+			lnet_detach_md(msg, status);
+	}
+
 	/*
 	 * NB: routed message can be committed for both receiving and sending,
 	 * we should finalize in LIFO order and keep counters correct.
@@ -536,7 +799,7 @@
 	while ((msg = list_first_entry_or_null(&container->msc_finalizing,
 					       struct lnet_msg,
 					       msg_list)) != NULL) {
-		list_del(&msg->msg_list);
+		list_del_init(&msg->msg_list);
 
 		/*
 		 * NB drops and regains the lnet lock if it actually does
@@ -575,7 +838,7 @@
 					       msg_activelist)) != NULL) {
 		LASSERT(msg->msg_onactivelist);
 		msg->msg_onactivelist = 0;
-		list_del(&msg->msg_activelist);
+		list_del_init(&msg->msg_activelist);
 		kfree(msg);
 		count++;
 	}
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 1534ab2..121876e 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2713,9 +2713,7 @@ static lnet_nid_t lnet_peer_select_nid(struct lnet_peer *lp)
 static int lnet_peer_send_ping(struct lnet_peer *lp)
 __must_hold(&lp->lp_lock)
 {
-	struct lnet_md md = { NULL };
-	struct lnet_process_id id;
-	struct lnet_ping_buffer *pbuf;
+	lnet_nid_t pnid;
 	int nnis;
 	int rc;
 	int cpt;
@@ -2724,54 +2722,35 @@ static int lnet_peer_send_ping(struct lnet_peer *lp)
 	lp->lp_state &= ~LNET_PEER_FORCE_PING;
 	spin_unlock(&lp->lp_lock);
 
-	nnis = max_t(int, lp->lp_data_nnis, LNET_INTERFACES_MIN);
-	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
-	if (!pbuf) {
-		rc = -ENOMEM;
-		goto fail_error;
-	}
-
-	/* initialize md content */
-	md.start = &pbuf->pb_info;
-	md.length = LNET_PING_INFO_SIZE(nnis);
-	md.threshold = 2; /* GET/REPLY */
-	md.max_size = 0;
-	md.options = LNET_MD_TRUNCATE;
-	md.user_ptr = lp;
-	md.eq_handle = the_lnet.ln_dc_eqh;
-
-	rc = LNetMDBind(md, LNET_UNLINK, &lp->lp_ping_mdh);
-	if (rc != 0) {
-		lnet_ping_buffer_decref(pbuf);
-		CERROR("Can't bind MD: %d\n", rc);
-		goto fail_error;
-	}
 	cpt = lnet_net_lock_current();
 	/* Refcount for MD. */
 	lnet_peer_addref_locked(lp);
-	id.pid = LNET_PID_LUSTRE;
-	id.nid = lnet_peer_select_nid(lp);
+	pnid = lnet_peer_select_nid(lp);
 	lnet_net_unlock(cpt);
 
-	if (id.nid == LNET_NID_ANY) {
-		rc = -EHOSTUNREACH;
-		goto fail_unlink_md;
-	}
+	nnis = max_t(int, lp->lp_data_nnis, LNET_INTERFACES_MIN);
 
-	rc = LNetGet(LNET_NID_ANY, lp->lp_ping_mdh, id,
-		     LNET_RESERVED_PORTAL,
-		     LNET_PROTO_PING_MATCHBITS, 0);
-	if (rc)
-		goto fail_unlink_md;
+	rc = lnet_send_ping(pnid, &lp->lp_ping_mdh, nnis, lp,
+			    the_lnet.ln_dc_eqh, false);
+	/* if LNetMDBind in lnet_send_ping fails we need to decrement the
+	 * refcount on the peer, otherwise LNetMDUnlink will be called
+	 * which will eventually do that.
+	 */
+	if (rc > 0) {
+		lnet_net_lock(cpt);
+		lnet_peer_decref_locked(lp);
+		lnet_net_unlock(cpt);
+		rc = -rc; /* change the rc to negative value */
+		goto fail_error;
+	} else if (rc < 0) {
+		goto fail_error;
+	}
 
 	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
 
 	spin_lock(&lp->lp_lock);
 	return 0;
 
-fail_unlink_md:
-	LNetMDUnlink(lp->lp_ping_mdh);
-	LNetInvalidateMDHandle(&lp->lp_ping_mdh);
 fail_error:
 	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
 	/*
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 3f9d8c5..7c3bbd8 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -1079,7 +1079,7 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 		lnet_net_unlock(rtr->lpni_cpt);
 
 		rc = LNetGet(LNET_NID_ANY, mdh, id, LNET_RESERVED_PORTAL,
-			     LNET_PROTO_PING_MATCHBITS, 0);
+			     LNET_PROTO_PING_MATCHBITS, 0, false);
 
 		lnet_net_lock(rtr->lpni_cpt);
 		if (rc)
diff --git a/net/lnet/selftest/rpc.c b/net/lnet/selftest/rpc.c
index 295d704..a5941e4 100644
--- a/net/lnet/selftest/rpc.c
+++ b/net/lnet/selftest/rpc.c
@@ -425,7 +425,7 @@ struct srpc_bulk *
 	} else {
 		LASSERT(options & LNET_MD_OP_GET);
 
-		rc = LNetGet(self, *mdh, peer, portal, matchbits, 0);
+		rc = LNetGet(self, *mdh, peer, portal, matchbits, 0, false);
 	}
 
 	if (rc) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 080/622] lnet: handle o2iblnd tx failure
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (78 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 079/622] lnet: handle local ni failure James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 081/622] lnet: handle socklnd " James Simmons
                   ` (542 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Monitor the different types of failures that might occur on the
transmit and flag the type of failure to be propagated to LNet
which will handle either by attempting a resend or simply
finalizing the message and propagating a failure to the ULP.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 8cf835e425d8 ("LU-9120 lnet: handle o2iblnd tx failure")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32765
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c    |  2 +-
 net/lnet/klnds/o2iblnd/o2iblnd.h    |  4 ++-
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 59 ++++++++++++++++++++++++++++++++-----
 3 files changed, 55 insertions(+), 10 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 825fe30..017fe5f 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -519,7 +519,7 @@ static int kiblnd_del_peer(struct lnet_ni *ni, lnet_nid_t nid)
 
 	write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
 
-	kiblnd_txlist_done(&zombies, -EIO);
+	kiblnd_txlist_done(&zombies, -EIO, LNET_MSG_STATUS_LOCAL_ERROR);
 
 	return rc;
 }
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 9021051..999b58d 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -515,6 +515,7 @@ struct kib_tx {					/* transmit message */
 	short			tx_queued;	/* queued for sending */
 	short			tx_waiting;	/* waiting for peer_ni */
 	int			tx_status;	/* LNET completion status */
+	enum lnet_msg_hstatus	tx_hstatus;	/* health status of the transmit */
 	ktime_t			tx_deadline;	/* completion deadline */
 	u64			tx_cookie;	/* completion cookie */
 	struct lnet_msg	       *tx_lntmsg[2];	/* lnet msgs to finalize on completion */
@@ -1027,7 +1028,8 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 void kiblnd_close_conn_locked(struct kib_conn *conn, int error);
 
 void kiblnd_launch_tx(struct lnet_ni *ni, struct kib_tx *tx, lnet_nid_t nid);
-void kiblnd_txlist_done(struct list_head *txlist, int status);
+void kiblnd_txlist_done(struct list_head *txlist, int status,
+			enum lnet_msg_hstatus hstatus);
 
 void kiblnd_qp_event(struct ib_event *event, void *arg);
 void kiblnd_cq_event(struct ib_event *event, void *arg);
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 60706b4..007058a 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -89,12 +89,17 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 		if (!lntmsg[i])
 			continue;
 
+		/* propagate health status to LNet for requests */
+		if (i == 0 && lntmsg[i])
+			lntmsg[i]->msg_health_status = tx->tx_hstatus;
+
 		lnet_finalize(lntmsg[i], rc);
 	}
 }
 
 void
-kiblnd_txlist_done(struct list_head *txlist, int status)
+kiblnd_txlist_done(struct list_head *txlist, int status,
+		   enum lnet_msg_hstatus hstatus)
 {
 	struct kib_tx *tx;
 
@@ -105,6 +110,7 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 		/* complete now */
 		tx->tx_waiting = 0;
 		tx->tx_status = status;
+		tx->tx_hstatus = hstatus;
 		kiblnd_tx_done(tx);
 	}
 }
@@ -134,6 +140,7 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 	LASSERT(!tx->tx_nfrags);
 
 	tx->tx_gaps = false;
+	tx->tx_hstatus = LNET_MSG_STATUS_OK;
 
 	return tx;
 }
@@ -265,10 +272,12 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 	}
 
 	if (!tx->tx_status) {		/* success so far */
-		if (status < 0) /* failed? */
+		if (status < 0) {	/* failed? */
 			tx->tx_status = status;
-		else if (txtype == IBLND_MSG_GET_REQ)
+			tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_ERROR;
+		} else if (txtype == IBLND_MSG_GET_REQ) {
 			lnet_set_reply_msg_len(ni, tx->tx_lntmsg[1], status);
+		}
 	}
 
 	tx->tx_waiting = 0;
@@ -846,6 +855,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		 * posted NOOPs complete
 		 */
 		spin_unlock(&conn->ibc_lock);
+		tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 		kiblnd_tx_done(tx);
 		spin_lock(&conn->ibc_lock);
 		CDEBUG(D_NET, "%s(%d): redundant or enough NOOP\n",
@@ -1045,6 +1055,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		conn->ibc_noops_posted--;
 
 	if (failed) {
+		tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_DROPPED;
 		tx->tx_waiting = 0;	/* don't wait for peer_ni */
 		tx->tx_status = -EIO;
 	}
@@ -1393,7 +1404,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 
 	CWARN("Abort reconnection of %s: %s\n",
 	      libcfs_nid2str(peer_ni->ibp_nid), reason);
-	kiblnd_txlist_done(&txs, -ECONNABORTED);
+	kiblnd_txlist_done(&txs, -ECONNABORTED,
+			   LNET_MSG_STATUS_LOCAL_ABORTED);
 	return false;
 }
 
@@ -1471,6 +1483,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		if (tx) {
 			tx->tx_status = -EHOSTUNREACH;
 			tx->tx_waiting = 0;
+			tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 			kiblnd_tx_done(tx);
 		}
 		return;
@@ -1607,6 +1620,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		if (rc) {
 			CERROR("Can't setup GET sink for %s: %d\n",
 			       libcfs_nid2str(target.nid), rc);
+			tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 			kiblnd_tx_done(tx);
 			return -EIO;
 		}
@@ -1757,6 +1771,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	return;
 
 failed_1:
+	tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 	kiblnd_tx_done(tx);
 failed_0:
 	lnet_finalize(lntmsg, -EIO);
@@ -1839,6 +1854,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		if (rc) {
 			CERROR("Can't setup PUT sink for %s: %d\n",
 			       libcfs_nid2str(conn->ibc_peer->ibp_nid), rc);
+			tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 			kiblnd_tx_done(tx);
 			/* tell peer_ni it's over */
 			kiblnd_send_completion(rx->rx_conn, IBLND_MSG_PUT_NAK,
@@ -2050,13 +2066,34 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		if (txs == &conn->ibc_active_txs) {
 			LASSERT(!tx->tx_queued);
 			LASSERT(tx->tx_waiting || tx->tx_sending);
+			if (conn->ibc_comms_error == -ETIMEDOUT) {
+				if (tx->tx_waiting && !tx->tx_sending)
+					tx->tx_hstatus =
+					  LNET_MSG_STATUS_REMOTE_TIMEOUT;
+				else if (tx->tx_sending)
+					tx->tx_hstatus =
+					  LNET_MSG_STATUS_NETWORK_TIMEOUT;
+			}
 		} else {
 			LASSERT(tx->tx_queued);
+			if (conn->ibc_comms_error == -ETIMEDOUT)
+				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_TIMEOUT;
+			else
+				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 		}
 
 		tx->tx_status = -ECONNABORTED;
 		tx->tx_waiting = 0;
 
+		/* TODO: This makes an assumption that
+		 * kiblnd_tx_complete() will be called for each tx. If
+		 * that event is dropped we could end up with stale
+		 * connections floating around. We'd like to deal with
+		 * that in a better way.
+		 *
+		 * Also that means we can exceed the timeout by many
+		 * seconds.
+		 */
 		if (!tx->tx_sending) {
 			tx->tx_queued = 0;
 			list_del(&tx->tx_list);
@@ -2066,7 +2103,10 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 
 	spin_unlock(&conn->ibc_lock);
 
-	kiblnd_txlist_done(&zombies, -ECONNABORTED);
+	/* aborting transmits occurs when finalizing the connection.
+	 * The connection is finalized on error
+	 */
+	kiblnd_txlist_done(&zombies, -ECONNABORTED, -1);
 }
 
 static void
@@ -2147,7 +2187,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	CNETERR("Deleting messages for %s: connection failed\n",
 		libcfs_nid2str(peer_ni->ibp_nid));
 
-	kiblnd_txlist_done(&zombies, -EHOSTUNREACH);
+	kiblnd_txlist_done(&zombies, error,
+			   LNET_MSG_STATUS_LOCAL_DROPPED);
 }
 
 static void
@@ -2223,7 +2264,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		kiblnd_close_conn_locked(conn, -ECONNABORTED);
 		write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
 
-		kiblnd_txlist_done(&txs, -ECONNABORTED);
+		kiblnd_txlist_done(&txs, -ECONNABORTED,
+				   LNET_MSG_STATUS_LOCAL_ERROR);
 
 		return;
 	}
@@ -3300,7 +3342,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
 
 	if (!list_empty(&timedout_txs))
-		kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT);
+		kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT,
+				   LNET_MSG_STATUS_LOCAL_TIMEOUT);
 
 	/*
 	 * Handle timeout by closing the whole
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 081/622] lnet: handle socklnd tx failure
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (79 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 080/622] lnet: handle o2iblnd tx failure James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 082/622] lnet: handle remote errors in LNet James Simmons
                   ` (541 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Update the socklnd to propagate the health status up to
LNet for handling.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 25c1cb2c4d6f ("LU-9120 lnet: handle socklnd tx failure")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32766
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.h    |  1 +
 net/lnet/klnds/socklnd/socklnd_cb.c | 49 ++++++++++++++++++++++++++++++++++---
 2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 04381a0..48884cf 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -289,6 +289,7 @@ struct ksock_tx {				/* transmit packet */
 	time64_t		tx_deadline;	/* when (in secs) tx times out */
 	struct ksock_msg	tx_msg;		/* socklnd message buffer */
 	int			tx_desc_size;	/* size of this descriptor */
+	enum lnet_msg_hstatus	tx_hstatus;	/* health status of tx */
 	union {
 		struct {
 			struct kvec	iov;	/* virt hdr */
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 5b75ea6..d50e0d2 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -56,6 +56,7 @@ struct ksock_tx *
 	tx->tx_zc_aborted = 0;
 	tx->tx_zc_capable = 0;
 	tx->tx_zc_checked = 0;
+	tx->tx_hstatus = LNET_MSG_STATUS_OK;
 	tx->tx_desc_size = size;
 
 	atomic_inc(&ksocknal_data.ksnd_nactive_txs);
@@ -328,18 +329,26 @@ struct ksock_tx *
 ksocknal_tx_done(struct lnet_ni *ni, struct ksock_tx *tx, int rc)
 {
 	struct lnet_msg *lnetmsg = tx->tx_lnetmsg;
+	enum lnet_msg_hstatus hstatus = tx->tx_hstatus;
 
 	LASSERT(ni || tx->tx_conn);
 
-	if (!rc && (tx->tx_resid != 0 || tx->tx_zc_aborted))
+	if (!rc && (tx->tx_resid != 0 || tx->tx_zc_aborted)) {
 		rc = -EIO;
+		hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
+	}
 
 	if (tx->tx_conn)
 		ksocknal_conn_decref(tx->tx_conn);
 
 	ksocknal_free_tx(tx);
-	if (lnetmsg) /* KSOCK_MSG_NOOP go without lnetmsg */
+	if (lnetmsg) { /* KSOCK_MSG_NOOP go without lnetmsg */
+		if (rc)
+			CERROR("tx failure rc = %d, hstatus = %d\n", rc,
+			       hstatus);
+		lnetmsg->msg_health_status = hstatus;
 		lnet_finalize(lnetmsg, rc);
+	}
 }
 
 void
@@ -362,6 +371,20 @@ struct ksock_tx *
 
 		list_del(&tx->tx_list);
 
+		if (tx->tx_hstatus == LNET_MSG_STATUS_OK) {
+			if (error == -ETIMEDOUT)
+				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_TIMEOUT;
+			else if (error == -ENETDOWN ||
+				 error == -EHOSTUNREACH ||
+				 error == -ENETUNREACH)
+				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_DROPPED;
+			/* for all other errors we don't want to
+			 * retransmit
+			 */
+			else if (error)
+				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
+		}
+
 		LASSERT(atomic_read(&tx->tx_refcount) == 1);
 		ksocknal_tx_done(ni, tx, error);
 	}
@@ -481,12 +504,25 @@ struct ksock_tx *
 			wake_up(&ksocknal_data.ksnd_reaper_waitq);
 
 		spin_unlock_bh(&ksocknal_data.ksnd_reaper_lock);
+
+		/* set the health status of the message which determines
+		 * whether we should retry the transmit
+		 */
+		tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 		return rc;
 	}
 
 	/* Actual error */
 	LASSERT(rc < 0);
 
+	/* set the health status of the message which determines
+	 * whether we should retry the transmit
+	 */
+	if (rc == -ETIMEDOUT)
+		tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_TIMEOUT;
+	else
+		tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
+
 	if (!conn->ksnc_closing) {
 		switch (rc) {
 		case -ECONNRESET:
@@ -509,7 +545,7 @@ struct ksock_tx *
 		ksocknal_uncheck_zc_req(tx);
 
 	/* it's not an error if conn is being closed */
-	ksocknal_close_conn_and_siblings(conn, (conn->ksnc_closing) ? 0 : rc);
+	ksocknal_close_conn_and_siblings(conn, conn->ksnc_closing ? 0 : rc);
 
 	return rc;
 }
@@ -2167,6 +2203,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 {
 	/* We're called with a shared lock on ksnd_global_lock */
 	struct ksock_conn *conn;
+	struct ksock_tx *tx;
 
 	list_for_each_entry(conn, &peer_ni->ksnp_conns, ksnc_list) {
 		int error;
@@ -2229,6 +2266,10 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 			 * buffered in the socket's send buffer
 			 */
 			ksocknal_conn_addref(conn);
+			list_for_each_entry(tx, &conn->ksnc_tx_queue,
+					    tx_list)
+				tx->tx_hstatus =
+					LNET_MSG_STATUS_LOCAL_TIMEOUT;
 			CNETERR("Timeout sending data to %s (%pI4h:%d) the network or that node may be down.\n",
 				libcfs_id2str(peer_ni->ksnp_id),
 				&conn->ksnc_ipaddr,
@@ -2255,6 +2296,8 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 		if (ktime_get_seconds() < tx->tx_deadline)
 			break;
 
+		tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_TIMEOUT;
+
 		list_del(&tx->tx_list);
 		list_add_tail(&tx->tx_list, &stale_txs);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 082/622] lnet: handle remote errors in LNet
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (80 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 081/622] lnet: handle socklnd " James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 083/622] lnet: add retry count James Simmons
                   ` (540 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add health value in the peer NI structure. Decrement the
value whenever there is an error sending to the peer.
Modify the selection algorithm to look at the peer NI health
value when selecting the best peer NI to send to.

Put the peer NI on the recovery queue whenever there is
an error sending to it. Attempt only to resend on REMOTE
DROPPED since we're sure the message was never received by
the peer. For other errors finalize the message.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 76fad19c2dea ("LU-9120 lnet: handle remote errors in LNet")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32767
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |   6 +
 include/linux/lnet/lib-types.h |  12 ++
 net/lnet/lnet/api-ni.c         |   1 +
 net/lnet/lnet/lib-move.c       | 311 +++++++++++++++++++++++++++++++++++------
 net/lnet/lnet/lib-msg.c        |  87 ++++++++++--
 net/lnet/lnet/peer.c           |   9 ++
 6 files changed, 368 insertions(+), 58 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 965fc5f..b8ca114 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -894,6 +894,12 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 	return false;
 }
 
+static inline void
+lnet_inc_healthv(atomic_t *healthv)
+{
+	atomic_add_unless(healthv, 1, LNET_MAX_HEALTH_VALUE);
+}
+
 void lnet_incr_stats(struct lnet_element_stats *stats,
 		     enum lnet_msg_type msg_type,
 		     enum lnet_stats_type stats_type);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 8c3bf34..19b83a4 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -478,6 +478,8 @@ struct lnet_peer_ni {
 	struct list_head	 lpni_peer_nis;
 	/* chain on remote peer list */
 	struct list_head	 lpni_on_remote_peer_ni_list;
+	/* chain on recovery queue */
+	struct list_head	 lpni_recovery;
 	/* chain on peer hash */
 	struct list_head	 lpni_hashlist;
 	/* messages blocking for tx credits */
@@ -529,6 +531,10 @@ struct lnet_peer_ni {
 	lnet_nid_t		 lpni_nid;
 	/* # refs */
 	atomic_t		 lpni_refcount;
+	/* health value for the peer */
+	atomic_t		 lpni_healthv;
+	/* recovery ping mdh */
+	struct lnet_handle_md	 lpni_recovery_ping_mdh;
 	/* CPT this peer attached on */
 	int			 lpni_cpt;
 	/* state flags -- protected by lpni_lock */
@@ -558,6 +564,10 @@ struct lnet_peer_ni {
 
 /* Preferred path added due to traffic on non-MR peer_ni */
 #define LNET_PEER_NI_NON_MR_PREF	BIT(0)
+/* peer is being recovered. */
+#define LNET_PEER_NI_RECOVERY_PENDING	BIT(1)
+/* peer is being deleted */
+#define LNET_PEER_NI_DELETING		BIT(2)
 
 struct lnet_peer {
 	/* chain on pt_peer_list */
@@ -1088,6 +1098,8 @@ struct lnet {
 	struct list_head		**ln_mt_resendqs;
 	/* local NIs to recover */
 	struct list_head		ln_mt_localNIRecovq;
+	/* local NIs to recover */
+	struct list_head		ln_mt_peerNIRecovq;
 	/* recovery eq handler */
 	struct lnet_handle_eq		ln_mt_eqh;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index deef404..97d9be5 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -832,6 +832,7 @@ struct lnet_libhandle *
 	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
 	INIT_LIST_HEAD(&the_lnet.ln_dc_expired);
 	INIT_LIST_HEAD(&the_lnet.ln_mt_localNIRecovq);
+	INIT_LIST_HEAD(&the_lnet.ln_mt_peerNIRecovq);
 	init_waitqueue_head(&the_lnet.ln_dc_waitq);
 
 	rc = lnet_descriptor_setup();
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index f3f4b84..5224490 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1025,15 +1025,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	}
 
 	if (txpeer) {
-		/*
-		 * TODO:
-		 * Once the patch for the health comes in we need to set
-		 * the health of the peer ni to bad when we fail to send
-		 * a message.
-		 * int status = msg->msg_ev.status;
-		 * if (status != 0)
-		 *	lnet_set_peer_ni_health_locked(txpeer, false)
-		 */
 		msg->msg_txpeer = NULL;
 		lnet_peer_ni_decref_locked(txpeer);
 	}
@@ -1545,6 +1536,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	int best_lpni_credits = INT_MIN;
 	bool preferred = false;
 	bool ni_is_pref;
+	int best_lpni_healthv = 0;
+	int lpni_healthv;
 
 	while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni))) {
 		/* if the best_ni we've chosen aleady has this lpni
@@ -1553,6 +1546,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
 							  best_ni->ni_nid);
 
+		lpni_healthv = atomic_read(&lpni->lpni_healthv);
+
 		CDEBUG(D_NET, "%s ni_is_pref = %d\n",
 		       libcfs_nid2str(best_ni->ni_nid), ni_is_pref);
 
@@ -1562,8 +1557,13 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			       lpni->lpni_txcredits, best_lpni_credits,
 			       lpni->lpni_seq, best_lpni->lpni_seq);
 
+		/* pick the healthiest peer ni */
+		if (lpni_healthv < best_lpni_healthv) {
+			continue;
+		} else if (lpni_healthv > best_lpni_healthv) {
+			best_lpni_healthv = lpni_healthv;
 		/* if this is a preferred peer use it */
-		if (!preferred && ni_is_pref) {
+		} else if (!preferred && ni_is_pref) {
 			preferred = true;
 		} else if (preferred && !ni_is_pref) {
 			/*
@@ -2408,6 +2408,16 @@ struct lnet_ni *
 	return 0;
 }
 
+enum lnet_mt_event_type {
+	MT_TYPE_LOCAL_NI = 0,
+	MT_TYPE_PEER_NI
+};
+
+struct lnet_mt_event_info {
+	enum lnet_mt_event_type mt_type;
+	lnet_nid_t mt_nid;
+};
+
 static void
 lnet_resend_pending_msgs_locked(struct list_head *resendq, int cpt)
 {
@@ -2503,6 +2513,7 @@ struct lnet_ni *
 static void
 lnet_recover_local_nis(void)
 {
+	struct lnet_mt_event_info *ev_info;
 	struct list_head processed_list;
 	struct list_head local_queue;
 	struct lnet_handle_md mdh;
@@ -2550,15 +2561,24 @@ struct lnet_ni *
 		lnet_ni_unlock(ni);
 		lnet_net_unlock(0);
 
-		/* protect the ni->ni_state field. Once we call the
-		 * lnet_send_ping function it's possible we receive
-		 * a response before we check the rc. The lock ensures
-		 * a stable value for the ni_state RECOVERY_PENDING bit
-		 */
+		CDEBUG(D_NET, "attempting to recover local ni: %s\n",
+		       libcfs_nid2str(ni->ni_nid));
+
 		lnet_ni_lock(ni);
 		if (!(ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING)) {
 			ni->ni_state |= LNET_NI_STATE_RECOVERY_PENDING;
 			lnet_ni_unlock(ni);
+
+			ev_info = kzalloc(sizeof(*ev_info), GFP_NOFS);
+			if (!ev_info) {
+				CERROR("out of memory. Can't recover %s\n",
+				       libcfs_nid2str(ni->ni_nid));
+				lnet_ni_lock(ni);
+				ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+				lnet_ni_unlock(ni);
+				continue;
+			}
+
 			mdh = ni->ni_ping_mdh;
 			/* Invalidate the ni mdh in case it's deleted.
 			 * We'll unlink the mdh in this case below.
@@ -2587,9 +2607,10 @@ struct lnet_ni *
 			lnet_ni_decref_locked(ni, 0);
 			lnet_net_unlock(0);
 
-			rc = lnet_send_ping(nid, &mdh,
-					    LNET_INTERFACES_MIN, (void *)nid,
-					    the_lnet.ln_mt_eqh, true);
+			ev_info->mt_type = MT_TYPE_LOCAL_NI;
+			ev_info->mt_nid = nid;
+			rc = lnet_send_ping(nid, &mdh, LNET_INTERFACES_MIN,
+					    ev_info, the_lnet.ln_mt_eqh, true);
 			/* lookup the nid again */
 			lnet_net_lock(0);
 			ni = lnet_nid2ni_locked(nid, 0);
@@ -2694,6 +2715,44 @@ struct lnet_ni *
 }
 
 static void
+lnet_unlink_lpni_recovery_mdh_locked(struct lnet_peer_ni *lpni, int cpt)
+{
+	struct lnet_handle_md recovery_mdh;
+
+	LNetInvalidateMDHandle(&recovery_mdh);
+
+	if (lpni->lpni_state & LNET_PEER_NI_RECOVERY_PENDING) {
+		recovery_mdh = lpni->lpni_recovery_ping_mdh;
+		LNetInvalidateMDHandle(&lpni->lpni_recovery_ping_mdh);
+	}
+	spin_unlock(&lpni->lpni_lock);
+	lnet_net_unlock(cpt);
+	if (!LNetMDHandleIsInvalid(recovery_mdh))
+		LNetMDUnlink(recovery_mdh);
+	lnet_net_lock(cpt);
+	spin_lock(&lpni->lpni_lock);
+}
+
+static void
+lnet_clean_peer_ni_recoveryq(void)
+{
+	struct lnet_peer_ni *lpni, *tmp;
+
+	lnet_net_lock(LNET_LOCK_EX);
+
+	list_for_each_entry_safe(lpni, tmp, &the_lnet.ln_mt_peerNIRecovq,
+				 lpni_recovery) {
+		list_del_init(&lpni->lpni_recovery);
+		spin_lock(&lpni->lpni_lock);
+		lnet_unlink_lpni_recovery_mdh_locked(lpni, LNET_LOCK_EX);
+		spin_unlock(&lpni->lpni_lock);
+		lnet_peer_ni_decref_locked(lpni);
+	}
+
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+
+static void
 lnet_clean_resendqs(void)
 {
 	struct lnet_msg *msg, *tmp;
@@ -2716,6 +2775,128 @@ struct lnet_ni *
 	cfs_percpt_free(the_lnet.ln_mt_resendqs);
 }
 
+static void
+lnet_recover_peer_nis(void)
+{
+	struct lnet_mt_event_info *ev_info;
+	struct list_head processed_list;
+	struct list_head local_queue;
+	struct lnet_handle_md mdh;
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer_ni *tmp;
+	lnet_nid_t nid;
+	int healthv;
+	int rc;
+
+	INIT_LIST_HEAD(&local_queue);
+	INIT_LIST_HEAD(&processed_list);
+
+	/* Always use cpt 0 for locking across all interactions with
+	 * ln_mt_peerNIRecovq
+	 */
+	lnet_net_lock(0);
+	list_splice_init(&the_lnet.ln_mt_peerNIRecovq,
+			 &local_queue);
+	lnet_net_unlock(0);
+
+	list_for_each_entry_safe(lpni, tmp, &local_queue,
+				 lpni_recovery) {
+		/* The same protection strategy is used here as is in the
+		 * local recovery case.
+		 */
+		lnet_net_lock(0);
+		healthv = atomic_read(&lpni->lpni_healthv);
+		spin_lock(&lpni->lpni_lock);
+		if (lpni->lpni_state & LNET_PEER_NI_DELETING ||
+		    healthv == LNET_MAX_HEALTH_VALUE) {
+			list_del_init(&lpni->lpni_recovery);
+			lnet_unlink_lpni_recovery_mdh_locked(lpni, 0);
+			spin_unlock(&lpni->lpni_lock);
+			lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(0);
+			continue;
+		}
+		spin_unlock(&lpni->lpni_lock);
+		lnet_net_unlock(0);
+
+		/* NOTE: we're racing with peer deletion from user space.
+		 * It's possible that a peer is deleted after we check its
+		 * state. In this case the recovery can create a new peer
+		 */
+		spin_lock(&lpni->lpni_lock);
+		if (!(lpni->lpni_state & LNET_PEER_NI_RECOVERY_PENDING) &&
+		    !(lpni->lpni_state & LNET_PEER_NI_DELETING)) {
+			lpni->lpni_state |= LNET_PEER_NI_RECOVERY_PENDING;
+			spin_unlock(&lpni->lpni_lock);
+
+			ev_info = kzalloc(sizeof(*ev_info), GFP_NOFS);
+			if (!ev_info) {
+				CERROR("out of memory. Can't recover %s\n",
+				       libcfs_nid2str(lpni->lpni_nid));
+				spin_lock(&lpni->lpni_lock);
+				lpni->lpni_state &=
+					~LNET_PEER_NI_RECOVERY_PENDING;
+				spin_unlock(&lpni->lpni_lock);
+				continue;
+			}
+
+			/* look at the comments in lnet_recover_local_nis() */
+			mdh = lpni->lpni_recovery_ping_mdh;
+			LNetInvalidateMDHandle(&lpni->lpni_recovery_ping_mdh);
+			nid = lpni->lpni_nid;
+			lnet_net_lock(0);
+			list_del_init(&lpni->lpni_recovery);
+			lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(0);
+
+			ev_info->mt_type = MT_TYPE_PEER_NI;
+			ev_info->mt_nid = nid;
+			rc = lnet_send_ping(nid, &mdh, LNET_INTERFACES_MIN,
+					    ev_info, the_lnet.ln_mt_eqh, true);
+			lnet_net_lock(0);
+			/* lnet_find_peer_ni_locked() grabs a refcount for
+			 * us. No need to take it explicitly.
+			 */
+			lpni = lnet_find_peer_ni_locked(nid);
+			if (!lpni) {
+				lnet_net_unlock(0);
+				LNetMDUnlink(mdh);
+				continue;
+			}
+
+			lpni->lpni_recovery_ping_mdh = mdh;
+			/* While we're unlocked the lpni could've been
+			 * readded on the recovery queue. In this case we
+			 * don't need to add it to the local queue, since
+			 * it's already on there and the thread that added
+			 * it would've incremented the refcount on the
+			 * peer, which means we need to decref the refcount
+			 * that was implicitly grabbed by find_peer_ni_locked.
+			 * Otherwise, if the lpni is still not on
+			 * the recovery queue, then we'll add it to the
+			 * processed list.
+			 */
+			if (list_empty(&lpni->lpni_recovery))
+				list_add_tail(&lpni->lpni_recovery,
+					      &processed_list);
+			else
+				lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(0);
+
+			spin_lock(&lpni->lpni_lock);
+			if (rc)
+				lpni->lpni_state &=
+					~LNET_PEER_NI_RECOVERY_PENDING;
+		}
+		spin_unlock(&lpni->lpni_lock);
+	}
+
+	list_splice_init(&processed_list, &local_queue);
+	lnet_net_lock(0);
+	list_splice(&local_queue, &the_lnet.ln_mt_peerNIRecovq);
+	lnet_net_unlock(0);
+}
+
 static int
 lnet_monitor_thread(void *arg)
 {
@@ -2736,6 +2917,8 @@ struct lnet_ni *
 
 		lnet_recover_local_nis();
 
+		lnet_recover_peer_nis();
+
 		/* TODO do we need to check if we should sleep without
 		 * timeout?  Technically, an active system will always
 		 * have messages in flight so this check will always
@@ -2822,10 +3005,61 @@ struct lnet_ni *
 }
 
 static void
+lnet_handle_recovery_reply(struct lnet_mt_event_info *ev_info,
+			   int status)
+{
+	lnet_nid_t nid = ev_info->mt_nid;
+
+	if (ev_info->mt_type == MT_TYPE_LOCAL_NI) {
+		struct lnet_ni *ni;
+
+		lnet_net_lock(0);
+		ni = lnet_nid2ni_locked(nid, 0);
+		if (!ni) {
+			lnet_net_unlock(0);
+			return;
+		}
+		lnet_ni_lock(ni);
+		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+		lnet_ni_unlock(ni);
+		lnet_net_unlock(0);
+
+		if (status != 0) {
+			CERROR("local NI recovery failed with %d\n", status);
+			return;
+		}
+		/* need to increment healthv for the ni here, because in
+		 * the lnet_finalize() path we don't have access to this
+		 * NI. And in order to get access to it, we'll need to
+		 * carry forward too much information.
+		 * In the peer case, it'll naturally be incremented
+		 */
+		lnet_inc_healthv(&ni->ni_healthv);
+	} else {
+		struct lnet_peer_ni *lpni;
+		int cpt;
+
+		cpt = lnet_net_lock_current();
+		lpni = lnet_find_peer_ni_locked(nid);
+		if (!lpni) {
+			lnet_net_unlock(cpt);
+			return;
+		}
+		spin_lock(&lpni->lpni_lock);
+		lpni->lpni_state &= ~LNET_PEER_NI_RECOVERY_PENDING;
+		spin_unlock(&lpni->lpni_lock);
+		lnet_peer_ni_decref_locked(lpni);
+		lnet_net_unlock(cpt);
+
+		if (status != 0)
+			CERROR("peer NI recovery failed with %d\n", status);
+	}
+}
+
+static void
 lnet_mt_event_handler(struct lnet_event *event)
 {
-	lnet_nid_t nid = (lnet_nid_t)event->md.user_ptr;
-	struct lnet_ni *ni;
+	struct lnet_mt_event_info *ev_info = event->md.user_ptr;
 	struct lnet_ping_buffer *pbuf;
 
 	/* TODO: remove assert */
@@ -2837,37 +3071,25 @@ struct lnet_ni *
 	       event->status);
 
 	switch (event->type) {
+	case LNET_EVENT_UNLINK:
+		CDEBUG(D_NET, "%s recovery ping unlinked\n",
+		       libcfs_nid2str(ev_info->mt_nid));
+		/* fall-through */
 	case LNET_EVENT_REPLY:
-		/* If the NI has been restored completely then remove from
-		 * the recovery queue
-		 */
-		lnet_net_lock(0);
-		ni = lnet_nid2ni_locked(nid, 0);
-		if (!ni) {
-			lnet_net_unlock(0);
-			break;
-		}
-		lnet_ni_lock(ni);
-		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
-		lnet_ni_unlock(ni);
-		lnet_net_unlock(0);
+		lnet_handle_recovery_reply(ev_info, event->status);
 		break;
 	case LNET_EVENT_SEND:
 		CDEBUG(D_NET, "%s recovery message sent %s:%d\n",
-		       libcfs_nid2str(nid),
+		       libcfs_nid2str(ev_info->mt_nid),
 		       (event->status) ? "unsuccessfully" :
 		       "successfully", event->status);
 		break;
-	case LNET_EVENT_UNLINK:
-		/* nothing to do */
-		CDEBUG(D_NET, "%s recovery ping unlinked\n",
-		       libcfs_nid2str(nid));
-		break;
 	default:
 		CERROR("Unexpected event: %d\n", event->type);
-		return;
+		break;
 	}
 	if (event->unlinked) {
+		kfree(ev_info);
 		pbuf = LNET_PING_INFO_TO_BUFFER(event->md.start);
 		lnet_ping_buffer_decref(pbuf);
 	}
@@ -2919,14 +3141,16 @@ int lnet_monitor_thr_start(void)
 	lnet_router_cleanup();
 free_mem:
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
-	lnet_clean_resendqs();
 	lnet_clean_local_ni_recoveryq();
+	lnet_clean_peer_ni_recoveryq();
+	lnet_clean_resendqs();
 	LNetEQFree(the_lnet.ln_mt_eqh);
 	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
 	return rc;
 clean_queues:
-	lnet_clean_resendqs();
 	lnet_clean_local_ni_recoveryq();
+	lnet_clean_peer_ni_recoveryq();
+	lnet_clean_resendqs();
 	return rc;
 }
 
@@ -2949,8 +3173,9 @@ void lnet_monitor_thr_stop(void)
 
 	/* perform cleanup tasks */
 	lnet_router_cleanup();
-	lnet_clean_resendqs();
 	lnet_clean_local_ni_recoveryq();
+	lnet_clean_peer_ni_recoveryq();
+	lnet_clean_resendqs();
 	rc = LNetEQFree(the_lnet.ln_mt_eqh);
 	LASSERT(rc == 0);
 }
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index e7f7469..046923b 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -482,12 +482,6 @@
 	}
 }
 
-static inline void
-lnet_inc_healthv(atomic_t *healthv)
-{
-	atomic_add_unless(healthv, 1, LNET_MAX_HEALTH_VALUE);
-}
-
 static void
 lnet_handle_local_failure(struct lnet_msg *msg)
 {
@@ -524,6 +518,43 @@
 	lnet_net_unlock(0);
 }
 
+static void
+lnet_handle_remote_failure(struct lnet_msg *msg)
+{
+	struct lnet_peer_ni *lpni;
+
+	lpni = msg->msg_txpeer;
+
+	/* lpni could be NULL if we're in the LOLND case */
+	if (!lpni)
+		return;
+
+	lnet_net_lock(0);
+	/* the mt could've shutdown and cleaned up the queues */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
+		lnet_net_unlock(0);
+		return;
+	}
+
+	lnet_dec_healthv_locked(&lpni->lpni_healthv);
+	/* add the peer NI to the recovery queue if it's not already there
+	 * and it's health value is actually below the maximum. It's
+	 * possible that the sensitivity might be set to 0, and the health
+	 * value will not be reduced. In this case, there is no reason to
+	 * invoke recovery
+	 */
+	if (list_empty(&lpni->lpni_recovery) &&
+	    atomic_read(&lpni->lpni_healthv) < LNET_MAX_HEALTH_VALUE) {
+		CERROR("lpni %s added to recovery queue. Health = %d\n",
+		       libcfs_nid2str(lpni->lpni_nid),
+		       atomic_read(&lpni->lpni_healthv));
+		list_add_tail(&lpni->lpni_recovery,
+			      &the_lnet.ln_mt_peerNIRecovq);
+		lnet_peer_ni_addref_locked(lpni);
+	}
+	lnet_net_unlock(0);
+}
+
 /* Do a health check on the message:
  * return -1 if we're not going to handle the error
  *   success case will return -1 as well
@@ -533,11 +564,20 @@
 lnet_health_check(struct lnet_msg *msg)
 {
 	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
+	bool lo = false;
 
 	/* TODO: lnet_incr_hstats(hstatus); */
 
 	LASSERT(msg->msg_txni);
 
+	/* if we're sending to the LOLND then the msg_txpeer will not be
+	 * set. So no need to sanity check it.
+	 */
+	if (LNET_NETTYP(LNET_NIDNET(msg->msg_txni->ni_nid)) != LOLND)
+		LASSERT(msg->msg_txpeer);
+	else
+		lo = true;
+
 	if (hstatus != LNET_MSG_STATUS_OK &&
 	    ktime_compare(ktime_get(), msg->msg_deadline) >= 0)
 		return -1;
@@ -546,9 +586,21 @@
 	if (the_lnet.ln_state != LNET_STATE_RUNNING)
 		return -1;
 
+	CDEBUG(D_NET, "health check: %s->%s: %s: %s\n",
+	       libcfs_nid2str(msg->msg_txni->ni_nid),
+	       (lo) ? "self" : libcfs_nid2str(msg->msg_txpeer->lpni_nid),
+	       lnet_msgtyp2str(msg->msg_type),
+	       lnet_health_error2str(hstatus));
+
 	switch (hstatus) {
 	case LNET_MSG_STATUS_OK:
 		lnet_inc_healthv(&msg->msg_txni->ni_healthv);
+		/* It's possible msg_txpeer is NULL in the LOLND
+		 * case.
+		 */
+		if (msg->msg_txpeer)
+			lnet_inc_healthv(&msg->msg_txpeer->lpni_healthv);
+
 		/* we can finalize this message */
 		return -1;
 	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
@@ -560,22 +612,27 @@
 		/* add to the re-send queue */
 		goto resend;
 
-		/* TODO: since the remote dropped the message we can
-		 * attempt a resend safely.
-		 */
-	case LNET_MSG_STATUS_REMOTE_DROPPED:
-		break;
-
-		/* These errors will not trigger a resend so simply
-		 * finalize the message
-		 */
+	/* These errors will not trigger a resend so simply
+	 * finalize the message
+	 */
 	case LNET_MSG_STATUS_LOCAL_ERROR:
 		lnet_handle_local_failure(msg);
 		return -1;
+
+	/* TODO: since the remote dropped the message we can
+	 * attempt a resend safely.
+	 */
+	case LNET_MSG_STATUS_REMOTE_DROPPED:
+		lnet_handle_remote_failure(msg);
+		goto resend;
+
 	case LNET_MSG_STATUS_REMOTE_ERROR:
 	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
 	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
+		lnet_handle_remote_failure(msg);
 		return -1;
+	default:
+		LBUG();
 	}
 
 resend:
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 121876e..4a62f9a 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -124,6 +124,7 @@
 	INIT_LIST_HEAD(&lpni->lpni_routes);
 	INIT_LIST_HEAD(&lpni->lpni_hashlist);
 	INIT_LIST_HEAD(&lpni->lpni_peer_nis);
+	INIT_LIST_HEAD(&lpni->lpni_recovery);
 	INIT_LIST_HEAD(&lpni->lpni_on_remote_peer_ni_list);
 
 	spin_lock_init(&lpni->lpni_lock);
@@ -133,6 +134,7 @@
 	lpni->lpni_ping_feats = LNET_PING_FEAT_INVAL;
 	lpni->lpni_nid = nid;
 	lpni->lpni_cpt = cpt;
+	atomic_set(&lpni->lpni_healthv, LNET_MAX_HEALTH_VALUE);
 	lnet_set_peer_ni_health_locked(lpni, true);
 
 	net = lnet_get_net_locked(LNET_NIDNET(nid));
@@ -331,6 +333,13 @@
 	/* remove peer ni from the hash list. */
 	list_del_init(&lpni->lpni_hashlist);
 
+	/* indicate the peer is being deleted so the monitor thread can
+	 * remove it from the recovery queue.
+	 */
+	spin_lock(&lpni->lpni_lock);
+	lpni->lpni_state |= LNET_PEER_NI_DELETING;
+	spin_unlock(&lpni->lpni_lock);
+
 	/* decrement the ref count on the peer table */
 	ptable = the_lnet.ln_peer_tables[lpni->lpni_cpt];
 	LASSERT(atomic_read(&ptable->pt_number) > 0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 083/622] lnet: add retry count
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (81 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 082/622] lnet: handle remote errors in LNet James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 084/622] lnet: calculate the lnd timeout James Simmons
                   ` (539 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Added a module parameter to define the number of retries on a
message. It defaults to 0, which means no retries will be attempted.
Each message will keep track of the number of times it has been
retransmitted. When queuing it on the resend queue, the retry count
will be checked and if it's exceeded, then the message will be
finalized.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 20e23980eae2 ("LU-9120 lnet: add retry count")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32769
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  | 1 +
 include/linux/lnet/lib-types.h | 2 ++
 net/lnet/lnet/api-ni.c         | 5 +++++
 net/lnet/lnet/lib-msg.c        | 8 +++++++-
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index b8ca114..ace0d51 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -478,6 +478,7 @@ struct lnet_ni *
 struct lnet_net *lnet_get_net_locked(u32 net_id);
 
 extern unsigned int lnet_transaction_timeout;
+extern unsigned int lnet_retry_count;
 extern unsigned int lnet_numa_range;
 extern unsigned int lnet_health_sensitivity;
 extern unsigned int lnet_peer_discovery_disabled;
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 19b83a4..1108e3b 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -103,6 +103,8 @@ struct lnet_msg {
 	enum lnet_msg_hstatus	msg_health_status;
 	/* This is a recovery message */
 	bool			msg_recovery;
+	/* the number of times a transmission has been retried */
+	int			msg_retry_count;
 	/* flag to indicate that we do not want to resend this message */
 	bool			msg_no_resend;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 97d9be5..a54fe2c 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -116,6 +116,11 @@ struct lnet the_lnet = {
 MODULE_PARM_DESC(lnet_transaction_timeout,
 		 "Time in seconds to wait for a REPLY or an ACK");
 
+unsigned int lnet_retry_count;
+module_param(lnet_retry_count, uint, 0444);
+MODULE_PARM_DESC(lnet_retry_count,
+		 "Maximum number of times to retry transmitting a message");
+
 /*
  * This sequence number keeps track of how many times DLC was used to
  * update the local NIs. It is incremented when a NI is added or
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 046923b..9841e14 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -556,7 +556,8 @@
 }
 
 /* Do a health check on the message:
- * return -1 if we're not going to handle the error
+ * return -1 if we're not going to handle the error or
+ *   if we've reached the maximum number of retries.
  *   success case will return -1 as well
  * return 0 if it the message is requeued for send
  */
@@ -646,6 +647,11 @@
 	if (msg->msg_no_resend)
 		return -1;
 
+	/* check if the message has exceeded the number of retries */
+	if (msg->msg_retry_count >= lnet_retry_count)
+		return -1;
+	msg->msg_retry_count++;
+
 	lnet_net_lock(msg->msg_tx_cpt);
 
 	/* remove message from the active list and reset it in preparation
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 084/622] lnet: calculate the lnd timeout
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (82 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 083/622] lnet: add retry count James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 085/622] lnet: sysfs functions for module params James Simmons
                   ` (538 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Calculate the LND timeout based on the transaction timeout
and the retry count. Both of these are user defined values. Whenever
they are set the lnd timeout is calculated. The LNDs use these
timeouts instead of the LND timeout module parameter.

Retry count can be set to 0, which means no retries. In that case the
LND timeout will default to 5 seconds, which is the same as the
default transaction timeout.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 84f3af43c4bd ("LU-9120 lnet: calculate the lnd timeout")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32770
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h       |  2 ++
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 20 +++++++++++---------
 net/lnet/klnds/socklnd/socklnd.c    |  6 +++---
 net/lnet/klnds/socklnd/socklnd_cb.c | 22 ++++++++++++----------
 net/lnet/lnet/api-ni.c              |  9 +++++++++
 5 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index ace0d51..5500e3f 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -85,6 +85,7 @@
 extern struct kmem_cache *lnet_small_mds_cachep; /* <= LNET_SMALL_MD_SIZE bytes
 						  * MDs kmem_cache
 						  */
+#define LNET_LND_DEFAULT_TIMEOUT 5
 
 static inline int lnet_is_route_alive(struct lnet_route *route)
 {
@@ -676,6 +677,7 @@ void lnet_copy_kiov2iter(struct iov_iter *to,
 struct page *lnet_kvaddr_to_page(unsigned long vaddr);
 int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset);
 
+unsigned int lnet_get_lnd_timeout(void);
 void lnet_register_lnd(struct lnet_lnd *lnd);
 void lnet_unregister_lnd(struct lnet_lnd *lnd);
 
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 007058a..c6e8e73 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1205,7 +1205,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	LASSERT(!tx->tx_queued);	/* not queued for sending already */
 	LASSERT(conn->ibc_state >= IBLND_CONN_ESTABLISHED);
 
-	timeout_ns = *kiblnd_tunables.kib_timeout * NSEC_PER_SEC;
+	timeout_ns = lnet_get_lnd_timeout() * NSEC_PER_SEC;
 	tx->tx_queued = 1;
 	tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);
 
@@ -1333,14 +1333,14 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 
 	if (*kiblnd_tunables.kib_use_priv_port) {
 		rc = kiblnd_resolve_addr(cmid, &srcaddr, &dstaddr,
-					 *kiblnd_tunables.kib_timeout * 1000);
+					 lnet_get_lnd_timeout() * 1000);
 	} else {
 		rc = rdma_resolve_addr(cmid,
 				       (struct sockaddr *)&srcaddr,
 				       (struct sockaddr *)&dstaddr,
-				       *kiblnd_tunables.kib_timeout * 1000);
+				       lnet_get_lnd_timeout() * 1000);
 	}
-	if (rc) {
+	if (rc != 0) {
 		/* Can't initiate address resolution:  */
 		CERROR("Can't resolve addr for %s: %d\n",
 		       libcfs_nid2str(peer_ni->ibp_nid), rc);
@@ -3097,8 +3097,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 				event->status);
 			rc = event->status;
 		} else {
-			rc = rdma_resolve_route(
-				cmid, *kiblnd_tunables.kib_timeout * 1000);
+			rc = rdma_resolve_route(cmid,
+						lnet_get_lnd_timeout() * 1000);
 			if (!rc) {
 				struct kib_net *net = peer_ni->ibp_ni->ni_data;
 				struct kib_dev *dev = net->ibn_dev;
@@ -3499,6 +3499,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 			const int n = 4;
 			const int p = 1;
 			int chunk = kiblnd_data.kib_peer_hash_size;
+			unsigned int lnd_timeout;
 
 			spin_unlock_irqrestore(lock, flags);
 			dropped_lock = 1;
@@ -3512,9 +3513,10 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 			 * connection within (n+1)/n times the timeout
 			 * interval.
 			 */
-			if (*kiblnd_tunables.kib_timeout > n * p)
-				chunk = (chunk * n * p) /
-					*kiblnd_tunables.kib_timeout;
+
+			lnd_timeout = lnet_get_lnd_timeout();
+			if (lnd_timeout > n * p)
+				chunk = (chunk * n * p) / lnd_timeout;
 			if (!chunk)
 				chunk = 1;
 
diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 03fa706..891d3bd 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -1284,7 +1284,7 @@ struct ksock_peer *
 	/* Set the deadline for the outgoing HELLO to drain */
 	conn->ksnc_tx_bufnob = sock->sk->sk_wmem_queued;
 	conn->ksnc_tx_deadline = ktime_get_seconds() +
-				 *ksocknal_tunables.ksnd_timeout;
+				 lnet_get_lnd_timeout();
 	mb();   /* order with adding to peer_ni's conn list */
 
 	list_add(&conn->ksnc_list, &peer_ni->ksnp_conns);
@@ -1674,7 +1674,7 @@ struct ksock_peer *
 	switch (conn->ksnc_rx_state) {
 	case SOCKNAL_RX_LNET_PAYLOAD:
 		last_rcv = conn->ksnc_rx_deadline -
-			   *ksocknal_tunables.ksnd_timeout;
+			   lnet_get_lnd_timeout();
 		CERROR("Completing partial receive from %s[%d], ip %pI4h:%d, with error, wanted: %zd, left: %d, last alive is %lld secs ago\n",
 		       libcfs_id2str(conn->ksnc_peer->ksnp_id), conn->ksnc_type,
 		       &conn->ksnc_ipaddr, conn->ksnc_port,
@@ -1849,7 +1849,7 @@ struct ksock_peer *
 			if (bufnob < conn->ksnc_tx_bufnob) {
 				/* something got ACKed */
 				conn->ksnc_tx_deadline = ktime_get_seconds() +
-							 *ksocknal_tunables.ksnd_timeout;
+							 lnet_get_lnd_timeout();
 				peer_ni->ksnp_last_alive = now;
 				conn->ksnc_tx_bufnob = bufnob;
 			}
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index d50e0d2..8bc23d2 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -222,7 +222,7 @@ struct ksock_tx *
 			 * something got ACKed
 			 */
 			conn->ksnc_tx_deadline = ktime_get_seconds() +
-						 *ksocknal_tunables.ksnd_timeout;
+						 lnet_get_lnd_timeout();
 			conn->ksnc_peer->ksnp_last_alive = ktime_get_seconds();
 			conn->ksnc_tx_bufnob = bufnob;
 			mb();
@@ -268,7 +268,7 @@ struct ksock_tx *
 
 	conn->ksnc_peer->ksnp_last_alive = ktime_get_seconds();
 	conn->ksnc_rx_deadline = ktime_get_seconds() +
-				 *ksocknal_tunables.ksnd_timeout;
+				 lnet_get_lnd_timeout();
 	mb();		/* order with setting rx_started */
 	conn->ksnc_rx_started = 1;
 
@@ -423,7 +423,7 @@ struct ksock_tx *
 
 	/* ZC_REQ is going to be pinned to the peer_ni */
 	tx->tx_deadline = ktime_get_seconds() +
-			  *ksocknal_tunables.ksnd_timeout;
+			  lnet_get_lnd_timeout();
 
 	LASSERT(!tx->tx_msg.ksm_zc_cookies[0]);
 
@@ -705,7 +705,7 @@ struct ksock_conn *
 	if (list_empty(&conn->ksnc_tx_queue) && !bufnob) {
 		/* First packet starts the timeout */
 		conn->ksnc_tx_deadline = ktime_get_seconds() +
-					 *ksocknal_tunables.ksnd_timeout;
+					 lnet_get_lnd_timeout();
 		if (conn->ksnc_tx_bufnob > 0) /* something got ACKed */
 			conn->ksnc_peer->ksnp_last_alive = ktime_get_seconds();
 		conn->ksnc_tx_bufnob = 0;
@@ -881,7 +881,7 @@ struct ksock_route *
 	    ksocknal_find_connecting_route_locked(peer_ni)) {
 		/* the message is going to be pinned to the peer_ni */
 		tx->tx_deadline = ktime_get_seconds() +
-				  *ksocknal_tunables.ksnd_timeout;
+				  lnet_get_lnd_timeout();
 
 		/* Queue the message until a connection is established */
 		list_add_tail(&tx->tx_list, &peer_ni->ksnp_tx_queue);
@@ -1663,7 +1663,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 	/* socket type set on active connections - not set on passive */
 	LASSERT(!active == !(conn->ksnc_type != SOCKLND_CONN_NONE));
 
-	timeout = active ? *ksocknal_tunables.ksnd_timeout :
+	timeout = active ? lnet_get_lnd_timeout() :
 			    lnet_acceptor_timeout();
 
 	rc = lnet_sock_read(sock, &hello->kshm_magic,
@@ -1801,7 +1801,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 	int retry_later = 0;
 	int rc = 0;
 
-	deadline = ktime_get_seconds() + *ksocknal_tunables.ksnd_timeout;
+	deadline = ktime_get_seconds() + lnet_get_lnd_timeout();
 
 	write_lock_bh(&ksocknal_data.ksnd_global_lock);
 
@@ -2552,6 +2552,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 			const int n = 4;
 			const int p = 1;
 			int chunk = ksocknal_data.ksnd_peer_hash_size;
+			unsigned int lnd_timeout;
 
 			/*
 			 * Time to check for timeouts on a few more peers: I do
@@ -2561,9 +2562,10 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 			 * timeout on any connection within (n+1)/n times the
 			 * timeout interval.
 			 */
-			if (*ksocknal_tunables.ksnd_timeout > n * p)
-				chunk = (chunk * n * p) /
-					*ksocknal_tunables.ksnd_timeout;
+
+			lnd_timeout = lnet_get_lnd_timeout();
+			if (lnd_timeout > n * p)
+				chunk = (chunk * n * p) / lnd_timeout;
 			if (!chunk)
 				chunk = 1;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index a54fe2c..e467d64 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -121,6 +121,8 @@ struct lnet the_lnet = {
 MODULE_PARM_DESC(lnet_retry_count,
 		 "Maximum number of times to retry transmitting a message");
 
+unsigned int lnet_lnd_timeout = LNET_LND_DEFAULT_TIMEOUT;
+
 /*
  * This sequence number keeps track of how many times DLC was used to
  * update the local NIs. It is incremented when a NI is added or
@@ -570,6 +572,13 @@ static void lnet_assert_wire_constants(void)
 	return NULL;
 }
 
+unsigned int
+lnet_get_lnd_timeout(void)
+{
+	return lnet_lnd_timeout;
+}
+EXPORT_SYMBOL(lnet_get_lnd_timeout);
+
 void
 lnet_register_lnd(struct lnet_lnd *lnd)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 085/622] lnet: sysfs functions for module params
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (83 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 084/622] lnet: calculate the lnd timeout James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 086/622] lnet: timeout delayed REPLYs and ACKs James Simmons
                   ` (537 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Allow transaction timeout and retry count module parameters to be
set and shown via sysfs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 5169827bf790 ("LU-9120 lnet: sysfs functions for module params")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32861
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 84 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 7 deletions(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index e467d64..38e35bb 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -111,13 +111,27 @@ struct lnet the_lnet = {
 
 unsigned int lnet_transaction_timeout = 5;
 static int transaction_to_set(const char *val, const struct kernel_param *kp);
-module_param_call(lnet_transaction_timeout, transaction_to_set, param_get_int,
-		  &lnet_transaction_timeout, 0444);
+static struct kernel_param_ops param_ops_transaction_timeout = {
+	.set = transaction_to_set,
+	.get = param_get_int,
+};
+
+#define param_check_transaction_timeout(name, p) \
+		__param_check(name, p, int)
+module_param(lnet_transaction_timeout, transaction_timeout, 0644);
 MODULE_PARM_DESC(lnet_transaction_timeout,
-		 "Time in seconds to wait for a REPLY or an ACK");
+		 "Maximum number of seconds to wait for a peer response.");
 
 unsigned int lnet_retry_count;
-module_param(lnet_retry_count, uint, 0444);
+static int retry_count_set(const char *val, const struct kernel_param *kp);
+static struct kernel_param_ops param_ops_retry_count = {
+	.set = retry_count_set,
+	.get = param_get_int,
+};
+
+#define param_check_retry_count(name, p) \
+		__param_check(name, p, int)
+module_param(lnet_retry_count, retry_count, 0644);
 MODULE_PARM_DESC(lnet_retry_count,
 		 "Maximum number of times to retry transmitting a message");
 
@@ -241,10 +255,15 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	 */
 	mutex_lock(&the_lnet.ln_api_mutex);
 
-	if (value == 0) {
+	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	if (value < lnet_retry_count || value == 0) {
 		mutex_unlock(&the_lnet.ln_api_mutex);
-		CERROR("Invalid value for lnet_transaction_timeout (%lu).\n",
-		       value);
+		CERROR("Invalid value for lnet_transaction_timeout (%lu). Has to be greater than lnet_retry_count (%u)\n",
+		       value, lnet_retry_count);
 		return -EINVAL;
 	}
 
@@ -254,6 +273,57 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	}
 
 	*transaction_to = value;
+	if (lnet_retry_count == 0)
+		lnet_lnd_timeout = value;
+	else
+		lnet_lnd_timeout = value / lnet_retry_count;
+
+	mutex_unlock(&the_lnet.ln_api_mutex);
+
+	return 0;
+}
+
+static int
+retry_count_set(const char *val, const struct kernel_param *kp)
+{
+	int rc;
+	unsigned int *retry_count = (unsigned int *)kp->arg;
+	unsigned long value;
+
+	rc = kstrtoul(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'lnet_retry_count'\n");
+		return rc;
+	}
+
+	/* The purpose of locking the api_mutex here is to ensure that
+	 * the correct value ends up stored properly.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+
+	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	if (value > lnet_transaction_timeout) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		CERROR("Invalid value for lnet_retry_count (%lu). Has to be smaller than lnet_transaction_timeout (%u)\n",
+		       value, lnet_transaction_timeout);
+		return -EINVAL;
+	}
+
+	if (value == *retry_count) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	*retry_count = value;
+
+	if (value == 0)
+		lnet_lnd_timeout = lnet_transaction_timeout;
+	else
+		lnet_lnd_timeout = lnet_transaction_timeout / value;
 
 	mutex_unlock(&the_lnet.ln_api_mutex);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 086/622] lnet: timeout delayed REPLYs and ACKs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (84 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 085/622] lnet: sysfs functions for module params James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 087/622] lnet: remove duplicate timeout mechanism James Simmons
                   ` (536 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When a GET or a PUT which require an ACK are sent, add a response
tracker block on a percpt queue. When the REPLY/ACK are received
then remove the block from the percpt queue. The monitor thread
will wake up periodically to check if any of the blocks have
expired and if so, it will send a timeout event to the ULP and
flag the MD as stale, then unlink.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: a57fa1176e74 ("LU-9120 lnet: timeout delayed REPLYs and ACKs")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32771
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |  20 ++++
 include/linux/lnet/lib-types.h |  20 ++++
 net/lnet/lnet/lib-move.c       | 210 ++++++++++++++++++++++++++++++++++++++++-
 net/lnet/lnet/lib-msg.c        |   9 ++
 4 files changed, 258 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 5500e3f..c2191e5 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -438,6 +438,25 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 	lnet_net_unlock(0);
 }
 
+static inline struct lnet_rsp_tracker *
+lnet_rspt_alloc(int cpt)
+{
+	struct lnet_rsp_tracker *rspt;
+
+	rspt = kzalloc(sizeof(*rspt), GFP_NOFS);
+	lnet_net_lock(cpt);
+	lnet_net_unlock(cpt);
+	return rspt;
+}
+
+static inline void
+lnet_rspt_free(struct lnet_rsp_tracker *rspt, int cpt)
+{
+	kfree(rspt);
+	lnet_net_lock(cpt);
+	lnet_net_unlock(cpt);
+}
+
 void lnet_ni_free(struct lnet_ni *ni);
 void lnet_net_free(struct lnet_net *net);
 
@@ -614,6 +633,7 @@ struct lnet_msg *lnet_create_reply_msg(struct lnet_ni *ni,
 				       struct lnet_msg *get_msg);
 void lnet_set_reply_msg_len(struct lnet_ni *ni, struct lnet_msg *msg,
 			    unsigned int len);
+void lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt);
 
 void lnet_finalize(struct lnet_msg *msg, int rc);
 
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 1108e3b..d815a87 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -75,6 +75,17 @@ enum lnet_msg_hstatus {
 	LNET_MSG_STATUS_NETWORK_TIMEOUT
 };
 
+struct lnet_rsp_tracker {
+	/* chain on the waiting list */
+	struct list_head rspt_on_list;
+	/* cpt to lock */
+	int rspt_cpt;
+	/* deadline of the REPLY/ACK */
+	ktime_t rspt_deadline;
+	/* parent MD */
+	struct lnet_handle_md rspt_mdh;
+};
+
 struct lnet_msg {
 	struct list_head	msg_activelist;
 	struct list_head	msg_list;	/* Q for credits/MD */
@@ -201,6 +212,7 @@ struct lnet_libmd {
 	unsigned int		 md_flags;
 	unsigned int		 md_niov;	/* # frags at end of struct */
 	void			*md_user_ptr;
+	struct lnet_rsp_tracker	*md_rspt_ptr;
 	struct lnet_eq		*md_eq;
 	struct lnet_handle_md	 md_bulk_handle;
 	union {
@@ -1102,6 +1114,14 @@ struct lnet {
 	struct list_head		ln_mt_localNIRecovq;
 	/* local NIs to recover */
 	struct list_head		ln_mt_peerNIRecovq;
+	/*
+	 * An array of queues for GET/PUT waiting for REPLY/ACK respectively.
+	 * There are CPT number of queues. Since response trackers will be
+	 * added on the fast path we can't afford to grab the exclusive
+	 * net lock to protect these queues. The CPT will be calculated
+	 * based on the mdh cookie.
+	 */
+	struct list_head		**ln_mt_rstq;
 	/* recovery eq handler */
 	struct lnet_handle_eq		ln_mt_eqh;
 
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 5224490..55cbf57 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2418,6 +2418,110 @@ struct lnet_mt_event_info {
 	lnet_nid_t mt_nid;
 };
 
+void
+lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt)
+{
+	struct lnet_rsp_tracker *rspt;
+
+	/* msg has a refcount on the MD so the MD is not going away.
+	 * The rspt queue for the cpt is protected by
+	 * the lnet_net_lock(cpt). cpt is the cpt of the MD cookie.
+	 */
+	lnet_res_lock(cpt);
+	if (!md->md_rspt_ptr) {
+		lnet_res_unlock(cpt);
+		return;
+	}
+	rspt = md->md_rspt_ptr;
+	md->md_rspt_ptr = NULL;
+
+	/* debug code */
+	LASSERT(rspt->rspt_cpt == cpt);
+
+	/* invalidate the handle to indicate that a response has been
+	 * received, which will then lead the monitor thread to clean up
+	 * the rspt block.
+	 */
+	LNetInvalidateMDHandle(&rspt->rspt_mdh);
+	lnet_res_unlock(cpt);
+}
+
+static void
+lnet_finalize_expired_responses(bool force)
+{
+	struct lnet_libmd *md;
+	struct list_head local_queue;
+	struct lnet_rsp_tracker *rspt, *tmp;
+	int i;
+
+	if (!the_lnet.ln_mt_rstq)
+		return;
+
+	cfs_cpt_for_each(i, lnet_cpt_table()) {
+		INIT_LIST_HEAD(&local_queue);
+
+		lnet_net_lock(i);
+		if (!the_lnet.ln_mt_rstq[i]) {
+			lnet_net_unlock(i);
+			continue;
+		}
+		list_splice_init(the_lnet.ln_mt_rstq[i], &local_queue);
+		lnet_net_unlock(i);
+
+		list_for_each_entry_safe(rspt, tmp, &local_queue,
+					 rspt_on_list) {
+			/* The rspt mdh will be invalidated when a response
+			 * is received or whenever we want to discard the
+			 * block the monitor thread will walk the queue
+			 * and clean up any rsts with an invalid mdh.
+			 * The monitor thread will walk the queue until
+			 * the first unexpired rspt block. This means that
+			 * some rspt blocks which received their
+			 * corresponding responses will linger in the
+			 * queue until they are cleaned up eventually.
+			 */
+			lnet_res_lock(i);
+			if (LNetMDHandleIsInvalid(rspt->rspt_mdh)) {
+				lnet_res_unlock(i);
+				list_del_init(&rspt->rspt_on_list);
+				lnet_rspt_free(rspt, i);
+				continue;
+			}
+
+			if (ktime_compare(ktime_get(),
+					  rspt->rspt_deadline) >= 0 ||
+			    force) {
+				md = lnet_handle2md(&rspt->rspt_mdh);
+				if (!md) {
+					LNetInvalidateMDHandle(&rspt->rspt_mdh);
+					lnet_res_unlock(i);
+					list_del_init(&rspt->rspt_on_list);
+					lnet_rspt_free(rspt, i);
+					continue;
+				}
+				LASSERT(md->md_rspt_ptr == rspt);
+				md->md_rspt_ptr = NULL;
+				lnet_res_unlock(i);
+
+				list_del_init(&rspt->rspt_on_list);
+
+				CDEBUG(D_NET,
+				       "Response timed out: md = %p\n", md);
+				LNetMDUnlink(rspt->rspt_mdh);
+				lnet_rspt_free(rspt, i);
+			} else {
+				lnet_res_unlock(i);
+				break;
+			}
+		}
+
+		lnet_net_lock(i);
+		if (!list_empty(&local_queue))
+			list_splice(&local_queue, the_lnet.ln_mt_rstq[i]);
+		lnet_net_unlock(i);
+	}
+}
+
 static void
 lnet_resend_pending_msgs_locked(struct list_head *resendq, int cpt)
 {
@@ -2900,6 +3004,8 @@ struct lnet_mt_event_info {
 static int
 lnet_monitor_thread(void *arg)
 {
+	int wakeup_counter = 0;
+
 	/* The monitor thread takes care of the following:
 	 *  1. Checks the aliveness of routers
 	 *  2. Checks if there are messages on the resend queue to resend
@@ -2915,6 +3021,12 @@ struct lnet_mt_event_info {
 
 		lnet_resend_pending_msgs();
 
+		wakeup_counter++;
+		if (wakeup_counter >= lnet_transaction_timeout / 2) {
+			lnet_finalize_expired_responses(false);
+			wakeup_counter = 0;
+		}
+
 		lnet_recover_local_nis();
 
 		lnet_recover_peer_nis();
@@ -3095,6 +3207,29 @@ struct lnet_mt_event_info {
 	}
 }
 
+static int
+lnet_rsp_tracker_create(void)
+{
+	struct list_head **rstqs;
+
+	rstqs = lnet_create_array_of_queues();
+	if (!rstqs)
+		return -ENOMEM;
+
+	the_lnet.ln_mt_rstq = rstqs;
+
+	return 0;
+}
+
+static void
+lnet_rsp_tracker_clean(void)
+{
+	lnet_finalize_expired_responses(true);
+
+	cfs_percpt_free(the_lnet.ln_mt_rstq);
+	the_lnet.ln_mt_rstq = NULL;
+}
+
 int lnet_monitor_thr_start(void)
 {
 	int rc = 0;
@@ -3107,6 +3242,10 @@ int lnet_monitor_thr_start(void)
 	if (rc)
 		return rc;
 
+	rc = lnet_rsp_tracker_create();
+	if (rc)
+		goto clean_queues;
+
 	rc = LNetEQAlloc(0, lnet_mt_event_handler, &the_lnet.ln_mt_eqh);
 	if (rc != 0) {
 		CERROR("Can't allocate monitor thread EQ: %d\n", rc);
@@ -3141,6 +3280,7 @@ int lnet_monitor_thr_start(void)
 	lnet_router_cleanup();
 free_mem:
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
+	lnet_rsp_tracker_clean();
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
 	lnet_clean_resendqs();
@@ -3148,6 +3288,7 @@ int lnet_monitor_thr_start(void)
 	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
 	return rc;
 clean_queues:
+	lnet_rsp_tracker_clean();
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
 	lnet_clean_resendqs();
@@ -3173,6 +3314,7 @@ void lnet_monitor_thr_stop(void)
 
 	/* perform cleanup tasks */
 	lnet_router_cleanup();
+	lnet_rsp_tracker_clean();
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
 	lnet_clean_resendqs();
@@ -3917,6 +4059,41 @@ void lnet_monitor_thr_stop(void)
 	}
 }
 
+static void
+lnet_attach_rsp_tracker(struct lnet_rsp_tracker *rspt, int cpt,
+			struct lnet_libmd *md, struct lnet_handle_md mdh)
+{
+	s64 timeout_ns;
+
+	/* MD has a refcount taken by message so it's not going away.
+	 * The MD however can be looked up. We need to secure the access
+	 * to the md_rspt_ptr by taking the res_lock.
+	 * The rspt can be accessed without protection up to when it gets
+	 * added to the list.
+	 */
+
+	/* debug code */
+	LASSERT(!md->md_rspt_ptr);
+
+	/* we'll use that same event in case we never get a response  */
+	rspt->rspt_mdh = mdh;
+	rspt->rspt_cpt = cpt;
+	timeout_ns = lnet_transaction_timeout * NSEC_PER_SEC;
+	rspt->rspt_deadline = ktime_add_ns(ktime_get(), timeout_ns);
+
+	lnet_res_lock(cpt);
+	/* store the rspt so we can access it when we get the REPLY */
+	md->md_rspt_ptr = rspt;
+	lnet_res_unlock(cpt);
+
+	/* add to the list of tracked responses. It's added to tail of the
+	 * list in order to expire all the older entries first.
+	 */
+	lnet_net_lock(cpt);
+	list_add_tail(&rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);
+	lnet_net_unlock(cpt);
+}
+
 /**
  * Initiate an asynchronous PUT operation.
  *
@@ -3968,6 +4145,7 @@ void lnet_monitor_thr_stop(void)
 	u64 match_bits, unsigned int offset,
 	u64 hdr_data)
 {
+	struct lnet_rsp_tracker *rspt = NULL;
 	struct lnet_msg *msg;
 	struct lnet_libmd *md;
 	int cpt;
@@ -3991,6 +4169,17 @@ void lnet_monitor_thr_stop(void)
 	msg->msg_vmflush = !!(current->flags & PF_MEMALLOC);
 
 	cpt = lnet_cpt_of_cookie(mdh.cookie);
+
+	if (ack == LNET_ACK_REQ) {
+		rspt = lnet_rspt_alloc(cpt);
+		if (!rspt) {
+			CERROR("Dropping PUT to %s: ENOMEM on response tracker\n",
+			       libcfs_id2str(target));
+			return -ENOMEM;
+		}
+		INIT_LIST_HEAD(&rspt->rspt_on_list);
+	}
+
 	lnet_res_lock(cpt);
 
 	md = lnet_handle2md(&mdh);
@@ -4003,6 +4192,7 @@ void lnet_monitor_thr_stop(void)
 			       md->md_me->me_portal);
 		lnet_res_unlock(cpt);
 
+		kfree(rspt);
 		kfree(msg);
 		return -ENOENT;
 	}
@@ -4035,11 +4225,15 @@ void lnet_monitor_thr_stop(void)
 
 	lnet_build_msg_event(msg, LNET_EVENT_SEND);
 
+	if (ack == LNET_ACK_REQ)
+		lnet_attach_rsp_tracker(rspt, cpt, md, mdh);
+
 	rc = lnet_send(self, msg, LNET_NID_ANY);
 	if (rc) {
 		CNETERR("Error sending PUT to %s: %d\n",
 			libcfs_id2str(target), rc);
 		msg->msg_no_resend = true;
+		lnet_detach_rsp_tracker(msg->msg_md, cpt);
 		lnet_finalize(msg, rc);
 	}
 
@@ -4180,6 +4374,7 @@ struct lnet_msg *
 	struct lnet_process_id target, unsigned int portal,
 	u64 match_bits, unsigned int offset, bool recovery)
 {
+	struct lnet_rsp_tracker *rspt;
 	struct lnet_msg *msg;
 	struct lnet_libmd *md;
 	int cpt;
@@ -4201,9 +4396,18 @@ struct lnet_msg *
 		return -ENOMEM;
 	}
 
+	cpt = lnet_cpt_of_cookie(mdh.cookie);
+
+	rspt = lnet_rspt_alloc(cpt);
+	if (!rspt) {
+		CERROR("Dropping GET to %s: ENOMEM on response tracker\n",
+		       libcfs_id2str(target));
+		return -ENOMEM;
+	}
+	INIT_LIST_HEAD(&rspt->rspt_on_list);
+
 	msg->msg_recovery = recovery;
 
-	cpt = lnet_cpt_of_cookie(mdh.cookie);
 	lnet_res_lock(cpt);
 
 	md = lnet_handle2md(&mdh);
@@ -4218,6 +4422,7 @@ struct lnet_msg *
 		lnet_res_unlock(cpt);
 
 		kfree(msg);
+		kfree(rspt);
 		return -ENOENT;
 	}
 
@@ -4242,11 +4447,14 @@ struct lnet_msg *
 
 	lnet_build_msg_event(msg, LNET_EVENT_SEND);
 
+	lnet_attach_rsp_tracker(rspt, cpt, md, mdh);
+
 	rc = lnet_send(self, msg, LNET_NID_ANY);
 	if (rc < 0) {
 		CNETERR("Error sending GET to %s: %d\n",
 			libcfs_id2str(target), rc);
 		msg->msg_no_resend = true;
+		lnet_detach_rsp_tracker(msg->msg_md, cpt);
 		lnet_finalize(msg, rc);
 	}
 
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 9841e14..5046648 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -777,6 +777,15 @@
 
 	msg->msg_ev.status = status;
 
+	/* if this is an ACK or a REPLY then make sure to remove the
+	 * response tracker.
+	 */
+	if (msg->msg_ev.type == LNET_EVENT_REPLY ||
+	    msg->msg_ev.type == LNET_EVENT_ACK) {
+		cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
+		lnet_detach_rsp_tracker(msg->msg_md, cpt);
+	}
+
 	/* if the message is successfully sent, no need to keep the MD around */
 	if (msg->msg_md && !status)
 		lnet_detach_md(msg, status);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 087/622] lnet: remove duplicate timeout mechanism
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (85 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 086/622] lnet: timeout delayed REPLYs and ACKs James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 088/622] lnet: handle fatal device error James Simmons
                   ` (535 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Remove the duplicate GET/PUT timeout mechanism currently implemented
for discovery, as it has been replaced by a more generic timeout
mechanism for all GET/PUT messages.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 0b1947d14188 ("LU-9120 lnet: remove duplicate timeout mechanism")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32992
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 39 ---------------------------------------
 1 file changed, 39 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 4a62f9a..ca9b90b 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2925,25 +2925,6 @@ static int lnet_peer_rediscover(struct lnet_peer *lp)
 }
 
 /*
- * Returns the first peer on the ln_dc_working queue if its timeout
- * has expired. Takes the current time as an argument so as to not
- * obsessively re-check the clock. The oldest discovery request will
- * be at the head of the queue.
- */
-static struct lnet_peer *lnet_peer_get_dc_timed_out(time64_t now)
-{
-	struct lnet_peer *lp;
-
-	if (list_empty(&the_lnet.ln_dc_working))
-		return NULL;
-	lp = list_first_entry(&the_lnet.ln_dc_working,
-			      struct lnet_peer, lp_dc_list);
-	if (now < lp->lp_last_queued + lnet_transaction_timeout)
-		return NULL;
-	return lp;
-}
-
-/*
  * Discovering this peer is taking too long. Cancel any Ping or Push
  * that discovery is waiting on by unlinking the relevant MDs. The
  * lnet_discovery_event_handler() will proceed from here and complete
@@ -2998,8 +2979,6 @@ static int lnet_peer_discovery_wait_for_work(void)
 			break;
 		if (!list_empty(&the_lnet.ln_msg_resend))
 			break;
-		if (lnet_peer_get_dc_timed_out(ktime_get_real_seconds()))
-			break;
 		lnet_net_unlock(cpt);
 
 		/*
@@ -3068,7 +3047,6 @@ static void lnet_resend_msgs(void)
 static int lnet_peer_discovery(void *arg)
 {
 	struct lnet_peer *lp;
-	time64_t now;
 	int rc;
 
 	CDEBUG(D_NET, "started\n");
@@ -3159,23 +3137,6 @@ static int lnet_peer_discovery(void *arg)
 				break;
 		}
 
-		/*
-		 * Now that the ln_dc_request queue has been emptied
-		 * check the ln_dc_working queue for peers that are
-		 * taking too long. Move all that are found to the
-		 * ln_dc_expired queue and time out any pending
-		 * Ping or Push. We have to drop the lnet_net_lock
-		 * in the loop because lnet_peer_cancel_discovery()
-		 * calls LNetMDUnlink().
-		 */
-		now = ktime_get_real_seconds();
-		while ((lp = lnet_peer_get_dc_timed_out(now)) != NULL) {
-			list_move(&lp->lp_dc_list, &the_lnet.ln_dc_expired);
-			lnet_net_unlock(LNET_LOCK_EX);
-			lnet_peer_cancel_discovery(lp);
-			lnet_net_lock(LNET_LOCK_EX);
-		}
-
 		lnet_net_unlock(LNET_LOCK_EX);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 088/622] lnet: handle fatal device error
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (86 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 087/622] lnet: remove duplicate timeout mechanism James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 089/622] lnet: reset health value James Simmons
                   ` (534 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

The o2iblnd can receive device status on the QP event handler.
There are three in specific that are being handled in this patch:
IB_EVENT_DEVICE_FATAL
IB_EVENT_PORT_ERR
IB_EVENT_PORT_ACTIVE
For DEVICE_FATAL and PORT_ERR the NI associated with the QP is set
in fatal error mode. This NI will no longer be selected when sending
messages. When PORT_ACTIVE is received the NI associated with the QP
has the fatal error cleared and future messages can use that NI.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 6b1571209a99 ("LU-9120 lnet: handle fatal device error")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32772
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h      |  7 +++++++
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 13 +++++++++++++
 net/lnet/lnet/lib-move.c            |  6 +++++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index d815a87..2b3e76a 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -443,6 +443,13 @@ struct lnet_ni {
 	atomic_t		ni_healthv;
 
 	/*
+	 * Set to 1 by the LND when it receives an event telling it the device
+	 * has gone into a fatal state. Set to 0 when the LND receives an
+	 * even telling it the device is back online.
+	 */
+	atomic_t		ni_fatal_error_on;
+
+	/*
 	 * equivalent interfaces to use
 	 * This is an array because socklnd bonding can still be configured
 	 */
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index c6e8e73..293a859 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -3567,6 +3567,19 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		rdma_notify(conn->ibc_cmid, IB_EVENT_COMM_EST);
 		return;
 
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_DEVICE_FATAL:
+		CERROR("Fatal device error for NI %s\n",
+		       libcfs_nid2str(conn->ibc_peer->ibp_ni->ni_nid));
+		atomic_set(&conn->ibc_peer->ibp_ni->ni_fatal_error_on, 1);
+		return;
+
+	case IB_EVENT_PORT_ACTIVE:
+		CERROR("Port reactivated for NI %s\n",
+		       libcfs_nid2str(conn->ibc_peer->ibp_ni->ni_nid));
+		atomic_set(&conn->ibc_peer->ibp_ni->ni_fatal_error_on, 0);
+		return;
+
 	default:
 		CERROR("%s: Async QP event type %d\n",
 		       libcfs_nid2str(conn->ibc_peer->ibp_nid), event->event);
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 55cbf57..8d5f1e5 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1303,9 +1303,11 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		unsigned int distance;
 		int ni_credits;
 		int ni_healthv;
+		int ni_fatal;
 
 		ni_credits = atomic_read(&ni->ni_tx_credits);
 		ni_healthv = atomic_read(&ni->ni_healthv);
+		ni_fatal = atomic_read(&ni->ni_fatal_error_on);
 
 		/*
 		 * calculate the distance from the CPT on which
@@ -1334,7 +1336,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		 * Select on health, shorter distance, available
 		 * credits, then round-robin.
 		 */
-		if (ni_healthv < best_healthv) {
+		if (ni_fatal) {
+			continue;
+		} else if (ni_healthv < best_healthv) {
 			continue;
 		} else if (ni_healthv > best_healthv) {
 			best_healthv = ni_healthv;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 089/622] lnet: reset health value
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (87 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 088/622] lnet: handle fatal device error James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 090/622] lnet: add health statistics James Simmons
                   ` (533 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Added an IOCTL to set the local or peer ni health value.
This would be useful in debugging where we can test the selection
algorithm and recovery mechanism by reducing the health of an
interface.

If the value specified is -1 then reset the health value to maximum.
This is useful to reset the system once a network issue has been
resolved. There would be no need to wait for the interface to go to
fully healthy on its own. It might be desirable to shortcut the
process.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 2f5a6d1233ac ("LU-9120 lnet: reset health value")
Lustre-commit: b04c35874dca ("LU-11283 lnet: fix setting health value manually")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32773
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h          |  2 ++
 include/uapi/linux/lnet/libcfs_ioctl.h |  3 +-
 include/uapi/linux/lnet/lnet-dlc.h     | 14 ++++++++
 net/lnet/lnet/api-ni.c                 | 51 +++++++++++++++++++++++++++
 net/lnet/lnet/lib-msg.c                | 16 +--------
 net/lnet/lnet/peer.c                   | 64 ++++++++++++++++++++++++++++++++++
 6 files changed, 134 insertions(+), 16 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index c2191e5..bd6ea90 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -524,6 +524,8 @@ struct lnet_ni *lnet_get_next_ni_locked(struct lnet_net *mynet,
 struct lnet_ni *lnet_get_ni_idx_locked(int idx);
 int lnet_get_peer_list(u32 *countp, u32 *sizep,
 		       struct lnet_process_id __user *ids);
+extern void lnet_peer_ni_set_healthv(lnet_nid_t nid, int value, bool all);
+extern void lnet_peer_ni_add_to_recoveryq_locked(struct lnet_peer_ni *lpni);
 
 void lnet_router_debugfs_init(void);
 void lnet_router_debugfs_fini(void);
diff --git a/include/uapi/linux/lnet/libcfs_ioctl.h b/include/uapi/linux/lnet/libcfs_ioctl.h
index 4396d26..458a634 100644
--- a/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -148,6 +148,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_GET_NUMA_RANGE	_IOWR(IOC_LIBCFS_TYPE, 99, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_PEER_LIST	_IOWR(IOC_LIBCFS_TYPE, 100, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS  _IOWR(IOC_LIBCFS_TYPE, 101, IOCTL_CONFIG_SIZE)
-#define IOC_LIBCFS_MAX_NR		101
+#define IOC_LIBCFS_SET_HEALHV		_IOWR(IOC_LIBCFS_TYPE, 102, IOCTL_CONFIG_SIZE)
+#define IOC_LIBCFS_MAX_NR		102
 
 #endif /* __LIBCFS_IOCTL_H__ */
diff --git a/include/uapi/linux/lnet/lnet-dlc.h b/include/uapi/linux/lnet/lnet-dlc.h
index 484435d..2d3aad8 100644
--- a/include/uapi/linux/lnet/lnet-dlc.h
+++ b/include/uapi/linux/lnet/lnet-dlc.h
@@ -230,6 +230,20 @@ struct lnet_ioctl_peer_cfg {
 	void __user *prcfg_bulk;
 };
 
+
+enum lnet_health_type {
+	LNET_HEALTH_TYPE_LOCAL_NI = 0,
+	LNET_HEALTH_TYPE_PEER_NI,
+};
+
+struct lnet_ioctl_reset_health_cfg {
+	struct libcfs_ioctl_hdr rh_hdr;
+	enum lnet_health_type rh_type;
+	bool rh_all;
+	int rh_value;
+	lnet_nid_t rh_nid;
+};
+
 struct lnet_ioctl_set_value {
 	struct libcfs_ioctl_hdr sv_hdr;
 	__u32 sv_value;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 38e35bb..0cadb2a 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3163,6 +3163,35 @@ u32 lnet_get_dlc_seq_locked(void)
 	return atomic_read(&lnet_dlc_seq_no);
 }
 
+static void
+lnet_ni_set_healthv(lnet_nid_t nid, int value, bool all)
+{
+	struct lnet_net *net;
+	struct lnet_ni *ni;
+
+	lnet_net_lock(LNET_LOCK_EX);
+	list_for_each_entry(net, &the_lnet.ln_nets, net_list) {
+		list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
+			if (ni->ni_nid == nid || all) {
+				atomic_set(&ni->ni_healthv, value);
+				if (list_empty(&ni->ni_recovery) &&
+				    value < LNET_MAX_HEALTH_VALUE) {
+					CERROR("manually adding local NI %s to recovery\n",
+					       libcfs_nid2str(ni->ni_nid));
+					list_add_tail(&ni->ni_recovery,
+						      &the_lnet.ln_mt_localNIRecovq);
+					lnet_ni_addref_locked(ni, 0);
+				}
+				if (!all) {
+					lnet_net_unlock(LNET_LOCK_EX);
+					return;
+				}
+			}
+		}
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+
 /**
  * LNet ioctl handler.
  *
@@ -3446,6 +3475,28 @@ u32 lnet_get_dlc_seq_locked(void)
 		return rc;
 	}
 
+	case IOC_LIBCFS_SET_HEALHV: {
+		struct lnet_ioctl_reset_health_cfg *cfg = arg;
+		int value;
+
+		if (cfg->rh_hdr.ioc_len < sizeof(*cfg))
+			return -EINVAL;
+		if (cfg->rh_value < 0 ||
+		    cfg->rh_value > LNET_MAX_HEALTH_VALUE)
+			value = LNET_MAX_HEALTH_VALUE;
+		else
+			value = cfg->rh_value;
+		mutex_lock(&the_lnet.ln_api_mutex);
+		if (cfg->rh_type == LNET_HEALTH_TYPE_LOCAL_NI)
+			lnet_ni_set_healthv(cfg->rh_nid, value,
+					    cfg->rh_all);
+		else
+			lnet_peer_ni_set_healthv(cfg->rh_nid, value,
+						 cfg->rh_all);
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
 	case IOC_LIBCFS_NOTIFY_ROUTER: {
 		time64_t deadline = ktime_get_real_seconds() - data->ioc_u64[0];
 
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 5046648..32d49e9 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -530,12 +530,6 @@
 		return;
 
 	lnet_net_lock(0);
-	/* the mt could've shutdown and cleaned up the queues */
-	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
-		lnet_net_unlock(0);
-		return;
-	}
-
 	lnet_dec_healthv_locked(&lpni->lpni_healthv);
 	/* add the peer NI to the recovery queue if it's not already there
 	 * and it's health value is actually below the maximum. It's
@@ -543,15 +537,7 @@
 	 * value will not be reduced. In this case, there is no reason to
 	 * invoke recovery
 	 */
-	if (list_empty(&lpni->lpni_recovery) &&
-	    atomic_read(&lpni->lpni_healthv) < LNET_MAX_HEALTH_VALUE) {
-		CERROR("lpni %s added to recovery queue. Health = %d\n",
-		       libcfs_nid2str(lpni->lpni_nid),
-		       atomic_read(&lpni->lpni_healthv));
-		list_add_tail(&lpni->lpni_recovery,
-			      &the_lnet.ln_mt_peerNIRecovq);
-		lnet_peer_ni_addref_locked(lpni);
-	}
+	lnet_peer_ni_add_to_recoveryq_locked(lpni);
 	lnet_net_unlock(0);
 }
 
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index ca9b90b..9dbb3bd4 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3437,3 +3437,67 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 out:
 	return rc;
 }
+
+void
+lnet_peer_ni_add_to_recoveryq_locked(struct lnet_peer_ni *lpni)
+{
+	/* the mt could've shutdown and cleaned up the queues */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING)
+		return;
+
+	if (list_empty(&lpni->lpni_recovery) &&
+	    atomic_read(&lpni->lpni_healthv) < LNET_MAX_HEALTH_VALUE) {
+		CERROR("lpni %s added to recovery queue. Health = %d\n",
+		       libcfs_nid2str(lpni->lpni_nid),
+		       atomic_read(&lpni->lpni_healthv));
+		list_add_tail(&lpni->lpni_recovery,
+			      &the_lnet.ln_mt_peerNIRecovq);
+		lnet_peer_ni_addref_locked(lpni);
+	}
+}
+
+/* Call with the ln_api_mutex held */
+void
+lnet_peer_ni_set_healthv(lnet_nid_t nid, int value, bool all)
+{
+	struct lnet_peer_table *ptable;
+	struct lnet_peer *lp;
+	struct lnet_peer_net *lpn;
+	struct lnet_peer_ni *lpni;
+	int lncpt;
+	int cpt;
+
+	if (the_lnet.ln_state != LNET_STATE_RUNNING)
+		return;
+
+	if (!all) {
+		lnet_net_lock(LNET_LOCK_EX);
+		lpni = lnet_find_peer_ni_locked(nid);
+		atomic_set(&lpni->lpni_healthv, value);
+		lnet_peer_ni_add_to_recoveryq_locked(lpni);
+		lnet_peer_ni_decref_locked(lpni);
+		lnet_net_unlock(LNET_LOCK_EX);
+		return;
+	}
+
+	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
+
+	/* Walk all the peers and reset the healhv for each one to the
+	 * maximum value.
+	 */
+	lnet_net_lock(LNET_LOCK_EX);
+	for (cpt = 0; cpt < lncpt; cpt++) {
+		ptable = the_lnet.ln_peer_tables[cpt];
+		list_for_each_entry(lp, &ptable->pt_peer_list, lp_peer_list) {
+			list_for_each_entry(lpn, &lp->lp_peer_nets,
+					    lpn_peer_nets) {
+				list_for_each_entry(lpni, &lpn->lpn_peer_nis,
+						    lpni_peer_nis) {
+					atomic_set(&lpni->lpni_healthv, value);
+					lnet_peer_ni_add_to_recoveryq_locked(lpni);
+				}
+			}
+		}
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 090/622] lnet: add health statistics
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (88 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 089/622] lnet: reset health value James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 091/622] lnet: Add ioctl to get health stats James Simmons
                   ` (532 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add a health statistics block for each local and peer NI.
These statistics will be incremented when processing errors reported
by lnet_finalize()

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 67908ab34371 ("LU-9120 lnet: add health statistics")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32775
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h | 18 +++++++++++++++
 net/lnet/lnet/lib-msg.c        | 52 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 2b3e76a..e5d4128 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -338,6 +338,22 @@ struct lnet_element_stats {
 	struct lnet_comm_count	el_drop_stats;
 };
 
+struct lnet_health_local_stats {
+	atomic_t hlt_local_interrupt;
+	atomic_t hlt_local_dropped;
+	atomic_t hlt_local_aborted;
+	atomic_t hlt_local_no_route;
+	atomic_t hlt_local_timeout;
+	atomic_t hlt_local_error;
+};
+
+struct lnet_health_remote_stats {
+	atomic_t hlt_remote_dropped;
+	atomic_t hlt_remote_timeout;
+	atomic_t hlt_remote_error;
+	atomic_t hlt_network_timeout;
+};
+
 struct lnet_net {
 	/* chain on the ln_nets */
 	struct list_head	net_list;
@@ -426,6 +442,7 @@ struct lnet_ni {
 
 	/* NI statistics */
 	struct lnet_element_stats ni_stats;
+	struct lnet_health_local_stats ni_hstats;
 
 	/* physical device CPT */
 	int			ni_dev_cpt;
@@ -511,6 +528,7 @@ struct lnet_peer_ni {
 	struct list_head	 lpni_rtr_list;
 	/* statistics kept on each peer NI */
 	struct lnet_element_stats lpni_stats;
+	struct lnet_health_remote_stats lpni_hstats;
 	/* spin lock protecting credits and lpni_txq / lpni_rtrq */
 	spinlock_t		 lpni_lock;
 	/* # tx credits available */
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 32d49e9..dc51a17 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -541,6 +541,54 @@
 	lnet_net_unlock(0);
 }
 
+static void
+lnet_incr_hstats(struct lnet_msg *msg, enum lnet_msg_hstatus hstatus)
+{
+	struct lnet_ni *ni = msg->msg_txni;
+	struct lnet_peer_ni *lpni = msg->msg_txpeer;
+
+	switch (hstatus) {
+	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
+		atomic_inc(&ni->ni_hstats.hlt_local_interrupt);
+		break;
+	case LNET_MSG_STATUS_LOCAL_DROPPED:
+		atomic_inc(&ni->ni_hstats.hlt_local_dropped);
+		break;
+	case LNET_MSG_STATUS_LOCAL_ABORTED:
+		atomic_inc(&ni->ni_hstats.hlt_local_aborted);
+		break;
+	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
+		atomic_inc(&ni->ni_hstats.hlt_local_no_route);
+		break;
+	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
+		atomic_inc(&ni->ni_hstats.hlt_local_timeout);
+		break;
+	case LNET_MSG_STATUS_LOCAL_ERROR:
+		atomic_inc(&ni->ni_hstats.hlt_local_error);
+		break;
+	case LNET_MSG_STATUS_REMOTE_DROPPED:
+		if (lpni)
+			atomic_inc(&lpni->lpni_hstats.hlt_remote_dropped);
+		break;
+	case LNET_MSG_STATUS_REMOTE_ERROR:
+		if (lpni)
+			atomic_inc(&lpni->lpni_hstats.hlt_remote_error);
+		break;
+	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
+		if (lpni)
+			atomic_inc(&lpni->lpni_hstats.hlt_remote_timeout);
+		break;
+	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
+		if (lpni)
+			atomic_inc(&lpni->lpni_hstats.hlt_network_timeout);
+		break;
+	case LNET_MSG_STATUS_OK:
+		break;
+	default:
+		LBUG();
+	}
+}
+
 /* Do a health check on the message:
  * return -1 if we're not going to handle the error or
  *   if we've reached the maximum number of retries.
@@ -553,8 +601,6 @@
 	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
 	bool lo = false;
 
-	/* TODO: lnet_incr_hstats(hstatus); */
-
 	LASSERT(msg->msg_txni);
 
 	/* if we're sending to the LOLND then the msg_txpeer will not be
@@ -565,6 +611,8 @@
 	else
 		lo = true;
 
+	lnet_incr_hstats(msg, hstatus);
+
 	if (hstatus != LNET_MSG_STATUS_OK &&
 	    ktime_compare(ktime_get(), msg->msg_deadline) >= 0)
 		return -1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 091/622] lnet: Add ioctl to get health stats
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (89 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 090/622] lnet: add health statistics James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 092/622] lnet: remove obsolete health functions James Simmons
                   ` (531 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

At the time of this patch the sysfs statistics features is
still in development. Therefore, using ioctl to get the stats
from LNet.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 10958cac798d ("LU-9120 lnet: Add ioctl to get health stats")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32776
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h          |  1 +
 include/uapi/linux/lnet/libcfs_ioctl.h |  3 ++-
 include/uapi/linux/lnet/lnet-dlc.h     | 31 ++++++++++++++++-----
 net/lnet/lnet/api-ni.c                 | 49 ++++++++++++++++++++++++++++++++++
 net/lnet/lnet/peer.c                   | 29 ++++++++++++++++----
 5 files changed, 101 insertions(+), 12 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index bd6ea90..ba237df 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -823,6 +823,7 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 			  u32 *ni_peer_tx_credits, u32 *peer_tx_credits,
 			  u32 *peer_rtr_credits, u32 *peer_min_rtr_credtis,
 			  u32 *peer_tx_qnob);
+int lnet_get_peer_ni_hstats(struct lnet_ioctl_peer_ni_hstats *stats);
 
 static inline bool
 lnet_is_peer_ni_healthy_locked(struct lnet_peer_ni *lpni)
diff --git a/include/uapi/linux/lnet/libcfs_ioctl.h b/include/uapi/linux/lnet/libcfs_ioctl.h
index 458a634..683d508 100644
--- a/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -149,6 +149,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_GET_PEER_LIST	_IOWR(IOC_LIBCFS_TYPE, 100, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS  _IOWR(IOC_LIBCFS_TYPE, 101, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_SET_HEALHV		_IOWR(IOC_LIBCFS_TYPE, 102, IOCTL_CONFIG_SIZE)
-#define IOC_LIBCFS_MAX_NR		102
+#define IOC_LIBCFS_GET_LOCAL_HSTATS	_IOWR(IOC_LIBCFS_TYPE, 103, IOCTL_CONFIG_SIZE)
+#define IOC_LIBCFS_MAX_NR		103
 
 #endif /* __LIBCFS_IOCTL_H__ */
diff --git a/include/uapi/linux/lnet/lnet-dlc.h b/include/uapi/linux/lnet/lnet-dlc.h
index 2d3aad8..8e9850c 100644
--- a/include/uapi/linux/lnet/lnet-dlc.h
+++ b/include/uapi/linux/lnet/lnet-dlc.h
@@ -163,6 +163,31 @@ struct lnet_ioctl_element_stats {
 	__u32 iel_drop_count;
 };
 
+enum lnet_health_type {
+	LNET_HEALTH_TYPE_LOCAL_NI = 0,
+	LNET_HEALTH_TYPE_PEER_NI,
+};
+
+struct lnet_ioctl_local_ni_hstats {
+	struct libcfs_ioctl_hdr hlni_hdr;
+	lnet_nid_t hlni_nid;
+	__u32 hlni_local_interrupt;
+	__u32 hlni_local_dropped;
+	__u32 hlni_local_aborted;
+	__u32 hlni_local_no_route;
+	__u32 hlni_local_timeout;
+	__u32 hlni_local_error;
+	__s32 hlni_health_value;
+};
+
+struct lnet_ioctl_peer_ni_hstats {
+	__u32 hlpni_remote_dropped;
+	__u32 hlpni_remote_timeout;
+	__u32 hlpni_remote_error;
+	__u32 hlpni_network_timeout;
+	__s32 hlpni_health_value;
+};
+
 struct lnet_ioctl_element_msg_stats {
 	struct libcfs_ioctl_hdr im_hdr;
 	__u32 im_idx;
@@ -230,12 +255,6 @@ struct lnet_ioctl_peer_cfg {
 	void __user *prcfg_bulk;
 };
 
-
-enum lnet_health_type {
-	LNET_HEALTH_TYPE_LOCAL_NI = 0,
-	LNET_HEALTH_TYPE_PEER_NI,
-};
-
 struct lnet_ioctl_reset_health_cfg {
 	struct libcfs_ioctl_hdr rh_hdr;
 	enum lnet_health_type rh_type;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 0cadb2a..14a8f2c 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3192,6 +3192,42 @@ u32 lnet_get_dlc_seq_locked(void)
 	lnet_net_unlock(LNET_LOCK_EX);
 }
 
+static int
+lnet_get_local_ni_hstats(struct lnet_ioctl_local_ni_hstats *stats)
+{
+	int cpt, rc = 0;
+	struct lnet_ni *ni;
+	lnet_nid_t nid = stats->hlni_nid;
+
+	cpt = lnet_net_lock_current();
+	ni = lnet_nid2ni_locked(nid, cpt);
+
+	if (!ni) {
+		rc = -ENOENT;
+		goto unlock;
+	}
+
+	stats->hlni_local_interrupt =
+		atomic_read(&ni->ni_hstats.hlt_local_interrupt);
+	stats->hlni_local_dropped =
+		atomic_read(&ni->ni_hstats.hlt_local_dropped);
+	stats->hlni_local_aborted =
+		atomic_read(&ni->ni_hstats.hlt_local_aborted);
+	stats->hlni_local_no_route =
+		atomic_read(&ni->ni_hstats.hlt_local_no_route);
+	stats->hlni_local_timeout =
+		atomic_read(&ni->ni_hstats.hlt_local_timeout);
+	stats->hlni_local_error =
+		atomic_read(&ni->ni_hstats.hlt_local_error);
+	stats->hlni_health_value =
+		atomic_read(&ni->ni_healthv);
+
+unlock:
+	lnet_net_unlock(cpt);
+
+	return rc;
+}
+
 /**
  * LNet ioctl handler.
  *
@@ -3399,6 +3435,19 @@ u32 lnet_get_dlc_seq_locked(void)
 		return rc;
 	}
 
+	case IOC_LIBCFS_GET_LOCAL_HSTATS: {
+		struct lnet_ioctl_local_ni_hstats *stats = arg;
+
+		if (stats->hlni_hdr.ioc_len < sizeof(*stats))
+			return -EINVAL;
+
+		mutex_lock(&the_lnet.ln_api_mutex);
+		rc = lnet_get_local_ni_hstats(stats);
+		mutex_unlock(&the_lnet.ln_api_mutex);
+
+		return rc;
+	}
+
 	case IOC_LIBCFS_ADD_PEER_NI: {
 		struct lnet_ioctl_peer_cfg *cfg = arg;
 
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 9dbb3bd4..4a38ca6 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3339,6 +3339,7 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 {
 	struct lnet_ioctl_element_stats *lpni_stats;
 	struct lnet_ioctl_element_msg_stats *lpni_msg_stats;
+	struct lnet_ioctl_peer_ni_hstats *lpni_hstats;
 	struct lnet_peer_ni_credit_info *lpni_info;
 	struct lnet_peer_ni *lpni;
 	struct lnet_peer *lp;
@@ -3354,7 +3355,7 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 	}
 
 	size = sizeof(nid) + sizeof(*lpni_info) + sizeof(*lpni_stats) +
-	       sizeof(*lpni_msg_stats);
+	       sizeof(*lpni_msg_stats) + sizeof(*lpni_hstats);
 	size *= lp->lp_nnis;
 	if (size > cfg->prcfg_size) {
 		cfg->prcfg_size = size;
@@ -3380,6 +3381,9 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 	lpni_msg_stats = kzalloc(sizeof(*lpni_msg_stats), GFP_KERNEL);
 	if (!lpni_msg_stats)
 		goto out_free_stats;
+	lpni_hstats = kzalloc(sizeof(*lpni_hstats), GFP_NOFS);
+	if (!lpni_hstats)
+		goto out_free_msg_stats;
 
 
 	lpni = NULL;
@@ -3387,7 +3391,7 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
 		nid = lpni->lpni_nid;
 		if (copy_to_user(bulk, &nid, sizeof(nid)))
-			goto out_free_msg_stats;
+			goto out_free_hstats;
 		bulk += sizeof(nid);
 
 		memset(lpni_info, 0, sizeof(*lpni_info));
@@ -3406,7 +3410,7 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 		lpni_info->cr_peer_min_tx_credits = lpni->lpni_mintxcredits;
 		lpni_info->cr_peer_tx_qnob = lpni->lpni_txqnob;
 		if (copy_to_user(bulk, lpni_info, sizeof(*lpni_info)))
-			goto out_free_msg_stats;
+			goto out_free_hstats;
 		bulk += sizeof(*lpni_info);
 
 		memset(lpni_stats, 0, sizeof(*lpni_stats));
@@ -3417,15 +3421,30 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 		lpni_stats->iel_drop_count =
 			lnet_sum_stats(&lpni->lpni_stats, LNET_STATS_TYPE_DROP);
 		if (copy_to_user(bulk, lpni_stats, sizeof(*lpni_stats)))
-			goto out_free_msg_stats;
+			goto out_free_hstats;
 		bulk += sizeof(*lpni_stats);
 		lnet_usr_translate_stats(lpni_msg_stats, &lpni->lpni_stats);
 		if (copy_to_user(bulk, lpni_msg_stats, sizeof(*lpni_msg_stats)))
-			goto out_free_msg_stats;
+			goto out_free_hstats;
 		bulk += sizeof(*lpni_msg_stats);
+		lpni_hstats->hlpni_network_timeout =
+			atomic_read(&lpni->lpni_hstats.hlt_network_timeout);
+		lpni_hstats->hlpni_remote_dropped =
+			atomic_read(&lpni->lpni_hstats.hlt_remote_dropped);
+		lpni_hstats->hlpni_remote_timeout =
+			atomic_read(&lpni->lpni_hstats.hlt_remote_timeout);
+		lpni_hstats->hlpni_remote_error =
+			atomic_read(&lpni->lpni_hstats.hlt_remote_error);
+		lpni_hstats->hlpni_health_value =
+			atomic_read(&lpni->lpni_healthv);
+		if (copy_to_user(bulk, lpni_hstats, sizeof(*lpni_hstats)))
+			goto out_free_hstats;
+		bulk += sizeof(*lpni_hstats);
 	}
 	rc = 0;
 
+out_free_hstats:
+	kfree(lpni_hstats);
 out_free_msg_stats:
 	kfree(lpni_msg_stats);
 out_free_stats:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 092/622] lnet: remove obsolete health functions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (90 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 091/622] lnet: Add ioctl to get health stats James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 093/622] lnet: set health value from user space James Simmons
                   ` (530 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Removed obsolete health functions that were originally added
during the Multi-Rail project. Some assumptions were made about
the health implementation back then, that are no longer true.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: ba05b3a98a0c ("LU-9120 lnet: remove obsolete health functions")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32862
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 40 ----------------------------------------
 net/lnet/lnet/api-ni.c        |  9 ---------
 net/lnet/lnet/lib-move.c      |  6 ------
 net/lnet/lnet/peer.c          |  8 --------
 4 files changed, 63 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index ba237df..74660d3 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -494,7 +494,6 @@ struct lnet_ni *
 struct lnet_ni *lnet_nid2ni_addref(lnet_nid_t nid);
 struct lnet_ni *lnet_net2ni_locked(u32 net, int cpt);
 struct lnet_ni *lnet_net2ni_addref(u32 net);
-bool lnet_is_ni_healthy_locked(struct lnet_ni *ni);
 struct lnet_net *lnet_get_net_locked(u32 net_id);
 
 extern unsigned int lnet_transaction_timeout;
@@ -825,45 +824,6 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 			  u32 *peer_tx_qnob);
 int lnet_get_peer_ni_hstats(struct lnet_ioctl_peer_ni_hstats *stats);
 
-static inline bool
-lnet_is_peer_ni_healthy_locked(struct lnet_peer_ni *lpni)
-{
-	return lpni->lpni_healthy;
-}
-
-static inline void
-lnet_set_peer_ni_health_locked(struct lnet_peer_ni *lpni, bool health)
-{
-	lpni->lpni_healthy = health;
-}
-
-static inline bool
-lnet_is_peer_net_healthy_locked(struct lnet_peer_net *peer_net)
-{
-	struct lnet_peer_ni *lpni;
-
-	list_for_each_entry(lpni, &peer_net->lpn_peer_nis,
-			    lpni_peer_nis) {
-		if (lnet_is_peer_ni_healthy_locked(lpni))
-			return true;
-	}
-
-	return false;
-}
-
-static inline bool
-lnet_is_peer_healthy_locked(struct lnet_peer *peer)
-{
-	struct lnet_peer_net *peer_net;
-
-	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
-		if (lnet_is_peer_net_healthy_locked(peer_net))
-			return true;
-	}
-
-	return false;
-}
-
 static inline struct lnet_peer_net *
 lnet_find_peer_net_locked(struct lnet_peer *peer, u32 net_id)
 {
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 14a8f2c..1ee24c7 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1155,15 +1155,6 @@ struct lnet_net *
 	return !!net;
 }
 
-bool
-lnet_is_ni_healthy_locked(struct lnet_ni *ni)
-{
-	if (ni->ni_state & LNET_NI_STATE_ACTIVE)
-		return true;
-
-	return false;
-}
-
 struct lnet_ni *
 lnet_nid2ni_locked(lnet_nid_t nid, int cpt)
 {
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 8d5f1e5..c33cf8d 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2323,12 +2323,6 @@ struct lnet_ni *
 	}
 	lnet_peer_ni_decref_locked(lpni);
 
-	/* If peer is not healthy then can not send anything to it */
-	if (!lnet_is_peer_healthy_locked(peer)) {
-		lnet_net_unlock(cpt);
-		return -EHOSTUNREACH;
-	}
-
 	/* Identify the different send cases
 	 */
 	if (src_nid == LNET_NID_ANY)
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 4a38ca6..b20230b 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -135,7 +135,6 @@
 	lpni->lpni_nid = nid;
 	lpni->lpni_cpt = cpt;
 	atomic_set(&lpni->lpni_healthv, LNET_MAX_HEALTH_VALUE);
-	lnet_set_peer_ni_health_locked(lpni, true);
 
 	net = lnet_get_net_locked(LNET_NIDNET(nid));
 	lpni->lpni_net = net;
@@ -2694,8 +2693,6 @@ static lnet_nid_t lnet_peer_select_nid(struct lnet_peer *lp)
 	/* Look for a direct-connected NID for this peer. */
 	lpni = NULL;
 	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
-		if (!lnet_is_peer_ni_healthy_locked(lpni))
-			continue;
 		if (!lnet_get_net_locked(lpni->lpni_peer_net->lpn_net_id))
 			continue;
 		break;
@@ -2706,8 +2703,6 @@ static lnet_nid_t lnet_peer_select_nid(struct lnet_peer *lp)
 	/* Look for a routed-connected NID for this peer. */
 	lpni = NULL;
 	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL) {
-		if (!lnet_is_peer_ni_healthy_locked(lpni))
-			continue;
 		if (!lnet_find_rnet_locked(lpni->lpni_peer_net->lpn_net_id))
 			continue;
 		break;
@@ -3082,9 +3077,6 @@ static int lnet_peer_discovery(void *arg)
 			 * forever, in case the GET message (for ping)
 			 * doesn't get a REPLY or the PUT message (for
 			 * push) doesn't get an ACK.
-			 *
-			 * TODO: LNet Health will deal with this scenario
-			 * in a generic way.
 			 */
 			lp->lp_last_queued = ktime_get_real_seconds();
 			lnet_net_unlock(LNET_LOCK_EX);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 093/622] lnet: set health value from user space
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (91 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 092/622] lnet: remove obsolete health functions James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 094/622] lnet: add global health statistics James Simmons
                   ` (529 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Collect debugging information for ioctl setting manually health
value. Test if a peer is returned by lnet_find_peer_ni_locked()
when lnet_get_peer_info() is called. This was discovered when
the user land tools were updated for setting the health value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: c0ad398fd716 ("LU-9120 lnet: set health value from user space")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32863
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 6 ++++++
 net/lnet/lnet/peer.c   | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 1ee24c7..82703dd 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3526,6 +3526,12 @@ u32 lnet_get_dlc_seq_locked(void)
 			value = LNET_MAX_HEALTH_VALUE;
 		else
 			value = cfg->rh_value;
+		CDEBUG(D_NET,
+		       "Manually setting healthv to %d for %s:%s. all = %d\n",
+		       value,
+		       (cfg->rh_type == LNET_HEALTH_TYPE_LOCAL_NI) ?
+		       "local" : "peer",
+		       libcfs_nid2str(cfg->rh_nid), cfg->rh_all);
 		mutex_lock(&the_lnet.ln_api_mutex);
 		if (cfg->rh_type == LNET_HEALTH_TYPE_LOCAL_NI)
 			lnet_ni_set_healthv(cfg->rh_nid, value,
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index b20230b..2fc5dfc 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3484,6 +3484,10 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 	if (!all) {
 		lnet_net_lock(LNET_LOCK_EX);
 		lpni = lnet_find_peer_ni_locked(nid);
+		if (!lpni) {
+			lnet_net_unlock(LNET_LOCK_EX);
+			return;
+		}
 		atomic_set(&lpni->lpni_healthv, value);
 		lnet_peer_ni_add_to_recoveryq_locked(lpni);
 		lnet_peer_ni_decref_locked(lpni);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 094/622] lnet: add global health statistics
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (92 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 093/622] lnet: set health value from user space James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 095/622] lnet: print recovery queues content James Simmons
                   ` (528 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Added global health statistics

Print that from lnetctl.

lnetctl stats show

lnet_selftest passes the statistics block over the wire. This,
unfortunately, creates an unnecessary backwards compatibility link
for lnet_selftest, which shouldn't be there. This patch breaks
this backwards compatibility, which means lnet_selftest will
not work with older selftest modules.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 15020fd977af ("LU-9120 lnet: add global health statistics")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32949
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h        |  2 ++
 include/uapi/linux/lnet/lnet-types.h | 13 +++++++++++++
 net/lnet/lnet/api-ni.c               | 13 +++++++++++++
 net/lnet/lnet/lib-move.c             | 11 +++++++++++
 net/lnet/lnet/lib-msg.c              | 28 +++++++++++++++++++++++-----
 5 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 74660d3..e4d9ccc 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -445,6 +445,7 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 
 	rspt = kzalloc(sizeof(*rspt), GFP_NOFS);
 	lnet_net_lock(cpt);
+	the_lnet.ln_counters[cpt]->rst_alloc++;
 	lnet_net_unlock(cpt);
 	return rspt;
 }
@@ -454,6 +455,7 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 {
 	kfree(rspt);
 	lnet_net_lock(cpt);
+	the_lnet.ln_counters[cpt]->rst_alloc--;
 	lnet_net_unlock(cpt);
 }
 
diff --git a/include/uapi/linux/lnet/lnet-types.h b/include/uapi/linux/lnet/lnet-types.h
index 2afdd83..1da72c4 100644
--- a/include/uapi/linux/lnet/lnet-types.h
+++ b/include/uapi/linux/lnet/lnet-types.h
@@ -278,11 +278,24 @@ struct lnet_ping_info {
 struct lnet_counters {
 	__u32	msgs_alloc;
 	__u32	msgs_max;
+	__u32	rst_alloc;
 	__u32	errors;
 	__u32	send_count;
 	__u32	recv_count;
 	__u32	route_count;
 	__u32	drop_count;
+	__u32	resend_count;
+	__u32	response_timeout_count;
+	__u32	local_interrupt_count;
+	__u32	local_dropped_count;
+	__u32	local_aborted_count;
+	__u32	local_no_route_count;
+	__u32	local_timeout_count;
+	__u32	local_error_count;
+	__u32	remote_dropped_count;
+	__u32	remote_error_count;
+	__u32	remote_timeout_count;
+	__u32	network_timeout_count;
 	__u64	send_length;
 	__u64	recv_length;
 	__u64	route_length;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 82703dd..d58006d 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -694,7 +694,20 @@ static void lnet_assert_wire_constants(void)
 	cfs_percpt_for_each(ctr, i, the_lnet.ln_counters) {
 		counters->msgs_max += ctr->msgs_max;
 		counters->msgs_alloc += ctr->msgs_alloc;
+		counters->rst_alloc += ctr->rst_alloc;
 		counters->errors += ctr->errors;
+		counters->resend_count += ctr->resend_count;
+		counters->response_timeout_count += ctr->response_timeout_count;
+		counters->local_interrupt_count += ctr->local_interrupt_count;
+		counters->local_dropped_count += ctr->local_dropped_count;
+		counters->local_aborted_count += ctr->local_aborted_count;
+		counters->local_no_route_count += ctr->local_no_route_count;
+		counters->local_timeout_count += ctr->local_timeout_count;
+		counters->local_error_count += ctr->local_error_count;
+		counters->remote_dropped_count += ctr->remote_dropped_count;
+		counters->remote_error_count += ctr->remote_error_count;
+		counters->remote_timeout_count += ctr->remote_timeout_count;
+		counters->network_timeout_count += ctr->network_timeout_count;
 		counters->send_count += ctr->send_count;
 		counters->recv_count += ctr->recv_count;
 		counters->route_count += ctr->route_count;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index c33cf8d..6a3704d 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2501,6 +2501,10 @@ struct lnet_mt_event_info {
 				md->md_rspt_ptr = NULL;
 				lnet_res_unlock(i);
 
+				lnet_net_lock(i);
+				the_lnet.ln_counters[i]->response_timeout_count++;
+				lnet_net_unlock(i);
+
 				list_del_init(&rspt->rspt_on_list);
 
 				CDEBUG(D_NET,
@@ -2567,6 +2571,11 @@ struct lnet_mt_event_info {
 			lnet_peer_ni_decref_locked(lpni);
 
 			lnet_net_unlock(cpt);
+			CDEBUG(D_NET, "resending %s->%s: %s recovery %d\n",
+			       libcfs_nid2str(src_nid),
+			       libcfs_id2str(msg->msg_target),
+			       lnet_msgtyp2str(msg->msg_type),
+			       msg->msg_recovery);
 			rc = lnet_send(src_nid, msg, LNET_NID_ANY);
 			if (rc) {
 				CERROR("Error sending %s to %s: %d\n",
@@ -2576,6 +2585,8 @@ struct lnet_mt_event_info {
 				lnet_finalize(msg, rc);
 			}
 			lnet_net_lock(cpt);
+			if (!rc)
+				the_lnet.ln_counters[cpt]->resend_count++;
 		}
 	}
 }
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index dc51a17..70decc7 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -546,41 +546,52 @@
 {
 	struct lnet_ni *ni = msg->msg_txni;
 	struct lnet_peer_ni *lpni = msg->msg_txpeer;
+	struct lnet_counters *counters = the_lnet.ln_counters[0];
 
 	switch (hstatus) {
 	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
 		atomic_inc(&ni->ni_hstats.hlt_local_interrupt);
+		counters->local_interrupt_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_DROPPED:
 		atomic_inc(&ni->ni_hstats.hlt_local_dropped);
+		counters->local_dropped_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_ABORTED:
 		atomic_inc(&ni->ni_hstats.hlt_local_aborted);
+		counters->local_aborted_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
 		atomic_inc(&ni->ni_hstats.hlt_local_no_route);
+		counters->local_no_route_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
 		atomic_inc(&ni->ni_hstats.hlt_local_timeout);
+		counters->local_timeout_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_ERROR:
 		atomic_inc(&ni->ni_hstats.hlt_local_error);
+		counters->local_error_count++;
 		break;
 	case LNET_MSG_STATUS_REMOTE_DROPPED:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_remote_dropped);
+		counters->remote_dropped_count++;
 		break;
 	case LNET_MSG_STATUS_REMOTE_ERROR:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_remote_error);
+		counters->remote_error_count++;
 		break;
 	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_remote_timeout);
+		counters->remote_timeout_count++;
 		break;
 	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_network_timeout);
+		counters->network_timeout_count++;
 		break;
 	case LNET_MSG_STATUS_OK:
 		break;
@@ -601,6 +612,10 @@
 	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
 	bool lo = false;
 
+	/* if we're shutting down no point in handling health. */
+	if (the_lnet.ln_state != LNET_STATE_RUNNING)
+		return -1;
+
 	LASSERT(msg->msg_txni);
 
 	/* if we're sending to the LOLND then the msg_txpeer will not be
@@ -611,15 +626,18 @@
 	else
 		lo = true;
 
-	lnet_incr_hstats(msg, hstatus);
-
 	if (hstatus != LNET_MSG_STATUS_OK &&
 	    ktime_compare(ktime_get(), msg->msg_deadline) >= 0)
 		return -1;
 
-	/* if we're shutting down no point in handling health. */
-	if (the_lnet.ln_state != LNET_STATE_RUNNING)
-		return -1;
+	/* stats are only incremented for errors so avoid wasting time
+	 * incrementing statistics if there is no error.
+	 */
+	if (hstatus != LNET_MSG_STATUS_OK) {
+		lnet_net_lock(0);
+		lnet_incr_hstats(msg, hstatus);
+		lnet_net_unlock(0);
+	}
 
 	CDEBUG(D_NET, "health check: %s->%s: %s: %s\n",
 	       libcfs_nid2str(msg->msg_txni->ni_nid),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 095/622] lnet: print recovery queues content
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (93 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 094/622] lnet: add global health statistics James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 096/622] lnet: health error simulation James Simmons
                   ` (527 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add commands to lnetctl to print recovery queues content from
user space.

Associated code to handle the IOCTL added in LNet module.

for local NIs:
lnetctl debug recovery --local

for peer NIs:
lnetctl debug recovery --peer

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 826ea19c077b ("LU-9120 lnet: print recovery queues content")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32950
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lnet/libcfs_ioctl.h |  3 +-
 include/uapi/linux/lnet/lnet-dlc.h     |  8 +++++
 net/lnet/lnet/api-ni.c                 | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/lnet/libcfs_ioctl.h b/include/uapi/linux/lnet/libcfs_ioctl.h
index 683d508..dfb73f7 100644
--- a/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -150,6 +150,7 @@ struct libcfs_debug_ioctl_data {
 #define IOC_LIBCFS_GET_LOCAL_NI_MSG_STATS  _IOWR(IOC_LIBCFS_TYPE, 101, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_SET_HEALHV		_IOWR(IOC_LIBCFS_TYPE, 102, IOCTL_CONFIG_SIZE)
 #define IOC_LIBCFS_GET_LOCAL_HSTATS	_IOWR(IOC_LIBCFS_TYPE, 103, IOCTL_CONFIG_SIZE)
-#define IOC_LIBCFS_MAX_NR		103
+#define IOC_LIBCFS_GET_RECOVERY_QUEUE	_IOWR(IOC_LIBCFS_TYPE, 104, IOCTL_CONFIG_SIZE)
+#define IOC_LIBCFS_MAX_NR		104
 
 #endif /* __LIBCFS_IOCTL_H__ */
diff --git a/include/uapi/linux/lnet/lnet-dlc.h b/include/uapi/linux/lnet/lnet-dlc.h
index 8e9850c..87f7680 100644
--- a/include/uapi/linux/lnet/lnet-dlc.h
+++ b/include/uapi/linux/lnet/lnet-dlc.h
@@ -35,6 +35,7 @@
 #define MAX_NUM_SHOW_ENTRIES	32
 #define LNET_MAX_STR_LEN	128
 #define LNET_MAX_SHOW_NUM_CPT	128
+#define LNET_MAX_SHOW_NUM_NID	128
 #define LNET_UNDEFINED_HOPS	((__u32)(-1))
 
 /*
@@ -263,6 +264,13 @@ struct lnet_ioctl_reset_health_cfg {
 	lnet_nid_t rh_nid;
 };
 
+struct lnet_ioctl_recovery_list {
+	struct libcfs_ioctl_hdr rlst_hdr;
+	enum lnet_health_type rlst_type;
+	int rlst_num_nids;
+	lnet_nid_t rlst_nid_array[LNET_MAX_SHOW_NUM_NID];
+};
+
 struct lnet_ioctl_set_value {
 	struct libcfs_ioctl_hdr sv_hdr;
 	__u32 sv_value;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index d58006d..07bc29f 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3232,6 +3232,44 @@ u32 lnet_get_dlc_seq_locked(void)
 	return rc;
 }
 
+static int
+lnet_get_local_ni_recovery_list(struct lnet_ioctl_recovery_list *list)
+{
+	struct lnet_ni *ni;
+	int i = 0;
+
+	lnet_net_lock(LNET_LOCK_EX);
+	list_for_each_entry(ni, &the_lnet.ln_mt_localNIRecovq, ni_recovery) {
+		list->rlst_nid_array[i] = ni->ni_nid;
+		i++;
+		if (i >= LNET_MAX_SHOW_NUM_NID)
+			break;
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+	list->rlst_num_nids = i;
+
+	return 0;
+}
+
+static int
+lnet_get_peer_ni_recovery_list(struct lnet_ioctl_recovery_list *list)
+{
+	struct lnet_peer_ni *lpni;
+	int i = 0;
+
+	lnet_net_lock(LNET_LOCK_EX);
+	list_for_each_entry(lpni, &the_lnet.ln_mt_peerNIRecovq, lpni_recovery) {
+		list->rlst_nid_array[i] = lpni->lpni_nid;
+		i++;
+		if (i >= LNET_MAX_SHOW_NUM_NID)
+			break;
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+	list->rlst_num_nids = i;
+
+	return 0;
+}
+
 /**
  * LNet ioctl handler.
  *
@@ -3452,6 +3490,21 @@ u32 lnet_get_dlc_seq_locked(void)
 		return rc;
 	}
 
+	case IOC_LIBCFS_GET_RECOVERY_QUEUE: {
+		struct lnet_ioctl_recovery_list *list = arg;
+
+		if (list->rlst_hdr.ioc_len < sizeof(*list))
+			return -EINVAL;
+
+		mutex_lock(&the_lnet.ln_api_mutex);
+		if (list->rlst_type == LNET_HEALTH_TYPE_LOCAL_NI)
+			rc = lnet_get_local_ni_recovery_list(list);
+		else
+			rc = lnet_get_peer_ni_recovery_list(list);
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return rc;
+	}
+
 	case IOC_LIBCFS_ADD_PEER_NI: {
 		struct lnet_ioctl_peer_cfg *cfg = arg;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 096/622] lnet: health error simulation
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (94 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 095/622] lnet: print recovery queues content James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 097/622] lustre: ptlrpc: replace simple_strtol with kstrtol James Simmons
                   ` (526 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Modified the error simulation code to simulate health errors for
testing purposes. The specific error can be set. If multiple
errors are configured then one at random is chosen from the set.

EX:
lctl net_drop_add -s *@tcp -d *@tcp -m GET -i 1 -e local_interrupt

The -e can be repeated multiple times to specify different
errors to simulate. The available set are
        local_interrupt
        local_dropped
        local_aborted
        local_no_route
        local_error
        local_timeout
        remote_error
        remote_dropped
        remote_timeout
        network_timeout
        random

a -n, "--random", has been added to randomize error generation for
drop rules. This will rely an interval value provided via -i. This
will generate a random number no bigger than interval. If the number
is smaller than half of the interval then the rule isn't matched,
otherwise it is.

The purpose of this is because drop matching can happen multiple
times in the path of sending the message, and using time based
or rate will not result in even error generation across the
multiple calls.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 5c17777d97bd ("LU-9120 lnet: health error simulation")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32951
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h       |  4 +-
 include/linux/lnet/lib-types.h      |  3 +-
 include/uapi/linux/lnet/lnetctl.h   | 17 +++++++++
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c |  6 ++-
 net/lnet/klnds/socklnd/socklnd_cb.c | 27 ++++++++++----
 net/lnet/lnet/lib-move.c            |  2 +-
 net/lnet/lnet/lib-msg.c             | 24 ++++++++++++
 net/lnet/lnet/net_fault.c           | 73 ++++++++++++++++++++++++++++++++++---
 8 files changed, 138 insertions(+), 18 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index e4d9ccc..4915a87 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -639,6 +639,8 @@ void lnet_set_reply_msg_len(struct lnet_ni *ni, struct lnet_msg *msg,
 void lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt);
 
 void lnet_finalize(struct lnet_msg *msg, int rc);
+bool lnet_send_error_simulation(struct lnet_msg *msg,
+				enum lnet_msg_hstatus *hstatus);
 
 void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
 		       unsigned int nob, u32 msg_type);
@@ -661,7 +663,7 @@ void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
 int lnet_fault_init(void);
 void lnet_fault_fini(void);
 
-bool lnet_drop_rule_match(struct lnet_hdr *hdr);
+bool lnet_drop_rule_match(struct lnet_hdr *hdr, enum lnet_msg_hstatus *hstatus);
 
 int lnet_delay_rule_add(struct lnet_fault_attr *attr);
 int lnet_delay_rule_del(lnet_nid_t src, lnet_nid_t dst, bool shutdown);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index e5d4128..f82ebb6 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -72,7 +72,8 @@ enum lnet_msg_hstatus {
 	LNET_MSG_STATUS_REMOTE_ERROR,
 	LNET_MSG_STATUS_REMOTE_DROPPED,
 	LNET_MSG_STATUS_REMOTE_TIMEOUT,
-	LNET_MSG_STATUS_NETWORK_TIMEOUT
+	LNET_MSG_STATUS_NETWORK_TIMEOUT,
+	LNET_MSG_STATUS_END,
 };
 
 struct lnet_rsp_tracker {
diff --git a/include/uapi/linux/lnet/lnetctl.h b/include/uapi/linux/lnet/lnetctl.h
index 191689c..2eb9c82 100644
--- a/include/uapi/linux/lnet/lnetctl.h
+++ b/include/uapi/linux/lnet/lnetctl.h
@@ -41,6 +41,19 @@ enum {
 #define LNET_GET_BIT		(1 << 2)
 #define LNET_REPLY_BIT		(1 << 3)
 
+#define HSTATUS_END			11
+#define HSTATUS_LOCAL_INTERRUPT_BIT	(1 << 1)
+#define HSTATUS_LOCAL_DROPPED_BIT	(1 << 2)
+#define HSTATUS_LOCAL_ABORTED_BIT	(1 << 3)
+#define HSTATUS_LOCAL_NO_ROUTE_BIT	(1 << 4)
+#define HSTATUS_LOCAL_ERROR_BIT		(1 << 5)
+#define HSTATUS_LOCAL_TIMEOUT_BIT	(1 << 6)
+#define HSTATUS_REMOTE_ERROR_BIT	(1 << 7)
+#define HSTATUS_REMOTE_DROPPED_BIT	(1 << 8)
+#define HSTATUS_REMOTE_TIMEOUT_BIT	(1 << 9)
+#define HSTATUS_NETWORK_TIMEOUT_BIT	(1 << 10)
+#define HSTATUS_RANDOM			0xffffffff
+
 /** ioctl parameter for LNet fault simulation */
 struct lnet_fault_attr {
 	/**
@@ -78,6 +91,10 @@ struct lnet_fault_attr {
 			 * with da_rate
 			 */
 			__u32			da_interval;
+			/** error type mask */
+			__u32			da_health_error_mask;
+			/** randomize error generation */
+			bool			da_random;
 		} drop;
 		/** message latency simulation */
 		struct {
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 293a859..5680f2a 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -912,7 +912,11 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 			 bad->wr_id, bad->opcode, bad->send_flags,
 			 libcfs_nid2str(conn->ibc_peer->ibp_nid));
 		bad = NULL;
-		rc = ib_post_send(conn->ibc_cmid->qp, wrq, &bad);
+		if (lnet_send_error_simulation(tx->tx_lntmsg[0],
+					       &tx->tx_hstatus))
+			rc = -EINVAL;
+		else
+			rc = ib_post_send(conn->ibc_cmid->qp, wrq, &bad);
 	}
 
 	conn->ibc_last_send = ktime_get();
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 8bc23d2..057c7f3 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -335,7 +335,8 @@ struct ksock_tx *
 
 	if (!rc && (tx->tx_resid != 0 || tx->tx_zc_aborted)) {
 		rc = -EIO;
-		hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
+		if (hstatus == LNET_MSG_STATUS_OK)
+			hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
 	}
 
 	if (tx->tx_conn)
@@ -467,6 +468,13 @@ struct ksock_tx *
 ksocknal_process_transmit(struct ksock_conn *conn, struct ksock_tx *tx)
 {
 	int rc;
+	bool error_sim = false;
+
+	if (lnet_send_error_simulation(tx->tx_lnetmsg, &tx->tx_hstatus)) {
+		error_sim = true;
+		rc = -EINVAL;
+		goto simulate_error;
+	}
 
 	if (tx->tx_zc_capable && !tx->tx_zc_checked)
 		ksocknal_check_zc_req(tx);
@@ -512,16 +520,19 @@ struct ksock_tx *
 		return rc;
 	}
 
+simulate_error:
 	/* Actual error */
 	LASSERT(rc < 0);
 
-	/* set the health status of the message which determines
-	 * whether we should retry the transmit
-	 */
-	if (rc == -ETIMEDOUT)
-		tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_TIMEOUT;
-	else
-		tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
+	if (!error_sim) {
+		/* set the health status of the message which determines
+		 * whether we should retry the transmit
+		 */
+		if (rc == -ETIMEDOUT)
+			tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_TIMEOUT;
+		else
+			tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;
+	}
 
 	if (!conn->ksnc_closing) {
 		switch (rc) {
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 6a3704d..eb0b48d 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3875,7 +3875,7 @@ void lnet_monitor_thr_stop(void)
 	}
 
 	if (!list_empty(&the_lnet.ln_drop_rules) &&
-	    lnet_drop_rule_match(hdr)) {
+	    lnet_drop_rule_match(hdr, NULL)) {
 		CDEBUG(D_NET, "%s, src %s, dst %s: Dropping %s to simulate silent message loss\n",
 		       libcfs_nid2str(from_nid), libcfs_nid2str(src_nid),
 		       libcfs_nid2str(dest_nid), lnet_msgtyp2str(type));
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 70decc7..5072238 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -812,6 +812,30 @@
 	}
 }
 
+bool
+lnet_send_error_simulation(struct lnet_msg *msg,
+			   enum lnet_msg_hstatus *hstatus)
+{
+	if (!msg)
+		return false;
+
+	if (list_empty(&the_lnet.ln_drop_rules))
+		return false;
+
+	/* match only health rules */
+	if (!lnet_drop_rule_match(&msg->msg_hdr, hstatus))
+		return false;
+
+	CDEBUG(D_NET, "src %s, dst %s: %s simulate health error: %s\n",
+	       libcfs_nid2str(msg->msg_hdr.src_nid),
+	       libcfs_nid2str(msg->msg_hdr.dest_nid),
+	       lnet_msgtyp2str(msg->msg_type),
+	       lnet_health_error2str(*hstatus));
+
+	return true;
+}
+EXPORT_SYMBOL(lnet_send_error_simulation);
+
 void
 lnet_finalize(struct lnet_msg *msg, int status)
 {
diff --git a/net/lnet/lnet/net_fault.c b/net/lnet/lnet/net_fault.c
index 4589b17..becb709 100644
--- a/net/lnet/lnet/net_fault.c
+++ b/net/lnet/lnet/net_fault.c
@@ -292,13 +292,56 @@ struct lnet_drop_rule {
 	lnet_net_unlock(cpt);
 }
 
+static void
+lnet_fault_match_health(enum lnet_msg_hstatus *hstatus, __u32 mask)
+{
+	int choice;
+	int delta;
+	int best_delta;
+	int i;
+
+	/* assign a random failure */
+	choice = prandom_u32_max(LNET_MSG_STATUS_END - LNET_MSG_STATUS_OK);
+	if (choice == 0)
+		choice++;
+
+	if (mask == HSTATUS_RANDOM) {
+		*hstatus = choice;
+		return;
+	}
+
+	if (mask & (1 << choice)) {
+		*hstatus = choice;
+		return;
+	}
+
+	/* round to the closest ON bit */
+	i = HSTATUS_END;
+	best_delta = HSTATUS_END;
+	while (i > 0) {
+		if (mask & (1 << i)) {
+			delta = choice - i;
+			if (delta < 0)
+				delta *= -1;
+			if (delta < best_delta) {
+				best_delta = delta;
+				choice = i;
+			}
+		}
+		i--;
+	}
+
+	*hstatus = choice;
+}
+
 /**
  * check source/destination NID, portal, message type and drop rate,
  * decide whether should drop this message or not
  */
 static bool
 drop_rule_match(struct lnet_drop_rule *rule, lnet_nid_t src,
-		lnet_nid_t dst, unsigned int type, unsigned int portal)
+		lnet_nid_t dst, unsigned int type, unsigned int portal,
+		enum lnet_msg_hstatus *hstatus)
 {
 	struct lnet_fault_attr *attr = &rule->dr_attr;
 	bool drop;
@@ -306,9 +349,23 @@ struct lnet_drop_rule {
 	if (!lnet_fault_attr_match(attr, src, dst, type, portal))
 		return false;
 
+	/* if we're trying to match a health status error but it hasn't
+	 * been set in the rule, then don't match
+	 */
+	if ((hstatus && !attr->u.drop.da_health_error_mask) ||
+	    (!hstatus && attr->u.drop.da_health_error_mask))
+		return false;
+
 	/* match this rule, check drop rate now */
 	spin_lock(&rule->dr_lock);
-	if (rule->dr_drop_time) { /* time based drop */
+	if (attr->u.drop.da_random) {
+		int value = prandom_u32_max(attr->u.drop.da_interval);
+
+		if (value >= (attr->u.drop.da_interval / 2))
+			drop = true;
+		else
+			drop = false;
+	} else if (rule->dr_drop_time) { /* time based drop */
 		time64_t now = ktime_get_seconds();
 
 		rule->dr_stat.fs_count++;
@@ -340,6 +397,9 @@ struct lnet_drop_rule {
 	}
 
 	if (drop) { /* drop this message, update counters */
+		if (hstatus)
+			lnet_fault_match_health(hstatus,
+						attr->u.drop.da_health_error_mask);
 		lnet_fault_stat_inc(&rule->dr_stat, type);
 		rule->dr_stat.u.drop.ds_dropped++;
 	}
@@ -352,12 +412,12 @@ struct lnet_drop_rule {
  * Check if message from @src to @dst can match any existed drop rule
  */
 bool
-lnet_drop_rule_match(struct lnet_hdr *hdr)
+lnet_drop_rule_match(struct lnet_hdr *hdr, enum lnet_msg_hstatus *hstatus)
 {
-	struct lnet_drop_rule *rule;
 	lnet_nid_t src = le64_to_cpu(hdr->src_nid);
 	lnet_nid_t dst = le64_to_cpu(hdr->dest_nid);
 	unsigned int typ = le32_to_cpu(hdr->type);
+	struct lnet_drop_rule *rule;
 	unsigned int ptl = -1;
 	bool drop = false;
 	int cpt;
@@ -373,12 +433,13 @@ struct lnet_drop_rule {
 
 	cpt = lnet_net_lock_current();
 	list_for_each_entry(rule, &the_lnet.ln_drop_rules, dr_link) {
-		drop = drop_rule_match(rule, src, dst, typ, ptl);
+		drop = drop_rule_match(rule, src, dst, typ, ptl,
+				       hstatus);
 		if (drop)
 			break;
 	}
-
 	lnet_net_unlock(cpt);
+
 	return drop;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 097/622] lustre: ptlrpc: replace simple_strtol with kstrtol
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (95 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 096/622] lnet: health error simulation James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 098/622] lustre: obd: use correct ip_compute_csum() version James Simmons
                   ` (525 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

Eventually simple_strtol() will be removed so replace its use in
the ptlrpc with kstrtoXXX() class of functions.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9325
Lustre-commit: 8f37d64b6bc9 ("LU-9325 ptlrpc: replace simple_strtol with kstrtol")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/32785
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Nikitas Angelinas <nikitas.angelinas@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/lproc_ptlrpc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index 6af3384..eb0ecc0 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -1303,13 +1303,13 @@ int lprocfs_wr_import(struct file *file, const char __user *buffer,
 	ptr = strstr(uuid, "::");
 	if (ptr) {
 		u32 inst;
-		char *endptr;
+		int rc;
 
 		*ptr = 0;
 		do_reconn = 0;
 		ptr += strlen("::");
-		inst = simple_strtoul(ptr, &endptr, 10);
-		if (*endptr) {
+		rc = kstrtouint(ptr, 10, &inst);
+		if (rc) {
 			CERROR("config: wrong instance # %s\n", ptr);
 		} else if (inst != imp->imp_connect_data.ocd_instance) {
 			CDEBUG(D_INFO,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 098/622] lustre: obd: use correct ip_compute_csum() version
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (96 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 097/622] lustre: ptlrpc: replace simple_strtol with kstrtol James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 099/622] lustre: osc: serialize access to idle_timeout vs cleanup James Simmons
                   ` (524 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

The linux kernel provides a generic platform independent version
of ip_compute_csum() as well as platform optimized versions. Some
platforms will disable the generic platform version in favor of
the optimized one. If the generic version is disabled and if the
checksum.h header from asm-generic is used then we will end up
with a undefined symbol error when loading the obdclass module.
The solution is to use the platform specific checksum.h header
that will handle using the generic or optimized version for us.
As a bounus we get better performance with the right kernel
configuration.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11224
Lustre-commit: 82fe90a1d07d ("LU-11224 obd: use correct ip_compute_csum() version")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/32953
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/integrity.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/integrity.c b/fs/lustre/obdclass/integrity.c
index 8348b16..5cb9a25 100644
--- a/fs/lustre/obdclass/integrity.c
+++ b/fs/lustre/obdclass/integrity.c
@@ -28,7 +28,7 @@
  */
 #include <linux/blkdev.h>
 #include <linux/crc-t10dif.h>
-#include <asm-generic/checksum.h>
+#include <asm/checksum.h>
 #include <obd_class.h>
 #include <obd_cksum.h>
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 099/622] lustre: osc: serialize access to idle_timeout vs cleanup
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (97 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 098/622] lustre: obd: use correct ip_compute_csum() version James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 100/622] lustre: mdc: remove obsolete intent opcodes James Simmons
                   ` (523 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

use lprocfs_climp_check() and up_read() as cl_import
can disappear due to umount.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11175
Lustre-commit: 5874da0b670b ("LU-11175 osc: serialize access to idle_timeout vs cleanup")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32883
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/lproc_osc.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 0a12079..efb4998 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -604,8 +604,15 @@ static ssize_t idle_timeout_show(struct kobject *kobj, struct attribute *attr,
 	struct obd_device *obd = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
 	struct client_obd *cli = &obd->u.cli;
+	int ret;
 
-	return sprintf(buf, "%u\n", cli->cl_import->imp_idle_timeout);
+	ret = lprocfs_climp_check(obd);
+	if (ret)
+		return ret;
+	ret = sprintf(buf, "%u\n", cli->cl_import->imp_idle_timeout);
+	up_read(&obd->u.cli.cl_sem);
+
+	return ret;
 }
 
 static ssize_t idle_timeout_store(struct kobject *kobj, struct attribute *attr,
@@ -625,6 +632,10 @@ static ssize_t idle_timeout_store(struct kobject *kobj, struct attribute *attr,
 	if (val > CONNECTION_SWITCH_MAX)
 		return -ERANGE;
 
+	rc = lprocfs_climp_check(obd);
+	if (rc)
+		return rc;
+
 	cli->cl_import->imp_idle_timeout = val;
 
 	/* to initiate the connection if it's in IDLE state */
@@ -633,6 +644,7 @@ static ssize_t idle_timeout_store(struct kobject *kobj, struct attribute *attr,
 		if (req)
 			ptlrpc_req_finished(req);
 	}
+	up_read(&obd->u.cli.cl_sem);
 
 	return count;
 }
@@ -645,12 +657,18 @@ static ssize_t idle_connect_store(struct kobject *kobj, struct attribute *attr,
 					      obd_kset.kobj);
 	struct client_obd *cli = &dev->u.cli;
 	struct ptlrpc_request *req;
+	int rc;
+
+	rc = lprocfs_climp_check(dev);
+	if (rc)
+		return rc;
 
 	/* to initiate the connection if it's in IDLE state */
 	req = ptlrpc_request_alloc(cli->cl_import, &RQF_OST_STATFS);
 	if (req)
 		ptlrpc_req_finished(req);
 	ptlrpc_pinger_force(cli->cl_import);
+	up_read(&dev->u.cli.cl_sem);
 
 	return count;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 100/622] lustre: mdc: remove obsolete intent opcodes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (98 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 099/622] lustre: osc: serialize access to idle_timeout vs cleanup James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 101/622] lustre: llite: fix setstripe for specific osts upon dir James Simmons
                   ` (522 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In enum ldlm_intent_flags, remove the obsolete constants IT_UNLINK,
IT_TRUNC, IT_EXEC, IT_PIN, IT_SETXATTR. Remove any handling code for
these opcodes.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11014
Lustre-commit: 511ea5850f25 ("LU-11014 mdc: remove obsolete intent opcodes")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32361
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h  |  1 -
 fs/lustre/include/obd.h                |  4 +---
 fs/lustre/ldlm/ldlm_lock.c             |  2 --
 fs/lustre/mdc/mdc_locks.c              | 44 +++-------------------------------
 fs/lustre/ptlrpc/layout.c              | 15 ------------
 include/uapi/linux/lustre/lustre_idl.h | 14 +++++------
 6 files changed, 11 insertions(+), 69 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 807d080..ed4fc42 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -203,7 +203,6 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_format RQF_LDLM_INTENT_GETATTR;
 extern struct req_format RQF_LDLM_INTENT_OPEN;
 extern struct req_format RQF_LDLM_INTENT_CREATE;
-extern struct req_format RQF_LDLM_INTENT_UNLINK;
 extern struct req_format RQF_LDLM_INTENT_GETXATTR;
 extern struct req_format RQF_LDLM_CANCEL;
 extern struct req_format RQF_LDLM_CALLBACK;
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index de9642f..175a99f 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -700,8 +700,6 @@ static inline int it_to_lock_mode(struct lookup_intent *it)
 		return LCK_PR;
 	else if (it->it_op &  IT_GETXATTR)
 		return LCK_PR;
-	else if (it->it_op &  IT_SETXATTR)
-		return LCK_PW;
 
 	LASSERTF(0, "Invalid it_op: %d\n", it->it_op);
 	return -EINVAL;
@@ -730,7 +728,7 @@ enum md_cli_flags {
  */
 static inline bool it_has_reply_body(const struct lookup_intent *it)
 {
-	return it->it_op & (IT_OPEN | IT_UNLINK | IT_LOOKUP | IT_GETATTR);
+	return it->it_op & (IT_OPEN | IT_LOOKUP | IT_GETATTR);
 }
 
 struct md_op_data {
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 1bf387a..4f746ad 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -123,8 +123,6 @@ const char *ldlm_it2str(enum ldlm_intent_flags it)
 		return "getattr";
 	case IT_LOOKUP:
 		return "lookup";
-	case IT_UNLINK:
-		return "unlink";
 	case IT_GETXATTR:
 		return "getxattr";
 	case IT_LAYOUT:
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index abbc908..80f2e10 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -430,42 +430,6 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	return req;
 }
 
-static struct ptlrpc_request *mdc_intent_unlink_pack(struct obd_export *exp,
-						     struct lookup_intent *it,
-						     struct md_op_data *op_data)
-{
-	struct ptlrpc_request *req;
-	struct obd_device *obddev = class_exp2obd(exp);
-	struct ldlm_intent *lit;
-	int rc;
-
-	req = ptlrpc_request_alloc(class_exp2cliimp(exp),
-				   &RQF_LDLM_INTENT_UNLINK);
-	if (!req)
-		return ERR_PTR(-ENOMEM);
-
-	req_capsule_set_size(&req->rq_pill, &RMF_NAME, RCL_CLIENT,
-			     op_data->op_namelen + 1);
-
-	rc = ldlm_prep_enqueue_req(exp, req, NULL, 0);
-	if (rc) {
-		ptlrpc_request_free(req);
-		return ERR_PTR(rc);
-	}
-
-	/* pack the intent */
-	lit = req_capsule_client_get(&req->rq_pill, &RMF_LDLM_INTENT);
-	lit->opc = (u64)it->it_op;
-
-	/* pack the intended request */
-	mdc_unlink_pack(req, op_data);
-
-	req_capsule_set_size(&req->rq_pill, &RMF_MDT_MD, RCL_SERVER,
-			     obddev->u.cli.cl_default_mds_easize);
-	ptlrpc_request_set_replen(req);
-	return req;
-}
-
 static struct ptlrpc_request *
 mdc_intent_getattr_pack(struct obd_export *exp, struct lookup_intent *it,
 			struct md_op_data *op_data, u32 acl_bufsize)
@@ -820,18 +784,18 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		LASSERT(!policy);
 
 		saved_flags |= LDLM_FL_HAS_INTENT;
-		if (it->it_op & (IT_UNLINK | IT_GETATTR | IT_READDIR))
+		if (it->it_op & (IT_GETATTR | IT_READDIR))
 			policy = &update_policy;
 		else if (it->it_op & IT_LAYOUT)
 			policy = &layout_policy;
-		else if (it->it_op & (IT_GETXATTR | IT_SETXATTR))
+		else if (it->it_op & IT_GETXATTR)
 			policy = &getxattr_policy;
 		else
 			policy = &lookup_policy;
 	}
 
 	generation = obddev->u.cli.cl_import->imp_generation;
-	if (!it || (it->it_op & (IT_CREAT | IT_OPEN_CREAT)))
+	if (!it || (it->it_op & (IT_OPEN | IT_CREAT)))
 		acl_bufsize = imp->imp_connect_data.ocd_max_easize;
 	else
 		acl_bufsize = LUSTRE_POSIX_ACL_MAX_SIZE_OLD;
@@ -845,8 +809,6 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		res_id.name[3] = LDLM_FLOCK;
 	} else if (it->it_op & IT_OPEN) {
 		req = mdc_intent_open_pack(exp, it, op_data, acl_bufsize);
-	} else if (it->it_op & IT_UNLINK) {
-		req = mdc_intent_unlink_pack(exp, it, op_data);
 	} else if (it->it_op & (IT_GETATTR | IT_LOOKUP)) {
 		req = mdc_intent_getattr_pack(exp, it, op_data, acl_bufsize);
 	} else if (it->it_op & IT_READDIR) {
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index ae573a2..70344b9 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -462,15 +462,6 @@
 	&RMF_FILE_SECCTX
 };
 
-static const struct req_msg_field *ldlm_intent_unlink_client[] = {
-	&RMF_PTLRPC_BODY,
-	&RMF_DLM_REQ,
-	&RMF_LDLM_INTENT,
-	&RMF_REC_REINT,    /* coincides with mds_reint_unlink_client[] */
-	&RMF_CAPA1,
-	&RMF_NAME
-};
-
 static const struct req_msg_field *ldlm_intent_getxattr_client[] = {
 	&RMF_PTLRPC_BODY,
 	&RMF_DLM_REQ,
@@ -756,7 +747,6 @@
 	&RQF_LDLM_INTENT_GETATTR,
 	&RQF_LDLM_INTENT_OPEN,
 	&RQF_LDLM_INTENT_CREATE,
-	&RQF_LDLM_INTENT_UNLINK,
 	&RQF_LDLM_INTENT_GETXATTR,
 	&RQF_LLOG_ORIGIN_HANDLE_CREATE,
 	&RQF_LLOG_ORIGIN_HANDLE_NEXT_BLOCK,
@@ -1431,11 +1421,6 @@ struct req_format RQF_LDLM_INTENT_CREATE =
 			ldlm_intent_create_client, ldlm_intent_getattr_server);
 EXPORT_SYMBOL(RQF_LDLM_INTENT_CREATE);
 
-struct req_format RQF_LDLM_INTENT_UNLINK =
-	DEFINE_REQ_FMT0("LDLM_INTENT_UNLINK",
-			ldlm_intent_unlink_client, ldlm_intent_server);
-EXPORT_SYMBOL(RQF_LDLM_INTENT_UNLINK);
-
 struct req_format RQF_LDLM_INTENT_GETXATTR =
 	DEFINE_REQ_FMT0("LDLM_INTENT_GETXATTR",
 			ldlm_intent_getxattr_client,
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index dc9872cf3..249a3d5 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2190,19 +2190,19 @@ struct ldlm_flock_wire {
 enum ldlm_intent_flags {
 	IT_OPEN		= 0x00000001,
 	IT_CREAT	= 0x00000002,
-	IT_OPEN_CREAT	= 0x00000003,
-	IT_READDIR	= 0x00000004,
+	IT_OPEN_CREAT	= IT_OPEN | IT_CREAT, /* To allow case label. */
+	IT_READDIR	= 0x00000004, /* Used by mdc, not put on the wire. */
 	IT_GETATTR	= 0x00000008,
 	IT_LOOKUP	= 0x00000010,
-	IT_UNLINK	= 0x00000020,
-	IT_TRUNC	= 0x00000040,
+/*	IT_UNLINK	= 0x00000020, Obsolete. */
+/*	IT_TRUNC	= 0x00000040, Obsolete. */
 	IT_GETXATTR	= 0x00000080,
-	IT_EXEC		= 0x00000100,
-	IT_PIN		= 0x00000200,
+/*	IT_EXEC		= 0x00000100, Obsolete. */
+/*	IT_PIN		= 0x00000200, Obsolete. */
 	IT_LAYOUT	= 0x00000400,
 	IT_QUOTA_DQACQ	= 0x00000800,
 	IT_QUOTA_CONN	= 0x00001000,
-	IT_SETXATTR	= 0x00002000,
+/*	IT_SETXATTR	= 0x00002000, Obsolete. */
 	IT_GLIMPSE     = 0x00004000,
 	IT_BRW	       = 0x00008000,
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 101/622] lustre: llite: fix setstripe for specific osts upon dir
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (99 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 100/622] lustre: mdc: remove obsolete intent opcodes James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 102/622] lustre: osc: enable/disable OSC grant shrink James Simmons
                   ` (521 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

LOV_USER_MAGIC_SPECIFIC function is broken and it
was not available for setting directory.

1) llite doesn't handle LOV_USER_MAGIC_SPECIFIC case
   properly for dir {set,get}_stripe, and ioctl
   LL_IOC_LOV_SETSTRIPE did not alloc enough buf,
   copy ost lists from userspace.

2) lod_get_default_lov_striping() did not handle
   LOV_USER_MAGIC_SPECIFIC type that newly created
   files/dir won't inherit parent setting well.

3) there is not any case to cover lfs setstripe
   '-o' interface which make it hard to figure out
   when this function was broken.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11146
Lustre-commit: 083d62ee6de5 ("LU-11146 lustre: fix setstripe for specific osts upon dir")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/32814
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c | 71 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 751d0183..06f7bd3 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -541,6 +541,21 @@ int ll_dir_setstripe(struct inode *inode, struct lov_user_md *lump,
 			lum_size = sizeof(struct lmv_user_md);
 			break;
 		}
+		case LOV_USER_MAGIC_SPECIFIC: {
+			struct lov_user_md_v3 *v3 =
+					(struct lov_user_md_v3 *)lump;
+			if (v3->lmm_stripe_count > LOV_MAX_STRIPE_COUNT)
+				return -EINVAL;
+			if (lump->lmm_magic !=
+			    cpu_to_le32(LOV_USER_MAGIC_SPECIFIC)) {
+				lustre_swab_lov_user_md_v3(v3);
+				lustre_swab_lov_user_md_objects(v3->lmm_objects,
+						v3->lmm_stripe_count);
+			}
+			lum_size = lov_user_md_size(v3->lmm_stripe_count,
+						    LOV_USER_MAGIC_SPECIFIC);
+			break;
+		}
 		default: {
 			CDEBUG(D_IOCTL,
 			       "bad userland LOV MAGIC: %#08x != %#08x nor %#08x\n",
@@ -695,6 +710,16 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 		if (cpu_to_le32(LMV_USER_MAGIC) != LMV_USER_MAGIC)
 			lustre_swab_lmv_user_md((struct lmv_user_md *)lmm);
 		break;
+	case LOV_USER_MAGIC_SPECIFIC: {
+		struct lov_user_md_v3 *v3 = (struct lov_user_md_v3 *)lmm;
+
+		if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC) {
+			lustre_swab_lov_user_md_v3(v3);
+			lustre_swab_lov_user_md_objects(v3->lmm_objects,
+							v3->lmm_stripe_count);
+			}
+		}
+		break;
 	default:
 		CERROR("unknown magic: %lX\n", (unsigned long)lmm->lmm_magic);
 		rc = -EPROTO;
@@ -1230,35 +1255,51 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	}
 	case LL_IOC_LOV_SETSTRIPE_NEW:
 	case LL_IOC_LOV_SETSTRIPE: {
-		struct lov_user_md_v3 lumv3;
-		struct lov_user_md_v1 *lumv1 = (struct lov_user_md_v1 *)&lumv3;
+		struct lov_user_md_v3 *lumv3 = NULL;
+		struct lov_user_md_v1 lumv1;
+		struct lov_user_md_v1 *lumv1_ptr = &lumv1;
 		struct lov_user_md_v1 __user *lumv1p = (void __user *)arg;
 		struct lov_user_md_v3 __user *lumv3p = (void __user *)arg;
+		int lum_size;
 
 		int set_default = 0;
 
 		BUILD_BUG_ON(sizeof(struct lov_user_md_v3) <=
 			     sizeof(struct lov_comp_md_v1));
-		BUILD_BUG_ON(sizeof(lumv3) != sizeof(*lumv3p));
-		BUILD_BUG_ON(sizeof(lumv3.lmm_objects[0]) !=
-			     sizeof(lumv3p->lmm_objects[0]));
+		BUILD_BUG_ON(sizeof(*lumv3) != sizeof(*lumv3p));
 		/* first try with v1 which is smaller than v3 */
-		if (copy_from_user(lumv1, lumv1p, sizeof(*lumv1)))
+		if (copy_from_user(&lumv1, lumv1p, sizeof(lumv1)))
 			return -EFAULT;
 
-		if (lumv1->lmm_magic == LOV_USER_MAGIC_V3) {
-			if (copy_from_user(&lumv3, lumv3p, sizeof(lumv3)))
-				return -EFAULT;
-			if (lumv3.lmm_magic != LOV_USER_MAGIC_V3)
-				return -EINVAL;
-		}
-
 		if (is_root_inode(inode))
 			set_default = 1;
 
-		/* in v1 and v3 cases lumv1 points to data */
-		rc = ll_dir_setstripe(inode, lumv1, set_default);
+		switch (lumv1.lmm_magic) {
+		case LOV_USER_MAGIC_V3:
+		case LOV_USER_MAGIC_SPECIFIC:
+			lum_size = ll_lov_user_md_size(&lumv1);
+			if (lum_size < 0)
+				return lum_size;
+			lumv3 = kzalloc(lum_size, GFP_NOFS);
+			if (!lumv3)
+				return -ENOMEM;
+			if (copy_from_user(lumv3, lumv3p, lum_size)) {
+				rc = -EFAULT;
+				goto out;
+			}
+			lumv1_ptr = (struct lov_user_md_v1 *)lumv3;
+			break;
+		case LOV_USER_MAGIC_V1:
+			break;
+		default:
+			rc = -ENOTSUPP;
+			goto out;
+		}
 
+		/* in v1 and v3 cases lumv1 points to data */
+		rc = ll_dir_setstripe(inode, lumv1_ptr, set_default);
+out:
+		kfree(lumv3);
 		return rc;
 	}
 	case LL_IOC_LMV_GETSTRIPE: {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 102/622] lustre: osc: enable/disable OSC grant shrink
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (100 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 101/622] lustre: llite: fix setstripe for specific osts upon dir James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 103/622] lustre: protocol: MDT as a statfs proxy James Simmons
                   ` (520 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

Add an OSC sysfs interface to enable/disable client's grant shrink
feature.

lctl get_param osc.*.grant_shrink
lctl set_param osc.*.grant_shrink={0,1}

WC-bug-id: https://jira.whamcloud.com/browse/LU-8708
Lustre-commit: 3e070e30a98d ("LU-8708 osc: enable/disable OSC grant shrink")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/23203
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/lproc_osc.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index efb4998..16de266 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -674,6 +674,72 @@ static ssize_t idle_connect_store(struct kobject *kobj, struct attribute *attr,
 }
 LUSTRE_WO_ATTR(idle_connect);
 
+static ssize_t grant_shrink_show(struct kobject *kobj, struct attribute *attr,
+				 char *buf)
+{
+	struct obd_device *obd = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+	struct client_obd *cli = &obd->u.cli;
+	struct obd_connect_data *ocd;
+	ssize_t len;
+
+	len = lprocfs_climp_check(obd);
+	if (len)
+		return len;
+
+	ocd = &cli->cl_import->imp_connect_data;
+
+	len = snprintf(buf, PAGE_SIZE, "%d\n",
+		       !!OCD_HAS_FLAG(ocd, GRANT_SHRINK));
+	up_read(&obd->u.cli.cl_sem);
+
+	return len;
+}
+
+static ssize_t grant_shrink_store(struct kobject *kobj, struct attribute *attr,
+				  const char *buffer, size_t count)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+	struct client_obd *cli = &dev->u.cli;
+	struct obd_connect_data *ocd;
+	bool val;
+	int rc;
+
+	if (!dev)
+		return 0;
+
+	rc = kstrtobool(buffer, &val);
+	if (rc)
+		return rc;
+
+	rc = lprocfs_climp_check(dev);
+	if (rc)
+		return rc;
+
+	ocd = &cli->cl_import->imp_connect_data;
+
+	if (!val) {
+		if (OCD_HAS_FLAG(ocd, GRANT_SHRINK))
+			ocd->ocd_connect_flags &= ~OBD_CONNECT_GRANT_SHRINK;
+	} else {
+		/**
+		 * server replied obd_connect_data is always bigger, so
+		 * client's imp_connect_flags_orig are always supported
+		 * by the server
+		 */
+		if (!OCD_HAS_FLAG(ocd, GRANT_SHRINK) &&
+		    cli->cl_import->imp_connect_flags_orig &
+		    OBD_CONNECT_GRANT_SHRINK)
+			ocd->ocd_connect_flags |= OBD_CONNECT_GRANT_SHRINK;
+	}
+
+	up_read(&dev->u.cli.cl_sem);
+
+	return count;
+}
+LUSTRE_RW_ATTR(grant_shrink);
+
 LPROC_SEQ_FOPS_RO_TYPE(osc, connect_flags);
 LPROC_SEQ_FOPS_RO_TYPE(osc, server_uuid);
 LPROC_SEQ_FOPS_RO_TYPE(osc, timeouts);
@@ -889,6 +955,7 @@ void lproc_osc_attach_seqstat(struct obd_device *dev)
 	&lustre_attr_ping.attr,
 	&lustre_attr_idle_timeout.attr,
 	&lustre_attr_idle_connect.attr,
+	&lustre_attr_grant_shrink.attr,
 	NULL,
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 103/622] lustre: protocol: MDT as a statfs proxy
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (101 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 102/622] lustre: osc: enable/disable OSC grant shrink James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 104/622] lustre: ldlm: correct logic in ldlm_prepare_lru_list() James Simmons
                   ` (519 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

MDT can act as a proxy for statfs data. this should
make df faster (RTT vs RTT*(#MDTs+1)) and enable
idling connections so that clients don't connect to
each OST just to report statfs data. the protocol
has been changing slightly to let MDT differentiate
self and aggregated statfs.

also, obd_statfs has got a new field "granted" where
OST reports how much space has been granted to the
requesting MDT so that space can be added to available
space.

client's NID is used to distribute MDS_STATFS among
MDTS.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10018
Lustre-commit: b500d5193360 ("LU-10018 protocol: MDT as a statfs proxy")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/29136
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h                 |  1 +
 fs/lustre/include/obd_class.h           |  7 +++-
 fs/lustre/include/obd_support.h         |  2 +
 fs/lustre/llite/llite_lib.c             |  9 ++++-
 fs/lustre/lmv/lmv_obd.c                 | 65 ++++++++++++++++++++++++++-------
 fs/lustre/mdc/mdc_request.c             | 13 +++++++
 fs/lustre/ptlrpc/layout.c               |  2 +-
 fs/lustre/ptlrpc/pack_generic.c         |  2 +-
 fs/lustre/ptlrpc/wiretest.c             |  8 ++--
 include/uapi/linux/lustre/lustre_idl.h  |  3 +-
 include/uapi/linux/lustre/lustre_user.h |  7 ++--
 11 files changed, 92 insertions(+), 27 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 175a99f..9286755 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -442,6 +442,7 @@ struct lmv_obd {
 
 	u32			tgts_size; /* size of tgts array */
 	struct lmv_tgt_desc	**tgts;
+	int			lmv_statfs_start;
 
 	struct obd_connect_data	conn_data;
 	struct kobject		*lmv_tgts_kobj;
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 0153c50..a3ef5d5 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -47,6 +47,8 @@
 #define OBD_STATFS_FROM_CACHE   0x0002
 /* the statfs is only for retrieving information from MDT0 */
 #define OBD_STATFS_FOR_MDT0	0x0004
+/* get aggregated statfs from MDT */
+#define OBD_STATFS_SUM		0x0008
 
 /* OBD Device Declarations */
 extern rwlock_t obd_dev_lock;
@@ -947,7 +949,10 @@ static inline int obd_statfs(const struct lu_env *env, struct obd_export *exp,
 
 	CDEBUG(D_SUPER, "osfs %lld, max_age %lld\n",
 	       obd->obd_osfs_age, max_age);
-	if (obd->obd_osfs_age < max_age) {
+	/* ignore cache if aggregated isn't expected */
+	if (obd->obd_osfs_age < max_age ||
+	    ((obd->obd_osfs.os_state & OS_STATE_SUM) &&
+	     !(flags & OBD_STATFS_SUM))) {
 		rc = OBP(obd, statfs)(env, exp, osfs, max_age, flags);
 		if (rc == 0) {
 			spin_lock(&obd->obd_osfs_lock);
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 28becfa..3d14723 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -137,7 +137,9 @@
 #define OBD_FAIL_MDS_GET_ROOT_NET			0x11b
 #define OBD_FAIL_MDS_GET_ROOT_PACK			0x11c
 #define OBD_FAIL_MDS_STATFS_PACK			0x11d
+#define OBD_FAIL_MDS_STATFS_SUM_PACK			0x11d
 #define OBD_FAIL_MDS_STATFS_NET				0x11e
+#define OBD_FAIL_MDS_STATFS_SUM_NET			0x11e
 #define OBD_FAIL_MDS_GETATTR_NAME_NET			0x11f
 #define OBD_FAIL_MDS_PIN_NET				0x120
 #define OBD_FAIL_MDS_UNPIN_NET				0x121
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index c04146f..8b3e2a3 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -211,7 +211,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	data->ocd_connect_flags2 = OBD_CONNECT2_FLR |
 				   OBD_CONNECT2_LOCK_CONVERT |
-				   OBD_CONNECT2_DIR_MIGRATE;
+				   OBD_CONNECT2_DIR_MIGRATE |
+				   OBD_CONNECT2_SUM_STATFS;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
@@ -1751,6 +1752,9 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 	       osfs->os_bavail, osfs->os_blocks, osfs->os_ffree,
 	       osfs->os_files);
 
+	if (osfs->os_state & OS_STATE_SUM)
+		goto out;
+
 	if (sbi->ll_flags & LL_SBI_LAZYSTATFS)
 		flags |= OBD_STATFS_NODELAY;
 
@@ -1779,6 +1783,7 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 		osfs->os_ffree = obd_osfs.os_ffree;
 	}
 
+out:
 	return rc;
 }
 
@@ -1793,7 +1798,7 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs)
 	ll_stats_ops_tally(ll_s2sbi(sb), LPROC_LL_STAFS, 1);
 
 	/* Some amount of caching on the client is allowed */
-	rc = ll_statfs_internal(ll_s2sbi(sb), &osfs, 0);
+	rc = ll_statfs_internal(ll_s2sbi(sb), &osfs, OBD_STATFS_SUM);
 	if (rc)
 		return rc;
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index c7bf8c7..90a46c4 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1325,6 +1325,33 @@ static int lmv_process_config(struct obd_device *obd, u32 len, void *buf)
 	return rc;
 }
 
+static int lmv_select_statfs_mdt(struct lmv_obd *lmv, u32 flags)
+{
+	int i;
+
+	if (flags & OBD_STATFS_FOR_MDT0)
+		return 0;
+
+	if (lmv->lmv_statfs_start || lmv->desc.ld_tgt_count == 1)
+		return lmv->lmv_statfs_start;
+
+	/* choose initial MDT for this client */
+	for (i = 0;; i++) {
+		struct lnet_process_id lnet_id;
+
+		if (LNetGetId(i, &lnet_id) == -ENOENT)
+			break;
+
+		if (LNET_NETTYP(LNET_NIDNET(lnet_id.nid)) != LOLND) {
+			lmv->lmv_statfs_start =
+				lnet_id.nid % lmv->desc.ld_tgt_count;
+			break;
+		}
+	}
+
+	return lmv->lmv_statfs_start;
+}
+
 static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 		      struct obd_statfs *osfs, time64_t max_age, u32 flags)
 {
@@ -1332,41 +1359,51 @@ static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct obd_statfs *temp;
 	int rc = 0;
-	u32 i;
+	u32 i, idx;
 
 	temp = kzalloc(sizeof(*temp), GFP_NOFS);
 	if (!temp)
 		return -ENOMEM;
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		if (!lmv->tgts[i] || !lmv->tgts[i]->ltd_exp)
+	/* distribute statfs among MDTs */
+	idx = lmv_select_statfs_mdt(lmv, flags);
+
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++, idx++) {
+		idx = idx % lmv->desc.ld_tgt_count;
+		if (!lmv->tgts[idx] || !lmv->tgts[idx]->ltd_exp)
 			continue;
 
-		rc = obd_statfs(env, lmv->tgts[i]->ltd_exp, temp,
+		rc = obd_statfs(env, lmv->tgts[idx]->ltd_exp, temp,
 				max_age, flags);
 		if (rc) {
 			CERROR("can't stat MDS #%d (%s), error %d\n", i,
-			       lmv->tgts[i]->ltd_exp->exp_obd->obd_name,
+			       lmv->tgts[idx]->ltd_exp->exp_obd->obd_name,
 			       rc);
 			goto out_free_temp;
 		}
 
+		if (temp->os_state & OS_STATE_SUM ||
+		    flags == OBD_STATFS_FOR_MDT0) {
+			/* Reset to the last aggregated values
+			 * and don't sum with non-aggrated data.
+			 * If the statfs is from mount, it needs to retrieve
+			 * necessary information from MDT0. i.e. mount does
+			 * not need the merged osfs from all of MDT. Also
+			 * clients can be mounted as long as MDT0 is in
+			 * service
+			 */
+			*osfs = *temp;
+			break;
+		}
+
 		if (i == 0) {
 			*osfs = *temp;
-			/* If the statfs is from mount, it will needs
-			 * retrieve necessary information from MDT0.
-			 * i.e. mount does not need the merged osfs
-			 * from all of MDT.
-			 * And also clients can be mounted as long as
-			 * MDT0 is in service
-			 */
-			if (flags & OBD_STATFS_FOR_MDT0)
-				goto out_free_temp;
 		} else {
 			osfs->os_bavail += temp->os_bavail;
 			osfs->os_blocks += temp->os_blocks;
 			osfs->os_ffree += temp->os_ffree;
 			osfs->os_files += temp->os_files;
+			osfs->os_granted += temp->os_granted;
 		}
 	}
 
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index b173937..3341761 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1495,6 +1495,19 @@ static int mdc_statfs(const struct lu_env *env,
 		goto output;
 	}
 
+	if ((flags & OBD_STATFS_SUM) &&
+	    (exp_connect_flags2(exp) & OBD_CONNECT2_SUM_STATFS)) {
+		/* request aggregated states */
+		struct mdt_body *body;
+
+		body = req_capsule_client_get(&req->rq_pill, &RMF_MDT_BODY);
+		if (!body) {
+			rc = -EPROTO;
+			goto out;
+		}
+		body->mbo_valid = OBD_MD_FLAGSTATFS;
+	}
+
 	ptlrpc_request_set_replen(req);
 
 	if (flags & OBD_STATFS_NODELAY) {
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 70344b9..225a73e 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -1252,7 +1252,7 @@ struct req_format RQF_MDS_GET_ROOT =
 EXPORT_SYMBOL(RQF_MDS_GET_ROOT);
 
 struct req_format RQF_MDS_STATFS =
-	DEFINE_REQ_FMT0("MDS_STATFS", empty, obd_statfs_server);
+	DEFINE_REQ_FMT0("MDS_STATFS", mdt_body_only, obd_statfs_server);
 EXPORT_SYMBOL(RQF_MDS_STATFS);
 
 struct req_format RQF_MDS_SYNC =
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index d09cf3f..e71f79d 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1645,7 +1645,7 @@ void lustre_swab_obd_statfs(struct obd_statfs *os)
 	__swab32s(&os->os_state);
 	__swab32s(&os->os_fprecreated);
 	BUILD_BUG_ON(offsetof(typeof(*os), os_fprecreated) == 0);
-	BUILD_BUG_ON(offsetof(typeof(*os), os_spare2) == 0);
+	__swab32s(&os->os_granted);
 	BUILD_BUG_ON(offsetof(typeof(*os), os_spare3) == 0);
 	BUILD_BUG_ON(offsetof(typeof(*os), os_spare4) == 0);
 	BUILD_BUG_ON(offsetof(typeof(*os), os_spare5) == 0);
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 1afbb41..30083c2 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1696,10 +1696,10 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct obd_statfs, os_fprecreated));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_fprecreated) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_fprecreated));
-	LASSERTF((int)offsetof(struct obd_statfs, os_spare2) == 112, "found %lld\n",
-		 (long long)(int)offsetof(struct obd_statfs, os_spare2));
-	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_spare2) == 4, "found %lld\n",
-		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_spare2));
+	LASSERTF((int)offsetof(struct obd_statfs, os_granted) == 112, "found %lld\n",
+		 (long long)(int)offsetof(struct obd_statfs, os_granted));
+	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_granted) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_granted));
 	LASSERTF((int)offsetof(struct obd_statfs, os_spare3) == 116, "found %lld\n",
 		 (long long)(int)offsetof(struct obd_statfs, os_spare3));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_spare3) == 4, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 249a3d5..c65663a 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -793,6 +793,7 @@ struct ptlrpc_body_v2 {
 							 */
 #define OBD_CONNECT2_DIR_MIGRATE	0x4ULL		/* migrate striped dir
 							 */
+#define OBD_CONNECT2_SUM_STATFS		0x8ULL /* MDT return aggregated stats */
 #define OBD_CONNECT2_FLR		0x20ULL		/* FLR support */
 #define OBD_CONNECT2_WBC_INTENTS	0x40ULL /* create/unlink/... intents
 						 * for wbc, also operations
@@ -1167,7 +1168,7 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 #define OBD_MD_FLXATTRLS	(0x0000002000000000ULL) /* xattr list */
 #define OBD_MD_FLXATTRRM	(0x0000004000000000ULL) /* xattr remove */
 #define OBD_MD_FLACL		(0x0000008000000000ULL) /* ACL */
-/*	OBD_MD_FLRMTPERM	(0x0000010000000000ULL) remote perm, obsolete */
+#define OBD_MD_FLAGSTATFS	(0x0000010000000000ULL) /* aggregated statfs */
 #define OBD_MD_FLMDSCAPA	(0x0000020000000000ULL) /* MDS capability */
 #define OBD_MD_FLOSSCAPA	(0x0000040000000000ULL) /* OSS capability */
 /*	OBD_MD_FLCKSPLIT	(0x0000080000000000ULL) obsolete 2.3.58*/
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 421c977..f25bb9b 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -104,6 +104,7 @@ enum obd_statfs_state {
 	OS_STATE_NOPRECREATE	= 0x00000004, /**< no object precreation */
 	OS_STATE_ENOSPC		= 0x00000020, /**< not enough free space */
 	OS_STATE_ENOINO		= 0x00000040, /**< not enough inodes */
+	OS_STATE_SUM		= 0x00000100, /**< aggregated for all tagrets */
 };
 
 struct obd_statfs {
@@ -121,9 +122,9 @@ struct obd_statfs {
 	__u32	os_fprecreated;	/* objs available now to the caller
 				 * used in QoS code to find preferred OSTs
 				 */
-	__u32	os_spare2;	/* Unused padding fields.  Remember */
-	__u32	os_spare3;	/* to fix lustre_swab_obd_statfs() */
-	__u32	os_spare4;
+	__u32	os_granted;	/* space granted for MDS */
+	__u32	os_spare3;	/* Unused padding fields.  Remember */
+	__u32	os_spare4;	/* to fix lustre_swab_obd_statfs() */
 	__u32	os_spare5;
 	__u32	os_spare6;
 	__u32	os_spare7;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 104/622] lustre: ldlm: correct logic in ldlm_prepare_lru_list()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (102 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 103/622] lustre: protocol: MDT as a statfs proxy James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 105/622] lustre: llite: check truncate race for DOM pages James Simmons
                   ` (518 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In ldlm_prepare_lru_list() fix an (x != a || x != b) type error and
correct a use after free.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11075
Lustre-commit: aecafb57d5b6 ("LU-11075 ldlm: correct logic in ldlm_prepare_lru_list()")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32660
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index bc441f0..f045d30 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1643,7 +1643,7 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 			/* No locks which got blocking requests. */
 			LASSERT(!ldlm_is_bl_ast(lock));
 
-			if (!ldlm_is_canceling(lock) ||
+			if (!ldlm_is_canceling(lock) &&
 			    !ldlm_is_converting(lock))
 				break;
 
@@ -1686,7 +1686,6 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 
 		if (result == LDLM_POLICY_SKIP_LOCK) {
 			lu_ref_del(&lock->l_reference, __func__, current);
-			LDLM_LOCK_RELEASE(lock);
 			if (no_wait) {
 				spin_lock(&ns->ns_lock);
 				if (!list_empty(&lock->l_lru) &&
@@ -1694,6 +1693,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 					ns->ns_last_pos = &lock->l_lru;
 				spin_unlock(&ns->ns_lock);
 			}
+
+			LDLM_LOCK_RELEASE(lock);
 			continue;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 105/622] lustre: llite: check truncate race for DOM pages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (103 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 104/622] lustre: ldlm: correct logic in ldlm_prepare_lru_list() James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 106/622] lnet: lnd: conditionally set health status James Simmons
                   ` (517 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

In ll_dom_finish_open() check vmpage mapping still
exists after locking and exit otherwise. This can
happen if page has been truncated concurrently.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11275
Lustre-commit: 0f7d7b200b58 ("LU-11275 llite: check truncate race for DOM pages")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33087
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 68fb623..ae39b2c 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -496,6 +496,13 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 			break;
 		}
 		lock_page(vmpage);
+		if (!vmpage->mapping) {
+			unlock_page(vmpage);
+			put_page(vmpage);
+			/* page was truncated */
+			rc = -ENODATA;
+			goto out_io;
+		}
 		clp = cl_page_find(env, obj, vmpage->index, vmpage,
 				   CPT_CACHEABLE);
 		if (IS_ERR(clp)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 106/622] lnet: lnd: conditionally set health status
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (104 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 105/622] lustre: llite: check truncate race for DOM pages James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 107/622] lnet: router handling James Simmons
                   ` (516 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

For specific error scenarios a more accurate health status is set
per transmit. These shouldn't be overwritten in
kiblnd_txlist_done()

WC-bug-id: https://jira.whamcloud.com/browse/LU-11271
Lustre-commit: cf3cc2c72e6e ("LU-11271 lnd: conditionally set health status")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33042
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 5680f2a..68ab7d5 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -110,7 +110,8 @@ static int kiblnd_init_rdma(struct kib_conn *conn, struct kib_tx *tx, int type,
 		/* complete now */
 		tx->tx_waiting = 0;
 		tx->tx_status = status;
-		tx->tx_hstatus = hstatus;
+		if (hstatus != LNET_MSG_STATUS_OK)
+			tx->tx_hstatus = hstatus;
 		kiblnd_tx_done(tx);
 	}
 }
@@ -2108,9 +2109,11 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	spin_unlock(&conn->ibc_lock);
 
 	/* aborting transmits occurs when finalizing the connection.
-	 * The connection is finalized on error
+	 * The connection is finalized on error.
+	 * Passing LNET_MSG_STATUS_OK to txlist_done() will not
+	 * override the value already set in tx->tx_hstatus above.
 	 */
-	kiblnd_txlist_done(&zombies, -ECONNABORTED, -1);
+	kiblnd_txlist_done(&zombies, -ECONNABORTED, LNET_MSG_STATUS_OK);
 }
 
 static void
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 107/622] lnet: router handling
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (105 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 106/622] lnet: lnd: conditionally set health status James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 108/622] lustre: obd: check '-o network' and peer discovery conflict James Simmons
                   ` (515 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Re-create the md and mdh if the router checker ping times out.
When re-transmitting a message do so even if the peer is marked down
to fulfill the message's retry quota.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11272
Lustre-commit: 05becd69bc0c ("LU-11272 lnet: router handling")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33043
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 12 ++++++++++--
 net/lnet/lnet/router.c   |  8 +++++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index eb0b48d..3cab970 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -678,7 +678,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
  *     may drop the lnet_net_lock
  */
 static int
-lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp)
+lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp,
+		       struct lnet_msg *msg)
 {
 	time64_t now = ktime_get_seconds();
 
@@ -689,6 +690,13 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		return 1;
 
 	/*
+	 * If we're resending a message, let's attempt to send it even if
+	 * the peer is down to fulfill our resend quota on the message
+	 */
+	if (msg->msg_retry_count > 0)
+		return 1;
+
+	/*
 	 * Peer appears dead, but we should avoid frequent NI queries (at
 	 * most once per lnet_queryinterval seconds).
 	 */
@@ -746,7 +754,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 	/* NB 'lp' is always the next hop */
 	if (!(msg->msg_target.pid & LNET_PID_USERFLAG) &&
-	    !lnet_peer_alive_locked(ni, lp)) {
+	    !lnet_peer_alive_locked(ni, lp, msg)) {
 		the_lnet.ln_counters[cpt]->drop_count++;
 		the_lnet.ln_counters[cpt]->drop_length += msg->msg_len;
 		lnet_net_unlock(cpt);
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 7c3bbd8..66a116c 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -1042,7 +1042,13 @@ int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 
 	rcd = rtr->lpni_rcd;
-	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis)
+
+	/* The response to the router checker ping could've timed out and
+	 * the mdh might've been invalidated, so we need to update it
+	 * again.
+	 */
+	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis ||
+	    LNetMDHandleIsInvalid(rcd->rcd_mdh))
 		rcd = lnet_update_rc_data_locked(rtr);
 	if (!rcd)
 		return;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 108/622] lustre: obd: check '-o network' and peer discovery conflict
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (106 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 107/622] lnet: router handling James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 109/622] lnet: update logging James Simmons
                   ` (514 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

"-o network=net" client mount option is not taken into account
when LNet dynamic peer discovery is active.
Check if LNet dynamic peer discovery is active on local node. If it
is, return error if "-o network=net" option is specified.

This patch will have to be reverted when the incompatibility between
"-o network=net" client mount option and LNet dynamic peer discovery
is resolved.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11057
Lustre-commit: 2269d27e07cb ("LU-11057 obd: check '-o network' and peer discovery conflict")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/32562
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obd_mount.c |  7 +++++++
 include/linux/lnet/api.h       |  1 +
 net/lnet/lnet/api-ni.c         | 13 +++++++++++++
 3 files changed, 21 insertions(+)

diff --git a/fs/lustre/obdclass/obd_mount.c b/fs/lustre/obdclass/obd_mount.c
index 5cf404c..d143112 100644
--- a/fs/lustre/obdclass/obd_mount.c
+++ b/fs/lustre/obdclass/obd_mount.c
@@ -1169,6 +1169,13 @@ int lmd_parse(char *options, struct lustre_mount_data *lmd)
 			rc = lmd_parse_network(lmd, s1 + 8);
 			if (rc)
 				goto invalid;
+
+			/* check if LNet dynamic peer discovery is activated */
+			if (LNetGetPeerDiscoveryStatus()) {
+				CERROR("LNet Dynamic Peer Discovery is enabled on this node. 'network' mount option cannot be taken into account.\n");
+				goto invalid;
+			}
+
 			clear++;
 		}
 
diff --git a/include/linux/lnet/api.h b/include/linux/lnet/api.h
index a57ecc8..4b152c8 100644
--- a/include/linux/lnet/api.h
+++ b/include/linux/lnet/api.h
@@ -207,6 +207,7 @@ int LNetGet(lnet_nid_t self,
 int LNetClearLazyPortal(int portal);
 int LNetCtl(unsigned int cmd, void *arg);
 void LNetDebugPeer(struct lnet_process_id id);
+int LNetGetPeerDiscoveryStatus(void);
 
 /** @} lnet_misc */
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 07bc29f..c81f46f 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -4038,3 +4038,16 @@ static int lnet_ping(struct lnet_process_id id, signed long timeout,
 	kfree(buf);
 	return rc;
 }
+
+/**
+ * Retrieve peer discovery status.
+ *
+ * Return	1 if lnet_peer_discovery_disabled is 0
+ *		0 if lnet_peer_discovery_disabled is 1
+ */
+int
+LNetGetPeerDiscoveryStatus(void)
+{
+	return !lnet_peer_discovery_disabled;
+}
+EXPORT_SYMBOL(LNetGetPeerDiscoveryStatus);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 109/622] lnet: update logging
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (107 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 108/622] lustre: obd: check '-o network' and peer discovery conflict James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 110/622] lustre: ldlm: don't cancel DoM locks before replay James Simmons
                   ` (513 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add the retry count when logging message sending/resending.
Make timed out responses visible on net error.
Log cases when a message is not resent

WC-bug-id: https://jira.whamcloud.com/browse/LU-11273
Lustre-commit: b9523f474346 ("LU-11273 lnet: update logging")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33044
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 13 +++++++------
 net/lnet/lnet/lib-msg.c  | 21 ++++++++++++++++++---
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 3cab970..84a30e0 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1517,14 +1517,14 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 	rc = lnet_post_send_locked(msg, 0);
 	if (!rc)
-		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) : %s\n",
+		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) : %s try# %d\n",
 		       libcfs_nid2str(msg->msg_hdr.src_nid),
 		       libcfs_nid2str(msg->msg_txni->ni_nid),
 		       libcfs_nid2str(sd->sd_src_nid),
 		       libcfs_nid2str(msg->msg_hdr.dest_nid),
 		       libcfs_nid2str(sd->sd_dst_nid),
 		       libcfs_nid2str(msg->msg_txpeer->lpni_nid),
-		       lnet_msgtyp2str(msg->msg_type));
+		       lnet_msgtyp2str(msg->msg_type), msg->msg_retry_count);
 
 	return rc;
 }
@@ -2515,8 +2515,7 @@ struct lnet_mt_event_info {
 
 				list_del_init(&rspt->rspt_on_list);
 
-				CDEBUG(D_NET,
-				       "Response timed out: md = %p\n", md);
+				CNETERR("Response timed out: md = %p\n", md);
 				LNetMDUnlink(rspt->rspt_mdh);
 				lnet_rspt_free(rspt, i);
 			} else {
@@ -2579,11 +2578,13 @@ struct lnet_mt_event_info {
 			lnet_peer_ni_decref_locked(lpni);
 
 			lnet_net_unlock(cpt);
-			CDEBUG(D_NET, "resending %s->%s: %s recovery %d\n",
+			CDEBUG(D_NET,
+			       "resending %s->%s: %s recovery %d try# %d\n",
 			       libcfs_nid2str(src_nid),
 			       libcfs_id2str(msg->msg_target),
 			       lnet_msgtyp2str(msg->msg_type),
-			       msg->msg_recovery);
+			       msg->msg_recovery,
+			       msg->msg_retry_count);
 			rc = lnet_send(src_nid, msg, LNET_NID_ANY);
 			if (rc) {
 				CERROR("Error sending %s to %s: %d\n",
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 5072238..9b52549 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -690,18 +690,33 @@
 
 resend:
 	/* don't resend recovery messages */
-	if (msg->msg_recovery)
+	if (msg->msg_recovery) {
+		CDEBUG(D_NET, "msg %s->%s is a recovery ping. retry# %d\n",
+		       libcfs_nid2str(msg->msg_from),
+		       libcfs_nid2str(msg->msg_target.nid),
+		       msg->msg_retry_count);
 		return -1;
+	}
 
 	/* if we explicitly indicated we don't want to resend then just
 	 * return
 	 */
-	if (msg->msg_no_resend)
+	if (msg->msg_no_resend) {
+		CDEBUG(D_NET, "msg %s->%s requested no resend. retry# %d\n",
+		       libcfs_nid2str(msg->msg_from),
+		       libcfs_nid2str(msg->msg_target.nid),
+		       msg->msg_retry_count);
 		return -1;
+	}
 
 	/* check if the message has exceeded the number of retries */
-	if (msg->msg_retry_count >= lnet_retry_count)
+	if (msg->msg_retry_count >= lnet_retry_count) {
+		CNETERR("msg %s->%s exceeded retry count %d\n",
+			libcfs_nid2str(msg->msg_from),
+			libcfs_nid2str(msg->msg_target.nid),
+			msg->msg_retry_count);
 		return -1;
+	}
 	msg->msg_retry_count++;
 
 	lnet_net_lock(msg->msg_tx_cpt);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 110/622] lustre: ldlm: don't cancel DoM locks before replay
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (108 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 109/622] lnet: update logging James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 111/622] lnet: lnd: Clean up logging James Simmons
                   ` (512 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Weigh a DOM locks before lock replay like that is done
for OSC EXTENT locks and don't cancel locks with data.

Add DoM replay tests for file creation and write cases.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10961
Lustre-commit: b44b1ff8c7fc ("LU-10961 ldlm: don't cancel DoM locks before replay")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32791
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h |  1 +
 fs/lustre/mdc/mdc_request.c    |  6 ++++++
 fs/lustre/osc/osc_lock.c       | 22 ++++++++++++++--------
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 5ba4f97..dc8071a 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -714,6 +714,7 @@ void osc_lock_cancel(const struct lu_env *env,
 		     const struct cl_lock_slice *slice);
 void osc_lock_fini(const struct lu_env *env, struct cl_lock_slice *slice);
 int osc_ldlm_glimpse_ast(struct ldlm_lock *dlmlock, void *data);
+unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock);
 
 /****************************************************************************
  *
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 3341761..0ee42dd 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -2510,6 +2510,12 @@ static int mdc_cancel_weight(struct ldlm_lock *lock)
 	if (lock->l_policy_data.l_inodebits.bits & MDS_INODELOCK_OPEN)
 		return 0;
 
+	/* Special case for DoM locks, cancel only unused and granted locks */
+	if (ldlm_has_dom(lock) &&
+	    (lock->l_granted_mode != lock->l_req_mode ||
+	     osc_ldlm_weigh_ast(lock) != 0))
+		return 0;
+
 	return 1;
 }
 
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index b7b33fb..1a2b0bd 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -608,8 +608,8 @@ static bool weigh_cb(const struct lu_env *env, struct cl_io *io,
 	struct cl_page *page = ops->ops_cl.cpl_page;
 
 	if (cl_page_is_vmlocked(env, page) ||
-	    PageDirty(page->cp_vmpage) || PageWriteback(page->cp_vmpage)
-	   )
+	    PageDirty(page->cp_vmpage) ||
+	    PageWriteback(page->cp_vmpage))
 		return false;
 
 	*(pgoff_t *)cbdata = osc_index(ops) + 1;
@@ -618,7 +618,7 @@ static bool weigh_cb(const struct lu_env *env, struct cl_io *io,
 
 static unsigned long osc_lock_weight(const struct lu_env *env,
 				     struct osc_object *oscobj,
-				     struct ldlm_extent *extent)
+				     loff_t start, loff_t end)
 {
 	struct cl_io *io = osc_env_thread_io(env);
 	struct cl_object *obj = cl_object_top(&oscobj->oo_cl);
@@ -631,11 +631,10 @@ static unsigned long osc_lock_weight(const struct lu_env *env,
 	if (result != 0)
 		return result;
 
-	page_index = cl_index(obj, extent->start);
+	page_index = cl_index(obj, start);
 
 	if (!osc_page_gang_lookup(env, io, oscobj,
-				 page_index,
-				 cl_index(obj, extent->end),
+				 page_index, cl_index(obj, end),
 				 weigh_cb, (void *)&page_index))
 		result = 1;
 	cl_io_fini(env, io);
@@ -668,7 +667,8 @@ unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock)
 		/* Mostly because lack of memory, do not eliminate this lock */
 		return 1;
 
-	LASSERT(dlmlock->l_resource->lr_type == LDLM_EXTENT);
+	LASSERT(dlmlock->l_resource->lr_type == LDLM_EXTENT ||
+		ldlm_has_dom(dlmlock));
 	lock_res_and_lock(dlmlock);
 	obj = dlmlock->l_ast_data;
 	if (obj)
@@ -695,7 +695,12 @@ unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock)
 		goto out;
 	}
 
-	weight = osc_lock_weight(env, obj, &dlmlock->l_policy_data.l_extent);
+	if (ldlm_has_dom(dlmlock))
+		weight = osc_lock_weight(env, obj, 0, OBD_OBJECT_EOF);
+	else
+		weight = osc_lock_weight(env, obj,
+					 dlmlock->l_policy_data.l_extent.start,
+					 dlmlock->l_policy_data.l_extent.end);
 
 out:
 	if (obj)
@@ -704,6 +709,7 @@ unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock)
 	cl_env_put(env, &refcheck);
 	return weight;
 }
+EXPORT_SYMBOL(osc_ldlm_weigh_ast);
 
 static void osc_lock_build_einfo(const struct lu_env *env,
 				 const struct cl_lock *lock,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 111/622] lnet: lnd: Clean up logging
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (109 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 110/622] lustre: ldlm: don't cancel DoM locks before replay James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 112/622] lustre: mdt: revoke lease lock for truncate James Simmons
                   ` (511 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

No need to output error in ksocknal_tx_done() as this error is
tracked in lnet.
No need to keep a cookie in the connection. It's always set to
the message. This will allow us to set the msg's health status
properly before calling lnet_finalize()

WC-bug-id: https://jira.whamcloud.com/browse/LU-11309
Lustre-commit: cdf462b19345 ("LU-11309 lnd: Clean up logging")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33096
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c    |  5 ++++-
 net/lnet/klnds/socklnd/socklnd.h    |  3 +--
 net/lnet/klnds/socklnd/socklnd_cb.c | 10 +++++-----
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 891d3bd..72ecf80 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -1680,7 +1680,10 @@ struct ksock_peer *
 		       &conn->ksnc_ipaddr, conn->ksnc_port,
 		       iov_iter_count(&conn->ksnc_rx_to), conn->ksnc_rx_nob_left,
 		       ktime_get_seconds() - last_rcv);
-		lnet_finalize(conn->ksnc_cookie, -EIO);
+		if (conn->ksnc_lnet_msg)
+			conn->ksnc_lnet_msg->msg_health_status =
+				LNET_MSG_STATUS_REMOTE_ERROR;
+		lnet_finalize(conn->ksnc_lnet_msg, -EIO);
 		break;
 	case SOCKNAL_RX_LNET_HEADER:
 		if (conn->ksnc_rx_started)
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 48884cf..c8d8acf 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -355,8 +355,7 @@ struct ksock_conn {
 	u32			ksnc_rx_csum;		/* partial checksum for incoming
 							 * data
 							 */
-	void		       *ksnc_cookie;		/* rx lnet_finalize passthru arg
-							 */
+	struct lnet_msg	       *ksnc_lnet_msg;		/* rx lnet_finalize arg */
 	struct ksock_msg	ksnc_msg;		/* incoming message buffer:
 							 * V2.x message takes the
 							 * whole struct
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 057c7f3..10a1934 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -344,9 +344,6 @@ struct ksock_tx *
 
 	ksocknal_free_tx(tx);
 	if (lnetmsg) { /* KSOCK_MSG_NOOP go without lnetmsg */
-		if (rc)
-			CERROR("tx failure rc = %d, hstatus = %d\n", rc,
-			       hstatus);
 		lnetmsg->msg_health_status = hstatus;
 		lnet_finalize(lnetmsg, rc);
 	}
@@ -1266,7 +1263,10 @@ struct ksock_route *
 					le64_to_cpu(lhdr->src_nid) != id->nid);
 		}
 
-		lnet_finalize(conn->ksnc_cookie, rc);
+		if (rc && conn->ksnc_lnet_msg)
+			conn->ksnc_lnet_msg->msg_health_status =
+				LNET_MSG_STATUS_REMOTE_ERROR;
+		lnet_finalize(conn->ksnc_lnet_msg, rc);
 
 		if (rc) {
 			ksocknal_new_packet(conn, 0);
@@ -1300,7 +1300,7 @@ struct ksock_route *
 	LASSERT(iov_iter_count(to) <= rlen);
 	LASSERT(to->nr_segs <= LNET_MAX_IOV);
 
-	conn->ksnc_cookie = msg;
+	conn->ksnc_lnet_msg = msg;
 	conn->ksnc_rx_nob_left = rlen;
 
 	conn->ksnc_rx_to = *to;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 112/622] lustre: mdt: revoke lease lock for truncate
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (110 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 111/622] lnet: lnd: Clean up logging James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 113/622] lustre: ptlrpc: race in AT early reply James Simmons
                   ` (510 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Jian Yu <yujian@whamcloud.com>

Lustre lease lock is usually used to protect file data
against concurrent access. Open lock used on MDT side
is for this purpose. However, truncate will change
file data but it doesn't revoke lease lock.

This patch fixes the issue by acquiring open sem,
checking lease count and revoking lease if there exists
any pending lease on the file.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10660
Lustre-commit: e4c168165df2 ("LU-10660 mdt: revoke lease lock for truncate")
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33093
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c            | 7 +++++++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 8b3e2a3..37558a8 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1616,6 +1616,13 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 		clear_bit(LLIF_DATA_MODIFIED, &lli->lli_flags);
 	}
 
+	if (attr->ia_valid & ATTR_FILE) {
+		struct ll_file_data *fd = LUSTRE_FPRIVATE(attr->ia_file);
+
+		if (fd->fd_lease_och)
+			op_data->op_bias |= MDS_TRUNC_KEEP_LEASE;
+	}
+
 	op_data->op_attr = *attr;
 	op_data->op_xvalid = xvalid;
 
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index c65663a..7f857be 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1700,6 +1700,7 @@ enum mds_op_bias {
 	MDS_CLOSE_LAYOUT_MERGE	= 1 << 15,
 	MDS_CLOSE_RESYNC_DONE	= 1 << 16,
 	MDS_CLOSE_LAYOUT_SPLIT	= 1 << 17,
+	MDS_TRUNC_KEEP_LEASE	= 1 << 18,
 };
 
 #define MDS_CLOSE_INTENT (MDS_HSM_RELEASE | MDS_CLOSE_LAYOUT_SWAP |         \
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 113/622] lustre: ptlrpc: race in AT early reply
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (111 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 112/622] lustre: mdt: revoke lease lock for truncate James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 114/622] lustre: migrate: migrate striped directory James Simmons
                   ` (509 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

In ptlrpc_at_check_timed, the refcount of the request could
be already dropped to zero, the ptlrpc_server_drop_request
could continue without the "scp_at_lock" and free the request
by writing 0x5a5a5a5a5a5a5a5a to the memory, but the following
"atomic_inc_not_zero(&rq->rq_refcount)" will return nonzero and
cause freed request to be used in ptlrpc_at_send_early_reply.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11281
Lustre-commit: 48e409e65edd ("LU-11281 ptlrpc: race in AT early reply")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33071
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index cf920ae..a9155b2 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1224,14 +1224,18 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 				break;
 			}
 
-			ptlrpc_at_remove_timed(rq);
 			/**
 			 * ptlrpc_server_drop_request() may drop
 			 * refcount to 0 already. Let's check this and
 			 * don't add entry to work_list
 			 */
-			if (likely(atomic_inc_not_zero(&rq->rq_refcount)))
+			if (likely(atomic_inc_not_zero(&rq->rq_refcount))) {
+				ptlrpc_at_remove_timed(rq);
 				list_add(&rq->rq_timed_list, &work_list);
+			} else {
+				ptlrpc_at_remove_timed(rq);
+			}
+
 			counter++;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 114/622] lustre: migrate: migrate striped directory
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (112 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 113/622] lustre: ptlrpc: race in AT early reply James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 115/622] lustre: obdclass: remove unused ll_import_cachep James Simmons
                   ` (508 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Migrate striped directory in below steps:
1. create target object if needed: if source is directory, a
   target object is always created, otherwise if source is
   already located on the target MDT, or source still has
   link on source MDT, then skip creating.
        a) if source is directory, detach source stripes and
           attach them to target.
        b) migrate source xattrs to target.
        c) if source is regular file, update PFID to target
           fid.
        d) update fid to target for all links of source
2. update namespace
        a) migrate dirent from source parent to target parent.
        b) update linkea parent fid to target parent.
        c) destroy source object.

This implementation improves following fields:
1. all involved objects are locked to avoid race.
2. directory migration doesn't migrate its dir entries, instead
   it's done in each sub file migration, this avoids timeout in
   migrating dir entries for large directory, and also avoids
   touching dir entries without lock.
3. file/dir is migrated in one transaction, so migrate recovery
   is the same as others.
4. migrating directory can be accessed (modifiable) like normal
   directory.
5. if migration of sub files under a directory fails, user can
   redo migrate to finish migration of this directory.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4684
Lustre-commit: 169738e30a7e ("LU-4684 migrate: migrate striped directory")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31427
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h          |  24 ++-
 fs/lustre/include/lustre_lmv.h         |  18 +-
 fs/lustre/llite/file.c                 |  11 +
 fs/lustre/llite/llite_lib.c            |  90 +++++----
 fs/lustre/lmv/lmv_internal.h           |  15 +-
 fs/lustre/lmv/lmv_obd.c                | 357 ++++++++++++++++++++++-----------
 fs/lustre/mdc/mdc_internal.h           |   2 +
 fs/lustre/mdc/mdc_lib.c                |  45 +++--
 fs/lustre/mdc/mdc_reint.c              |   5 +-
 fs/lustre/ptlrpc/wiretest.c            |  16 +-
 include/uapi/linux/lustre/lustre_idl.h |  16 +-
 11 files changed, 403 insertions(+), 196 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index e49954c..a709ad7 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1229,6 +1229,26 @@ struct lu_name {
 	int		ln_namelen;
 };
 
+static inline bool name_is_dot_or_dotdot(const char *name, int namelen)
+{
+	return name[0] == '.' &&
+	       (namelen == 1 || (namelen == 2 && name[1] == '.'));
+}
+
+static inline bool lu_name_is_dot_or_dotdot(const struct lu_name *lname)
+{
+	return name_is_dot_or_dotdot(lname->ln_name, lname->ln_namelen);
+}
+
+static inline bool lu_name_is_valid_len(const char *name, size_t name_len)
+{
+	return name &&
+	       name_len > 0 &&
+	       name_len < INT_MAX &&
+	       strlen(name) == name_len &&
+	       memchr(name, '/', name_len) == NULL;
+}
+
 /**
  * Validate names (path components)
  *
@@ -1240,9 +1260,7 @@ struct lu_name {
  */
 static inline bool lu_name_is_valid_2(const char *name, size_t name_len)
 {
-	return name && name_len > 0 && name_len < INT_MAX &&
-	       name[name_len] == '\0' && strlen(name) == name_len &&
-	       !memchr(name, '/', name_len);
+	return lu_name_is_valid_len(name, name_len) && name[name_len] == '\0';
 }
 
 /**
diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index 5e15c62..ff279e1 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -47,6 +47,8 @@ struct lmv_stripe_md {
 	u32	lsm_md_master_mdt_index;
 	u32	lsm_md_hash_type;
 	u32	lsm_md_layout_version;
+	u32	lsm_md_migrate_offset;
+	u32	lsm_md_migrate_hash;
 	u32	lsm_md_default_count;
 	u32	lsm_md_default_index;
 	char	lsm_md_pool_name[LOV_MAXPOOLNAME + 1];
@@ -63,6 +65,10 @@ struct lmv_stripe_md {
 	    lsm1->lsm_md_master_mdt_index != lsm2->lsm_md_master_mdt_index ||
 	    lsm1->lsm_md_hash_type != lsm2->lsm_md_hash_type ||
 	    lsm1->lsm_md_layout_version != lsm2->lsm_md_layout_version ||
+	    lsm1->lsm_md_migrate_offset !=
+				lsm2->lsm_md_migrate_offset ||
+	    lsm1->lsm_md_migrate_hash !=
+				lsm2->lsm_md_migrate_hash ||
 	    strcmp(lsm1->lsm_md_pool_name, lsm2->lsm_md_pool_name) != 0)
 		return false;
 
@@ -137,18 +143,14 @@ static inline int lmv_name_to_stripe_index(u32 lmv_hash_type,
 					   unsigned int stripe_count,
 					   const char *name, int namelen)
 {
-	u32 hash_type = lmv_hash_type & LMV_HASH_TYPE_MASK;
 	int idx;
 
 	LASSERT(namelen > 0);
-	if (stripe_count <= 1)
-		return 0;
 
-	/* for migrating object, always start from 0 stripe */
-	if (lmv_hash_type & LMV_HASH_FLAG_MIGRATION)
+	if (stripe_count <= 1)
 		return 0;
 
-	switch (hash_type) {
+	switch (lmv_hash_type & LMV_HASH_TYPE_MASK) {
 	case LMV_HASH_TYPE_ALL_CHARS:
 		idx = lmv_hash_all_chars(stripe_count, name, namelen);
 		break;
@@ -159,8 +161,8 @@ static inline int lmv_name_to_stripe_index(u32 lmv_hash_type,
 		idx = -EBADFD;
 		break;
 	}
-	CDEBUG(D_INFO, "name %.*s hash_type %d idx %d\n", namelen, name,
-	       hash_type, idx);
+	CDEBUG(D_INFO, "name %.*s hash_type %#x idx %d/%u\n", namelen, name,
+	       lmv_hash_type, idx, stripe_count);
 
 	return idx;
 }
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index ae39b2c..fd39948 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3836,6 +3836,17 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 	if (!child_inode)
 		return -ENOENT;
 
+	if (!(exp_connect_flags2(ll_i2sbi(parent)->ll_md_exp) &
+	      OBD_CONNECT2_DIR_MIGRATE)) {
+		if (le32_to_cpu(lum->lum_stripe_count) > 1 ||
+		    ll_i2info(child_inode)->lli_lsm_md) {
+			CERROR("%s: MDT doesn't support stripe directory migration!\n",
+			       ll_get_fsname(parent->i_sb, NULL, 0));
+			rc = -EOPNOTSUPP;
+			goto out_iput;
+		}
+	}
+
 	/*
 	 * lfs migrate command needs to be blocked on the client
 	 * by checking the migrate FID against the FID of the
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 37558a8..636ddf8 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1254,14 +1254,8 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 		 * different, so it reset lsm_md to NULL to avoid
 		 * initializing lsm for slave inode.
 		 */
-		/* For migrating inode, master stripe and master object will
-		 * be same, so we only need assign this inode
-		 */
-		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION && !i)
-			lsm->lsm_md_oinfo[i].lmo_root = inode;
-		else
-			lsm->lsm_md_oinfo[i].lmo_root =
-				ll_iget_anon_dir(inode->i_sb, fid, md);
+		lsm->lsm_md_oinfo[i].lmo_root =
+			ll_iget_anon_dir(inode->i_sb, fid, md);
 		if (IS_ERR(lsm->lsm_md_oinfo[i].lmo_root)) {
 			int rc = PTR_ERR(lsm->lsm_md_oinfo[i].lmo_root);
 
@@ -1273,20 +1267,6 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 	return 0;
 }
 
-static inline int lli_lsm_md_eq(const struct lmv_stripe_md *lsm_md1,
-				const struct lmv_stripe_md *lsm_md2)
-{
-	return lsm_md1->lsm_md_magic == lsm_md2->lsm_md_magic &&
-	       lsm_md1->lsm_md_stripe_count == lsm_md2->lsm_md_stripe_count &&
-	       lsm_md1->lsm_md_master_mdt_index ==
-			lsm_md2->lsm_md_master_mdt_index &&
-	       lsm_md1->lsm_md_hash_type == lsm_md2->lsm_md_hash_type &&
-	       lsm_md1->lsm_md_layout_version ==
-			lsm_md2->lsm_md_layout_version &&
-	       !strcmp(lsm_md1->lsm_md_pool_name,
-		       lsm_md2->lsm_md_pool_name);
-}
-
 static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
@@ -1297,27 +1277,53 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	CDEBUG(D_INODE, "update lsm %p of " DFID "\n", lli->lli_lsm_md,
 	       PFID(ll_inode2fid(inode)));
 
-	/* no striped information from request. */
-	if (!lsm) {
-		if (!lli->lli_lsm_md) {
-			return 0;
-		} else if (lli->lli_lsm_md->lsm_md_hash_type &
-			   LMV_HASH_FLAG_MIGRATION) {
-			/*
-			 * migration is done, the temporay MIGRATE layout has
-			 * been removed
-			 */
-			CDEBUG(D_INODE, DFID " finish migration.\n",
-			       PFID(ll_inode2fid(inode)));
-			lmv_free_memmd(lli->lli_lsm_md);
-			lli->lli_lsm_md = NULL;
-			return 0;
-		}
-		/*
-		 * The lustre_md from req does not include stripeEA,
-		 * see ll_md_setattr
-		 */
+	/*
+	 * no striped information from request, lustre_md from req does not
+	 * include stripeEA, see ll_md_setattr()
+	 */
+	if (!lsm)
 		return 0;
+
+	/* Compare the old and new stripe information */
+	if (lli->lli_lsm_md && !lsm_md_eq(lli->lli_lsm_md, lsm)) {
+		struct lmv_stripe_md *old_lsm = lli->lli_lsm_md;
+		bool layout_changed = lsm->lsm_md_layout_version >
+				      old_lsm->lsm_md_layout_version;
+		int mask = layout_changed ? D_INODE : D_ERROR;
+		int idx;
+
+		CDEBUG(mask,
+		       "%s: inode@%p "DFID" lmv layout %s magic %#x/%#x stripe count %d/%d master_mdt %d/%d hash_type %#x/%#x version %d/%d migrate offset %d/%d  migrate hash %#x/%#x pool %s/%s\n",
+		       ll_get_fsname(inode->i_sb, NULL, 0), inode,
+		       PFID(&lli->lli_fid),
+		       layout_changed ? "changed" : "mismatch",
+		       lsm->lsm_md_magic, old_lsm->lsm_md_magic,
+		       lsm->lsm_md_stripe_count,
+		       old_lsm->lsm_md_stripe_count,
+		       lsm->lsm_md_master_mdt_index,
+		       old_lsm->lsm_md_master_mdt_index,
+		       lsm->lsm_md_hash_type, old_lsm->lsm_md_hash_type,
+		       lsm->lsm_md_layout_version,
+		       old_lsm->lsm_md_layout_version,
+		       lsm->lsm_md_migrate_offset,
+		       old_lsm->lsm_md_migrate_offset,
+		       lsm->lsm_md_migrate_hash,
+		       old_lsm->lsm_md_migrate_hash,
+		       lsm->lsm_md_pool_name,
+		       old_lsm->lsm_md_pool_name);
+
+		for (idx = 0; idx < old_lsm->lsm_md_stripe_count; idx++)
+			CDEBUG(mask, "old stripe[%d] "DFID"\n",
+			       idx, PFID(&old_lsm->lsm_md_oinfo[idx].lmo_fid));
+
+		for (idx = 0; idx < lsm->lsm_md_stripe_count; idx++)
+			CDEBUG(mask, "new stripe[%d] "DFID"\n",
+			       idx, PFID(&lsm->lsm_md_oinfo[idx].lmo_fid));
+
+		if (!layout_changed)
+			return -EINVAL;
+
+		ll_dir_clear_lsm_md(inode);
 	}
 
 	/* set the directory layout */
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index 6794f11..c4a2fb8 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -123,18 +123,21 @@ static inline int lmv_stripe_md_size(int stripe_count)
 	return sizeof(*lsm) + stripe_count * sizeof(lsm->lsm_md_oinfo[0]);
 }
 
-int lmv_name_to_stripe_index(enum lmv_hash_type hashtype,
-			     unsigned int max_mdt_index,
-			     const char *name, int namelen);
-
+/* for file under migrating directory, return the target stripe info */
 static inline const struct lmv_oinfo *
 lsm_name_to_stripe_info(const struct lmv_stripe_md *lsm, const char *name,
 			int namelen)
 {
+	u32 hash_type = lsm->lsm_md_hash_type;
+	u32 stripe_count = lsm->lsm_md_stripe_count;
 	int stripe_index;
 
-	stripe_index = lmv_name_to_stripe_index(lsm->lsm_md_hash_type,
-						lsm->lsm_md_stripe_count,
+	if (hash_type & LMV_HASH_FLAG_MIGRATION) {
+		hash_type &= ~LMV_HASH_FLAG_MIGRATION;
+		stripe_count = lsm->lsm_md_migrate_offset;
+	}
+
+	stripe_index = lmv_name_to_stripe_index(hash_type, stripe_count,
 						name, namelen);
 	if (stripe_index < 0)
 		return ERR_PTR(stripe_index);
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 90a46c4..3ddffd8 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1836,154 +1836,284 @@ static int lmv_link(struct obd_export *exp, struct md_op_data *op_data,
 	return md_link(tgt->ltd_exp, op_data, request);
 }
 
-static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
-		      const char *old, size_t oldlen,
-		      const char *new, size_t newlen,
-		      struct ptlrpc_request **request)
+static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
+			const char *name, size_t namelen,
+			struct ptlrpc_request **request)
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct obd_export *target_exp;
-	struct lmv_tgt_desc *src_tgt;
-	struct lmv_tgt_desc *tgt_tgt;
-	struct mdt_body *body;
+	struct lmv_stripe_md *lsm = op_data->op_mea1;
+	struct lmv_tgt_desc *parent_tgt;
+	struct lmv_tgt_desc *sp_tgt;
+	struct lmv_tgt_desc *tp_tgt = NULL;
+	struct lmv_tgt_desc *child_tgt;
+	struct lmv_tgt_desc *tgt;
+	struct lu_fid target_fid;
 	int rc;
 
-	LASSERT(oldlen != 0);
+	LASSERT(op_data->op_cli_flags & CLI_MIGRATE);
+	LASSERTF(fid_is_sane(&op_data->op_fid3), "invalid FID "DFID"\n",
+		 PFID(&op_data->op_fid3));
 
-	CDEBUG(D_INODE, "RENAME %.*s in " DFID ":%d to %.*s in " DFID ":%d\n",
-	       (int)oldlen, old, PFID(&op_data->op_fid1),
-	       op_data->op_mea1 ? op_data->op_mea1->lsm_md_stripe_count : 0,
-	       (int)newlen, new, PFID(&op_data->op_fid2),
-	       op_data->op_mea2 ? op_data->op_mea2->lsm_md_stripe_count : 0);
+	CDEBUG(D_INODE, "MIGRATE "DFID"/%.*s\n",
+	       PFID(&op_data->op_fid1), (int)namelen, name);
 
 	op_data->op_fsuid = from_kuid(&init_user_ns, current_fsuid());
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
 
-	if (op_data->op_cli_flags & CLI_MIGRATE) {
-		LASSERTF(fid_is_sane(&op_data->op_fid3),
-			 "invalid FID " DFID "\n",
-			 PFID(&op_data->op_fid3));
-
-		if (op_data->op_mea1) {
-			struct lmv_stripe_md *lsm = op_data->op_mea1;
-			struct lmv_tgt_desc *tmp;
-
-			/* Fix the parent fid for striped dir */
-			tmp = lmv_locate_target_for_name(lmv, lsm, old,
-							 oldlen,
-							 &op_data->op_fid1,
-							 NULL);
-			if (IS_ERR(tmp))
-				return PTR_ERR(tmp);
+	parent_tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	if (IS_ERR(parent_tgt))
+		return PTR_ERR(parent_tgt);
+
+	if (lsm) {
+		u32 hash_type = lsm->lsm_md_hash_type;
+		u32 stripe_count = lsm->lsm_md_stripe_count;
+
+		/*
+		 * old stripes are appended after new stripes for migrating
+		 * directory.
+		 */
+		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION) {
+			hash_type = lsm->lsm_md_migrate_hash;
+			stripe_count -= lsm->lsm_md_migrate_offset;
 		}
 
-		rc = lmv_fid_alloc(NULL, exp, &op_data->op_fid2, op_data);
-		if (rc)
+		rc = lmv_name_to_stripe_index(hash_type, stripe_count, name,
+					      namelen);
+		if (rc < 0)
 			return rc;
-		src_tgt = lmv_find_target(lmv, &op_data->op_fid3);
-		if (IS_ERR(src_tgt))
-			return PTR_ERR(src_tgt);
 
-		target_exp = src_tgt->ltd_exp;
-	} else {
-		if (op_data->op_mea1) {
-			struct lmv_stripe_md *lsm = op_data->op_mea1;
+		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION)
+			rc += lsm->lsm_md_migrate_offset;
 
-			src_tgt = lmv_locate_target_for_name(lmv, lsm, old,
-							     oldlen,
-							     &op_data->op_fid1,
-							     &op_data->op_mds);
-		} else {
-			src_tgt = lmv_find_target(lmv, &op_data->op_fid1);
-		}
-		if (IS_ERR(src_tgt))
-			return PTR_ERR(src_tgt);
+		/* save it in fid4 temporarily for early cancel */
+		op_data->op_fid4 = lsm->lsm_md_oinfo[rc].lmo_fid;
+		sp_tgt = lmv_get_target(lmv, lsm->lsm_md_oinfo[rc].lmo_mds,
+					NULL);
+		if (IS_ERR(sp_tgt))
+			return PTR_ERR(sp_tgt);
 
-		if (op_data->op_mea2) {
-			struct lmv_stripe_md *lsm = op_data->op_mea2;
-
-			tgt_tgt = lmv_locate_target_for_name(lmv, lsm, new,
-							     newlen,
-							     &op_data->op_fid2,
-							     &op_data->op_mds);
-		} else {
-			tgt_tgt = lmv_find_target(lmv, &op_data->op_fid2);
+		/*
+		 * if parent is being migrated too, fill op_fid2 with target
+		 * stripe fid, otherwise the target stripe is not created yet.
+		 */
+		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION) {
+			hash_type = lsm->lsm_md_hash_type &
+				    ~LMV_HASH_FLAG_MIGRATION;
+			stripe_count = lsm->lsm_md_migrate_offset;
+
+			rc = lmv_name_to_stripe_index(hash_type, stripe_count,
+						      name, namelen);
+			if (rc < 0)
+				return rc;
+
+			op_data->op_fid2 = lsm->lsm_md_oinfo[rc].lmo_fid;
+			tp_tgt = lmv_get_target(lmv,
+						lsm->lsm_md_oinfo[rc].lmo_mds,
+						NULL);
+			if (IS_ERR(tp_tgt))
+				return PTR_ERR(tp_tgt);
 		}
-		if (IS_ERR(tgt_tgt))
-			return PTR_ERR(tgt_tgt);
-
-		target_exp = tgt_tgt->ltd_exp;
+	} else {
+		sp_tgt = parent_tgt;
 	}
 
-	/*
-	 * LOOKUP lock on src child (fid3) should also be cancelled for
-	 * src_tgt in mdc_rename.
-	 */
-	op_data->op_flags |= MF_MDC_CANCEL_FID1 | MF_MDC_CANCEL_FID3;
+	child_tgt = lmv_find_target(lmv, &op_data->op_fid3);
+	if (IS_ERR(child_tgt))
+		return PTR_ERR(child_tgt);
 
-	/*
-	 * Cancel UPDATE locks on tgt parent (fid2), tgt_tgt is its
-	 * own target.
-	 */
-	rc = lmv_early_cancel(exp, NULL, op_data, src_tgt->ltd_idx,
-			      LCK_EX, MDS_INODELOCK_UPDATE,
-			      MF_MDC_CANCEL_FID2);
+	rc = lmv_fid_alloc(NULL, exp, &target_fid, op_data);
 	if (rc)
 		return rc;
+
 	/*
-	 * Cancel LOOKUP locks on source child (fid3) for parent tgt_tgt.
+	 * for directory, send migrate request to the MDT where the object will
+	 * be migrated to, because we can't create a striped directory remotely.
+	 *
+	 * otherwise, send to the MDT where source is located because regular
+	 * file may open lease.
+	 *
+	 * NB. if MDT doesn't support DIR_MIGRATE, send to source MDT too for
+	 * backward compatibility.
 	 */
-	if (fid_is_sane(&op_data->op_fid3)) {
-		struct lmv_tgt_desc *tgt;
-
-		tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	if (S_ISDIR(op_data->op_mode) &&
+	    (exp_connect_flags2(exp) & OBD_CONNECT2_DIR_MIGRATE)) {
+		tgt = lmv_find_target(lmv, &target_fid);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
+	} else {
+		tgt = child_tgt;
+	}
 
-		/* Cancel LOOKUP lock on its parent */
-		rc = lmv_early_cancel(exp, tgt, op_data, src_tgt->ltd_idx,
-				      LCK_EX, MDS_INODELOCK_LOOKUP,
-				      MF_MDC_CANCEL_FID3);
+	/* cancel UPDATE lock of parent master object */
+	rc = lmv_early_cancel(exp, parent_tgt, op_data, tgt->ltd_idx, LCK_EX,
+			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
+	if (rc)
+		return rc;
+
+	/* cancel UPDATE lock of source parent */
+	if (sp_tgt != parent_tgt) {
+		/*
+		 * migrate RPC packs master object FID, because we can only pack
+		 * two FIDs in reint RPC, but MDS needs to know both source
+		 * parent and target parent, and it will obtain them from master
+		 * FID and LMV, the other FID in RPC is kept for target.
+		 *
+		 * since this FID is not passed to MDC, cancel it anyway.
+		 */
+		rc = lmv_early_cancel(exp, sp_tgt, op_data, -1, LCK_EX,
+				      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID4);
 		if (rc)
 			return rc;
 
-		rc = lmv_early_cancel(exp, NULL, op_data, src_tgt->ltd_idx,
-				      LCK_EX, MDS_INODELOCK_ELC,
+		op_data->op_flags &= ~MF_MDC_CANCEL_FID4;
+	}
+	op_data->op_fid4 = target_fid;
+
+	/* cancel UPDATE locks of target parent */
+	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID2);
+	if (rc)
+		return rc;
+
+	/* cancel LOOKUP lock of source if source is remote object */
+	if (child_tgt != sp_tgt) {
+		rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_idx,
+				      LCK_EX, MDS_INODELOCK_LOOKUP,
 				      MF_MDC_CANCEL_FID3);
 		if (rc)
 			return rc;
 	}
 
-retry_rename:
+	/* cancel ELC locks of source */
+	rc = lmv_early_cancel(exp, child_tgt, op_data, tgt->ltd_idx, LCK_EX,
+			      MDS_INODELOCK_ELC, MF_MDC_CANCEL_FID3);
+	if (rc)
+		return rc;
+
+	rc = md_rename(tgt->ltd_exp, op_data, name, namelen, NULL, 0, request);
+
+	return rc;
+}
+
+static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
+		      const char *old, size_t oldlen,
+		      const char *new, size_t newlen,
+		      struct ptlrpc_request **request)
+{
+	struct obd_device *obd = exp->exp_obd;
+	struct lmv_obd *lmv = &obd->u.lmv;
+	struct lmv_stripe_md *lsm = op_data->op_mea1;
+	struct lmv_tgt_desc *sp_tgt;
+	struct lmv_tgt_desc *tp_tgt = NULL;
+	struct lmv_tgt_desc *tgt;
+	struct mdt_body *body;
+	int rc;
+
+	LASSERT(oldlen != 0);
+
+	if (op_data->op_cli_flags & CLI_MIGRATE) {
+		rc = lmv_migrate(exp, op_data, old, oldlen, request);
+		return rc;
+	}
+
+	op_data->op_fsuid = from_kuid(&init_user_ns, current_fsuid());
+	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
+	op_data->op_cap = current_cap();
+
+	CDEBUG(D_INODE, "RENAME "DFID"/%.*s to "DFID"/%.*s\n",
+		PFID(&op_data->op_fid1), (int)oldlen, old,
+		PFID(&op_data->op_fid2), (int)newlen, new);
+
+	if (lsm)
+		sp_tgt = lmv_locate_target_for_name(lmv, lsm, old, oldlen,
+						    &op_data->op_fid1,
+						    &op_data->op_mds);
+	else
+		sp_tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	if (IS_ERR(sp_tgt))
+		return PTR_ERR(sp_tgt);
+
+	lsm = op_data->op_mea2;
+	if (lsm)
+		tp_tgt = lmv_locate_target_for_name(lmv, lsm, new, newlen,
+						    &op_data->op_fid2,
+						    &op_data->op_mds);
+	else
+		tp_tgt = lmv_find_target(lmv, &op_data->op_fid2);
+	if (IS_ERR(tp_tgt))
+		return PTR_ERR(tp_tgt);
+
 	/*
-	 * Cancel all the locks on tgt child (fid4).
+	 * Since the target child might be destroyed, and it might
+	 * become orphan, and we can only check orphan on the local
+	 * MDT right now, so we send rename request to the MDT where
+	 * target child is located. If target child does not exist,
+	 * then it will send the request to the target parent
 	 */
 	if (fid_is_sane(&op_data->op_fid4)) {
-		struct lmv_tgt_desc *tgt;
-
-		rc = lmv_early_cancel(exp, NULL, op_data, src_tgt->ltd_idx,
-				      LCK_EX, MDS_INODELOCK_ELC,
-				      MF_MDC_CANCEL_FID4);
-		if (rc)
-			return rc;
-
 		tgt = lmv_find_target(lmv, &op_data->op_fid4);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
+	} else {
+		tgt = tp_tgt;
+	}
 
-		/*
-		 * Since the target child might be destroyed, and it might
-		 * become orphan, and we can only check orphan on the local
-		 * MDT right now, so we send rename request to the MDT where
-		 * target child is located. If target child does not exist,
-		 * then it will send the request to the target parent
-		 */
-		target_exp = tgt->ltd_exp;
+	op_data->op_flags |= MF_MDC_CANCEL_FID4;
+
+	/* cancel UPDATE locks of source parent */
+	rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
+	if (rc != 0)
+		return rc;
+
+	/* cancel UPDATE locks of target parent */
+	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID2);
+	if (rc != 0)
+		return rc;
+
+	if (fid_is_sane(&op_data->op_fid3)) {
+		struct lmv_tgt_desc *src_tgt;
+
+		src_tgt = lmv_find_target(lmv, &op_data->op_fid3);
+		if (IS_ERR(src_tgt))
+			return PTR_ERR(src_tgt);
+
+		/* cancel LOOKUP lock of source on source parent */
+		if (src_tgt != sp_tgt) {
+			rc = lmv_early_cancel(exp, sp_tgt, op_data,
+					      tgt->ltd_idx, LCK_EX,
+					      MDS_INODELOCK_LOOKUP,
+					      MF_MDC_CANCEL_FID3);
+			if (rc != 0)
+				return rc;
+		}
+
+		/* cancel ELC locks of source */
+		rc = lmv_early_cancel(exp, src_tgt, op_data, tgt->ltd_idx,
+				      LCK_EX, MDS_INODELOCK_ELC,
+				      MF_MDC_CANCEL_FID3);
+		if (rc != 0)
+			return rc;
+	}
+
+retry_rename:
+	if (fid_is_sane(&op_data->op_fid4)) {
+		/* cancel LOOKUP lock of target on target parent */
+		if (tgt != tp_tgt) {
+			rc = lmv_early_cancel(exp, tp_tgt, op_data,
+					      tgt->ltd_idx, LCK_EX,
+					      MDS_INODELOCK_LOOKUP,
+					      MF_MDC_CANCEL_FID4);
+			if (rc != 0)
+				return rc;
+		}
 	}
 
-	rc = md_rename(target_exp, op_data, old, oldlen, new, newlen, request);
+	rc = md_rename(tgt->ltd_exp, op_data, old, oldlen, new, newlen,
+		       request);
 	if (rc && rc != -EXDEV)
 		return rc;
 
@@ -2001,6 +2131,11 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fid4 = body->mbo_fid1;
 	ptlrpc_req_finished(*request);
 	*request = NULL;
+
+	tgt = lmv_find_target(lmv, &op_data->op_fid4);
+	if (IS_ERR(tgt))
+		return PTR_ERR(tgt);
+
 	goto retry_rename;
 }
 
@@ -2743,6 +2878,8 @@ static int lmv_unpack_md_v1(struct obd_export *exp, struct lmv_stripe_md *lsm,
 	else
 		lsm->lsm_md_hash_type = le32_to_cpu(lmm1->lmv_hash_type);
 	lsm->lsm_md_layout_version = le32_to_cpu(lmm1->lmv_layout_version);
+	lsm->lsm_md_migrate_offset = le32_to_cpu(lmm1->lmv_migrate_offset);
+	lsm->lsm_md_migrate_hash = le32_to_cpu(lmm1->lmv_migrate_hash);
 	cplen = strlcpy(lsm->lsm_md_pool_name, lmm1->lmv_pool_name,
 			sizeof(lsm->lsm_md_pool_name));
 
@@ -2750,7 +2887,7 @@ static int lmv_unpack_md_v1(struct obd_export *exp, struct lmv_stripe_md *lsm,
 		return -E2BIG;
 
 	CDEBUG(D_INFO,
-	       "unpack lsm count %d, master %d hash_type %d layout_version %d\n",
+	       "unpack lsm count %d, master %d hash_type %#x  layout_version %d\n",
 	       lsm->lsm_md_stripe_count, lsm->lsm_md_master_mdt_index,
 	       lsm->lsm_md_hash_type, lsm->lsm_md_layout_version);
 
@@ -2783,16 +2920,8 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 	if (lsm && !lmm) {
 		int i;
 
-		for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
-			/*
-			 * For migrating inode, the master stripe and master
-			 * object will be the same, so do not need iput, see
-			 * ll_update_lsm_md
-			 */
-			if (!(lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION &&
-			      !i))
-				iput(lsm->lsm_md_oinfo[i].lmo_root);
-		}
+		for (i = 0; i < lsm->lsm_md_stripe_count; i++)
+			iput(lsm->lsm_md_oinfo[i].lmo_root);
 
 		kvfree(lsm);
 		*lsmp = NULL;
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index 6cfa79c..b4af9778 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -63,6 +63,8 @@ void mdc_file_secctx_pack(struct ptlrpc_request *req,
 void mdc_rename_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 		     const char *old, size_t oldlen,
 		     const char *new, size_t newlen);
+void mdc_migrate_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
+			const char *name, size_t namelen);
 void mdc_close_pack(struct ptlrpc_request *req, struct md_op_data *op_data);
 
 /* mdc/mdc_locks.c */
diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index 1d38574..5b1691e 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -489,8 +489,7 @@ void mdc_rename_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 	rec = req_capsule_client_get(&req->rq_pill, &RMF_REC_REINT);
 
 	/* XXX do something about time, uid, gid */
-	rec->rn_opcode = op_data->op_cli_flags & CLI_MIGRATE ?
-			 REINT_MIGRATE : REINT_RENAME;
+	rec->rn_opcode = REINT_RENAME;
 	rec->rn_fsuid = op_data->op_fsuid;
 	rec->rn_fsgid = op_data->op_fsgid;
 	rec->rn_cap = op_data->op_cap.cap[0];
@@ -506,22 +505,42 @@ void mdc_rename_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 
 	if (new)
 		mdc_pack_name(req, &RMF_SYMTGT, new, newlen);
+}
 
-	if (op_data->op_cli_flags & CLI_MIGRATE) {
-		char *tmp;
+void mdc_migrate_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
+		      const char *name, size_t namelen)
+{
+	struct mdt_rec_rename *rec;
+	char *ea;
 
-		if (op_data->op_bias & MDS_CLOSE_MIGRATE) {
-			struct mdt_ioepoch *epoch;
+	BUILD_BUG_ON(sizeof(struct mdt_rec_reint) !=
+		     sizeof(struct mdt_rec_rename));
+	rec = req_capsule_client_get(&req->rq_pill, &RMF_REC_REINT);
 
-			mdc_close_intent_pack(req, op_data);
-			epoch = req_capsule_client_get(&req->rq_pill,
-							&RMF_MDT_EPOCH);
-			mdc_ioepoch_pack(epoch, op_data);
-		}
+	rec->rn_opcode	 = REINT_MIGRATE;
+	rec->rn_fsuid	 = op_data->op_fsuid;
+	rec->rn_fsgid	 = op_data->op_fsgid;
+	rec->rn_cap	 = op_data->op_cap.cap[0];
+	rec->rn_suppgid1 = op_data->op_suppgids[0];
+	rec->rn_suppgid2 = op_data->op_suppgids[1];
+	rec->rn_fid1	 = op_data->op_fid1;
+	rec->rn_fid2	 = op_data->op_fid4;
+	rec->rn_time	 = op_data->op_mod_time;
+	rec->rn_mode	 = op_data->op_mode;
+	rec->rn_bias	 = op_data->op_bias;
 
-		tmp = req_capsule_client_get(&req->rq_pill, &RMF_EADATA);
-		memcpy(tmp, op_data->op_data, op_data->op_data_size);
+	mdc_pack_name(req, &RMF_NAME, name, namelen);
+
+	if (op_data->op_bias & MDS_CLOSE_MIGRATE) {
+		struct mdt_ioepoch *epoch;
+
+		mdc_close_intent_pack(req, op_data);
+		epoch = req_capsule_client_get(&req->rq_pill, &RMF_MDT_EPOCH);
+		mdc_ioepoch_pack(epoch, op_data);
 	}
+
+	ea = req_capsule_client_get(&req->rq_pill, &RMF_EADATA);
+	memcpy(ea, op_data->op_data, op_data->op_data_size);
 }
 
 void mdc_getattr_pack(struct ptlrpc_request *req, u64 valid, u32 flags,
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 030c247..355cee1 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -403,7 +403,10 @@ int mdc_rename(struct obd_export *exp, struct md_op_data *op_data,
 	if (exp_connect_cancelset(exp) && req)
 		ldlm_cli_cancel_list(&cancels, count, req, 0);
 
-	mdc_rename_pack(req, op_data, old, oldlen, new, newlen);
+	if (op_data->op_cli_flags & CLI_MIGRATE)
+		mdc_migrate_pack(req, op_data, old, oldlen);
+	else
+		mdc_rename_pack(req, op_data, old, oldlen, new, newlen);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_MDT_MD, RCL_SERVER,
 			     obd->u.cli.cl_default_mds_easize);
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 30083c2..4095767 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1627,13 +1627,17 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct lmv_mds_md_v1, lmv_layout_version));
 	LASSERTF((int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_layout_version) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_layout_version));
-	LASSERTF((int)offsetof(struct lmv_mds_md_v1, lmv_padding1) == 20, "found %lld\n",
-		 (long long)(int)offsetof(struct lmv_mds_md_v1, lmv_padding1));
-	LASSERTF((int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_padding1) == 4, "found %lld\n",
-		 (long long)(int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_padding1));
-	LASSERTF((int)offsetof(struct lmv_mds_md_v1, lmv_padding2) == 24, "found %lld\n",
+	LASSERTF((int)offsetof(struct lmv_mds_md_v1, lmv_migrate_offset) == 20, "found %lld\n",
+		 (long long)(int)offsetof(struct lmv_mds_md_v1, lmv_migrate_offset));
+	LASSERTF((int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_migrate_offset) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_migrate_offset));
+	LASSERTF((int)offsetof(struct lmv_mds_md_v1, lmv_migrate_hash) == 24, "found %lld\n",
+		 (long long)(int)offsetof(struct lmv_mds_md_v1, lmv_migrate_hash));
+	LASSERTF((int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_migrate_hash) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_migrate_hash));
+	LASSERTF((int)offsetof(struct lmv_mds_md_v1, lmv_padding2) == 28, "found %lld\n",
 		 (long long)(int)offsetof(struct lmv_mds_md_v1, lmv_padding2));
-	LASSERTF((int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_padding2) == 8, "found %lld\n",
+	LASSERTF((int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_padding2) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lmv_mds_md_v1 *)0)->lmv_padding2));
 	LASSERTF((int)offsetof(struct lmv_mds_md_v1, lmv_padding3) == 32, "found %lld\n",
 		 (long long)(int)offsetof(struct lmv_mds_md_v1, lmv_padding3));
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 7f857be..522bd52 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1941,9 +1941,19 @@ struct lmv_mds_md_v1 {
 					 * be used to mark the object status,
 					 * for example migrating or dead.
 					 */
-	__u32 lmv_layout_version;	/* Used for directory restriping */
-	__u32 lmv_padding1;
-	__u64 lmv_padding2;
+	__u32 lmv_layout_version;	/* increased each time layout changed,
+					 * by directory migration, restripe
+					 * and LFSCK.
+					 */
+	__u32 lmv_migrate_offset;	/* once this is set, it means this
+					 * directory is been migrated, stripes
+					 * before this offset belong to target,
+					 * from this to source.
+					 */
+	__u32 lmv_migrate_hash;		/* hash type of source stripes of
+					 * migrating directory
+					 */
+	__u32 lmv_padding2;
 	__u64 lmv_padding3;
 	char lmv_pool_name[LOV_MAXPOOLNAME + 1];/* pool name */
 	struct lu_fid lmv_stripe_fids[0];	/* FIDs for each stripe */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 115/622] lustre: obdclass: remove unused ll_import_cachep
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (113 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 114/622] lustre: migrate: migrate striped directory James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 116/622] lustre: ptlrpc: add debugging for idle connections James Simmons
                   ` (507 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The ll_import_cache is not used anywhere, and can be removed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10899
Lustre-commit: e23250110729 ("LU-10899 obdclass: remove unused ll_import_cachep")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33119
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/genops.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index fc50aba..a122332 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -48,7 +48,6 @@
 static struct kmem_cache *obd_device_cachep;
 struct kmem_cache *obdo_cachep;
 EXPORT_SYMBOL(obdo_cachep);
-static struct kmem_cache *import_cachep;
 
 static struct kobj_type class_ktype;
 static struct workqueue_struct *zombie_wq;
@@ -648,8 +647,6 @@ void obd_cleanup_caches(void)
 	obd_device_cachep = NULL;
 	kmem_cache_destroy(obdo_cachep);
 	obdo_cachep = NULL;
-	kmem_cache_destroy(import_cachep);
-	import_cachep = NULL;
 }
 
 int obd_init_caches(void)
@@ -667,13 +664,6 @@ int obd_init_caches(void)
 	if (!obdo_cachep)
 		goto out;
 
-	LASSERT(!import_cachep);
-	import_cachep = kmem_cache_create("ll_import_cache",
-					  sizeof(struct obd_import),
-					  0, 0, NULL);
-	if (!import_cachep)
-		goto out;
-
 	return 0;
 out:
 	obd_cleanup_caches();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 116/622] lustre: ptlrpc: add debugging for idle connections
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (114 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 115/622] lustre: obdclass: remove unused ll_import_cachep James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 117/622] lustre: obdclass: Add lbug_on_eviction option James Simmons
                   ` (506 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Add a "debug" parameter for the idle client disconnection so that
it can log disconnect/reconnect events to the console.

Print the idle time in the "import" file.

Enable the connection debugging for all test runs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11128
Lustre-commit: 0aa58d26f5df ("LU-11128 ptlrpc: add debugging for idle connections")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33168
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h   |  1 +
 fs/lustre/obdclass/lprocfs_status.c |  6 ++++--
 fs/lustre/osc/lproc_osc.c           | 34 ++++++++++++++++++++++------------
 fs/lustre/osc/osc_request.c         |  1 +
 fs/lustre/ptlrpc/client.c           |  6 ++++--
 fs/lustre/ptlrpc/import.c           |  4 +++-
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index c4452e1..1fd6246 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -304,6 +304,7 @@ struct obd_import {
 
 	u32				imp_connect_op;
 	u32				imp_idle_timeout;
+	u32				imp_idle_debug;
 	struct obd_connect_data		imp_connect_data;
 	u64				imp_connect_flags_orig;
 	u64				imp_connect_flags2_orig;
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index fbd46df..747baff 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -802,11 +802,13 @@ int lprocfs_rd_import(struct seq_file *m, void *data)
 		   "       current_connection: %s\n"
 		   "       connection_attempts: %u\n"
 		   "       generation: %u\n"
-		   "       in-progress_invalidations: %u\n",
+		   "       in-progress_invalidations: %u\n"
+		   "       idle: %lld sec\n",
 		   nidstr,
 		   imp->imp_conn_cnt,
 		   imp->imp_generation,
-		   atomic_read(&imp->imp_inval_count));
+		   atomic_read(&imp->imp_inval_count),
+		   ktime_get_real_seconds() - imp->imp_last_reply_time);
 	spin_unlock(&imp->imp_lock);
 
 	if (!obd->obd_svc_stats)
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 16de266..f025275 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -622,27 +622,37 @@ static ssize_t idle_timeout_store(struct kobject *kobj, struct attribute *attr,
 					      obd_kset.kobj);
 	struct client_obd *cli = &obd->u.cli;
 	struct ptlrpc_request *req;
+	unsigned int idle_debug = 0;
 	unsigned int val;
 	int rc;
 
-	rc = kstrtouint(buffer, 0, &val);
-	if (rc)
-		return rc;
+	if (strncmp(buffer, "debug", 5) == 0) {
+		idle_debug = D_CONSOLE;
+	} else if (strncmp(buffer, "nodebug", 6) == 0) {
+		idle_debug = D_HA;
+	} else {
+		rc = kstrtouint(buffer, 0, &val);
+		if (rc)
+			return rc;
 
-	if (val > CONNECTION_SWITCH_MAX)
-		return -ERANGE;
+		if (val > CONNECTION_SWITCH_MAX)
+			return -ERANGE;
+	}
 
 	rc = lprocfs_climp_check(obd);
 	if (rc)
 		return rc;
 
-	cli->cl_import->imp_idle_timeout = val;
-
-	/* to initiate the connection if it's in IDLE state */
-	if (!val) {
-		req = ptlrpc_request_alloc(cli->cl_import, &RQF_OST_STATFS);
-		if (req)
-			ptlrpc_req_finished(req);
+	if (idle_debug) {
+		cli->cl_import->imp_idle_timeout = val;
+	} else {
+		/* to initiate the connection if it's in IDLE state */
+		if (!val) {
+			req = ptlrpc_request_alloc(cli->cl_import,
+						   &RQF_OST_STATFS);
+			if (req)
+				ptlrpc_req_finished(req);
+		}
 	}
 	up_read(&obd->u.cli.cl_sem);
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 1a9ed8d..2784e1e 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -3271,6 +3271,7 @@ int osc_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	list_add_tail(&cli->cl_shrink_list, &osc_shrink_list);
 	spin_unlock(&osc_shrink_lock);
 	cli->cl_import->imp_idle_timeout = osc_idle_timeout;
+	cli->cl_import->imp_idle_debug = D_HA;
 
 	return rc;
 
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 57b08de..691df1a 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -890,8 +890,10 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 	if (unlikely(imp->imp_state == LUSTRE_IMP_IDLE)) {
 		int rc;
 
-		CDEBUG(D_INFO, "%s: connect at new req\n",
-		       imp->imp_obd->obd_name);
+		CDEBUG_LIMIT(imp->imp_idle_debug,
+			     "%s: reconnect after %llds idle\n",
+			     imp->imp_obd->obd_name, ktime_get_real_seconds() -
+						     imp->imp_last_reply_time);
 		spin_lock(&imp->imp_lock);
 		if (imp->imp_state == LUSTRE_IMP_IDLE) {
 			imp->imp_generation++;
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index b90f78c..b11bb2f 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1623,7 +1623,9 @@ int ptlrpc_disconnect_and_idle_import(struct obd_import *imp)
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
-	CDEBUG(D_INFO, "%s: disconnect\n", imp->imp_obd->obd_name);
+	CDEBUG_LIMIT(imp->imp_idle_debug, "%s: disconnect after %llus idle\n",
+		     imp->imp_obd->obd_name,
+		     ktime_get_real_seconds() - imp->imp_last_reply_time);
 	req->rq_interpret_reply = ptlrpc_disconnect_idle_interpret;
 	ptlrpcd_add_req(req);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 117/622] lustre: obdclass: Add lbug_on_eviction option
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (115 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 116/622] lustre: ptlrpc: add debugging for idle connections James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 118/622] lustre: lmv: support accessing migrating directory James Simmons
                   ` (505 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Ryan Haasken <haasken@cray.com>

Add an lbug_on_eviction sysfs interface.  When it is set to a non-zero
value on a client, it will cause the client to LBUG whenever it is
evicted by the server. Note, an MDS is a client to OSTs, and every
server is a client of MGS. Thus, it is probably desireable to leave
this set to zero on servers.

Cray-bug-id: LUS-2591
WC-bug-id: https://jira.whamcloud.com/browse/LU-5026
Lustre-commit: 97381ffc9231 ("LU-5026 obdclass: Add lbug_on_eviction option")
Signed-off-by: Ryan Haasken <haasken@cray.com>
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/10257
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 1 +
 fs/lustre/obdclass/class_obd.c  | 2 ++
 fs/lustre/obdclass/obd_sysfs.c  | 2 ++
 fs/lustre/ptlrpc/import.c       | 1 +
 4 files changed, 6 insertions(+)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 3d14723..04ef76f 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -43,6 +43,7 @@
 extern unsigned int obd_debug_peer_on_timeout;
 extern unsigned int obd_dump_on_timeout;
 extern unsigned int obd_dump_on_eviction;
+extern unsigned int obd_lbug_on_eviction;
 /* obd_timeout should only be used for recovery, not for
  * networking / disk / timings affected by load (use Adaptive Timeouts)
  */
diff --git a/fs/lustre/obdclass/class_obd.c b/fs/lustre/obdclass/class_obd.c
index 7e436af..4ef9cca 100644
--- a/fs/lustre/obdclass/class_obd.c
+++ b/fs/lustre/obdclass/class_obd.c
@@ -56,6 +56,8 @@
 EXPORT_SYMBOL(obd_dump_on_timeout);
 unsigned int obd_dump_on_eviction;
 EXPORT_SYMBOL(obd_dump_on_eviction);
+unsigned int obd_lbug_on_eviction;
+EXPORT_SYMBOL(obd_lbug_on_eviction);
 unsigned long obd_max_dirty_pages;
 EXPORT_SYMBOL(obd_max_dirty_pages);
 atomic_long_t obd_dirty_pages;
diff --git a/fs/lustre/obdclass/obd_sysfs.c b/fs/lustre/obdclass/obd_sysfs.c
index cd2917e..73e44e7 100644
--- a/fs/lustre/obdclass/obd_sysfs.c
+++ b/fs/lustre/obdclass/obd_sysfs.c
@@ -118,6 +118,7 @@ static ssize_t static_uintvalue_store(struct kobject *kobj,
 LUSTRE_STATIC_UINT_ATTR(at_extra, &at_extra);
 LUSTRE_STATIC_UINT_ATTR(at_early_margin, &at_early_margin);
 LUSTRE_STATIC_UINT_ATTR(at_history, &at_history);
+LUSTRE_STATIC_UINT_ATTR(lbug_on_eviction, &obd_lbug_on_eviction);
 
 static ssize_t max_dirty_mb_show(struct kobject *kobj, struct attribute *attr,
 				 char *buf)
@@ -280,6 +281,7 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr,
 	&lustre_sattr_at_extra.u.attr,
 	&lustre_sattr_at_early_margin.u.attr,
 	&lustre_sattr_at_history.u.attr,
+	&lustre_sattr_lbug_on_eviction.u.attr,
 	NULL,
 };
 
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index b11bb2f..73a345f 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1385,6 +1385,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 					   "%s: This client was evicted by %.*s; in progress operations using this service will fail.\n",
 					   imp->imp_obd->obd_name, target_len,
 					   target_start);
+			LASSERTF(!obd_lbug_on_eviction, "LBUG upon eviction");
 		}
 		CDEBUG(D_HA, "evicted from %s@%s; invalidating\n",
 		       obd2cli_tgt(imp->imp_obd),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 118/622] lustre: lmv: support accessing migrating directory
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (116 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 117/622] lustre: obdclass: Add lbug_on_eviction option James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 119/622] lustre: mdc: move RPC semaphore code to lustre/osp James Simmons
                   ` (504 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Migrating directory contains stripes of both old and new layout, and
its sub files may be located on either one. To avoid race between
access and new creations, there are 4 rules to access migrating
directory:
1. always create new file under new layout.
2. any operation that tries to create new file under old layout will
   be rejected, e.g., 'mv a <migrating_dir>/b', if b exists and is
   under old layout, this rename should fail with -EBUSY.
3. operations that access file by name should try old layout first,
   if file doesn't exist, then it will retry new layout, such
   operations include: lookup, getattr_name, unlink, open-by-name,
   link, rename.
4. according to rule 1, open(O_CREAT | O_EXCL) and create() will
   create new file under new layout, but they should check existing
   file in one transaction, however this can't be done for old
   layout, so check existing file under old layout on client side,
   then issue the open/create request to new layout.

Disable sanity 230d for ZFS backend because it will trigger lots of
sync, which may cause system hung.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4684
Lustre-commit: 976b609abcdf ("LU-4684 lmv: support accessing migrating directory")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31504
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h      |  12 ++
 fs/lustre/lmv/lmv_intent.c   | 132 +++++++------
 fs/lustre/lmv/lmv_internal.h |  75 +++++--
 fs/lustre/lmv/lmv_obd.c      | 453 ++++++++++++++++++++++---------------------
 4 files changed, 381 insertions(+), 291 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 9286755..b404391 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -787,6 +787,18 @@ struct md_op_data {
 	u32			op_projid;
 
 	u16			op_mirror_id;
+
+	/*
+	 * used to access migrating dir: if it's set, assume migration is
+	 * finished, use the new layout to access dir, otherwise use old layout.
+	 * By default it's not set, because new files are created under new
+	 * layout, if we can't find file with name under both old and new
+	 * layout, we are sure file with name doesn't exist, but in reverse
+	 * order there may be a race with creation by others.
+	 */
+	bool			op_post_migrate;
+	/* used to access dir with bash hash */
+	u32			op_stripe_index;
 };
 
 struct md_callback {
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 355a2af..3f51032 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -191,7 +191,7 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 		op_data->op_fid1 = fid;
 		op_data->op_fid2 = fid;
 
-		tgt = lmv_locate_mds(lmv, op_data, &fid);
+		tgt = lmv_get_target(lmv, lsm->lsm_md_oinfo[i].lmo_mds, NULL);
 		if (IS_ERR(tgt)) {
 			rc = PTR_ERR(tgt);
 			goto cleanup;
@@ -269,8 +269,52 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 	struct mdt_body	*body;
+	u64 flags = it->it_flags;
 	int rc;
 
+	if ((it->it_op & IT_CREAT) && !(flags & MDS_OPEN_BY_FID)) {
+		/* don't allow create under dir with bad hash */
+		if (lmv_is_dir_bad_hash(op_data->op_mea1))
+			return -EBADF;
+
+		if (lmv_is_dir_migrating(op_data->op_mea1)) {
+			if (flags & O_EXCL) {
+				/*
+				 * open(O_CREAT | O_EXCL) needs to check
+				 * existing name, which should be done on both
+				 * old and new layout, to avoid creating new
+				 * file under old layout, check old layout on
+				 * client side.
+				 */
+				tgt = lmv_locate_tgt(lmv, op_data,
+						     &op_data->op_fid1);
+				if (IS_ERR(tgt))
+					return PTR_ERR(tgt);
+
+				rc = md_getattr_name(tgt->ltd_exp, op_data,
+						     reqp);
+				if (!rc) {
+					ptlrpc_req_finished(*reqp);
+					*reqp = NULL;
+					return -EEXIST;
+				}
+
+				if (rc != -ENOENT)
+					return rc;
+
+				op_data->op_post_migrate = true;
+			} else {
+				/*
+				 * open(O_CREAT) will be sent to MDT in old
+				 * layout first, to avoid creating new file
+				 * under old layout, clear O_CREAT.
+				 */
+				it->it_flags &= ~O_CREAT;
+			}
+		}
+	}
+
+retry:
 	if (it->it_flags & MDS_OPEN_BY_FID) {
 		LASSERT(fid_is_sane(&op_data->op_fid2));
 
@@ -292,7 +336,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 		LASSERT(fid_is_zero(&op_data->op_fid2));
 		LASSERT(op_data->op_name);
 
-		tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
+		tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 	}
@@ -325,8 +369,21 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	 */
 	if ((it->it_disposition & DISP_LOOKUP_NEG) &&
 	    !(it->it_disposition & DISP_OPEN_CREATE) &&
-	    !(it->it_disposition & DISP_OPEN_OPEN))
+	    !(it->it_disposition & DISP_OPEN_OPEN)) {
+		if (!(it->it_flags & MDS_OPEN_BY_FID) &&
+		    lmv_dir_retry_check_update(op_data)) {
+			ptlrpc_req_finished(*reqp);
+			it->it_request = NULL;
+			it->it_disposition = 0;
+			*reqp = NULL;
+
+			it->it_flags = flags;
+			fid_zero(&op_data->op_fid2);
+			goto retry;
+		}
+
 		return rc;
+	}
 
 	body = req_capsule_server_get(&(*reqp)->rq_pill, &RMF_MDT_BODY);
 	if (!body)
@@ -357,43 +414,25 @@ static int lmv_intent_lookup(struct obd_export *exp,
 			     ldlm_blocking_callback cb_blocking,
 			     u64 extra_lock_flags)
 {
-	struct lmv_stripe_md *lsm = op_data->op_mea1;
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt = NULL;
 	struct mdt_body	*body;
-	int rc = 0;
+	int rc;
 
-	/*
-	 * If it returns ERR_PTR(-EBADFD) then it is an unknown hash type
-	 * it will try all stripes to locate the object
-	 */
-	tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
-	if (IS_ERR(tgt) && (PTR_ERR(tgt) != -EBADFD))
+retry:
+	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
-	/*
-	 * Both migrating dir and unknown hash dir need to try
-	 * all of sub-stripes
-	 */
-	if (lsm && !lmv_is_known_hash_type(lsm->lsm_md_hash_type)) {
-		struct lmv_oinfo *oinfo = &lsm->lsm_md_oinfo[0];
-
-		op_data->op_fid1 = oinfo->lmo_fid;
-		op_data->op_mds = oinfo->lmo_mds;
-		tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
-		if (IS_ERR(tgt))
-			return PTR_ERR(tgt);
-	}
-
 	if (!fid_is_sane(&op_data->op_fid2))
 		fid_zero(&op_data->op_fid2);
 
 	CDEBUG(D_INODE,
-	       "LOOKUP_INTENT with fid1=" DFID ", fid2=" DFID ", name='%s' -> mds #%u lsm=%p lsm_magic=%x\n",
+	       "LOOKUP_INTENT with fid1=" DFID ", fid2=" DFID ", name='%s' -> mds #%u\n",
 	       PFID(&op_data->op_fid1), PFID(&op_data->op_fid2),
 	       op_data->op_name ? op_data->op_name : "<NULL>",
-	       tgt->ltd_idx, lsm, !lsm ? -1 : lsm->lsm_md_magic);
+	       tgt->ltd_idx);
 
 	op_data->op_bias &= ~MDS_CROSS_REF;
 
@@ -415,39 +454,14 @@ static int lmv_intent_lookup(struct obd_export *exp,
 				return rc;
 		}
 		return rc;
-	} else if (it_disposition(it, DISP_LOOKUP_NEG) && lsm &&
-		   lmv_need_try_all_stripes(lsm)) {
-		/*
-		 * For migrating and unknown hash type directory, it will
-		 * try to target the entry on other stripes
-		 */
-		int stripe_index;
-
-		for (stripe_index = 1;
-		     stripe_index < lsm->lsm_md_stripe_count &&
-		     it_disposition(it, DISP_LOOKUP_NEG); stripe_index++) {
-			struct lmv_oinfo *oinfo;
-
-			/* release the previous request */
-			ptlrpc_req_finished(*reqp);
-			it->it_request = NULL;
-			*reqp = NULL;
-
-			oinfo = &lsm->lsm_md_oinfo[stripe_index];
-			tgt = lmv_find_target(lmv, &oinfo->lmo_fid);
-			if (IS_ERR(tgt))
-				return PTR_ERR(tgt);
-
-			CDEBUG(D_INODE, "Try other stripes " DFID "\n",
-			       PFID(&oinfo->lmo_fid));
+	} else if (it_disposition(it, DISP_LOOKUP_NEG) &&
+		   lmv_dir_retry_check_update(op_data)) {
+		ptlrpc_req_finished(*reqp);
+		it->it_request = NULL;
+		it->it_disposition = 0;
+		*reqp = NULL;
 
-			op_data->op_fid1 = oinfo->lmo_fid;
-			it->it_disposition &= ~DISP_ENQ_COMPLETE;
-			rc = md_intent_lock(tgt->ltd_exp, op_data, it, reqp,
-					    cb_blocking, extra_lock_flags);
-			if (rc)
-				return rc;
-		}
+		goto retry;
 	}
 
 	if (!it_has_reply_body(it))
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index c4a2fb8..e434919 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -58,6 +58,9 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 			  ldlm_blocking_callback cb_blocking,
 			  int extra_lock_flags);
 
+int lmv_getattr_name(struct obd_export *exp, struct md_op_data *op_data,
+		     struct ptlrpc_request **preq);
+
 static inline struct obd_device *lmv2obd_dev(struct lmv_obd *lmv)
 {
 	return container_of_safe(lmv, struct obd_device, u.lmv);
@@ -126,15 +129,20 @@ static inline int lmv_stripe_md_size(int stripe_count)
 /* for file under migrating directory, return the target stripe info */
 static inline const struct lmv_oinfo *
 lsm_name_to_stripe_info(const struct lmv_stripe_md *lsm, const char *name,
-			int namelen)
+			int namelen, bool post_migrate)
 {
 	u32 hash_type = lsm->lsm_md_hash_type;
 	u32 stripe_count = lsm->lsm_md_stripe_count;
 	int stripe_index;
 
 	if (hash_type & LMV_HASH_FLAG_MIGRATION) {
-		hash_type &= ~LMV_HASH_FLAG_MIGRATION;
-		stripe_count = lsm->lsm_md_migrate_offset;
+		if (post_migrate) {
+			hash_type &= ~LMV_HASH_FLAG_MIGRATION;
+			stripe_count = lsm->lsm_md_migrate_offset;
+		} else {
+			hash_type = lsm->lsm_md_migrate_hash;
+			stripe_count -= lsm->lsm_md_migrate_offset;
+		}
 	}
 
 	stripe_index = lmv_name_to_stripe_index(hash_type, stripe_count,
@@ -142,23 +150,64 @@ static inline int lmv_stripe_md_size(int stripe_count)
 	if (stripe_index < 0)
 		return ERR_PTR(stripe_index);
 
-	LASSERTF(stripe_index < lsm->lsm_md_stripe_count,
-		 "stripe_index = %d, stripe_count = %d hash_type = %x name = %.*s\n",
-		 stripe_index, lsm->lsm_md_stripe_count,
-		 lsm->lsm_md_hash_type, namelen, name);
+	if ((lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION) && !post_migrate)
+		stripe_index += lsm->lsm_md_migrate_offset;
+
+	if (stripe_index >= lsm->lsm_md_stripe_count) {
+		CERROR("stripe_index %d stripe_count %d hash_type %#x migrate_offset %d migrate_hash %#x name %.*s\n",
+		       stripe_index, lsm->lsm_md_stripe_count,
+		       lsm->lsm_md_hash_type, lsm->lsm_md_migrate_offset,
+		       lsm->lsm_md_migrate_hash, namelen, name);
+		return ERR_PTR(-EBADF);
+	}
 
 	return &lsm->lsm_md_oinfo[stripe_index];
 }
 
-static inline bool lmv_need_try_all_stripes(const struct lmv_stripe_md *lsm)
+static inline bool lmv_is_dir_migrating(const struct lmv_stripe_md *lsm)
+{
+	return lsm ? lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION : false;
+}
+
+static inline bool lmv_is_dir_bad_hash(const struct lmv_stripe_md *lsm)
+{
+	if (!lsm)
+		return false;
+
+	if (lmv_is_dir_migrating(lsm)) {
+		if (lsm->lsm_md_stripe_count - lsm->lsm_md_migrate_offset > 1)
+			return !lmv_is_known_hash_type(
+					lsm->lsm_md_migrate_hash);
+		return false;
+	}
+
+	return !lmv_is_known_hash_type(lsm->lsm_md_hash_type);
+}
+
+static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 {
-	return !lmv_is_known_hash_type(lsm->lsm_md_hash_type) ||
-	       lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION;
+	const struct lmv_stripe_md *lsm = op_data->op_mea1;
+
+	if (!lsm)
+		return false;
+
+	if (lmv_is_dir_migrating(lsm) && !op_data->op_post_migrate) {
+		op_data->op_post_migrate = true;
+		return true;
+	}
+
+	if (lmv_is_dir_bad_hash(lsm) &&
+	    op_data->op_stripe_index < lsm->lsm_md_stripe_count - 1) {
+		op_data->op_stripe_index++;
+		return true;
+	}
+
+	return false;
 }
 
-struct lmv_tgt_desc
-*lmv_locate_mds(struct lmv_obd *lmv, struct md_op_data *op_data,
-		struct lu_fid *fid);
+struct lmv_tgt_desc *lmv_locate_tgt(struct lmv_obd *lmv,
+				    struct md_op_data *op_data,
+				    struct lu_fid *fid);
 /* lproc_lmv.c */
 int lmv_tunables_init(struct obd_device *obd);
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 3ddffd8..0da9269 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1141,7 +1141,7 @@ static int lmv_placement_policy(struct obd_device *obd,
 	 * 1. See if the stripe offset is specified by lum.
 	 * 2. Then check if there is default stripe offset.
 	 * 3. Finally choose MDS by name hash if the parent
-	 *    is striped directory. (see lmv_locate_mds()).
+	 *    is striped directory. (see lmv_locate_tgt()).
 	 */
 	if (op_data->op_cli_flags & CLI_SET_MEA && lum &&
 	    le32_to_cpu(lum->lum_stripe_offset) != (u32)-1) {
@@ -1511,26 +1511,31 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 	return md_close(tgt->ltd_exp, op_data, mod, request);
 }
 
-/**
- * Choosing the MDT by name or FID in @op_data.
- * For non-striped directory, it will locate MDT by fid.
- * For striped-directory, it will locate MDT by name. And also
- * it will reset op_fid1 with the FID of the chosen stripe.
- **/
-static struct lmv_tgt_desc *
-lmv_locate_target_for_name(struct lmv_obd *lmv, struct lmv_stripe_md *lsm,
-			   const char *name, int namelen, struct lu_fid *fid,
-			   u32 *mds)
+struct lmv_tgt_desc*
+__lmv_locate_tgt(struct lmv_obd *lmv, struct lmv_stripe_md *lsm,
+		 const char *name, int namelen, struct lu_fid *fid, u32 *mds,
+		 bool post_migrate)
 {
 	const struct lmv_oinfo *oinfo;
 	struct lmv_tgt_desc *tgt;
 
+	if (!lsm || namelen == 0) {
+		tgt = lmv_find_target(lmv, fid);
+		if (IS_ERR(tgt))
+			return tgt;
+
+		LASSERT(mds);
+		*mds = tgt->ltd_idx;
+		return tgt;
+	}
+
 	if (OBD_FAIL_CHECK(OBD_FAIL_LFSCK_BAD_NAME_HASH)) {
 		if (cfs_fail_val >= lsm->lsm_md_stripe_count)
 			return ERR_PTR(-EBADF);
 		oinfo = &lsm->lsm_md_oinfo[cfs_fail_val];
 	} else {
-		oinfo = lsm_name_to_stripe_info(lsm, name, namelen);
+		oinfo = lsm_name_to_stripe_info(lsm, name, namelen,
+						post_migrate);
 		if (IS_ERR(oinfo))
 			return ERR_CAST(oinfo);
 	}
@@ -1544,16 +1549,17 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 
 	CDEBUG(D_INFO, "locate on mds %u " DFID "\n", oinfo->lmo_mds,
 	       PFID(&oinfo->lmo_fid));
+
 	return tgt;
 }
 
 /**
- * Locate mds by fid or name
+ * Locate mdt by fid or name
  *
- * For striped directory (lsm != NULL), it will locate the stripe
- * by name hash (see lsm_name_to_stripe_info()). Note: if the hash_type
- * is unknown, it will return -EBADFD, and lmv_intent_lookup might need
- * walk through all of stripes to locate the entry.
+ * For striped directory, it will locate the stripe by name hash, if hash_type
+ * is unknown, it will return the stripe specified by 'op_data->op_stripe_index'
+ * which is set outside, and if dir is migrating, 'op_data->op_post_migrate'
+ * indicates whether old or new layout is used to locate.
  *
  * For normal direcotry, it will locate MDS by FID directly.
  *
@@ -1566,10 +1572,11 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
  *		ERR_PTR(errno) if failed.
  */
 struct lmv_tgt_desc*
-lmv_locate_mds(struct lmv_obd *lmv, struct md_op_data *op_data,
+lmv_locate_tgt(struct lmv_obd *lmv, struct md_op_data *op_data,
 	       struct lu_fid *fid)
 {
 	struct lmv_stripe_md *lsm = op_data->op_mea1;
+	struct lmv_oinfo *oinfo;
 	struct lmv_tgt_desc *tgt;
 
 	/*
@@ -1579,17 +1586,15 @@ struct lmv_tgt_desc*
 	 */
 	if (op_data->op_bias & MDS_CREATE_VOLATILE &&
 	    (int)op_data->op_mds != -1) {
-		int i;
-
 		tgt = lmv_get_target(lmv, op_data->op_mds, NULL);
 		if (IS_ERR(tgt))
 			return tgt;
 
 		if (lsm) {
+			int i;
+
 			/* refill the right parent fid */
 			for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
-				struct lmv_oinfo *oinfo;
-
 				oinfo = &lsm->lsm_md_oinfo[i];
 				if (oinfo->lmo_mds == op_data->op_mds) {
 					*fid = oinfo->lmo_fid;
@@ -1600,23 +1605,22 @@ struct lmv_tgt_desc*
 			if (i == lsm->lsm_md_stripe_count)
 				*fid = lsm->lsm_md_oinfo[0].lmo_fid;
 		}
+	} else if (lmv_is_dir_bad_hash(lsm)) {
+		LASSERT(op_data->op_stripe_index < lsm->lsm_md_stripe_count);
+		oinfo = &lsm->lsm_md_oinfo[op_data->op_stripe_index];
 
-		return tgt;
-	}
-
-	if (!lsm || !op_data->op_namelen) {
-		tgt = lmv_find_target(lmv, fid);
-		if (IS_ERR(tgt))
-			return tgt;
-
-		op_data->op_mds = tgt->ltd_idx;
+		*fid = oinfo->lmo_fid;
+		op_data->op_mds = oinfo->lmo_mds;
 
-		return tgt;
+		tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
+	} else {
+		tgt = __lmv_locate_tgt(lmv, lsm, op_data->op_name,
+				       op_data->op_namelen, fid,
+				       &op_data->op_mds,
+				       op_data->op_post_migrate);
 	}
 
-	return lmv_locate_target_for_name(lmv, lsm, op_data->op_name,
-					  op_data->op_namelen, fid,
-					  &op_data->op_mds);
+	return tgt;
 }
 
 static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
@@ -1632,7 +1636,33 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	if (!lmv->desc.ld_active_tgt_count)
 		return -EIO;
 
-	tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
+	if (lmv_is_dir_bad_hash(op_data->op_mea1))
+		return -EBADF;
+
+	if (lmv_is_dir_migrating(op_data->op_mea1)) {
+		/*
+		 * if parent is migrating, create() needs to lookup existing
+		 * name, to avoid creating new file under old layout of
+		 * migrating directory, check old layout here.
+		 */
+		tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+		if (IS_ERR(tgt))
+			return PTR_ERR(tgt);
+
+		rc = md_getattr_name(tgt->ltd_exp, op_data, request);
+		if (!rc) {
+			ptlrpc_req_finished(*request);
+			*request = NULL;
+			return -EEXIST;
+		}
+
+		if (rc != -ENOENT)
+			return rc;
+
+		op_data->op_post_migrate = true;
+	}
+
+	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1685,7 +1715,7 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 
 	CDEBUG(D_INODE, "ENQUEUE on " DFID "\n", PFID(&op_data->op_fid1));
 
-	tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
+	tgt = lmv_find_target(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1696,18 +1726,18 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 			  extra_lock_flags);
 }
 
-static int
+int
 lmv_getattr_name(struct obd_export *exp, struct md_op_data *op_data,
 		 struct ptlrpc_request **preq)
 {
-	struct ptlrpc_request *req = NULL;
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 	struct mdt_body	*body;
 	int rc;
 
-	tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
+retry:
+	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1716,30 +1746,26 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	       PFID(&op_data->op_fid1), tgt->ltd_idx);
 
 	rc = md_getattr_name(tgt->ltd_exp, op_data, preq);
-	if (rc != 0)
+	if (rc == -ENOENT && lmv_dir_retry_check_update(op_data)) {
+		ptlrpc_req_finished(*preq);
+		*preq = NULL;
+		goto retry;
+	}
+
+	if (rc)
 		return rc;
 
 	body = req_capsule_server_get(&(*preq)->rq_pill, &RMF_MDT_BODY);
 	if (body->mbo_valid & OBD_MD_MDS) {
-		struct lu_fid rid = body->mbo_fid1;
-
-		CDEBUG(D_INODE, "Request attrs for " DFID "\n",
-		       PFID(&rid));
-
-		tgt = lmv_find_target(lmv, &rid);
-		if (IS_ERR(tgt)) {
-			ptlrpc_req_finished(*preq);
-			*preq = NULL;
-			return PTR_ERR(tgt);
-		}
-
-		op_data->op_fid1 = rid;
+		op_data->op_fid1 = body->mbo_fid1;
 		op_data->op_valid |= OBD_MD_FLCROSSREF;
 		op_data->op_namelen = 0;
 		op_data->op_name = NULL;
-		rc = md_getattr_name(tgt->ltd_exp, op_data, &req);
+
 		ptlrpc_req_finished(*preq);
-		*preq = req;
+		*preq = NULL;
+
+		goto retry;
 	}
 
 	return rc;
@@ -1808,19 +1834,40 @@ static int lmv_link(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fsuid = from_kuid(&init_user_ns, current_fsuid());
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
-	if (op_data->op_mea2) {
-		struct lmv_stripe_md *lsm = op_data->op_mea2;
-		const struct lmv_oinfo *oinfo;
 
-		oinfo = lsm_name_to_stripe_info(lsm, op_data->op_name,
-						op_data->op_namelen);
-		if (IS_ERR(oinfo))
-			return PTR_ERR(oinfo);
+	if (lmv_is_dir_migrating(op_data->op_mea2)) {
+		struct lu_fid fid1 = op_data->op_fid1;
+		struct lmv_stripe_md *lsm1 = op_data->op_mea1;
 
-		op_data->op_fid2 = oinfo->lmo_fid;
+		/*
+		 * avoid creating new file under old layout of migrating
+		 * directory, check it here.
+		 */
+		tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, op_data->op_name,
+				       op_data->op_namelen, &op_data->op_fid2,
+				       &op_data->op_mds, false);
+		tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+		if (IS_ERR(tgt))
+			return PTR_ERR(tgt);
+
+		op_data->op_fid1 = op_data->op_fid2;
+		op_data->op_mea1 = op_data->op_mea2;
+		rc = md_getattr_name(tgt->ltd_exp, op_data, request);
+		op_data->op_fid1 = fid1;
+		op_data->op_mea1 = lsm1;
+		if (!rc) {
+			ptlrpc_req_finished(*request);
+			*request = NULL;
+			return -EEXIST;
+		}
+
+		if (rc != -ENOENT)
+			return rc;
 	}
 
-	tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid2);
+	tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, op_data->op_name,
+			       op_data->op_namelen, &op_data->op_fid2,
+			       &op_data->op_mds, true);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2004,9 +2051,9 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct lmv_stripe_md *lsm = op_data->op_mea1;
 	struct lmv_tgt_desc *sp_tgt;
 	struct lmv_tgt_desc *tp_tgt = NULL;
+	struct lmv_tgt_desc *src_tgt = NULL;
 	struct lmv_tgt_desc *tgt;
 	struct mdt_body *body;
 	int rc;
@@ -2022,26 +2069,44 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
 
-	CDEBUG(D_INODE, "RENAME "DFID"/%.*s to "DFID"/%.*s\n",
-		PFID(&op_data->op_fid1), (int)oldlen, old,
-		PFID(&op_data->op_fid2), (int)newlen, new);
+	if (lmv_is_dir_migrating(op_data->op_mea2)) {
+		struct lu_fid fid1 = op_data->op_fid1;
+		struct lmv_stripe_md *lsm1 = op_data->op_mea1;
 
-	if (lsm)
-		sp_tgt = lmv_locate_target_for_name(lmv, lsm, old, oldlen,
-						    &op_data->op_fid1,
-						    &op_data->op_mds);
-	else
-		sp_tgt = lmv_find_target(lmv, &op_data->op_fid1);
-	if (IS_ERR(sp_tgt))
-		return PTR_ERR(sp_tgt);
+		/*
+		 * we avoid creating new file under old layout of migrating
+		 * directory, if there is an existing file with new name under
+		 * old layout, we can't unlink file in old layout and rename to
+		 * new layout in one transaction, so return -EBUSY here.`
+		 */
+		tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, new, newlen,
+				       &op_data->op_fid2, &op_data->op_mds,
+				       false);
+		if (IS_ERR(tgt))
+			return PTR_ERR(tgt);
 
-	lsm = op_data->op_mea2;
-	if (lsm)
-		tp_tgt = lmv_locate_target_for_name(lmv, lsm, new, newlen,
-						    &op_data->op_fid2,
-						    &op_data->op_mds);
-	else
-		tp_tgt = lmv_find_target(lmv, &op_data->op_fid2);
+		op_data->op_fid1 = op_data->op_fid2;
+		op_data->op_mea1 = op_data->op_mea2;
+		op_data->op_name = new;
+		op_data->op_namelen = newlen;
+		rc = md_getattr_name(tgt->ltd_exp, op_data, request);
+		op_data->op_fid1 = fid1;
+		op_data->op_mea1 = lsm1;
+		op_data->op_name = NULL;
+		op_data->op_namelen = 0;
+		if (!rc) {
+			ptlrpc_req_finished(*request);
+			*request = NULL;
+			return -EBUSY;
+		}
+
+		if (rc != -ENOENT)
+			return rc;
+	}
+
+	/* rename to new layout for migrating directory */
+	tp_tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, new, newlen,
+				  &op_data->op_fid2, &op_data->op_mds, true);
 	if (IS_ERR(tp_tgt))
 		return PTR_ERR(tp_tgt);
 
@@ -2062,34 +2127,28 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 
 	op_data->op_flags |= MF_MDC_CANCEL_FID4;
 
-	/* cancel UPDATE locks of source parent */
-	rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_idx, LCK_EX,
-			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
-	if (rc != 0)
-		return rc;
-
 	/* cancel UPDATE locks of target parent */
 	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_idx, LCK_EX,
 			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID2);
 	if (rc != 0)
 		return rc;
 
-	if (fid_is_sane(&op_data->op_fid3)) {
-		struct lmv_tgt_desc *src_tgt;
-
-		src_tgt = lmv_find_target(lmv, &op_data->op_fid3);
-		if (IS_ERR(src_tgt))
-			return PTR_ERR(src_tgt);
-
-		/* cancel LOOKUP lock of source on source parent */
-		if (src_tgt != sp_tgt) {
-			rc = lmv_early_cancel(exp, sp_tgt, op_data,
+	if (fid_is_sane(&op_data->op_fid4)) {
+		/* cancel LOOKUP lock of target on target parent */
+		if (tgt != tp_tgt) {
+			rc = lmv_early_cancel(exp, tp_tgt, op_data,
 					      tgt->ltd_idx, LCK_EX,
 					      MDS_INODELOCK_LOOKUP,
-					      MF_MDC_CANCEL_FID3);
+					      MF_MDC_CANCEL_FID4);
 			if (rc != 0)
 				return rc;
 		}
+	}
+
+	if (fid_is_sane(&op_data->op_fid3)) {
+		src_tgt = lmv_find_target(lmv, &op_data->op_fid3);
+		if (IS_ERR(src_tgt))
+			return PTR_ERR(src_tgt);
 
 		/* cancel ELC locks of source */
 		rc = lmv_early_cancel(exp, src_tgt, op_data, tgt->ltd_idx,
@@ -2099,21 +2158,44 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 			return rc;
 	}
 
-retry_rename:
-	if (fid_is_sane(&op_data->op_fid4)) {
-		/* cancel LOOKUP lock of target on target parent */
-		if (tgt != tp_tgt) {
-			rc = lmv_early_cancel(exp, tp_tgt, op_data,
+retry:
+	sp_tgt = __lmv_locate_tgt(lmv, op_data->op_mea1, old, oldlen,
+				  &op_data->op_fid1, &op_data->op_mds,
+				  op_data->op_post_migrate);
+	if (IS_ERR(sp_tgt))
+		return PTR_ERR(sp_tgt);
+
+	/* cancel UPDATE locks of source parent */
+	rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
+	if (rc != 0)
+		return rc;
+
+	if (fid_is_sane(&op_data->op_fid3)) {
+		/* cancel LOOKUP lock of source on source parent */
+		if (src_tgt != sp_tgt) {
+			rc = lmv_early_cancel(exp, sp_tgt, op_data,
 					      tgt->ltd_idx, LCK_EX,
 					      MDS_INODELOCK_LOOKUP,
-					      MF_MDC_CANCEL_FID4);
+					      MF_MDC_CANCEL_FID3);
 			if (rc != 0)
 				return rc;
 		}
 	}
 
+rename:
+	CDEBUG(D_INODE, "RENAME " DFID "/%.*s to " DFID "/%.*s\n",
+	       PFID(&op_data->op_fid1), (int)oldlen, old,
+	       PFID(&op_data->op_fid2), (int)newlen, new);
+
 	rc = md_rename(tgt->ltd_exp, op_data, old, oldlen, new, newlen,
 		       request);
+	if (rc == -ENOENT && lmv_dir_retry_check_update(op_data)) {
+		ptlrpc_req_finished(*request);
+		*request = NULL;
+		goto retry;
+	}
+
 	if (rc && rc != -EXDEV)
 		return rc;
 
@@ -2125,10 +2207,8 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	if (likely(!(body->mbo_valid & OBD_MD_MDS)))
 		return rc;
 
-	CDEBUG(D_INODE, "%s: try rename to another MDT for " DFID "\n",
-	       exp->exp_obd->obd_name, PFID(&body->mbo_fid1));
-
 	op_data->op_fid4 = body->mbo_fid1;
+
 	ptlrpc_req_finished(*request);
 	*request = NULL;
 
@@ -2136,7 +2216,19 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
-	goto retry_rename;
+	if (fid_is_sane(&op_data->op_fid4)) {
+		/* cancel LOOKUP lock of target on target parent */
+		if (tgt != tp_tgt) {
+			rc = lmv_early_cancel(exp, tp_tgt, op_data,
+					      tgt->ltd_idx, LCK_EX,
+					      MDS_INODELOCK_LOOKUP,
+					      MF_MDC_CANCEL_FID4);
+			if (rc != 0)
+				return rc;
+		}
+	}
+
+	goto rename;
 }
 
 static int lmv_setattr(struct obd_export *exp, struct md_op_data *op_data,
@@ -2575,68 +2667,30 @@ static int lmv_read_page(struct obd_export *exp, struct md_op_data *op_data,
 static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 		      struct ptlrpc_request **request)
 {
-	struct lmv_stripe_md *lsm = op_data->op_mea1;
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct lmv_tgt_desc *parent_tgt = NULL;
-	struct lmv_tgt_desc *tgt = NULL;
-	struct mdt_body	*body;
-	int stripe_index = 0;
+	struct lmv_tgt_desc *tgt;
+	struct lmv_tgt_desc *parent_tgt;
+	struct mdt_body *body;
 	int rc;
 
-retry_unlink:
-	/* For striped dir, we need to locate the parent as well */
-	if (lsm) {
-		struct lmv_tgt_desc *tmp;
-
-		LASSERT(op_data->op_name && op_data->op_namelen);
-
-		tmp = lmv_locate_target_for_name(lmv, lsm,
-						 op_data->op_name,
-						 op_data->op_namelen,
-						 &op_data->op_fid1,
-						 &op_data->op_mds);
-
-		/*
-		 * return -EBADFD means unknown hash type, might
-		 * need try all sub-stripe here
-		 */
-		if (IS_ERR(tmp) && PTR_ERR(tmp) != -EBADFD)
-			return PTR_ERR(tmp);
-
-		/*
-		 * Note: both migrating dir and unknown hash dir need to
-		 * try all of sub-stripes, so we need start search the
-		 * name from stripe 0, but migrating dir is already handled
-		 * inside lmv_locate_target_for_name(), so we only check
-		 * unknown hash type directory here
-		 */
-		if (!lmv_is_known_hash_type(lsm->lsm_md_hash_type)) {
-			struct lmv_oinfo *oinfo;
-
-			oinfo = &lsm->lsm_md_oinfo[stripe_index];
-
-			op_data->op_fid1 = oinfo->lmo_fid;
-			op_data->op_mds = oinfo->lmo_mds;
-		}
-	}
-
-try_next_stripe:
-	/* Send unlink requests to the MDT where the child is located */
-	if (likely(!fid_is_zero(&op_data->op_fid2)))
-		tgt = lmv_find_target(lmv, &op_data->op_fid2);
-	else if (lsm)
-		tgt = lmv_get_target(lmv, op_data->op_mds, NULL);
-	else
-		tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
-
-	if (IS_ERR(tgt))
-		return PTR_ERR(tgt);
-
 	op_data->op_fsuid = from_kuid(&init_user_ns, current_fsuid());
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
 
+retry:
+	parent_tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+	if (IS_ERR(parent_tgt))
+		return PTR_ERR(parent_tgt);
+
+	if (likely(!fid_is_zero(&op_data->op_fid2))) {
+		tgt = lmv_find_target(lmv, &op_data->op_fid2);
+		if (IS_ERR(tgt))
+			return PTR_ERR(tgt);
+	} else {
+		tgt = parent_tgt;
+	}
+
 	/*
 	 * If child's fid is given, cancel unused locks for it if it is from
 	 * another export than parent.
@@ -2646,50 +2700,29 @@ static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 	 */
 	op_data->op_flags |= MF_MDC_CANCEL_FID1 | MF_MDC_CANCEL_FID3;
 
-	/*
-	 * Cancel FULL locks on child (fid3).
-	 */
-	parent_tgt = lmv_find_target(lmv, &op_data->op_fid1);
-	if (IS_ERR(parent_tgt))
-		return PTR_ERR(parent_tgt);
-
-	if (parent_tgt != tgt) {
+	if (parent_tgt != tgt)
 		rc = lmv_early_cancel(exp, parent_tgt, op_data, tgt->ltd_idx,
 				      LCK_EX, MDS_INODELOCK_LOOKUP,
 				      MF_MDC_CANCEL_FID3);
-	}
 
 	rc = lmv_early_cancel(exp, NULL, op_data, tgt->ltd_idx, LCK_EX,
 			      MDS_INODELOCK_ELC, MF_MDC_CANCEL_FID3);
-	if (rc != 0)
+	if (rc)
 		return rc;
 
 	CDEBUG(D_INODE, "unlink with fid=" DFID "/" DFID " -> mds #%u\n",
 	       PFID(&op_data->op_fid1), PFID(&op_data->op_fid2), tgt->ltd_idx);
 
 	rc = md_unlink(tgt->ltd_exp, op_data, request);
-	if (rc != 0 && rc != -EREMOTE  && rc != -ENOENT)
-		return rc;
-
-	/* Try next stripe if it is needed. */
-	if (rc == -ENOENT && lsm && lmv_need_try_all_stripes(lsm)) {
-		struct lmv_oinfo *oinfo;
-
-		stripe_index++;
-		if (stripe_index >= lsm->lsm_md_stripe_count)
-			return rc;
-
-		oinfo = &lsm->lsm_md_oinfo[stripe_index];
-
-		op_data->op_fid1 = oinfo->lmo_fid;
-		op_data->op_mds = oinfo->lmo_mds;
-
+	if (rc == -ENOENT && lmv_dir_retry_check_update(op_data)) {
 		ptlrpc_req_finished(*request);
 		*request = NULL;
-
-		goto try_next_stripe;
+		goto retry;
 	}
 
+	if (rc != -EREMOTE)
+		return rc;
+
 	body = req_capsule_server_get(&(*request)->rq_pill, &RMF_MDT_BODY);
 	if (!body)
 		return -EPROTO;
@@ -2698,34 +2731,16 @@ static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 	if (likely(!(body->mbo_valid & OBD_MD_MDS)))
 		return rc;
 
-	CDEBUG(D_INODE, "%s: try unlink to another MDT for " DFID "\n",
-	       exp->exp_obd->obd_name, PFID(&body->mbo_fid1));
-
-	/* This is a remote object, try remote MDT, Note: it may
-	 * try more than 1 time here, Considering following case
-	 * /mnt/lustre is root on MDT0, remote1 is on MDT1
-	 * 1. Initially A does not know where remote1 is, it send
-	 *    unlink RPC to MDT0, MDT0 return -EREMOTE, it will
-	 *    resend unlink RPC to MDT1 (retry 1st time).
-	 *
-	 * 2. During the unlink RPC in flight,
-	 *    client B mv /mnt/lustre/remote1 /mnt/lustre/remote2
-	 *    and create new remote1, but on MDT0
-	 *
-	 * 3. MDT1 get unlink RPC(from A), then do remote lock on
-	 *    /mnt/lustre, then lookup get fid of remote1, and find
-	 *    it is remote dir again, and replay -EREMOTE again.
-	 *
-	 * 4. Then A will resend unlink RPC to MDT0. (retry 2nd times).
-	 *
-	 * In theory, it might try unlimited time here, but it should
-	 * be very rare case.
-	 */
+	/* This is a remote object, try remote MDT. */
 	op_data->op_fid2 = body->mbo_fid1;
 	ptlrpc_req_finished(*request);
 	*request = NULL;
 
-	goto retry_unlink;
+	tgt = lmv_find_target(lmv, &op_data->op_fid2);
+	if (IS_ERR(tgt))
+		return PTR_ERR(tgt);
+
+	goto retry;
 }
 
 static int lmv_precleanup(struct obd_device *obd)
@@ -3134,7 +3149,7 @@ static int lmv_intent_getattr_async(struct obd_export *exp,
 	if (!fid_is_sane(&op_data->op_fid2))
 		return -EINVAL;
 
-	tgt = lmv_locate_mds(lmv, op_data, &op_data->op_fid1);
+	tgt = lmv_find_target(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -3172,7 +3187,7 @@ static int lmv_revalidate_lock(struct obd_export *exp, struct lookup_intent *it,
 	const struct lmv_oinfo *oinfo;
 
 	LASSERT(lsm);
-	oinfo = lsm_name_to_stripe_info(lsm, name, namelen);
+	oinfo = lsm_name_to_stripe_info(lsm, name, namelen, false);
 	if (IS_ERR(oinfo))
 		return PTR_ERR(oinfo);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 119/622] lustre: mdc: move RPC semaphore code to lustre/osp
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (117 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 118/622] lustre: lmv: support accessing migrating directory James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 120/622] lnet: libcfs: fix wrong check in libcfs_debug_vmsg2() James Simmons
                   ` (503 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The "MDC RPC semaphore" is no longer used by MDC code since patch
http://review.whamcloud.com/14374 "LU-5319 mdc: manage number of
modify RPCs in flight" landed.  It is only still used by the OSP
currently in the OpenSFS branch. While there are plans to remove
this from the OSP as well, it makes sense to move all of this
code from MDC to OSP so that it will also be cleaned up when that
functionality lands.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6864
Lustre-commit: 040ca57f2ebd ("LU-6864 mdc: move RPC semaphore code to lustre/osp")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32412
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_mdc.h  | 96 -----------------------------------------
 fs/lustre/include/obd.h         |  2 -
 fs/lustre/include/obd_support.h |  2 +-
 3 files changed, 1 insertion(+), 99 deletions(-)

diff --git a/fs/lustre/include/lustre_mdc.h b/fs/lustre/include/lustre_mdc.h
index 208989f..aecb6ee 100644
--- a/fs/lustre/include/lustre_mdc.h
+++ b/fs/lustre/include/lustre_mdc.h
@@ -60,102 +60,6 @@
 struct ptlrpc_request;
 struct obd_device;
 
-/**
- * Serializes in-flight MDT-modifying RPC requests to preserve idempotency.
- *
- * This mutex is used to implement execute-once semantics on the MDT.
- * The MDT stores the last transaction ID and result for every client in
- * its last_rcvd file. If the client doesn't get a reply, it can safely
- * resend the request and the MDT will reconstruct the reply being aware
- * that the request has already been executed. Without this lock,
- * execution status of concurrent in-flight requests would be
- * overwritten.
- *
- * This design limits the extent to which we can keep a full pipeline of
- * in-flight requests from a single client.  This limitation could be
- * overcome by allowing multiple slots per client in the last_rcvd file.
- */
-struct mdc_rpc_lock {
-	/** Lock protecting in-flight RPC concurrency. */
-	struct mutex		rpcl_mutex;
-	/** Intent associated with currently executing request. */
-	struct lookup_intent	*rpcl_it;
-	/** Used for MDS/RPC load testing purposes. */
-	int			rpcl_fakes;
-};
-
-#define MDC_FAKE_RPCL_IT ((void *)0x2c0012bfUL)
-
-static inline void mdc_init_rpc_lock(struct mdc_rpc_lock *lck)
-{
-	mutex_init(&lck->rpcl_mutex);
-	lck->rpcl_it = NULL;
-}
-
-static inline void mdc_get_rpc_lock(struct mdc_rpc_lock *lck,
-				    struct lookup_intent *it)
-{
-	if (it && (it->it_op == IT_GETATTR || it->it_op == IT_LOOKUP ||
-		   it->it_op == IT_LAYOUT || it->it_op == IT_READDIR))
-		return;
-
-	/* This would normally block until the existing request finishes.
-	 * If fail_loc is set it will block until the regular request is
-	 * done, then set rpcl_it to MDC_FAKE_RPCL_IT.  Once that is set
-	 * it will only be cleared when all fake requests are finished.
-	 * Only when all fake requests are finished can normal requests
-	 * be sent, to ensure they are recoverable again.
-	 */
-again:
-	mutex_lock(&lck->rpcl_mutex);
-
-	if (CFS_FAIL_CHECK_QUIET(OBD_FAIL_MDC_RPCS_SEM)) {
-		lck->rpcl_it = MDC_FAKE_RPCL_IT;
-		lck->rpcl_fakes++;
-		mutex_unlock(&lck->rpcl_mutex);
-		return;
-	}
-
-	/* This will only happen when the CFS_FAIL_CHECK() was
-	 * just turned off but there are still requests in progress.
-	 * Wait until they finish.  It doesn't need to be efficient
-	 * in this extremely rare case, just have low overhead in
-	 * the common case when it isn't true.
-	 */
-	while (unlikely(lck->rpcl_it == MDC_FAKE_RPCL_IT)) {
-		mutex_unlock(&lck->rpcl_mutex);
-		schedule_timeout_uninterruptible(HZ / 4);
-		goto again;
-	}
-
-	LASSERT(!lck->rpcl_it);
-	lck->rpcl_it = it;
-}
-
-static inline void mdc_put_rpc_lock(struct mdc_rpc_lock *lck,
-				    struct lookup_intent *it)
-{
-	if (it && (it->it_op == IT_GETATTR || it->it_op == IT_LOOKUP ||
-		   it->it_op == IT_LAYOUT || it->it_op == IT_READDIR))
-		return;
-
-	if (lck->rpcl_it == MDC_FAKE_RPCL_IT) { /* OBD_FAIL_MDC_RPCS_SEM */
-		mutex_lock(&lck->rpcl_mutex);
-
-		LASSERTF(lck->rpcl_fakes > 0, "%d\n", lck->rpcl_fakes);
-		lck->rpcl_fakes--;
-
-		if (lck->rpcl_fakes == 0)
-			lck->rpcl_it = NULL;
-
-	} else {
-		LASSERTF(it == lck->rpcl_it, "%p != %p\n", it, lck->rpcl_it);
-		lck->rpcl_it = NULL;
-	}
-
-	mutex_unlock(&lck->rpcl_mutex);
-}
-
 static inline void mdc_get_mod_rpc_slot(struct ptlrpc_request *req,
 					struct lookup_intent *it)
 {
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index b404391..3910c10 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -304,8 +304,6 @@ struct client_obd {
 	atomic_t		cl_destroy_in_flight;
 	wait_queue_head_t	cl_destroy_waitq;
 
-	struct mdc_rpc_lock     *cl_rpc_lock;
-
 	/* modify rpcs in flight
 	 * currently used for metadata only
 	 */
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 04ef76f..c2db38f 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -385,7 +385,7 @@
 #define OBD_FAIL_MDC_ENQUEUE_PAUSE			0x801
 #define OBD_FAIL_MDC_OLD_EXT_FLAGS			0x802
 #define OBD_FAIL_MDC_GETATTR_ENQUEUE			0x803
-#define OBD_FAIL_MDC_RPCS_SEM				0x804
+#define OBD_FAIL_MDC_RPCS_SEM				0x804 /* deprecated */
 #define OBD_FAIL_MDC_LIGHTWEIGHT			0x805
 #define OBD_FAIL_MDC_CLOSE				0x806
 #define OBD_FAIL_MDC_MERGE				0x807
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 120/622] lnet: libcfs: fix wrong check in libcfs_debug_vmsg2()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (118 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 119/622] lustre: mdc: move RPC semaphore code to lustre/osp James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 121/622] lustre: ptlrpc: new request vs disconnect race James Simmons
                   ` (502 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Logic here is we skip output if time is before @cdls_next
reach and increase @cdls_count. however we did it in the
opposite way:

1)libcfs_debug_vmsg2() is called for a long time, that means
current check succeed, we skip print messages and return, we
will skip all messages later too..

2)libcfs_debug_vmsg2() is called frequently, current check
fail every time, message will be bumped out always. the
worst case is we never skip any messages.

Also fix test case to cover this later which is from Andreas:

The test_60a() llog test is being run on the MGS, while the
check in test_60b() to confirm that CDEBUG_LIMIT() works properly
is being run on the client.  There has been a breakage in
CDEBUG_LIMIT() that this test failed to catch, now we need to
track it down.

Change test_60b to dump the dmesg logs on the MGS.

Fixes: b49946b2e ("staging: lustre: libcfs: discard cfs_time_after()")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11373
Lustre-commit: 4037c1462730 ("LU-11373 libcfs: fix wrong check in libcfs_debug_vmsg2()")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/33154
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/tracefile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/libcfs/tracefile.c b/net/lnet/libcfs/tracefile.c
index 6e4cc31..bda3523 100644
--- a/net/lnet/libcfs/tracefile.c
+++ b/net/lnet/libcfs/tracefile.c
@@ -544,7 +544,7 @@ int libcfs_debug_msg(struct libcfs_debug_msg_data *msgdata,
 	if (cdls) {
 		if (libcfs_console_ratelimit &&
 		    cdls->cdls_next &&		/* not first time ever */
-		    !time_after(jiffies, cdls->cdls_next)) {
+		    time_before(jiffies, cdls->cdls_next)) {
 			/* skipping a console message */
 			cdls->cdls_count++;
 			if (tcd)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 121/622] lustre: ptlrpc: new request vs disconnect race
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (119 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 120/622] lnet: libcfs: fix wrong check in libcfs_debug_vmsg2() James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 122/622] lustre: misc: name open file handles as such James Simmons
                   ` (501 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

new request can race with disconnect-by-idle process.
disconnect code detect this state and initiate a new connection.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11128
Lustre-commit: 93d20d171c20 ("LU-11128 ptlrpc: new request vs disconnect race")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32980
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c | 15 ++++++++++-----
 fs/lustre/ptlrpc/import.c | 32 +++++++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 691df1a..7be597c 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -887,6 +887,13 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 	struct ptlrpc_request *request;
 	int connect = 0;
 
+	request = __ptlrpc_request_alloc(imp, pool);
+	if (!request)
+		return NULL;
+
+	/* initiate connection if needed when the import has been
+	 * referenced by the new request to avoid races with disconnect
+	 */
 	if (unlikely(imp->imp_state == LUSTRE_IMP_IDLE)) {
 		int rc;
 
@@ -904,16 +911,14 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 		spin_unlock(&imp->imp_lock);
 		if (connect) {
 			rc = ptlrpc_connect_import(imp);
-			if (rc < 0)
+			if (rc < 0) {
+				ptlrpc_request_free(request);
 				return NULL;
+			}
 			ptlrpc_pinger_add_import(imp);
 		}
 	}
 
-	request = __ptlrpc_request_alloc(imp, pool);
-	if (!request)
-		return NULL;
-
 	req_capsule_init(&request->rq_pill, request, RCL_CLIENT);
 	req_capsule_set(&request->rq_pill, format);
 	return request;
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 73a345f..f59af80 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1593,13 +1593,39 @@ static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 					    void *data, int rc)
 {
 	struct obd_import *imp = req->rq_import;
+	int connect = 0;
+
+	DEBUG_REQ(D_HA, req, "inflight=%d, refcount=%d: rc = %d\n",
+		  atomic_read(&imp->imp_inflight),
+		  atomic_read(&imp->imp_refcount), rc);
 
-	LASSERT(imp->imp_state == LUSTRE_IMP_CONNECTING);
 	spin_lock(&imp->imp_lock);
-	IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_IDLE);
-	memset(&imp->imp_remote_handle, 0, sizeof(imp->imp_remote_handle));
+	/* DISCONNECT reply can be late and another connection can just
+	 * be initiated. so we have to abort disconnection.
+	 */
+	if (req->rq_import_generation == imp->imp_generation &&
+	    imp->imp_state != LUSTRE_IMP_CLOSED) {
+		LASSERTF(imp->imp_state == LUSTRE_IMP_CONNECTING,
+			 "%s\n", ptlrpc_import_state_name(imp->imp_state));
+		imp->imp_state = LUSTRE_IMP_IDLE;
+		memset(&imp->imp_remote_handle, 0,
+		       sizeof(imp->imp_remote_handle));
+		/* take our DISCONNECT into account */
+		if (atomic_read(&imp->imp_inflight) > 1) {
+			imp->imp_generation++;
+			imp->imp_initiated_at = imp->imp_generation;
+			IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_NEW);
+			connect = 1;
+		}
+	}
 	spin_unlock(&imp->imp_lock);
 
+	if (connect) {
+		rc = ptlrpc_connect_import(imp);
+		if (rc >= 0)
+			ptlrpc_pinger_add_import(imp);
+	}
+
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 122/622] lustre: misc: name open file handles as such
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (120 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 121/622] lustre: ptlrpc: new request vs disconnect race James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 123/622] lustre: ldlm: cleanup LVB handling James Simmons
                   ` (500 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

In a number of places in the code, rename variables from "*_handle"
or "*_fh" to "*_open_handle" so that it is clear this is referencing
an open file handle rather than something else (e.g. a lock handle).

Also rename the confusingly-named mti_close_handle to mti_open_handle,
since this is referencing an open file handle, even if it is used at
close time to close the file.

   mfd_handle2mfd() -> mfd_open_handle2mfd()
   mdt_file_data.mfd_handle -> mfd_open_handle
   mdt_file_data.mfd_old_handle -> mfd_open_handle_old
   mdt_thread_info.mti_close_handle -> mti_open_handle
   mdt_body.mbo_handle -> mbo_open_handle
   mdt_io_epoch.mio_handle -> mio_open_handle
   md_op_data.op_handle -> op_open_handle
   mdt_rec_create.cr_old_handle -> cr_open_handle_old
   mdt_reint_record.rr_handle -> rr_open_handle
   obd_client_handle.och_fh -> och_open_handle

Change the resync code path to use a "lease_handle" to avoid confusion
with an open handle:

   mdt_rec_resync.rs_handle -> rs_lease_handle
   use md_op_data.op_lease_handle
   add mdt_reint_record.rr_lease_handle

WC-bug-id: https://jira.whamcloud.com/browse/LU-8174
Lustre-commit: ccb133fd2266 ("LU-8174 misc: name open file handles as such")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/26953
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h                |  4 ++--
 fs/lustre/llite/file.c                 | 22 +++++++++++-----------
 fs/lustre/llite/llite_lib.c            |  2 +-
 fs/lustre/mdc/mdc_lib.c                |  4 ++--
 fs/lustre/mdc/mdc_reint.c              |  4 ++--
 fs/lustre/mdc/mdc_request.c            | 26 ++++++++++++++------------
 fs/lustre/ptlrpc/pack_generic.c        |  2 +-
 fs/lustre/ptlrpc/wiretest.c            | 24 ++++++++++++------------
 include/uapi/linux/lustre/lustre_idl.h | 12 ++++++------
 9 files changed, 51 insertions(+), 49 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 3910c10..7cf9745 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -737,7 +737,7 @@ struct md_op_data {
 	struct lu_fid		op_fid4; /* to the operation locks. */
 	u32			op_mds;  /* what mds server open will go to */
 	u32			op_mode;
-	struct lustre_handle	op_handle;
+	struct lustre_handle	op_open_handle;
 	s64			op_mod_time;
 	const char	       *op_name;
 	size_t			op_namelen;
@@ -933,7 +933,7 @@ struct md_open_data {
 };
 
 struct obd_client_handle {
-	struct lustre_handle		och_fh;
+	struct lustre_handle		och_open_handle;
 	struct lu_fid			och_fid;
 	struct md_open_data	       *och_mod;
 	struct lustre_handle		och_lease_handle; /* open lock for lease */
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index fd39948..a46f5d3 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -103,7 +103,7 @@ static void ll_prepare_close(struct inode *inode, struct md_op_data *op_data,
 	op_data->op_attr_flags = ll_inode_to_ext_flags(inode->i_flags);
 	if (test_bit(LLIF_PROJECT_INHERIT, &lli->lli_flags))
 		op_data->op_attr_flags |= LUSTRE_PROJINHERIT_FL;
-	op_data->op_handle = och->och_fh;
+	op_data->op_open_handle = och->och_open_handle;
 
 	/*
 	 * For HSM: if inode data has been modified, pack it so that
@@ -230,7 +230,7 @@ static int ll_close_inode_openhandle(struct inode *inode,
 
 out:
 	md_clear_open_replay_data(md_exp, och);
-	och->och_fh.cookie = DEAD_HANDLE_MAGIC;
+	och->och_open_handle.cookie = DEAD_HANDLE_MAGIC;
 	kfree(och);
 
 	ptlrpc_req_finished(req);
@@ -613,7 +613,7 @@ static int ll_och_fill(struct obd_export *md_exp, struct lookup_intent *it,
 	struct mdt_body *body;
 
 	body = req_capsule_server_get(&it->it_request->rq_pill, &RMF_MDT_BODY);
-	och->och_fh = body->mbo_handle;
+	och->och_open_handle = body->mbo_open_handle;
 	och->och_fid = body->mbo_fid1;
 	och->och_lease_handle.cookie = it->it_lock_handle;
 	och->och_magic = OBD_CLIENT_HANDLE_MAGIC;
@@ -903,7 +903,7 @@ static int ll_md_blocking_lease_ast(struct ldlm_lock *lock,
  * if it has an open lock in cache already.
  */
 static int ll_lease_och_acquire(struct inode *inode, struct file *file,
-				struct lustre_handle *old_handle)
+				struct lustre_handle *old_open_handle)
 {
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct ll_inode_info *lli = ll_i2info(inode);
@@ -939,7 +939,7 @@ static int ll_lease_och_acquire(struct inode *inode, struct file *file,
 		*och_p = NULL;
 	}
 
-	*old_handle = fd->fd_och->och_fh;
+	*old_open_handle = fd->fd_och->och_open_handle;
 
 out_unlock:
 	mutex_unlock(&lli->lli_och_mutex);
@@ -999,7 +999,7 @@ static int ll_lease_och_release(struct inode *inode, struct file *file)
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct md_op_data *op_data;
 	struct ptlrpc_request *req = NULL;
-	struct lustre_handle old_handle = { 0 };
+	struct lustre_handle old_open_handle = { 0 };
 	struct obd_client_handle *och = NULL;
 	int rc;
 	int rc2;
@@ -1011,7 +1011,7 @@ static int ll_lease_och_release(struct inode *inode, struct file *file)
 		if (!(fmode & file->f_mode) || (file->f_mode & FMODE_EXEC))
 			return ERR_PTR(-EPERM);
 
-		rc = ll_lease_och_acquire(inode, file, &old_handle);
+		rc = ll_lease_och_acquire(inode, file, &old_open_handle);
 		if (rc)
 			return ERR_PTR(rc);
 	}
@@ -1028,7 +1028,7 @@ static int ll_lease_och_release(struct inode *inode, struct file *file)
 	}
 
 	/* To tell the MDT this openhandle is from the same owner */
-	op_data->op_handle = old_handle;
+	op_data->op_open_handle = old_open_handle;
 
 	it.it_flags = fmode | open_flags;
 	it.it_flags |= MDS_OPEN_LOCK | MDS_OPEN_BY_FID | MDS_OPEN_LEASE;
@@ -1230,7 +1230,7 @@ static int ll_lease_file_resync(struct obd_client_handle *och,
 	if (rc)
 		goto out;
 
-	op_data->op_handle = och->och_lease_handle;
+	op_data->op_lease_handle = och->och_lease_handle;
 	rc = md_file_resync(sbi->ll_md_exp, op_data);
 	if (rc)
 		goto out;
@@ -3892,7 +3892,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 		if (rc)
 			goto out_close;
 
-		op_data->op_handle = och->och_fh;
+		op_data->op_open_handle = och->och_open_handle;
 		op_data->op_data_version = data_version;
 		op_data->op_lease_handle = och->och_lease_handle;
 		op_data->op_bias |= MDS_CLOSE_MIGRATE;
@@ -3919,7 +3919,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 			obd_mod_put(och->och_mod);
 			md_clear_open_replay_data(ll_i2sbi(parent)->ll_md_exp,
 						  och);
-			och->och_fh.cookie = DEAD_HANDLE_MAGIC;
+			och->och_open_handle.cookie = DEAD_HANDLE_MAGIC;
 			kfree(och);
 			och = NULL;
 		}
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 636ddf8..be67652 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2258,7 +2258,7 @@ void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req)
 		return;
 
 	op_data->op_fid1 = body->mbo_fid1;
-	op_data->op_handle = body->mbo_handle;
+	op_data->op_open_handle = body->mbo_open_handle;
 	op_data->op_mod_time = get_seconds();
 	md_close(exp, op_data, NULL, &close_req);
 	ptlrpc_req_finished(close_req);
diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index 5b1691e..00a6be4 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -254,7 +254,7 @@ void mdc_open_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 	rec->cr_suppgid2 = op_data->op_suppgids[1];
 	rec->cr_bias = op_data->op_bias;
 	rec->cr_umask = current_umask();
-	rec->cr_old_handle = op_data->op_handle;
+	rec->cr_open_handle_old = op_data->op_open_handle;
 
 	if (op_data->op_name) {
 		mdc_pack_name(req, &RMF_NAME, op_data->op_name,
@@ -359,7 +359,7 @@ static void mdc_setattr_pack_rec(struct mdt_rec_setattr *rec,
 static void mdc_ioepoch_pack(struct mdt_ioepoch *epoch,
 			     struct md_op_data *op_data)
 {
-	epoch->mio_handle = op_data->op_handle;
+	epoch->mio_open_handle = op_data->op_open_handle;
 	epoch->mio_unused1 = 0;
 	epoch->mio_unused2 = 0;
 	epoch->mio_padding = 0;
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 355cee1..5d82449 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -456,9 +456,9 @@ int mdc_file_resync(struct obd_export *exp, struct md_op_data *op_data)
 	rec->rs_fid	= op_data->op_fid1;
 	rec->rs_bias	= op_data->op_bias;
 
-	lock = ldlm_handle2lock(&op_data->op_handle);
+	lock = ldlm_handle2lock(&op_data->op_lease_handle);
 	if (lock) {
-		rec->rs_handle = lock->l_remote_handle;
+		rec->rs_lease_handle = lock->l_remote_handle;
 		LDLM_LOCK_PUT(lock);
 	}
 
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 0ee42dd..15f94ea 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -593,7 +593,7 @@ void mdc_replay_open(struct ptlrpc_request *req)
 	struct md_open_data *mod = req->rq_cb_data;
 	struct ptlrpc_request *close_req;
 	struct obd_client_handle *och;
-	struct lustre_handle old;
+	struct lustre_handle old_open_handle = { };
 	struct mdt_body *body;
 
 	if (!mod) {
@@ -606,22 +606,22 @@ void mdc_replay_open(struct ptlrpc_request *req)
 
 	spin_lock(&req->rq_lock);
 	och = mod->mod_och;
-	if (och && och->och_fh.cookie)
+	if (och && och->och_open_handle.cookie)
 		req->rq_early_free_repbuf = 1;
 	else
 		req->rq_early_free_repbuf = 0;
 	spin_unlock(&req->rq_lock);
 
 	if (req->rq_early_free_repbuf) {
-		struct lustre_handle *file_fh;
+		struct lustre_handle *file_open_handle;
 
 		LASSERT(och->och_magic == OBD_CLIENT_HANDLE_MAGIC);
 
-		file_fh = &och->och_fh;
+		file_open_handle = &och->och_open_handle;
 		CDEBUG(D_HA, "updating handle from %#llx to %#llx\n",
-		       file_fh->cookie, body->mbo_handle.cookie);
-		old = *file_fh;
-		*file_fh = body->mbo_handle;
+		       file_open_handle->cookie, body->mbo_open_handle.cookie);
+		old_open_handle = *file_open_handle;
+		*file_open_handle = body->mbo_open_handle;
 	}
 
 	close_req = mod->mod_close_req;
@@ -635,10 +635,11 @@ void mdc_replay_open(struct ptlrpc_request *req)
 		LASSERT(epoch);
 
 		if (req->rq_early_free_repbuf)
-			LASSERT(!memcmp(&old, &epoch->mio_handle, sizeof(old)));
+			LASSERT(old_open_handle.cookie ==
+				epoch->mio_open_handle.cookie);
 
 		DEBUG_REQ(D_HA, close_req, "updating close body with new fh");
-		epoch->mio_handle = body->mbo_handle;
+		epoch->mio_open_handle = body->mbo_open_handle;
 	}
 }
 
@@ -722,11 +723,12 @@ int mdc_set_open_replay_data(struct obd_export *exp,
 	}
 
 	rec->cr_fid2 = body->mbo_fid1;
-	rec->cr_old_handle.cookie = body->mbo_handle.cookie;
+	rec->cr_open_handle_old = body->mbo_open_handle;
 	open_req->rq_replay_cb = mdc_replay_open;
 	if (!fid_is_sane(&body->mbo_fid1)) {
 		DEBUG_REQ(D_ERROR, open_req,
-			  "Saving replay request with insane fid");
+			  "saving replay request with insane FID " DFID,
+			  PFID(&body->mbo_fid1));
 		LBUG();
 	}
 
@@ -774,7 +776,7 @@ static int mdc_clear_open_replay_data(struct obd_export *exp,
 
 	spin_lock(&mod->mod_open_req->rq_lock);
 	if (mod->mod_och)
-		mod->mod_och->och_fh.cookie = 0;
+		mod->mod_och->och_open_handle.cookie = 0;
 	mod->mod_open_req->rq_early_free_repbuf = 0;
 	spin_unlock(&mod->mod_open_req->rq_lock);
 	mdc_free_open(mod);
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index e71f79d..653a8d7 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1770,7 +1770,7 @@ void lustre_swab_mdt_body(struct mdt_body *b)
 void lustre_swab_mdt_ioepoch(struct mdt_ioepoch *b)
 {
 	/* handle is opaque */
-	/* mio_handle is opaque */
+	/* mio_open_handle is opaque */
 	BUILD_BUG_ON(!offsetof(typeof(*b), mio_unused1));
 	BUILD_BUG_ON(!offsetof(typeof(*b), mio_unused2));
 	BUILD_BUG_ON(!offsetof(typeof(*b), mio_padding));
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 4095767..845aff4 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1961,10 +1961,10 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct mdt_body, mbo_fid2));
 	LASSERTF((int)sizeof(((struct mdt_body *)0)->mbo_fid2) == 16, "found %lld\n",
 		 (long long)(int)sizeof(((struct mdt_body *)0)->mbo_fid2));
-	LASSERTF((int)offsetof(struct mdt_body, mbo_handle) == 32, "found %lld\n",
-		 (long long)(int)offsetof(struct mdt_body, mbo_handle));
-	LASSERTF((int)sizeof(((struct mdt_body *)0)->mbo_handle) == 8, "found %lld\n",
-		 (long long)(int)sizeof(((struct mdt_body *)0)->mbo_handle));
+	LASSERTF((int)offsetof(struct mdt_body, mbo_open_handle) == 32, "found %lld\n",
+		 (long long)(int)offsetof(struct mdt_body, mbo_open_handle));
+	LASSERTF((int)sizeof(((struct mdt_body *)0)->mbo_open_handle) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct mdt_body *)0)->mbo_open_handle));
 	LASSERTF((int)offsetof(struct mdt_body, mbo_valid) == 40, "found %lld\n",
 		 (long long)(int)offsetof(struct mdt_body, mbo_valid));
 	LASSERTF((int)sizeof(((struct mdt_body *)0)->mbo_valid) == 8, "found %lld\n",
@@ -2162,10 +2162,10 @@ void lustre_assert_wire_constants(void)
 	/* Checks for struct mdt_ioepoch */
 	LASSERTF((int)sizeof(struct mdt_ioepoch) == 24, "found %lld\n",
 		 (long long)(int)sizeof(struct mdt_ioepoch));
-	LASSERTF((int)offsetof(struct mdt_ioepoch, mio_handle) == 0, "found %lld\n",
-		 (long long)(int)offsetof(struct mdt_ioepoch, mio_handle));
-	LASSERTF((int)sizeof(((struct mdt_ioepoch *)0)->mio_handle) == 8, "found %lld\n",
-		 (long long)(int)sizeof(((struct mdt_ioepoch *)0)->mio_handle));
+	LASSERTF((int)offsetof(struct mdt_ioepoch, mio_open_handle) == 0, "found %lld\n",
+		 (long long)(int)offsetof(struct mdt_ioepoch, mio_open_handle));
+	LASSERTF((int)sizeof(((struct mdt_ioepoch *)0)->mio_open_handle) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct mdt_ioepoch *)0)->mio_open_handle));
 	LASSERTF((int)offsetof(struct mdt_ioepoch, mio_unused1) == 8, "found %lld\n",
 		 (long long)(int)offsetof(struct mdt_ioepoch, mio_unused1));
 	LASSERTF((int)sizeof(((struct mdt_ioepoch *)0)->mio_unused1) == 8, "found %lld\n",
@@ -2334,10 +2334,10 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct mdt_rec_create, cr_fid2));
 	LASSERTF((int)sizeof(((struct mdt_rec_create *)0)->cr_fid2) == 16, "found %lld\n",
 		 (long long)(int)sizeof(((struct mdt_rec_create *)0)->cr_fid2));
-	LASSERTF((int)offsetof(struct mdt_rec_create, cr_old_handle) == 72, "found %lld\n",
-		 (long long)(int)offsetof(struct mdt_rec_create, cr_old_handle));
-	LASSERTF((int)sizeof(((struct mdt_rec_create *)0)->cr_old_handle) == 8, "found %lld\n",
-		 (long long)(int)sizeof(((struct mdt_rec_create *)0)->cr_old_handle));
+	LASSERTF((int)offsetof(struct mdt_rec_create, cr_open_handle_old) == 72, "found %lld\n",
+		 (long long)(int)offsetof(struct mdt_rec_create, cr_open_handle_old));
+	LASSERTF((int)sizeof(((struct mdt_rec_create *)0)->cr_open_handle_old) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct mdt_rec_create *)0)->cr_open_handle_old));
 	LASSERTF((int)offsetof(struct mdt_rec_create, cr_time) == 80, "found %lld\n",
 		 (long long)(int)offsetof(struct mdt_rec_create, cr_time));
 	LASSERTF((int)sizeof(((struct mdt_rec_create *)0)->cr_time) == 8, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 522bd52..39f2d3b 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1574,7 +1574,7 @@ enum md_transient_state {
 struct mdt_body {
 	struct lu_fid mbo_fid1;
 	struct lu_fid mbo_fid2;
-	struct lustre_handle mbo_handle;
+	struct lustre_handle mbo_open_handle;
 	__u64	mbo_valid;
 	__u64	mbo_size;	/* Offset, in the case of MDS_READPAGE */
 	__s64	mbo_mtime;
@@ -1612,7 +1612,7 @@ struct mdt_body {
 }; /* 216 */
 
 struct mdt_ioepoch {
-	struct lustre_handle mio_handle;
+	struct lustre_handle mio_open_handle;
 	__u64 mio_unused1; /* was ioepoch */
 	__u32 mio_unused2; /* was flags */
 	__u32 mio_padding;
@@ -1719,9 +1719,9 @@ struct mdt_rec_create {
 	__u32		cr_suppgid1_h;
 	__u32		cr_suppgid2;
 	__u32		cr_suppgid2_h;
-	struct lu_fid   cr_fid1;
-	struct lu_fid   cr_fid2;
-	struct lustre_handle cr_old_handle; /* handle in case of open replay */
+	struct lu_fid	cr_fid1;
+	struct lu_fid	cr_fid2;
+	struct lustre_handle cr_open_handle_old; /* in case of open replay */
 	__s64		cr_time;
 	__u64		cr_rdev;
 	__u64		cr_ioepoch;
@@ -1864,7 +1864,7 @@ struct mdt_rec_resync {
 	__u32           rs_suppgid2_h;
 	struct lu_fid   rs_fid;
 	__u8		rs_padding0[sizeof(struct lu_fid)];
-	struct lustre_handle rs_handle;	/* rr_mtime */
+	struct lustre_handle rs_lease_handle;	/* rr_mtime */
 	__s64		rs_padding1;	/* rr_atime */
 	__s64		rs_padding2;	/* rr_ctime */
 	__u64           rs_padding3;	/* rr_size */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 123/622] lustre: ldlm: cleanup LVB handling
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (121 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 122/622] lustre: misc: name open file handles as such James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 124/622] lustre: ldlm: pass preallocated env to methods James Simmons
                   ` (499 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Bruno Faccini <bruno.faccini@intel.com>

For the client side LVB handling is barely used. In the OpenSFS
tree lvbo handling was reworked for LU-5042. Merge those
changes as well as remove all server related code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-5042
Lustre-commit: 8739f13233e ("LU-5042 ldlm: delay filling resource's LVB upon replay")
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: http://review.whamcloud.com/10845
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Niu Yawei <yawei.niu@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h | 62 ++----------------------------------------
 fs/lustre/ldlm/ldlm_resource.c | 39 ++++----------------------
 2 files changed, 8 insertions(+), 93 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 6ad12a3..1133e20 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -280,16 +280,12 @@ struct ldlm_pool {
  * Currently LVBs are used by:
  *  - OSC-OST code to maintain current object size/times
  *  - layout lock code to return the layout when the layout lock is granted
+ *
+ * To ensure delayed LVB initialization, it is highly recommended to use the set
+ * of ldlm_[res_]lvbo_[init,update,fill]() functions.
  */
 struct ldlm_valblock_ops {
-	int (*lvbo_init)(struct ldlm_resource *res);
-	int (*lvbo_update)(struct ldlm_resource *res, struct ldlm_lock *lock,
-			   struct ptlrpc_request *r,  int increase);
 	int (*lvbo_free)(struct ldlm_resource *res);
-	/* Return size of lvb data appropriate RPC size can be reserved */
-	int (*lvbo_size)(struct ldlm_lock *lock);
-	/* Called to fill in lvb data to RPC buffer @buf */
-	int (*lvbo_fill)(struct ldlm_lock *lock, void *buf, int buflen);
 };
 
 /**
@@ -922,36 +918,6 @@ static inline bool ldlm_has_dom(struct ldlm_lock *lock)
 	return &lock->l_resource->lr_ns_bucket->nsb_at_estimate;
 }
 
-static inline int ldlm_lvbo_init(struct ldlm_resource *res)
-{
-	struct ldlm_namespace *ns = ldlm_res_to_ns(res);
-
-	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_init)
-		return ns->ns_lvbo->lvbo_init(res);
-
-	return 0;
-}
-
-static inline int ldlm_lvbo_size(struct ldlm_lock *lock)
-{
-	struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
-
-	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_size)
-		return ns->ns_lvbo->lvbo_size(lock);
-
-	return 0;
-}
-
-static inline int ldlm_lvbo_fill(struct ldlm_lock *lock, void *buf, int len)
-{
-	struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
-
-	if (ns->ns_lvbo)
-		return ns->ns_lvbo->lvbo_fill(lock, buf, len);
-
-	return 0;
-}
-
 struct ldlm_ast_work {
 	struct ldlm_lock       *w_lock;
 	int			w_blocking;
@@ -1111,28 +1077,6 @@ static inline struct ldlm_lock *ldlm_handle2lock(const struct lustre_handle *h)
 	return lock;
 }
 
-/**
- * Update Lock Value Block Operations (LVBO) on a resource taking into account
- * data from request @r
- */
-static inline int ldlm_lvbo_update(struct ldlm_resource *res,
-				   struct ldlm_lock *lock,
-				   struct ptlrpc_request *req, int increase)
-{
-	struct ldlm_namespace *ns = ldlm_res_to_ns(res);
-
-	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_update)
-		return ns->ns_lvbo->lvbo_update(res, lock, req, increase);
-
-	return 0;
-}
-
-static inline int ldlm_res_lvbo_update(struct ldlm_resource *res,
-				       struct ptlrpc_request *req, int increase)
-{
-	return ldlm_lvbo_update(res, NULL, req, increase);
-}
-
 int ldlm_error2errno(enum ldlm_error error);
 
 #if LUSTRE_TRACKS_LOCK_EXP_REFS
diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 5d73132..59b17b5 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -1062,11 +1062,10 @@ static struct ldlm_resource *ldlm_resource_new(enum ldlm_type ldlm_type)
 	spin_lock_init(&res->lr_lock);
 	lu_ref_init(&res->lr_reference);
 
-	/* The creator of the resource must unlock the mutex after LVB
-	 * initialization.
+	/* Since LVB init can be delayed now, there is no longer need to
+	 * immediately acquire mutex here.
 	 */
 	mutex_init(&res->lr_lvb_mutex);
-	mutex_lock(&res->lr_lvb_mutex);
 
 	return res;
 }
@@ -1087,7 +1086,6 @@ struct ldlm_resource *
 	struct cfs_hash_bd bd;
 	u64 version;
 	int ns_refcount = 0;
-	int rc;
 
 	LASSERT(!parent);
 	LASSERT(ns->ns_rs_hash);
@@ -1097,7 +1095,7 @@ struct ldlm_resource *
 	hnode = cfs_hash_bd_lookup_locked(ns->ns_rs_hash, &bd, (void *)name);
 	if (hnode) {
 		cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 0);
-		goto lvbo_init;
+		goto found;
 	}
 
 	version = cfs_hash_bd_version_get(&bd);
@@ -1125,25 +1123,12 @@ struct ldlm_resource *
 		cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 1);
 		/* Clean lu_ref for failed resource. */
 		lu_ref_fini(&res->lr_reference);
-		/* We have taken lr_lvb_mutex. Drop it. */
-		mutex_unlock(&res->lr_lvb_mutex);
 		if (res->lr_itree)
 			kmem_cache_free(ldlm_interval_tree_slab,
 					res->lr_itree);
 		kmem_cache_free(ldlm_resource_slab, res);
-lvbo_init:
+found:
 		res = hlist_entry(hnode, struct ldlm_resource, lr_hash);
-		/* Synchronize with regard to resource creation. */
-		if (ns->ns_lvbo && ns->ns_lvbo->lvbo_init) {
-			mutex_lock(&res->lr_lvb_mutex);
-			mutex_unlock(&res->lr_lvb_mutex);
-		}
-
-		if (unlikely(res->lr_lvb_len < 0)) {
-			rc = res->lr_lvb_len;
-			ldlm_resource_putref(res);
-			res = ERR_PTR(rc);
-		}
 		return res;
 	}
 	/* We won! Let's add the resource. */
@@ -1152,22 +1137,8 @@ struct ldlm_resource *
 		ns_refcount = ldlm_namespace_get_return(ns);
 
 	cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 1);
-	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_init) {
-		OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_CREATE_RESOURCE, 2);
-		rc = ns->ns_lvbo->lvbo_init(res);
-		if (rc < 0) {
-			CERROR("%s: lvbo_init failed for resource %#llx:%#llx: rc = %d\n",
-			       ns->ns_obd->obd_name, name->name[0],
-			       name->name[1], rc);
-			res->lr_lvb_len = rc;
-			mutex_unlock(&res->lr_lvb_mutex);
-			ldlm_resource_putref(res);
-			return ERR_PTR(rc);
-		}
-	}
 
-	/* We create resource with locked lr_lvb_mutex. */
-	mutex_unlock(&res->lr_lvb_mutex);
+	OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_CREATE_RESOURCE, 2);
 
 	/* Let's see if we happened to be the very first resource in this
 	 * namespace. If so, and this is a client namespace, we need to move
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 124/622] lustre: ldlm: pass preallocated env to methods
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (122 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 123/622] lustre: ldlm: cleanup LVB handling James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 125/622] lustre: osc: move obdo_cache to OSC code James Simmons
                   ` (498 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

to save on env allocation.

Benchmarks made by Shuichi Ihara demonstrated 13% improvement
for small I/Os: 564k vs 639k IOPS. the details can be found at
https://jira.whamcloud.com/browse/LU-11164.

Lustre-commit:e02cb40761ff8 ("LU-11164 ldlm: pass env to lvbo methods")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32832
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h |  2 +-
 fs/lustre/ldlm/ldlm_internal.h |  3 ++-
 fs/lustre/ldlm/ldlm_lock.c     |  6 ++++--
 fs/lustre/ldlm/ldlm_request.c  |  3 ++-
 fs/lustre/lov/lov_obd.c        |  4 ++--
 fs/lustre/ptlrpc/client.c      | 23 ++++++++++++++++++++---
 fs/lustre/ptlrpc/ptlrpcd.c     |  2 +-
 7 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index cf13555..cbd524c 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1842,7 +1842,7 @@ struct ptlrpc_connection *ptlrpc_uuid_to_connection(struct obd_uuid *uuid,
 struct ptlrpc_request_set *ptlrpc_prep_fcset(int max, set_producer_func func,
 					     void *arg);
 int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set);
-int ptlrpc_set_wait(struct ptlrpc_request_set *);
+int ptlrpc_set_wait(const struct lu_env *env, struct ptlrpc_request_set *set);
 void ptlrpc_set_destroy(struct ptlrpc_request_set *);
 void ptlrpc_set_add_req(struct ptlrpc_request_set *, struct ptlrpc_request *);
 #define PTLRPCD_SET ((struct ptlrpc_request_set *)1)
diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index ec68713..df57c02 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -137,7 +137,8 @@ struct ldlm_lock *
 		 enum ldlm_type type, enum ldlm_mode mode,
 		 const struct ldlm_callback_suite *cbs,
 		 void *data, u32 lvb_len, enum lvb_type lvb_type);
-enum ldlm_error ldlm_lock_enqueue(struct ldlm_namespace *ns,
+enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
+				  struct ldlm_namespace *ns,
 				  struct ldlm_lock **lock, void *cookie,
 				  u64 *flags);
 void ldlm_lock_addref_internal(struct ldlm_lock *lock, enum ldlm_mode mode);
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 4f746ad..bdbbfec 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -1578,7 +1578,8 @@ struct ldlm_lock *ldlm_lock_create(struct ldlm_namespace *ns,
  * Does not block. As a result of enqueue the lock would be put
  * into granted or waiting list.
  */
-enum ldlm_error ldlm_lock_enqueue(struct ldlm_namespace *ns,
+enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
+				  struct ldlm_namespace *ns,
 				  struct ldlm_lock **lockp,
 				  void *cookie, u64 *flags)
 {
@@ -1832,7 +1833,7 @@ int ldlm_run_ast_work(struct ldlm_namespace *ns, struct list_head *rpc_list,
 		goto out;
 	}
 
-	ptlrpc_set_wait(arg->set);
+	ptlrpc_set_wait(NULL, arg->set);
 	ptlrpc_set_destroy(arg->set);
 
 	rc = atomic_read(&arg->restart) ? -ERESTART : 0;
@@ -1945,6 +1946,7 @@ int ldlm_lock_set_data(const struct lustre_handle *lockh, void *data)
 EXPORT_SYMBOL(ldlm_lock_set_data);
 
 struct export_cl_data {
+	const struct lu_env	*ecl_env;
 	struct obd_export	*ecl_exp;
 	int			ecl_loop;
 };
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index f045d30..9d3330c 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -343,6 +343,7 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 			  const struct lustre_handle *lockh, int rc)
 {
 	struct ldlm_namespace *ns = exp->exp_obd->obd_namespace;
+	const struct lu_env *env = NULL;
 	int is_replay = *flags & LDLM_FL_REPLAY;
 	struct ldlm_lock *lock;
 	struct ldlm_reply *reply;
@@ -487,7 +488,7 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 	}
 
 	if (!is_replay) {
-		rc = ldlm_lock_enqueue(ns, &lock, NULL, flags);
+		rc = ldlm_lock_enqueue(env, ns, &lock, NULL, flags);
 		if (lock->l_completion_ast) {
 			int err = lock->l_completion_ast(lock, *flags, NULL);
 
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 35eaa1f..9a6ffe8 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -948,7 +948,7 @@ static int lov_statfs(const struct lu_env *env, struct obd_export *exp,
 			goto out_set;
 	}
 
-	rc = ptlrpc_set_wait(rqset);
+	rc = ptlrpc_set_wait(env, rqset);
 
 out_set:
 	if (rc < 0)
@@ -1249,7 +1249,7 @@ static int lov_set_info_async(const struct lu_env *env, struct obd_export *exp,
 
 	lov_tgts_putref(obddev);
 	if (no_set) {
-		err = ptlrpc_set_wait(set);
+		err = ptlrpc_set_wait(env, set);
 		if (rc == 0)
 			rc = err;
 		ptlrpc_set_destroy(set);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 7be597c..fabe675 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -2278,9 +2278,10 @@ time64_t ptlrpc_set_next_timeout(struct ptlrpc_request_set *set)
  * error or otherwise be interrupted).
  * Returns 0 on success or error code otherwise.
  */
-int ptlrpc_set_wait(struct ptlrpc_request_set *set)
+int ptlrpc_set_wait(const struct lu_env *env, struct ptlrpc_request_set *set)
 {
 	struct ptlrpc_request *req;
+	struct lu_env _env;
 	time64_t timeout;
 	int rc;
 
@@ -2295,6 +2296,19 @@ int ptlrpc_set_wait(struct ptlrpc_request_set *set)
 	if (list_empty(&set->set_requests))
 		return 0;
 
+	/* ideally we want env provide by the caller all the time,
+	 * but at the moment that would mean a massive change in
+	 * LDLM while benefits would be close to zero, so just
+	 * initialize env here for those rare cases
+	 */
+	if (!env) {
+		/* XXX: skip on the client side? */
+		rc = lu_env_init(&_env, LCT_DT_THREAD);
+		if (rc)
+			return rc;
+		env = &_env;
+	}
+
 	do {
 		timeout = ptlrpc_set_next_timeout(set);
 
@@ -2313,7 +2327,7 @@ int ptlrpc_set_wait(struct ptlrpc_request_set *set)
 			 * so we allow interrupts during the timeout.
 			 */
 			rc = l_wait_event_abortable_timeout(set->set_waitq,
-							    ptlrpc_check_set(NULL, set),
+							    ptlrpc_check_set(env, set),
 							    HZ);
 			if (rc == 0) {
 				rc = -ETIMEDOUT;
@@ -2380,6 +2394,9 @@ int ptlrpc_set_wait(struct ptlrpc_request_set *set)
 			rc = req->rq_status;
 	}
 
+	if (env && env == &_env)
+		lu_env_fini(&_env);
+
 	return rc;
 }
 EXPORT_SYMBOL(ptlrpc_set_wait);
@@ -2841,7 +2858,7 @@ int ptlrpc_queue_wait(struct ptlrpc_request *req)
 	/* add a ref for the set (see comment in ptlrpc_set_add_req) */
 	ptlrpc_request_addref(req);
 	ptlrpc_set_add_req(set, req);
-	rc = ptlrpc_set_wait(set);
+	rc = ptlrpc_set_wait(NULL, set);
 	ptlrpc_set_destroy(set);
 
 	return rc;
diff --git a/fs/lustre/ptlrpc/ptlrpcd.c b/fs/lustre/ptlrpc/ptlrpcd.c
index c0b091c..e9c03ba 100644
--- a/fs/lustre/ptlrpc/ptlrpcd.c
+++ b/fs/lustre/ptlrpc/ptlrpcd.c
@@ -469,7 +469,7 @@ static int ptlrpcd(void *arg)
 	 * Wait for inflight requests to drain.
 	 */
 	if (!list_empty(&set->set_requests))
-		ptlrpc_set_wait(set);
+		ptlrpc_set_wait(&env, set);
 	lu_context_fini(&env.le_ctx);
 	lu_context_fini(env.le_ses);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 125/622] lustre: osc: move obdo_cache to OSC code
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (123 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 124/622] lustre: ldlm: pass preallocated env to methods James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 126/622] lustre: llite: zero lum for stripeless files James Simmons
                   ` (497 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The obdo_cache slab is only used by the OSC code today, so it does
not need to be allocated in obdclass on servers.  Move it to only
be allocated when the OSC module is loaded.

Rename obdo_cachep to osc_obdo_kmem to match other slab caches
created by the OSC.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10899
Lustre-commit: 48df66be72c9 ("LU-10899 osc: move obdo_cache to OSC code")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33141
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h |  1 +
 fs/lustre/include/obd_class.h  |  3 ---
 fs/lustre/obdclass/genops.c    | 10 ----------
 fs/lustre/osc/osc_dev.c        |  8 ++++++--
 fs/lustre/osc/osc_request.c    | 11 +++++------
 5 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index dc8071a..dabcee0 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -557,6 +557,7 @@ struct osc_brw_async_args {
 extern struct kmem_cache *osc_session_kmem;
 extern struct kmem_cache *osc_extent_kmem;
 extern struct kmem_cache *osc_quota_kmem;
+extern struct kmem_cache *osc_obdo_kmem;
 
 extern struct lu_context_key osc_key;
 extern struct lu_context_key osc_session_key;
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index a3ef5d5..01eb385 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1651,9 +1651,6 @@ static inline int md_unpackmd(struct obd_export *exp,
 int obd_init_caches(void);
 void obd_cleanup_caches(void);
 
-/* support routines */
-extern struct kmem_cache *obdo_cachep;
-
 typedef int (*register_lwp_cb)(void *data);
 
 struct lwp_register_item {
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index a122332..e5e2f73 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -46,8 +46,6 @@
 static struct obd_device *obd_devs[MAX_OBD_DEVICES];
 
 static struct kmem_cache *obd_device_cachep;
-struct kmem_cache *obdo_cachep;
-EXPORT_SYMBOL(obdo_cachep);
 
 static struct kobj_type class_ktype;
 static struct workqueue_struct *zombie_wq;
@@ -645,8 +643,6 @@ void obd_cleanup_caches(void)
 {
 	kmem_cache_destroy(obd_device_cachep);
 	obd_device_cachep = NULL;
-	kmem_cache_destroy(obdo_cachep);
-	obdo_cachep = NULL;
 }
 
 int obd_init_caches(void)
@@ -658,12 +654,6 @@ int obd_init_caches(void)
 	if (!obd_device_cachep)
 		goto out;
 
-	LASSERT(!obdo_cachep);
-	obdo_cachep = kmem_cache_create("ll_obdo_cache", sizeof(struct obdo),
-					0, 0, NULL);
-	if (!obdo_cachep)
-		goto out;
-
 	return 0;
 out:
 	obd_cleanup_caches();
diff --git a/fs/lustre/osc/osc_dev.c b/fs/lustre/osc/osc_dev.c
index 3d0687a..b8bf75a 100644
--- a/fs/lustre/osc/osc_dev.c
+++ b/fs/lustre/osc/osc_dev.c
@@ -55,9 +55,8 @@
 struct kmem_cache *osc_thread_kmem;
 struct kmem_cache *osc_session_kmem;
 struct kmem_cache *osc_extent_kmem;
-EXPORT_SYMBOL(osc_extent_kmem);
 struct kmem_cache *osc_quota_kmem;
-EXPORT_SYMBOL(osc_quota_kmem);
+struct kmem_cache *osc_obdo_kmem;
 
 struct lu_kmem_descr osc_caches[] = {
 	{
@@ -91,6 +90,11 @@ struct lu_kmem_descr osc_caches[] = {
 		.ckd_size  = sizeof(struct osc_quota_info)
 	},
 	{
+		.ckd_cache = &osc_obdo_kmem,
+		.ckd_name  = "osc_obdo_kmem",
+		.ckd_size  = sizeof(struct obdo)
+	},
+	{
 		.ckd_cache = NULL
 	}
 };
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 2784e1e..e968360 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -749,7 +749,7 @@ static int osc_shrink_grant_interpret(const struct lu_env *env,
 	LASSERT(body);
 	osc_update_grant(cli, body);
 out:
-	kmem_cache_free(obdo_cachep, oa);
+	kmem_cache_free(osc_obdo_kmem, oa);
 	return rc;
 }
 
@@ -2115,7 +2115,7 @@ static int brw_interpret(const struct lu_env *env,
 			cl_object_attr_update(env, obj, attr, valid);
 		cl_object_attr_unlock(obj);
 	}
-	kmem_cache_free(obdo_cachep, aa->aa_oa);
+	kmem_cache_free(osc_obdo_kmem, aa->aa_oa);
 
 	if (lustre_msg_get_opc(req->rq_reqmsg) == OST_WRITE && rc == 0)
 		osc_inc_unstable_pages(req);
@@ -2223,7 +2223,7 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 		goto out;
 	}
 
-	oa = kmem_cache_zalloc(obdo_cachep, GFP_NOFS);
+	oa = kmem_cache_zalloc(osc_obdo_kmem, GFP_NOFS);
 	if (!oa) {
 		rc = -ENOMEM;
 		goto out;
@@ -2349,8 +2349,7 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 	if (rc != 0) {
 		LASSERT(!req);
 
-		if (oa)
-			kmem_cache_free(obdo_cachep, oa);
+		kmem_cache_free(osc_obdo_kmem, oa);
 		kfree(pga);
 		/* this should happen rarely and is pretty bad, it makes the
 		 * pending list not follow the dirty order
@@ -2960,7 +2959,7 @@ int osc_set_info_async(const struct lu_env *env, struct obd_export *exp,
 		struct obdo *oa;
 
 		aa = ptlrpc_req_async_args(aa, req);
-		oa = kmem_cache_zalloc(obdo_cachep, GFP_NOFS);
+		oa = kmem_cache_zalloc(osc_obdo_kmem, GFP_NOFS);
 		if (!oa) {
 			ptlrpc_req_finished(req);
 			return -ENOMEM;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 126/622] lustre: llite: zero lum for stripeless files
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (124 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 125/622] lustre: osc: move obdo_cache to OSC code James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 127/622] lustre: idl: remove obsolete RPC flags James Simmons
                   ` (496 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In the IOC_MDC_GETFILEINFO/LL_IOC_MDC_GETINFO case of ll_dir_ioctl(),
if the file has no striping then zero out the lum buffer so that
userspace won't be confused by garbage.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11380
Lustre-commit: fab95b4345db ("LU-11380 llite: zero lum for stripeless files")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33172
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 06f7bd3..55a1efb 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1442,15 +1442,14 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			goto out_req;
 		}
 
-		if (rc < 0) {
-			if (rc == -ENODATA && (cmd == IOC_MDC_GETFILEINFO ||
-					       cmd == LL_IOC_MDC_GETINFO)) {
-				rc = 0;
-				goto skip_lmm;
-			}
+		if (rc == -ENODATA && (cmd == IOC_MDC_GETFILEINFO ||
+				       cmd == LL_IOC_MDC_GETINFO)) {
+			lmmsize = 0;
+			rc = 0;
+		}
 
+		if (rc < 0)
 			goto out_req;
-		}
 
 		if (cmd == IOC_MDC_GETFILESTRIPE ||
 		    cmd == LL_IOC_LOV_GETSTRIPE ||
@@ -1462,14 +1461,23 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			lmdp = (struct lov_user_mds_data __user *)arg;
 			lump = &lmdp->lmd_lmm;
 		}
-		if (copy_to_user(lump, lmm, lmmsize)) {
+
+		if (lmmsize == 0) {
+			/* If the file has no striping then zero out *lump so
+			 * that the caller isn't confused by garbage.
+			 */
+			if (clear_user(lump, sizeof(*lump))) {
+				rc = -EFAULT;
+				goto out_req;
+			}
+		} else if (copy_to_user(lump, lmm, lmmsize)) {
 			if (copy_to_user(lump, lmm, sizeof(*lump))) {
 				rc = -EFAULT;
 				goto out_req;
 			}
 			rc = -EOVERFLOW;
 		}
-skip_lmm:
+
 		if (cmd == IOC_MDC_GETFILEINFO || cmd == LL_IOC_MDC_GETINFO) {
 			struct lov_user_mds_data __user *lmdp;
 			lstat_t st = { 0 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 127/622] lustre: idl: remove obsolete RPC flags
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (125 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 126/622] lustre: llite: zero lum for stripeless files James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 128/622] lustre: flr: add 'nosync' flag for FLR mirrors James Simmons
                   ` (495 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Remove RPC flags that are no longer in use:
- OBD_MD_FLQOS          has never been used in master branch
- OBD_MD_FLEPOCH        unused since v2_7_50_0-38-gd5d5b349f2
- OBD_MD_REINT          unused since before 1.6
- OBD_MD_FLMDSCAPA      unused since v2_7_55_0-15-g353ef58b1d
- OBD_MD_FLOSSCAPA      unused since v2_7_55_0-15-g353ef58b1d

Rename OBD_MD_FLGENER to OBD_MD_FLPARENT to more accurately describe
that this flag is only used to mark the parent FID in the OST obdo.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11397
Lustre-commit: f63366a3c285 ("LU-11397 idl: remove obsolete RPC flags")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33202
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obdo.c              |  2 +-
 fs/lustre/ptlrpc/layout.c              |  6 ++----
 fs/lustre/ptlrpc/pack_generic.c        |  5 +----
 fs/lustre/ptlrpc/wiretest.c            | 14 ++------------
 include/uapi/linux/lustre/lustre_idl.h | 18 ++++++++----------
 5 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/fs/lustre/obdclass/obdo.c b/fs/lustre/obdclass/obdo.c
index e5475f1..8fd2922 100644
--- a/fs/lustre/obdclass/obdo.c
+++ b/fs/lustre/obdclass/obdo.c
@@ -48,7 +48,7 @@ void obdo_set_parent_fid(struct obdo *dst, const struct lu_fid *parent)
 	dst->o_parent_oid = fid_oid(parent);
 	dst->o_parent_seq = fid_seq(parent);
 	dst->o_parent_ver = fid_ver(parent);
-	dst->o_valid |= OBD_MD_FLGENER | OBD_MD_FLFID;
+	dst->o_valid |= OBD_MD_FLPARENT | OBD_MD_FLFID;
 }
 EXPORT_SYMBOL(obdo_set_parent_fid);
 
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 225a73e..efbff69 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -1022,13 +1022,11 @@ struct req_msg_field RMF_LOGCOOKIES =
 EXPORT_SYMBOL(RMF_LOGCOOKIES);
 
 struct req_msg_field RMF_CAPA1 =
-	DEFINE_MSGF("capa", 0, sizeof(struct lustre_capa),
-		    lustre_swab_lustre_capa, NULL);
+	DEFINE_MSGF("capa", 0, 0, NULL, NULL);
 EXPORT_SYMBOL(RMF_CAPA1);
 
 struct req_msg_field RMF_CAPA2 =
-	DEFINE_MSGF("capa", 0, sizeof(struct lustre_capa),
-		    lustre_swab_lustre_capa, NULL);
+	DEFINE_MSGF("capa", 0, 0, NULL, NULL);
 EXPORT_SYMBOL(RMF_CAPA2);
 
 struct req_msg_field RMF_LAYOUT_INTENT =
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 653a8d7..6da9aca 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -2242,12 +2242,9 @@ static void dump_obdo(struct obdo *oa)
 	else if (valid & OBD_MD_FLCKSUM)
 		CDEBUG(D_RPCTRACE, "obdo: o_checksum (o_nlink) = %u\n",
 		       oa->o_nlink);
-	if (valid & OBD_MD_FLGENER)
+	if (valid & OBD_MD_FLPARENT)
 		CDEBUG(D_RPCTRACE, "obdo: o_parent_oid = %x\n",
 		       oa->o_parent_oid);
-	if (valid & OBD_MD_FLEPOCH)
-		CDEBUG(D_RPCTRACE, "obdo: o_ioepoch = %lld\n",
-		       oa->o_ioepoch);
 	if (valid & OBD_MD_FLFID) {
 		CDEBUG(D_RPCTRACE, "obdo: o_stripe_idx = %u\n",
 		       oa->o_stripe_idx);
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 845aff4..42af0b8 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1335,8 +1335,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_MD_FLFLAGS);
 	LASSERTF(OBD_MD_FLNLINK == (0x00002000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLNLINK);
-	LASSERTF(OBD_MD_FLGENER == (0x00004000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_FLGENER);
+	LASSERTF(OBD_MD_FLPARENT == (0x00004000ULL), "found 0x%.16llxULL\n",
+		 OBD_MD_FLPARENT);
 	LASSERTF(OBD_MD_FLRDEV == (0x00010000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLRDEV);
 	LASSERTF(OBD_MD_FLEASIZE == (0x00020000ULL), "found 0x%.16llxULL\n",
@@ -1347,14 +1347,10 @@ void lustre_assert_wire_constants(void)
 		 OBD_MD_FLHANDLE);
 	LASSERTF(OBD_MD_FLCKSUM == (0x00100000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLCKSUM);
-	LASSERTF(OBD_MD_FLQOS == (0x00200000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_FLQOS);
 	LASSERTF(OBD_MD_FLGROUP == (0x01000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLGROUP);
 	LASSERTF(OBD_MD_FLFID == (0x02000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLFID);
-	LASSERTF(OBD_MD_FLEPOCH == (0x04000000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_FLEPOCH);
 	LASSERTF(OBD_MD_FLGRANT == (0x08000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLGRANT);
 	LASSERTF(OBD_MD_FLDIREA == (0x10000000ULL), "found 0x%.16llxULL\n",
@@ -1367,8 +1363,6 @@ void lustre_assert_wire_constants(void)
 		 OBD_MD_FLMODEASIZE);
 	LASSERTF(OBD_MD_MDS == (0x0000000100000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_MDS);
-	LASSERTF(OBD_MD_REINT == (0x0000000200000000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_REINT);
 	LASSERTF(OBD_MD_MEA == (0x0000000400000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_MEA);
 	LASSERTF(OBD_MD_TSTATE == (0x0000000800000000ULL),
@@ -1381,10 +1375,6 @@ void lustre_assert_wire_constants(void)
 		 OBD_MD_FLXATTRRM);
 	LASSERTF(OBD_MD_FLACL == (0x0000008000000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLACL);
-	LASSERTF(OBD_MD_FLMDSCAPA == (0x0000020000000000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_FLMDSCAPA);
-	LASSERTF(OBD_MD_FLOSSCAPA == (0x0000040000000000ULL), "found 0x%.16llxULL\n",
-		 OBD_MD_FLOSSCAPA);
 	LASSERTF(OBD_MD_FLCROSSREF == (0x0000100000000000ULL), "found 0x%.16llxULL\n",
 		 OBD_MD_FLCROSSREF);
 	LASSERTF(OBD_MD_FLGETATTRLOCK == (0x0000200000000000ULL), "found 0x%.16llxULL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 39f2d3b..8002e046 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1137,21 +1137,19 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 #define OBD_MD_FLFLAGS		(0x00000800ULL) /* flags word */
 #define OBD_MD_DOM_SIZE		(0x00001000ULL) /* Data-on-MDT component size */
 #define OBD_MD_FLNLINK		(0x00002000ULL) /* link count */
-#define OBD_MD_FLGENER		(0x00004000ULL) /* generation number */
-#define OBD_MD_LAYOUT_VERSION	(0x00008000ULL) /* layout version for
-						 * OST objects
-						 */
+#define OBD_MD_FLPARENT		(0x00004000ULL) /* parent FID */
+#define OBD_MD_LAYOUT_VERSION	(0x00008000ULL) /* OST object layout version */
 #define OBD_MD_FLRDEV		(0x00010000ULL) /* device number */
 #define OBD_MD_FLEASIZE		(0x00020000ULL) /* extended attribute data */
 #define OBD_MD_LINKNAME		(0x00040000ULL) /* symbolic link target */
 #define OBD_MD_FLHANDLE		(0x00080000ULL) /* file/lock handle */
 #define OBD_MD_FLCKSUM		(0x00100000ULL) /* bulk data checksum */
-#define OBD_MD_FLQOS		(0x00200000ULL) /* quality of service stats */
+/*	OBD_MD_FLQOS		(0x00200000ULL) has never been used */
 #define OBD_MD_FLPRJQUOTA	(0x00400000ULL)	/* over quota flags sent from ost */
 /*	OBD_MD_FLCOOKIE		(0x00800000ULL) obsolete in 2.8 */
 #define OBD_MD_FLGROUP		(0x01000000ULL) /* group */
 #define OBD_MD_FLFID		(0x02000000ULL) /* ->ost write inline fid */
-#define OBD_MD_FLEPOCH		(0x04000000ULL) /* ->ost write with ioepoch */
+/*	OBD_MD_FLEPOCH		(0x04000000ULL) obsolete 2.7.50 */
 						/* ->mds if epoch opens or closes */
 #define OBD_MD_FLGRANT		(0x08000000ULL) /* ost preallocation space grant */
 #define OBD_MD_FLDIREA		(0x10000000ULL) /* dir's extended attribute data */
@@ -1160,7 +1158,7 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 #define OBD_MD_FLMODEASIZE	(0x80000000ULL) /* EA size will be changed */
 
 #define OBD_MD_MDS		(0x0000000100000000ULL) /* where an inode lives on */
-#define OBD_MD_REINT		(0x0000000200000000ULL) /* reintegrate oa */
+/*	OBD_MD_REINT		(0x0000000200000000ULL) obsolete 1.8 */
 #define OBD_MD_MEA		(0x0000000400000000ULL) /* CMD split EA  */
 #define OBD_MD_TSTATE		(0x0000000800000000ULL) /* transient state field */
 
@@ -1169,8 +1167,8 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 #define OBD_MD_FLXATTRRM	(0x0000004000000000ULL) /* xattr remove */
 #define OBD_MD_FLACL		(0x0000008000000000ULL) /* ACL */
 #define OBD_MD_FLAGSTATFS	(0x0000010000000000ULL) /* aggregated statfs */
-#define OBD_MD_FLMDSCAPA	(0x0000020000000000ULL) /* MDS capability */
-#define OBD_MD_FLOSSCAPA	(0x0000040000000000ULL) /* OSS capability */
+/*	OBD_MD_FLMDSCAPA	(0x0000020000000000ULL) obsolete 2.7.54 */
+/*	OBD_MD_FLOSSCAPA	(0x0000040000000000ULL) obsolete 2.7.54 */
 /*	OBD_MD_FLCKSPLIT	(0x0000080000000000ULL) obsolete 2.3.58*/
 #define OBD_MD_FLCROSSREF	(0x0000100000000000ULL) /* Cross-ref case */
 #define OBD_MD_FLGETATTRLOCK	(0x0000200000000000ULL) /* Get IOEpoch attributes
@@ -1202,7 +1200,7 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 			  OBD_MD_FLCTIME | OBD_MD_FLSIZE  | OBD_MD_FLBLKSZ | \
 			  OBD_MD_FLMODE  | OBD_MD_FLTYPE  | OBD_MD_FLUID   | \
 			  OBD_MD_FLGID   | OBD_MD_FLFLAGS | OBD_MD_FLNLINK | \
-			  OBD_MD_FLGENER | OBD_MD_FLRDEV  | OBD_MD_FLGROUP | \
+			  OBD_MD_FLPARENT | OBD_MD_FLRDEV  | OBD_MD_FLGROUP | \
 			  OBD_MD_FLPROJID)
 
 #define OBD_MD_FLXATTRALL (OBD_MD_FLXATTR | OBD_MD_FLXATTRLS)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 128/622] lustre: flr: add 'nosync' flag for FLR mirrors
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (126 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 127/622] lustre: idl: remove obsolete RPC flags James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 129/622] lustre: llite: create checksums to replace checksum_pages James Simmons
                   ` (494 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

This patch allows 'nosync' flag to be set for FLR mirror components,
which makes lfs mirror resync skip on mirrors with this flag unless
mirror resync explicitly requested those mirrors to be resync.

This flag can be cleared by set '^nosync' on any component of the
mirror.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11400
Lustre-commit: 8a0554450eaa ("LU-11400 flr: add 'nosync' flag for FLR mirrors")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33205
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_ea.c                  |  3 +++
 fs/lustre/lov/lov_internal.h            |  1 +
 fs/lustre/lov/lov_pack.c                |  3 +++
 fs/lustre/ptlrpc/pack_generic.c         |  2 +-
 fs/lustre/ptlrpc/wiretest.c             | 18 +++++++++++++-----
 include/uapi/linux/lustre/lustre_user.h |  8 ++++++--
 6 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/lov/lov_ea.c b/fs/lustre/lov/lov_ea.c
index edca3b0..31a18d0 100644
--- a/fs/lustre/lov/lov_ea.c
+++ b/fs/lustre/lov/lov_ea.c
@@ -478,6 +478,9 @@ static int lsm_verify_comp_md_v1(struct lov_comp_md_v1 *lcm,
 		lsm->lsm_entries[i] = lsme;
 		lsme->lsme_id = le32_to_cpu(lcme->lcme_id);
 		lsme->lsme_flags = le32_to_cpu(lcme->lcme_flags);
+		if (lsme->lsme_flags & LCME_FL_NOSYNC)
+			lsme->lsme_timestamp =
+				le64_to_cpu(lcme->lcme_timestamp);
 		lu_extent_le_to_cpu(&lsme->lsme_extent, &lcme->lcme_extent);
 
 		if (i == entry_count - 1) {
diff --git a/fs/lustre/lov/lov_internal.h b/fs/lustre/lov/lov_internal.h
index 5dba8d3..376ac52 100644
--- a/fs/lustre/lov/lov_internal.h
+++ b/fs/lustre/lov/lov_internal.h
@@ -50,6 +50,7 @@ struct lov_stripe_md_entry {
 	u32			lsme_magic;
 	u32			lsme_flags;
 	u32			lsme_pattern;
+	u64			lsme_timestamp;
 	u32			lsme_stripe_size;
 	u16			lsme_stripe_count;
 	u16			lsme_layout_gen;
diff --git a/fs/lustre/lov/lov_pack.c b/fs/lustre/lov/lov_pack.c
index 3dbc6aa..5f8b281 100644
--- a/fs/lustre/lov/lov_pack.c
+++ b/fs/lustre/lov/lov_pack.c
@@ -201,6 +201,9 @@ ssize_t lov_lsm_pack(const struct lov_stripe_md *lsm, void *buf,
 
 		lcme->lcme_id = cpu_to_le32(lsme->lsme_id);
 		lcme->lcme_flags = cpu_to_le32(lsme->lsme_flags);
+		if (lsme->lsme_flags & LCME_FL_NOSYNC)
+			lcme->lcme_timestamp =
+				cpu_to_le64(lsme->lsme_timestamp);
 		lcme->lcme_extent.e_start =
 			cpu_to_le64(lsme->lsme_extent.e_start);
 		lcme->lcme_extent.e_end =
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 6da9aca..d93dbe1 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -2062,13 +2062,13 @@ void lustre_swab_lov_comp_md_v1(struct lov_comp_md_v1 *lum)
 		}
 		__swab32s(&ent->lcme_id);
 		__swab32s(&ent->lcme_flags);
+		__swab64s(&ent->lcme_timestamp);
 		__swab64s(&ent->lcme_extent.e_start);
 		__swab64s(&ent->lcme_extent.e_end);
 		__swab32s(&ent->lcme_offset);
 		__swab32s(&ent->lcme_size);
 		__swab32s(&ent->lcme_layout_gen);
 		BUILD_BUG_ON(offsetof(typeof(*ent), lcme_padding_1) == 0);
-		BUILD_BUG_ON(offsetof(typeof(*ent), lcme_padding_2) == 0);
 
 		v1 = (struct lov_user_md_v1 *)((char *)lum + off);
 		stripe_count = v1->lmm_stripe_count;
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 42af0b8..c6dd256 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1532,14 +1532,14 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_layout_gen));
 	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_layout_gen) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_layout_gen));
-	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_1) == 36, "found %lld\n",
+	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_timestamp) == 36, "found %lld\n",
+		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_timestamp));
+	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_timestamp) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_timestamp));
+	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_1) == 44, "found %lld\n",
 		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_1));
 	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_1) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_1));
-	LASSERTF((int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_2) == 40, "found %lld\n",
-		 (long long)(int)offsetof(struct lov_comp_md_entry_v1, lcme_padding_2));
-	LASSERTF((int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_2) == 8, "found %lld\n",
-		 (long long)(int)sizeof(((struct lov_comp_md_entry_v1 *)0)->lcme_padding_2));
 	LASSERTF(LCME_FL_INIT == 0x00000010UL, "found 0x%.8xUL\n",
 		 (unsigned int)LCME_FL_INIT);
 	LASSERTF(LCME_FL_NEG == 0x80000000UL, "found 0x%.8xUL\n",
@@ -1666,6 +1666,10 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct obd_statfs, os_bavail));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_bavail) == 8, "found %lld\n",
 		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_bavail));
+	LASSERTF((int)offsetof(struct obd_statfs, os_files) == 32, "found %lld\n",
+		 (long long)(int)offsetof(struct obd_statfs, os_files));
+	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_files) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_files));
 	LASSERTF((int)offsetof(struct obd_statfs, os_ffree) == 40, "found %lld\n",
 		 (long long)(int)offsetof(struct obd_statfs, os_ffree));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_ffree) == 8, "found %lld\n",
@@ -1682,6 +1686,10 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct obd_statfs, os_namelen));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_namelen) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_namelen));
+	LASSERTF((int)offsetof(struct obd_statfs, os_maxbytes) == 96, "found %lld\n",
+		 (long long)(int)offsetof(struct obd_statfs, os_maxbytes));
+	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_maxbytes) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_maxbytes));
 	LASSERTF((int)offsetof(struct obd_statfs, os_state) == 104, "found %lld\n",
 		 (long long)(int)offsetof(struct obd_statfs, os_state));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_state) == 4, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index f25bb9b..bff6f76 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -483,16 +483,20 @@ enum lov_comp_md_entry_flags {
 	LCME_FL_PREF_RW		= LCME_FL_PREF_RD | LCME_FL_PREF_WR,
 	LCME_FL_OFFLINE		= 0x00000008,	/* Not used */
 	LCME_FL_INIT		= 0x00000010,	/* instantiated */
+	LCME_FL_NOSYNC		= 0x00000020,	/* FLR: no sync for the mirror */
 	LCME_FL_NEG		= 0x80000000,	/* used to indicate a negative
 						 * flag, won't be stored on disk
 						 */
 };
 
 #define LCME_KNOWN_FLAGS	(LCME_FL_NEG | LCME_FL_INIT | LCME_FL_STALE | \
-				 LCME_FL_PREF_RW)
+				 LCME_FL_PREF_RW | LCME_FL_NOSYNC)
 /* The flags can be set by users at mirror creation time. */
 #define LCME_USER_FLAGS		(LCME_FL_PREF_RW)
 
+/* The flags are for mirrors */
+#define LCME_MIRROR_FLAGS	(LCME_FL_NOSYNC)
+
 /* the highest bit in obdo::o_layout_version is used to mark if the file is
  * being resynced.
  */
@@ -519,8 +523,8 @@ struct lov_comp_md_entry_v1 {
 						 */
 	__u32			lcme_size;	/* size of component blob */
 	__u32			lcme_layout_gen;
+	__u64			lcme_timestamp;	/* snapshot time if applicable*/
 	__u32			lcme_padding_1;
-	__u64			lcme_padding_2;
 } __packed;
 
 #define SEQ_ID_MAX		0x0000FFFF
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 129/622] lustre: llite: create checksums to replace checksum_pages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (127 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 128/622] lustre: flr: add 'nosync' flag for FLR mirrors James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 130/622] lustre: ptlrpc: don't change buffer when signature is ready James Simmons
                   ` (493 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

Create llite.*.checksums, which matches llite.*.checksum_pages in
functionality. Now the llite layer have something that matches
osc.*.checksums. In time we can retire checksum_pages and change
it to its original purpose of enabling per-page checksums (which
was not implemented in the CLIO development).

WC-bug-id: https://jira.whamcloud.com/browse/LU-10906
Lustre-commit: 123ee3cf96dd ("LU-10906 llite: create checksums to replace checksum_pages")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33222
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lproc_llite.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 5ac6689..5fc7705 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -599,8 +599,8 @@ static ssize_t ll_max_cached_mb_seq_write(struct file *file,
 
 LPROC_SEQ_FOPS(ll_max_cached_mb);
 
-static ssize_t checksum_pages_show(struct kobject *kobj, struct attribute *attr,
-				   char *buf)
+static ssize_t checksums_show(struct kobject *kobj, struct attribute *attr,
+			      char *buf)
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
@@ -608,10 +608,8 @@ static ssize_t checksum_pages_show(struct kobject *kobj, struct attribute *attr,
 	return sprintf(buf, "%u\n", (sbi->ll_flags & LL_SBI_CHECKSUM) ? 1 : 0);
 }
 
-static ssize_t checksum_pages_store(struct kobject *kobj,
-				    struct attribute *attr,
-				    const char *buffer,
-				    size_t count)
+static ssize_t checksums_store(struct kobject *kobj, struct attribute *attr,
+			       const char *buffer, size_t count)
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
@@ -642,7 +640,9 @@ static ssize_t checksum_pages_store(struct kobject *kobj,
 
 	return count;
 }
-LUSTRE_RW_ATTR(checksum_pages);
+LUSTRE_RW_ATTR(checksums);
+
+LUSTRE_ATTR(checksum_pages, 0644, checksums_show, checksums_store);
 
 static ssize_t ll_rd_track_id(struct kobject *kobj, char *buf,
 			      enum stats_track_type type)
@@ -1250,6 +1250,7 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 	&lustre_attr_max_read_ahead_mb.attr,
 	&lustre_attr_max_read_ahead_per_file_mb.attr,
 	&lustre_attr_max_read_ahead_whole_mb.attr,
+	&lustre_attr_checksums.attr,
 	&lustre_attr_checksum_pages.attr,
 	&lustre_attr_stats_track_pid.attr,
 	&lustre_attr_stats_track_ppid.attr,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 130/622] lustre: ptlrpc: don't change buffer when signature is ready
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (128 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 129/622] lustre: llite: create checksums to replace checksum_pages James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:09 ` [lustre-devel] [PATCH 131/622] lustre: ldlm: update l_blocking_lock under lock James Simmons
                   ` (492 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The lm_repsize is part of buffer being used in signature calculation
and must not be changed after calculation is done.

Patch reverts related changes from commit 13372d6c and moves related
lm_repsize update into MDC where DOM read-on-open buffer is prepared

WC-bug-id: https://jira.whamcloud.com/browse/LU-11414
Lustre-commit: cf503e047c7f ("LU-11414 ptlrpc: don't change buffer when signature is ready")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33223
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_locks.c | 30 +++++++++++++++++++++---------
 fs/lustre/ptlrpc/niobuf.c |  5 -----
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 80f2e10..09f9bc5 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -256,7 +256,7 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	int count = 0;
 	enum ldlm_mode mode;
 	int rc;
-	int repsize;
+	int repsize, repsize_estimate;
 
 	it->it_create_mode = (it->it_create_mode & ~S_IFMT) | S_IFREG;
 
@@ -347,22 +347,34 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	/* Get real repbuf allocated size as rounded up power of 2 */
 	repsize = size_roundup_power2(req->rq_replen +
 				      lustre_msg_early_size());
-
 	/* Estimate free space for DoM files in repbuf */
-	repsize -= req->rq_replen - obddev->u.cli.cl_max_mds_easize +
-		   sizeof(struct lov_comp_md_v1) +
-		   sizeof(struct lov_comp_md_entry_v1) +
-		   lov_mds_md_size(0, LOV_MAGIC_V3);
-
-	if (repsize < obddev->u.cli.cl_dom_min_inline_repsize) {
-		repsize = obddev->u.cli.cl_dom_min_inline_repsize - repsize;
+	repsize_estimate = repsize - (req->rq_replen -
+			   obddev->u.cli.cl_max_mds_easize +
+			   sizeof(struct lov_comp_md_v1) +
+			   sizeof(struct lov_comp_md_entry_v1) +
+			   lov_mds_md_size(0, LOV_MAGIC_V3));
+
+	if (repsize_estimate < obddev->u.cli.cl_dom_min_inline_repsize) {
+		repsize = obddev->u.cli.cl_dom_min_inline_repsize -
+			  repsize_estimate + sizeof(struct niobuf_remote);
 		req_capsule_set_size(&req->rq_pill, &RMF_NIOBUF_INLINE,
 				     RCL_SERVER,
 				     sizeof(struct niobuf_remote) + repsize);
 		ptlrpc_request_set_replen(req);
 		CDEBUG(D_INFO, "Increase repbuf by %d bytes, total: %d\n",
 		       repsize, req->rq_replen);
+		repsize = size_roundup_power2(req->rq_replen +
+					      lustre_msg_early_size());
 	}
+	/* The only way to report real allocated repbuf size to the server
+	 * is the lm_repsize but it must be set prior buffer allocation itself
+	 * due to security reasons - it is part of buffer used in signature
+	 * calculation (see LU-11414). Therefore the saved size is predicted
+	 * value as rq_replen rounded to the next higher power of 2.
+	 * Such estimation is safe. Though the final allocated buffer might
+	 * be even larger, it is not possible to know that at this point.
+	 */
+	req->rq_reqmsg->lm_repsize = repsize;
 	return req;
 }
 
diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index e8ba57b..2e866fe 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -617,11 +617,6 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 				request->rq_status = rc;
 				goto cleanup_bulk;
 			}
-			/* Use real allocated value in lm_repsize,
-			 * so the server may use whole reply buffer
-			 * without resends where it is needed.
-			 */
-			request->rq_reqmsg->lm_repsize = request->rq_repbuf_len;
 		} else {
 			request->rq_repdata = NULL;
 			request->rq_repmsg = NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 131/622] lustre: ldlm: update l_blocking_lock under lock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (129 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 130/622] lustre: ptlrpc: don't change buffer when signature is ready James Simmons
@ 2020-02-27 21:09 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 132/622] lustre: mgc: don't proccess cld during stopping James Simmons
                   ` (491 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:09 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Update l_blocking_lock under with locking to prevent race
between lock_handle_convert0() and ldlm_work_bl_ast() code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11287
Lustre-commit: 2a520282888d ("LU-11287 ldlm: update l_blocking_lock under lock")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33124
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lock.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index bdbbfec..869d664 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -1639,16 +1639,7 @@ enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
 
 	lock = list_first_entry(arg->list, struct ldlm_lock, l_bl_ast);
 
-	/* nobody should touch l_bl_ast */
-	lock_res_and_lock(lock);
-	list_del_init(&lock->l_bl_ast);
-
-	LASSERT(ldlm_is_ast_sent(lock));
-	LASSERT(lock->l_bl_ast_run == 0);
 	LASSERT(lock->l_blocking_lock);
-	lock->l_bl_ast_run++;
-	unlock_res_and_lock(lock);
-
 	ldlm_lock2desc(lock->l_blocking_lock, &d);
 	/* copy blocking lock ibits in cancel_bits as well,
 	 * new client may use them for lock convert and it is
@@ -1658,9 +1649,16 @@ enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
 	d.l_policy_data.l_inodebits.cancel_bits =
 		lock->l_blocking_lock->l_policy_data.l_inodebits.bits;
 
+	/* nobody should touch l_bl_ast */
+	lock_res_and_lock(lock);
+	list_del_init(&lock->l_bl_ast);
+
+	LASSERT(ldlm_is_ast_sent(lock));
+	LASSERT(lock->l_bl_ast_run == 0);
+	lock->l_bl_ast_run++;
+	unlock_res_and_lock(lock);
+
 	rc = lock->l_blocking_ast(lock, &d, (void *)arg, LDLM_CB_BLOCKING);
-	LDLM_LOCK_RELEASE(lock->l_blocking_lock);
-	lock->l_blocking_lock = NULL;
 	LDLM_LOCK_RELEASE(lock);
 
 	return rc;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 132/622] lustre: mgc: don't proccess cld during stopping
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (130 preceding siblings ...)
  2020-02-27 21:09 ` [lustre-devel] [PATCH 131/622] lustre: ldlm: update l_blocking_lock under lock James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 133/622] lustre: obdclass: make mod rpc slot wait queue FIFO James Simmons
                   ` (490 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

The patch fixes the log processing during stopping. It was general
protection fault at mgc_process_cfg_log() at lsi access. Lsi pointer
was wrong 38323172756f6663, and all cld->cld_cfg.cfg_sb had invalid
data.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10595
Lustre-commit: bda43cbe369a ("LU-10595 mgc: don't proccess cld during stopping")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-6199
Reviewed-on: https://review.whamcloud.com/33190
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mgc/mgc_request.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/lustre/mgc/mgc_request.c b/fs/lustre/mgc/mgc_request.c
index c114aa8..785461b 100644
--- a/fs/lustre/mgc/mgc_request.c
+++ b/fs/lustre/mgc/mgc_request.c
@@ -1651,6 +1651,11 @@ int mgc_process_log(struct obd_device *mgc, struct config_llog_data *cld)
 				goto restart;
 			} else {
 				mutex_lock(&cld->cld_lock);
+				/* unlock/lock mutex, so check stopping again */
+				if (cld->cld_stopping) {
+					mutex_unlock(&cld->cld_lock);
+					return 0;
+				}
 				spin_lock(&config_list_lock);
 				cld->cld_lostlock = 1;
 				spin_unlock(&config_list_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 133/622] lustre: obdclass: make mod rpc slot wait queue FIFO
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (131 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 132/622] lustre: mgc: don't proccess cld during stopping James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 134/622] lustre: mdc: use old statfs format James Simmons
                   ` (489 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

Relatively big load may cause a process to spin for a long time
without successful grabbing a free mod rpc slot. It has been observed
a process spinning more than 100 seconds when there were 72 mdtest-s
and 8 IOR-s.

Make mod rpc slot wait queue to run FIFO so that waiting thread got
free mod rpc slot in order they entered the queue.

Cray-bug-id: LUS-6380
WC-bug-id: https://jira.whamcloud.com/browse/LU-11441
Lustre-commit: 7fa0fd415770 ("LU-11441 obdclass: make mod rpc slot wait queue FIFO")
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Reviewed-on: https://review.whamcloud.com/33282
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/genops.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index e5e2f73..da53572 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -1574,8 +1574,9 @@ u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc,
 		CDEBUG(D_RPCTRACE, "%s: sleeping for a modify RPC slot opc %u, max %hu\n",
 		       cli->cl_import->imp_obd->obd_name, opc, max);
 
-		wait_event_idle(cli->cl_mod_rpcs_waitq,
-				obd_mod_rpc_slot_avail(cli, close_req));
+		wait_event_idle_exclusive(cli->cl_mod_rpcs_waitq,
+					  obd_mod_rpc_slot_avail(cli,
+								 close_req));
 	} while (true);
 }
 EXPORT_SYMBOL(obd_get_mod_rpc_slot);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 134/622] lustre: mdc: use old statfs format
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (132 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 133/622] lustre: obdclass: make mod rpc slot wait queue FIFO James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 135/622] lnet: Fix selftest backward compatibility post health James Simmons
                   ` (488 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

when the client talks to old server with no support
for aggregated statfs

WC-bug-id: https://jira.whamcloud.com/browse/LU-11375
Lustre-commit: e70a6fd8a640 ("LU-11375 mdc: use old statfs format")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33162
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h | 1 +
 fs/lustre/mdc/mdc_request.c           | 9 +++++++--
 fs/lustre/ptlrpc/layout.c             | 8 +++++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index ed4fc42..36656c6 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -133,6 +133,7 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_format RQF_MDS_CONNECT;
 extern struct req_format RQF_MDS_DISCONNECT;
 extern struct req_format RQF_MDS_STATFS;
+extern struct req_format RQF_MDS_STATFS_NEW;
 extern struct req_format RQF_MDS_GET_ROOT;
 extern struct req_format RQF_MDS_SYNC;
 extern struct req_format RQF_MDS_GETXATTR;
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 15f94ea..5cc1e1f 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1474,6 +1474,7 @@ static int mdc_statfs(const struct lu_env *env,
 		      time64_t max_age, u32 flags)
 {
 	struct obd_device *obd = class_exp2obd(exp);
+	struct req_format *fmt;
 	struct ptlrpc_request *req;
 	struct obd_statfs *msfs;
 	struct obd_import *imp = NULL;
@@ -1490,8 +1491,12 @@ static int mdc_statfs(const struct lu_env *env,
 	if (!imp)
 		return -ENODEV;
 
-	req = ptlrpc_request_alloc_pack(imp, &RQF_MDS_STATFS,
-					LUSTRE_MDS_VERSION, MDS_STATFS);
+	fmt = &RQF_MDS_STATFS;
+	if ((exp_connect_flags2(exp) & OBD_CONNECT2_SUM_STATFS) &&
+	    (flags & OBD_STATFS_SUM))
+		fmt = &RQF_MDS_STATFS_NEW;
+	req = ptlrpc_request_alloc_pack(imp, fmt, LUSTRE_MDS_VERSION,
+					MDS_STATFS);
 	if (!req) {
 		rc = -ENOMEM;
 		goto output;
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index efbff69..92d2fc2 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -683,6 +683,7 @@
 	&RQF_MDS_GET_INFO,
 	&RQF_MDS_GET_ROOT,
 	&RQF_MDS_STATFS,
+	&RQF_MDS_STATFS_NEW,
 	&RQF_MDS_GETATTR,
 	&RQF_MDS_GETATTR_NAME,
 	&RQF_MDS_GETXATTR,
@@ -1250,9 +1251,13 @@ struct req_format RQF_MDS_GET_ROOT =
 EXPORT_SYMBOL(RQF_MDS_GET_ROOT);
 
 struct req_format RQF_MDS_STATFS =
-	DEFINE_REQ_FMT0("MDS_STATFS", mdt_body_only, obd_statfs_server);
+	DEFINE_REQ_FMT0("MDS_STATFS", empty, obd_statfs_server);
 EXPORT_SYMBOL(RQF_MDS_STATFS);
 
+struct req_format RQF_MDS_STATFS_NEW =
+	DEFINE_REQ_FMT0("MDS_STATFS_NEW", mdt_body_only, obd_statfs_server);
+EXPORT_SYMBOL(RQF_MDS_STATFS_NEW);
+
 struct req_format RQF_MDS_SYNC =
 	DEFINE_REQ_FMT0("MDS_SYNC", mdt_body_capa, mdt_body_only);
 EXPORT_SYMBOL(RQF_MDS_SYNC);
@@ -2134,6 +2139,7 @@ u32 req_capsule_fmt_size(u32 magic, const struct req_format *fmt,
 			size += cfs_size_round(fmt->rf_fields[loc].d[i]->rmf_size);
 	return size;
 }
+EXPORT_SYMBOL(req_capsule_fmt_size);
 
 /**
  * Changes the format of an RPC.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 135/622] lnet: Fix selftest backward compatibility post health
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (133 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 134/622] lustre: mdc: use old statfs format James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 136/622] lustre: osc: clarify short_io_bytes is maximum value James Simmons
                   ` (487 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sharmaso@whamcloud.com>

Post LNet health feature landing, lnet-selftest loses
backward compatibility. This patch fixes that by
adding a new structure lnet_counters_common similar
to lnet_counters(pre-Health version). Now,
lnet_counters_common is the struct that selftest depends on.

Also, adds a struct lnet_counters_health specifically
for health stats.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11422
Lustre-commit: 60f6f2b480b4 ("LU-11422 lnet: Fix selftest backward compatibility post health")
Signed-off-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33242
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h        |  5 ++-
 include/uapi/linux/lnet/lnet-types.h | 58 +++++++++++++++------------
 net/lnet/lnet/api-ni.c               | 78 +++++++++++++++++++++++++-----------
 net/lnet/lnet/lib-move.c             | 18 +++++----
 net/lnet/lnet/lib-msg.c              | 57 +++++++++++++-------------
 net/lnet/lnet/router_proc.c          | 14 ++++---
 net/lnet/selftest/framework.c        | 28 ++++++-------
 net/lnet/selftest/rpc.h              | 10 ++---
 8 files changed, 157 insertions(+), 111 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 4915a87..a1dad9f 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -445,7 +445,7 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 
 	rspt = kzalloc(sizeof(*rspt), GFP_NOFS);
 	lnet_net_lock(cpt);
-	the_lnet.ln_counters[cpt]->rst_alloc++;
+	the_lnet.ln_counters[cpt]->lct_health.lch_rst_alloc++;
 	lnet_net_unlock(cpt);
 	return rspt;
 }
@@ -455,7 +455,7 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 {
 	kfree(rspt);
 	lnet_net_lock(cpt);
-	the_lnet.ln_counters[cpt]->rst_alloc--;
+	the_lnet.ln_counters[cpt]->lct_health.lch_rst_alloc--;
 	lnet_net_unlock(cpt);
 }
 
@@ -675,6 +675,7 @@ int lnet_delay_rule_list(int pos, struct lnet_fault_attr *attr,
 
 /** @} lnet_fault_simulation */
 
+void lnet_counters_get_common(struct lnet_counters_common *common);
 void lnet_counters_get(struct lnet_counters *counters);
 void lnet_counters_reset(void);
 
diff --git a/include/uapi/linux/lnet/lnet-types.h b/include/uapi/linux/lnet/lnet-types.h
index 1da72c4..cf263b9 100644
--- a/include/uapi/linux/lnet/lnet-types.h
+++ b/include/uapi/linux/lnet/lnet-types.h
@@ -275,33 +275,41 @@ struct lnet_ping_info {
 #define LNET_PING_INFO_LONI(PINFO)	((PINFO)->pi_ni[0].ns_nid)
 #define LNET_PING_INFO_SEQNO(PINFO)	((PINFO)->pi_ni[0].ns_status)
 
-struct lnet_counters {
-	__u32	msgs_alloc;
-	__u32	msgs_max;
-	__u32	rst_alloc;
-	__u32	errors;
-	__u32	send_count;
-	__u32	recv_count;
-	__u32	route_count;
-	__u32	drop_count;
-	__u32	resend_count;
-	__u32	response_timeout_count;
-	__u32	local_interrupt_count;
-	__u32	local_dropped_count;
-	__u32	local_aborted_count;
-	__u32	local_no_route_count;
-	__u32	local_timeout_count;
-	__u32	local_error_count;
-	__u32	remote_dropped_count;
-	__u32	remote_error_count;
-	__u32	remote_timeout_count;
-	__u32	network_timeout_count;
-	__u64	send_length;
-	__u64	recv_length;
-	__u64	route_length;
-	__u64	drop_length;
+struct lnet_counters_common {
+	__u32	lcc_msgs_alloc;
+	__u32	lcc_msgs_max;
+	__u32	lcc_errors;
+	__u32	lcc_send_count;
+	__u32	lcc_recv_count;
+	__u32	lcc_route_count;
+	__u32	lcc_drop_count;
+	__u64	lcc_send_length;
+	__u64	lcc_recv_length;
+	__u64	lcc_route_length;
+	__u64	lcc_drop_length;
 } __packed;
 
+struct lnet_counters_health {
+	__u32	lch_rst_alloc;
+	__u32	lch_resend_count;
+	__u32	lch_response_timeout_count;
+	__u32	lch_local_interrupt_count;
+	__u32	lch_local_dropped_count;
+	__u32	lch_local_aborted_count;
+	__u32	lch_local_no_route_count;
+	__u32	lch_local_timeout_count;
+	__u32	lch_local_error_count;
+	__u32	lch_remote_dropped_count;
+	__u32	lch_remote_error_count;
+	__u32	lch_remote_timeout_count;
+	__u32	lch_network_timeout_count;
+};
+
+struct lnet_counters {
+	struct lnet_counters_common lct_common;
+	struct lnet_counters_health lct_health;
+};
+
 #define LNET_NI_STATUS_UP	0x15aac0de
 #define LNET_NI_STATUS_DOWN	0xdeadface
 #define LNET_NI_STATUS_INVALID	0x00000000
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index c81f46f..21e0175 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -682,40 +682,70 @@ static void lnet_assert_wire_constants(void)
 EXPORT_SYMBOL(lnet_unregister_lnd);
 
 void
+lnet_counters_get_common(struct lnet_counters_common *common)
+{
+	struct lnet_counters *ctr;
+	int i;
+
+	memset(common, 0, sizeof(*common));
+
+	lnet_net_lock(LNET_LOCK_EX);
+
+	cfs_percpt_for_each(ctr, i, the_lnet.ln_counters) {
+		common->lcc_msgs_max	 += ctr->lct_common.lcc_msgs_max;
+		common->lcc_msgs_alloc   += ctr->lct_common.lcc_msgs_alloc;
+		common->lcc_errors       += ctr->lct_common.lcc_errors;
+		common->lcc_send_count   += ctr->lct_common.lcc_send_count;
+		common->lcc_recv_count   += ctr->lct_common.lcc_recv_count;
+		common->lcc_route_count  += ctr->lct_common.lcc_route_count;
+		common->lcc_drop_count   += ctr->lct_common.lcc_drop_count;
+		common->lcc_send_length  += ctr->lct_common.lcc_send_length;
+		common->lcc_recv_length  += ctr->lct_common.lcc_recv_length;
+		common->lcc_route_length += ctr->lct_common.lcc_route_length;
+		common->lcc_drop_length  += ctr->lct_common.lcc_drop_length;
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+EXPORT_SYMBOL(lnet_counters_get_common);
+
+void
 lnet_counters_get(struct lnet_counters *counters)
 {
 	struct lnet_counters *ctr;
+	struct lnet_counters_health *health = &counters->lct_health;
 	int i;
 
 	memset(counters, 0, sizeof(*counters));
 
+	lnet_counters_get_common(&counters->lct_common);
+
 	lnet_net_lock(LNET_LOCK_EX);
 
 	cfs_percpt_for_each(ctr, i, the_lnet.ln_counters) {
-		counters->msgs_max += ctr->msgs_max;
-		counters->msgs_alloc += ctr->msgs_alloc;
-		counters->rst_alloc += ctr->rst_alloc;
-		counters->errors += ctr->errors;
-		counters->resend_count += ctr->resend_count;
-		counters->response_timeout_count += ctr->response_timeout_count;
-		counters->local_interrupt_count += ctr->local_interrupt_count;
-		counters->local_dropped_count += ctr->local_dropped_count;
-		counters->local_aborted_count += ctr->local_aborted_count;
-		counters->local_no_route_count += ctr->local_no_route_count;
-		counters->local_timeout_count += ctr->local_timeout_count;
-		counters->local_error_count += ctr->local_error_count;
-		counters->remote_dropped_count += ctr->remote_dropped_count;
-		counters->remote_error_count += ctr->remote_error_count;
-		counters->remote_timeout_count += ctr->remote_timeout_count;
-		counters->network_timeout_count += ctr->network_timeout_count;
-		counters->send_count += ctr->send_count;
-		counters->recv_count += ctr->recv_count;
-		counters->route_count += ctr->route_count;
-		counters->drop_count += ctr->drop_count;
-		counters->send_length += ctr->send_length;
-		counters->recv_length += ctr->recv_length;
-		counters->route_length += ctr->route_length;
-		counters->drop_length += ctr->drop_length;
+		health->lch_rst_alloc += ctr->lct_health.lch_rst_alloc;
+		health->lch_resend_count += ctr->lct_health.lch_resend_count;
+		health->lch_response_timeout_count +=
+				ctr->lct_health.lch_response_timeout_count;
+		health->lch_local_interrupt_count +=
+				ctr->lct_health.lch_local_interrupt_count;
+		health->lch_local_dropped_count +=
+				ctr->lct_health.lch_local_dropped_count;
+		health->lch_local_aborted_count +=
+				ctr->lct_health.lch_local_aborted_count;
+		health->lch_local_no_route_count +=
+				ctr->lct_health.lch_local_no_route_count;
+		health->lch_local_timeout_count +=
+				ctr->lct_health.lch_local_timeout_count;
+		health->lch_local_error_count +=
+				ctr->lct_health.lch_local_error_count;
+		health->lch_remote_dropped_count +=
+				ctr->lct_health.lch_remote_dropped_count;
+		health->lch_remote_error_count +=
+				ctr->lct_health.lch_remote_error_count;
+		health->lch_remote_timeout_count +=
+				ctr->lct_health.lch_remote_timeout_count;
+		health->lch_network_timeout_count +=
+				ctr->lct_health.lch_network_timeout_count;
 	}
 	lnet_net_unlock(LNET_LOCK_EX);
 }
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 84a30e0..38ee970 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -755,8 +755,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	/* NB 'lp' is always the next hop */
 	if (!(msg->msg_target.pid & LNET_PID_USERFLAG) &&
 	    !lnet_peer_alive_locked(ni, lp, msg)) {
-		the_lnet.ln_counters[cpt]->drop_count++;
-		the_lnet.ln_counters[cpt]->drop_length += msg->msg_len;
+		the_lnet.ln_counters[cpt]->lct_common.lcc_drop_count++;
+		the_lnet.ln_counters[cpt]->lct_common.lcc_drop_length +=
+			msg->msg_len;
 		lnet_net_unlock(cpt);
 		if (msg->msg_txpeer)
 			lnet_incr_stats(&msg->msg_txpeer->lpni_stats,
@@ -2510,7 +2511,7 @@ struct lnet_mt_event_info {
 				lnet_res_unlock(i);
 
 				lnet_net_lock(i);
-				the_lnet.ln_counters[i]->response_timeout_count++;
+				the_lnet.ln_counters[i]->lct_health.lch_response_timeout_count++;
 				lnet_net_unlock(i);
 
 				list_del_init(&rspt->rspt_on_list);
@@ -2595,7 +2596,7 @@ struct lnet_mt_event_info {
 			}
 			lnet_net_lock(cpt);
 			if (!rc)
-				the_lnet.ln_counters[cpt]->resend_count++;
+				the_lnet.ln_counters[cpt]->lct_health.lch_resend_count++;
 		}
 	}
 }
@@ -3346,8 +3347,8 @@ void lnet_monitor_thr_stop(void)
 {
 	lnet_net_lock(cpt);
 	lnet_incr_stats(&ni->ni_stats, msg_type, LNET_STATS_TYPE_DROP);
-	the_lnet.ln_counters[cpt]->drop_count++;
-	the_lnet.ln_counters[cpt]->drop_length += nob;
+	the_lnet.ln_counters[cpt]->lct_common.lcc_drop_count++;
+	the_lnet.ln_counters[cpt]->lct_common.lcc_drop_length += nob;
 	lnet_net_unlock(cpt);
 
 	lnet_ni_recv(ni, private, NULL, 0, 0, 0, nob);
@@ -4329,8 +4330,9 @@ struct lnet_msg *
 
 	lnet_net_lock(cpt);
 	lnet_incr_stats(&ni->ni_stats, LNET_MSG_GET, LNET_STATS_TYPE_DROP);
-	the_lnet.ln_counters[cpt]->drop_count++;
-	the_lnet.ln_counters[cpt]->drop_length += getmd->md_length;
+	the_lnet.ln_counters[cpt]->lct_common.lcc_drop_count++;
+	the_lnet.ln_counters[cpt]->lct_common.lcc_drop_length +=
+		getmd->md_length;
 	lnet_net_unlock(cpt);
 
 	kfree(msg);
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 9b52549..433401f 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -140,7 +140,7 @@
 lnet_msg_commit(struct lnet_msg *msg, int cpt)
 {
 	struct lnet_msg_container *container = the_lnet.ln_msg_containers[cpt];
-	struct lnet_counters *counters = the_lnet.ln_counters[cpt];
+	struct lnet_counters_common *common;
 	s64 timeout_ns;
 
 	/* set the message deadline */
@@ -169,30 +169,31 @@
 	msg->msg_onactivelist = 1;
 	list_add_tail(&msg->msg_activelist, &container->msc_active);
 
-	counters->msgs_alloc++;
-	if (counters->msgs_alloc > counters->msgs_max)
-		counters->msgs_max = counters->msgs_alloc;
+	common = &the_lnet.ln_counters[cpt]->lct_common;
+	common->lcc_msgs_alloc++;
+	if (common->lcc_msgs_alloc > common->lcc_msgs_max)
+		common->lcc_msgs_max = common->lcc_msgs_alloc;
 }
 
 static void
 lnet_msg_decommit_tx(struct lnet_msg *msg, int status)
 {
-	struct lnet_counters *counters;
+	struct lnet_counters_common *common;
 	struct lnet_event *ev = &msg->msg_ev;
 
 	LASSERT(msg->msg_tx_committed);
 	if (status)
 		goto out;
 
-	counters = the_lnet.ln_counters[msg->msg_tx_cpt];
+	common = &the_lnet.ln_counters[msg->msg_tx_cpt]->lct_common;
 	switch (ev->type) {
 	default: /* routed message */
 		LASSERT(msg->msg_routing);
 		LASSERT(msg->msg_rx_committed);
 		LASSERT(!ev->type);
 
-		counters->route_length += msg->msg_len;
-		counters->route_count++;
+		common->lcc_route_length += msg->msg_len;
+		common->lcc_route_count++;
 		goto incr_stats;
 
 	case LNET_EVENT_PUT:
@@ -206,7 +207,7 @@
 	case LNET_EVENT_SEND:
 		LASSERT(!msg->msg_rx_committed);
 		if (msg->msg_type == LNET_MSG_PUT)
-			counters->send_length += msg->msg_len;
+			common->lcc_send_length += msg->msg_len;
 		break;
 
 	case LNET_EVENT_GET:
@@ -220,7 +221,7 @@
 		break;
 	}
 
-	counters->send_count++;
+	common->lcc_send_count++;
 
 incr_stats:
 	if (msg->msg_txpeer)
@@ -239,7 +240,7 @@
 static void
 lnet_msg_decommit_rx(struct lnet_msg *msg, int status)
 {
-	struct lnet_counters *counters;
+	struct lnet_counters_common *common;
 	struct lnet_event *ev = &msg->msg_ev;
 
 	LASSERT(!msg->msg_tx_committed); /* decommitted or never committed */
@@ -248,7 +249,7 @@
 	if (status)
 		goto out;
 
-	counters = the_lnet.ln_counters[msg->msg_rx_cpt];
+	common = &the_lnet.ln_counters[msg->msg_rx_cpt]->lct_common;
 	switch (ev->type) {
 	default:
 		LASSERT(!ev->type);
@@ -268,7 +269,7 @@
 		 */
 		LASSERT(msg->msg_type == LNET_MSG_REPLY ||
 			msg->msg_type == LNET_MSG_GET);
-		counters->send_length += msg->msg_wanted;
+		common->lcc_send_length += msg->msg_wanted;
 		break;
 
 	case LNET_EVENT_PUT:
@@ -285,7 +286,7 @@
 		break;
 	}
 
-	counters->recv_count++;
+	common->lcc_recv_count++;
 
 incr_stats:
 	if (msg->msg_rxpeer)
@@ -297,7 +298,7 @@
 				msg->msg_type,
 				LNET_STATS_TYPE_RECV);
 	if (ev->type == LNET_EVENT_PUT || ev->type == LNET_EVENT_REPLY)
-		counters->recv_length += msg->msg_wanted;
+		common->lcc_recv_length += msg->msg_wanted;
 
 out:
 	lnet_return_rx_credits_locked(msg);
@@ -330,7 +331,7 @@
 	list_del(&msg->msg_activelist);
 	msg->msg_onactivelist = 0;
 
-	the_lnet.ln_counters[cpt2]->msgs_alloc--;
+	the_lnet.ln_counters[cpt2]->lct_common.lcc_msgs_alloc--;
 
 	if (cpt2 != cpt) {
 		lnet_net_unlock(cpt2);
@@ -546,52 +547,54 @@
 {
 	struct lnet_ni *ni = msg->msg_txni;
 	struct lnet_peer_ni *lpni = msg->msg_txpeer;
-	struct lnet_counters *counters = the_lnet.ln_counters[0];
+	struct lnet_counters_health *health;
+
+	health = &the_lnet.ln_counters[0]->lct_health;
 
 	switch (hstatus) {
 	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
 		atomic_inc(&ni->ni_hstats.hlt_local_interrupt);
-		counters->local_interrupt_count++;
+		health->lch_local_interrupt_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_DROPPED:
 		atomic_inc(&ni->ni_hstats.hlt_local_dropped);
-		counters->local_dropped_count++;
+		health->lch_local_dropped_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_ABORTED:
 		atomic_inc(&ni->ni_hstats.hlt_local_aborted);
-		counters->local_aborted_count++;
+		health->lch_local_aborted_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
 		atomic_inc(&ni->ni_hstats.hlt_local_no_route);
-		counters->local_no_route_count++;
+		health->lch_local_no_route_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
 		atomic_inc(&ni->ni_hstats.hlt_local_timeout);
-		counters->local_timeout_count++;
+		health->lch_local_timeout_count++;
 		break;
 	case LNET_MSG_STATUS_LOCAL_ERROR:
 		atomic_inc(&ni->ni_hstats.hlt_local_error);
-		counters->local_error_count++;
+		health->lch_local_error_count++;
 		break;
 	case LNET_MSG_STATUS_REMOTE_DROPPED:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_remote_dropped);
-		counters->remote_dropped_count++;
+		health->lch_remote_dropped_count++;
 		break;
 	case LNET_MSG_STATUS_REMOTE_ERROR:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_remote_error);
-		counters->remote_error_count++;
+		health->lch_remote_error_count++;
 		break;
 	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_remote_timeout);
-		counters->remote_timeout_count++;
+		health->lch_remote_timeout_count++;
 		break;
 	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
 		if (lpni)
 			atomic_inc(&lpni->lpni_hstats.hlt_network_timeout);
-		counters->network_timeout_count++;
+		health->lch_network_timeout_count++;
 		break;
 	case LNET_MSG_STATUS_OK:
 		break;
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index ebe7993..45abcfb 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -79,6 +79,7 @@ static int proc_lnet_stats(struct ctl_table *table, int write,
 {
 	int rc;
 	struct lnet_counters *ctrs;
+	struct lnet_counters_common common;
 	size_t nob = *lenp;
 	loff_t pos = *ppos;
 	int len;
@@ -102,15 +103,16 @@ static int proc_lnet_stats(struct ctl_table *table, int write,
 	}
 
 	lnet_counters_get(ctrs);
+	common = ctrs->lct_common;
 
 	len = snprintf(tmpstr, tmpsiz,
 		       "%u %u %u %u %u %u %u %llu %llu %llu %llu",
-		       ctrs->msgs_alloc, ctrs->msgs_max,
-		       ctrs->errors,
-		       ctrs->send_count, ctrs->recv_count,
-		       ctrs->route_count, ctrs->drop_count,
-		       ctrs->send_length, ctrs->recv_length,
-		       ctrs->route_length, ctrs->drop_length);
+		       common.lcc_msgs_alloc, common.lcc_msgs_max,
+		       common.lcc_errors,
+		       common.lcc_send_count, common.lcc_recv_count,
+		       common.lcc_route_count, common.lcc_drop_count,
+		       common.lcc_send_length, common.lcc_recv_length,
+		       common.lcc_route_length, common.lcc_drop_length);
 
 	if (pos >= min_t(int, len, strlen(tmpstr)))
 		rc = 0;
diff --git a/net/lnet/selftest/framework.c b/net/lnet/selftest/framework.c
index c8c42b9..00e7363 100644
--- a/net/lnet/selftest/framework.c
+++ b/net/lnet/selftest/framework.c
@@ -82,19 +82,19 @@
 	__swab64s(&(rc).bulk_put);	\
 } while (0)
 
-#define sfw_unpack_lnet_counters(lc)	\
-do {					\
-	__swab32s(&(lc).errors);	\
-	__swab32s(&(lc).msgs_max);	\
-	__swab32s(&(lc).msgs_alloc);	\
-	__swab32s(&(lc).send_count);	\
-	__swab32s(&(lc).recv_count);	\
-	__swab32s(&(lc).drop_count);	\
-	__swab32s(&(lc).route_count);	\
-	__swab64s(&(lc).send_length);	\
-	__swab64s(&(lc).recv_length);	\
-	__swab64s(&(lc).drop_length);	\
-	__swab64s(&(lc).route_length);	\
+#define sfw_unpack_lnet_counters(lc)		\
+do {						\
+	__swab32s(&(lc).lcc_errors);		\
+	__swab32s(&(lc).lcc_msgs_max);		\
+	__swab32s(&(lc).lcc_msgs_alloc);	\
+	__swab32s(&(lc).lcc_send_count);	\
+	__swab32s(&(lc).lcc_recv_count);	\
+	__swab32s(&(lc).lcc_drop_count);	\
+	__swab32s(&(lc).lcc_route_count);	\
+	__swab64s(&(lc).lcc_send_length);	\
+	__swab64s(&(lc).lcc_recv_length);	\
+	__swab64s(&(lc).lcc_drop_length);	\
+	__swab64s(&(lc).lcc_route_length);	\
 } while (0)
 
 #define sfw_test_active(t)	(atomic_read(&(t)->tsi_nactive))
@@ -377,7 +377,7 @@
 		return 0;
 	}
 
-	lnet_counters_get(&reply->str_lnet);
+	lnet_counters_get_common(&reply->str_lnet);
 	srpc_get_counters(&reply->str_rpc);
 
 	/*
diff --git a/net/lnet/selftest/rpc.h b/net/lnet/selftest/rpc.h
index 8ccae3a..6d07452 100644
--- a/net/lnet/selftest/rpc.h
+++ b/net/lnet/selftest/rpc.h
@@ -160,11 +160,11 @@ struct srpc_stat_reqst {
 } __packed;
 
 struct srpc_stat_reply {
-	u32		   str_status;
-	struct lst_sid	   str_sid;
-	struct sfw_counters	str_fw;
-	struct srpc_counters	str_rpc;
-	struct lnet_counters    str_lnet;
+	u32			    str_status;
+	struct lst_sid		    str_sid;
+	struct sfw_counters	    str_fw;
+	struct srpc_counters	    str_rpc;
+	struct lnet_counters_common str_lnet;
 } __packed;
 
 struct test_bulk_req {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 136/622] lustre: osc: clarify short_io_bytes is maximum value
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (134 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 135/622] lnet: Fix selftest backward compatibility post health James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 137/622] lustre: ptlrpc: Make CPU binding switchable James Simmons
                   ` (486 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Clarify in the code that the "osc.*.short_io_bytes" parameter is
the maximum IO size to pack into request/reply not the minimum.

Allow short_io to be disabled completely if it is set to zero.

It would be nice to also change the name of the /sysfs functions
in a similar manner but that also changes the /sysfs tunable name
(via LUSTRE_RW_ATTR() macro) and has compatibility implications
for sites that may have changed this value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-1757
Lustre-commit: b90812a674f6 ("LU-1757 osc: clarify short_io_bytes is maximum value")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33173
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h             | 2 +-
 fs/lustre/ldlm/ldlm_lib.c           | 2 ++
 fs/lustre/obdclass/lprocfs_status.c | 6 +++---
 fs/lustre/osc/osc_request.c         | 7 +++----
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 7cf9745..2587136 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -252,7 +252,7 @@ struct client_obd {
 	atomic_t		cl_pending_r_pages;
 	u32			cl_max_pages_per_rpc;
 	u32			cl_max_rpcs_in_flight;
-	u32			cl_short_io_bytes;
+	u32			cl_max_short_io_bytes;
 	struct obd_histogram    cl_read_rpc_hist;
 	struct obd_histogram    cl_write_rpc_hist;
 	struct obd_histogram    cl_read_page_hist;
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 838ddb3..5fe5711 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -374,6 +374,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	 */
 	cli->cl_max_pages_per_rpc = PTLRPC_MAX_BRW_PAGES;
 
+	cli->cl_max_short_io_bytes = OBD_MAX_SHORT_IO_BYTES;
+
 	/*
 	 * set cl_chunkbits default value to PAGE_CACHE_SHIFT,
 	 * it will be updated at OSC connection time.
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 747baff..b3dbe85 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -1896,7 +1896,7 @@ ssize_t short_io_bytes_show(struct kobject *kobj, struct attribute *attr,
 	int rc;
 
 	spin_lock(&cli->cl_loi_list_lock);
-	rc = sprintf(buf, "%d\n", cli->cl_short_io_bytes);
+	rc = sprintf(buf, "%d\n", cli->cl_max_short_io_bytes);
 	spin_unlock(&cli->cl_loi_list_lock);
 	return rc;
 }
@@ -1922,7 +1922,7 @@ ssize_t short_io_bytes_store(struct kobject *kobj, struct attribute *attr,
 	if (rc)
 		goto out;
 
-	if (val > OBD_MAX_SHORT_IO_BYTES || val < MIN_SHORT_IO_BYTES) {
+	if (val && (val < MIN_SHORT_IO_BYTES || val > OBD_MAX_SHORT_IO_BYTES)) {
 		rc = -ERANGE;
 		goto out;
 	}
@@ -1933,7 +1933,7 @@ ssize_t short_io_bytes_store(struct kobject *kobj, struct attribute *attr,
 	if (val > (cli->cl_max_pages_per_rpc << PAGE_SHIFT))
 		rc = -ERANGE;
 	else
-		cli->cl_short_io_bytes = val;
+		cli->cl_max_short_io_bytes = val;
 	spin_unlock(&cli->cl_loi_list_lock);
 
 out:
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index e968360..4524a98 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1321,9 +1321,9 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 	for (i = 0; i < page_count; i++)
 		short_io_size += pga[i]->count;
 
-	/* Check if we can do a short io. */
-	if (!(short_io_size <= cli->cl_short_io_bytes && niocount == 1 &&
-	    imp_connect_shortio(cli->cl_import)))
+	/* Check if read/write is small enough to be a short io. */
+	if (short_io_size > cli->cl_max_short_io_bytes || niocount > 1 ||
+	    !imp_connect_shortio(cli->cl_import))
 		short_io_size = 0;
 
 	req_capsule_set_size(pill, &RMF_SHORT_IO, RCL_CLIENT,
@@ -1762,7 +1762,6 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 			CERROR("Unexpected +ve rc %d\n", rc);
 			return -EPROTO;
 		}
-		LASSERT(req->rq_bulk->bd_nob == aa->aa_requested_nob);
 
 		if (req->rq_bulk &&
 		    sptlrpc_cli_unwrap_bulk_write(req, req->rq_bulk))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 137/622] lustre: ptlrpc: Make CPU binding switchable
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (135 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 136/622] lustre: osc: clarify short_io_bytes is maximum value James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 138/622] lustre: misc: quiet console messages at startup James Simmons
                   ` (485 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

LU-6325 added CPT binding to the ptlrpc worker threads on
the servers.  This is often desirable, especially where
NUMA latencies are high, but it is not always beneficial.

If NUMA latencies are low, there is little benefit, and
sometimes it can be quite costly:

In particular, if NID-CPT hashing with routers leads to an
unbalanced workload by CPT, it is easy to end up in a
situation where the CPUs in one CPT are maxed out but
others are idle.

To this end, we add module parameters to allow disabling
the strict binding behavior, allowing threads to use all
CPUs.

This is complicated a bit because we still want separate
service partitions - The existing "no affinity" behavior
places all service threads in a single service partition,
which gives only one queue for service wakeups.

So we separate binding behavior from CPT association,
allowing us to keep multiple service partitions where
desired.

Module parameters are added to ldlm, mdt, and ost, of the
form "servicename_cpu_bind", such as "mds_rdpg_cpu_bind".

Setting them to "0" will disable the strict CPU binding
behavior for the threads in that service.

Parameters were not added for certain minor services which
do not have any CPT affinity/binding behavior today.  (This
appears to be because they are not expected to be
performance sensitive.)

cray-bug-id: LUS-6518
WC-bug-id: https://jira.whamcloud.com/browse/LU-11454
Lustre-commit: 3eb7a1dfc3e7 ("LU-11454 ptlrpc: Make CPU binding switchable")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33262
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 12 ++++++++----
 fs/lustre/ldlm/ldlm_lockd.c    |  8 +++++++-
 fs/lustre/ptlrpc/service.c     | 25 +++++++++++++++----------
 3 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index cbd524c..81a6ac9 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1480,14 +1480,16 @@ struct ptlrpc_service {
 	int				srv_watchdog_factor;
 	/** under unregister_service */
 	unsigned			srv_is_stopping:1;
+	/** Whether or not to restrict service threads to CPUs in this CPT */
+	unsigned			srv_cpt_bind:1;
 
 	/** max # request buffers */
 	int				srv_nrqbds_max;
 	/** max # request buffers in history per partition */
 	int				srv_hist_nrqbds_cpt_max;
-	/** number of CPTs this service bound on */
+	/** number of CPTs this service associated with */
 	int				srv_ncpts;
-	/** CPTs array this service bound on */
+	/** CPTs array this service associated with */
 	u32				*srv_cpts;
 	/** 2^srv_cptab_bits >= cfs_cpt_numbert(srv_cptable) */
 	int				srv_cpt_bits;
@@ -1934,8 +1936,8 @@ struct ptlrpc_service_thr_conf {
 	 * other members of this structure.
 	 */
 	unsigned int			tc_nthrs_user;
-	/* set NUMA node affinity for service threads */
-	unsigned int			tc_cpu_affinity;
+	/* bind service threads to only CPUs in their associated CPT */
+	unsigned int			tc_cpu_bind;
 	/* Tags for lu_context associated with service thread */
 	u32				tc_ctx_tags;
 };
@@ -1944,6 +1946,8 @@ struct ptlrpc_service_cpt_conf {
 	struct cfs_cpt_table		*cc_cptable;
 	/* string pattern to describe CPTs for a service */
 	char				*cc_pattern;
+	/* whether or not to have per-CPT service partitions */
+	bool				cc_affinity;
 };
 
 struct ptlrpc_service_conf {
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index b50a3f7..204b11b 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -49,6 +49,11 @@
 module_param(ldlm_num_threads, int, 0444);
 MODULE_PARM_DESC(ldlm_num_threads, "number of DLM service threads to start");
 
+static unsigned int ldlm_cpu_bind = 1;
+module_param(ldlm_cpu_bind, uint, 0444);
+MODULE_PARM_DESC(ldlm_cpu_bind,
+		 "bind DLM service threads to particular CPU partitions");
+
 static char *ldlm_cpts;
 module_param(ldlm_cpts, charp, 0444);
 MODULE_PARM_DESC(ldlm_cpts, "CPU partitions ldlm threads should run on");
@@ -1006,11 +1011,12 @@ static int ldlm_setup(void)
 			.tc_nthrs_base		= LDLM_NTHRS_BASE,
 			.tc_nthrs_max		= LDLM_NTHRS_MAX,
 			.tc_nthrs_user		= ldlm_num_threads,
-			.tc_cpu_affinity	= 1,
+			.tc_cpu_bind		= ldlm_cpu_bind,
 			.tc_ctx_tags		= LCT_MD_THREAD | LCT_DT_THREAD,
 		},
 		.psc_cpt		= {
 			.cc_pattern		= ldlm_cpts,
+			.cc_affinity		= true,
 		},
 		.psc_ops		= {
 			.so_req_handler		= ldlm_callback_handler,
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index a9155b2..b94ed6a 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -573,7 +573,13 @@ struct ptlrpc_service *
 	if (!cptable)
 		cptable = cfs_cpt_tab;
 
-	if (!conf->psc_thr.tc_cpu_affinity) {
+	if (conf->psc_thr.tc_cpu_bind > 1) {
+		CERROR("%s: Invalid cpu bind value %d, only 1 or 0 allowed\n",
+		       conf->psc_name, conf->psc_thr.tc_cpu_bind);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (!cconf->cc_affinity) {
 		ncpts = 1;
 	} else {
 		ncpts = cfs_cpt_number(cptable);
@@ -611,6 +617,7 @@ struct ptlrpc_service *
 	service->srv_cptable = cptable;
 	service->srv_cpts = cpts;
 	service->srv_ncpts = ncpts;
+	service->srv_cpt_bind = conf->psc_thr.tc_cpu_bind;
 
 	service->srv_cpt_bits = 0; /* it's zero already, easy to read... */
 	while ((1 << service->srv_cpt_bits) < cfs_cpt_number(cptable))
@@ -646,7 +653,7 @@ struct ptlrpc_service *
 	service->srv_ops = conf->psc_ops;
 
 	for (i = 0; i < ncpts; i++) {
-		if (!conf->psc_thr.tc_cpu_affinity)
+		if (!cconf->cc_affinity)
 			cpt = CFS_CPT_ANY;
 		else
 			cpt = cpts ? cpts[i] : i;
@@ -2105,14 +2112,12 @@ static int ptlrpc_main(void *arg)
 	thread->t_pid = current->pid;
 	unshare_fs_struct();
 
-	/* NB: we will call cfs_cpt_bind() for all threads, because we
-	 * might want to run lustre server only on a subset of system CPUs,
-	 * in that case ->scp_cpt is CFS_CPT_ANY
-	 */
-	rc = cfs_cpt_bind(svc->srv_cptable, svcpt->scp_cpt);
-	if (rc != 0) {
-		CWARN("%s: failed to bind %s on CPT %d\n",
-		      svc->srv_name, thread->t_name, svcpt->scp_cpt);
+	if (svc->srv_cpt_bind) {
+		rc = cfs_cpt_bind(svc->srv_cptable, svcpt->scp_cpt);
+		if (rc != 0) {
+			CWARN("%s: failed to bind %s on CPT %d\n",
+			      svc->srv_name, thread->t_name, svcpt->scp_cpt);
+		}
 	}
 
 	ginfo = groups_alloc(0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 138/622] lustre: misc: quiet console messages at startup
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (136 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 137/622] lustre: ptlrpc: Make CPU binding switchable James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 139/622] lustre: ldlm: don't apply ELC to converting and DOM locks James Simmons
                   ` (484 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Some modules print less-than-useful messages on every load.
Turn these into internal debug messages to reduce noise.

The message in gss_init_svc_upcall() should also be quieted,
but it exposes that this function is waiting 1.5s on each module
load for lsvcgssd to start.  This should be fixed separately.

WC-bug-id: https://jira.whamcloud.com/browse/LU-1095
Lustre-commit: ed0c19d250f6 ("LU-1095 misc: quiet console messages at startup")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33281
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_obd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 0da9269..81b86a0 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -257,10 +257,12 @@ static int lmv_init_ea_size(struct obd_export *exp, u32 easize, u32 def_easize)
 	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
 		struct lmv_tgt_desc *tgt = lmv->tgts[i];
 
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active) {
+		if (!tgt || !tgt->ltd_exp) {
 			CWARN("%s: NULL export for %d\n", obd->obd_name, i);
 			continue;
 		}
+		if (!tgt->ltd_active)
+			continue;
 
 		rc = md_init_ea_size(tgt->ltd_exp, easize, def_easize);
 		if (rc) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 139/622] lustre: ldlm: don't apply ELC to converting and DOM locks
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (137 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 138/622] lustre: misc: quiet console messages at startup James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 140/622] lustre: class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD James Simmons
                   ` (483 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Prevent ELC for locks being converted and for locks
having DOM bit set to avoid data flush without need.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11276
Lustre-commit: 70a01a6c9c7c ("LU-11276 ldlm: don't apply ELC to converting and DOM locks")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33125
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 9d3330c..1afe9a5 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1823,7 +1823,8 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 		/* If somebody is already doing CANCEL, or blocking AST came,
 		 * skip this lock.
 		 */
-		if (ldlm_is_bl_ast(lock) || ldlm_is_canceling(lock))
+		if (ldlm_is_bl_ast(lock) || ldlm_is_canceling(lock) ||
+		    ldlm_is_converting(lock))
 			continue;
 
 		if (lockmode_compat(lock->l_granted_mode, mode))
@@ -1831,10 +1832,11 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 
 		/* If policy is given and this is IBITS lock, add to list only
 		 * those locks that match by policy.
+		 * Skip locks with DoM bit always to don't flush data.
 		 */
 		if (policy && (lock->l_resource->lr_type == LDLM_IBITS) &&
-		    !(lock->l_policy_data.l_inodebits.bits &
-		      policy->l_inodebits.bits))
+		    (!(lock->l_policy_data.l_inodebits.bits &
+		      policy->l_inodebits.bits) || ldlm_has_dom(lock)))
 			continue;
 
 		/* See CBPENDING comment in ldlm_cancel_lru */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 140/622] lustre: class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (138 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 139/622] lustre: ldlm: don't apply ELC to converting and DOM locks James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 141/622] lustre: uapi: add new changerec_type James Simmons
                   ` (482 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

Use INIT_LIST_HEAD_RCU to avoid compiler optimization too much
in some case.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11453
Lustre-commit: 68bc3984975b ("LU-11453 class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33317
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/genops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index da53572..4465dd9 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -821,7 +821,7 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 	spin_lock_init(&export->exp_uncommitted_replies_lock);
 	INIT_LIST_HEAD(&export->exp_uncommitted_replies);
 	INIT_LIST_HEAD(&export->exp_req_replay_queue);
-	INIT_LIST_HEAD(&export->exp_handle.h_link);
+	INIT_LIST_HEAD_RCU(&export->exp_handle.h_link);
 	INIT_LIST_HEAD(&export->exp_hp_rpcs);
 	class_handle_hash(&export->exp_handle, &export_handle_ops);
 	spin_lock_init(&export->exp_lock);
@@ -1018,7 +1018,7 @@ struct obd_import *class_new_import(struct obd_device *obd)
 	atomic_set(&imp->imp_replay_inflight, 0);
 	atomic_set(&imp->imp_inval_count, 0);
 	INIT_LIST_HEAD(&imp->imp_conn_list);
-	INIT_LIST_HEAD(&imp->imp_handle.h_link);
+	INIT_LIST_HEAD_RCU(&imp->imp_handle.h_link);
 	class_handle_hash(&imp->imp_handle, &import_handle_ops);
 	init_imp_at(&imp->imp_at);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 141/622] lustre: uapi: add new changerec_type
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (139 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 140/622] lustre: class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 142/622] lustre: ldlm: check double grant race after resource change James Simmons
                   ` (481 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

The Lazy Size on MDT is causing the trusted.som xattr to be logged
in the changelog whenever a file is needed to update this xattr
data casued by file open/close or truncate operations.

The original patch landed to the OpenSFS branch fixes this problem
to avoid logging this xattr for every file. This introduces a new
changelog_rec_type that the mdc changelog code needs to be aware
of.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11450
Lustre-commit: faf6f514c172 ("LU-11450 mdd: avoid logging trusted.som xattr in changelogs")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/33323
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index bff6f76..844e50e 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -966,6 +966,7 @@ enum la_valid {
 /********* Changelogs **********/
 /** Changelog record types */
 enum changelog_rec_type {
+	CL_NONE		= -1,
 	CL_MARK		= 0,
 	CL_CREATE	= 1,  /* namespace */
 	CL_MKDIR	= 2,  /* namespace */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 142/622] lustre: ldlm: check double grant race after resource change
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (140 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 141/622] lustre: uapi: add new changerec_type James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 143/622] lustre: mdc: grow lvb buffer to hold layout James Simmons
                   ` (480 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyang.li@anu.edu.au>

In ldlm_handle_cp_callback(), we call lock_res_and_lock and then
check if the ldlm lock has already been granted.
If the lock resource has changed, we release the lock and go ahead
allocating new resource, then grabs the lock again before calling
ldlm_grant_lock().
However this gives another thread an opportunity to grab the lock
and pass the check, while we change the resource. Eventually the
other thread calls ldlm_grant_lock() on the same ldlm lock and
triggers a LASSERT.

Fix the issue by doing double grant race check after changing the
lock resource.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8391
Lustre-commit: fef1020406a0 ("LU-8391 ldlm: check double grant race after resource change")
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Reviewed-on: https://review.whamcloud.com/21275
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lockd.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 204b11b..6905ee5 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -214,6 +214,21 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 	}
 
 	lock_res_and_lock(lock);
+
+	if (!ldlm_res_eq(&dlm_req->lock_desc.l_resource.lr_name,
+			 &lock->l_resource->lr_name)) {
+		ldlm_resource_unlink_lock(lock);
+		unlock_res_and_lock(lock);
+		rc = ldlm_lock_change_resource(ns, lock,
+				&dlm_req->lock_desc.l_resource.lr_name);
+		if (rc < 0) {
+			LDLM_ERROR(lock, "Failed to allocate resource");
+			goto out;
+		}
+		LDLM_DEBUG(lock, "completion AST, new resource");
+		lock_res_and_lock(lock);
+	}
+
 	if (ldlm_is_destroyed(lock) ||
 	    lock->l_granted_mode == lock->l_req_mode) {
 		/* bug 11300: the lock has already been granted */
@@ -240,20 +255,6 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 	}
 
 	ldlm_resource_unlink_lock(lock);
-	if (memcmp(&dlm_req->lock_desc.l_resource.lr_name,
-		   &lock->l_resource->lr_name,
-		   sizeof(lock->l_resource->lr_name)) != 0) {
-		unlock_res_and_lock(lock);
-		rc = ldlm_lock_change_resource(ns, lock,
-					       &dlm_req->lock_desc.l_resource.lr_name);
-		if (rc < 0) {
-			LDLM_ERROR(lock, "Failed to allocate resource");
-			goto out;
-		}
-		LDLM_DEBUG(lock, "completion AST, new resource");
-		CERROR("change resource!\n");
-		lock_res_and_lock(lock);
-	}
 
 	if (dlm_req->lock_flags & LDLM_FL_AST_SENT) {
 		/* BL_AST locks are not needed in LRU.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 143/622] lustre: mdc: grow lvb buffer to hold layout
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (141 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 142/622] lustre: ldlm: check double grant race after resource change James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 144/622] lustre: osc: re-check target versus available grant James Simmons
                   ` (479 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

Write intent RPC could generate a layout bigger than the initial
mdt_max_mdsize, so that the new layout cannot be returned to client,
this patch fix the client side issue by:

* define a new MAX_MD_SIZE to hold a reasonal composite layout, and
  keeps old MAX_MD_SIZE as MAX_MD_SIZE_OLD.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11158
Lustre-commit: e5abcf83c057 ("LU-11158 mdt: grow lvb buffer to hold layout")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32847
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_locks.c              | 4 +++-
 include/uapi/linux/lustre/lustre_idl.h | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 09f9bc5..f9d66a4 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -614,7 +614,7 @@ static int mdc_finish_enqueue(struct obd_export *exp,
 	    (!it_disposition(it, DISP_OPEN_OPEN) || it->it_status != 0))
 		mdc_clear_replay_flag(req, it->it_status);
 
-	DEBUG_REQ(D_RPCTRACE, req, "op: %d disposition: %x, status: %d",
+	DEBUG_REQ(D_RPCTRACE, req, "op: %x disposition: %x, status: %d",
 		  it->it_op, it->it_disposition, it->it_status);
 
 	/* We know what to expect, so we do any byte flipping required here */
@@ -680,6 +680,8 @@ static int mdc_finish_enqueue(struct obd_export *exp,
 		 * is packed into RMF_DLM_LVB of req
 		 */
 		lvb_len = req_capsule_get_size(pill, &RMF_DLM_LVB, RCL_SERVER);
+		CDEBUG(D_INFO, "%s: layout return lvb %d transno %lld\n",
+		       class_exp2obd(exp)->obd_name, lvb_len, req->rq_transno);
 		if (lvb_len > 0) {
 			lvb_data = req_capsule_server_sized_get(pill,
 								&RMF_DLM_LVB,
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 8002e046..2f15671 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1049,8 +1049,11 @@ struct lov_mds_md_v1 {		/* LOV EA mds/wire data (little-endian) */
 	struct lov_ost_data_v1 lmm_objects[0]; /* per-stripe data */
 };
 
-#define MAX_MD_SIZE							\
+#define MAX_MD_SIZE_OLD							\
 	(sizeof(struct lov_mds_md) + 4 * sizeof(struct lov_ost_data))
+#define MAX_MD_SIZE							\
+	(sizeof(struct lov_comp_md_v1) +				\
+	 4 * (sizeof(struct lov_comp_md_entry_v1) + MAX_MD_SIZE_OLD))
 #define MIN_MD_SIZE							\
 	(sizeof(struct lov_mds_md) + 1 * sizeof(struct lov_ost_data))
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 144/622] lustre: osc: re-check target versus available grant
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (142 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 143/622] lustre: mdc: grow lvb buffer to hold layout James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 145/622] lnet: unlink md if fail to send recovery James Simmons
                   ` (478 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

- under the spinlock, otherwise it's possible that available
  grant has changed since target calculation and bytes to
  shrink go negative.
- tgt_grant_alloc() should avoid negative grants

WC-bug-id: https://jira.whamcloud.com/browse/LU-11288
Lustre-commit: fcbd8c981239 ("LU-11288 osc: re-check target versus available grant")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33226
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 4524a98..18b99a9 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -811,6 +811,12 @@ int osc_shrink_grant_to_target(struct client_obd *cli, u64 target_bytes)
 	osc_announce_cached(cli, &body->oa, 0);
 
 	spin_lock(&cli->cl_loi_list_lock);
+	if (target_bytes >= cli->cl_avail_grant) {
+		/* available grant has changed since target calculation */
+		spin_unlock(&cli->cl_loi_list_lock);
+		rc = 0;
+		goto out_free;
+	}
 	body->oa.o_grant = cli->cl_avail_grant - target_bytes;
 	cli->cl_avail_grant = target_bytes;
 	spin_unlock(&cli->cl_loi_list_lock);
@@ -826,6 +832,7 @@ int osc_shrink_grant_to_target(struct client_obd *cli, u64 target_bytes)
 				sizeof(*body), body, NULL);
 	if (rc != 0)
 		__osc_update_grant(cli, body->oa.o_grant);
+out_free:
 	kfree(body);
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 145/622] lnet: unlink md if fail to send recovery
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (143 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 144/622] lustre: osc: re-check target versus available grant James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 146/622] lustre: obd: use correct names for conn_uuid James Simmons
                   ` (477 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

MD for recovery ping should be unlinked if we fail to send the GET.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11474
Lustre-commit: e0132e16df15 ("LU-11474 lnet: unlink md if fail to send recovery")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33306
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  7 ++++--
 net/lnet/lnet/lib-move.c       | 48 +++++++++++++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index f82ebb6..b2159b0 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -317,7 +317,8 @@ struct lnet_tx_queue {
 #define LNET_NI_STATE_ACTIVE		(1 << 1)
 #define LNET_NI_STATE_FAILED		(1 << 2)
 #define LNET_NI_STATE_RECOVERY_PENDING	(1 << 3)
-#define LNET_NI_STATE_DELETING		(1 << 4)
+#define LNET_NI_STATE_RECOVERY_FAILED	BIT(4)
+#define LNET_NI_STATE_DELETING		BIT(5)
 
 enum lnet_stats_type {
 	LNET_STATS_TYPE_SEND	= 0,
@@ -606,8 +607,10 @@ struct lnet_peer_ni {
 #define LNET_PEER_NI_NON_MR_PREF	BIT(0)
 /* peer is being recovered. */
 #define LNET_PEER_NI_RECOVERY_PENDING	BIT(1)
+/* recovery ping failed */
+#define LNET_PEER_NI_RECOVERY_FAILED	BIT(2)
 /* peer is being deleted */
-#define LNET_PEER_NI_DELETING		BIT(2)
+#define LNET_PEER_NI_DELETING		BIT(3)
 
 struct lnet_peer {
 	/* chain on pt_peer_list */
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 38ee970..b54fbab 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2615,13 +2615,13 @@ struct lnet_mt_event_info {
 
 /* called with cpt and ni_lock held */
 static void
-lnet_unlink_ni_recovery_mdh_locked(struct lnet_ni *ni, int cpt)
+lnet_unlink_ni_recovery_mdh_locked(struct lnet_ni *ni, int cpt, bool force)
 {
 	struct lnet_handle_md recovery_mdh;
 
 	LNetInvalidateMDHandle(&recovery_mdh);
 
-	if (ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING) {
+	if (ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING || force) {
 		recovery_mdh = ni->ni_ping_mdh;
 		LNetInvalidateMDHandle(&ni->ni_ping_mdh);
 	}
@@ -2675,12 +2675,22 @@ struct lnet_mt_event_info {
 		if (!(ni->ni_state & LNET_NI_STATE_ACTIVE) ||
 		    healthv == LNET_MAX_HEALTH_VALUE) {
 			list_del_init(&ni->ni_recovery);
-			lnet_unlink_ni_recovery_mdh_locked(ni, 0);
+			lnet_unlink_ni_recovery_mdh_locked(ni, 0, false);
 			lnet_ni_unlock(ni);
 			lnet_ni_decref_locked(ni, 0);
 			lnet_net_unlock(0);
 			continue;
 		}
+
+		/* if the local NI failed recovery we must unlink the md.
+		 * But we want to keep the local_ni on the recovery queue
+		 * so we can continue the attempts to recover it.
+		 */
+		if (ni->ni_state & LNET_NI_STATE_RECOVERY_FAILED) {
+			lnet_unlink_ni_recovery_mdh_locked(ni, 0, true);
+			ni->ni_state &= ~LNET_NI_STATE_RECOVERY_FAILED;
+		}
+
 		lnet_ni_unlock(ni);
 		lnet_net_unlock(0);
 
@@ -2829,7 +2839,7 @@ struct lnet_mt_event_info {
 				struct lnet_ni, ni_recovery);
 		list_del_init(&ni->ni_recovery);
 		lnet_ni_lock(ni);
-		lnet_unlink_ni_recovery_mdh_locked(ni, 0);
+		lnet_unlink_ni_recovery_mdh_locked(ni, 0, true);
 		lnet_ni_unlock(ni);
 		lnet_ni_decref_locked(ni, 0);
 	}
@@ -2838,13 +2848,14 @@ struct lnet_mt_event_info {
 }
 
 static void
-lnet_unlink_lpni_recovery_mdh_locked(struct lnet_peer_ni *lpni, int cpt)
+lnet_unlink_lpni_recovery_mdh_locked(struct lnet_peer_ni *lpni, int cpt,
+				     bool force)
 {
 	struct lnet_handle_md recovery_mdh;
 
 	LNetInvalidateMDHandle(&recovery_mdh);
 
-	if (lpni->lpni_state & LNET_PEER_NI_RECOVERY_PENDING) {
+	if (lpni->lpni_state & LNET_PEER_NI_RECOVERY_PENDING || force) {
 		recovery_mdh = lpni->lpni_recovery_ping_mdh;
 		LNetInvalidateMDHandle(&lpni->lpni_recovery_ping_mdh);
 	}
@@ -2867,7 +2878,7 @@ struct lnet_mt_event_info {
 				 lpni_recovery) {
 		list_del_init(&lpni->lpni_recovery);
 		spin_lock(&lpni->lpni_lock);
-		lnet_unlink_lpni_recovery_mdh_locked(lpni, LNET_LOCK_EX);
+		lnet_unlink_lpni_recovery_mdh_locked(lpni, LNET_LOCK_EX, true);
 		spin_unlock(&lpni->lpni_lock);
 		lnet_peer_ni_decref_locked(lpni);
 	}
@@ -2933,12 +2944,22 @@ struct lnet_mt_event_info {
 		if (lpni->lpni_state & LNET_PEER_NI_DELETING ||
 		    healthv == LNET_MAX_HEALTH_VALUE) {
 			list_del_init(&lpni->lpni_recovery);
-			lnet_unlink_lpni_recovery_mdh_locked(lpni, 0);
+			lnet_unlink_lpni_recovery_mdh_locked(lpni, 0, false);
 			spin_unlock(&lpni->lpni_lock);
 			lnet_peer_ni_decref_locked(lpni);
 			lnet_net_unlock(0);
 			continue;
 		}
+
+		/* If the peer NI has failed recovery we must unlink the
+		 * md. But we want to keep the peer ni on the recovery
+		 * queue so we can try to continue recovering it
+		 */
+		if (lpni->lpni_state & LNET_PEER_NI_RECOVERY_FAILED) {
+			lnet_unlink_lpni_recovery_mdh_locked(lpni, 0, true);
+			lpni->lpni_state &= ~LNET_PEER_NI_RECOVERY_FAILED;
+		}
+
 		spin_unlock(&lpni->lpni_lock);
 		lnet_net_unlock(0);
 
@@ -3152,11 +3173,14 @@ struct lnet_mt_event_info {
 		}
 		lnet_ni_lock(ni);
 		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+		if (status)
+			ni->ni_state |= LNET_NI_STATE_RECOVERY_FAILED;
 		lnet_ni_unlock(ni);
 		lnet_net_unlock(0);
 
 		if (status != 0) {
-			CERROR("local NI recovery failed with %d\n", status);
+			CERROR("local NI (%s) recovery failed with %d\n",
+			       libcfs_nid2str(nid), status);
 			return;
 		}
 		/* need to increment healthv for the ni here, because in
@@ -3178,12 +3202,15 @@ struct lnet_mt_event_info {
 		}
 		spin_lock(&lpni->lpni_lock);
 		lpni->lpni_state &= ~LNET_PEER_NI_RECOVERY_PENDING;
+		if (status)
+			lpni->lpni_state |= LNET_PEER_NI_RECOVERY_FAILED;
 		spin_unlock(&lpni->lpni_lock);
 		lnet_peer_ni_decref_locked(lpni);
 		lnet_net_unlock(cpt);
 
 		if (status != 0)
-			CERROR("peer NI recovery failed with %d\n", status);
+			CERROR("peer NI (%s) recovery failed with %d\n",
+			       libcfs_nid2str(nid), status);
 	}
 }
 
@@ -3214,6 +3241,7 @@ struct lnet_mt_event_info {
 		       libcfs_nid2str(ev_info->mt_nid),
 		       (event->status) ? "unsuccessfully" :
 		       "successfully", event->status);
+		lnet_handle_recovery_reply(ev_info, event->status);
 		break;
 	default:
 		CERROR("Unexpected event: %d\n", event->type);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 146/622] lustre: obd: use correct names for conn_uuid
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (144 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 145/622] lnet: unlink md if fail to send recovery James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 147/622] lustre: idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags James Simmons
                   ` (476 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

The LUSTRE_R[OW]_ATTR() macros assume that the name of the sysfs
file to create matches the beginning of the function names. In
the case of LUSTRE_RO_ATTR(conn_uuid) this maps to the function
conn_uuid_show() and generated sysfs files "conn_uuid". While it
makes sense to standardize this interface we need to keep the
old xxx_conn_uuid. We can create these xxx_conn_uuid sysfs files
by using the base sysfs attr macro LUSTRE_ATTR().

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: f2bf876ef77e ("LU-8066 obd: use correct names for conn_uuid")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33213
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/lproc_mdc.c           |  7 ++++---
 fs/lustre/mgc/lproc_mgc.c           |  5 +++--
 fs/lustre/obdclass/lprocfs_status.c | 24 ------------------------
 fs/lustre/osc/lproc_osc.c           |  5 +++--
 4 files changed, 10 insertions(+), 31 deletions(-)

diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index 0c52bcf..746dd21 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -303,8 +303,8 @@ static ssize_t max_mod_rpcs_in_flight_store(struct kobject *kobj,
 
 LUSTRE_RW_ATTR(max_pages_per_rpc);
 
-#define mdc_conn_uuid_show conn_uuid_show
-LUSTRE_RO_ATTR(mdc_conn_uuid);
+LUSTRE_ATTR(mds_conn_uuid, 0444, conn_uuid_show, NULL);
+LUSTRE_RO_ATTR(conn_uuid);
 
 LUSTRE_RO_ATTR(ping);
 
@@ -529,7 +529,8 @@ static ssize_t mdc_dom_min_repsize_seq_write(struct file *file,
 	&lustre_attr_max_rpcs_in_flight.attr,
 	&lustre_attr_max_mod_rpcs_in_flight.attr,
 	&lustre_attr_max_pages_per_rpc.attr,
-	&lustre_attr_mdc_conn_uuid.attr,
+	&lustre_attr_mds_conn_uuid.attr,
+	&lustre_attr_conn_uuid.attr,
 	&lustre_attr_ping.attr,
 	NULL,
 };
diff --git a/fs/lustre/mgc/lproc_mgc.c b/fs/lustre/mgc/lproc_mgc.c
index 4c276f9..676d479 100644
--- a/fs/lustre/mgc/lproc_mgc.c
+++ b/fs/lustre/mgc/lproc_mgc.c
@@ -66,13 +66,14 @@ struct lprocfs_vars lprocfs_mgc_obd_vars[] = {
 	{ NULL }
 };
 
-#define mgs_conn_uuid_show conn_uuid_show
-LUSTRE_RO_ATTR(mgs_conn_uuid);
+LUSTRE_ATTR(mgs_conn_uuid, 0444, conn_uuid_show, NULL);
+LUSTRE_RO_ATTR(conn_uuid);
 
 LUSTRE_RO_ATTR(ping);
 
 static struct attribute *mgc_attrs[] = {
 	&lustre_attr_mgs_conn_uuid.attr,
+	&lustre_attr_conn_uuid.attr,
 	&lustre_attr_ping.attr,
 	NULL,
 };
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index b3dbe85..cce9bec 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -524,30 +524,6 @@ int lprocfs_rd_server_uuid(struct seq_file *m, void *data)
 }
 EXPORT_SYMBOL(lprocfs_rd_server_uuid);
 
-int lprocfs_rd_conn_uuid(struct seq_file *m, void *data)
-{
-	struct obd_device *obd = data;
-	struct ptlrpc_connection *conn;
-	int rc;
-
-	LASSERT(obd);
-
-	rc = lprocfs_climp_check(obd);
-	if (rc)
-		return rc;
-
-	conn = obd->u.cli.cl_import->imp_connection;
-	if (conn && obd->u.cli.cl_import)
-		seq_printf(m, "%s\n", conn->c_remote_uuid.uuid);
-	else
-		seq_puts(m, "<none>\n");
-
-	up_read(&obd->u.cli.cl_sem);
-
-	return 0;
-}
-EXPORT_SYMBOL(lprocfs_rd_conn_uuid);
-
 /**
  * Lock statistics structure for access, possibly only on this CPU.
  *
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index f025275..d9030b7 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -173,8 +173,8 @@ static ssize_t max_dirty_mb_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(max_dirty_mb);
 
-#define ost_conn_uuid_show conn_uuid_show
-LUSTRE_RO_ATTR(ost_conn_uuid);
+LUSTRE_ATTR(ost_conn_uuid, 0444, conn_uuid_show, NULL);
+LUSTRE_RO_ATTR(conn_uuid);
 
 LUSTRE_RO_ATTR(ping);
 
@@ -962,6 +962,7 @@ void lproc_osc_attach_seqstat(struct obd_device *dev)
 	&lustre_attr_short_io_bytes.attr,
 	&lustre_attr_resend_count.attr,
 	&lustre_attr_ost_conn_uuid.attr,
+	&lustre_attr_conn_uuid.attr,
 	&lustre_attr_ping.attr,
 	&lustre_attr_idle_timeout.attr,
 	&lustre_attr_idle_connect.attr,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 147/622] lustre: idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (145 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 146/622] lustre: obd: use correct names for conn_uuid James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 148/622] lustre: llite: optimize read on open pages James Simmons
                   ` (475 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Add proper MDS_ATTR_* and MDS_OPEN_* flags for different flags
namespaces.  The MDS_OPEN_OWNEROVERRIDE was being mapped into
the MDS_ATTR_* flags in some cases.  This did not conflict yet, but
add separate ATTR_OVERRIDE and MDS_ATTR_OVERRIDE flags for this use
so they don't conflict in the future.

Remove the MDS_OPEN_CROSS flag, since this was only used internally
as a hack to pass open flags to mdd_permission(), but was truncating
the u64 open flags to a 32-bit int in the process.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10030
Lustre-commit: 9c2ffe39bd32 ("LU-10030 idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32107
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c             | 6 ++----
 include/uapi/linux/lustre/lustre_idl.h  | 1 +
 include/uapi/linux/lustre/lustre_user.h | 2 +-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index c6dd256..f72e5fc 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -251,8 +251,6 @@ void lustre_assert_wire_constants(void)
 		 (long long)MDS_ATTR_KILL_SGID);
 	LASSERTF(MDS_ATTR_CTIME_SET == 0x0000000000002000ULL, "found 0x%.16llxULL\n",
 		 (long long)MDS_ATTR_CTIME_SET);
-	LASSERTF(MDS_ATTR_FROM_OPEN == 0x0000000000004000ULL, "found 0x%.16llxULL\n",
-		 (long long)MDS_ATTR_FROM_OPEN);
 	LASSERTF(MDS_ATTR_BLOCKS == 0x0000000000008000ULL, "found 0x%.16llxULL\n",
 		 (long long)MDS_ATTR_BLOCKS);
 	LASSERTF(MDS_ATTR_PROJID == 0x0000000000010000ULL, "found 0x%.16llxULL\n",
@@ -262,6 +260,8 @@ void lustre_assert_wire_constants(void)
 		 (long long)MDS_ATTR_LSIZE);
 	LASSERTF(MDS_ATTR_LBLOCKS == 0x0000000000040000ULL, "found 0x%.16llxULL\n",
 		 (long long)MDS_ATTR_LBLOCKS);
+	LASSERTF(MDS_ATTR_OVERRIDE == 0x0000000002000000ULL, "found 0x%.16llxULL\n",
+		 (long long)MDS_ATTR_OVERRIDE);
 	LASSERTF(FLD_QUERY == 900, "found %lld\n",
 		 (long long)FLD_QUERY);
 	LASSERTF(FLD_FIRST_OPC == 900, "found %lld\n",
@@ -2094,8 +2094,6 @@ void lustre_assert_wire_constants(void)
 		 MDS_FMODE_EXEC);
 	LASSERTF(MDS_OPEN_CREATED == 000000000010UL, "found 0%.11oUL\n",
 		 MDS_OPEN_CREATED);
-	LASSERTF(MDS_OPEN_CROSS == 000000000020UL, "found 0%.11oUL\n",
-		 MDS_OPEN_CROSS);
 	LASSERTF(MDS_OPEN_CREAT == 000000000100UL, "found 0%.11oUL\n",
 		 MDS_OPEN_CREAT);
 	LASSERTF(MDS_OPEN_EXCL == 000000000200UL, "found 0%.11oUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 2f15671..d46a921 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1681,6 +1681,7 @@ struct mdt_rec_setattr {
 #define MDS_ATTR_PROJID		0x10000ULL /* = 65536 */
 #define MDS_ATTR_LSIZE		0x20000ULL /* = 131072 */
 #define MDS_ATTR_LBLOCKS	0x40000ULL /* = 262144 */
+#define MDS_ATTR_OVERRIDE	0x2000000ULL /* = 33554432 */
 
 enum mds_op_bias {
 /*	MDS_CHECK_SPLIT		= 1 << 0, obsolete before 2.3.58 */
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 844e50e..db751d8 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -922,7 +922,7 @@ enum la_valid {
 /*	MDS_FMODE_SOM		04000000 obsolete since 2.8.0 */
 
 #define MDS_OPEN_CREATED	00000010
-#define MDS_OPEN_CROSS		00000020
+/*	MDS_OPEN_CROSS		00000020 obsolete in 2.12, internal use only */
 
 #define MDS_OPEN_CREAT		00000100
 #define MDS_OPEN_EXCL		00000200
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 148/622] lustre: llite: optimize read on open pages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (146 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 147/622] lustre: idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 149/622] lnet: set the health status correctly James Simmons
                   ` (474 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Jinshan Xiong <jinshan.xiong@uber.com>

Current read-on-open implementation does allocate cl_page after data
are piggied back by open request, which is expensive and not
necessary.

This patch improves the case by just adding the pages into page cache.
As long as those pages will be discarded at lock revocation, there
should be no concerns.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11427
Lustre-commit: 02e766f5ed95 ("LU-11427 llite: optimize read on open pages")
Signed-off-by: Jinshan Xiong <jinshan.xiong@uber.com>
Reviewed-on: https://review.whamcloud.com/33234
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c  | 58 +++++--------------------------------------------
 fs/lustre/llite/namei.c |  7 +++++-
 2 files changed, 11 insertions(+), 54 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index a46f5d3..2fd906f 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -420,14 +420,10 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	struct page *vmpage;
 	struct niobuf_remote *rnb;
 	char *data;
-	struct lu_env *env;
-	struct cl_io *io;
-	u16 refcheck;
 	struct lustre_handle lockh;
 	struct ldlm_lock *lock;
 	unsigned long index, start;
 	struct niobuf_local lnb;
-	int rc;
 	bool dom_lock = false;
 
 	if (!obj)
@@ -440,37 +436,21 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 			dom_lock = ldlm_has_dom(lock);
 		LDLM_LOCK_PUT(lock);
 	}
-
 	if (!dom_lock)
 		return;
 
-	env = cl_env_get(&refcheck);
-	if (IS_ERR(env))
-		return;
-
 	if (!req_capsule_has_field(&req->rq_pill, &RMF_NIOBUF_INLINE,
-				   RCL_SERVER)) {
-		rc = -ENODATA;
-		goto out_env;
-	}
+				   RCL_SERVER))
+		return;
 
 	rnb = req_capsule_server_get(&req->rq_pill, &RMF_NIOBUF_INLINE);
-	data = (char *)rnb + sizeof(*rnb);
-
-	if (!rnb || rnb->rnb_len == 0) {
-		rc = 0;
-		goto out_env;
-	}
+	if (!rnb || rnb->rnb_len == 0)
+		return;
 
 	CDEBUG(D_INFO, "Get data buffer along with open, len %i, i_size %llu\n",
 	       rnb->rnb_len, i_size_read(inode));
 
-	io = vvp_env_thread_io(env);
-	io->ci_obj = obj;
-	io->ci_ignore_layout = 1;
-	rc = cl_io_init(env, io, CIT_MISC, obj);
-	if (rc)
-		goto out_io;
+	data = (char *)rnb + sizeof(*rnb);
 
 	lnb.lnb_file_offset = rnb->rnb_offset;
 	start = lnb.lnb_file_offset / PAGE_SIZE;
@@ -478,8 +458,6 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	LASSERT(lnb.lnb_file_offset % PAGE_SIZE == 0);
 	lnb.lnb_page_offset = 0;
 	do {
-		struct cl_page *clp;
-
 		lnb.lnb_data = data + (index << PAGE_SHIFT);
 		lnb.lnb_len = rnb->rnb_len - (index << PAGE_SHIFT);
 		if (lnb.lnb_len > PAGE_SIZE)
@@ -495,35 +473,9 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 			      PTR_ERR(vmpage));
 			break;
 		}
-		lock_page(vmpage);
-		if (!vmpage->mapping) {
-			unlock_page(vmpage);
-			put_page(vmpage);
-			/* page was truncated */
-			rc = -ENODATA;
-			goto out_io;
-		}
-		clp = cl_page_find(env, obj, vmpage->index, vmpage,
-				   CPT_CACHEABLE);
-		if (IS_ERR(clp)) {
-			unlock_page(vmpage);
-			put_page(vmpage);
-			rc = PTR_ERR(clp);
-			goto out_io;
-		}
-
-		/* export page */
-		cl_page_export(env, clp, 1);
-		cl_page_put(env, clp);
-		unlock_page(vmpage);
 		put_page(vmpage);
 		index++;
 	} while (rnb->rnb_len > (index << PAGE_SHIFT));
-	rc = 0;
-out_io:
-	cl_io_fini(env, io);
-out_env:
-	cl_env_put(env, &refcheck);
 }
 
 static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 4ac62b2..530c2df 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -185,8 +185,13 @@ int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
 	int rc;
 	u16 refcheck;
 
-	if (!lli->lli_clob)
+	if (!lli->lli_clob) {
+		/* Due to DoM read on open, there may exist pages for Lustre
+		 * regular file even though cl_object is not set up yet.
+		 */
+		truncate_inode_pages(inode->i_mapping, 0);
 		return 0;
+	}
 
 	env = cl_env_get(&refcheck);
 	if (IS_ERR(env))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 149/622] lnet: set the health status correctly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (147 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 148/622] lustre: llite: optimize read on open pages James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 150/622] lustre: lov: add debugging info for statfs James Simmons
                   ` (473 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

There are cases where the health status wasn't set properly.
Most notably in the tx_done we need to deal with a specific
set of errno: ENETDOWN, EHOSTUNREACH, ENETUNREACH, ECONNREFUSED,
ECONNRESET. In all those cases we can try and resend to other
available peer NIs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11476
Lustre-commit: 5d77f0d8dc74 ("LU-11476 lnet: set the health status correctly")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33307
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd_cb.c | 8 ++++++--
 net/lnet/lnet/lib-move.c            | 5 ++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 10a1934..abb3529 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -374,8 +374,10 @@ struct ksock_tx *
 				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_TIMEOUT;
 			else if (error == -ENETDOWN ||
 				 error == -EHOSTUNREACH ||
-				 error == -ENETUNREACH)
-				tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_DROPPED;
+				 error == -ENETUNREACH ||
+				 error == -ECONNREFUSED ||
+				 error == -ECONNRESET)
+				tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_DROPPED;
 			/* for all other errors we don't want to
 			 * retransmit
 			 */
@@ -901,6 +903,7 @@ struct ksock_route *
 
 	/* NB Routes may be ignored if connections to them failed recently */
 	CNETERR("No usable routes to %s\n", libcfs_id2str(id));
+	tx->tx_hstatus = LNET_MSG_STATUS_REMOTE_ERROR;
 	return -EHOSTUNREACH;
 }
 
@@ -986,6 +989,7 @@ struct ksock_route *
 	if (!rc)
 		return 0;
 
+	lntmsg->msg_health_status = tx->tx_hstatus;
 	ksocknal_free_tx(tx);
 	return -EIO;
 }
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index b54fbab..bbbcd8d 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -770,10 +770,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 		CNETERR("Dropping message for %s: peer not alive\n",
 			libcfs_id2str(msg->msg_target));
-		if (do_send) {
-			msg->msg_health_status = LNET_MSG_STATUS_LOCAL_DROPPED;
+		msg->msg_health_status = LNET_MSG_STATUS_LOCAL_DROPPED;
+		if (do_send)
 			lnet_finalize(msg, -EHOSTUNREACH);
-		}
 
 		lnet_net_lock(cpt);
 		return -EHOSTUNREACH;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 150/622] lustre: lov: add debugging info for statfs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (148 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 149/622] lnet: set the health status correctly James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 151/622] lnet: Decrement health on timeout James Simmons
                   ` (472 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

In obd_statfs() print the device name in the debug logs for clarity.

WC-bug-id: https://jira.whamcloud.com/browse/LU-7770
Lustre-commit: b917406a7f0a ("LU-7770 lov: fix statfs for conf-sanity test_50b")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33369
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h | 14 +++++++-------
 fs/lustre/lov/lov_obd.c       |  4 +---
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 01eb385..742e92a 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -891,8 +891,8 @@ static inline int obd_statfs_async(struct obd_export *exp,
 				   time64_t max_age,
 				   struct ptlrpc_request_set *rqset)
 {
-	int rc = 0;
 	struct obd_device *obd;
+	int rc = 0;
 
 	if (!exp || !exp->exp_obd)
 		return -EINVAL;
@@ -903,8 +903,8 @@ static inline int obd_statfs_async(struct obd_export *exp,
 		return -EOPNOTSUPP;
 	}
 
-	CDEBUG(D_SUPER, "%s: osfs %p age %lld, max_age %lld\n",
-	       obd->obd_name, &obd->obd_osfs, obd->obd_osfs_age, max_age);
+	CDEBUG(D_SUPER, "%s: age %lld, max_age %lld\n",
+	       obd->obd_name, obd->obd_osfs_age, max_age);
 	if (obd->obd_osfs_age < max_age) {
 		rc = OBP(obd, statfs_async)(exp, oinfo, max_age, rqset);
 	} else {
@@ -935,20 +935,20 @@ static inline int obd_statfs(const struct lu_env *env, struct obd_export *exp,
 	struct obd_device *obd = exp->exp_obd;
 	int rc = 0;
 
-	if (!obd)
+	if (unlikely(!obd))
 		return -EINVAL;
 
 	rc = obd_check_dev_active(obd);
 	if (rc)
 		return rc;
 
-	if (!obd->obd_type || !obd->obd_type->typ_dt_ops->statfs) {
+	if (unlikely(!obd->obd_type || !obd->obd_type->typ_dt_ops->statfs)) {
 		CERROR("%s: no %s operation\n", obd->obd_name, __func__);
 		return -EOPNOTSUPP;
 	}
 
-	CDEBUG(D_SUPER, "osfs %lld, max_age %lld\n",
-	       obd->obd_osfs_age, max_age);
+	CDEBUG(D_SUPER, "%s: age %lld, max_age %lld\n",
+	       obd->obd_name, obd->obd_osfs_age, max_age);
 	/* ignore cache if aggregated isn't expected */
 	if (obd->obd_osfs_age < max_age ||
 	    ((obd->obd_osfs.os_state & OS_STATE_SUM) &&
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 9a6ffe8..a16c663 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -1122,9 +1122,7 @@ static int lov_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 			if (!lov->lov_tgts[i] || !lov->lov_tgts[i]->ltd_exp)
 				continue;
 
-			/* ll_umount_begin() sets force flag but for lov, not
-			 * osc. Let's pass it through
-			 */
+			/* ll_umount_begin() sets force on lov, pass to osc */
 			osc_obd = class_exp2obd(lov->lov_tgts[i]->ltd_exp);
 			osc_obd->obd_force = obddev->obd_force;
 			err = obd_iocontrol(cmd, lov->lov_tgts[i]->ltd_exp,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 151/622] lnet: Decrement health on timeout
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (149 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 150/622] lustre: lov: add debugging info for statfs James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 152/622] lustre: quota: fix setattr project check James Simmons
                   ` (471 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When a response times out we want to decrement the health of the
immediate next hop peer ni, so we don't use that interface if there
are others available.

When sending a message if there is a response tracker associated
with the MD, store the next-hop-nid there. If the response times
out then we can look up the peer_ni using the cached NID, and
decrement its health value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11472
Lustre-commit: 139d69141b73 ("LU-11472 lnet: Decrement health on timeout")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33308
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |  1 +
 include/linux/lnet/lib-types.h |  2 ++
 net/lnet/lnet/lib-move.c       | 33 ++++++++++++++++++++++++++++++++-
 net/lnet/lnet/lib-msg.c        | 24 +++++++++++++++---------
 4 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index a1dad9f..ecacd65 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -641,6 +641,7 @@ void lnet_set_reply_msg_len(struct lnet_ni *ni, struct lnet_msg *msg,
 void lnet_finalize(struct lnet_msg *msg, int rc);
 bool lnet_send_error_simulation(struct lnet_msg *msg,
 				enum lnet_msg_hstatus *hstatus);
+void lnet_handle_remote_failure_locked(struct lnet_peer_ni *lpni);
 
 void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
 		       unsigned int nob, u32 msg_type);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index b2159b0..ce0caa9 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -81,6 +81,8 @@ struct lnet_rsp_tracker {
 	struct list_head rspt_on_list;
 	/* cpt to lock */
 	int rspt_cpt;
+	/* nid of next hop */
+	lnet_nid_t rspt_next_hop_nid;
 	/* deadline of the REPLY/ACK */
 	ktime_t rspt_deadline;
 	/* parent MD */
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index bbbcd8d..548ea88 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1432,6 +1432,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	u32 send_case = sd->sd_send_case;
 	int rc;
 	u32 routing = send_case & REMOTE_DST;
+	struct lnet_rsp_tracker *rspt;
 
 	/* Increment sequence number of the selected peer so that we
 	 * pick the next one in Round Robin.
@@ -1515,6 +1516,18 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		msg->msg_hdr.dest_nid = cpu_to_le64(msg->msg_txpeer->lpni_nid);
 	}
 
+	/* if we have response tracker block update it with the next hop
+	 * nid
+	 */
+	if (msg->msg_md) {
+		rspt = msg->msg_md->md_rspt_ptr;
+		if (rspt) {
+			rspt->rspt_next_hop_nid = msg->msg_txpeer->lpni_nid;
+			CDEBUG(D_NET, "rspt_next_hop_nid = %s\n",
+			       libcfs_nid2str(rspt->rspt_next_hop_nid));
+		}
+	}
+
 	rc = lnet_post_send_locked(msg, 0);
 	if (!rc)
 		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) : %s try# %d\n",
@@ -2497,6 +2510,9 @@ struct lnet_mt_event_info {
 			if (ktime_compare(ktime_get(),
 					  rspt->rspt_deadline) >= 0 ||
 			    force) {
+				struct lnet_peer_ni *lpni;
+				lnet_nid_t nid;
+
 				md = lnet_handle2md(&rspt->rspt_mdh);
 				if (!md) {
 					LNetInvalidateMDHandle(&rspt->rspt_mdh);
@@ -2515,9 +2531,24 @@ struct lnet_mt_event_info {
 
 				list_del_init(&rspt->rspt_on_list);
 
-				CNETERR("Response timed out: md = %p\n", md);
+				nid = rspt->rspt_next_hop_nid;
+
+				CNETERR("Response timed out: md = %p: nid = %s\n",
+					md, libcfs_nid2str(nid));
 				LNetMDUnlink(rspt->rspt_mdh);
 				lnet_rspt_free(rspt, i);
+
+				/* If there is a timeout on the response
+				 * from the next hop decrement its health
+				 * value so that we don't use it
+				 */
+				lnet_net_lock(0);
+				lpni = lnet_find_peer_ni_locked(nid);
+				if (lpni) {
+					lnet_handle_remote_failure_locked(lpni);
+					lnet_peer_ni_decref_locked(lpni);
+				}
+				lnet_net_unlock(0);
 			} else {
 				lnet_res_unlock(i);
 				break;
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 433401f..f626ca3 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -519,18 +519,13 @@
 	lnet_net_unlock(0);
 }
 
-static void
-lnet_handle_remote_failure(struct lnet_msg *msg)
+void
+lnet_handle_remote_failure_locked(struct lnet_peer_ni *lpni)
 {
-	struct lnet_peer_ni *lpni;
-
-	lpni = msg->msg_txpeer;
-
 	/* lpni could be NULL if we're in the LOLND case */
 	if (!lpni)
 		return;
 
-	lnet_net_lock(0);
 	lnet_dec_healthv_locked(&lpni->lpni_healthv);
 	/* add the peer NI to the recovery queue if it's not already there
 	 * and it's health value is actually below the maximum. It's
@@ -539,6 +534,17 @@
 	 * invoke recovery
 	 */
 	lnet_peer_ni_add_to_recoveryq_locked(lpni);
+}
+
+static void
+lnet_handle_remote_failure(struct lnet_peer_ni *lpni)
+{
+	/* lpni could be NULL if we're in the LOLND case */
+	if (!lpni)
+		return;
+
+	lnet_net_lock(0);
+	lnet_handle_remote_failure_locked(lpni);
 	lnet_net_unlock(0);
 }
 
@@ -679,13 +685,13 @@
 	 * attempt a resend safely.
 	 */
 	case LNET_MSG_STATUS_REMOTE_DROPPED:
-		lnet_handle_remote_failure(msg);
+		lnet_handle_remote_failure(msg->msg_txpeer);
 		goto resend;
 
 	case LNET_MSG_STATUS_REMOTE_ERROR:
 	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
 	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
-		lnet_handle_remote_failure(msg);
+		lnet_handle_remote_failure(msg->msg_txpeer);
 		return -1;
 	default:
 		LBUG();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 152/622] lustre: quota: fix setattr project check
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (150 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 151/622] lnet: Decrement health on timeout James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 153/622] lnet: socklnd: dynamically set LND parameters James Simmons
                   ` (470 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Similar patch motivated by upstream patch:
ext4: fix setattr project check in fssetxattr ioctl

Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

WC-bug-id: https://jira.whamcloud.com/browse/LU-11101
Lustre-commit: 2d3bbce0c9f3 ("LU-11101 quota: fix setattr project check")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/32730
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c           | 42 ++++++++++++++++++++++++++++++----------
 fs/lustre/llite/llite_internal.h |  1 +
 fs/lustre/llite/llite_lib.c      |  9 +++++++++
 3 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 2fd906f..ed0470d 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -2780,6 +2780,30 @@ int ll_ioctl_fsgetxattr(struct inode *inode, unsigned int cmd,
 	return 0;
 }
 
+int ll_ioctl_check_project(struct inode *inode, struct fsxattr *fa)
+{
+	/*
+	 * Project Quota ID state is only allowed to change from within the init
+	 * namespace. Enforce that restriction only if we are trying to change
+	 * the quota ID state. Everything else is allowed in user namespaces.
+	 */
+	if (current_user_ns() == &init_user_ns)
+		return 0;
+
+	if (ll_i2info(inode)->lli_projid != fa->fsx_projid)
+		return -EINVAL;
+
+	if (test_bit(LLIF_PROJECT_INHERIT, &ll_i2info(inode)->lli_flags)) {
+		if (!(fa->fsx_xflags & FS_XFLAG_PROJINHERIT))
+			return -EINVAL;
+	} else {
+		if (fa->fsx_xflags & FS_XFLAG_PROJINHERIT)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
 int ll_ioctl_fssetxattr(struct inode *inode, unsigned int cmd,
 			unsigned long arg)
 {
@@ -2791,22 +2815,20 @@ int ll_ioctl_fssetxattr(struct inode *inode, unsigned int cmd,
 	int rc = 0;
 	int flags;
 
-	/* only root could change project ID */
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
+	if (copy_from_user(&fsxattr,
+			   (const struct fsxattr __user *)arg,
+			   sizeof(fsxattr)))
+		return -EFAULT;
+
+	rc = ll_ioctl_check_project(inode, &fsxattr);
+	if (rc)
+		return rc;
 
 	op_data = ll_prep_md_op_data(NULL, inode, NULL, NULL, 0, 0,
 				     LUSTRE_OPC_ANY, NULL);
 	if (IS_ERR(op_data))
 		return PTR_ERR(op_data);
 
-	if (copy_from_user(&fsxattr,
-			   (const struct fsxattr __user *)arg,
-			   sizeof(fsxattr))) {
-		rc = -EFAULT;
-		goto out_fsxattr;
-	}
-
 	flags = ll_xflags_to_inode_flags(fsxattr.fsx_xflags);
 	op_data->op_attr_flags = ll_inode_to_ext_flags(flags);
 	if (fsxattr.fsx_xflags & FS_XFLAG_PROJINHERIT)
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index edb5f2a..d6fc6a29 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -829,6 +829,7 @@ int ll_migrate(struct inode *parent, struct file *file,
 int ll_get_fid_by_name(struct inode *parent, const char *name,
 		       int namelen, struct lu_fid *fid, struct inode **inode);
 int ll_inode_permission(struct inode *inode, int mask);
+int ll_ioctl_check_project(struct inode *inode, struct fsxattr *fa);
 int ll_ioctl_fsgetxattr(struct inode *inode, unsigned int cmd,
 			unsigned long arg);
 int ll_ioctl_fssetxattr(struct inode *inode, unsigned int cmd,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index be67652..859fdf4 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2094,10 +2094,19 @@ int ll_iocontrol(struct inode *inode, struct file *file,
 		struct md_op_data *op_data;
 		struct cl_object *obj;
 		struct iattr *attr;
+		struct fsxattr fa = { 0 };
 
 		if (get_user(flags, (int __user *)arg))
 			return -EFAULT;
 
+		fa.fsx_projid = ll_i2info(inode)->lli_projid;
+		if (flags & LUSTRE_PROJINHERIT_FL)
+			fa.fsx_xflags = FS_XFLAG_PROJINHERIT;
+
+		rc = ll_ioctl_check_project(inode, &fa);
+		if (rc)
+			return rc;
+
 		op_data = ll_prep_md_op_data(NULL, inode, NULL, NULL, 0, 0,
 					     LUSTRE_OPC_ANY, NULL);
 		if (IS_ERR(op_data))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 153/622] lnet: socklnd: dynamically set LND parameters
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (151 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 152/622] lustre: quota: fix setattr project check James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 154/622] lustre: flr: add mirror write command James Simmons
                   ` (469 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sharmaso@whamcloud.com>

Currently, the socklnd parameters cannot be set
dynamically. Only the default values are set
which cannot be changed by deleting and
re-adding the net with DLC.

This patch allows setting socklnd parameters
dynamically.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11371
Lustre-commit: 1d94072c63f5 ("LU-11371 socklnd: dynamically set LND parameters")
Signed-off-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33191
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 72ecf80..ba5623a 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -2723,6 +2723,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 ksocknal_startup(struct lnet_ni *ni)
 {
 	struct ksock_net *net;
+	struct lnet_ioctl_config_lnd_cmn_tunables *net_tunables;
 	struct ksock_interface *ksi = NULL;
 	struct lnet_inetdev *ifaces = NULL;
 	int i = 0;
@@ -2745,17 +2746,28 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	spin_lock_init(&net->ksnn_lock);
 	net->ksnn_incarnation = ktime_get_real_ns();
 	ni->ni_data = net;
-	if (!ni->ni_net->net_tunables_set) {
-		ni->ni_net->net_tunables.lct_peer_timeout =
+	net_tunables = &ni->ni_net->net_tunables;
+
+	if (net_tunables->lct_peer_timeout == -1)
+		net_tunables->lct_peer_timeout =
 			*ksocknal_tunables.ksnd_peertimeout;
-		ni->ni_net->net_tunables.lct_max_tx_credits =
+
+	if (net_tunables->lct_max_tx_credits == -1)
+		net_tunables->lct_max_tx_credits =
 			*ksocknal_tunables.ksnd_credits;
-		ni->ni_net->net_tunables.lct_peer_tx_credits =
+
+	if (net_tunables->lct_peer_tx_credits == -1)
+		net_tunables->lct_peer_tx_credits =
 			*ksocknal_tunables.ksnd_peertxcredits;
-		ni->ni_net->net_tunables.lct_peer_rtr_credits =
+
+	if (net_tunables->lct_peer_tx_credits >
+	    net_tunables->lct_max_tx_credits)
+		net_tunables->lct_peer_tx_credits =
+			net_tunables->lct_max_tx_credits;
+
+	if (net_tunables->lct_peer_rtr_credits == -1)
+		net_tunables->lct_peer_rtr_credits =
 			*ksocknal_tunables.ksnd_peerrtrcredits;
-		ni->ni_net->net_tunables_set = true;
-	}
 
 	rc = lnet_inet_enumerate(&ifaces);
 	if (rc < 0)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 154/622] lustre: flr: add mirror write command
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (152 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 153/622] lnet: socklnd: dynamically set LND parameters James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 155/622] lnet: properly error check sensitivity James Simmons
                   ` (468 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

This change allows issuing a RESYNC lease write lock to notify MDS
to prepare destination mirror for the write (instantiate components
of the mirror), then client copy data from a file or STDIN to the
specified mirror of the mirrored file. After the data copy, a
RESYNC_DONE lease unlock is issued to MDS to update the layout
of the mirrored file.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10258
Lustre-commit: 14171e787dd0 ("LU-10258 lfs: lfs mirror write command")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33219
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c                  | 10 ++++++++--
 fs/lustre/mdc/mdc_reint.c               |  1 +
 fs/lustre/ptlrpc/pack_generic.c         |  1 +
 fs/lustre/ptlrpc/wiretest.c             | 16 ++++++++++++----
 include/uapi/linux/lustre/lustre_idl.h  |  6 ++++--
 include/uapi/linux/lustre/lustre_user.h | 10 ++++++++++
 6 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index ed0470d..9de37d2 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1162,10 +1162,11 @@ static int ll_lease_close(struct obd_client_handle *och, struct inode *inode,
  * After lease is taken, send the RPC MDS_REINT_RESYNC to the MDT
  */
 static int ll_lease_file_resync(struct obd_client_handle *och,
-				struct inode *inode)
+				struct inode *inode, unsigned long arg)
 {
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct md_op_data *op_data;
+	struct ll_ioc_lease_id ioc;
 	u64 data_version_unused;
 	int rc;
 
@@ -1174,6 +1175,10 @@ static int ll_lease_file_resync(struct obd_client_handle *och,
 	if (IS_ERR(op_data))
 		return PTR_ERR(op_data);
 
+	if (copy_from_user(&ioc, (struct ll_ioc_lease_id __user *)arg,
+			   sizeof(ioc)))
+		return -EFAULT;
+
 	/* before starting file resync, it's necessary to clean up page cache
 	 * in client memory, otherwise once the layout version is increased,
 	 * writing back cached data will be denied the OSTs.
@@ -1183,6 +1188,7 @@ static int ll_lease_file_resync(struct obd_client_handle *och,
 		goto out;
 
 	op_data->op_lease_handle = och->och_lease_handle;
+	op_data->op_mirror_id = ioc.lil_mirror_id;
 	rc = md_file_resync(sbi->ll_md_exp, op_data);
 	if (rc)
 		goto out;
@@ -3048,7 +3054,7 @@ static long ll_file_set_lease(struct file *file, struct ll_ioc_lease *ioc,
 		return PTR_ERR(och);
 
 	if (ioc->lil_flags & LL_LEASE_RESYNC) {
-		rc = ll_lease_file_resync(och, inode);
+		rc = ll_lease_file_resync(och, inode, arg);
 		if (rc) {
 			ll_lease_close(och, inode, NULL);
 			return rc;
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 5d82449..062685c 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -455,6 +455,7 @@ int mdc_file_resync(struct obd_export *exp, struct md_op_data *op_data)
 	rec->rs_cap	= op_data->op_cap.cap[0];
 	rec->rs_fid	= op_data->op_fid1;
 	rec->rs_bias	= op_data->op_bias;
+	rec->rs_mirror_id = op_data->op_mirror_id;
 
 	lock = ldlm_handle2lock(&op_data->op_lease_handle);
 	if (lock) {
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index d93dbe1..231cb26 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1917,6 +1917,7 @@ void lustre_swab_mdt_rec_reint (struct mdt_rec_reint *rr)
 	__swab32s(&rr->rr_flags);
 	__swab32s(&rr->rr_flags_h);
 	__swab32s(&rr->rr_umask);
+	__swab16s(&rr->rr_mirror_id);
 
 	BUILD_BUG_ON(offsetof(typeof(*rr), rr_padding_4) == 0);
 };
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index f72e5fc..66dce80 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -2854,9 +2854,13 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct mdt_rec_resync, rs_padding8));
 	LASSERTF((int)sizeof(((struct mdt_rec_resync *)0)->rs_padding8) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct mdt_rec_resync *)0)->rs_padding8));
-	LASSERTF((int)offsetof(struct mdt_rec_resync, rs_padding9) == 132, "found %lld\n",
+	LASSERTF((int)offsetof(struct mdt_rec_resync, rs_mirror_id) == 132, "found %lld\n",
+		 (long long)(int)offsetof(struct mdt_rec_resync, rs_mirror_id));
+	LASSERTF((int)sizeof(((struct mdt_rec_resync *)0)->rs_mirror_id) == 2, "found %lld\n",
+		 (long long)(int)sizeof(((struct mdt_rec_resync *)0)->rs_mirror_id));
+	LASSERTF((int)offsetof(struct mdt_rec_resync, rs_padding9) == 134, "found %lld\n",
 		 (long long)(int)offsetof(struct mdt_rec_resync, rs_padding9));
-	LASSERTF((int)sizeof(((struct mdt_rec_resync *)0)->rs_padding9) == 4, "found %lld\n",
+	LASSERTF((int)sizeof(((struct mdt_rec_resync *)0)->rs_padding9) == 2, "found %lld\n",
 		 (long long)(int)sizeof(((struct mdt_rec_resync *)0)->rs_padding9));
 
 	/* Checks for struct mdt_rec_reint */
@@ -2950,9 +2954,13 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct mdt_rec_reint, rr_umask));
 	LASSERTF((int)sizeof(((struct mdt_rec_reint *)0)->rr_umask) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct mdt_rec_reint *)0)->rr_umask));
-	LASSERTF((int)offsetof(struct mdt_rec_reint, rr_padding_4) == 132, "found %lld\n",
+	LASSERTF((int)offsetof(struct mdt_rec_reint, rr_mirror_id) == 132, "found %lld\n",
+		 (long long)(int)offsetof(struct mdt_rec_reint, rr_mirror_id));
+	LASSERTF((int)sizeof(((struct mdt_rec_reint *)0)->rr_mirror_id) == 2, "found %lld\n",
+		 (long long)(int)sizeof(((struct mdt_rec_reint *)0)->rr_mirror_id));
+	LASSERTF((int)offsetof(struct mdt_rec_reint, rr_padding_4) == 134, "found %lld\n",
 		 (long long)(int)offsetof(struct mdt_rec_reint, rr_padding_4));
-	LASSERTF((int)sizeof(((struct mdt_rec_reint *)0)->rr_padding_4) == 4, "found %lld\n",
+	LASSERTF((int)sizeof(((struct mdt_rec_reint *)0)->rr_padding_4) == 2, "found %lld\n",
 		 (long long)(int)sizeof(((struct mdt_rec_reint *)0)->rr_padding_4));
 
 	/* Checks for struct lmv_desc */
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index d46a921..8330fe1 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1876,7 +1876,8 @@ struct mdt_rec_resync {
 	__u32           rs_padding6;	/* rr_flags */
 	__u32           rs_padding7;	/* rr_flags_h */
 	__u32           rs_padding8;	/* rr_umask */
-	__u32           rs_padding9;	/* rr_padding_4 */
+	__u16           rs_mirror_id;
+	__u16           rs_padding9;	/* rr_padding_4 */
 };
 
 /*
@@ -1910,7 +1911,8 @@ struct mdt_rec_reint {
 	__u32		rr_flags;
 	__u32		rr_flags_h;
 	__u32		rr_umask;
-	__u32		rr_padding_4; /* also fix lustre_swab_mdt_rec_reint */
+	__u16		rr_mirror_id;
+	__u16		rr_padding_4; /* also fix lustre_swab_mdt_rec_reint */
 };
 
 /* lmv structures */
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index db751d8..5551cbf 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -277,6 +277,16 @@ struct ll_ioc_lease {
 	__u32		lil_ids[0];
 };
 
+struct ll_ioc_lease_id {
+	__u32		lil_mode;
+	__u32		lil_flags;
+	__u32		lil_count;
+	__u16		lil_mirror_id;
+	__u16		lil_padding1;
+	__u64		lil_padding2;
+	__u32		lil_ids[0];
+};
+
 /*
  * The ioctl naming rules:
  * LL_*     - works on the currently opened filehandle instead of parent dir
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 155/622] lnet: properly error check sensitivity
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (153 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 154/622] lustre: flr: add mirror write command James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 156/622] lustre: llite: add lock for dir layout data James Simmons
                   ` (467 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Reject setting health sensitivity greater than the maximum health
value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11530
Lustre-commit: a5c1cd5ec240 ("LU-11530 lnet: properly error check sensitivity")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33392
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 21e0175..a2c648e 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -175,9 +175,11 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 		return 0;
 	}
 
-	if (value == *sensitivity) {
+	if (value > LNET_MAX_HEALTH_VALUE) {
 		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
+		CERROR("Invalid health value. Maximum: %d value = %lu\n",
+		       LNET_MAX_HEALTH_VALUE, value);
+		return -EINVAL;
 	}
 
 	*sensitivity = value;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 156/622] lustre: llite: add lock for dir layout data
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (154 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 155/622] lnet: properly error check sensitivity James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 157/622] lnet: configure recovery interval James Simmons
                   ` (466 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Directory layout data should be accessed with lock, because
directory migration may change it, if it's accessed without lock,
it may cause crash.

Introduce an rw_semaphore 'lli_lsm_sem', any MD operation that uses
directory layout data will take read lock, and ll_update_lsm_md()
will take write lock when setting lsm.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4684
Lustre-commit: ae828cd3b092 ("LU-4684 llite: add lock for dir layout data")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32946
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_lmv.h   |  16 ++++
 fs/lustre/include/obd.h          |   2 +
 fs/lustre/llite/dir.c            |  29 +++----
 fs/lustre/llite/file.c           |   5 +-
 fs/lustre/llite/llite_internal.h |   3 +
 fs/lustre/llite/llite_lib.c      | 168 ++++++++++++++++++++-------------------
 fs/lustre/llite/namei.c          |   2 +
 fs/lustre/llite/statahead.c      | 137 ++++++++++++++++---------------
 fs/lustre/lmv/lmv_obd.c          |   2 -
 9 files changed, 199 insertions(+), 165 deletions(-)

diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index ff279e1..1246c25 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -81,6 +81,22 @@ struct lmv_stripe_md {
 	return true;
 }
 
+static inline void lsm_md_dump(int mask, const struct lmv_stripe_md *lsm)
+{
+	int i;
+
+	CDEBUG(mask,
+	       "magic %#x stripe count %d master mdt %d hash type %#x version %d migrate offset %d migrate hash %#x pool %s\n",
+	       lsm->lsm_md_magic, lsm->lsm_md_stripe_count,
+	       lsm->lsm_md_master_mdt_index, lsm->lsm_md_hash_type,
+	       lsm->lsm_md_layout_version, lsm->lsm_md_migrate_offset,
+	       lsm->lsm_md_migrate_hash, lsm->lsm_md_pool_name);
+
+	for (i = 0; i < lsm->lsm_md_stripe_count; i++)
+		CDEBUG(mask, "stripe[%d] "DFID"\n",
+		       i, PFID(&lsm->lsm_md_oinfo[i].lmo_fid));
+}
+
 union lmv_mds_md;
 
 void lmv_free_memmd(struct lmv_stripe_md *lsm);
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 2587136..4829e11 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -741,6 +741,8 @@ struct md_op_data {
 	s64			op_mod_time;
 	const char	       *op_name;
 	size_t			op_namelen;
+	struct rw_semaphore	*op_mea1_sem;
+	struct rw_semaphore	*op_mea2_sem;
 	struct lmv_stripe_md   *op_mea1;
 	struct lmv_stripe_md   *op_mea2;
 	u32			op_suppgids[2];
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 55a1efb..3da9d14 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -298,6 +298,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 	int hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
 	bool api32 = ll_need_32bit_api(sbi);
 	struct md_op_data *op_data;
+	struct lu_fid pfid = { 0 };
 	int rc;
 
 	CDEBUG(D_VFSTRACE,
@@ -313,14 +314,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 		goto out;
 	}
 
-	op_data = ll_prep_md_op_data(NULL, inode, inode, NULL, 0, 0,
-				     LUSTRE_OPC_ANY, inode);
-	if (IS_ERR(op_data)) {
-		rc = PTR_ERR(op_data);
-		goto out;
-	}
-
-	if (unlikely(op_data->op_mea1)) {
+	if (unlikely(ll_i2info(inode)->lli_lsm_md)) {
 		/*
 		 * This is only needed for striped dir to fill ..,
 		 * see lmv_read_page
@@ -332,21 +326,28 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 
 			parent = file_dentry(filp)->d_parent->d_inode;
 			if (ll_have_md_lock(parent, &ibits, LCK_MINMODE))
-				op_data->op_fid3 = *ll_inode2fid(parent);
+				pfid = *ll_inode2fid(parent);
 		}
 
 		/*
 		 * If it can not find in cache, do lookup .. on the master
 		 * object
 		 */
-		if (fid_is_zero(&op_data->op_fid3)) {
-			rc = ll_dir_get_parent_fid(inode, &op_data->op_fid3);
-			if (rc) {
-				ll_finish_md_op_data(op_data);
+		if (fid_is_zero(&pfid)) {
+			rc = ll_dir_get_parent_fid(inode, &pfid);
+			if (rc)
 				return rc;
-			}
 		}
 	}
+
+	op_data = ll_prep_md_op_data(NULL, inode, inode, NULL, 0, 0,
+				     LUSTRE_OPC_ANY, inode);
+	if (IS_ERR(op_data)) {
+		rc = PTR_ERR(op_data);
+		goto out;
+	}
+	op_data->op_fid3 = pfid;
+
 	ctx->pos = pos;
 	rc = ll_dir_read(inode, &pos, op_data, ctx);
 	pos = ctx->pos;
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 9de37d2..e1fba1c 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4080,12 +4080,15 @@ static int ll_inode_revalidate(struct dentry *dentry, enum ldlm_intent_flags op)
 
 static int ll_merge_md_attr(struct inode *inode)
 {
+	struct ll_inode_info *lli = ll_i2info(inode);
 	struct cl_attr attr = { 0 };
 	int rc;
 
-	LASSERT(ll_i2info(inode)->lli_lsm_md);
+	LASSERT(lli->lli_lsm_md);
+	down_read(&lli->lli_lsm_sem);
 	rc = md_merge_attr(ll_i2mdexp(inode), ll_i2info(inode)->lli_lsm_md,
 			   &attr, ll_md_blocking_ast);
+	up_read(&lli->lli_lsm_sem);
 	if (rc)
 		return rc;
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index d6fc6a29..d41531b 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -168,6 +168,8 @@ struct ll_inode_info {
 			unsigned int			lli_sa_enabled:1;
 			/* generation for statahead */
 			unsigned int			lli_sa_generation;
+			/* rw lock protects lli_lsm_md */
+			struct rw_semaphore		lli_lsm_sem;
 			/* directory stripe information */
 			struct lmv_stripe_md	       *lli_lsm_md;
 			/* default directory stripe offset.  This is extracted
@@ -905,6 +907,7 @@ enum {
 	LUSTRE_OPC_ANY		= 5,
 };
 
+void ll_unlock_md_op_lsm(struct md_op_data *op_data);
 struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 				      struct inode *i1, struct inode *i2,
 				      const char *name, size_t namelen,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 859fdf4..ed2d1c6 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -933,6 +933,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 		lli->lli_opendir_pid = 0;
 		lli->lli_sa_enabled = 0;
 		lli->lli_def_stripe_offset = -1;
+		init_rwsem(&lli->lli_lsm_sem);
 	} else {
 		mutex_init(&lli->lli_size_mutex);
 		lli->lli_symlink_name = NULL;
@@ -1237,10 +1238,17 @@ static struct inode *ll_iget_anon_dir(struct super_block *sb,
 static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 {
 	struct lmv_stripe_md *lsm = md->lmv;
+	struct ll_inode_info *lli = ll_i2info(inode);
 	struct lu_fid *fid;
 	int i;
 
 	LASSERT(lsm);
+
+	CDEBUG(D_INODE, "%s: "DFID" set dir layout:\n",
+		ll_get_fsname(inode->i_sb, NULL, 0),
+		PFID(&lli->lli_fid));
+	lsm_md_dump(D_INODE, lsm);
+
 	/*
 	 * XXX sigh, this lsm_root initialization should be in
 	 * LMV layer, but it needs ll_iget right now, so we
@@ -1260,10 +1268,16 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 			int rc = PTR_ERR(lsm->lsm_md_oinfo[i].lmo_root);
 
 			lsm->lsm_md_oinfo[i].lmo_root = NULL;
+			while (i-- > 0) {
+				iput(lsm->lsm_md_oinfo[i].lmo_root);
+				lsm->lsm_md_oinfo[i].lmo_root = NULL;
+			}
 			return rc;
 		}
 	}
 
+	lli->lli_lsm_md = lsm;
+
 	return 0;
 }
 
@@ -1271,7 +1285,7 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct lmv_stripe_md *lsm = md->lmv;
-	int rc;
+	int rc = 0;
 
 	LASSERT(S_ISDIR(inode->i_mode));
 	CDEBUG(D_INODE, "update lsm %p of " DFID "\n", lli->lli_lsm_md,
@@ -1284,53 +1298,43 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	if (!lsm)
 		return 0;
 
-	/* Compare the old and new stripe information */
+	/*
+	 * normally dir layout doesn't change, only take read lock to check
+	 * that to avoid blocking other MD operations.
+	 */
+	if (lli->lli_lsm_md)
+		down_read(&lli->lli_lsm_sem);
+	else
+		down_write(&lli->lli_lsm_sem);
+
+	/*
+	 * if dir layout mismatch, check whether version is increased, which
+	 * means layout is changed, this happens in dir migration and lfsck.
+	 */
 	if (lli->lli_lsm_md && !lsm_md_eq(lli->lli_lsm_md, lsm)) {
-		struct lmv_stripe_md *old_lsm = lli->lli_lsm_md;
-		bool layout_changed = lsm->lsm_md_layout_version >
-				      old_lsm->lsm_md_layout_version;
-		int mask = layout_changed ? D_INODE : D_ERROR;
-		int idx;
-
-		CDEBUG(mask,
-		       "%s: inode@%p "DFID" lmv layout %s magic %#x/%#x stripe count %d/%d master_mdt %d/%d hash_type %#x/%#x version %d/%d migrate offset %d/%d  migrate hash %#x/%#x pool %s/%s\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0), inode,
-		       PFID(&lli->lli_fid),
-		       layout_changed ? "changed" : "mismatch",
-		       lsm->lsm_md_magic, old_lsm->lsm_md_magic,
-		       lsm->lsm_md_stripe_count,
-		       old_lsm->lsm_md_stripe_count,
-		       lsm->lsm_md_master_mdt_index,
-		       old_lsm->lsm_md_master_mdt_index,
-		       lsm->lsm_md_hash_type, old_lsm->lsm_md_hash_type,
-		       lsm->lsm_md_layout_version,
-		       old_lsm->lsm_md_layout_version,
-		       lsm->lsm_md_migrate_offset,
-		       old_lsm->lsm_md_migrate_offset,
-		       lsm->lsm_md_migrate_hash,
-		       old_lsm->lsm_md_migrate_hash,
-		       lsm->lsm_md_pool_name,
-		       old_lsm->lsm_md_pool_name);
-
-		for (idx = 0; idx < old_lsm->lsm_md_stripe_count; idx++)
-			CDEBUG(mask, "old stripe[%d] "DFID"\n",
-			       idx, PFID(&old_lsm->lsm_md_oinfo[idx].lmo_fid));
-
-		for (idx = 0; idx < lsm->lsm_md_stripe_count; idx++)
-			CDEBUG(mask, "new stripe[%d] "DFID"\n",
-			       idx, PFID(&lsm->lsm_md_oinfo[idx].lmo_fid));
-
-		if (!layout_changed)
-			return -EINVAL;
+		if (lsm->lsm_md_layout_version <=
+		    lli->lli_lsm_md->lsm_md_layout_version) {
+			CERROR("%s: " DFID " dir layout mismatch:\n",
+			       ll_get_fsname(inode->i_sb, NULL, 0),
+			       PFID(&lli->lli_fid));
+			lsm_md_dump(D_ERROR, lli->lli_lsm_md);
+			lsm_md_dump(D_ERROR, lsm);
+			rc = -EINVAL;
+			goto unlock;
+		}
 
+		/* layout changed, switch to write lock */
+		up_read(&lli->lli_lsm_sem);
+		down_write(&lli->lli_lsm_sem);
 		ll_dir_clear_lsm_md(inode);
 	}
 
-	/* set the directory layout */
+	/* set directory layout */
 	if (!lli->lli_lsm_md) {
 		struct cl_attr *attr;
 
 		rc = ll_init_lsm_md(inode, md);
+		up_write(&lli->lli_lsm_sem);
 		if (rc)
 			return rc;
 
@@ -1339,18 +1343,25 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 		 * will not free this lsm
 		 */
 		md->lmv = NULL;
-		lli->lli_lsm_md = lsm;
+
+		/*
+		 * md_merge_attr() may take long, since lsm is already set,
+		 * switch to read lock.
+		 */
+		down_read(&lli->lli_lsm_sem);
 
 		attr = kzalloc(sizeof(*attr), GFP_NOFS);
-		if (!attr)
-			return -ENOMEM;
+		if (!attr) {
+			rc = -ENOMEM;
+			goto unlock;
+		}
 
 		/* validate the lsm */
 		rc = md_merge_attr(ll_i2mdexp(inode), lsm, attr,
 				   ll_md_blocking_ast);
 		if (rc) {
 			kfree(attr);
-			return rc;
+			goto unlock;
 		}
 
 		if (md->body->mbo_valid & OBD_MD_FLNLINK)
@@ -1365,47 +1376,11 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 			md->body->mbo_mtime = attr->cat_mtime;
 
 		kfree(attr);
-
-		CDEBUG(D_INODE, "Set lsm %p magic %x to " DFID "\n", lsm,
-		       lsm->lsm_md_magic, PFID(ll_inode2fid(inode)));
-		return 0;
 	}
+unlock:
+	up_read(&lli->lli_lsm_sem);
 
-	/* Compare the old and new stripe information */
-	if (!lsm_md_eq(lli->lli_lsm_md, lsm)) {
-		struct lmv_stripe_md *old_lsm = lli->lli_lsm_md;
-		int idx;
-
-		CERROR("%s: inode " DFID "(%p)'s lmv layout mismatch (%p)/(%p) magic:0x%x/0x%x stripe count: %d/%d master_mdt: %d/%d hash_type:0x%x/0x%x layout: 0x%x/0x%x pool:%s/%s\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0), PFID(&lli->lli_fid),
-		       inode, lsm, old_lsm,
-		       lsm->lsm_md_magic, old_lsm->lsm_md_magic,
-		       lsm->lsm_md_stripe_count,
-		       old_lsm->lsm_md_stripe_count,
-		       lsm->lsm_md_master_mdt_index,
-		       old_lsm->lsm_md_master_mdt_index,
-		       lsm->lsm_md_hash_type, old_lsm->lsm_md_hash_type,
-		       lsm->lsm_md_layout_version,
-		       old_lsm->lsm_md_layout_version,
-		       lsm->lsm_md_pool_name,
-		       old_lsm->lsm_md_pool_name);
-
-		for (idx = 0; idx < old_lsm->lsm_md_stripe_count; idx++) {
-			CERROR("%s: sub FIDs in old lsm idx %d, old: " DFID "\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0), idx,
-			       PFID(&old_lsm->lsm_md_oinfo[idx].lmo_fid));
-		}
-
-		for (idx = 0; idx < lsm->lsm_md_stripe_count; idx++) {
-			CERROR("%s: sub FIDs in new lsm idx %d, new: " DFID "\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0), idx,
-			       PFID(&lsm->lsm_md_oinfo[idx].lmo_fid));
-		}
-
-		return -EIO;
-	}
-
-	return 0;
+	return rc;
 }
 
 void ll_clear_inode(struct inode *inode)
@@ -2417,6 +2392,23 @@ int ll_obd_statfs(struct inode *inode, void __user *arg)
 	return rc;
 }
 
+/*
+ * this is normally called in ll_fini_md_op_data(), but sometimes it needs to
+ * be called early to avoid deadlock.
+ */
+void ll_unlock_md_op_lsm(struct md_op_data *op_data)
+{
+	if (op_data->op_mea2_sem) {
+		up_read(op_data->op_mea2_sem);
+		op_data->op_mea2_sem = NULL;
+	}
+
+	if (op_data->op_mea1_sem) {
+		up_read(op_data->op_mea1_sem);
+		op_data->op_mea1_sem = NULL;
+	}
+}
+
 /* this function prepares md_op_data hint for passing ot down to MD stack. */
 struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 				      struct inode *i1, struct inode *i2,
@@ -2444,7 +2436,10 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 	ll_i2gids(op_data->op_suppgids, i1, i2);
 	op_data->op_fid1 = *ll_inode2fid(i1);
 	op_data->op_default_stripe_offset = -1;
+
 	if (S_ISDIR(i1->i_mode)) {
+		down_read(&ll_i2info(i1)->lli_lsm_sem);
+		op_data->op_mea1_sem = &ll_i2info(i1)->lli_lsm_sem;
 		op_data->op_mea1 = ll_i2info(i1)->lli_lsm_md;
 		if (opc == LUSTRE_OPC_MKDIR)
 			op_data->op_default_stripe_offset =
@@ -2453,8 +2448,14 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 
 	if (i2) {
 		op_data->op_fid2 = *ll_inode2fid(i2);
-		if (S_ISDIR(i2->i_mode))
+		if (S_ISDIR(i2->i_mode)) {
+			if (i2 != i1) {
+				down_read(&ll_i2info(i2)->lli_lsm_sem);
+				op_data->op_mea2_sem =
+						&ll_i2info(i2)->lli_lsm_sem;
+			}
 			op_data->op_mea2 = ll_i2info(i2)->lli_lsm_md;
+		}
 	} else {
 		fid_zero(&op_data->op_fid2);
 	}
@@ -2483,6 +2484,7 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 
 void ll_finish_md_op_data(struct md_op_data *op_data)
 {
+	ll_unlock_md_op_lsm(op_data);
 	security_release_secctx(op_data->op_file_secctx,
 				op_data->op_file_secctx_size);
 	kfree(op_data);
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 530c2df..3e3fbd9 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -777,6 +777,8 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 		goto out;
 	}
 
+	/* dir layout may change */
+	ll_unlock_md_op_lsm(op_data);
 	rc = ll_lookup_it_finish(req, it, parent, &dentry);
 	if (rc != 0) {
 		ll_intent_release(it);
diff --git a/fs/lustre/llite/statahead.c b/fs/lustre/llite/statahead.c
index 122b9d8..1de62b5 100644
--- a/fs/lustre/llite/statahead.c
+++ b/fs/lustre/llite/statahead.c
@@ -332,6 +332,58 @@ static void sa_put(struct ll_statahead_info *sai, struct sa_entry *entry,
 	return (index == sai->sai_index_wait);
 }
 
+/* finish async stat RPC arguments */
+static void sa_fini_data(struct md_enqueue_info *minfo)
+{
+	ll_unlock_md_op_lsm(&minfo->mi_data);
+	iput(minfo->mi_dir);
+	kfree(minfo);
+}
+
+static int ll_statahead_interpret(struct ptlrpc_request *req,
+				  struct md_enqueue_info *minfo, int rc);
+
+/*
+ * prepare arguments for async stat RPC.
+ */
+static struct md_enqueue_info *
+sa_prep_data(struct inode *dir, struct inode *child, struct sa_entry *entry)
+{
+	struct md_enqueue_info   *minfo;
+	struct ldlm_enqueue_info *einfo;
+	struct md_op_data        *op_data;
+
+	minfo = kzalloc(sizeof(*minfo), GFP_NOFS);
+	if (!minfo)
+		return ERR_PTR(-ENOMEM);
+
+	op_data = ll_prep_md_op_data(&minfo->mi_data, dir, child,
+				     entry->se_qstr.name, entry->se_qstr.len, 0,
+				     LUSTRE_OPC_ANY, NULL);
+	if (IS_ERR(op_data)) {
+		kfree(minfo);
+		return (struct md_enqueue_info *)op_data;
+	}
+
+	if (!child)
+		op_data->op_fid2 = entry->se_fid;
+
+	minfo->mi_it.it_op = IT_GETATTR;
+	minfo->mi_dir = igrab(dir);
+	minfo->mi_cb = ll_statahead_interpret;
+	minfo->mi_cbdata = entry;
+
+	einfo = &minfo->mi_einfo;
+	einfo->ei_type   = LDLM_IBITS;
+	einfo->ei_mode   = it_to_lock_mode(&minfo->mi_it);
+	einfo->ei_cb_bl  = ll_md_blocking_ast;
+	einfo->ei_cb_cp  = ldlm_completion_ast;
+	einfo->ei_cb_gl  = NULL;
+	einfo->ei_cbdata = NULL;
+
+	return minfo;
+}
+
 /*
  * release resources used in async stat RPC, update entry state and wakeup if
  * scanner process it waiting on this entry.
@@ -348,8 +400,7 @@ static void sa_put(struct ll_statahead_info *sai, struct sa_entry *entry,
 	if (minfo) {
 		entry->se_minfo = NULL;
 		ll_intent_release(&minfo->mi_it);
-		iput(minfo->mi_dir);
-		kfree(minfo);
+		sa_fini_data(minfo);
 	}
 
 	if (req) {
@@ -685,17 +736,16 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
 
 	if (rc) {
 		ll_intent_release(it);
-		iput(dir);
-		kfree(minfo);
+		sa_fini_data(minfo);
 	} else {
-		/*
-		 * release ibits lock ASAP to avoid deadlock when statahead
+		/* release ibits lock ASAP to avoid deadlock when statahead
 		 * thread enqueues lock on parent in readdir and another
 		 * process enqueues lock on child with parent lock held, eg.
 		 * unlink.
 		 */
 		handle = it->it_lock_handle;
 		ll_intent_drop_lock(it);
+		ll_unlock_md_op_lsm(&minfo->mi_data);
 	}
 
 	spin_lock(&lli->lli_sa_lock);
@@ -729,54 +779,6 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
 	return rc;
 }
 
-/* finish async stat RPC arguments */
-static void sa_fini_data(struct md_enqueue_info *minfo)
-{
-	iput(minfo->mi_dir);
-	kfree(minfo);
-}
-
-/**
- * prepare arguments for async stat RPC.
- */
-static struct md_enqueue_info *
-sa_prep_data(struct inode *dir, struct inode *child, struct sa_entry *entry)
-{
-	struct md_enqueue_info *minfo;
-	struct ldlm_enqueue_info *einfo;
-	struct md_op_data *op_data;
-
-	minfo = kzalloc(sizeof(*minfo), GFP_NOFS);
-	if (!minfo)
-		return ERR_PTR(-ENOMEM);
-
-	op_data = ll_prep_md_op_data(&minfo->mi_data, dir, child,
-				     entry->se_qstr.name, entry->se_qstr.len, 0,
-				     LUSTRE_OPC_ANY, NULL);
-	if (IS_ERR(op_data)) {
-		kfree(minfo);
-		return (struct md_enqueue_info *)op_data;
-	}
-
-	if (!child)
-		op_data->op_fid2 = entry->se_fid;
-
-	minfo->mi_it.it_op = IT_GETATTR;
-	minfo->mi_dir = igrab(dir);
-	minfo->mi_cb = ll_statahead_interpret;
-	minfo->mi_cbdata = entry;
-
-	einfo = &minfo->mi_einfo;
-	einfo->ei_type = LDLM_IBITS;
-	einfo->ei_mode = it_to_lock_mode(&minfo->mi_it);
-	einfo->ei_cb_bl = ll_md_blocking_ast;
-	einfo->ei_cb_cp = ldlm_completion_ast;
-	einfo->ei_cb_gl = NULL;
-	einfo->ei_cbdata = NULL;
-
-	return minfo;
-}
-
 /* async stat for file not found in dcache */
 static int sa_lookup(struct inode *dir, struct sa_entry *entry)
 {
@@ -818,22 +820,20 @@ static int sa_revalidate(struct inode *dir, struct sa_entry *entry,
 	if (d_mountpoint(dentry))
 		return 1;
 
+	minfo = sa_prep_data(dir, inode, entry);
+	if (IS_ERR(minfo))
+		return PTR_ERR(minfo);
+
 	entry->se_inode = igrab(inode);
 	rc = md_revalidate_lock(ll_i2mdexp(dir), &it, ll_inode2fid(inode),
 				NULL);
 	if (rc == 1) {
 		entry->se_handle = it.it_lock_handle;
 		ll_intent_release(&it);
+		sa_fini_data(minfo);
 		return 1;
 	}
 
-	minfo = sa_prep_data(dir, inode, entry);
-	if (IS_ERR(minfo)) {
-		entry->se_inode = NULL;
-		iput(inode);
-		return PTR_ERR(minfo);
-	}
-
 	rc = md_intent_getattr_async(ll_i2mdexp(dir), minfo);
 	if (rc) {
 		entry->se_inode = NULL;
@@ -982,10 +982,9 @@ static int ll_statahead_thread(void *arg)
 	CDEBUG(D_READA, "statahead thread starting: sai %p, parent %pd\n",
 	       sai, parent);
 
-	op_data = ll_prep_md_op_data(NULL, dir, dir, NULL, 0, 0,
-				     LUSTRE_OPC_ANY, dir);
-	if (IS_ERR(op_data)) {
-		rc = PTR_ERR(op_data);
+	op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
+	if (!op_data) {
+		rc = -ENOMEM;
 		goto out;
 	}
 
@@ -993,8 +992,16 @@ static int ll_statahead_thread(void *arg)
 		struct lu_dirpage *dp;
 		struct lu_dirent *ent;
 
+		op_data = ll_prep_md_op_data(op_data, dir, dir, NULL, 0, 0,
+				     LUSTRE_OPC_ANY, dir);
+		if (IS_ERR(op_data)) {
+			rc = PTR_ERR(op_data);
+			break;
+		}
+
 		sai->sai_in_readpage = 1;
 		page = ll_get_dir_page(dir, op_data, pos);
+		ll_unlock_md_op_lsm(op_data);
 		sai->sai_in_readpage = 0;
 		if (IS_ERR(page)) {
 			rc = PTR_ERR(page);
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 81b86a0..e98f33d 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1901,8 +1901,6 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	int rc;
 
 	LASSERT(op_data->op_cli_flags & CLI_MIGRATE);
-	LASSERTF(fid_is_sane(&op_data->op_fid3), "invalid FID "DFID"\n",
-		 PFID(&op_data->op_fid3));
 
 	CDEBUG(D_INODE, "MIGRATE "DFID"/%.*s\n",
 	       PFID(&op_data->op_fid1), (int)namelen, name);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 157/622] lnet: configure recovery interval
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (155 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 156/622] lustre: llite: add lock for dir layout data James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 158/622] lustre: osc: Do not walk full extent list James Simmons
                   ` (465 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Added a module parameter to configure the interval between each
recovery ping. Some sites might not want to ping failed NIDs once
a second and might desire a longer interval. The interval defaults
to 1 second.
Monitor thread now wakes up depending on the smallest interval
it needs to monitor

WC-bug-id: https://jira.whamcloud.com/browse/LU-11468
Lustre-commit: dc1f5f08b420 ("LU-11468 lnet: configure recovery interval")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33309
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 +
 net/lnet/lnet/api-ni.c        | 52 +++++++++++++++++++++++++++++++++++++++++++
 net/lnet/lnet/lib-move.c      | 24 +++++++++++++-------
 3 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index ecacd65..26095a6 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -502,6 +502,7 @@ struct lnet_ni *
 extern unsigned int lnet_retry_count;
 extern unsigned int lnet_numa_range;
 extern unsigned int lnet_health_sensitivity;
+extern unsigned int lnet_recovery_interval;
 extern unsigned int lnet_peer_discovery_disabled;
 extern int portal_rotor;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index a2c648e..c4f698d 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -95,6 +95,23 @@ struct lnet the_lnet = {
 MODULE_PARM_DESC(lnet_health_sensitivity,
 		 "Value to decrement the health value by on error");
 
+/* lnet_recovery_interval determines how often we should perform recovery
+ * on unhealthy interfaces.
+ */
+unsigned int lnet_recovery_interval = 1;
+static int recovery_interval_set(const char *val,
+				 const struct kernel_param *kp);
+static struct kernel_param_ops param_ops_recovery_interval = {
+	.set = recovery_interval_set,
+	.get = param_get_int,
+};
+
+#define param_check_recovery_interval(name, p) \
+		__param_check(name, p, int)
+module_param(lnet_recovery_interval, recovery_interval, 0644);
+MODULE_PARM_DESC(lnet_recovery_interval,
+		 "Interval to recover unhealthy interfaces in seconds");
+
 static int lnet_interfaces_max = LNET_INTERFACES_MAX_DEFAULT;
 static int intf_max_set(const char *val, const struct kernel_param *kp);
 module_param_call(lnet_interfaces_max, intf_max_set, param_get_int,
@@ -190,6 +207,41 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 }
 
 static int
+recovery_interval_set(const char *val, const struct kernel_param *kp)
+{
+	int rc;
+	unsigned int *interval = (unsigned int *)kp->arg;
+	unsigned long value;
+
+	rc = kstrtoul(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'lnet_recovery_interval'\n");
+		return rc;
+	}
+
+	if (value < 1) {
+		CERROR("lnet_recovery_interval must be@least 1 second\n");
+		return -EINVAL;
+	}
+
+	/* The purpose of locking the api_mutex here is to ensure that
+	 * the correct value ends up stored properly.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+
+	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	*interval = value;
+
+	mutex_unlock(&the_lnet.ln_api_mutex);
+
+	return 0;
+}
+
+static int
 discovery_set(const char *val, const struct kernel_param *kp)
 {
 	int rc;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 548ea88..434aa09 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3074,7 +3074,10 @@ struct lnet_mt_event_info {
 static int
 lnet_monitor_thread(void *arg)
 {
-	int wakeup_counter = 0;
+	time64_t recovery_timeout = 0;
+	time64_t rsp_timeout = 0;
+	int interval;
+	time64_t now;
 
 	/* The monitor thread takes care of the following:
 	 *  1. Checks the aliveness of routers
@@ -3086,20 +3089,23 @@ struct lnet_mt_event_info {
 	 *     and pings them.
 	 */
 	while (the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING) {
+		now = ktime_get_real_seconds();
+
 		if (lnet_router_checker_active())
 			lnet_check_routers();
 
 		lnet_resend_pending_msgs();
 
-		wakeup_counter++;
-		if (wakeup_counter >= lnet_transaction_timeout / 2) {
+		if (now >= rsp_timeout) {
 			lnet_finalize_expired_responses(false);
-			wakeup_counter = 0;
+			rsp_timeout = now + (lnet_transaction_timeout / 2);
 		}
 
-		lnet_recover_local_nis();
-
-		lnet_recover_peer_nis();
+		if (now >= recovery_timeout) {
+			lnet_recover_local_nis();
+			lnet_recover_peer_nis();
+			recovery_timeout = now + lnet_recovery_interval;
+		}
 
 		/* TODO do we need to check if we should sleep without
 		 * timeout?  Technically, an active system will always
@@ -3109,8 +3115,10 @@ struct lnet_mt_event_info {
 		 * cases where we get a complaint that an idle thread
 		 * is waking up unnecessarily.
 		 */
+		interval = min(lnet_recovery_interval,
+			       lnet_transaction_timeout / 2);
 		wait_event_interruptible_timeout(the_lnet.ln_mt_waitq,
-						 false, HZ);
+						 false, HZ * interval);
 	}
 
 	/* clean up the router checker */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 158/622] lustre: osc: Do not walk full extent list
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (156 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 157/622] lnet: configure recovery interval James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 159/622] lnet: separate ni state from recovery James Simmons
                   ` (464 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

It is only possible to merge with the extent immediately
before or immediately after the one we are trying to add,
so do not continue to walk the extent list after passing
that extent.

This has a significant impact when writing large sparse
files, where most writes create a new extent, and many
extents are too distant to be merged with their neighbors.

Writing 2 GiB of data randomly 4K at a time, we see an
improvement of about 15% with this patch.

mpirun -n 1 $IOR -w -t 4K -b 2G -o ./file -z
w/o patch:
write         285.86 MiB/s
w/patch:
write         324.03 MiB/s

Cray-bug-id: LUS-6523
WC-bug-id: https://jira.whamcloud.com/browse/LU-11423
Lustre-commit: 7f8143cf85b7 ("LU-11423 osc: Do not walk full extent list")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33227
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 2ed7ca2..961fc6bf 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -746,7 +746,7 @@ static struct osc_extent *osc_extent_find(const struct lu_env *env,
 		pgoff_t ext_chk_end = ext->oe_end >> ppc_bits;
 
 		LASSERT(osc_extent_sanity_check_nolock(ext) == 0);
-		if (chunk > ext_chk_end + 1)
+		if (chunk > ext_chk_end + 1 || chunk < ext_chk_start)
 			break;
 
 		/* if covering by different locks, no chance to match */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 159/622] lnet: separate ni state from recovery
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (157 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 158/622] lustre: osc: Do not walk full extent list James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 160/622] lustre: mdc: move empty xattr handling to mdc layer James Simmons
                   ` (463 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

To make the code more readable we make the ni_state an
enumerated type, and create a separate bit filed to track
the recovery state. Both of these are protected by the
lnet_ni_lock()

WC-bug-id: https://jira.whamcloud.com/browse/LU-11514
Lustre-commit: 2be10428ac22 ("LU-11514 lnet: separate ni state from recovery")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33361
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h | 24 ++++++++++++++++--------
 net/lnet/lnet/api-ni.c         |  8 +++-----
 net/lnet/lnet/config.c         |  2 +-
 net/lnet/lnet/lib-move.c       | 23 +++++++++++++----------
 4 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index ce0caa9..b1a6f6a 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -315,12 +315,17 @@ struct lnet_tx_queue {
 	struct list_head	tq_delayed;	/* delayed TXs */
 };
 
-#define LNET_NI_STATE_INIT		(1 << 0)
-#define LNET_NI_STATE_ACTIVE		(1 << 1)
-#define LNET_NI_STATE_FAILED		(1 << 2)
-#define LNET_NI_STATE_RECOVERY_PENDING	(1 << 3)
-#define LNET_NI_STATE_RECOVERY_FAILED	BIT(4)
-#define LNET_NI_STATE_DELETING		BIT(5)
+enum lnet_ni_state {
+	/* initial state when NI is created */
+	LNET_NI_STATE_INIT = 0,
+	/* set when NI is brought up */
+	LNET_NI_STATE_ACTIVE,
+	/* set when NI is being shutdown */
+	LNET_NI_STATE_DELETING,
+};
+
+#define LNET_NI_RECOVERY_PENDING	BIT(0)
+#define LNET_NI_RECOVERY_FAILED		BIT(1)
 
 enum lnet_stats_type {
 	LNET_STATS_TYPE_SEND	= 0,
@@ -435,8 +440,11 @@ struct lnet_ni {
 	/* my health status */
 	struct lnet_ni_status	*ni_status;
 
-	/* NI FSM */
-	u32			ni_state;
+	/* NI FSM. Protected by lnet_ni_lock() */
+	enum lnet_ni_state	ni_state;
+
+	/* Recovery state. Protected by lnet_ni_lock() */
+	u32			ni_recovery_state;
 
 	/* per NI LND tunables */
 	struct lnet_lnd_tunables ni_lnd_tunables;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index c4f698d..25592db 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1823,7 +1823,7 @@ static void lnet_push_target_fini(void)
 		list_del_init(&ni->ni_netlist);
 		/* the ni should be in deleting state. If it's not it's
 		 * a bug */
-		LASSERT(ni->ni_state & LNET_NI_STATE_DELETING);
+		LASSERT(ni->ni_state == LNET_NI_STATE_DELETING);
 		cfs_percpt_for_each(ref, j, ni->ni_refs) {
 			if (!*ref)
 				continue;
@@ -1871,8 +1871,7 @@ static void lnet_push_target_fini(void)
 
 	lnet_net_lock(LNET_LOCK_EX);
 	lnet_ni_lock(ni);
-	ni->ni_state |= LNET_NI_STATE_DELETING;
-	ni->ni_state &= ~LNET_NI_STATE_ACTIVE;
+	ni->ni_state = LNET_NI_STATE_DELETING;
 	lnet_ni_unlock(ni);
 	lnet_ni_unlink_locked(ni);
 	lnet_incr_dlc_seq();
@@ -2005,8 +2004,7 @@ static void lnet_push_target_fini(void)
 	}
 
 	lnet_ni_lock(ni);
-	ni->ni_state |= LNET_NI_STATE_ACTIVE;
-	ni->ni_state &= ~LNET_NI_STATE_INIT;
+	ni->ni_state = LNET_NI_STATE_ACTIVE;
 	lnet_ni_unlock(ni);
 
 	/* We keep a reference on the loopback net through the loopback NI */
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index ea62d36..5e0831a 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -467,7 +467,7 @@ struct lnet_net *
 		ni->ni_net_ns = NULL;
 
 	ni->ni_last_alive = ktime_get_real_seconds();
-	ni->ni_state |= LNET_NI_STATE_INIT;
+	ni->ni_state = LNET_NI_STATE_INIT;
 	list_add_tail(&ni->ni_netlist, &net->net_ni_added);
 
 	/*
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 434aa09..eacda4c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2651,7 +2651,8 @@ struct lnet_mt_event_info {
 
 	LNetInvalidateMDHandle(&recovery_mdh);
 
-	if (ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING || force) {
+	if (ni->ni_recovery_state & LNET_NI_RECOVERY_PENDING ||
+	    force) {
 		recovery_mdh = ni->ni_ping_mdh;
 		LNetInvalidateMDHandle(&ni->ni_ping_mdh);
 	}
@@ -2702,7 +2703,7 @@ struct lnet_mt_event_info {
 
 		lnet_net_lock(0);
 		lnet_ni_lock(ni);
-		if (!(ni->ni_state & LNET_NI_STATE_ACTIVE) ||
+		if (ni->ni_state != LNET_NI_STATE_ACTIVE ||
 		    healthv == LNET_MAX_HEALTH_VALUE) {
 			list_del_init(&ni->ni_recovery);
 			lnet_unlink_ni_recovery_mdh_locked(ni, 0, false);
@@ -2716,9 +2717,9 @@ struct lnet_mt_event_info {
 		 * But we want to keep the local_ni on the recovery queue
 		 * so we can continue the attempts to recover it.
 		 */
-		if (ni->ni_state & LNET_NI_STATE_RECOVERY_FAILED) {
+		if (ni->ni_recovery_state & LNET_NI_RECOVERY_FAILED) {
 			lnet_unlink_ni_recovery_mdh_locked(ni, 0, true);
-			ni->ni_state &= ~LNET_NI_STATE_RECOVERY_FAILED;
+			ni->ni_recovery_state &= ~LNET_NI_RECOVERY_FAILED;
 		}
 
 		lnet_ni_unlock(ni);
@@ -2728,8 +2729,8 @@ struct lnet_mt_event_info {
 		       libcfs_nid2str(ni->ni_nid));
 
 		lnet_ni_lock(ni);
-		if (!(ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING)) {
-			ni->ni_state |= LNET_NI_STATE_RECOVERY_PENDING;
+		if (!(ni->ni_recovery_state & LNET_NI_RECOVERY_PENDING)) {
+			ni->ni_recovery_state |= LNET_NI_RECOVERY_PENDING;
 			lnet_ni_unlock(ni);
 
 			ev_info = kzalloc(sizeof(*ev_info), GFP_NOFS);
@@ -2737,7 +2738,8 @@ struct lnet_mt_event_info {
 				CERROR("out of memory. Can't recover %s\n",
 				       libcfs_nid2str(ni->ni_nid));
 				lnet_ni_lock(ni);
-				ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+				ni->ni_recovery_state &=
+				  ~LNET_NI_RECOVERY_PENDING;
 				lnet_ni_unlock(ni);
 				continue;
 			}
@@ -2806,7 +2808,8 @@ struct lnet_mt_event_info {
 
 			lnet_ni_lock(ni);
 			if (rc)
-				ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+				ni->ni_recovery_state &=
+					~LNET_NI_RECOVERY_PENDING;
 		}
 		lnet_ni_unlock(ni);
 	}
@@ -3210,9 +3213,9 @@ struct lnet_mt_event_info {
 			return;
 		}
 		lnet_ni_lock(ni);
-		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+		ni->ni_recovery_state &= ~LNET_NI_RECOVERY_PENDING;
 		if (status)
-			ni->ni_state |= LNET_NI_STATE_RECOVERY_FAILED;
+			ni->ni_recovery_state |= LNET_NI_RECOVERY_FAILED;
 		lnet_ni_unlock(ni);
 		lnet_net_unlock(0);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 160/622] lustre: mdc: move empty xattr handling to mdc layer
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (158 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 159/622] lnet: separate ni state from recovery James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 161/622] lustre: obd: remove portals handle from OBD import James Simmons
                   ` (462 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

Extract duplicated logic around empty xattr handling from several
places in llite and consolidate it in mdc_getxattr().

WC-bug-id: https://jira.whamcloud.com/browse/LU-11380
Lustre-commit: 0f42b388432c ("LU-11380 mdc: move empty xattr handling to mdc layer")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33198
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c      | 16 +++++--------
 fs/lustre/llite/xattr.c     | 44 ++++------------------------------
 fs/lustre/mdc/mdc_request.c | 57 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 65 insertions(+), 52 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index e1fba1c..246d5de 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4391,7 +4391,6 @@ static int ll_layout_fetch(struct inode *inode, struct ldlm_lock *lock)
 {
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct ptlrpc_request *req;
-	struct mdt_body *body;
 	void *lvbdata;
 	void *lmm;
 	int lmmsize;
@@ -4411,19 +4410,16 @@ static int ll_layout_fetch(struct inode *inode, struct ldlm_lock *lock)
 	 * completion AST because it doesn't have a large enough buffer
 	 */
 	rc = ll_get_default_mdsize(sbi, &lmmsize);
-	if (rc == 0)
-		rc = md_getxattr(sbi->ll_md_exp, ll_inode2fid(inode),
-				 OBD_MD_FLXATTR, XATTR_NAME_LOV, lmmsize, &req);
 	if (rc < 0)
 		return rc;
 
-	body = req_capsule_server_get(&req->rq_pill, &RMF_MDT_BODY);
-	if (!body) {
-		rc = -EPROTO;
-		goto out;
-	}
+	rc = md_getxattr(sbi->ll_md_exp, ll_inode2fid(inode), OBD_MD_FLXATTR,
+			 XATTR_NAME_LOV, lmmsize, &req);
+	if (rc < 0)
+		return rc;
 
-	lmmsize = body->mbo_eadatasize;
+	lmmsize = rc;
+	rc = 0;
 	if (lmmsize == 0) /* empty layout */ {
 		rc = 0;
 		goto out;
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index 636334e..948aaf6 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -326,7 +326,6 @@ int ll_xattr_list(struct inode *inode, const char *name, int type, void *buffer,
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct ptlrpc_request *req = NULL;
-	struct mdt_body *body;
 	void *xdata;
 	int rc;
 
@@ -358,57 +357,24 @@ int ll_xattr_list(struct inode *inode, const char *name, int type, void *buffer,
 		if (rc < 0)
 			goto out_xattr;
 
-		body = req_capsule_server_get(&req->rq_pill, &RMF_MDT_BODY);
-		LASSERT(body);
-
 		/* only detect the xattr size */
-		if (size == 0) {
-			/* LU-11109: Older MDTs do not distinguish
-			 * between nonexistent xattrs and zero length
-			 * values in this case. Newer MDTs will return
-			 * -ENODATA or set OBD_MD_FLXATTR.
-			 */
-			rc = body->mbo_eadatasize;
+		if (size == 0)
 			goto out;
-		}
 
-		if (size < body->mbo_eadatasize) {
-			CERROR("server bug: replied size %u > %u\n",
-			       body->mbo_eadatasize, (int)size);
+		if (size < rc) {
 			rc = -ERANGE;
 			goto out;
 		}
 
-		if (body->mbo_eadatasize == 0) {
-			/* LU-11109: Newer MDTs set OBD_MD_FLXATTR on
-			 * success so that we can distinguish between
-			 * zero length value and nonexistent xattr.
-			 *
-			 * If OBD_MD_FLXATTR is not set then we keep
-			 * the old behavior and return -ENODATA for
-			 * getxattr() when mbo_eadatasize is 0. But
-			 * -ENODATA only makes sense for getxattr()
-			 * and not for listxattr().
-			 */
-			if (body->mbo_valid & OBD_MD_FLXATTR)
-				rc = 0;
-			else if (valid == OBD_MD_FLXATTR)
-				rc = -ENODATA;
-			else
-				rc = 0;
-			goto out;
-		}
-
 		/* do not need swab xattr data */
 		xdata = req_capsule_server_sized_get(&req->rq_pill, &RMF_EADATA,
-						     body->mbo_eadatasize);
+						     rc);
 		if (!xdata) {
-			rc = -EFAULT;
+			rc = -EPROTO;
 			goto out;
 		}
 
-		memcpy(buffer, xdata, body->mbo_eadatasize);
-		rc = body->mbo_eadatasize;
+		memcpy(buffer, xdata, rc);
 	}
 
 out_xattr:
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 5cc1e1f..6934e57 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -432,12 +432,63 @@ static int mdc_getxattr(struct obd_export *exp, const struct lu_fid *fid,
 			u64 obd_md_valid, const char *name, size_t buf_size,
 			struct ptlrpc_request **req)
 {
+	struct mdt_body *body;
+	int rc;
+
 	LASSERT(obd_md_valid == OBD_MD_FLXATTR ||
 		obd_md_valid == OBD_MD_FLXATTRLS);
 
-	return mdc_xattr_common(exp, &RQF_MDS_GETXATTR, fid, MDS_GETXATTR,
-				obd_md_valid, name, NULL, 0, buf_size, 0, -1,
-				req);
+	rc = mdc_xattr_common(exp, &RQF_MDS_GETXATTR, fid, MDS_GETXATTR,
+			      obd_md_valid, name, NULL, 0, buf_size, 0, -1,
+			      req);
+	if (rc < 0)
+		goto out;
+
+	body = req_capsule_server_get(&(*req)->rq_pill, &RMF_MDT_BODY);
+	if (!body) {
+		rc = -EPROTO;
+		goto out;
+	}
+
+	/* only detect the xattr size */
+	if (buf_size == 0) {
+		/* LU-11109: Older MDTs do not distinguish
+		 * between nonexistent xattrs and zero length
+		 * values in this case. Newer MDTs will return
+		 * -ENODATA or set OBD_MD_FLXATTR.
+		 */
+		rc = body->mbo_eadatasize;
+		goto out;
+	}
+
+	if (body->mbo_eadatasize == 0) {
+		/* LU-11109: Newer MDTs set OBD_MD_FLXATTR on
+		 * success so that we can distinguish between
+		 * zero length value and nonexistent xattr.
+		 *
+		 * If OBD_MD_FLXATTR is not set then we keep
+		 * the old behavior and return -ENODATA for
+		 * getxattr() when mbo_eadatasize is 0. But
+		 * -ENODATA only makes sense for getxattr()
+		 * and not for listxattr().
+		 */
+		if (body->mbo_valid & OBD_MD_FLXATTR)
+			rc = 0;
+		else if (obd_md_valid == OBD_MD_FLXATTR)
+			rc = -ENODATA;
+		else
+			rc = 0;
+		goto out;
+	}
+
+	rc = body->mbo_eadatasize;
+out:
+	if (rc < 0) {
+		ptlrpc_req_finished(*req);
+		*req = NULL;
+	}
+
+	return rc;
 }
 
 #ifdef CONFIG_LUSTRE_FS_POSIX_ACL
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 161/622] lustre: obd: remove portals handle from OBD import
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (159 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 160/622] lustre: mdc: move empty xattr handling to mdc layer James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 162/622] lustre: mgc: restore mgc binding for sptlrpc James Simmons
                   ` (461 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

OBD imports are never looked up using the portals handle (imp_handle)
they contain, so remove it. Also remove the unused functions
class_conn2obd() and class_conn2cliimp().

WC-bug-id: https://jira.whamcloud.com/browse/LU-11445
Lustre-commit: 59729e4c0867 ("LU-11445 obd: remove portals handle from OBD import")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33250
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h | 10 +++++++---
 fs/lustre/obdclass/genops.c       | 21 +++------------------
 2 files changed, 10 insertions(+), 21 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index 1fd6246..f16d621 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -43,9 +43,15 @@
  *
  * @{
  */
+#include <linux/atomic.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/time.h>
+#include <linux/types.h>
+#include <linux/workqueue.h>
 
 #include <linux/libcfs/libcfs.h>
-#include <lustre_handles.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 
 /**
@@ -154,8 +160,6 @@ struct import_state_hist {
  * Imports are representing client-side view to remote target.
  */
 struct obd_import {
-	/** Local handle (== id) for this import. */
-	struct portals_handle		imp_handle;
 	/** Reference counter */
 	atomic_t			imp_refcount;
 	struct lustre_handle		imp_dlm_handle; /* client's ldlm export */
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 4465dd9..2254943 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -863,7 +863,6 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 
 exit_unlock:
 	spin_unlock(&obd->obd_dev_lock);
-	class_handle_unhash(&export->exp_handle);
 	obd_destroy_export(export);
 	kfree(export);
 	return ERR_PTR(rc);
@@ -903,7 +902,7 @@ void class_unlink_export(struct obd_export *exp)
 }
 
 /* Import management functions */
-static void class_import_destroy(struct obd_import *imp)
+static void obd_zombie_import_free(struct obd_import *imp)
 {
 	struct obd_import_conn *imp_conn;
 
@@ -924,19 +923,9 @@ static void class_import_destroy(struct obd_import *imp)
 
 	LASSERT(!imp->imp_sec);
 	class_decref(imp->imp_obd, "import", imp);
-	OBD_FREE_RCU(imp, sizeof(*imp), &imp->imp_handle);
+	kfree(imp);
 }
 
-static void import_handle_addref(void *import)
-{
-	class_import_get(import);
-}
-
-static struct portals_handle_ops import_handle_ops = {
-	.hop_addref	= import_handle_addref,
-	.hop_free	= NULL,
-};
-
 struct obd_import *class_import_get(struct obd_import *import)
 {
 	atomic_inc(&import->imp_refcount);
@@ -985,7 +974,7 @@ static void obd_zombie_imp_cull(struct work_struct *ws)
 	struct obd_import *import = container_of(ws, struct obd_import,
 						 imp_zombie_work);
 
-	class_import_destroy(import);
+	obd_zombie_import_free(import);
 }
 
 struct obd_import *class_new_import(struct obd_device *obd)
@@ -1018,8 +1007,6 @@ struct obd_import *class_new_import(struct obd_device *obd)
 	atomic_set(&imp->imp_replay_inflight, 0);
 	atomic_set(&imp->imp_inval_count, 0);
 	INIT_LIST_HEAD(&imp->imp_conn_list);
-	INIT_LIST_HEAD_RCU(&imp->imp_handle.h_link);
-	class_handle_hash(&imp->imp_handle, &import_handle_ops);
 	init_imp_at(&imp->imp_at);
 
 	/* the default magic is V2, will be used in connect RPC, and
@@ -1036,8 +1023,6 @@ void class_destroy_import(struct obd_import *import)
 	LASSERT(import);
 	LASSERT(import != LP_POISON);
 
-	class_handle_unhash(&import->imp_handle);
-
 	spin_lock(&import->imp_lock);
 	import->imp_generation++;
 	spin_unlock(&import->imp_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 162/622] lustre: mgc: restore mgc binding for sptlrpc
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (160 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 161/622] lustre: obd: remove portals handle from OBD import James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 163/622] lnet: peer deletion code may hide error James Simmons
                   ` (460 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

The work for LU-9034 mapped config logs to separate mgc devices.
This change prevented the ability to configure sptlrpc. A later
work around was introduced in LU-9567. Recently it was reported
that the work around introduced can now cause a MGC failover
panic. This patch is the proper fix in that the sptlrpc is
properly bound to an mgc device.

The sptlrpc config record expects 2 pieces of data:

  *  [0]: fs_name/target_name,
  *  [1]: rule string

What was happening is that when you set cfg_instance it was used
to create a new instance name of the form fsname-%p. For sptlrpc
it expects it to only be fsname. The solution is to test if the
config record is for sptlrpc and in that can keep the first
record field as is. With this change we can drop cfg_obdname
which only sptlrpc used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10937
Lustre-commit: ca9300e53dc2 ("LU-10937 mgc: restore mgc binding for sptlrpc")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33311
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h   | 1 -
 fs/lustre/mgc/mgc_request.c     | 7 +------
 fs/lustre/obdclass/obd_config.c | 5 ++++-
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 742e92a..434bb79 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -166,7 +166,6 @@ int class_config_llog_handler(const struct lu_env *env,
 
 /* Passed as data param to class_config_parse_llog */
 struct config_llog_instance {
-	char		       *cfg_obdname;
 	void		       *cfg_instance;
 	struct super_block     *cfg_sb;
 	struct obd_uuid		cfg_uuid;
diff --git a/fs/lustre/mgc/mgc_request.c b/fs/lustre/mgc/mgc_request.c
index 785461b..5bfa1b7 100644
--- a/fs/lustre/mgc/mgc_request.c
+++ b/fs/lustre/mgc/mgc_request.c
@@ -224,10 +224,8 @@ struct config_llog_data *do_config_log_add(struct obd_device *obd,
 	/* Keep the mgc around until we are done */
 	cld->cld_mgcexp = class_export_get(obd->obd_self_export);
 
-	if (cld_is_sptlrpc(cld)) {
+	if (cld_is_sptlrpc(cld))
 		sptlrpc_conf_log_start(logname);
-		cld->cld_cfg.cfg_obdname = obd->obd_name;
-	}
 
 	spin_lock(&config_list_lock);
 	list_add(&cld->cld_list_chain, &config_llog_list);
@@ -273,9 +271,6 @@ struct config_llog_data *do_config_log_add(struct obd_device *obd,
 
 	lcfg.cfg_instance = sb ? (void *)sb : (void *)obd;
 
-	if (type == CONFIG_T_SPTLRPC)
-		lcfg.cfg_instance = NULL;
-
 	cld = config_log_find(logname, &lcfg);
 	if (unlikely(cld))
 		return cld;
diff --git a/fs/lustre/obdclass/obd_config.c b/fs/lustre/obdclass/obd_config.c
index 550cee0..398f888 100644
--- a/fs/lustre/obdclass/obd_config.c
+++ b/fs/lustre/obdclass/obd_config.c
@@ -1357,6 +1357,7 @@ int class_config_llog_handler(const struct lu_env *env,
 		lustre_cfg_bufs_init(&bufs, lcfg);
 
 		if (clli && clli->cfg_instance &&
+		    lcfg->lcfg_command != LCFG_SPTLRPC_CONF &&
 		    LUSTRE_CFG_BUFLEN(lcfg, 0) > 0) {
 			inst_len = LUSTRE_CFG_BUFLEN(lcfg, 0) +
 				   sizeof(clli->cfg_instance) * 2 + 4;
@@ -1389,12 +1390,14 @@ int class_config_llog_handler(const struct lu_env *env,
 		 */
 		if (clli && !clli->cfg_instance &&
 		    lcfg->lcfg_command == LCFG_SPTLRPC_CONF) {
+			struct obd_device *obd = clli->cfg_instance;
+
 			lustre_cfg_bufs_set(&bufs, 2, bufs.lcfg_buf[1],
 					    bufs.lcfg_buflen[1]);
 			lustre_cfg_bufs_set(&bufs, 1, bufs.lcfg_buf[0],
 					    bufs.lcfg_buflen[0]);
 			lustre_cfg_bufs_set_string(&bufs, 0,
-						   clli->cfg_obdname);
+						   obd->obd_name);
 		}
 
 		/* Add net info to setup command
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 163/622] lnet: peer deletion code may hide error
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (161 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 162/622] lustre: mgc: restore mgc binding for sptlrpc James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 164/622] lustre: hsm: make changelog flag argument an enum James Simmons
                   ` (459 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sharmaso@whamcloud.com>

lnet_peer_ni_del_locked might return -EBUSY if the
NID to be deleted is a gateway.

Check for the return value of lnet_peer_ni_del_locked
in lnet_peer_del_nid.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10876
Lustre-commit: a3b6109705dc ("LU-10876 lnet: peer deletion code may hide error")
Signed-off-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31861
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 2fc5dfc..24a5cd3 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -494,7 +494,9 @@ void lnet_peer_uninit(void)
 	}
 
 	lnet_net_lock(LNET_LOCK_EX);
-	lnet_peer_ni_del_locked(lpni);
+
+	rc = lnet_peer_ni_del_locked(lpni);
+
 	lnet_net_unlock(LNET_LOCK_EX);
 
 out:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 164/622] lustre: hsm: make changelog flag argument an enum
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (162 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 163/622] lnet: peer deletion code may hide error James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 165/622] lustre: ldlm: don't skip bl_ast for local lock James Simmons
                   ` (458 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Since the changelog record flag is being stored on disk, pass it
around as an enum instead of a signed int.  Also make it clear at
the caller that only the low 12 bits of the flag are normally
being stored in the changelog records, since this isn't obvious
to the reader. For open and close records, the bottom 32 bits
of open flags are recorded.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10030
Lustre-commit: 2496089a0017 ("LU-10030 hsm: make changelog flag argument an enum")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32112
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 34 ++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 5551cbf..3bd6fc7 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -1019,16 +1019,17 @@ static inline const char *changelog_type2str(int type)
 	return NULL;
 }
 
-/* per-record flags */
+/* 12 bits of per-record data can be stored in the bottom of the flags */
 #define CLF_FLAGSHIFT   12
-#define CLF_FLAGMASK    ((1U << CLF_FLAGSHIFT) - 1)
-#define CLF_VERMASK     (~CLF_FLAGMASK)
 enum changelog_rec_flags {
 	CLF_VERSION	= 0x1000,
 	CLF_RENAME	= 0x2000,
 	CLF_JOBID	= 0x4000,
 	CLF_EXTRA_FLAGS = 0x8000,
-	CLF_SUPPORTED	= CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS
+	CLF_SUPPORTED	= CLF_VERSION | CLF_RENAME | CLF_JOBID |
+			  CLF_EXTRA_FLAGS,
+	CLF_FLAGMASK	= (1U << CLF_FLAGSHIFT) - 1,
+	CLF_VERMASK	= ~CLF_FLAGMASK,
 };
 
 /* Anything under the flagmask may be per-type (if desired) */
@@ -1089,29 +1090,32 @@ static inline enum hsm_event hsm_get_cl_event(__u16 flags)
 	return CLF_GET_BITS(flags, CLF_HSM_EVENT_H, CLF_HSM_EVENT_L);
 }
 
-static inline void hsm_set_cl_event(int *flags, enum hsm_event he)
+static inline void hsm_set_cl_event(enum changelog_rec_flags *clf_flags,
+				    enum hsm_event he)
 {
-	*flags |= (he << CLF_HSM_EVENT_L);
+	*clf_flags |= (he << CLF_HSM_EVENT_L);
 }
 
-static inline __u16 hsm_get_cl_flags(int flags)
+static inline __u16 hsm_get_cl_flags(enum changelog_rec_flags clf_flags)
 {
-	return CLF_GET_BITS(flags, CLF_HSM_FLAG_H, CLF_HSM_FLAG_L);
+	return CLF_GET_BITS(clf_flags, CLF_HSM_FLAG_H, CLF_HSM_FLAG_L);
 }
 
-static inline void hsm_set_cl_flags(int *flags, int bits)
+static inline void hsm_set_cl_flags(enum changelog_rec_flags *clf_flags,
+				    unsigned int bits)
 {
-	*flags |= (bits << CLF_HSM_FLAG_L);
+	*clf_flags |= (bits << CLF_HSM_FLAG_L);
 }
 
-static inline int hsm_get_cl_error(int flags)
+static inline int hsm_get_cl_error(enum changelog_rec_flags clf_flags)
 {
-	return CLF_GET_BITS(flags, CLF_HSM_ERR_H, CLF_HSM_ERR_L);
+	return CLF_GET_BITS(clf_flags, CLF_HSM_ERR_H, CLF_HSM_ERR_L);
 }
 
-static inline void hsm_set_cl_error(int *flags, int error)
+static inline void hsm_set_cl_error(enum changelog_rec_flags *clf_flags,
+				    unsigned int error)
 {
-	*flags |= (error << CLF_HSM_ERR_L);
+	*clf_flags |= (error << CLF_HSM_ERR_L);
 }
 
 enum changelog_rec_extra_flags {
@@ -1198,7 +1202,7 @@ struct changelog_ext_nid {
 	__u32 padding;
 };
 
-/* Changelog extra extension to include OPEN mode. */
+/* Changelog extra extension to include low 32 bits of MDS_OPEN_* flags. */
 struct changelog_ext_openmode {
 	__u32 cr_openflags;
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 165/622] lustre: ldlm: don't skip bl_ast for local lock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (163 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 164/622] lustre: hsm: make changelog flag argument an enum James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 166/622] lustre: clio: use pagevec_release for many pages James Simmons
                   ` (457 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

During downgrade to COS the lock renews own blocking AST states
and start reprocessing. Any new lock conflict will cause new
blocking AST and related async commit as needed.

For the linux client we can remove server specific code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11102
Lustre-commit: 75a417fa0065 ("LU-11102 ldlm: don't skip bl_ast for local lock")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33458
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lock.c | 130 ++++-----------------------------------------
 1 file changed, 9 insertions(+), 121 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 869d664..b9771ef 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -595,9 +595,15 @@ static void ldlm_add_bl_work_item(struct ldlm_lock *lock, struct ldlm_lock *new,
 		 */
 		if (ldlm_is_ast_discard_data(new))
 			ldlm_set_discard_data(lock);
-		LASSERT(list_empty(&lock->l_bl_ast));
-		list_add(&lock->l_bl_ast, work_list);
-		LDLM_LOCK_GET(lock);
+		/* Lock can be converted from a blocking state back to granted
+		 * after lock convert or COS downgrade but still be in an
+		 * older bl_list because it is controlled only by
+		 * ldlm_work_bl_ast_lock(), let it be processed there.
+		 */
+		if (list_empty(&lock->l_bl_ast)) {
+			list_add(&lock->l_bl_ast, work_list);
+			LDLM_LOCK_GET(lock);
+		}
 		LASSERT(!lock->l_blocking_lock);
 		lock->l_blocking_lock = LDLM_LOCK_GET(new);
 	}
@@ -1624,47 +1630,6 @@ enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
 }
 
 /**
- * Process a call to blocking AST callback for a lock in ast_work list
- */
-static int
-ldlm_work_bl_ast_lock(struct ptlrpc_request_set *rqset, void *opaq)
-{
-	struct ldlm_cb_set_arg *arg = opaq;
-	struct ldlm_lock_desc d;
-	int rc;
-	struct ldlm_lock *lock;
-
-	if (list_empty(arg->list))
-		return -ENOENT;
-
-	lock = list_first_entry(arg->list, struct ldlm_lock, l_bl_ast);
-
-	LASSERT(lock->l_blocking_lock);
-	ldlm_lock2desc(lock->l_blocking_lock, &d);
-	/* copy blocking lock ibits in cancel_bits as well,
-	 * new client may use them for lock convert and it is
-	 * important to use new field to convert locks from
-	 * new servers only
-	 */
-	d.l_policy_data.l_inodebits.cancel_bits =
-		lock->l_blocking_lock->l_policy_data.l_inodebits.bits;
-
-	/* nobody should touch l_bl_ast */
-	lock_res_and_lock(lock);
-	list_del_init(&lock->l_bl_ast);
-
-	LASSERT(ldlm_is_ast_sent(lock));
-	LASSERT(lock->l_bl_ast_run == 0);
-	lock->l_bl_ast_run++;
-	unlock_res_and_lock(lock);
-
-	rc = lock->l_blocking_ast(lock, &d, (void *)arg, LDLM_CB_BLOCKING);
-	LDLM_LOCK_RELEASE(lock);
-
-	return rc;
-}
-
-/**
  * Process a call to completion AST callback for a lock in ast_work list
  */
 static int
@@ -1711,71 +1676,6 @@ enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
 }
 
 /**
- * Process a call to revocation AST callback for a lock in ast_work list
- */
-static int
-ldlm_work_revoke_ast_lock(struct ptlrpc_request_set *rqset, void *opaq)
-{
-	struct ldlm_cb_set_arg *arg = opaq;
-	struct ldlm_lock_desc desc;
-	int rc;
-	struct ldlm_lock *lock;
-
-	if (list_empty(arg->list))
-		return -ENOENT;
-
-	lock = list_first_entry(arg->list, struct ldlm_lock, l_rk_ast);
-	list_del_init(&lock->l_rk_ast);
-
-	/* the desc just pretend to exclusive */
-	ldlm_lock2desc(lock, &desc);
-	desc.l_req_mode = LCK_EX;
-	desc.l_granted_mode = 0;
-
-	rc = lock->l_blocking_ast(lock, &desc, (void *)arg, LDLM_CB_BLOCKING);
-	LDLM_LOCK_RELEASE(lock);
-
-	return rc;
-}
-
-/**
- * Process a call to glimpse AST callback for a lock in ast_work list
- */
-static int ldlm_work_gl_ast_lock(struct ptlrpc_request_set *rqset, void *opaq)
-{
-	struct ldlm_cb_set_arg *arg = opaq;
-	struct ldlm_glimpse_work *gl_work;
-	struct ldlm_lock *lock;
-	int rc = 0;
-
-	if (list_empty(arg->list))
-		return -ENOENT;
-
-	gl_work = list_first_entry(arg->list, struct ldlm_glimpse_work,
-				   gl_list);
-	list_del_init(&gl_work->gl_list);
-
-	lock = gl_work->gl_lock;
-
-	/* transfer the glimpse descriptor to ldlm_cb_set_arg */
-	arg->gl_desc = gl_work->gl_desc;
-
-	/* invoke the actual glimpse callback */
-	if (lock->l_glimpse_ast(lock, (void *)arg) == 0)
-		rc = 1;
-
-	LDLM_LOCK_RELEASE(lock);
-
-	if (gl_work->gl_flags & LDLM_GL_WORK_SLAB_ALLOCATED)
-		kmem_cache_free(ldlm_glimpse_work_kmem, gl_work);
-	else
-		kfree(gl_work);
-	gl_work = NULL;
-
-	return rc;
-}
-
-/**
  * Process list of locks in need of ASTs being sent.
  *
  * Used on server to send multiple ASTs together instead of sending one by
@@ -1799,22 +1699,10 @@ int ldlm_run_ast_work(struct ldlm_namespace *ns, struct list_head *rpc_list,
 	arg->list = rpc_list;
 
 	switch (ast_type) {
-	case LDLM_WORK_BL_AST:
-		arg->type = LDLM_BL_CALLBACK;
-		work_ast_lock = ldlm_work_bl_ast_lock;
-		break;
 	case LDLM_WORK_CP_AST:
 		arg->type = LDLM_CP_CALLBACK;
 		work_ast_lock = ldlm_work_cp_ast_lock;
 		break;
-	case LDLM_WORK_REVOKE_AST:
-		arg->type = LDLM_BL_CALLBACK;
-		work_ast_lock = ldlm_work_revoke_ast_lock;
-		break;
-	case LDLM_WORK_GL_AST:
-		arg->type = LDLM_GL_CALLBACK;
-		work_ast_lock = ldlm_work_gl_ast_lock;
-		break;
 	default:
 		LBUG();
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 166/622] lustre: clio: use pagevec_release for many pages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (164 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 165/622] lustre: ldlm: don't skip bl_ast for local lock James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 167/622] lustre: lmv: allocate fid on parent MDT in migrate James Simmons
                   ` (456 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

When Lustre releases cached pages, it always uses
page_release, even when releasing many pages.

When clearing OST ldlm lock lrus in parallel with lots of
cached data, the ldlm_bl threads spend most of their time
contending for the zone lock taken by page_release.
Also, when osc_lru_reclaim kicks in when there's not enough
LRU slots during I/O, the contention on zone lock kills
I/O performance.

Switching to pagevec when we expect to actually release the
pages (discard_pages, truncate, lru reclaim) brings
significant performance benefits as shown below.

This patch introduces cl_pagevec_put() to release the pages
in batches using pagevec, which is essentially calling
release_pages().

  mpirun -np 48 ior -w -r -t 16m -b 16g -F -e -vv -o ... -i 1 [-B]

                mode         write (GB/s)    read (GB/s)
  master        O_DIRECT     20.8            21.8
  master+patch  O_DIRECT     20.7            22.2
  master        Buffered     11.6            12.3
  master+patch  Buffered     15.3            19.6

Also clean up the dead lovsub_page related code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9906
Lustre-commit: b4a959eb61bc ("LU-9906 clio: use pagevec_release for many pages")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/28667
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h   |  7 ++++-
 fs/lustre/include/lustre_osc.h  |  1 +
 fs/lustre/llite/vvp_page.c      | 19 ++++++++----
 fs/lustre/lov/Makefile          |  2 +-
 fs/lustre/lov/lov_cl_internal.h | 13 --------
 fs/lustre/lov/lovsub_page.c     | 68 -----------------------------------------
 fs/lustre/obdclass/cl_page.c    | 36 +++++++++++++++-------
 fs/lustre/obdecho/echo_client.c |  3 +-
 fs/lustre/osc/osc_cache.c       | 14 +++++++--
 fs/lustre/osc/osc_page.c        |  5 ++-
 10 files changed, 64 insertions(+), 104 deletions(-)
 delete mode 100644 fs/lustre/lov/lovsub_page.c

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index c96a5b7..3337bbf 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -95,6 +95,7 @@
 #include <linux/radix-tree.h>
 #include <linux/spinlock.h>
 #include <linux/wait.h>
+#include <linux/pagevec.h>
 
 struct inode;
 
@@ -896,7 +897,8 @@ struct cl_page_operations {
 			   const struct cl_page_slice *slice);
 	/** Destructor. Frees resources and slice itself. */
 	void (*cpo_fini)(const struct lu_env *env,
-			 struct cl_page_slice *slice);
+			 struct cl_page_slice *slice,
+			 struct pagevec *pvec);
 	/**
 	 * Optional debugging helper. Prints given page slice.
 	 *
@@ -2147,6 +2149,9 @@ struct cl_page *cl_page_alloc(const struct lu_env *env,
 			      enum cl_page_type type);
 void cl_page_get(struct cl_page *page);
 void cl_page_put(const struct lu_env *env, struct cl_page *page);
+void cl_pagevec_put(const struct lu_env *env,
+		    struct cl_page *page,
+		    struct pagevec *pvec);
 void cl_page_print(const struct lu_env *env, void *cookie, lu_printer_t printer,
 		   const struct cl_page *pg);
 void cl_page_header_print(const struct lu_env *env, void *cookie,
diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index dabcee0..aa3d4c3 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -179,6 +179,7 @@ struct osc_thread_info {
 	struct lustre_handle	oti_handle;
 	struct cl_page_list	oti_plist;
 	struct cl_io		oti_io;
+	struct pagevec		oti_pagevec;
 	void			*oti_pvec[OTI_PVEC_SIZE];
 	/*
 	 * Fields used by cl_lock_discard_pages().
diff --git a/fs/lustre/llite/vvp_page.c b/fs/lustre/llite/vvp_page.c
index 78a70b5..bd4ec85 100644
--- a/fs/lustre/llite/vvp_page.c
+++ b/fs/lustre/llite/vvp_page.c
@@ -54,16 +54,22 @@
  *
  */
 
-static void vvp_page_fini_common(struct vvp_page *vpg)
+static void vvp_page_fini_common(struct vvp_page *vpg, struct pagevec *pvec)
 {
 	struct page *vmpage = vpg->vpg_page;
 
 	LASSERT(vmpage);
-	put_page(vmpage);
+	if (pvec) {
+		if (!pagevec_add(pvec, vmpage))
+			pagevec_release(pvec);
+	} else {
+		put_page(vmpage);
+	}
 }
 
 static void vvp_page_fini(const struct lu_env *env,
-			  struct cl_page_slice *slice)
+			  struct cl_page_slice *slice,
+			  struct pagevec *pvec)
 {
 	struct vvp_page *vpg = cl2vvp_page(slice);
 	struct page *vmpage = vpg->vpg_page;
@@ -73,7 +79,7 @@ static void vvp_page_fini(const struct lu_env *env,
 	 * VPG_FREEING state.
 	 */
 	LASSERT((struct cl_page *)vmpage->private != slice->cpl_page);
-	vvp_page_fini_common(vpg);
+	vvp_page_fini_common(vpg, pvec);
 }
 
 static int vvp_page_own(const struct lu_env *env,
@@ -471,13 +477,14 @@ static int vvp_transient_page_is_vmlocked(const struct lu_env *env,
 }
 
 static void vvp_transient_page_fini(const struct lu_env *env,
-				    struct cl_page_slice *slice)
+				    struct cl_page_slice *slice,
+				    struct pagevec *pvec)
 {
 	struct vvp_page *vpg = cl2vvp_page(slice);
 	struct cl_page *clp = slice->cpl_page;
 	struct vvp_object *clobj = cl2vvp(clp->cp_obj);
 
-	vvp_page_fini_common(vpg);
+	vvp_page_fini_common(vpg, pvec);
 	atomic_dec(&clobj->vob_transient_pages);
 }
 
diff --git a/fs/lustre/lov/Makefile b/fs/lustre/lov/Makefile
index abdaac0..2f0b761 100644
--- a/fs/lustre/lov/Makefile
+++ b/fs/lustre/lov/Makefile
@@ -4,5 +4,5 @@ ccflags-y += -I$(srctree)/$(src)/../include
 obj-$(CONFIG_LUSTRE_FS) += lov.o
 lov-y := lov_obd.o lov_pack.o lov_offset.o lov_merge.o \
 	 lov_request.o lov_ea.o lov_dev.o lov_object.o lov_page.o  \
-	 lov_lock.o lov_io.o lovsub_dev.o lovsub_object.o lovsub_page.o      \
+	 lov_lock.o lov_io.o lovsub_dev.o lovsub_object.o \
 	 lov_pool.o lproc_lov.o
diff --git a/fs/lustre/lov/lov_cl_internal.h b/fs/lustre/lov/lov_cl_internal.h
index 875af37..e14567d 100644
--- a/fs/lustre/lov/lov_cl_internal.h
+++ b/fs/lustre/lov/lov_cl_internal.h
@@ -466,10 +466,6 @@ struct lov_sublock_env {
 	struct cl_io		*lse_io;
 };
 
-struct lovsub_page {
-	struct cl_page_slice	lsb_cl;
-};
-
 struct lov_thread_info {
 	struct cl_object_conf   lti_stripe_conf;
 	struct lu_fid		lti_fid;
@@ -626,8 +622,6 @@ struct lov_io_sub *lov_sub_get(const struct lu_env *env, struct lov_io *lio,
 
 int lov_page_init(const struct lu_env *env, struct cl_object *ob,
 		  struct cl_page *page, pgoff_t index);
-int lovsub_page_init(const struct lu_env *env, struct cl_object *ob,
-		     struct cl_page *page, pgoff_t index);
 int lov_page_init_empty(const struct lu_env *env, struct cl_object *obj,
 			struct cl_page *page, pgoff_t index);
 int lov_page_init_composite(const struct lu_env *env, struct cl_object *obj,
@@ -782,13 +776,6 @@ static inline struct lov_page *cl2lov_page(const struct cl_page_slice *slice)
 	return container_of(slice, struct lov_page, lps_cl);
 }
 
-static inline struct lovsub_page *
-cl2lovsub_page(const struct cl_page_slice *slice)
-{
-	LINVRNT(lovsub_is_object(&slice->cpl_obj->co_lu));
-	return container_of(slice, struct lovsub_page, lsb_cl);
-}
-
 static inline struct lov_io *cl2lov_io(const struct lu_env *env,
 				       const struct cl_io_slice *ios)
 {
diff --git a/fs/lustre/lov/lovsub_page.c b/fs/lustre/lov/lovsub_page.c
deleted file mode 100644
index a8aa583..0000000
--- a/fs/lustre/lov/lovsub_page.c
+++ /dev/null
@@ -1,68 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Copyright (c) 2002, 2010, Oracle and/or its affiliates. All rights reserved.
- * Use is subject to license terms.
- */
-/*
- * This file is part of Lustre, http://www.lustre.org/
- * Lustre is a trademark of Sun Microsystems, Inc.
- *
- * Implementation of cl_page for LOVSUB layer.
- *
- *   Author: Nikita Danilov <nikita.danilov@sun.com>
- */
-
-#define DEBUG_SUBSYSTEM S_LOV
-
-#include "lov_cl_internal.h"
-
-/** \addtogroup lov
- *  @{
- */
-
-/*****************************************************************************
- *
- * Lovsub page operations.
- *
- */
-
-static void lovsub_page_fini(const struct lu_env *env,
-			     struct cl_page_slice *slice)
-{
-}
-
-static const struct cl_page_operations lovsub_page_ops = {
-	.cpo_fini	= lovsub_page_fini
-};
-
-int lovsub_page_init(const struct lu_env *env, struct cl_object *obj,
-		     struct cl_page *page, pgoff_t index)
-{
-	struct lovsub_page *lsb = cl_object_page_slice(obj, page);
-
-	cl_page_slice_add(page, &lsb->lsb_cl, obj, index, &lovsub_page_ops);
-	return 0;
-}
-
-/** @} lov */
diff --git a/fs/lustre/obdclass/cl_page.c b/fs/lustre/obdclass/cl_page.c
index 8dbd312..3076f8c 100644
--- a/fs/lustre/obdclass/cl_page.c
+++ b/fs/lustre/obdclass/cl_page.c
@@ -90,7 +90,8 @@ static void cl_page_get_trust(struct cl_page *page)
 	return NULL;
 }
 
-static void cl_page_free(const struct lu_env *env, struct cl_page *page)
+static void cl_page_free(const struct lu_env *env, struct cl_page *page,
+			 struct pagevec *pvec)
 {
 	struct cl_object *obj = page->cp_obj;
 	struct cl_page_slice *slice;
@@ -104,7 +105,7 @@ static void cl_page_free(const struct lu_env *env, struct cl_page *page)
 						 cpl_linkage)) != NULL) {
 		list_del_init(page->cp_layers.next);
 		if (unlikely(slice->cpl_ops->cpo_fini))
-			slice->cpl_ops->cpo_fini(env, slice);
+			slice->cpl_ops->cpo_fini(env, slice, pvec);
 	}
 	lu_object_ref_del_at(&obj->co_lu, &page->cp_obj_ref, "cl_page", page);
 	cl_object_put(env, obj);
@@ -152,7 +153,7 @@ struct cl_page *cl_page_alloc(const struct lu_env *env,
 								   page, ind);
 				if (result != 0) {
 					__cl_page_delete(env, page);
-					cl_page_free(env, page);
+					cl_page_free(env, page, NULL);
 					page = ERR_PTR(result);
 					break;
 				}
@@ -299,15 +300,13 @@ void cl_page_get(struct cl_page *page)
 EXPORT_SYMBOL(cl_page_get);
 
 /**
- * Releases a reference to a page.
+ * Releases a reference to a page, use the pagevec to release the pages
+ * in batch if provided.
  *
- * When last reference is released, page is returned to the cache, unless it
- * is in cl_page_state::CPS_FREEING state, in which case it is immediately
- * destroyed.
- *
- * \see cl_object_put(), cl_lock_put().
+ * Users need to do a final pagevec_release() to release any trailing pages.
  */
-void cl_page_put(const struct lu_env *env, struct cl_page *page)
+void cl_pagevec_put(const struct lu_env *env, struct cl_page *page,
+		  struct pagevec *pvec)
 {
 	CL_PAGE_HEADER(D_TRACE, env, page, "%d\n",
 		       refcount_read(&page->cp_ref));
@@ -322,9 +321,24 @@ void cl_page_put(const struct lu_env *env, struct cl_page *page)
 		 * Page is no longer reachable by other threads. Tear
 		 * it down.
 		 */
-		cl_page_free(env, page);
+		cl_page_free(env, page, pvec);
 	}
 }
+EXPORT_SYMBOL(cl_pagevec_put);
+
+/**
+ * Releases a reference to a page, wrapper to cl_pagevec_put
+ *
+ * When last reference is released, page is returned to the cache, unless it
+ * is in cl_page_state::CPS_FREEING state, in which case it is immediately
+ * destroyed.
+ *
+ * \see cl_object_put(), cl_lock_put().
+ */
+void cl_page_put(const struct lu_env *env, struct cl_page *page)
+{
+	cl_pagevec_put(env, page, NULL);
+}
 EXPORT_SYMBOL(cl_page_put);
 
 /**
diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 0735a5a..5ac4519 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -259,7 +259,8 @@ static void echo_page_completion(const struct lu_env *env,
 }
 
 static void echo_page_fini(const struct lu_env *env,
-			   struct cl_page_slice *slice)
+			   struct cl_page_slice *slice,
+			   struct pagevec *pvec)
 {
 	struct echo_object *eco = cl2echo_obj(slice->cpl_obj);
 
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 961fc6bf..47aee99 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -985,6 +985,7 @@ static int osc_extent_truncate(struct osc_extent *ext, pgoff_t trunc_index,
 	struct client_obd *cli = osc_cli(obj);
 	struct osc_async_page *oap;
 	struct osc_async_page *tmp;
+	struct pagevec        *pvec;
 	int pages_in_chunk = 0;
 	int ppc_bits = cli->cl_chunkbits - PAGE_SHIFT;
 	u64 trunc_chunk = trunc_index >> ppc_bits;
@@ -1008,6 +1009,8 @@ static int osc_extent_truncate(struct osc_extent *ext, pgoff_t trunc_index,
 	io  = osc_env_thread_io(env);
 	io->ci_obj = cl_object_top(osc2cl(obj));
 	io->ci_ignore_layout = 1;
+	pvec = &osc_env_info(env)->oti_pagevec;
+	pagevec_init(pvec);
 	rc = cl_io_init(env, io, CIT_MISC, io->ci_obj);
 	if (rc < 0)
 		goto out;
@@ -1046,11 +1049,13 @@ static int osc_extent_truncate(struct osc_extent *ext, pgoff_t trunc_index,
 		}
 
 		lu_ref_del(&page->cp_reference, "truncate", current);
-		cl_page_put(env, page);
+		cl_pagevec_put(env, page, pvec);
 
 		--ext->oe_nr_pages;
 		++nr_pages;
 	}
+	pagevec_release(pvec);
+
 	EASSERTF(ergo(ext->oe_start >= trunc_index + !!partial,
 		      ext->oe_nr_pages == 0),
 		ext, "trunc_index %lu, partial %d\n", trunc_index, partial);
@@ -3030,6 +3035,7 @@ bool osc_page_gang_lookup(const struct lu_env *env, struct cl_io *io,
 			  osc_page_gang_cbt cb, void *cbdata)
 {
 	struct osc_page *ops;
+	struct pagevec	*pagevec;
 	void **pvec;
 	pgoff_t idx;
 	unsigned int nr;
@@ -3040,6 +3046,8 @@ bool osc_page_gang_lookup(const struct lu_env *env, struct cl_io *io,
 
 	idx = start;
 	pvec = osc_env_info(env)->oti_pvec;
+	pagevec = &osc_env_info(env)->oti_pagevec;
+	pagevec_init(pagevec);
 	spin_lock(&osc->oo_tree_lock);
 	while ((nr = radix_tree_gang_lookup(&osc->oo_tree, pvec,
 					    idx, OTI_PVEC_SIZE)) > 0) {
@@ -3086,8 +3094,10 @@ bool osc_page_gang_lookup(const struct lu_env *env, struct cl_io *io,
 
 			page = ops->ops_cl.cpl_page;
 			lu_ref_del(&page->cp_reference, "gang_lookup", current);
-			cl_page_put(env, page);
+			cl_pagevec_put(env, page, pagevec);
 		}
+		pagevec_release(pagevec);
+
 		if (nr < OTI_PVEC_SIZE || end_of_region)
 			break;
 
diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 9236e02..4dc6c18 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -506,8 +506,10 @@ static void osc_lru_use(struct client_obd *cli, struct osc_page *opg)
 static void discard_pagevec(const struct lu_env *env, struct cl_io *io,
 			    struct cl_page **pvec, int max_index)
 {
+	struct pagevec *pagevec = &osc_env_info(env)->oti_pagevec;
 	int i;
 
+	pagevec_init(pagevec);
 	for (i = 0; i < max_index; i++) {
 		struct cl_page *page = pvec[i];
 
@@ -515,10 +517,11 @@ static void discard_pagevec(const struct lu_env *env, struct cl_io *io,
 		cl_page_delete(env, page);
 		cl_page_discard(env, io, page);
 		cl_page_disown(env, io, page);
-		cl_page_put(env, page);
+		cl_pagevec_put(env, page, pagevec);
 
 		pvec[i] = NULL;
 	}
+	pagevec_release(pagevec);
 }
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 167/622] lustre: lmv: allocate fid on parent MDT in migrate
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (165 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 166/622] lustre: clio: use pagevec_release for many pages James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 168/622] lustre: ptlrpc: Do not map unrecognized ELDLM errnos to EIO James Simmons
                   ` (455 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

During directory migration, if the migrated file is not directory,
the target should be allocated on its parent MDT, not user specified
MDT. Because if it's parent is striped, this file should be migrated
to the MDT by its name hash, not the starting MDT of its parent.

Add sanity 230k to check file data not changed after migration.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11642
Lustre-commit: a857446dc648 ("LU-11642 lmv: allocate fid on parent MDT in migrate")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33641
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_obd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index e98f33d..428904c 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1970,7 +1970,10 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	if (IS_ERR(child_tgt))
 		return PTR_ERR(child_tgt);
 
-	rc = lmv_fid_alloc(NULL, exp, &target_fid, op_data);
+	if (!S_ISDIR(op_data->op_mode) && tp_tgt)
+		rc = __lmv_fid_alloc(lmv, &target_fid, tp_tgt->ltd_idx);
+	else
+		rc = lmv_fid_alloc(NULL, exp, &target_fid, op_data);
 	if (rc)
 		return rc;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 168/622] lustre: ptlrpc: Do not map unrecognized ELDLM errnos to EIO
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (166 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 167/622] lustre: lmv: allocate fid on parent MDT in migrate James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 169/622] lustre: llite: protect reading inode->i_data.nrpages James Simmons
                   ` (454 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

The lustre_errno_hton and lustre_errno_ntoh functions map between
host and network error numbers before they are sent over the network.
If an errno is unrecognized then it is mapped to EIO.

However an optimization for x86 and i386 architectures replaced the
functions with macros that simply return the original errno. The
result is that x86 and i386 return the original values for ELDLM
errnos and all other architectures return EIO. This difference is
known to break glimpse lock callback handling which depends on clients
responding with ELDLM_NO_LOCK_DATA. The difference in errnos may
result in other as yet unidentified bugs.

The fix defines mappings for the ELDLM errors that leaves the values
unchanged. Error numbers not found in the mapping tables are still
mapped to EIO.

Cray-bug-id: LUS-6057
WC-bug-id: https://jira.whamcloud.com/browse/LU-9793
Lustre-commit: 641e1d546742 ("LU-9793 ptlrpc: Do not map unrecognized ELDLM errnos to EIO")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/33471
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/errno.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/lustre/ptlrpc/errno.c b/fs/lustre/ptlrpc/errno.c
index b904524..2975010 100644
--- a/fs/lustre/ptlrpc/errno.c
+++ b/fs/lustre/ptlrpc/errno.c
@@ -30,6 +30,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <lustre_errno.h>
+#include <lustre_dlm.h>
 
 /*
  * The two translation tables below must define a one-to-one mapping between
@@ -187,6 +188,19 @@
 	[EBADTYPE]		= LUSTRE_EBADTYPE,
 	[EJUKEBOX]		= LUSTRE_EJUKEBOX,
 	[EIOCBQUEUED]		= LUSTRE_EIOCBQUEUED,
+
+	/*
+	 * The ELDLM errors are Lustre specific errors whose ranges
+	 * lie in the middle of the above system errors. The ELDLM
+	 * numbers must be preserved to avoid LU-9793.
+	 */
+	[ELDLM_LOCK_CHANGED]		= ELDLM_LOCK_CHANGED,
+	[ELDLM_LOCK_ABORTED]		= ELDLM_LOCK_ABORTED,
+	[ELDLM_LOCK_REPLACED]		= ELDLM_LOCK_REPLACED,
+	[ELDLM_NO_LOCK_DATA]		= ELDLM_NO_LOCK_DATA,
+	[ELDLM_LOCK_WOULDBLOCK]		= ELDLM_LOCK_WOULDBLOCK,
+	[ELDLM_NAMESPACE_EXISTS]	= ELDLM_NAMESPACE_EXISTS,
+	[ELDLM_BAD_NAMESPACE]		= ELDLM_BAD_NAMESPACE,
 };
 
 static int lustre_errno_ntoh_mapping[] = {
@@ -333,6 +347,19 @@
 	[LUSTRE_EBADTYPE]		= EBADTYPE,
 	[LUSTRE_EJUKEBOX]		= EJUKEBOX,
 	[LUSTRE_EIOCBQUEUED]		= EIOCBQUEUED,
+
+	/*
+	 * The ELDLM errors are Lustre specific errors whose ranges
+	 * lie in the middle of the above system errors. The ELDLM
+	 * numbers must be preserved to avoid LU-9793.
+	 */
+	[ELDLM_LOCK_CHANGED]		= ELDLM_LOCK_CHANGED,
+	[ELDLM_LOCK_ABORTED]		= ELDLM_LOCK_ABORTED,
+	[ELDLM_LOCK_REPLACED]		= ELDLM_LOCK_REPLACED,
+	[ELDLM_NO_LOCK_DATA]		= ELDLM_NO_LOCK_DATA,
+	[ELDLM_LOCK_WOULDBLOCK]		= ELDLM_LOCK_WOULDBLOCK,
+	[ELDLM_NAMESPACE_EXISTS]	= ELDLM_NAMESPACE_EXISTS,
+	[ELDLM_BAD_NAMESPACE]		= ELDLM_BAD_NAMESPACE,
 };
 
 unsigned int lustre_errno_hton(unsigned int h)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 169/622] lustre: llite: protect reading inode->i_data.nrpages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (167 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 168/622] lustre: ptlrpc: Do not map unrecognized ELDLM errnos to EIO James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 170/622] lustre: mdt: fix read-on-open for big PAGE_SIZE James Simmons
                   ` (453 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

truncate_inode_pages() looks up pages in the radix tree without
lock, and could miss finding pages removed from the radix tree
by __remove_mapping(), so that after calling truncate_inode_pages()
we need to read the nrpages of the inode->i_data with the protection
of tree_lock.

Since it could still be in the race window of __remove_mapping()->
__delete_from_page_cache()->page_cache_tree_delte(), before the
nrpages being decreased.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11582
Lustre-commit: 04c172b68676 ("LU-11582 llite: protect reading inode->i_data.nrpages")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33639
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index ed2d1c6..b766402 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2011,6 +2011,8 @@ int ll_read_inode2(struct inode *inode, void *opaque)
 void ll_delete_inode(struct inode *inode)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
+	struct address_space *mapping = &inode->i_data;
+	unsigned long nrpages;
 
 	if (S_ISREG(inode->i_mode) && lli->lli_clob)
 		/* discard all dirty pages before truncating them, required by
@@ -2019,11 +2021,26 @@ void ll_delete_inode(struct inode *inode)
 		cl_sync_file_range(inode, 0, OBD_OBJECT_EOF,
 				   CL_FSYNC_LOCAL, 1);
 
-	truncate_inode_pages_final(&inode->i_data);
+	truncate_inode_pages_final(mapping);
 
-	LASSERTF(!inode->i_data.nrpages,
-		 "inode=" DFID "(%p) nrpages=%lu, see http://jira.whamcloud.com/browse/LU-118\n",
-		 PFID(ll_inode2fid(inode)), inode, inode->i_data.nrpages);
+	/* Workaround for LU-118: Note nrpages may not be totally updated when
+	 * truncate_inode_pages() returns, as there can be a page in the process
+	 * of deletion (inside __delete_from_page_cache()) in the specified
+	 * range. Thus mapping->nrpages can be non-zero when this function
+	 * returns even after truncation of the whole mapping.  Only do this if
+	 * npages isn't already zero.
+	 */
+	nrpages = mapping->nrpages;
+	if (nrpages) {
+		xa_lock_irq(&mapping->i_pages);
+		nrpages = mapping->nrpages;
+		xa_unlock_irq(&mapping->i_pages);
+	} /* Workaround end */
+
+	LASSERTF(nrpages == 0,
+		 "%s: inode="DFID"(%p) nrpages=%lu, see https://jira.whamcloud.com/browse/LU-118\n",
+		 ll_get_fsname(inode->i_sb, NULL, 0),
+		 PFID(ll_inode2fid(inode)), inode, nrpages);
 
 	ll_clear_inode(inode);
 	clear_inode(inode);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 170/622] lustre: mdt: fix read-on-open for big PAGE_SIZE
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (168 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 169/622] lustre: llite: protect reading inode->i_data.nrpages James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 171/622] lustre: llite: handle -ENODATA in ll_layout_fetch() James Simmons
                   ` (452 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Client PAGE_SIZE can be larger than server one so data returned
from server along with OPEN can be misaligned on client.

Patch replaces assertion on client with check and graceful exit,
changes MDC_DOM_DEF_INLINE_REPSIZE to be PAGE_SIZE at least and
updates mdt_dom_read_on_open() to return file tail for maximum
possible page size that can fit into reply.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11595
Lustre-commit: 4d7b022e373d ("LU-11595 mdt: fix read-on-open for big PAGE_SIZE")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33606
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c       | 22 ++++++++++++++++++++--
 fs/lustre/mdc/mdc_internal.h |  3 ++-
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 246d5de..44337a2 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -447,8 +447,26 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	if (!rnb || rnb->rnb_len == 0)
 		return;
 
-	CDEBUG(D_INFO, "Get data buffer along with open, len %i, i_size %llu\n",
-	       rnb->rnb_len, i_size_read(inode));
+	/* LU-11595: Server may return whole file and that is OK always or
+	 * it may return just file tail and its offset must be aligned with
+	 * client PAGE_SIZE to be used on that client, if server's PAGE_SIZE is
+	 * smaller then offset may be not aligned and that data is just ignored.
+	 */
+	if (rnb->rnb_offset % PAGE_SIZE)
+		return;
+
+	/* Server returns whole file or just file tail if it fills in
+	 * reply buffer, in both cases total size should be inode size.
+	 */
+	if (rnb->rnb_offset + rnb->rnb_len < i_size_read(inode)) {
+		CERROR("%s: server returns off/len %llu/%u < i_size %llu\n",
+		       ll_get_fsname(inode->i_sb, NULL, 0), rnb->rnb_offset,
+		       rnb->rnb_len, i_size_read(inode));
+		return;
+	}
+
+	CDEBUG(D_INFO, "Get data along with open at %llu len %i, i_size %llu\n",
+	       rnb->rnb_offset, rnb->rnb_len, i_size_read(inode));
 
 	data = (char *)rnb + sizeof(*rnb);
 
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index b4af9778..7a6ec81 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -162,7 +162,8 @@ int mdc_ldlm_blocking_ast(struct ldlm_lock *dlmlock,
 int mdc_ldlm_glimpse_ast(struct ldlm_lock *dlmlock, void *data);
 int mdc_fill_lvb(struct ptlrpc_request *req, struct ost_lvb *lvb);
 
-#define MDC_DOM_DEF_INLINE_REPSIZE 8192
+/* the minimum inline repsize should be PAGE_SIZE at least */
+#define MDC_DOM_DEF_INLINE_REPSIZE max(8192UL, PAGE_SIZE)
 #define MDC_DOM_MAX_INLINE_REPSIZE XATTR_SIZE_MAX
 
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 171/622] lustre: llite: handle -ENODATA in ll_layout_fetch()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (169 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 170/622] lustre: mdt: fix read-on-open for big PAGE_SIZE James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 172/622] lustre: hsm: increase upper limit of maximum HSM backends registered with MDT James Simmons
                   ` (451 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In ll_layout_fetch() handle -ENODATA returns from mdc_getxattr(). This
is needed for interop and restores the behavior from before commit
0f42b388432c (LU-11380 mdc: move empty xattr to mdc layer) landed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11662
Lustre-commit: e3f367f3660d ("LU-11662 llite: handle -ENODATA in ll_layout_fetch()")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33665
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 44337a2..25d7986 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4433,8 +4433,13 @@ static int ll_layout_fetch(struct inode *inode, struct ldlm_lock *lock)
 
 	rc = md_getxattr(sbi->ll_md_exp, ll_inode2fid(inode), OBD_MD_FLXATTR,
 			 XATTR_NAME_LOV, lmmsize, &req);
-	if (rc < 0)
+	if (rc < 0) {
+		if (rc == -ENODATA) {
+			rc = 0;
+			goto out; /* empty layout */
+		}
 		return rc;
+	}
 
 	lmmsize = rc;
 	rc = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 172/622] lustre: hsm: increase upper limit of maximum HSM backends registered with MDT
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (170 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 171/622] lustre: llite: handle -ENODATA in ll_layout_fetch() James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 173/622] lustre: osc: wrong page offset for T10PI checksum James Simmons
                   ` (450 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Teddy Zheng <teddy@ddn.com>

Lustre only supports at most 32 HSM backends, which limits HSM to be applied
to other features, such as LPCC. This patch breaks the limitation by allowing
the system take any interger number as a valid archive-id.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10114
Lustre-commit: 3bfb6107ba4e ("LU-10114 hsm: increase upper limit of maximum HSM backends registered with MDT")
Signed-off-by: Teddy Zheng <teddy@ddn.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/32197
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_export.h             |  15 +++-
 fs/lustre/llite/dir.c                         | 115 +++++++++++++++++++++++---
 fs/lustre/llite/file.c                        |  15 ++--
 fs/lustre/llite/llite_lib.c                   |   3 +-
 fs/lustre/lmv/lmv_obd.c                       |  31 +++++--
 fs/lustre/mdc/mdc_request.c                   |  81 +++++++++++++-----
 fs/lustre/ptlrpc/layout.c                     |   2 +-
 include/uapi/linux/lustre/lustre_idl.h        |  10 ++-
 include/uapi/linux/lustre/lustre_kernelcomm.h |  15 +++-
 9 files changed, 235 insertions(+), 52 deletions(-)

diff --git a/fs/lustre/include/lustre_export.h b/fs/lustre/include/lustre_export.h
index 57cf68b..c94efb0 100644
--- a/fs/lustre/include/lustre_export.h
+++ b/fs/lustre/include/lustre_export.h
@@ -276,11 +276,22 @@ static inline int exp_connect_lock_convert(struct obd_export *exp)
 
 struct obd_export *class_conn2export(struct lustre_handle *conn);
 
-#define KKUC_CT_DATA_MAGIC	0x092013cea
+static inline int exp_connect_archive_id_array(struct obd_export *exp)
+{
+	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_ARCHIVE_ID_ARRAY);
+}
+
+enum {
+	/* archive_ids in array format */
+	KKUC_CT_DATA_ARRAY_MAGIC	= 0x092013cea,
+	/* archive_ids in bitmap format */
+	KKUC_CT_DATA_BITMAP_MAGIC	= 0x082018cea,
+};
 
 struct kkuc_ct_data {
 	u32			kcd_magic;
-	u32			kcd_archive;
+	u32			kcd_nr_archives;
+	u32			kcd_archives[0];
 };
 
 /** @} export */
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 3da9d14..f54987a 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -931,19 +931,114 @@ static int ll_ioc_copy_end(struct super_block *sb, struct hsm_copy *copy)
 	return rc ? rc : rc2;
 }
 
-static int copy_and_ioctl(int cmd, struct obd_export *exp,
-			  const void __user *data, size_t size)
+static int copy_and_ct_start(int cmd, struct obd_export *exp,
+			     const struct lustre_kernelcomm __user *data)
 {
-	void *copy;
+	struct lustre_kernelcomm *lk;
+	struct lustre_kernelcomm *tmp;
+	size_t size = sizeof(*lk);
+	size_t new_size;
 	int rc;
+	int i;
 
-	copy = memdup_user(data, size);
-	if (IS_ERR(copy))
-		return PTR_ERR(copy);
+	lk = memdup_user(data, size);
+	if (IS_ERR(lk)) {
+		rc = PTR_ERR(lk);
+		goto out_lk;
+	}
+
+	if (lk->lk_flags & LK_FLG_STOP)
+		goto do_ioctl;
+
+	if (!(lk->lk_flags & LK_FLG_DATANR)) {
+		u32 archive_mask = lk->lk_data_count;
+		int count;
+
+		/* old hsm agent to old MDS */
+		if (!exp_connect_archive_id_array(exp))
+			goto do_ioctl;
+
+		/* old hsm agent to new MDS */
+		lk->lk_flags |= LK_FLG_DATANR;
+
+		if (archive_mask == 0)
+			goto do_ioctl;
+
+		count = hweight32(archive_mask);
+		new_size = offsetof(struct lustre_kernelcomm, lk_data[count]);
+		tmp = kmalloc(new_size, GFP_KERNEL);
+		if (!tmp) {
+			rc = -ENOMEM;
+			goto out_lk;
+		}
+		memcpy(tmp, lk, size);
+		tmp->lk_data_count = count;
+		kfree(lk);
+		lk = tmp;
+		size = new_size;
+
+		count = 0;
+		for (i = 0; i < sizeof(archive_mask) * 8; i++) {
+			if (BIT(i) & archive_mask) {
+				lk->lk_data[count] = i + 1;
+				count++;
+			}
+		}
+		goto do_ioctl;
+	}
+
+	/* new hsm agent to new mds */
+	if (lk->lk_data_count > 0) {
+		new_size = offsetof(struct lustre_kernelcomm,
+				    lk_data[lk->lk_data_count]);
+		tmp = kmalloc(new_size, GFP_KERNEL);
+		if (!tmp) {
+			rc = -ENOMEM;
+			goto out_lk;
+		}
+
+		kfree(lk);
+		lk = tmp;
+		size = new_size;
+
+		if (copy_from_user(lk, data, size)) {
+			rc = -EFAULT;
+			goto out_lk;
+		}
+	}
+
+	/* new hsm agent to old MDS */
+	if (!exp_connect_archive_id_array(exp)) {
+		u32 archives = 0;
+
+		if (lk->lk_data_count > LL_HSM_ORIGIN_MAX_ARCHIVE) {
+			rc = -EINVAL;
+			goto out_lk;
+		}
+
+		for (i = 0; i < lk->lk_data_count; i++) {
+			if (lk->lk_data[i] > LL_HSM_ORIGIN_MAX_ARCHIVE) {
+				rc = -EINVAL;
+				CERROR("%s: archive id %d requested but only [0 - %zu] supported: rc = %d\n",
+				       exp->exp_obd->obd_name, lk->lk_data[i],
+				LL_HSM_ORIGIN_MAX_ARCHIVE, rc);
+				goto out_lk;
+			}
 
-	rc = obd_iocontrol(cmd, exp, size, copy, NULL);
-	kfree(copy);
+			if (lk->lk_data[i] == 0) {
+				archives = 0;
+				break;
+			}
 
+			archives |= BIT(lk->lk_data[i] - 1);
+		}
+		lk->lk_flags &= ~LK_FLG_DATANR;
+		lk->lk_data_count = archives;
+	}
+do_ioctl:
+	rc = obd_iocontrol(cmd, exp, size, lk, NULL);
+out_lk:
+	kfree(lk);
 	return rc;
 }
 
@@ -1671,8 +1766,8 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
 
-		rc = copy_and_ioctl(cmd, sbi->ll_md_exp, (void __user *)arg,
-				    sizeof(struct lustre_kernelcomm));
+		rc = copy_and_ct_start(cmd, sbi->ll_md_exp,
+				       (struct lustre_kernelcomm __user *)arg);
 		return rc;
 
 	case LL_IOC_HSM_COPY_START: {
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 25d7986..7078734 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -2397,6 +2397,7 @@ static int ll_swap_layouts(struct file *file1, struct file *file2,
 
 int ll_hsm_state_set(struct inode *inode, struct hsm_state_set *hss)
 {
+	struct obd_export *exp = ll_i2mdexp(inode);
 	struct md_op_data *op_data;
 	int rc;
 
@@ -2411,18 +2412,20 @@ int ll_hsm_state_set(struct inode *inode, struct hsm_state_set *hss)
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	/* Detect out-of range archive id */
-	if ((hss->hss_valid & HSS_ARCHIVE_ID) &&
-	    (hss->hss_archive_id > LL_HSM_MAX_ARCHIVE))
-		return -EINVAL;
+	if (!exp_connect_archive_id_array(exp)) {
+		/* Detect out-of range archive id */
+		if ((hss->hss_valid & HSS_ARCHIVE_ID) &&
+		    (hss->hss_archive_id > LL_HSM_ORIGIN_MAX_ARCHIVE))
+			return -EINVAL;
+	}
 
 	op_data = ll_prep_md_op_data(NULL, inode, NULL, NULL, 0, 0,
 				     LUSTRE_OPC_ANY, hss);
 	if (IS_ERR(op_data))
 		return PTR_ERR(op_data);
 
-	rc = obd_iocontrol(LL_IOC_HSM_STATE_SET, ll_i2mdexp(inode),
-			   sizeof(*op_data), op_data, NULL);
+	rc = obd_iocontrol(LL_IOC_HSM_STATE_SET, exp, sizeof(*op_data),
+			   op_data, NULL);
 
 	ll_finish_md_op_data(op_data);
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index b766402..4797ee9 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -212,7 +212,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	data->ocd_connect_flags2 = OBD_CONNECT2_FLR |
 				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_DIR_MIGRATE |
-				   OBD_CONNECT2_SUM_STATFS;
+				   OBD_CONNECT2_SUM_STATFS |
+				   OBD_CONNECT2_ARCHIVE_ID_ARRAY;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 428904c..9f9abd3 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -788,18 +788,39 @@ static int lmv_hsm_ct_register(struct obd_device *obd, unsigned int cmd,
 	u32 i, j;
 	int err;
 	bool any_set = false;
-	struct kkuc_ct_data kcd = {
-		.kcd_magic	= KKUC_CT_DATA_MAGIC,
-		.kcd_archive	= lk->lk_data,
-	};
+	struct kkuc_ct_data *kcd;
+	size_t kcd_size;
 	int rc = 0;
 
 	filp = fget(lk->lk_wfd);
 	if (!filp)
 		return -EBADF;
 
+	if (lk->lk_flags & LK_FLG_DATANR)
+		kcd_size = offsetof(struct kkuc_ct_data,
+				    kcd_archives[lk->lk_data_count]);
+	else
+		kcd_size = sizeof(*kcd);
+
+	kcd = kmalloc(kcd_size, GFP_KERNEL);
+	if (!kcd) {
+		rc = -ENOMEM;
+		goto err_fput;
+	}
+
+	kcd->kcd_nr_archives = lk->lk_data_count;
+	if (lk->lk_flags & LK_FLG_DATANR) {
+		kcd->kcd_magic = KKUC_CT_DATA_ARRAY_MAGIC;
+		if (lk->lk_data_count > 0)
+			memcpy(kcd->kcd_archives, lk->lk_data,
+			       sizeof(*kcd->kcd_archives) * lk->lk_data_count);
+	} else {
+		kcd->kcd_magic = KKUC_CT_DATA_BITMAP_MAGIC;
+	}
+
 	rc = libcfs_kkuc_group_add(filp, &obd->obd_uuid, lk->lk_uid,
-				   lk->lk_group, &kcd, sizeof(kcd));
+				   lk->lk_group, kcd, kcd_size);
+	kfree(kcd);
 	if (rc)
 		goto err_fput;
 
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 6934e57..d702fd1 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1689,31 +1689,56 @@ static int mdc_ioc_hsm_progress(struct obd_export *exp,
 	return rc;
 }
 
-static int mdc_ioc_hsm_ct_register(struct obd_import *imp, u32 archives)
+/**
+ * Send hsm_ct_register to MDS
+ *
+ * @imp			import
+ * @ archive_count	if in bitmap format, it is the bitmap,
+ *			else it is the count of archive_ids
+ * @archives		if in bitmap format, it is NULL,
+ *			else it is archive_id lists
+ *
+ * Return:		0 on success, negated error code on failure.
+ */
+static int mdc_ioc_hsm_ct_register(struct obd_import *imp, u32 archive_count,
+				   u32 *archives)
 {
-	u32 *archive_mask;
+	u32 *archive_array;
 	struct ptlrpc_request *req;
+	size_t archives_size;
 	int rc;
 
-	req = ptlrpc_request_alloc_pack(imp, &RQF_MDS_HSM_CT_REGISTER,
-					LUSTRE_MDS_VERSION,
-					MDS_HSM_CT_REGISTER);
-	if (!req) {
-		rc = -ENOMEM;
-		goto out;
+	req = ptlrpc_request_alloc(imp, &RQF_MDS_HSM_CT_REGISTER);
+	if (!req)
+		return -ENOMEM;
+
+	if (archives)
+		archives_size = sizeof(*archive_array) * archive_count;
+	else
+		archives_size = sizeof(archive_count);
+
+	req_capsule_set_size(&req->rq_pill, &RMF_MDS_HSM_ARCHIVE,
+			     RCL_CLIENT, archives_size);
+
+	rc = ptlrpc_request_pack(req, LUSTRE_MDS_VERSION, MDS_HSM_CT_REGISTER);
+	if (rc) {
+		ptlrpc_request_free(req);
+		return -ENOMEM;
 	}
 
 	mdc_pack_body(req, NULL, 0, 0, -1, 0);
 
-	/* Copy hsm_progress struct */
-	archive_mask = req_capsule_client_get(&req->rq_pill,
-					      &RMF_MDS_HSM_ARCHIVE);
-	if (!archive_mask) {
+	archive_array = req_capsule_client_get(&req->rq_pill,
+					       &RMF_MDS_HSM_ARCHIVE);
+	if (!archive_array) {
 		rc = -EPROTO;
 		goto out;
 	}
 
-	*archive_mask = archives;
+	if (archives)
+		memcpy(archive_array, archives, archives_size);
+	else
+		*archive_array = archive_count;
 
 	ptlrpc_request_set_replen(req);
 
@@ -2249,7 +2274,6 @@ static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
 				struct lustre_kernelcomm *lk)
 {
 	struct obd_import *imp = class_exp2cliimp(exp);
-	u32 archive = lk->lk_data;
 	int rc = 0;
 
 	if (lk->lk_group != KUC_GRP_HSM) {
@@ -2264,7 +2288,12 @@ static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
 		/* Unregister with the coordinator */
 		rc = mdc_ioc_hsm_ct_unregister(imp);
 	} else {
-		rc = mdc_ioc_hsm_ct_register(imp, archive);
+		u32 *archives = NULL;
+
+		if ((lk->lk_flags & LK_FLG_DATANR) && lk->lk_data_count > 0)
+			archives = lk->lk_data;
+
+		rc = mdc_ioc_hsm_ct_register(imp, lk->lk_data_count, archives);
 	}
 
 	return rc;
@@ -2314,17 +2343,29 @@ static int mdc_hsm_copytool_send(const struct obd_uuid *uuid,
  */
 static int mdc_hsm_ct_reregister(void *data, void *cb_arg)
 {
-	struct kkuc_ct_data *kcd = data;
 	struct obd_import *imp = (struct obd_import *)cb_arg;
+	struct kkuc_ct_data *kcd = data;
+	u32 *archives = NULL;
 	int rc;
 
-	if (!kcd || kcd->kcd_magic != KKUC_CT_DATA_MAGIC)
+	if (!kcd ||
+	    (kcd->kcd_magic != KKUC_CT_DATA_ARRAY_MAGIC &&
+	     kcd->kcd_magic != KKUC_CT_DATA_BITMAP_MAGIC))
 		return -EPROTO;
 
-	CDEBUG(D_HA, "%s: recover copytool registration to MDT (archive=%#x)\n",
-	       imp->imp_obd->obd_name, kcd->kcd_archive);
-	rc = mdc_ioc_hsm_ct_register(imp, kcd->kcd_archive);
+	if (kcd->kcd_magic == KKUC_CT_DATA_BITMAP_MAGIC) {
+		CDEBUG(D_HA,
+		       "%s: recover copytool registration to MDT (archive=%#x)\n",
+		       imp->imp_obd->obd_name, kcd->kcd_nr_archives);
+	} else {
+		CDEBUG(D_HA,
+		       "%s: recover copytool registration to MDT (archive nr = %u)\n",
+		       imp->imp_obd->obd_name, kcd->kcd_nr_archives);
+		if (kcd->kcd_nr_archives != 0)
+			archives = kcd->kcd_archives;
+	}
 
+	rc = mdc_ioc_hsm_ct_register(imp, kcd->kcd_nr_archives, archives);
 	/* ignore error if the copytool is already registered */
 	return (rc == -EEXIST) ? 0 : rc;
 }
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 92d2fc2..2e74ae1b 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -1127,7 +1127,7 @@ struct req_msg_field RMF_MDS_HSM_USER_ITEM =
 EXPORT_SYMBOL(RMF_MDS_HSM_USER_ITEM);
 
 struct req_msg_field RMF_MDS_HSM_ARCHIVE =
-	DEFINE_MSGF("hsm_archive", 0,
+	DEFINE_MSGF("hsm_archive", RMF_F_STRUCT_ARRAY,
 		    sizeof(u32), lustre_swab_generic_32s, NULL);
 EXPORT_SYMBOL(RMF_MDS_HSM_ARCHIVE);
 
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 8330fe1..599fe86 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -194,12 +194,14 @@ enum {
 	LUSTRE_FID_INIT_OID  = 1UL
 };
 
-/* copytool uses a 32b bitmask field to encode archive-Ids during register
- * with MDT thru kuc.
+/* copytool can use any nonnegative integer to represent archive-Ids during
+ * register with MDT thru kuc.
  * archive num = 0 => all
- * archive num from 1 to 32
+ * archive num from 1 to MAX_U32
  */
-#define LL_HSM_MAX_ARCHIVE (sizeof(__u32) * 8)
+#define LL_HSM_ORIGIN_MAX_ARCHIVE	(sizeof(__u32) * 8)
+/* the max count of archive ids that one agent can support */
+#define LL_HSM_MAX_ARCHIVES_PER_AGENT	1024
 
 /**
  * Different FID Format
diff --git a/include/uapi/linux/lustre/lustre_kernelcomm.h b/include/uapi/linux/lustre/lustre_kernelcomm.h
index d84a8fc..8c5dec7 100644
--- a/include/uapi/linux/lustre/lustre_kernelcomm.h
+++ b/include/uapi/linux/lustre/lustre_kernelcomm.h
@@ -75,17 +75,26 @@ enum kuc_generic_message_type {
 #define KUC_GRP_HSM	0x02
 #define KUC_GRP_MAX	KUC_GRP_HSM
 
-#define LK_FLG_STOP 0x01
+enum lk_flags {
+	LK_FLG_STOP	= 0x0001,
+	LK_FLG_DATANR	= 0x0002,
+};
 #define LK_NOFD -1U
 
-/* kernelcomm control structure, passed from userspace to kernel */
+/* kernelcomm control structure, passed from userspace to kernel.
+ * For compatibility with old copytools, users who pass ARCHIVE_IDs
+ * to kernel using lk_data_count and lk_data should fill lk_flags with
+ * LK_FLG_DATANR. Otherwise kernel will take lk_data_count as bitmap of
+ * ARCHIVE IDs.
+ */
 struct lustre_kernelcomm {
 	__u32 lk_wfd;
 	__u32 lk_rfd;
 	__u32 lk_uid;
 	__u32 lk_group;
-	__u32 lk_data;
+	__u32 lk_data_count;
 	__u32 lk_flags;
+	__u32 lk_data[0];
 } __packed;
 
 #endif	/* __UAPI_LUSTRE_KERNELCOMM_H__ */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 173/622] lustre: osc: wrong page offset for T10PI checksum
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (171 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 172/622] lustre: hsm: increase upper limit of maximum HSM backends registered with MDT James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 174/622] lnet: increase lnet transaction timeout James Simmons
                   ` (449 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Li Xi <lixi@ddn.com>

The page offset might could be non-zero value. Thus, when
calculating T10PI checksum, the offset should be correct value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11697
Lustre-commit: c1f052055446 ("LU-11697 osc: wrong page offset for T10PI checksum")
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/33727
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 18b99a9..1fc7a57 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1153,7 +1153,8 @@ static int osc_checksum_bulk_t10pi(const char *obd_name, int nob,
 		 * The left guard number should be able to hold checksums of a
 		 * whole page
 		 */
-		rc = obd_page_dif_generate_buffer(obd_name, pga[i]->pg, 0,
+		rc = obd_page_dif_generate_buffer(obd_name, pga[i]->pg,
+						  pga[i]->off & ~PAGE_MASK,
 						  count,
 						  guard_start + used_number,
 						  guard_number - used_number,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 174/622] lnet: increase lnet transaction timeout
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (172 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 173/622] lustre: osc: wrong page offset for T10PI checksum James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 175/622] lnet: handle multi-md usage James Simmons
                   ` (448 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sharmaso@whamcloud.com>

Increase the new LNet Health transaction timeout to the original
50s value, to avoid spurious lnet-selftest failures and expected
false timeouts under load.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11389
Lustre-commit: 73fdd1579d87 ("LU-11389 lnet: increase lnet transaction timeout")
Signed-off-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33231
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 25592db..3ee10da 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -126,7 +126,7 @@ static int recovery_interval_set(const char *val,
 MODULE_PARM_DESC(lnet_peer_discovery_disabled,
 		 "Set to 1 to disable peer discovery on this node.");
 
-unsigned int lnet_transaction_timeout = 5;
+unsigned int lnet_transaction_timeout = 50;
 static int transaction_to_set(const char *val, const struct kernel_param *kp);
 static struct kernel_param_ops param_ops_transaction_timeout = {
 	.set = transaction_to_set,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 175/622] lnet: handle multi-md usage
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (173 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 174/622] lnet: increase lnet transaction timeout James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 176/622] lustre: uapi: fix warnings when lustre_user.h included James Simmons
                   ` (447 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

The MD can be used multiple times. The response tracker needs to have
the same lifespan as the MD. If we re-use the MD and a response
tracker has already been attached to it, then we'll update the
deadline for the response tracker. This means the deadline on the MD
is for its last user.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11734
Lustre-commit: 8c249097e627 ("LU-11734 lnet: handle multi-md usage")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33794
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 -
 net/lnet/lnet/lib-move.c      | 47 +++++++++++++++++--------------
 net/lnet/lnet/lib-msg.c       | 64 +++++++++++++++++++++----------------------
 3 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 26095a6..bbb678f 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -550,7 +550,6 @@ int lnet_get_peer_list(u32 *countp, u32 *sizep,
 
 void lnet_msg_attach_md(struct lnet_msg *msg, struct lnet_libmd *md,
 			unsigned int offset, unsigned int mlen);
-void lnet_msg_detach_md(struct lnet_msg *msg, int status);
 void lnet_build_unlink_event(struct lnet_libmd *md, struct lnet_event *ev);
 void lnet_build_msg_event(struct lnet_msg *msg, enum lnet_event_kind ev_type);
 void lnet_msg_commit(struct lnet_msg *msg, int cpt);
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index eacda4c..3bcac03 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2437,6 +2437,7 @@ struct lnet_mt_event_info {
 	lnet_nid_t mt_nid;
 };
 
+/* called with res_lock held */
 void
 lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt)
 {
@@ -2446,11 +2447,9 @@ struct lnet_mt_event_info {
 	 * The rspt queue for the cpt is protected by
 	 * the lnet_net_lock(cpt). cpt is the cpt of the MD cookie.
 	 */
-	lnet_res_lock(cpt);
-	if (!md->md_rspt_ptr) {
-		lnet_res_unlock(cpt);
+	if (!md->md_rspt_ptr)
 		return;
-	}
+
 	rspt = md->md_rspt_ptr;
 	md->md_rspt_ptr = NULL;
 
@@ -2462,7 +2461,6 @@ struct lnet_mt_event_info {
 	 * the rspt block.
 	 */
 	LNetInvalidateMDHandle(&rspt->rspt_mdh);
-	lnet_res_unlock(cpt);
 }
 
 static void
@@ -4152,6 +4150,8 @@ void lnet_monitor_thr_stop(void)
 			struct lnet_libmd *md, struct lnet_handle_md mdh)
 {
 	s64 timeout_ns;
+	bool new_entry = true;
+	struct lnet_rsp_tracker *local_rspt;
 
 	/* MD has a refcount taken by message so it's not going away.
 	 * The MD however can be looked up. We need to secure the access
@@ -4159,27 +4159,34 @@ void lnet_monitor_thr_stop(void)
 	 * The rspt can be accessed without protection up to when it gets
 	 * added to the list.
 	 */
-
-	/* debug code */
-	LASSERT(!md->md_rspt_ptr);
-
-	/* we'll use that same event in case we never get a response  */
-	rspt->rspt_mdh = mdh;
-	rspt->rspt_cpt = cpt;
-	timeout_ns = lnet_transaction_timeout * NSEC_PER_SEC;
-	rspt->rspt_deadline = ktime_add_ns(ktime_get(), timeout_ns);
-
 	lnet_res_lock(cpt);
-	/* store the rspt so we can access it when we get the REPLY */
-	md->md_rspt_ptr = rspt;
-	lnet_res_unlock(cpt);
+	local_rspt = md->md_rspt_ptr;
+	timeout_ns = lnet_transaction_timeout * NSEC_PER_SEC;
+	if (local_rspt) {
+		/* we already have an rspt attached to the md, so we'll
+		 * update the deadline on that one.
+		 */
+		kfree(rspt);
+		new_entry = false;
+	} else {
+		/* new md */
+		rspt->rspt_mdh = mdh;
+		rspt->rspt_cpt = cpt;
+		/* store the rspt so we can access it when we get the REPLY */
+		md->md_rspt_ptr = rspt;
+		local_rspt = rspt;
+	}
+	local_rspt->rspt_deadline = ktime_add_ns(ktime_get(), timeout_ns);
 
 	/* add to the list of tracked responses. It's added to tail of the
 	 * list in order to expire all the older entries first.
 	 */
 	lnet_net_lock(cpt);
-	list_add_tail(&rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);
+	if (!new_entry && !list_empty(&local_rspt->rspt_on_list))
+		list_del_init(&local_rspt->rspt_on_list);
+	list_add_tail(&local_rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);
 	lnet_net_unlock(cpt);
+	lnet_res_unlock(cpt);
 }
 
 /**
@@ -4321,7 +4328,6 @@ void lnet_monitor_thr_stop(void)
 		CNETERR("Error sending PUT to %s: %d\n",
 			libcfs_id2str(target), rc);
 		msg->msg_no_resend = true;
-		lnet_detach_rsp_tracker(msg->msg_md, cpt);
 		lnet_finalize(msg, rc);
 	}
 
@@ -4543,7 +4549,6 @@ struct lnet_msg *
 		CNETERR("Error sending GET to %s: %d\n",
 			libcfs_id2str(target), rc);
 		msg->msg_no_resend = true;
-		lnet_detach_rsp_tracker(msg->msg_md, cpt);
 		lnet_finalize(msg, rc);
 	}
 
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index f626ca3..af0675e 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -369,29 +369,6 @@
 	lnet_md_deconstruct(md, &msg->msg_ev.md);
 }
 
-void
-lnet_msg_detach_md(struct lnet_msg *msg, int status)
-{
-	struct lnet_libmd *md = msg->msg_md;
-	int unlink;
-
-	/* Now it's safe to drop my caller's ref */
-	md->md_refcount--;
-	LASSERT(md->md_refcount >= 0);
-
-	unlink = lnet_md_unlinkable(md);
-	if (md->md_eq) {
-		msg->msg_ev.status = status;
-		msg->msg_ev.unlinked = unlink;
-		lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev);
-	}
-
-	if (unlink)
-		lnet_md_unlink(md);
-
-	msg->msg_md = NULL;
-}
-
 static int
 lnet_complete_msg_locked(struct lnet_msg *msg, int cpt)
 {
@@ -772,12 +749,42 @@
 }
 
 static void
+lnet_msg_detach_md(struct lnet_msg *msg, int cpt, int status)
+{
+	struct lnet_libmd *md = msg->msg_md;
+	int unlink;
+
+	/* Now it's safe to drop my caller's ref */
+	md->md_refcount--;
+	LASSERT(md->md_refcount >= 0);
+
+	unlink = lnet_md_unlinkable(md);
+	if (md->md_eq) {
+		msg->msg_ev.status = status;
+		msg->msg_ev.unlinked = unlink;
+		lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev);
+	}
+
+	if (unlink) {
+		/* if this is an ACK or a REPLY then make sure to remove the
+		 * response tracker.
+		 */
+		if (msg->msg_ev.type == LNET_EVENT_REPLY ||
+		    msg->msg_ev.type == LNET_EVENT_ACK)
+			lnet_detach_rsp_tracker(msg->msg_md, cpt);
+		lnet_md_unlink(md);
+	}
+
+	msg->msg_md = NULL;
+}
+
+static void
 lnet_detach_md(struct lnet_msg *msg, int status)
 {
 	int cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
 
 	lnet_res_lock(cpt);
-	lnet_msg_detach_md(msg, status);
+	lnet_msg_detach_md(msg, cpt, status);
 	lnet_res_unlock(cpt);
 }
 
@@ -877,15 +884,6 @@
 
 	msg->msg_ev.status = status;
 
-	/* if this is an ACK or a REPLY then make sure to remove the
-	 * response tracker.
-	 */
-	if (msg->msg_ev.type == LNET_EVENT_REPLY ||
-	    msg->msg_ev.type == LNET_EVENT_ACK) {
-		cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
-		lnet_detach_rsp_tracker(msg->msg_md, cpt);
-	}
-
 	/* if the message is successfully sent, no need to keep the MD around */
 	if (msg->msg_md && !status)
 		lnet_detach_md(msg, status);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 176/622] lustre: uapi: fix warnings when lustre_user.h included
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (174 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 175/622] lnet: handle multi-md usage James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 177/622] lustre: obdclass: lu_dirent record length missing '0' James Simmons
                   ` (446 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Checking for lustre/lustre_user.h in a configure script
generates a warning because of the included <sys/quota.h>

  checking lustre/lustre_user.h usability... no
  checking lustre/lustre_user.h presence... yes
  WARNING: present but cannot be compiled
  WARNING: check for missing prerequisite headers?
  WARNING: see the Autoconf documentation
  WARNING: section "Present But Cannot Be Compiled"
  WARNING: proceeding with the preprocessor's result
  WARNING: in the future, the compiler will take precedence

Looking into config.log it shows:

  In file included from /usr/include/lustre/lustre_user.h:59,
                   from conftest.c:91:
  /usr/include/sys/quota.h:221: error: expected declaration
    specifiers or '...' before 'caddr_t'

Since we don't really need much from the <sys/quota.h> header,
just use the default linux UAPI quota header.

Fix an unused variable warning in ll_dir_ioctl().

Lustre-commit: db0592145574 ("LU-11783 utils: fix warnings when lustre_user.h included")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33876
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                   | 2 +-
 include/uapi/linux/lustre/lustre_user.h | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index f54987a..ef4fa36 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1356,7 +1356,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		struct lov_user_md_v1 *lumv1_ptr = &lumv1;
 		struct lov_user_md_v1 __user *lumv1p = (void __user *)arg;
 		struct lov_user_md_v3 __user *lumv3p = (void __user *)arg;
-		int lum_size;
+		int lum_size = 0;
 
 		int set_default = 0;
 
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 3bd6fc7..649aeeb 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -44,10 +44,10 @@
 
 #include <linux/kernel.h>
 #include <linux/types.h>
+#include <linux/quota.h>
 
 #ifdef __KERNEL__
 # include <linux/fs.h>
-# include <linux/quota.h>
 # include <linux/sched/signal.h>
 # include <linux/string.h> /* snprintf() */
 # include <linux/version.h>
@@ -57,7 +57,6 @@
 # include <stdbool.h>
 # include <stdio.h> /* snprintf() */
 # include <string.h>
-# include <sys/quota.h>
 # include <sys/stat.h>
 #endif /* __KERNEL__ */
 #include <uapi/linux/lustre/lustre_fiemap.h>
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 177/622] lustre: obdclass: lu_dirent record length missing '0'
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (175 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 176/622] lustre: uapi: fix warnings when lustre_user.h included James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 178/622] lustre: update version to 2.11.99 James Simmons
                   ` (445 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

In lu_dirent packing, a '0' is appended after name, but it's not
counted in size calcuation, which may cause crash.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11753
Lustre-commit: 77f01308c509 ("LU-11753 obdclass: lu_dirent record length missing '0'")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33865
Reviewed-by: Stephan Thiell <sthiell@stanford.edu>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_idl.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 599fe86..4236a43 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -480,10 +480,11 @@ static inline size_t lu_dirent_calc_size(size_t namelen, __u16 attr)
 	if (attr & LUDA_TYPE) {
 		const size_t align = sizeof(struct luda_type) - 1;
 
-		size = (sizeof(struct lu_dirent) + namelen + align) & ~align;
+		size = (sizeof(struct lu_dirent) + namelen + 1 + align) &
+		       ~align;
 		size += sizeof(struct luda_type);
 	} else {
-		size = sizeof(struct lu_dirent) + namelen;
+		size = sizeof(struct lu_dirent) + namelen + 1;
 	}
 
 	return (size + 7) & ~7;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 178/622] lustre: update version to 2.11.99
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (176 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 177/622] lustre: obdclass: lu_dirent record length missing '0' James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 179/622] lustre: osc: limit chunk number of write submit James Simmons
                   ` (444 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

With nearly all of the the missing patches from the
lustre 2.12 version merged upstream its time to update
the upstream clients version.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_ver.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_ver.h b/include/uapi/linux/lustre/lustre_ver.h
index d7c53c5..8ceb57d 100644
--- a/include/uapi/linux/lustre/lustre_ver.h
+++ b/include/uapi/linux/lustre/lustre_ver.h
@@ -2,10 +2,10 @@
 #define _LUSTRE_VER_H_
 
 #define LUSTRE_MAJOR 2
-#define LUSTRE_MINOR 10
+#define LUSTRE_MINOR 11
 #define LUSTRE_PATCH 99
 #define LUSTRE_FIX 0
-#define LUSTRE_VERSION_STRING "2.10.99"
+#define LUSTRE_VERSION_STRING "2.11.99"
 
 #define OBD_OCD_VERSION(major, minor, patch, fix)			\
 	(((major) << 24) + ((minor) << 16) + ((patch) << 8) + (fix))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 179/622] lustre: osc: limit chunk number of write submit
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (177 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 178/622] lustre: update version to 2.11.99 James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 180/622] lustre: osc: speed up page cache cleanup during blocking ASTs James Simmons
                   ` (443 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

Don't queue too many pages in an extent for a write RPC, we need
to take care of the chunk limit in write submit as well (refers to
LU-8135 for more details).

WC-bug-id: https://jira.whamcloud.com/browse/LU-10239
Lustre-commit: 93ef6e7863b4 ("LU-10239 osc: limit chunk number of write submit")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/30627
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c    | 30 ------------------------------
 fs/lustre/osc/osc_internal.h | 30 ++++++++++++++++++++++++++++++
 fs/lustre/osc/osc_io.c       | 27 +++++++++++++++++++++++++--
 3 files changed, 55 insertions(+), 32 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 47aee99..1ff258c 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -1937,36 +1937,6 @@ static int try_to_add_extent_for_io(struct client_obd *cli,
 	return 1;
 }
 
-static inline unsigned int osc_max_write_chunks(const struct client_obd *cli)
-{
-	/*
-	 * LU-8135:
-	 *
-	 * The maximum size of a single transaction is about 64MB in ZFS.
-	 * #define DMU_MAX_ACCESS (64 * 1024 * 1024)
-	 *
-	 * Since ZFS is a copy-on-write file system, a single dirty page in
-	 * a chunk will result in the rewrite of the whole chunk, therefore
-	 * an RPC shouldn't be allowed to contain too many chunks otherwise
-	 * it will make transaction size much bigger than 64MB, especially
-	 * with big block size for ZFS.
-	 *
-	 * This piece of code is to make sure that OSC won't send write RPCs
-	 * with too many chunks. The maximum chunk size that an RPC can cover
-	 * is set to PTLRPC_MAX_BRW_SIZE, which is defined to 16MB. Ideally
-	 * OST should tell the client what the biggest transaction size is,
-	 * but it's good enough for now.
-	 *
-	 * This limitation doesn't apply to ldiskfs, which allows as many
-	 * chunks in one RPC as we want. However, it won't have any benefits
-	 * to have too many discontiguous pages in one RPC.
-	 *
-	 * An osc_extent won't cover over a RPC size, so the chunks in an
-	 * osc_extent won't bigger than PTLRPC_MAX_BRW_SIZE >> chunkbits.
-	 */
-	return PTLRPC_MAX_BRW_SIZE >> cli->cl_chunkbits;
-}
-
 /**
  * In order to prevent multiple ptlrpcd from breaking contiguous extents,
  * get_write_extent() takes all appropriate extents in atomic.
diff --git a/fs/lustre/osc/osc_internal.h b/fs/lustre/osc/osc_internal.h
index 3ba209f..2cb737b 100644
--- a/fs/lustre/osc/osc_internal.h
+++ b/fs/lustre/osc/osc_internal.h
@@ -162,6 +162,36 @@ unsigned long osc_cache_shrink_count(struct shrinker *sk,
 unsigned long osc_cache_shrink_scan(struct shrinker *sk,
 				    struct shrink_control *sc);
 
+static inline unsigned int osc_max_write_chunks(const struct client_obd *cli)
+{
+	/*
+	 * LU-8135:
+	 *
+	 * The maximum size of a single transaction is about 64MB in ZFS.
+	 * #define DMU_MAX_ACCESS (64 * 1024 * 1024)
+	 *
+	 * Since ZFS is a copy-on-write file system, a single dirty page in
+	 * a chunk will result in the rewrite of the whole chunk, therefore
+	 * an RPC shouldn't be allowed to contain too many chunks otherwise
+	 * it will make transaction size much bigger than 64MB, especially
+	 * with big block size for ZFS.
+	 *
+	 * This piece of code is to make sure that OSC won't send write RPCs
+	 * with too many chunks. The maximum chunk size that an RPC can cover
+	 * is set to PTLRPC_MAX_BRW_SIZE, which is defined to 16MB. Ideally
+	 * OST should tell the client what the biggest transaction size is,
+	 * but it's good enough for now.
+	 *
+	 * This limitation doesn't apply to ldiskfs, which allows as many
+	 * chunks in one RPC as we want. However, it won't have any benefits
+	 * to have too many discontiguous pages in one RPC.
+	 *
+	 * An osc_extent won't cover over a RPC size, so the chunks in an
+	 * osc_extent won't bigger than PTLRPC_MAX_BRW_SIZE >> chunkbits.
+	 */
+	return PTLRPC_MAX_BRW_SIZE >> cli->cl_chunkbits;
+}
+
 static inline void osc_set_io_portal(struct ptlrpc_request *req)
 {
 	struct obd_import *imp = req->rq_import;
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 1485962..56f30cb 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -122,6 +122,9 @@ int osc_io_submit(const struct lu_env *env, const struct cl_io_slice *ios,
 	int result = 0;
 	int brw_flags;
 	unsigned int max_pages;
+	unsigned int ppc_bits; /* pages per chunk bits */
+	unsigned int ppc;
+	bool sync_queue = false;
 
 	LASSERT(qin->pl_nr > 0);
 
@@ -130,6 +133,8 @@ int osc_io_submit(const struct lu_env *env, const struct cl_io_slice *ios,
 	osc = cl2osc(ios->cis_obj);
 	cli = osc_cli(osc);
 	max_pages = cli->cl_max_pages_per_rpc;
+	ppc_bits = cli->cl_chunkbits - PAGE_SHIFT;
+	ppc = 1 << ppc_bits;
 
 	brw_flags = osc_io_srvlock(cl2osc_io(env, ios)) ? OBD_BRW_SRVLOCK : 0;
 	brw_flags |= crt == CRT_WRITE ? OBD_BRW_WRITE : OBD_BRW_READ;
@@ -186,12 +191,30 @@ int osc_io_submit(const struct lu_env *env, const struct cl_io_slice *ios,
 		else /* async IO */
 			cl_page_list_del(env, qin, page);
 
-		if (++queued == max_pages) {
-			queued = 0;
+		queued++;
+		if (queued == max_pages) {
+			sync_queue = true;
+		} else if (crt == CRT_WRITE) {
+			unsigned int chunks;
+			unsigned int next_chunks;
+
+			chunks = (queued + ppc - 1) >> ppc_bits;
+			/* chunk number if add another page */
+			next_chunks = (queued + ppc) >> ppc_bits;
+
+			/* next page will excceed write chunk limit */
+			if (chunks == osc_max_write_chunks(cli) &&
+			    next_chunks > chunks)
+				sync_queue = true;
+		}
+
+		if (sync_queue) {
 			result = osc_queue_sync_pages(env, io, osc, &list,
 						      brw_flags);
 			if (result < 0)
 				break;
+			queued = 0;
+			sync_queue = false;
 		}
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 180/622] lustre: osc: speed up page cache cleanup during blocking ASTs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (178 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 179/622] lustre: osc: limit chunk number of write submit James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 181/622] lustre: lmv: Fix style issues for lmv_fld.c James Simmons
                   ` (442 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andrew Perepechko <c17827@cray.com>

While we are cleaning a write lock, we don't need to check if
page cache pages under this lock are covered by another lock.

If a client needs to give up its lock, cleaning gigabytes of
page cache can take quite a long time.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11296
Lustre-commit: b9ebb17277c7 ("LU-11296 osc: speed up page cache cleanup during blocking ASTs")
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Cray-bug-id: LUS-6352
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-on: https://review.whamcloud.com/33090
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_lock.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index 1a2b0bd..eccea37 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -372,7 +372,12 @@ static int osc_lock_flush(struct osc_object *obj, pgoff_t start, pgoff_t end,
 			rc = 0;
 	}
 
-	rc2 = osc_lock_discard_pages(env, obj, start, end, discard);
+	/*
+	 * Do not try to match other locks with CLM_WRITE since we already
+	 * know there're none
+	 */
+	rc2 = osc_lock_discard_pages(env, obj, start, end,
+				     mode == CLM_WRITE || discard);
 	if (rc == 0 && rc2 < 0)
 		rc = rc2;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 181/622] lustre: lmv: Fix style issues for lmv_fld.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (179 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 180/622] lustre: osc: speed up page cache cleanup during blocking ASTs James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 182/622] lustre: llite: Fix style issues for llite_nfs.c James Simmons
                   ` (441 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch for file
fs/lustre/lmv/lmv_fld.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 72ee63625055 ("LU-6142 lmv: Fix style issues for lmv_fld.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/33566
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_fld.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/lmv/lmv_fld.c b/fs/lustre/lmv/lmv_fld.c
index 00dc858..ef2c866 100644
--- a/fs/lustre/lmv/lmv_fld.c
+++ b/fs/lustre/lmv/lmv_fld.c
@@ -58,15 +58,17 @@ int lmv_fld_lookup(struct lmv_obd *lmv, const struct lu_fid *fid, u32 *mds)
 	 */
 	if (!fid_is_sane(fid) || !(fid_seq_in_fldb(fid_seq(fid)) ||
 				   fid_seq_is_local_file(fid_seq(fid)))) {
-		CERROR("%s: invalid FID " DFID "\n", obd->obd_name, PFID(fid));
-		return -EINVAL;
+		rc = -EINVAL;
+		CERROR("%s: invalid FID " DFID ": rc = %d\n", obd->obd_name,
+		       PFID(fid), rc);
+		return rc;
 	}
 
 	rc = fld_client_lookup(&lmv->lmv_fld, fid_seq(fid), mds,
 			       LU_SEQ_RANGE_MDT, NULL);
 	if (rc) {
-		CERROR("Error while looking for mds number. Seq %#llx, err = %d\n",
-		       fid_seq(fid), rc);
+		CERROR("%s: Error while looking for mds number. Seq %#llx: rc = %d\n",
+		       obd->obd_name, fid_seq(fid), rc);
 		return rc;
 	}
 
@@ -74,9 +76,10 @@ int lmv_fld_lookup(struct lmv_obd *lmv, const struct lu_fid *fid, u32 *mds)
 	       *mds, PFID(fid));
 
 	if (*mds >= lmv->desc.ld_tgt_count) {
-		CERROR("FLD lookup got invalid mds #%x (max: %x) for fid=" DFID "\n", *mds, lmv->desc.ld_tgt_count,
-		       PFID(fid));
 		rc = -EINVAL;
+		CERROR("%s: FLD lookup got invalid mds #%x (max: %x) for fid=" DFID ": rc = %d\n",
+		       obd->obd_name, *mds, lmv->desc.ld_tgt_count, PFID(fid),
+		       rc);
 	}
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 182/622] lustre: llite: Fix style issues for llite_nfs.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (180 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 181/622] lustre: lmv: Fix style issues for lmv_fld.c James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 183/622] lustre: llite: Fix style issues for lcommon_misc.c James Simmons
                   ` (440 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/llite/llite_nfs.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: c648f5ddc3e8 ("LU-6142 llite: Fix style issues for llite_nfs.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/33809
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_nfs.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/llite/llite_nfs.c b/fs/lustre/llite/llite_nfs.c
index 434f92b..de8f707 100644
--- a/fs/lustre/llite/llite_nfs.c
+++ b/fs/lustre/llite/llite_nfs.c
@@ -64,12 +64,11 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 	struct ptlrpc_request *req = NULL;
 	struct inode *inode = NULL;
 	int eadatalen = 0;
-	unsigned long hash = cl_fid_build_ino(fid,
-					      ll_need_32bit_api(sbi));
+	unsigned long hash = cl_fid_build_ino(fid, ll_need_32bit_api(sbi));
 	struct md_op_data *op_data;
 	int rc;
 
-	CDEBUG(D_INFO, "searching inode for:(%lu," DFID ")\n", hash, PFID(fid));
+	CDEBUG(D_INFO, "searching inode for:(%lu,"DFID")\n", hash, PFID(fid));
 
 	inode = ilookup5(sb, hash, ll_test_inode_by_fid, (void *)fid);
 	if (inode)
@@ -79,7 +78,8 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 	if (rc)
 		return ERR_PTR(rc);
 
-	/* Because inode is NULL, ll_prep_md_op_data can not
+	/*
+	 * Because inode is NULL, ll_prep_md_op_data can not
 	 * be used here. So we allocate op_data ourselves
 	 */
 	op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
@@ -94,6 +94,10 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 	rc = md_getattr(sbi->ll_md_exp, op_data, &req);
 	kfree(op_data);
 	if (rc) {
+		/*
+		 * Suppress erroneous/confusing messages when NFS
+		 * is out of sync and requests old data.
+		 */
 		CDEBUG(D_INFO, "can't get object attrs, fid " DFID ", rc %d\n",
 		       PFID(fid), rc);
 		return ERR_PTR(rc);
@@ -107,8 +111,8 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 }
 
 struct lustre_nfs_fid {
-	struct lu_fid	lnf_child;
-	struct lu_fid	lnf_parent;
+	struct lu_fid lnf_child;
+	struct lu_fid lnf_parent;
 };
 
 static struct dentry *
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 183/622] lustre: llite: Fix style issues for lcommon_misc.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (181 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 182/622] lustre: llite: Fix style issues for llite_nfs.c James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 184/622] lustre: llite: Fix style issues for symlink.c James Simmons
                   ` (439 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/llite/lcommon_misc.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: aac46ee4f871 ("LU-6142 llite: Fix style issues for lcommon_misc.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/33810
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lcommon_misc.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/llite/lcommon_misc.c b/fs/lustre/llite/lcommon_misc.c
index 48503d6..d833a16 100644
--- a/fs/lustre/llite/lcommon_misc.c
+++ b/fs/lustre/llite/lcommon_misc.c
@@ -41,14 +41,17 @@
 
 #include "llite_internal.h"
 
-/* Initialize the default and maximum LOV EA and cookie sizes.  This allows
+/*
+ * Initialize the default and maximum LOV EA and cookie sizes.  This allows
  * us to make MDS RPCs with large enough reply buffers to hold the
  * maximum-sized (= maximum striped) EA and cookie without having to
  * calculate this (via a call into the LOV + OSCs) each time we make an RPC.
  */
 static int cl_init_ea_size(struct obd_export *md_exp, struct obd_export *dt_exp)
 {
-	u32 val_size, max_easize, def_easize;
+	u32 val_size;
+	u32 max_easize;
+	u32 def_easize;
 	int rc;
 
 	val_size = sizeof(max_easize);
@@ -83,9 +86,9 @@ int cl_ocd_update(struct obd_device *host, struct obd_device *watched,
 		  enum obd_notify_event ev, void *owner)
 {
 	struct lustre_client_ocd *lco;
-	struct client_obd	*cli;
+	struct client_obd *cli;
 	u64 flags;
-	int   result;
+	int result;
 
 	if (!strcmp(watched->obd_type->typ_name, LUSTRE_OSC_NAME) &&
 	    watched->obd_set_up && !watched->obd_stopping) {
@@ -117,13 +120,13 @@ int cl_ocd_update(struct obd_device *host, struct obd_device *watched,
 int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
 		     struct ll_grouplock *lg)
 {
-	struct lu_env	  *env;
-	struct cl_io	   *io;
-	struct cl_lock	 *lock;
-	struct cl_lock_descr   *descr;
-	u32		   enqflags;
+	struct lu_env *env;
+	struct cl_io *io;
+	struct cl_lock *lock;
+	struct cl_lock_descr *descr;
+	u32 enqflags;
 	u16 refcheck;
-	int		     rc;
+	int rc;
 
 	env = cl_env_get(&refcheck);
 	if (IS_ERR(env))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 184/622] lustre: llite: Fix style issues for symlink.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (182 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 183/622] lustre: llite: Fix style issues for lcommon_misc.c James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 185/622] lustre: headers: define pct(a, b) once James Simmons
                   ` (438 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/llite/symlink.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: e486703b5278 ("LU-6142 llite: Fix style issues for symlink.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/33811
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/symlink.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/llite/symlink.c b/fs/lustre/llite/symlink.c
index 0690fdb..d2922d1 100644
--- a/fs/lustre/llite/symlink.c
+++ b/fs/lustre/llite/symlink.c
@@ -53,7 +53,8 @@ static int ll_readlink_internal(struct inode *inode,
 		int print_limit = min_t(int, PAGE_SIZE - 128, symlen);
 
 		*symname = lli->lli_symlink_name;
-		/* If the total CDEBUG() size is larger than a page, it
+		/*
+		 * If the total CDEBUG() size is larger than a page, it
 		 * will print a warning to the console, avoid this by
 		 * printing just the last part of the symlink.
 		 */
@@ -97,11 +98,11 @@ static int ll_readlink_internal(struct inode *inode,
 	}
 
 	*symname = req_capsule_server_get(&(*request)->rq_pill, &RMF_MDT_MD);
-	if (!*symname ||
-	    strnlen(*symname, symlen) != symlen - 1) {
+	if (!*symname || strnlen(*symname, symlen) != symlen - 1) {
 		/* not full/NULL terminated */
-		CERROR("inode %lu: symlink not NULL terminated string of length %d\n",
-		       inode->i_ino, symlen - 1);
+		CERROR("%s: inode " DFID ": symlink not NULL terminated string of length %d\n",
+		       ll_get_fsname(inode->i_sb, NULL, 0),
+		       PFID(ll_inode2fid(inode)), symlen - 1);
 		rc = -EPROTO;
 		goto failed;
 	}
@@ -143,7 +144,8 @@ static const char *ll_get_link(struct dentry *dentry,
 		return ERR_PTR(rc);
 	}
 
-	/* symname may contain a pointer to the request message buffer,
+	/*
+	 * symname may contain a pointer to the request message buffer,
 	 * we delay request releasing then.
 	 */
 	set_delayed_call(done, ll_put_link, request);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 185/622] lustre: headers: define pct(a, b) once
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (183 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 184/622] lustre: llite: Fix style issues for symlink.c James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 186/622] lustre: obdclass: report all obd states for OBD_IOC_GETDEVICE James Simmons
                   ` (437 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Ben Evans <bevans@cray.com>

pct is defined 6 times in different places.  Define it in one.
Also change it to a static inline to do a better job of
enforcing types.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10171
Lustre-commit: 9b924e86b27d ("LU-10171 headers: define pct(a,b) once")
Signed-off-by: Ben Evans <bevans@cray.com>
Reviewed-on: https://review.whamcloud.com/29852
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/fld/fld_cache.c          | 12 ++----------
 fs/lustre/include/lprocfs_status.h |  5 +++++
 fs/lustre/llite/lproc_llite.c      |  7 +++----
 fs/lustre/obdclass/genops.c        |  6 +-----
 fs/lustre/osc/lproc_osc.c          | 10 +++-------
 5 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/fs/lustre/fld/fld_cache.c b/fs/lustre/fld/fld_cache.c
index d289c29..96be544 100644
--- a/fs/lustre/fld/fld_cache.c
+++ b/fs/lustre/fld/fld_cache.c
@@ -94,22 +94,14 @@ struct fld_cache *fld_cache_init(const char *name,
  */
 void fld_cache_fini(struct fld_cache *cache)
 {
-	u64 pct;
-
 	LASSERT(cache);
 	fld_cache_flush(cache);
 
-	if (cache->fci_stat.fst_count > 0) {
-		pct = cache->fci_stat.fst_cache * 100;
-		do_div(pct, cache->fci_stat.fst_count);
-	} else {
-		pct = 0;
-	}
-
 	CDEBUG(D_INFO, "FLD cache statistics (%s):\n", cache->fci_name);
 	CDEBUG(D_INFO, "  Total reqs: %llu\n", cache->fci_stat.fst_count);
 	CDEBUG(D_INFO, "  Cache reqs: %llu\n", cache->fci_stat.fst_cache);
-	CDEBUG(D_INFO, "  Cache hits: %llu%%\n", pct);
+	CDEBUG(D_INFO, "  Cache hits: %u%%\n",
+	       pct(cache->fci_stat.fst_cache, cache->fci_stat.fst_count));
 
 	kfree(cache);
 }
diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index 32d43fb..1ef548ae 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -58,6 +58,11 @@ struct lprocfs_vars {
 	umode_t				proc_mode;
 };
 
+static inline u32 pct(s64 a, s64 b)
+{
+	return b ? a * 100 / b : 0;
+}
+
 struct lprocfs_static_vars {
 	struct lprocfs_vars		*obd_vars;
 	const struct attribute_group	*sysfs_vars;
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 5fc7705..4060271 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1483,8 +1483,6 @@ void ll_debugfs_unregister_super(struct super_block *sb)
 	lprocfs_free_stats(&sbi->ll_stats);
 }
 
-#define pct(a, b) (b ? a * 100 / b : 0)
-
 static void ll_display_extents_info(struct ll_rw_extents_info *io_extents,
 				    struct seq_file *seq, int which)
 {
@@ -1508,8 +1506,9 @@ static void ll_display_extents_info(struct ll_rw_extents_info *io_extents,
 		w = pp_info->pp_w_hist.oh_buckets[i];
 		read_cum += r;
 		write_cum += w;
-		end = 1 << (i + LL_HIST_START - units);
-		seq_printf(seq, "%4lu%c - %4lu%c%c: %14lu %4lu %4lu  | %14lu %4lu %4lu\n",
+		end = BIT(i + LL_HIST_START - units);
+		seq_printf(seq,
+			   "%4lu%c - %4lu%c%c: %14lu %4u %4u  | %14lu %4u %4u\n",
 			   start, *unitp, end, *unitp,
 			   (i == LL_HIST_MAX - 1) ? '+' : ' ',
 			   r, pct(r, read_tot), pct(read_cum, read_tot),
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 2254943..fd9dd96 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -1425,8 +1425,6 @@ int obd_set_max_mod_rpcs_in_flight(struct client_obd *cli, u16 max)
 }
 EXPORT_SYMBOL(obd_set_max_mod_rpcs_in_flight);
 
-#define pct(a, b) (b ? (a * 100) / b : 0)
-
 int obd_mod_rpc_stats_seq_show(struct client_obd *cli, struct seq_file *seq)
 {
 	unsigned long mod_tot = 0, mod_cum;
@@ -1452,7 +1450,7 @@ int obd_mod_rpc_stats_seq_show(struct client_obd *cli, struct seq_file *seq)
 		unsigned long mod = cli->cl_mod_rpcs_hist.oh_buckets[i];
 
 		mod_cum += mod;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u\n",
 			   i, mod, pct(mod, mod_tot),
 			   pct(mod_cum, mod_tot));
 		if (mod_cum == mod_tot)
@@ -1464,8 +1462,6 @@ int obd_mod_rpc_stats_seq_show(struct client_obd *cli, struct seq_file *seq)
 	return 0;
 }
 EXPORT_SYMBOL(obd_mod_rpc_stats_seq_show);
-#undef pct
-
 /*
  * The number of modify RPCs sent in parallel is limited
  * because the server has a finite number of slots per client to
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index d9030b7..ac64724 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -780,8 +780,6 @@ static ssize_t grant_shrink_store(struct kobject *kobj, struct attribute *attr,
 	{ NULL }
 };
 
-#define pct(a, b) (b ? a * 100 / b : 0)
-
 static int osc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 {
 	struct timespec64 now;
@@ -820,7 +818,7 @@ static int osc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 		read_cum += r;
 		write_cum += w;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu   | %10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u   | %10lu %3u %3u\n",
 			   1 << i, r, pct(r, read_tot),
 			   pct(read_cum, read_tot), w,
 			   pct(w, write_tot),
@@ -844,7 +842,7 @@ static int osc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 		read_cum += r;
 		write_cum += w;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu   | %10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u   | %10lu %3u %3u\n",
 			   i, r, pct(r, read_tot),
 			   pct(read_cum, read_tot), w,
 			   pct(w, write_tot),
@@ -868,7 +866,7 @@ static int osc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 		read_cum += r;
 		write_cum += w;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu   | %10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u   | %10lu %3u %3u\n",
 			   (i == 0) ? 0 : 1 << (i - 1),
 			   r, pct(r, read_tot), pct(read_cum, read_tot),
 			   w, pct(w, write_tot), pct(write_cum, write_tot));
@@ -881,8 +879,6 @@ static int osc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
-#undef pct
-
 static ssize_t osc_rpc_stats_seq_write(struct file *file,
 				       const char __user *buf,
 				       size_t len, loff_t *off)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 186/622] lustre: obdclass: report all obd states for OBD_IOC_GETDEVICE
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (184 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 185/622] lustre: headers: define pct(a, b) once James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 187/622] lustre: ldlm: remove trace from ldlm_pool_count() James Simmons
                   ` (436 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

The wrong state '--' which is reported when the obd device is
inactive. Reporting the "IN" state cover all the information that
is provided by 'devices' debugfs file. Now all the information
from 'devices' can be collected from the lustre sysfs tree.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: adfec49f334d ("LU-8066 obdclass: report all obd states for OBD_IOC_GETDEVICE")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33774
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/class_obd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/lustre/obdclass/class_obd.c b/fs/lustre/obdclass/class_obd.c
index 4ef9cca..0435f62 100644
--- a/fs/lustre/obdclass/class_obd.c
+++ b/fs/lustre/obdclass/class_obd.c
@@ -427,6 +427,8 @@ int class_handle_ioctl(unsigned int cmd, unsigned long arg)
 
 		if (obd->obd_stopping)
 			status = "ST";
+		else if (obd->obd_inactive)
+			status = "IN";
 		else if (obd->obd_set_up)
 			status = "UP";
 		else if (obd->obd_attached)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 187/622] lustre: ldlm: remove trace from ldlm_pool_count()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (185 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 186/622] lustre: obdclass: report all obd states for OBD_IOC_GETDEVICE James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 188/622] lustre: ptlrpc: clean up rq_interpret_reply callbacks James Simmons
                   ` (435 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

The trace in ldlm_pool_count() is too noisy given its
information value so remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10862
Lustre-commit: 3fe2096dfc30 ("LU-10862 ldlm: remove trace from ldlm_pool_{count,skrink}()")
Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/31820
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_pool.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_pool.c b/fs/lustre/ldlm/ldlm_pool.c
index d2149a6..b2b3ead 100644
--- a/fs/lustre/ldlm/ldlm_pool.c
+++ b/fs/lustre/ldlm/ldlm_pool.c
@@ -773,9 +773,6 @@ static unsigned long ldlm_pools_count(enum ldlm_side client, gfp_t gfp_mask)
 	if (client == LDLM_NAMESPACE_CLIENT && !(gfp_mask & __GFP_FS))
 		return 0;
 
-	CDEBUG(D_DLMTRACE, "Request to count %s locks from all pools\n",
-	       client == LDLM_NAMESPACE_CLIENT ? "client" : "server");
-
 	/*
 	 * Find out how many resources we may release.
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 188/622] lustre: ptlrpc: clean up rq_interpret_reply callbacks
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (186 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 187/622] lustre: ldlm: remove trace from ldlm_pool_count() James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 189/622] lustre: lov: quiet lov_dump_lmm_ console messages James Simmons
                   ` (434 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Clean up the function prototypes of several rq_interpret_reply
callback functions to match the function pointer type instead
of using typecasting to avoid the risk of bad function pointers.

Clean up related code to match code style.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11398
Lustre-commit: 4014ddbb2350 ("LU-11398 ptlrpc: clean up rq_interpret_reply callbacks")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33203
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 11 +++++----
 fs/lustre/mdc/mdc_dev.c       | 10 ++++----
 fs/lustre/osc/osc_io.c        |  4 ++--
 fs/lustre/osc/osc_request.c   | 53 +++++++++++++++++++++++--------------------
 fs/lustre/ptlrpc/client.c     | 13 ++++-------
 fs/lustre/ptlrpc/import.c     |  9 ++++----
 6 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 1afe9a5..b9e9ae9 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -826,8 +826,9 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
  */
 static int lock_convert_interpret(const struct lu_env *env,
 				  struct ptlrpc_request *req,
-				  struct ldlm_async_args *aa, int rc)
+				  void *args, int rc)
 {
+	struct ldlm_async_args *aa = args;
 	struct ldlm_lock *lock;
 	struct ldlm_reply *reply;
 
@@ -1010,7 +1011,7 @@ int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
 
 	aa = ptlrpc_req_async_args(aa, req);
 	ldlm_lock2handle(lock, &aa->lock_handle);
-	req->rq_interpret_reply = (ptlrpc_interpterer_t)lock_convert_interpret;
+	req->rq_interpret_reply = lock_convert_interpret;
 
 	ptlrpcd_add_req(req);
 	return 0;
@@ -2117,9 +2118,9 @@ static int ldlm_chain_lock_for_replay(struct ldlm_lock *lock, void *closure)
 }
 
 static int replay_lock_interpret(const struct lu_env *env,
-				 struct ptlrpc_request *req,
-				 struct ldlm_async_args *aa, int rc)
+				 struct ptlrpc_request *req, void *args, int rc)
 {
+	struct ldlm_async_args *aa = args;
 	struct ldlm_lock *lock;
 	struct ldlm_reply *reply;
 	struct obd_export *exp;
@@ -2234,7 +2235,7 @@ static int replay_one_lock(struct obd_import *imp, struct ldlm_lock *lock)
 	atomic_inc(&req->rq_import->imp_replay_inflight);
 	aa = ptlrpc_req_async_args(aa, req);
 	aa->lock_handle = body->lock_handle[0];
-	req->rq_interpret_reply = (ptlrpc_interpterer_t)replay_lock_interpret;
+	req->rq_interpret_reply = replay_lock_interpret;
 	ptlrpcd_add_req(req);
 
 	return 0;
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 21dc83e..306b917 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -602,8 +602,9 @@ int mdc_enqueue_fini(struct ptlrpc_request *req, osc_enqueue_upcall_f upcall,
 }
 
 int mdc_enqueue_interpret(const struct lu_env *env, struct ptlrpc_request *req,
-			  struct osc_enqueue_args *aa, int rc)
+			  void *args, int rc)
 {
+	struct osc_enqueue_args *aa = args;
 	struct ldlm_lock *lock;
 	struct lustre_handle *lockh = &aa->oa_lockh;
 	enum ldlm_mode mode = aa->oa_mode;
@@ -745,8 +746,7 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 			aa->oa_flags = flags;
 			aa->oa_lvb = lvb;
 
-			req->rq_interpret_reply =
-				(ptlrpc_interpterer_t)mdc_enqueue_interpret;
+			req->rq_interpret_reply = mdc_enqueue_interpret;
 			ptlrpcd_add_req(req);
 		} else {
 			ptlrpc_req_finished(req);
@@ -1121,9 +1121,9 @@ struct mdc_data_version_args {
 
 static int
 mdc_data_version_interpret(const struct lu_env *env, struct ptlrpc_request *req,
-			   void *arg, int rc)
+			   void *args, int rc)
 {
-	struct mdc_data_version_args *dva = arg;
+	struct mdc_data_version_args *dva = args;
 	struct osc_io *oio = dva->dva_oio;
 	const struct mdt_body *body;
 
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 56f30cb..76657f3 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -656,9 +656,9 @@ struct osc_data_version_args {
 
 static int
 osc_data_version_interpret(const struct lu_env *env, struct ptlrpc_request *req,
-			   void *arg, int rc)
+			   void *args, int rc)
 {
-	struct osc_data_version_args *dva = arg;
+	struct osc_data_version_args *dva = args;
 	struct osc_io *oio = dva->dva_oio;
 	const struct ost_body *body;
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 1fc7a57..ba84bd1 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -188,9 +188,9 @@ static int osc_setattr(const struct lu_env *env, struct obd_export *exp,
 }
 
 static int osc_setattr_interpret(const struct lu_env *env,
-				 struct ptlrpc_request *req,
-				 struct osc_setattr_args *sa, int rc)
+				 struct ptlrpc_request *req, void *args, int rc)
 {
+	struct osc_setattr_args *sa = args;
 	struct ost_body *body;
 
 	if (rc != 0)
@@ -236,8 +236,7 @@ int osc_setattr_async(struct obd_export *exp, struct obdo *oa,
 		/* Do not wait for response. */
 		ptlrpcd_add_req(req);
 	} else {
-		req->rq_interpret_reply =
-			(ptlrpc_interpterer_t)osc_setattr_interpret;
+		req->rq_interpret_reply = osc_setattr_interpret;
 
 		sa = ptlrpc_req_async_args(sa, req);
 		sa->sa_oa = oa;
@@ -417,7 +416,7 @@ int osc_punch_send(struct obd_export *exp, struct obdo *oa,
 
 	ptlrpc_request_set_replen(req);
 
-	req->rq_interpret_reply = (ptlrpc_interpterer_t)osc_setattr_interpret;
+	req->rq_interpret_reply = osc_setattr_interpret;
 	sa = ptlrpc_req_async_args(sa, req);
 	sa->sa_oa = oa;
 	sa->sa_upcall = upcall;
@@ -545,13 +544,13 @@ static int osc_resource_get_unused(struct obd_export *exp, struct obdo *oa,
 }
 
 static int osc_destroy_interpret(const struct lu_env *env,
-				 struct ptlrpc_request *req, void *data,
-				 int rc)
+				 struct ptlrpc_request *req, void *args, int rc)
 {
 	struct client_obd *cli = &req->rq_import->imp_obd->u.cli;
 
 	atomic_dec(&cli->cl_destroy_in_flight);
 	wake_up(&cli->cl_destroy_waitq);
+
 	return 0;
 }
 
@@ -734,14 +733,14 @@ struct grant_thread_data {
 
 static int osc_shrink_grant_interpret(const struct lu_env *env,
 				      struct ptlrpc_request *req,
-				      void *aa, int rc)
+				      void *args, int rc)
 {
+	struct osc_brw_async_args *aa = args;
 	struct client_obd *cli = &req->rq_import->imp_obd->u.cli;
-	struct obdo *oa = ((struct osc_brw_async_args *)aa)->aa_oa;
 	struct ost_body *body;
 
 	if (rc != 0) {
-		__osc_update_grant(cli, oa->o_grant);
+		__osc_update_grant(cli, aa->aa_oa->o_grant);
 		goto out;
 	}
 
@@ -749,7 +748,8 @@ static int osc_shrink_grant_interpret(const struct lu_env *env,
 	LASSERT(body);
 	osc_update_grant(cli, body);
 out:
-	kmem_cache_free(osc_obdo_kmem, oa);
+	kmem_cache_free(osc_obdo_kmem, aa->aa_oa);
+
 	return rc;
 }
 
@@ -1951,7 +1951,8 @@ static int osc_brw_redo_request(struct ptlrpc_request *request,
 				 request, oap->oap_request);
 		}
 	}
-	/* New request takes over pga and oaps from old request.
+	/*
+	 * New request takes over pga and oaps from old request.
 	 * Note that copying a list_head doesn't work, need to move it...
 	 */
 	aa->aa_resends++;
@@ -2034,9 +2035,9 @@ static void osc_release_ppga(struct brw_page **ppga, u32 count)
 }
 
 static int brw_interpret(const struct lu_env *env,
-			 struct ptlrpc_request *req, void *data, int rc)
+			 struct ptlrpc_request *req, void *args, int rc)
 {
-	struct osc_brw_async_args *aa = data;
+	struct osc_brw_async_args *aa = args;
 	struct osc_extent *ext;
 	struct osc_extent *tmp;
 	struct client_obd *cli = aa->aa_cli;
@@ -2044,7 +2045,8 @@ static int brw_interpret(const struct lu_env *env,
 
 	rc = osc_brw_fini_request(req, rc);
 	CDEBUG(D_INODE, "request %p aa %p rc %d\n", req, aa, rc);
-	/* When server return -EINPROGRESS, client should always retry
+	/*
+	 * When server returns -EINPROGRESS, client should always retry
 	 * regardless of the number of times the bulk was resent already.
 	 */
 	if (osc_recoverable_error(rc) && !req->rq_no_delay) {
@@ -2425,8 +2427,9 @@ int osc_enqueue_fini(struct ptlrpc_request *req, osc_enqueue_upcall_f upcall,
 }
 
 int osc_enqueue_interpret(const struct lu_env *env, struct ptlrpc_request *req,
-			  struct osc_enqueue_args *aa, int rc)
+			  void *args, int rc)
 {
+	struct osc_enqueue_args *aa = args;
 	struct ldlm_lock *lock;
 	struct lustre_handle *lockh = &aa->oa_lockh;
 	enum ldlm_mode mode = aa->oa_mode;
@@ -2627,8 +2630,7 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 				aa->oa_flags = NULL;
 			}
 
-			req->rq_interpret_reply =
-				(ptlrpc_interpterer_t)osc_enqueue_interpret;
+			req->rq_interpret_reply = osc_enqueue_interpret;
 			ptlrpc_set_add_req(rqset, req);
 		} else if (intent) {
 			ptlrpc_req_finished(req);
@@ -2690,16 +2692,16 @@ int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 
 static int osc_statfs_interpret(const struct lu_env *env,
 				struct ptlrpc_request *req,
-				struct osc_async_args *aa, int rc)
+				void *args, int rc)
 {
+	struct osc_async_args *aa = args;
 	struct obd_statfs *msfs;
 
 	if (rc == -EBADR)
-		/* The request has in fact never been sent
-		 * due to issues at a higher level (LOV).
-		 * Exit immediately since the caller is
-		 * aware of the problem and takes care
-		 * of the clean up
+		/* The request has in fact never been sent due to
+		 * issues at a higher level (LOV).  Exit immediately
+		 * since the caller is aware of the problem and takes
+		 * care of the clean up
 		 */
 		return rc;
 
@@ -2721,6 +2723,7 @@ static int osc_statfs_interpret(const struct lu_env *env,
 	*aa->aa_oi->oi_osfs = *msfs;
 out:
 	rc = aa->aa_oi->oi_cb_up(aa->aa_oi, rc);
+
 	return rc;
 }
 
@@ -2759,7 +2762,7 @@ static int osc_statfs_async(struct obd_export *exp,
 		req->rq_no_delay = 1;
 	}
 
-	req->rq_interpret_reply = (ptlrpc_interpterer_t)osc_statfs_interpret;
+	req->rq_interpret_reply = osc_statfs_interpret;
 	aa = ptlrpc_req_async_args(aa, req);
 	aa->aa_oi = oinfo;
 
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index fabe675..ff212a3 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -2872,9 +2872,9 @@ int ptlrpc_queue_wait(struct ptlrpc_request *req)
  */
 static int ptlrpc_replay_interpret(const struct lu_env *env,
 				   struct ptlrpc_request *req,
-				   void *data, int rc)
+				   void *args, int rc)
 {
-	struct ptlrpc_replay_async_args *aa = data;
+	struct ptlrpc_replay_async_args *aa = args;
 	struct obd_import *imp = req->rq_import;
 
 	atomic_dec(&imp->imp_replay_inflight);
@@ -2993,10 +2993,7 @@ int ptlrpc_replay_req(struct ptlrpc_request *req)
 	/* Re-adjust the timeout for current conditions */
 	ptlrpc_at_set_req_timeout(req);
 
-	/*
-	 * Tell server the net_latency, so the server can calculate how long
-	 * it should wait for next replay
-	 */
+	/* Tell server net_latency to calculate how long to wait for reply. */
 	lustre_msg_set_service_time(req->rq_reqmsg,
 				    ptlrpc_at_get_net_latency(req));
 	DEBUG_REQ(D_HA, req, "REPLAY");
@@ -3252,9 +3249,9 @@ static void ptlrpcd_add_work_req(struct ptlrpc_request *req)
 }
 
 static int work_interpreter(const struct lu_env *env,
-			    struct ptlrpc_request *req, void *data, int rc)
+			    struct ptlrpc_request *req, void *args, int rc)
 {
-	struct ptlrpc_work_async_args *arg = data;
+	struct ptlrpc_work_async_args *arg = args;
 
 	LASSERT(ptlrpcd_check_work(req));
 
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index f59af80..867aff6 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -104,7 +104,7 @@ static void __import_set_state(struct obd_import *imp,
 
 static int ptlrpc_connect_interpret(const struct lu_env *env,
 				    struct ptlrpc_request *request,
-				    void *data, int rc);
+				    void *args, int rc);
 
 /* Only this function is allowed to change the import state when it is
  * CLOSED. I would rather refcount the import and free it after
@@ -1263,11 +1263,10 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
  */
 static int completed_replay_interpret(const struct lu_env *env,
 				      struct ptlrpc_request *req,
-				      void *data, int rc)
+				      void *args, int rc)
 {
 	atomic_dec(&req->rq_import->imp_replay_inflight);
-	if (req->rq_status == 0 &&
-	    !req->rq_import->imp_vbr_failed) {
+	if (req->rq_status == 0 && !req->rq_import->imp_vbr_failed) {
 		ptlrpc_import_recovery_state_machine(req->rq_import);
 	} else {
 		if (req->rq_import->imp_vbr_failed) {
@@ -1590,7 +1589,7 @@ int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
 
 static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 					    struct ptlrpc_request *req,
-					    void *data, int rc)
+					    void *args, int rc)
 {
 	struct obd_import *imp = req->rq_import;
 	int connect = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 189/622] lustre: lov: quiet lov_dump_lmm_ console messages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (187 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 188/622] lustre: ptlrpc: clean up rq_interpret_reply callbacks James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 190/622] lustre: lov: cl_cache could miss initialize James Simmons
                   ` (433 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Limit messages in lov_dump_lmm_objects() and lov_dump_lmm_common()
printing to the console repeatedly when D_ERROR is used.  Change
CDEBUG() to CDEBUG_LIMIT() so that rate-limiting is applied.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11579
Lustre-commit: d9ef75eb8226 ("LU-11579 lov: quiet lov_dump_lmm_ console messages")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33513
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_pack.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/lov/lov_pack.c b/fs/lustre/lov/lov_pack.c
index 5f8b281..c6dec2d 100644
--- a/fs/lustre/lov/lov_pack.c
+++ b/fs/lustre/lov/lov_pack.c
@@ -55,13 +55,13 @@ void lov_dump_lmm_common(int level, void *lmmp)
 	struct ost_id oi;
 
 	lmm_oi_le_to_cpu(&oi, &lmm->lmm_oi);
-	CDEBUG(level, "objid " DOSTID ", magic 0x%08x, pattern %#x\n",
-	       POSTID(&oi), le32_to_cpu(lmm->lmm_magic),
-	       le32_to_cpu(lmm->lmm_pattern));
-	CDEBUG(level, "stripe_size %u, stripe_count %u, layout_gen %u\n",
-	       le32_to_cpu(lmm->lmm_stripe_size),
-	       le16_to_cpu(lmm->lmm_stripe_count),
-	       le16_to_cpu(lmm->lmm_layout_gen));
+	CDEBUG_LIMIT(level, "objid " DOSTID ", magic 0x%08x, pattern %#x\n",
+		     POSTID(&oi), le32_to_cpu(lmm->lmm_magic),
+		     le32_to_cpu(lmm->lmm_pattern));
+	CDEBUG_LIMIT(level, "stripe_size %u, stripe_count %u, layout_gen %u\n",
+		     le32_to_cpu(lmm->lmm_stripe_size),
+		     le16_to_cpu(lmm->lmm_stripe_count),
+		     le16_to_cpu(lmm->lmm_layout_gen));
 }
 
 static void lov_dump_lmm_objects(int level, struct lov_ost_data *lod,
@@ -70,8 +70,9 @@ static void lov_dump_lmm_objects(int level, struct lov_ost_data *lod,
 	int i;
 
 	if (stripe_count > LOV_V1_INSANE_STRIPE_COUNT) {
-		CDEBUG(level, "bad stripe_count %u > max_stripe_count %u\n",
-		       stripe_count, LOV_V1_INSANE_STRIPE_COUNT);
+		CDEBUG_LIMIT(level,
+			     "bad stripe_count %u > max_stripe_count %u\n",
+			     stripe_count, LOV_V1_INSANE_STRIPE_COUNT);
 		return;
 	}
 
@@ -79,8 +80,8 @@ static void lov_dump_lmm_objects(int level, struct lov_ost_data *lod,
 		struct ost_id oi;
 
 		ostid_le_to_cpu(&lod->l_ost_oi, &oi);
-		CDEBUG(level, "stripe %u idx %u subobj " DOSTID "\n", i,
-		       le32_to_cpu(lod->l_ost_idx), POSTID(&oi));
+		CDEBUG_LIMIT(level, "stripe %u idx %u subobj " DOSTID "\n", i,
+			     le32_to_cpu(lod->l_ost_idx), POSTID(&oi));
 	}
 }
 
@@ -94,7 +95,8 @@ void lov_dump_lmm_v1(int level, struct lov_mds_md_v1 *lmm)
 void lov_dump_lmm_v3(int level, struct lov_mds_md_v3 *lmm)
 {
 	lov_dump_lmm_common(level, lmm);
-	CDEBUG(level, "pool_name " LOV_POOLNAMEF "\n", lmm->lmm_pool_name);
+	CDEBUG_LIMIT(level, "pool_name " LOV_POOLNAMEF "\n",
+		     lmm->lmm_pool_name);
 	lov_dump_lmm_objects(level, lmm->lmm_objects,
 			     le16_to_cpu(lmm->lmm_stripe_count));
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 190/622] lustre: lov: cl_cache could miss initialize
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (188 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 189/622] lustre: lov: quiet lov_dump_lmm_ console messages James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:10 ` [lustre-devel] [PATCH 191/622] lnet: socklnd: improve scheduling algorithm James Simmons
                   ` (432 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

The cl_cache may be missed initialize when we mount
a client with deactivate osc and then active it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11658
Lustre-commit: 42e83c44eb5a ("LU-11658 lov: cl_cache could miss initialize")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33650
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_obd.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index a16c663..08d7edc 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -360,23 +360,6 @@ static int lov_set_osc_active(struct obd_device *obd, struct obd_uuid *uuid,
 		tgt = lov->lov_tgts[index];
 		if (!tgt)
 			continue;
-		/*
-		 * LU-642, initially inactive OSC could miss the obd_connect,
-		 * we make up for it here.
-		 */
-		if (ev == OBD_NOTIFY_ACTIVATE && !tgt->ltd_exp &&
-		    obd_uuid_equals(uuid, &tgt->ltd_uuid)) {
-			struct obd_uuid lov_osc_uuid = {"LOV_OSC_UUID"};
-
-			obd_connect(NULL, &tgt->ltd_exp, tgt->ltd_obd,
-				    &lov_osc_uuid, &lov->lov_ocd, NULL);
-		}
-		if (!tgt->ltd_exp)
-			continue;
-
-		CDEBUG(D_INFO, "lov idx %d is %s conn %#llx\n",
-		       index, obd_uuid2str(&tgt->ltd_uuid),
-		       tgt->ltd_exp->exp_handle.h_cookie);
 		if (obd_uuid_equals(uuid, &tgt->ltd_uuid))
 			break;
 	}
@@ -389,6 +372,31 @@ static int lov_set_osc_active(struct obd_device *obd, struct obd_uuid *uuid,
 	if (ev == OBD_NOTIFY_DEACTIVATE || ev == OBD_NOTIFY_ACTIVATE) {
 		activate = (ev == OBD_NOTIFY_ACTIVATE) ? 1 : 0;
 
+		/*
+		 * LU-642, initially inactive OSC could miss the obd_connect,
+		 * we make up for it here.
+		 */
+		if (activate && !tgt->ltd_exp) {
+			int rc;
+			struct obd_uuid lov_osc_uuid = {"LOV_OSC_UUID"};
+
+			rc = obd_connect(NULL, &tgt->ltd_exp, tgt->ltd_obd,
+					 &lov_osc_uuid, &lov->lov_ocd, NULL);
+			if (rc || !tgt->ltd_exp) {
+				index = rc;
+				goto out;
+			}
+			rc = obd_set_info_async(NULL, tgt->ltd_exp,
+						sizeof(KEY_CACHE_SET),
+						KEY_CACHE_SET,
+						sizeof(struct cl_client_cache),
+						lov->lov_cache, NULL);
+			if (rc < 0) {
+				index = rc;
+				goto out;
+			}
+		}
+
 		if (lov->lov_tgts[index]->ltd_activate == activate) {
 			CDEBUG(D_INFO, "OSC %s already %sactivate!\n",
 			       uuid->uuid, activate ? "" : "de");
@@ -421,6 +429,10 @@ static int lov_set_osc_active(struct obd_device *obd, struct obd_uuid *uuid,
 		CERROR("Unknown event(%d) for uuid %s", ev, uuid->uuid);
 	}
 
+	if (tgt->ltd_exp)
+		CDEBUG(D_INFO, "%s: lov idx %d conn %llx\n", obd_uuid2str(uuid),
+		       index, tgt->ltd_exp->exp_handle.h_cookie);
+
 out:
 	lov_tgts_putref(obd);
 	return index;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 191/622] lnet: socklnd: improve scheduling algorithm
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (189 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 190/622] lustre: lov: cl_cache could miss initialize James Simmons
@ 2020-02-27 21:10 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 192/622] lustre: ldlm: Adjust search_* functions James Simmons
                   ` (431 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:10 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Modified the scheduling algorithm to use all scheduler threads
available. Previously a connection is assigned a single thread
and can only use that one. With this patch any scheduler thread
available on the assigned CPT can pick up and work on requests
queued on the connection.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11415
Lustre-commit: 89df5e712ffd ("LU-11415 socklnd: improve scheduling algorithm")
Reviewed-on: https://review.whamcloud.com/33740
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c    | 156 +++++++++++++-----------------------
 net/lnet/klnds/socklnd/socklnd.h    |  18 ++---
 net/lnet/klnds/socklnd/socklnd_cb.c |   8 +-
 3 files changed, 65 insertions(+), 117 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index ba5623a..8b283ac 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -648,34 +648,21 @@ struct ksock_peer *
 static struct ksock_sched *
 ksocknal_choose_scheduler_locked(unsigned int cpt)
 {
-	struct ksock_sched_info	*info = ksocknal_data.ksnd_sched_info[cpt];
-	struct ksock_sched *sched;
+	struct ksock_sched *sched = ksocknal_data.ksnd_schedulers[cpt];
 	int i;
 
-	if (info->ksi_nthreads == 0) {
-		cfs_percpt_for_each(info, i, ksocknal_data.ksnd_sched_info) {
-			if (info->ksi_nthreads > 0) {
+	if (sched->kss_nthreads == 0) {
+		cfs_percpt_for_each(sched, i, ksocknal_data.ksnd_schedulers) {
+			if (sched->kss_nthreads > 0) {
 				CDEBUG(D_NET,
 				       "scheduler[%d] has no threads. selected scheduler[%d]\n",
-				       cpt, info->ksi_cpt);
-				goto select_sched;
+				       cpt, sched->kss_cpt);
+				return sched;
 			}
 		}
 		return NULL;
 	}
 
-select_sched:
-	sched = &info->ksi_scheds[0];
-	/*
-	 * NB: it's safe so far, but info->ksi_nthreads could be changed
-	 * at runtime when we have dynamic LNet configuration, then we
-	 * need to take care of this.
-	 */
-	for (i = 1; i < info->ksi_nthreads; i++) {
-		if (sched->kss_nconns > info->ksi_scheds[i].kss_nconns)
-			sched = &info->ksi_scheds[i];
-	}
-
 	return sched;
 }
 
@@ -1276,7 +1263,7 @@ struct ksock_peer *
 	 * The cpt might have changed if we ended up selecting a non cpt
 	 * native scheduler. So use the scheduler's cpt instead.
 	 */
-	cpt = sched->kss_info->ksi_cpt;
+	cpt = sched->kss_cpt;
 	sched->kss_nconns++;
 	conn->ksnc_scheduler = sched;
 
@@ -1316,11 +1303,11 @@ struct ksock_peer *
 	 *    (b) normal I/O on the conn is blocked until I setup and call the
 	 *	socket callbacks.
 	 */
-	CDEBUG(D_NET, "New conn %s p %d.x %pI4h -> %pI4h/%d incarnation:%lld sched[%d:%d]\n",
+	CDEBUG(D_NET,
+	       "New conn %s p %d.x %pI4h -> %pI4h/%d incarnation:%lld sched[%d]\n",
 	       libcfs_id2str(peerid), conn->ksnc_proto->pro_version,
 	       &conn->ksnc_myipaddr, &conn->ksnc_ipaddr,
-	       conn->ksnc_port, incarnation, cpt,
-	       (int)(sched - &sched->kss_info->ksi_scheds[0]));
+	       conn->ksnc_port, incarnation, cpt);
 
 	if (active) {
 		/* additional routes after interface exchange? */
@@ -2209,7 +2196,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		data->ioc_u32[1] = conn->ksnc_port;
 		data->ioc_u32[2] = conn->ksnc_myipaddr;
 		data->ioc_u32[3] = conn->ksnc_type;
-		data->ioc_u32[4] = conn->ksnc_scheduler->kss_info->ksi_cpt;
+		data->ioc_u32[4] = conn->ksnc_scheduler->kss_cpt;
 		data->ioc_u32[5] = rxmem;
 		data->ioc_u32[6] = conn->ksnc_peer->ksnp_id.pid;
 		ksocknal_conn_decref(conn);
@@ -2248,14 +2235,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 {
 	LASSERT(!atomic_read(&ksocknal_data.ksnd_nactive_txs));
 
-	if (ksocknal_data.ksnd_sched_info) {
-		struct ksock_sched_info *info;
-		int i;
-
-		cfs_percpt_for_each(info, i, ksocknal_data.ksnd_sched_info)
-			kfree(info->ksi_scheds);
-		cfs_percpt_free(ksocknal_data.ksnd_sched_info);
-	}
+	if (ksocknal_data.ksnd_schedulers)
+		cfs_percpt_free(ksocknal_data.ksnd_schedulers);
 
 	kvfree(ksocknal_data.ksnd_peers);
 
@@ -2282,10 +2263,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 static void
 ksocknal_base_shutdown(void)
 {
-	struct ksock_sched_info *info;
 	struct ksock_sched *sched;
 	int i;
-	int j;
 
 	LASSERT(!ksocknal_data.ksnd_nnets);
 
@@ -2305,22 +2284,14 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		LASSERT(list_empty(&ksocknal_data.ksnd_connd_connreqs));
 		LASSERT(list_empty(&ksocknal_data.ksnd_connd_routes));
 
-		if (ksocknal_data.ksnd_sched_info) {
-			cfs_percpt_for_each(info, i,
-					    ksocknal_data.ksnd_sched_info) {
-				if (!info->ksi_scheds)
-					continue;
+		if (ksocknal_data.ksnd_schedulers) {
+			cfs_percpt_for_each(sched, i,
+					    ksocknal_data.ksnd_schedulers) {
 
-				for (j = 0; j < info->ksi_nthreads_max; j++) {
-					sched = &info->ksi_scheds[j];
-					LASSERT(list_empty(
-						&sched->kss_tx_conns));
-					LASSERT(list_empty(
-						&sched->kss_rx_conns));
-					LASSERT(list_empty(
-						&sched->kss_zombie_noop_txs));
-					LASSERT(!sched->kss_nconns);
-				}
+				LASSERT(list_empty(&sched->kss_tx_conns));
+				LASSERT(list_empty(&sched->kss_rx_conns));
+				LASSERT(list_empty(&sched->kss_zombie_noop_txs));
+				LASSERT(!sched->kss_nconns);
 			}
 		}
 
@@ -2329,17 +2300,10 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		wake_up_all(&ksocknal_data.ksnd_connd_waitq);
 		wake_up_all(&ksocknal_data.ksnd_reaper_waitq);
 
-		if (ksocknal_data.ksnd_sched_info) {
-			cfs_percpt_for_each(info, i,
-					    ksocknal_data.ksnd_sched_info) {
-				if (!info->ksi_scheds)
-					continue;
-
-				for (j = 0; j < info->ksi_nthreads_max; j++) {
-					sched = &info->ksi_scheds[j];
+		if (ksocknal_data.ksnd_schedulers) {
+			cfs_percpt_for_each(sched, i,
+					    ksocknal_data.ksnd_schedulers)
 					wake_up_all(&sched->kss_waitq);
-				}
-			}
 		}
 
 		i = 4;
@@ -2367,7 +2331,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 static int
 ksocknal_base_startup(void)
 {
-	struct ksock_sched_info	*info;
+	struct ksock_sched *sched;
 	int rc;
 	int i;
 
@@ -2409,15 +2373,18 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	ksocknal_data.ksnd_init = SOCKNAL_INIT_DATA;
 	try_module_get(THIS_MODULE);
 
-	ksocknal_data.ksnd_sched_info = cfs_percpt_alloc(lnet_cpt_table(),
-							 sizeof(*info));
-	if (!ksocknal_data.ksnd_sched_info)
+	/* Create a scheduler block per available CPT */
+	ksocknal_data.ksnd_schedulers = cfs_percpt_alloc(lnet_cpt_table(),
+							 sizeof(*sched));
+	if (!ksocknal_data.ksnd_schedulers)
 		goto failed;
 
-	cfs_percpt_for_each(info, i, ksocknal_data.ksnd_sched_info) {
-		struct ksock_sched *sched;
+	cfs_percpt_for_each(sched, i, ksocknal_data.ksnd_schedulers) {
 		int nthrs;
 
+		/* make sure not to allocate more threads than there are
+		 * cores/CPUs in the CPT
+		 */
 		nthrs = cfs_cpt_weight(lnet_cpt_table(), i);
 		if (*ksocknal_tunables.ksnd_nscheds > 0) {
 			nthrs = min(nthrs, *ksocknal_tunables.ksnd_nscheds);
@@ -2429,27 +2396,14 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 			nthrs = min(max(SOCKNAL_NSCHEDS, nthrs >> 1), nthrs);
 		}
 
-		info->ksi_nthreads_max = nthrs;
-		info->ksi_cpt = i;
-
-		if (nthrs == 0)
-			continue;
-
-		info->ksi_scheds = kzalloc_cpt(info->ksi_nthreads_max * sizeof(*sched),
-					       GFP_NOFS, i);
-		if (!info->ksi_scheds)
-			goto failed;
-
-		for (; nthrs > 0; nthrs--) {
-			sched = &info->ksi_scheds[nthrs - 1];
+		sched->kss_nthreads_max = nthrs;
+		sched->kss_cpt = i;
 
-			sched->kss_info = info;
-			spin_lock_init(&sched->kss_lock);
-			INIT_LIST_HEAD(&sched->kss_rx_conns);
-			INIT_LIST_HEAD(&sched->kss_tx_conns);
-			INIT_LIST_HEAD(&sched->kss_zombie_noop_txs);
-			init_waitqueue_head(&sched->kss_waitq);
-		}
+		spin_lock_init(&sched->kss_lock);
+		INIT_LIST_HEAD(&sched->kss_rx_conns);
+		INIT_LIST_HEAD(&sched->kss_tx_conns);
+		INIT_LIST_HEAD(&sched->kss_zombie_noop_txs);
+		init_waitqueue_head(&sched->kss_waitq);
 	}
 
 	ksocknal_data.ksnd_connd_starting = 0;
@@ -2646,37 +2600,35 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 }
 
 static int
-ksocknal_start_schedulers(struct ksock_sched_info *info)
+ksocknal_start_schedulers(struct ksock_sched *sched)
 {
 	int nthrs;
 	int rc = 0;
 	int i;
 
-	if (!info->ksi_nthreads) {
+	if (sched->kss_nthreads == 0) {
 		if (*ksocknal_tunables.ksnd_nscheds > 0) {
-			nthrs = info->ksi_nthreads_max;
+			nthrs = sched->kss_nthreads_max;
 		} else {
 			nthrs = cfs_cpt_weight(lnet_cpt_table(),
-					       info->ksi_cpt);
+					       sched->kss_cpt);
 			nthrs = min(max(SOCKNAL_NSCHEDS, nthrs >> 1), nthrs);
 			nthrs = min(SOCKNAL_NSCHEDS_HIGH, nthrs);
 		}
-		nthrs = min(nthrs, info->ksi_nthreads_max);
+		nthrs = min(nthrs, sched->kss_nthreads_max);
 	} else {
-		LASSERT(info->ksi_nthreads <= info->ksi_nthreads_max);
+		LASSERT(sched->kss_nthreads <= sched->kss_nthreads_max);
 		/* increase two threads if there is new interface */
-		nthrs = min(2, info->ksi_nthreads_max - info->ksi_nthreads);
+		nthrs = min(2, sched->kss_nthreads_max - sched->kss_nthreads);
 	}
 
 	for (i = 0; i < nthrs; i++) {
 		long id;
 		char name[20];
-		struct ksock_sched *sched;
 
-		id = KSOCK_THREAD_ID(info->ksi_cpt, info->ksi_nthreads + i);
-		sched = &info->ksi_scheds[KSOCK_THREAD_SID(id)];
+		id = KSOCK_THREAD_ID(sched->kss_cpt, sched->kss_nthreads + i);
 		snprintf(name, sizeof(name), "socknal_sd%02d_%02d",
-			 info->ksi_cpt, (int)(sched - &info->ksi_scheds[0]));
+			 sched->kss_cpt, (int)KSOCK_THREAD_SID(id));
 
 		rc = ksocknal_thread_start(ksocknal_scheduler,
 					   (void *)id, name);
@@ -2684,11 +2636,11 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 			continue;
 
 		CERROR("Can't spawn thread %d for scheduler[%d]: %d\n",
-		       info->ksi_cpt, info->ksi_nthreads + i, rc);
+		       sched->kss_cpt, (int)KSOCK_THREAD_SID(id), rc);
 		break;
 	}
 
-	info->ksi_nthreads += i;
+	sched->kss_nthreads += i;
 	return rc;
 }
 
@@ -2703,16 +2655,16 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		return -EINVAL;
 
 	for (i = 0; i < ncpts; i++) {
-		struct ksock_sched_info *info;
+		struct ksock_sched *sched;
 		int cpt = !cpts ? i : cpts[i];
 
 		LASSERT(cpt < cfs_cpt_number(lnet_cpt_table()));
-		info = ksocknal_data.ksnd_sched_info[cpt];
+		sched = ksocknal_data.ksnd_schedulers[cpt];
 
-		if (!newif && info->ksi_nthreads > 0)
+		if (!newif && sched->kss_nthreads > 0)
 			continue;
 
-		rc = ksocknal_start_schedulers(info);
+		rc = ksocknal_start_schedulers(sched);
 		if (rc)
 			return rc;
 	}
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index c8d8acf..2e292f0 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -74,8 +74,7 @@
 # define SOCKNAL_RISK_KMAP_DEADLOCK	1
 #endif
 
-struct ksock_sched_info;
-
+/* per scheduler state */
 struct ksock_sched {				/* per scheduler state */
 	spinlock_t		kss_lock;	/* serialise */
 	struct list_head	kss_rx_conns;	/* conn waiting to be read */
@@ -85,15 +84,14 @@ struct ksock_sched {				/* per scheduler state */
 	int			kss_nconns;	/* # connections assigned to
 						 * this scheduler
 						 */
-	struct ksock_sched_info	*kss_info;	/* owner of it */
+	/* max allowed threads */
+	int			kss_nthreads_max;
+	/* number of threads */
+	int			kss_nthreads;
+	/* CPT id */
+	int			kss_cpt;
 };
 
-struct ksock_sched_info {
-	int			ksi_nthreads_max; /* max allowed threads */
-	int			ksi_nthreads;	  /* number of threads */
-	int			ksi_cpt;	  /* CPT id */
-	struct ksock_sched	*ksi_scheds;	  /* array of schedulers */
-};
 
 #define KSOCK_CPT_SHIFT			16
 #define KSOCK_THREAD_ID(cpt, sid)	(((cpt) << KSOCK_CPT_SHIFT) | (sid))
@@ -197,7 +195,7 @@ struct ksock_nal_data {
 	int			ksnd_nthreads;		/* # live threads */
 	int			ksnd_shuttingdown;	/* tell threads to exit
 							 */
-	struct ksock_sched_info	**ksnd_sched_info;	/* schedulers info */
+	struct ksock_sched	**ksnd_schedulers;	/* schedulers info */
 
 	atomic_t		ksnd_nactive_txs;	/* #active txs */
 
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index abb3529..581f734 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -1349,7 +1349,6 @@ struct ksock_route *
 
 int ksocknal_scheduler(void *arg)
 {
-	struct ksock_sched_info *info;
 	struct ksock_sched *sched;
 	struct ksock_conn *conn;
 	struct ksock_tx *tx;
@@ -1357,13 +1356,12 @@ int ksocknal_scheduler(void *arg)
 	int nloops = 0;
 	long id = (long)arg;
 
-	info = ksocknal_data.ksnd_sched_info[KSOCK_THREAD_CPT(id)];
-	sched = &info->ksi_scheds[KSOCK_THREAD_SID(id)];
+	sched = ksocknal_data.ksnd_schedulers[KSOCK_THREAD_CPT(id)];
 
-	rc = cfs_cpt_bind(lnet_cpt_table(), info->ksi_cpt);
+	rc = cfs_cpt_bind(lnet_cpt_table(), sched->kss_cpt);
 	if (rc) {
 		CWARN("Can't set CPU partition affinity to %d: %d\n",
-		      info->ksi_cpt, rc);
+		      sched->kss_cpt, rc);
 	}
 
 	spin_lock_bh(&sched->kss_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 192/622] lustre: ldlm: Adjust search_* functions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (190 preceding siblings ...)
  2020-02-27 21:10 ` [lustre-devel] [PATCH 191/622] lnet: socklnd: improve scheduling algorithm James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 193/622] lustre: sysfs: make ping sysfs file read and writable James Simmons
                   ` (430 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

The search_itree and search_queue functions should both
return either a pointer to a found lock or NULL.

Currently, search_itree just returns the contents of
data->lmd_lock, whether or not a lock was found.

search_queue will do the same under certain cirumstances.

Zero lmd_lock in both search_* functions, and also stop
searching in search_itree once a lock is found.

cray-bug-id: LUS-6783
WC-bug-id: https://jira.whamcloud.com/browse/LU-11719
Lustre-commit: a231148843bd ("LU-11719 ldlm: Adjust search_* functions")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33754
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lock.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index b9771ef..06690a6 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -1159,6 +1159,8 @@ static struct ldlm_lock *search_itree(struct ldlm_resource *res,
 {
 	int idx;
 
+	data->lmd_lock = NULL;
+
 	for (idx = 0; idx < LCK_MODE_NUM; idx++) {
 		struct ldlm_interval_tree *tree = &res->lr_itree[idx];
 
@@ -1172,11 +1174,14 @@ static struct ldlm_lock *search_itree(struct ldlm_resource *res,
 				   data->lmd_policy->l_extent.start,
 				   data->lmd_policy->l_extent.end,
 				   lock_matches, data);
+		if (data->lmd_lock)
+			return data->lmd_lock;
 	}
-	return data->lmd_lock;
+
+	return NULL;
 }
 
-/**
+/*
  * Search for a lock with given properties in a queue.
  *
  * @queue	search for a lock in this queue
@@ -1189,9 +1194,12 @@ static struct ldlm_lock *search_queue(struct list_head *queue,
 {
 	struct ldlm_lock *lock;
 
+	data->lmd_lock = NULL;
+
 	list_for_each_entry(lock, queue, l_res_link)
 		if (lock_matches(lock, data))
 			return data->lmd_lock;
+
 	return NULL;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 193/622] lustre: sysfs: make ping sysfs file read and writable
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (191 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 192/622] lustre: ldlm: Adjust search_* functions James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 194/622] lustre: ptlrpc: connect vs import invalidate race James Simmons
                   ` (429 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

Starting with 4.15 kernels any sysfs read only is limited to
root access only. To retain the ability for non root users
to detect if a remote server is alive using the 'ping' sysfs
file we need to change it to writable. Retain the read ability
so older tools will work.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: 6bbae72c6900 ("LU-8066 sysfs: make ping sysfs file read and writable")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33776
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lprocfs_status.h | 3 ++-
 fs/lustre/mdc/lproc_mdc.c          | 2 +-
 fs/lustre/mgc/lproc_mgc.c          | 2 +-
 fs/lustre/osc/lproc_osc.c          | 2 +-
 fs/lustre/ptlrpc/lproc_ptlrpc.c    | 9 +++++++++
 5 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index 1ef548ae..c1079f1 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -462,7 +462,8 @@ int lprocfs_wr_uint(struct file *file, const char __user *buffer,
 struct adaptive_timeout;
 int lprocfs_at_hist_helper(struct seq_file *m, struct adaptive_timeout *at);
 int lprocfs_rd_timeouts(struct seq_file *m, void *data);
-
+ssize_t ping_store(struct kobject *kobj, struct attribute *attr,
+		   const char *buffer, size_t count);
 ssize_t ping_show(struct kobject *kobj, struct attribute *attr,
 		  char *buffer);
 
diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index 746dd21..70c9eaf 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -306,7 +306,7 @@ static ssize_t max_mod_rpcs_in_flight_store(struct kobject *kobj,
 LUSTRE_ATTR(mds_conn_uuid, 0444, conn_uuid_show, NULL);
 LUSTRE_RO_ATTR(conn_uuid);
 
-LUSTRE_RO_ATTR(ping);
+LUSTRE_RW_ATTR(ping);
 
 static ssize_t mdc_rpc_stats_seq_write(struct file *file,
 				       const char __user *buf,
diff --git a/fs/lustre/mgc/lproc_mgc.c b/fs/lustre/mgc/lproc_mgc.c
index 676d479..0c716df 100644
--- a/fs/lustre/mgc/lproc_mgc.c
+++ b/fs/lustre/mgc/lproc_mgc.c
@@ -69,7 +69,7 @@ struct lprocfs_vars lprocfs_mgc_obd_vars[] = {
 LUSTRE_ATTR(mgs_conn_uuid, 0444, conn_uuid_show, NULL);
 LUSTRE_RO_ATTR(conn_uuid);
 
-LUSTRE_RO_ATTR(ping);
+LUSTRE_RW_ATTR(ping);
 
 static struct attribute *mgc_attrs[] = {
 	&lustre_attr_mgs_conn_uuid.attr,
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index ac64724..ea67d20 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -176,7 +176,7 @@ static ssize_t max_dirty_mb_store(struct kobject *kobj,
 LUSTRE_ATTR(ost_conn_uuid, 0444, conn_uuid_show, NULL);
 LUSTRE_RO_ATTR(conn_uuid);
 
-LUSTRE_RO_ATTR(ping);
+LUSTRE_RW_ATTR(ping);
 
 static int osc_cached_mb_seq_show(struct seq_file *m, void *v)
 {
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index eb0ecc0..700e109 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -1234,6 +1234,7 @@ void ptlrpc_lprocfs_unregister_obd(struct obd_device *obd)
 }
 EXPORT_SYMBOL(ptlrpc_lprocfs_unregister_obd);
 
+/* Kept for older tools */
 ssize_t ping_show(struct kobject *kobj, struct attribute *attr,
 		  char *buffer)
 {
@@ -1260,6 +1261,14 @@ ssize_t ping_show(struct kobject *kobj, struct attribute *attr,
 }
 EXPORT_SYMBOL(ping_show);
 
+ssize_t ping_store(struct kobject *kobj, struct attribute *attr,
+		   const char *buffer, size_t count)
+{
+	return ping_show(kobj, attr, (char *)buffer);
+}
+EXPORT_SYMBOL(ping_store);
+
+
 #undef BUFLEN
 
 /* Write the connection UUID to this file to attempt to connect to that node.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 194/622] lustre: ptlrpc: connect vs import invalidate race
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (192 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 193/622] lustre: sysfs: make ping sysfs file read and writable James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 195/622] lustre: ptlrpc: always unregister bulk James Simmons
                   ` (428 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

Connect can't be sent while import invalidate is
in progress, thus it leaves the import in not
initialized state.

Don't allow reconnect in evicted state.

Cray-bug-id: LUS-6322
WC-bug-id: https://jira.whamcloud.com/browse/LU-7558
Lustre-commit: b1827ff1da82 ("LU-7558 ptlrpc: connect vs import invalidate race")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/33718
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 1 +
 fs/lustre/ptlrpc/import.c       | 6 ++++++
 fs/lustre/ptlrpc/recover.c      | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index c2db38f..5ff270a 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -353,6 +353,7 @@
 #define OBD_FAIL_PTLRPC_LONG_REQ_UNLINK			0x51b
 #define OBD_FAIL_PTLRPC_LONG_BOTH_UNLINK		0x51c
 #define OBD_FAIL_PTLRPC_BULK_ATTACH      0x521
+#define OBD_FAIL_PTLRPC_CONNECT_RACE			0x531
 
 #define OBD_FAIL_OBD_PING_NET				0x600
 /*	OBD_FAIL_OBD_LOG_CANCEL_NET	0x601 obsolete since 1.5 */
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 867aff6..df6c459 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -38,6 +38,7 @@
 #define DEBUG_SUBSYSTEM S_RPC
 
 #include <linux/kthread.h>
+#include <linux/delay.h>
 #include <linux/fs_struct.h>
 #include <obd_support.h>
 #include <lustre_ha.h>
@@ -273,6 +274,10 @@ void ptlrpc_invalidate_import(struct obd_import *imp)
 	if (!imp->imp_invalid || imp->imp_obd->obd_no_recov)
 		ptlrpc_deactivate_import(imp);
 
+	if (OBD_FAIL_PRECHECK(OBD_FAIL_PTLRPC_CONNECT_RACE)) {
+		OBD_RACE(OBD_FAIL_PTLRPC_CONNECT_RACE);
+		msleep(10 * MSEC_PER_SEC);
+	}
 	CFS_FAIL_TIMEOUT(OBD_FAIL_MGS_CONNECT_NET, 3 * cfs_fail_val / 2);
 	LASSERT(imp->imp_invalid);
 
@@ -615,6 +620,7 @@ int ptlrpc_connect_import(struct obd_import *imp)
 		CERROR("already connected\n");
 		return 0;
 	} else if (imp->imp_state == LUSTRE_IMP_CONNECTING ||
+		   imp->imp_state == LUSTRE_IMP_EVICTED ||
 		   imp->imp_connected) {
 		spin_unlock(&imp->imp_lock);
 		CERROR("already connecting\n");
diff --git a/fs/lustre/ptlrpc/recover.c b/fs/lustre/ptlrpc/recover.c
index 7c09c4e..ceab288 100644
--- a/fs/lustre/ptlrpc/recover.c
+++ b/fs/lustre/ptlrpc/recover.c
@@ -339,6 +339,8 @@ int ptlrpc_recover_import(struct obd_import *imp, char *new_uuid, int async)
 	if (rc)
 		goto out;
 
+	OBD_RACE(OBD_FAIL_PTLRPC_CONNECT_RACE);
+
 	rc = ptlrpc_connect_import(imp);
 	if (rc)
 		goto out;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 195/622] lustre: ptlrpc: always unregister bulk
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (193 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 194/622] lustre: ptlrpc: connect vs import invalidate race James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 196/622] lustre: sptlrpc: split sptlrpc_process_config() James Simmons
                   ` (427 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

In ptlrpc_check_set, the bulk should be unregistered before
ptl_send_rpc in any case.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11647
Lustre-commit: 21c53b18a1bc ("LU-11647 ptlrpc: always unregister bulk")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/22378
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index ff212a3..f57ec1883 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -1902,9 +1902,6 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 					spin_lock(&req->rq_lock);
 					req->rq_resend = 1;
 					spin_unlock(&req->rq_lock);
-					if (req->rq_bulk &&
-					    !ptlrpc_unregister_bulk(req, 1))
-						continue;
 				}
 				/*
 				 * rq_wait_ctx is only touched by ptlrpcd,
@@ -1931,6 +1928,13 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 					spin_unlock(&req->rq_lock);
 				}
 
+				/* In any case, the previous bulk should be
+				 * cleaned up to prepare for the new sending
+				 */
+				if (req->rq_bulk &&
+				    !ptlrpc_unregister_bulk(req, 1))
+					continue;
+
 				rc = ptl_send_rpc(req, 0);
 				if (rc == -ENOMEM) {
 					spin_lock(&imp->imp_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 196/622] lustre: sptlrpc: split sptlrpc_process_config()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (194 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 195/622] lustre: ptlrpc: always unregister bulk James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 197/622] lustre: cfg: reserve flags for SELinux status checking James Simmons
                   ` (426 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

Make sptlrpc_process_config() more than a single line wapper
exporting function. Instead migrate the lcfg parsing out of
__sptlrpc_process_config() so that we can use this function
for both LCFG_PARAM and LCFG_SET_PARAM handling.

The first field parsed from struct lustre_cfg *lcfg is the
target. This can be "_mgs", file system name, or an obd target
i.e fsname-MDT0000. We can move to extracting the file system
name out of the target string using server_name2fsname().

WC-bug-id: https://jira.whamcloud.com/browse/LU-10937
Lustre-commit: 0ff7d548eb7b ("LU-10937 sptlrpc: split sptlrpc_process_config()")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33760
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_disk.h |  1 +
 fs/lustre/obdclass/obd_mount.c  |  5 ++-
 fs/lustre/ptlrpc/sec_config.c   | 85 +++++++++++++++++++++++++----------------
 3 files changed, 57 insertions(+), 34 deletions(-)

diff --git a/fs/lustre/include/lustre_disk.h b/fs/lustre/include/lustre_disk.h
index 92618e8..b6b693f 100644
--- a/fs/lustre/include/lustre_disk.h
+++ b/fs/lustre/include/lustre_disk.h
@@ -145,6 +145,7 @@ struct lustre_sb_info {
 /****************** prototypes *********************/
 
 /* obd_mount.c */
+int server_name2fsname(const char *svname, char *fsname, const char **endptr);
 
 int lustre_start_mgc(struct super_block *sb);
 int lustre_common_put_super(struct super_block *sb);
diff --git a/fs/lustre/obdclass/obd_mount.c b/fs/lustre/obdclass/obd_mount.c
index d143112..6c68bc7 100644
--- a/fs/lustre/obdclass/obd_mount.c
+++ b/fs/lustre/obdclass/obd_mount.c
@@ -597,8 +597,8 @@ int lustre_put_lsi(struct super_block *sb)
  *
  * Returns:	rc < 0  on error
  */
-static int server_name2fsname(const char *svname, char *fsname,
-			      const char **endptr)
+int server_name2fsname(const char *svname, char *fsname,
+		       const char **endptr)
 {
 	const char *dash;
 
@@ -618,6 +618,7 @@ static int server_name2fsname(const char *svname, char *fsname,
 
 	return 0;
 }
+EXPORT_SYMBOL(server_name2fsname);
 
 /* Get the index from the obd name.
  *  rc = server type, or
diff --git a/fs/lustre/ptlrpc/sec_config.c b/fs/lustre/ptlrpc/sec_config.c
index 135ce99..e4b1a075 100644
--- a/fs/lustre/ptlrpc/sec_config.c
+++ b/fs/lustre/ptlrpc/sec_config.c
@@ -41,6 +41,7 @@
 #include <obd_class.h>
 #include <obd_support.h>
 #include <lustre_import.h>
+#include <lustre_disk.h>
 #include <uapi/linux/lustre/lustre_param.h>
 #include <lustre_sec.h>
 
@@ -577,14 +578,45 @@ static int sptlrpc_conf_merge_rule(struct sptlrpc_conf *conf,
  * find one through the target name in the record inside conf_lock;
  * otherwise means caller already hold conf_lock.
  */
-static int __sptlrpc_process_config(struct lustre_cfg *lcfg,
+static int __sptlrpc_process_config(char *target, const char *fsname,
+				    struct sptlrpc_rule *rule,
 				    struct sptlrpc_conf *conf)
 {
-	char *target, *param;
+	int rc;
+
+	if (!conf) {
+		if (!fsname)
+			return -ENODEV;
+
+		mutex_lock(&sptlrpc_conf_lock);
+		conf = sptlrpc_conf_get(fsname, 0);
+		if (!conf) {
+			CERROR("can't find conf\n");
+			rc = -ENOMEM;
+		} else {
+			rc = sptlrpc_conf_merge_rule(conf, target, rule);
+		}
+		mutex_unlock(&sptlrpc_conf_lock);
+	} else {
+		LASSERT(mutex_is_locked(&sptlrpc_conf_lock));
+		rc = sptlrpc_conf_merge_rule(conf, target, rule);
+	}
+
+	if (!rc)
+		conf->sc_modified++;
+
+	return rc;
+}
+
+int sptlrpc_process_config(struct lustre_cfg *lcfg)
+{
 	char fsname[MTI_NAME_MAXLEN];
 	struct sptlrpc_rule rule;
+	char *target, *param;
 	int rc;
 
+	print_lustre_cfg(lcfg);
+
 	target = lustre_cfg_string(lcfg, 1);
 	if (!target) {
 		CERROR("missing target name\n");
@@ -597,45 +629,34 @@ static int __sptlrpc_process_config(struct lustre_cfg *lcfg,
 		return -EINVAL;
 	}
 
-	CDEBUG(D_SEC, "processing rule: %s.%s\n", target, param);
-
 	/* parse rule to make sure the format is correct */
-	if (strncmp(param, PARAM_SRPC_FLVR, sizeof(PARAM_SRPC_FLVR) - 1) != 0) {
+	if (strncmp(param, PARAM_SRPC_FLVR,
+		    sizeof(PARAM_SRPC_FLVR) - 1) != 0) {
 		CERROR("Invalid sptlrpc parameter: %s\n", param);
 		return -EINVAL;
 	}
 	param += sizeof(PARAM_SRPC_FLVR) - 1;
 
-	rc = sptlrpc_parse_rule(param, &rule);
-	if (rc)
-		return -EINVAL;
-
-	if (!conf) {
-		target2fsname(target, fsname, sizeof(fsname));
-
-		mutex_lock(&sptlrpc_conf_lock);
-		conf = sptlrpc_conf_get(fsname, 0);
-		if (!conf) {
-			CERROR("can't find conf\n");
-			rc = -ENOMEM;
-		} else {
-			rc = sptlrpc_conf_merge_rule(conf, target, &rule);
-		}
-		mutex_unlock(&sptlrpc_conf_lock);
-	} else {
-		LASSERT(mutex_is_locked(&sptlrpc_conf_lock));
-		rc = sptlrpc_conf_merge_rule(conf, target, &rule);
-	}
+	CDEBUG(D_SEC, "processing rule: %s.%s\n", target, param);
 
-	if (rc == 0)
-		conf->sc_modified++;
+	/*
+	 * Three types of targets exist for sptlrpc using conf_param
+	 * 1.	'_mgs' which targets mgc srpc settings. Treat it as
+	 *	as a special file system name.
+	 * 2.	target is a device which can be fsname-MDTXXXX or
+	 *	fsname-OSTXXXX. This can be verified by the function
+	 *	server_name2fsname.
+	 * 3.	If both above conditions are not meet then the target
+	 *	is a actual filesystem.
+	 */
+	if (server_name2fsname(target, fsname, NULL))
+		strlcpy(fsname, target, sizeof(target));
 
-	return rc;
-}
+	rc = sptlrpc_parse_rule(param, &rule);
+	if (rc)
+		return rc;
 
-int sptlrpc_process_config(struct lustre_cfg *lcfg)
-{
-	return __sptlrpc_process_config(lcfg, NULL);
+	return __sptlrpc_process_config(target, fsname, &rule, NULL);
 }
 EXPORT_SYMBOL(sptlrpc_process_config);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 197/622] lustre: cfg: reserve flags for SELinux status checking
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (195 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 196/622] lustre: sptlrpc: split sptlrpc_process_config() James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 198/622] lustre: llite: remove cl_file_inode_init() LASSERT James Simmons
                   ` (425 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

Reserve LCFG_NODEMAP_SET_SEPOL config flag that will be used to
define sepol parameter on nodemap entries.
Reserve OBD_CONNECT2_SELINUX_POLICY connection flag that will be set
(in ocd_connect_flags2) if a client supports sending the SELinux
policy status info.
Add checks for all lcfg_command_type constants, along with lustre_cfg
and cfg_record_type.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8955
Lustre-commit: e71a77ba8d47 ("LU-8955 cfg: reserve flags for SELinux status checking")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/33797
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <farr0186@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    |   1 +
 fs/lustre/ptlrpc/wiretest.c            | 115 +++++++++++++++++++++++++++++++--
 include/uapi/linux/lustre/lustre_cfg.h |   1 +
 include/uapi/linux/lustre/lustre_idl.h |   1 +
 4 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index cce9bec..7701bc3 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -120,6 +120,7 @@
 	"wbc",		/* 0x40 */
 	"lock_convert",	/* 0x80 */
 	"archive_id_array",	/* 0x100 */
+	"selinux_policy",	/* 0x200 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 66dce80..bf79b8b 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -41,6 +41,7 @@
 #include <lustre_net.h>
 #include <lustre_disk.h>
 #include <uapi/linux/lustre/lustre_idl.h>
+#include <uapi/linux/lustre/lustre_cfg.h>
 
 #include "ptlrpc_internal.h"
 
@@ -1143,6 +1144,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_LOCK_CONVERT);
 	LASSERTF(OBD_CONNECT2_ARCHIVE_ID_ARRAY == 0x100ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ARCHIVE_ID_ARRAY);
+	LASSERTF(OBD_CONNECT2_SELINUX_POLICY == 0x400ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_SELINUX_POLICY);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
@@ -1150,17 +1153,17 @@ void lustre_assert_wire_constants(void)
 	LASSERTF(OBD_CKSUM_CRC32C == 0x00000004UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32C);
 	LASSERTF(OBD_CKSUM_RESERVED == 0x00000008UL, "found 0x%.8xUL\n",
-		(unsigned int)OBD_CKSUM_RESERVED);
+		 (unsigned int)OBD_CKSUM_RESERVED);
 	LASSERTF(OBD_CKSUM_T10IP512 == 0x00000010UL, "found 0x%.8xUL\n",
-		(unsigned int)OBD_CKSUM_T10IP512);
+		 (unsigned int)OBD_CKSUM_T10IP512);
 	LASSERTF(OBD_CKSUM_T10IP4K == 0x00000020UL, "found 0x%.8xUL\n",
-		(unsigned int)OBD_CKSUM_T10IP4K);
+		 (unsigned int)OBD_CKSUM_T10IP4K);
 	LASSERTF(OBD_CKSUM_T10CRC512 == 0x00000040UL, "found 0x%.8xUL\n",
-		(unsigned int)OBD_CKSUM_T10CRC512);
+		 (unsigned int)OBD_CKSUM_T10CRC512);
 	LASSERTF(OBD_CKSUM_T10CRC4K == 0x00000080UL, "found 0x%.8xUL\n",
-		(unsigned int)OBD_CKSUM_T10CRC4K);
+		 (unsigned int)OBD_CKSUM_T10CRC4K);
 	LASSERTF(OBD_CKSUM_T10_TOP == 0x00000002UL, "found 0x%.8xUL\n",
-		(unsigned int)OBD_CKSUM_T10_TOP);
+		 (unsigned int)OBD_CKSUM_T10_TOP);
 
 	/* Checks for struct ost_layout */
 	LASSERTF((int)sizeof(struct ost_layout) == 28, "found %lld\n",
@@ -4633,4 +4636,104 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)sizeof(((struct ladvise_hdr *)0)->lah_advise));
 	LASSERTF(LF_ASYNC == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)LF_ASYNC);
+
+	/* Checks for struct lustre_cfg */
+	LASSERTF((int)sizeof(struct lustre_cfg) == 32, "found %lld\n",
+		 (long long)(int)sizeof(struct lustre_cfg));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_version) == 0, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_version));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_version) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_version));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_command) == 4, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_command));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_command) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_command));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_num) == 8, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_num));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_num) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_num));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_flags) == 12, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_flags));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_flags) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_flags));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_nid) == 16, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_nid));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_nid) == 8, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_nid));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_nal) == 24, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_nal));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_nal) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_nal));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_bufcount) == 28, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_bufcount));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_bufcount) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_bufcount));
+	LASSERTF((int)offsetof(struct lustre_cfg, lcfg_buflens[0]) == 32, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_cfg, lcfg_buflens[0]));
+	LASSERTF((int)sizeof(((struct lustre_cfg *)0)->lcfg_buflens[0]) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_cfg *)0)->lcfg_buflens[0]));
+	LASSERTF(LCFG_ATTACH == 0x000cf001UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_ATTACH);
+	LASSERTF(LCFG_DETACH == 0x000cf002UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_DETACH);
+	LASSERTF(LCFG_SETUP == 0x000cf003UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_SETUP);
+	LASSERTF(LCFG_CLEANUP == 0x000cf004UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_CLEANUP);
+	LASSERTF(LCFG_ADD_UUID == 0x000cf005UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_ADD_UUID);
+	LASSERTF(LCFG_DEL_UUID == 0x000cf006UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_DEL_UUID);
+	LASSERTF(LCFG_MOUNTOPT == 0x000cf007UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_MOUNTOPT);
+	LASSERTF(LCFG_DEL_MOUNTOPT == 0x000cf008UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_DEL_MOUNTOPT);
+	LASSERTF(LCFG_SET_TIMEOUT == 0x000cf009UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_SET_TIMEOUT);
+	LASSERTF(LCFG_SET_UPCALL == 0x000cf00aUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_SET_UPCALL);
+	LASSERTF(LCFG_ADD_CONN == 0x000cf00bUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_ADD_CONN);
+	LASSERTF(LCFG_DEL_CONN == 0x000cf00cUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_DEL_CONN);
+	LASSERTF(LCFG_LOV_ADD_OBD == 0x000cf00dUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_LOV_ADD_OBD);
+	LASSERTF(LCFG_LOV_DEL_OBD == 0x000cf00eUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_LOV_DEL_OBD);
+	LASSERTF(LCFG_PARAM == 0x000cf00fUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_PARAM);
+	LASSERTF(LCFG_MARKER == 0x000cf010UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_MARKER);
+	LASSERTF(LCFG_LOG_START == 0x000ce011UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_LOG_START);
+	LASSERTF(LCFG_LOG_END == 0x000ce012UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_LOG_END);
+	LASSERTF(LCFG_LOV_ADD_INA == 0x000ce013UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_LOV_ADD_INA);
+	LASSERTF(LCFG_ADD_MDC == 0x000cf014UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_ADD_MDC);
+	LASSERTF(LCFG_DEL_MDC == 0x000cf015UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_DEL_MDC);
+	LASSERTF(LCFG_SPTLRPC_CONF == 0x000ce016UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_SPTLRPC_CONF);
+	LASSERTF(LCFG_POOL_NEW == 0x000ce020UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_POOL_NEW);
+	LASSERTF(LCFG_POOL_ADD == 0x000ce021UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_POOL_ADD);
+	LASSERTF(LCFG_POOL_REM == 0x000ce022UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_POOL_REM);
+	LASSERTF(LCFG_POOL_DEL == 0x000ce023UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_POOL_DEL);
+	LASSERTF(LCFG_SET_LDLM_TIMEOUT == 0x000ce030UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_SET_LDLM_TIMEOUT);
+	LASSERTF(LCFG_PRE_CLEANUP == 0x000cf031UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_PRE_CLEANUP);
+	LASSERTF(LCFG_SET_PARAM == 0x000ce032UL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_SET_PARAM);
+	LASSERTF(LCFG_NODEMAP_SET_SEPOL == 0x000ce05bUL, "found 0x%.8xUL\n",
+		(unsigned int)LCFG_NODEMAP_SET_SEPOL);
+	LASSERTF(PORTALS_CFG_TYPE == 1, "found %lld\n",
+		 (long long)PORTALS_CFG_TYPE);
+	LASSERTF(LUSTRE_CFG_TYPE == 123, "found %lld\n",
+		 (long long)LUSTRE_CFG_TYPE);
 }
diff --git a/include/uapi/linux/lustre/lustre_cfg.h b/include/uapi/linux/lustre/lustre_cfg.h
index 0620e49..5d6b585 100644
--- a/include/uapi/linux/lustre/lustre_cfg.h
+++ b/include/uapi/linux/lustre/lustre_cfg.h
@@ -107,6 +107,7 @@ enum lcfg_command_type {
 	LCFG_SET_PARAM		  = 0x00ce032, /**< use set_param syntax to set
 						 * a proc parameters
 						 */
+	LCFG_NODEMAP_SET_SEPOL	  = 0x00ce05b, /**< set SELinux policy */
 };
 
 struct lustre_cfg_bufs {
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 4236a43..f723d7b 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -805,6 +805,7 @@ struct ptlrpc_body_v2 {
 						 */
 #define OBD_CONNECT2_LOCK_CONVERT	0x80ULL /* IBITS lock convert support */
 #define OBD_CONNECT2_ARCHIVE_ID_ARRAY  0x100ULL	/* store HSM archive_id in array */
+#define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 198/622] lustre: llite: remove cl_file_inode_init() LASSERT
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (196 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 197/622] lustre: cfg: reserve flags for SELinux status checking James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 199/622] lnet: add fault injection for bulk transfers James Simmons
                   ` (424 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

If there is some corruption or other reason that the file layout
cannot be used, the first call to cl_file_inode_init() will fail.
If it is called a second time on the same file then it will hit
an LASSERT() since I_NEW is no longer set on the inode.

It would be good to handle the error in lov_init_raid0() better,
but we still want to avoid this LASSERT() if there is an error.

Convert the LASSERT() in cl_file_inode_init() into a CERROR() and
error return.  This is being triggered due to corruption on the
server, but that shouldn't cause the client to assert.

    lov_dump_lmm_common() oid 0xdf4e:311367, magic 0x0bd10bd0
    lov_dump_lmm_common() stripe_size 1048576, stripe_count 4
    lov_dump_lmm_objects() stripe 0 idx 10 subobj 0x0:151194471
    lov_dump_lmm_objects() stripe 1 idx 12 subobj 0x0:152477530
    lov_dump_lmm_objects() stripe 2 idx 25 subobj 0x0:151589797
    lov_dump_lmm_objects() stripe 3 idx 2 subobj 0x0:150332564
    lov_init_raid0() fsname-clilov: OST0019 is not initialized
    cl_file_inode_init() Failure to initialize cl object
        [0x20004c047:0xdf4e:0x0]: -5

    cl_file_inode_init() ASSERTION(inode->i_state & (1 << 3) ) failed
    cl_file_inode_init() LBUG
    Pid: 37233, comm: ll_sa_4709 3.10.0-862.14.4.el7.x86_64 #1 SMP
    Call Trace:
    libcfs_call_trace+0x8c/0xc0 [libcfs]
    lbug_with_loc+0x4c/0xa0 [libcfs]
    cl_file_inode_init+0x2ac/0x300 [lustre]
    ll_update_inode+0x315/0x600 [lustre]
    ll_iget+0x163/0x350 [lustre]
    ll_prep_inode+0x232/0xc80 [lustre]
    sa_handle_callback+0x3a4/0xf70 [lustre]
    ll_statahead_thread+0x40e/0x2080 [lustre]

Instead, return an IO error instead of killing the client.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11579
Lustre-commit: 0baa3eb1a4ab ("LU-11579 llite: remove cl_file_inode_init() LASSERT")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33505
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lcommon_cl.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/lcommon_cl.c b/fs/lustre/llite/lcommon_cl.c
index 978e05b..9ac80e0 100644
--- a/fs/lustre/llite/lcommon_cl.c
+++ b/fs/lustre/llite/lcommon_cl.c
@@ -171,7 +171,14 @@ int cl_file_inode_init(struct inode *inode, struct lustre_md *md)
 		 * unnecessary to perform lookup-alloc-lookup-insert, just
 		 * alloc and insert directly.
 		 */
-		LASSERT(inode->i_state & I_NEW);
+		if (!(inode->i_state & I_NEW)) {
+			result = -EIO;
+			CERROR("%s: unexpected not-NEW inode "DFID": rc = %d\n",
+			       ll_get_fsname(inode->i_sb, NULL, 0), PFID(fid),
+			       result);
+			goto out;
+		}
+
 		conf.coc_lu.loc_flags = LOC_F_NEW;
 		clob = cl_object_find(env, lu2cl_dev(site->ls_top_dev),
 				      fid, &conf);
@@ -193,11 +200,13 @@ int cl_file_inode_init(struct inode *inode, struct lustre_md *md)
 		}
 	}
 
+	if (result)
+		CERROR("%s: failed to initialize cl_object "DFID": rc = %d\n",
+			ll_get_fsname(inode->i_sb, NULL, 0), PFID(fid), result);
+
+out:
 	cl_env_put(env, &refcheck);
 
-	if (result != 0)
-		CERROR("Failure to initialize cl object " DFID ": %d\n",
-		       PFID(fid), result);
 	return result;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 199/622] lnet: add fault injection for bulk transfers
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (197 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 198/622] lustre: llite: remove cl_file_inode_init() LASSERT James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 200/622] lnet: remove .nf_min_max handling James Simmons
                   ` (423 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Artem Blagodarenko <artem.blagodarenko@seagate.com>

An internal test was always passing due to nno fault injecytion
happening. Add CFS_FAIL_PTLRPC_OST_BULK_CB2 to simulation a bulk
transfer timeout.

WC-bug-id: https://jira.whamcloud.com/browse/LU-7159
Lustre-commit: 707820692275 ("LU-7159 tests: fix 224c fault injection")
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@seagate.com>
Xyratex-bug-id: MRP-2472
Reviewed-on: https://review.whamcloud.com/16426
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h    | 1 +
 include/linux/libcfs/libcfs_fail.h | 6 ++++++
 include/linux/lnet/lib-lnet.h      | 3 +++
 net/lnet/lnet/lib-move.c           | 6 +++++-
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 5ff270a..d9a0395 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -487,6 +487,7 @@
 #define OBD_FAIL_FLR_LV_INC			0x1A02
 #define OBD_FAIL_FLR_RANDOM_PICK_MIRROR	0x1A03
 
+/* LNet is allocated failure locations 0xe000 to 0xffff */
 /* Assign references to moved code to reduce code changes */
 #define OBD_FAIL_PRECHECK(id)			CFS_FAIL_PRECHECK(id)
 #define OBD_FAIL_CHECK(id)			CFS_FAIL_CHECK(id)
diff --git a/include/linux/libcfs/libcfs_fail.h b/include/linux/libcfs/libcfs_fail.h
index f52a82a..c341567 100644
--- a/include/linux/libcfs/libcfs_fail.h
+++ b/include/linux/libcfs/libcfs_fail.h
@@ -54,6 +54,12 @@ enum {
 	CFS_FAIL_LOC_VALUE	= 3
 };
 
+/* Failure ranges
+ * "0x0100 - 0x3fff" for Lustre
+ * "0xe000 - 0xefff" for LNet
+ * "0xf000 - 0xffff" for LNDs
+ */
+
 /* Failure injection control */
 #define CFS_FAIL_MASK_SYS	0x0000FF00
 #define CFS_FAIL_MASK_LOC	(0x000000FF | CFS_FAIL_MASK_SYS)
diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index bbb678f..d09fb4c 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -49,6 +49,9 @@
 #include <uapi/linux/lnet/lnetctl.h>
 #include <uapi/linux/lnet/nidstr.h>
 
+/* LNET has 0xeXXX */
+#define CFS_FAIL_PTLRPC_OST_BULK_CB2	0xe000
+
 extern struct lnet the_lnet;	/* THE network */
 
 #if (BITS_PER_LONG == 32)
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 3bcac03..f5548eb 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -4323,7 +4323,11 @@ void lnet_monitor_thr_stop(void)
 	if (ack == LNET_ACK_REQ)
 		lnet_attach_rsp_tracker(rspt, cpt, md, mdh);
 
-	rc = lnet_send(self, msg, LNET_NID_ANY);
+	if (CFS_FAIL_CHECK_ORSET(CFS_FAIL_PTLRPC_OST_BULK_CB2,
+				 CFS_FAIL_ONCE))
+		rc = -EIO;
+	else
+		rc = lnet_send(self, msg, LNET_NID_ANY);
 	if (rc) {
 		CNETERR("Error sending PUT to %s: %d\n",
 			libcfs_id2str(target), rc);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 200/622] lnet: remove .nf_min_max handling
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (198 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 199/622] lnet: add fault injection for bulk transfers James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 201/622] lustre: sec: create new function sptlrpc_get_sepol() James Simmons
                   ` (422 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Kit Westneat <kit.westneat@gmail.com>

The .nf_min_max handling was only used for server side debugging.
This has been removed in the OpenSFS tree as well so lets remove
it here since the code related to nf_min_max handling is not
used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8939
Lustre-commit: a9b830da51bd ("LU-8939 nodemap: remove deprecated lproc files")
Signed-off-by: Kit Westneat <kit.westneat@gmail.com>
Reviewed-on: https://review.whamcloud.com/24352
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/nidstrings.c | 278 ++-------------------------------------------
 1 file changed, 10 insertions(+), 268 deletions(-)

diff --git a/net/lnet/lnet/nidstrings.c b/net/lnet/lnet/nidstrings.c
index 13338d0..eca5092 100644
--- a/net/lnet/lnet/nidstrings.c
+++ b/net/lnet/lnet/nidstrings.c
@@ -451,264 +451,6 @@ int cfs_print_nidlist(char *buffer, int count, struct list_head *nidlist)
 }
 EXPORT_SYMBOL(cfs_print_nidlist);
 
-/**
- * Determines minimum and maximum addresses for a single
- * numeric address range
- *
- * @ar
- * @min_nid	*min_nid __u32 representation of min NID
- * @max_nid	*max_nid __u32 representation of max NID
- *
- * Return:	-EINVAL unsupported LNET range
- *		-ERANGE non-contiguous LNET range
- */
-static int cfs_ip_ar_min_max(struct addrrange *ar, u32 *min_nid,
-			     u32 *max_nid)
-{
-	struct cfs_expr_list *expr_list;
-	struct cfs_range_expr *range;
-	unsigned int min_ip[4] = { 0 };
-	unsigned int max_ip[4] = { 0 };
-	int cur_octet = 0;
-	bool expect_full_octet = false;
-
-	list_for_each_entry(expr_list, &ar->ar_numaddr_ranges, el_link) {
-		int re_count = 0;
-
-		list_for_each_entry(range, &expr_list->el_exprs, re_link) {
-		/* XXX: add support for multiple & non-contig. re's */
-			if (re_count > 0)
-				return -EINVAL;
-
-			/* if a previous octet was ranged, then all remaining
-			 * octets must be full for contiguous range
-			 */
-			if (expect_full_octet && (range->re_lo != 0 ||
-						  range->re_hi != 255))
-				return -ERANGE;
-
-			if (range->re_stride != 1)
-				return -ERANGE;
-
-			if (range->re_lo > range->re_hi)
-				return -EINVAL;
-
-			if (range->re_lo != range->re_hi)
-				expect_full_octet = true;
-
-			min_ip[cur_octet] = range->re_lo;
-			max_ip[cur_octet] = range->re_hi;
-
-			re_count++;
-		}
-
-		cur_octet++;
-	}
-
-	if (min_nid)
-		*min_nid = ((min_ip[0] << 24) | (min_ip[1] << 16) |
-			    (min_ip[2] << 8) | min_ip[3]);
-
-	if (max_nid)
-		*max_nid = ((max_ip[0] << 24) | (max_ip[1] << 16) |
-			    (max_ip[2] << 8) | max_ip[3]);
-
-	return 0;
-}
-
-/**
- * Determines minimum and maximum addresses for a single
- * numeric address range
- *
- * @ar
- * @min_nid	*min_nid __u32 representation of min NID
- * @max_nid	*max_nid __u32 representation of max NID
- *
- * Return:	-EINVAL unsupported LNET range
- */
-static int cfs_num_ar_min_max(struct addrrange *ar, u32 *min_nid,
-			      u32 *max_nid)
-{
-	struct cfs_expr_list *el;
-	struct cfs_range_expr *re;
-	unsigned int min_addr = 0;
-	unsigned int max_addr = 0;
-
-	list_for_each_entry(el, &ar->ar_numaddr_ranges, el_link) {
-		int re_count = 0;
-
-		list_for_each_entry(re, &el->el_exprs, re_link) {
-			if (re_count > 0)
-				return -EINVAL;
-			if (re->re_lo > re->re_hi)
-				return -EINVAL;
-
-			if (re->re_lo < min_addr || !min_addr)
-				min_addr = re->re_lo;
-			if (re->re_hi > max_addr)
-				max_addr = re->re_hi;
-
-			re_count++;
-		}
-	}
-
-	if (min_nid)
-		*min_nid = min_addr;
-	if (max_nid)
-		*max_nid = max_addr;
-
-	return 0;
-}
-
-/**
- * Takes a linked list of nidrange expressions, determines the minimum
- * and maximum nid and creates appropriate nid structures
- *
- * @nidlist
- * @min_nid	*min_nid string representation of min NID
- * @max_nid	*max_nid string representation of max NID
- *
- * Return:	-EINVAL unsupported LNET range
- *		-ERANGE non-contiguous LNET range
- */
-int cfs_nidrange_find_min_max(struct list_head *nidlist, char *min_nid,
-			      char *max_nid, size_t nidstr_length)
-{
-	struct nidrange *first_nidrange;
-	int netnum;
-	struct netstrfns *nf;
-	char *lndname;
-	u32 min_addr;
-	u32 max_addr;
-	char min_addr_str[IPSTRING_LENGTH];
-	char max_addr_str[IPSTRING_LENGTH];
-	int rc;
-
-	first_nidrange = list_entry(nidlist->next, struct nidrange, nr_link);
-
-	netnum = first_nidrange->nr_netnum;
-	nf = first_nidrange->nr_netstrfns;
-	lndname = nf->nf_name;
-
-	rc = nf->nf_min_max(nidlist, &min_addr, &max_addr);
-	if (rc < 0)
-		return rc;
-
-	nf->nf_addr2str(min_addr, min_addr_str, sizeof(min_addr_str));
-	nf->nf_addr2str(max_addr, max_addr_str, sizeof(max_addr_str));
-
-	snprintf(min_nid, nidstr_length, "%s@%s%d", min_addr_str, lndname,
-		 netnum);
-	snprintf(max_nid, nidstr_length, "%s@%s%d", max_addr_str, lndname,
-		 netnum);
-
-	return 0;
-}
-EXPORT_SYMBOL(cfs_nidrange_find_min_max);
-
-/**
- * Determines the min and max NID values for num LNDs
- *
- * @nidlist
- * @min_nid	*min_nid if provided, returns string representation of min NID
- * @max_nid	*max_nid if provided, returns string representation of max NID
- *
- * Return:	-EINVAL unsupported LNET range
- *		-ERANGE non-contiguous LNET range
- */
-static int cfs_num_min_max(struct list_head *nidlist, u32 *min_nid,
-			   u32 *max_nid)
-{
-	struct nidrange	*nr;
-	struct addrrange *ar;
-	unsigned int tmp_min_addr = 0;
-	unsigned int tmp_max_addr = 0;
-	unsigned int min_addr = 0;
-	unsigned int max_addr = 0;
-	int nidlist_count = 0;
-	int rc;
-
-	list_for_each_entry(nr, nidlist, nr_link) {
-		if (nidlist_count > 0)
-			return -EINVAL;
-
-		list_for_each_entry(ar, &nr->nr_addrranges, ar_link) {
-			rc = cfs_num_ar_min_max(ar, &tmp_min_addr,
-						&tmp_max_addr);
-			if (rc)
-				return rc;
-
-			if (tmp_min_addr < min_addr || !min_addr)
-				min_addr = tmp_min_addr;
-			if (tmp_max_addr > max_addr)
-				max_addr = tmp_min_addr;
-		}
-	}
-
-	if (max_nid)
-		*max_nid = max_addr;
-	if (min_nid)
-		*min_nid = min_addr;
-
-	return 0;
-}
-
-/**
- * Takes an nidlist and determines the minimum and maximum
- * ip addresses.
- *
- * @nidlist
- * @min_nid	*min_nid if provided, returns string representation of min NID
- * @max_nid	*max_nid if provided, returns string representation of max NID
- *
- * Return:	-EINVAL unsupported LNET range
- *		-ERANGE non-contiguous LNET range
- */
-static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
-			  u32 *max_nid)
-{
-	struct nidrange *nr;
-	struct addrrange *ar;
-	u32 tmp_min_ip_addr = 0;
-	u32 tmp_max_ip_addr = 0;
-	u32 min_ip_addr = 0;
-	u32 max_ip_addr = 0;
-	int nidlist_count = 0;
-	int rc;
-
-	list_for_each_entry(nr, nidlist, nr_link) {
-		if (nidlist_count > 0)
-			return -EINVAL;
-
-		if (nr->nr_all) {
-			min_ip_addr = 0;
-			max_ip_addr = 0xffffffff;
-			break;
-		}
-
-		list_for_each_entry(ar, &nr->nr_addrranges, ar_link) {
-			rc = cfs_ip_ar_min_max(ar, &tmp_min_ip_addr,
-					       &tmp_max_ip_addr);
-			if (rc)
-				return rc;
-
-			if (tmp_min_ip_addr < min_ip_addr || !min_ip_addr)
-				min_ip_addr = tmp_min_ip_addr;
-			if (tmp_max_ip_addr > max_ip_addr)
-				max_ip_addr = tmp_max_ip_addr;
-		}
-
-		nidlist_count++;
-	}
-
-	if (min_nid)
-		*min_nid = min_ip_addr;
-	if (max_nid)
-		*max_nid = max_ip_addr;
-
-	return 0;
-}
-
 static int
 libcfs_lo_str2addr(const char *str, int nob, u32 *addr)
 {
@@ -912,8 +654,8 @@ static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
 	  .nf_str2addr		= libcfs_lo_str2addr,
 	  .nf_parse_addrlist	= libcfs_num_parse,
 	  .nf_print_addrlist	= libcfs_num_addr_range_print,
-	  .nf_match_addr	= libcfs_num_match,
-	  .nf_min_max		= cfs_num_min_max },
+	  .nf_match_addr	= libcfs_num_match
+	},
 	{ .nf_type		= SOCKLND,
 	  .nf_name		= "tcp",
 	  .nf_modname		= "ksocklnd",
@@ -921,8 +663,8 @@ static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
 	  .nf_str2addr		= libcfs_ip_str2addr,
 	  .nf_parse_addrlist	= cfs_ip_addr_parse,
 	  .nf_print_addrlist	= libcfs_ip_addr_range_print,
-	  .nf_match_addr	= cfs_ip_addr_match,
-	  .nf_min_max		= cfs_ip_min_max },
+	  .nf_match_addr	= cfs_ip_addr_match
+	},
 	{ .nf_type		= O2IBLND,
 	  .nf_name		= "o2ib",
 	  .nf_modname		= "ko2iblnd",
@@ -930,8 +672,8 @@ static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
 	  .nf_str2addr		= libcfs_ip_str2addr,
 	  .nf_parse_addrlist	= cfs_ip_addr_parse,
 	  .nf_print_addrlist	= libcfs_ip_addr_range_print,
-	  .nf_match_addr	= cfs_ip_addr_match,
-	  .nf_min_max		= cfs_ip_min_max },
+	  .nf_match_addr	= cfs_ip_addr_match
+	},
 	{ .nf_type		= GNILND,
 	  .nf_name		= "gni",
 	  .nf_modname		= "kgnilnd",
@@ -939,8 +681,8 @@ static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
 	  .nf_str2addr		= libcfs_num_str2addr,
 	  .nf_parse_addrlist	= libcfs_num_parse,
 	  .nf_print_addrlist	= libcfs_num_addr_range_print,
-	  .nf_match_addr	= libcfs_num_match,
-	  .nf_min_max		= cfs_num_min_max },
+	  .nf_match_addr	= libcfs_num_match
+	},
 	{ .nf_type		= GNIIPLND,
 	  .nf_name		= "gip",
 	  .nf_modname		= "kgnilnd",
@@ -948,8 +690,8 @@ static int cfs_ip_min_max(struct list_head *nidlist, u32 *min_nid,
 	  .nf_str2addr		= libcfs_ip_str2addr,
 	  .nf_parse_addrlist	= cfs_ip_addr_parse,
 	  .nf_print_addrlist	= libcfs_ip_addr_range_print,
-	  .nf_match_addr	= cfs_ip_addr_match,
-	  .nf_min_max		= cfs_ip_min_max },
+	  .nf_match_addr	= cfs_ip_addr_match
+	},
 };
 
 static const size_t libcfs_nnetstrfns = ARRAY_SIZE(libcfs_netstrfns);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 201/622] lustre: sec: create new function sptlrpc_get_sepol()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (199 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 200/622] lnet: remove .nf_min_max handling James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 202/622] lustre: clio: fix incorrect invariant in cl_io_iter_fini() James Simmons
                   ` (421 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

Create new function sptlrpc_get_sepol() in ptlrpc/sec.c to compute
SELinux policy info, by calling new userland command l_getsepol.

The SELinux policy info syntax is the following:
<mode>:<name>:<version>:<hash>
where:
- <mode> is a digit telling if SELinux is in Permissive mode (0)
  or Enforcing mode (1)
- <name> is the name of the SELinux policy
- <version> is the version of the SELinux policy
- <hash> is the computed hash of the binary representation of the
  policy, as exported in /etc/selinux/<name>/policy/policy.<version>

Userland command l_getsepol can be called on the command line by a
security administrator to get SELinux status information to store into
'sepol' field of nodemap.

SELinux status information is reported by Lustre client only if
new 'send_sepol' ptlrpc kernel module's parameter is not zero, and
SELinux is enabled on the client.
'send_sepol' accepts various values:
- 0: do not send SELinux policy info;
- -1: send SELinux policy info for every request;
- N > 0: only send SELinux policy info every N seconds. Use max value
  2^31-1 (signed int on 32 bits) to make sure SELinux policy info is
  only checked at mount time.
Independently from 'send_sepol' value, SELinux policy info has an
associated mtime. l_getsepol checks mtime and recalculates whole
SELinux policy info (including SHA) only if mtime changed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8955
Lustre-commit: c61168239eff ("LU-8955 sec: create new function sptlrpc_get_sepol()")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/24421
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h          |   7 ++
 fs/lustre/include/lustre_sec.h          |  12 +++
 fs/lustre/ptlrpc/sec.c                  | 125 ++++++++++++++++++++++++++++++++
 fs/lustre/ptlrpc/sec_lproc.c            |  74 +++++++++++++++++++
 include/uapi/linux/lustre/lustre_idl.h  |  13 ++++
 include/uapi/linux/lustre/lustre_user.h |   9 +++
 6 files changed, 240 insertions(+)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 81a6ac9..36de665 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -845,6 +845,13 @@ struct ptlrpc_request {
 	/** description of flavors for client & server */
 	struct sptlrpc_flavor		rq_flvr;
 
+	/**
+	 * SELinux policy info at the time of the request
+	 * sepol string format is:
+	 * <mode>:<policy name>:<policy version>:<policy hash>
+	 */
+	char rq_sepol[LUSTRE_NODEMAP_SEPOL_LENGTH + 1];
+
 	/* client/server security flags */
 	unsigned int
 				 rq_ctx_init:1,      /* context initiation */
diff --git a/fs/lustre/include/lustre_sec.h b/fs/lustre/include/lustre_sec.h
index 99702fd..00710d6 100644
--- a/fs/lustre/include/lustre_sec.h
+++ b/fs/lustre/include/lustre_sec.h
@@ -792,6 +792,17 @@ struct ptlrpc_sec {
 	/** owning import */
 	struct obd_import	       *ps_import;
 	spinlock_t			ps_lock;
+	/** mtime of SELinux policy file */
+	time_t				ps_sepol_mtime;
+	/** next check time of SELinux policy file */
+	ktime_t				ps_sepol_checknext;
+	/**
+	 * SELinux policy info
+	 * sepol string format is:
+	 * <mode>:<policy name>:<policy version>:<policy hash>
+	 */
+	char				ps_sepol[LUSTRE_NODEMAP_SEPOL_LENGTH
+						 + 1];
 
 	/*
 	 * garbage collection
@@ -987,6 +998,7 @@ int sptlrpc_cli_unwrap_early_reply(struct ptlrpc_request *req,
 void sptlrpc_cli_finish_early_reply(struct ptlrpc_request *early_req);
 
 void sptlrpc_request_out_callback(struct ptlrpc_request *req);
+int sptlrpc_get_sepol(struct ptlrpc_request *req);
 
 /*
  * exported higher interface of import & request
diff --git a/fs/lustre/ptlrpc/sec.c b/fs/lustre/ptlrpc/sec.c
index 54ca97c..789b5cb 100644
--- a/fs/lustre/ptlrpc/sec.c
+++ b/fs/lustre/ptlrpc/sec.c
@@ -53,6 +53,10 @@
 
 #include "ptlrpc_internal.h"
 
+static int send_sepol;
+module_param(send_sepol, int, 0644);
+MODULE_PARM_DESC(send_sepol, "Client sends SELinux policy status");
+
 /***********************************************
  * policy registers			    *
  ***********************************************/
@@ -1692,6 +1696,127 @@ static int sptlrpc_svc_install_rvs_ctx(struct obd_import *imp,
 	return policy->sp_sops->install_rctx(imp, ctx);
 }
 
+#ifdef CONFIG_SECURITY_SELINUX
+/* Get SELinux policy info from userspace */
+static int sepol_helper(struct obd_import *imp)
+{
+	char mtime_str[21] = { 0 }, mode_str[2] = { 0 };
+	char *argv[] = {
+		[0] = "/usr/sbin/l_getsepol",
+		[1] = "-o",
+		[2] = NULL,	    /* obd type */
+		[3] = "-n",
+		[4] = NULL,	    /* obd name */
+		[5] = "-t",
+		[6] = mtime_str,    /* policy mtime */
+		[7] = "-m",
+		[8] = mode_str,	    /* enforcing mode */
+		[9] = NULL
+	};
+	static char *envp[] = {
+		[0] = "HOME=/",
+		[1] = "PATH=/sbin:/usr/sbin",
+		[2] = NULL
+	};
+	signed short ret;
+	int rc = 0;
+
+	if (!imp || !imp->imp_obd ||
+	    !imp->imp_obd->obd_type) {
+		rc = -EINVAL;
+	} else {
+		argv[2] = (char *)imp->imp_obd->obd_type->typ_name;
+		argv[4] = imp->imp_obd->obd_name;
+		spin_lock(&imp->imp_sec->ps_lock);
+		if (imp->imp_sec->ps_sepol_mtime == 0 &&
+		    imp->imp_sec->ps_sepol[0] == '\0') {
+			/* ps_sepol has not been initialized */
+			argv[5] = NULL;
+			argv[7] = NULL;
+		} else {
+			snprintf(mtime_str, sizeof(mtime_str), "%lu",
+				 imp->imp_sec->ps_sepol_mtime);
+			mode_str[0] = imp->imp_sec->ps_sepol[0];
+		}
+		spin_unlock(&imp->imp_sec->ps_lock);
+		ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+		rc = ret>>8;
+	}
+
+	return rc;
+}
+#endif
+
+static inline int sptlrpc_sepol_needs_check(struct ptlrpc_sec *imp_sec)
+{
+	ktime_t checknext;
+
+	if (send_sepol == 0)
+		return 0;
+
+	if (send_sepol == -1)
+		/* send_sepol == -1 means fetch sepol status every time */
+		return 1;
+
+	spin_lock(&imp_sec->ps_lock);
+	checknext = imp_sec->ps_sepol_checknext;
+	spin_unlock(&imp_sec->ps_lock);
+
+	/* next check is too far in time, please update */
+	if (ktime_after(checknext,
+			ktime_add(ktime_get(), ktime_set(send_sepol, 0))))
+		goto setnext;
+
+	if (ktime_before(ktime_get(), checknext))
+		/* too early to fetch sepol status */
+		return 0;
+
+setnext:
+	/* define new sepol_checknext time */
+	spin_lock(&imp_sec->ps_lock);
+	imp_sec->ps_sepol_checknext = ktime_add(ktime_get(),
+						ktime_set(send_sepol, 0));
+	spin_unlock(&imp_sec->ps_lock);
+
+	return 1;
+}
+
+int sptlrpc_get_sepol(struct ptlrpc_request *req)
+{
+#ifndef CONFIG_SECURITY_SELINUX
+	(req->rq_sepol)[0] = '\0';
+
+	if (unlikely(send_sepol != 0))
+		CDEBUG(D_SEC,
+		       "Client cannot report SELinux status, it was not built against libselinux.\n");
+	return 0;
+#else
+	struct ptlrpc_sec *imp_sec = req->rq_import->imp_sec;
+	int rc = 0;
+
+	(req->rq_sepol)[0] = '\0';
+
+	if (send_sepol == 0)
+		return 0;
+
+	if (!imp_sec)
+		return -EINVAL;
+
+	/* Retrieve SELinux status info */
+	if (sptlrpc_sepol_needs_check(imp_sec))
+		rc = sepol_helper(req->rq_import);
+	if (likely(rc == 0)) {
+		spin_lock(&imp_sec->ps_lock);
+		memcpy(req->rq_sepol, imp_sec->ps_sepol,
+		       sizeof(req->rq_sepol));
+		spin_unlock(&imp_sec->ps_lock);
+	}
+
+	return rc;
+#endif
+}
+EXPORT_SYMBOL(sptlrpc_get_sepol);
+
 /****************************************
  * server side security		 *
  ****************************************/
diff --git a/fs/lustre/ptlrpc/sec_lproc.c b/fs/lustre/ptlrpc/sec_lproc.c
index df7c667..04e421d 100644
--- a/fs/lustre/ptlrpc/sec_lproc.c
+++ b/fs/lustre/ptlrpc/sec_lproc.c
@@ -131,6 +131,78 @@ static int sptlrpc_ctxs_lprocfs_seq_show(struct seq_file *seq, void *v)
 
 LPROC_SEQ_FOPS_RO(sptlrpc_ctxs_lprocfs);
 
+static ssize_t
+lprocfs_wr_sptlrpc_sepol(struct file *file, const char __user *buffer,
+			 size_t count, void *data)
+{
+	struct seq_file	*seq = file->private_data;
+	struct obd_device *dev = seq->private;
+	struct client_obd *cli = &dev->u.cli;
+	struct obd_import *imp = cli->cl_import;
+	struct sepol_downcall_data *param;
+	int size = sizeof(*param);
+	int rc = 0;
+
+	if (count < size) {
+		CERROR("%s: invalid data count = %lu, size = %d\n",
+		       dev->obd_name, (unsigned long) count, size);
+		return -EINVAL;
+	}
+
+	param = kzalloc(size, GFP_KERNEL);
+	if (!param)
+		return -ENOMEM;
+
+	if (copy_from_user(param, buffer, size)) {
+		CERROR("%s: bad sepol data\n", dev->obd_name);
+		rc = -EFAULT;
+		goto out;
+	}
+
+	if (param->sdd_magic != SEPOL_DOWNCALL_MAGIC) {
+		CERROR("%s: sepol downcall bad params\n",
+		       dev->obd_name);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (param->sdd_sepol_len == 0 ||
+	    param->sdd_sepol_len >= sizeof(imp->imp_sec->ps_sepol)) {
+		CERROR("%s: invalid sepol data returned\n",
+		       dev->obd_name);
+		rc = -EINVAL;
+		goto out;
+	}
+	rc = param->sdd_sepol_len; /* save sdd_sepol_len */
+	kfree(param);
+	size = offsetof(struct sepol_downcall_data,
+			sdd_sepol[rc]);
+
+	/* alloc again with real size */
+	rc = 0;
+	param = kzalloc(size, GFP_KERNEL);
+	if (!param)
+		return -ENOMEM;
+
+	if (copy_from_user(param, buffer, size)) {
+		CERROR("%s: bad sepol data\n", dev->obd_name);
+		rc = -EFAULT;
+		goto out;
+	}
+
+	spin_lock(&imp->imp_sec->ps_lock);
+	snprintf(imp->imp_sec->ps_sepol, param->sdd_sepol_len + 1, "%s",
+		 param->sdd_sepol);
+	imp->imp_sec->ps_sepol_mtime = param->sdd_sepol_mtime;
+	spin_unlock(&imp->imp_sec->ps_lock);
+
+out:
+	kfree(param);
+
+	return rc ? rc : count;
+}
+LPROC_SEQ_FOPS_WR_ONLY(srpc, sptlrpc_sepol);
+
 int sptlrpc_lprocfs_cliobd_attach(struct obd_device *dev)
 {
 	if (strcmp(dev->obd_type->typ_name, LUSTRE_OSC_NAME) != 0 &&
@@ -145,6 +217,8 @@ int sptlrpc_lprocfs_cliobd_attach(struct obd_device *dev)
 			    &sptlrpc_info_lprocfs_fops);
 	debugfs_create_file("srpc_contexts", 0444, dev->obd_debugfs_entry, dev,
 			    &sptlrpc_ctxs_lprocfs_fops);
+	debugfs_create_file("srpc_sepol", 0200, dev->obd_debugfs_entry, dev,
+			    &srpc_sptlrpc_sepol_fops);
 
 	return 0;
 }
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index f723d7b..77b9539 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2936,6 +2936,19 @@ struct close_data {
 	};
 };
 
+/* sepol string format is:
+ * <1-digit for SELinux status>:<policy name>:<policy version>:<policy hash>
+ */
+/* Max length of the sepol string
+ * Should be large enough to contain a sha512sum of the policy
+ */
+#define SELINUX_MODE_LEN 1
+#define SELINUX_POLICY_VER_LEN 3 /* 3 chars to leave room for the future */
+#define SELINUX_POLICY_HASH_LEN 64
+#define LUSTRE_NODEMAP_SEPOL_LENGTH (SELINUX_MODE_LEN + NAME_MAX + \
+				     SELINUX_POLICY_VER_LEN + \
+				     SELINUX_POLICY_HASH_LEN + 3)
+
 /*
  * This is the lu_ladvise struct which goes out on the wire.
  * Corresponds to the userspace arg llapi_lu_ladvise.
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 649aeeb..c1e9dca 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -798,6 +798,7 @@ static inline char *qtype_name(int qtype)
 }
 
 #define IDENTITY_DOWNCALL_MAGIC 0x6d6dd629
+#define SEPOL_DOWNCALL_MAGIC 0x8b8bb842
 
 /* permission */
 #define N_PERMS_MAX	64
@@ -819,6 +820,14 @@ struct identity_downcall_data {
 	__u32			    idd_groups[0];
 };
 
+struct sepol_downcall_data {
+	__u32		sdd_magic;
+	time_t		sdd_sepol_mtime;
+	__u16		sdd_sepol_len;
+	char		sdd_sepol[0];
+};
+
+
 /* lustre volatile file support
  * file name header: ".^L^S^T^R:volatile"
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 202/622] lustre: clio: fix incorrect invariant in cl_io_iter_fini()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (200 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 201/622] lustre: sec: create new function sptlrpc_get_sepol() James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 203/622] lustre: mdc: Improve xattr buffer allocations James Simmons
                   ` (420 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

It was discovered during PFL testing that if LINVRNT() is enabled
that cl_io_iter_fini() will crash with the following backtrace:

    kernel: LustreError: 16009:0:(cl_io.c:439:cl_io_iter_fini())
            ASSERTION( io->ci_state == CIS_UNLOCKED ) failed
    kernel: cl_io_iter_fini+0x10c/0x110 [obdclass]
    kernel: cl_io_loop+0x46/0x220 [obdclass]
    kernel: cl_setattr_ost+0x1ed/0x2a0 [lustre]
    kernel: ll_setattr_raw+0x7b0/0x9a0 [lustre]
    kernel: notify_change+0x1dc/0x430
    kernel: do_truncate+0x72/0xc0
    kernel: do_sys_ftruncate+0xf5/0x160

This is due to the incorrect assumption that the ci_state will
always be CIS_UNLOCKED, but by looking at the behavior of
cl_io_loop() it can be seen that is not the case with PFL.
We do want to make sure the IO state is not in the middle of
some other action (up to CIS_IT_STARTED or CIS_IO_FINISHED or
later) when cl_io_iter_fini() is called.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11828
Lustre-commit: 8160b9bdf16c ("LU-11828 clio: fix incorrect invariant in cl_io_iter_fini()")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33915
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/cl_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index a98be15..4278bc0 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -412,7 +412,8 @@ void cl_io_iter_fini(const struct lu_env *env, struct cl_io *io)
 	const struct cl_io_slice *scan;
 
 	LINVRNT(cl_io_is_loopable(io));
-	LINVRNT(io->ci_state < CIS_LOCKED || io->ci_state > CIS_IO_FINISHED);
+	LINVRNT(io->ci_state <= CIS_IT_STARTED ||
+		io->ci_state > CIS_IO_FINISHED);
 	LINVRNT(cl_io_invariant(io));
 
 	list_for_each_entry_reverse(scan, &io->ci_layers, cis_linkage) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 203/622] lustre: mdc: Improve xattr buffer allocations
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (201 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 202/622] lustre: clio: fix incorrect invariant in cl_io_iter_fini() James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 204/622] lnet: libcfs: allow file/func/line passed to CDEBUG() James Simmons
                   ` (419 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Many of the xattr related buffers in the mdc/mdt code are
allocated at max_easize, but they are used for normal POSIX
xattrs (primarily ACLs) and so they are guaranteed not to
exceed XATTR_SIZE_MAX.

HSM xattrs should also be less than XATTR_SIZE_MAX.

Reduce allocations to MIN(XATTR_SIZE_MAX, max_easize).

WC-bug-id: https://jira.whamcloud.com/browse/LU-11868
Lustre-commit: 4f78164f8748 ("LU-11868 mdc: Improve xattr buffer allocations")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34059
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_locks.c   |  9 ++++++---
 fs/lustre/mdc/mdc_reint.c   |  4 +++-
 fs/lustre/mdc/mdc_request.c | 14 ++++++++------
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index f9d66a4..9898b6a 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -810,7 +810,9 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 
 	generation = obddev->u.cli.cl_import->imp_generation;
 	if (!it || (it->it_op & (IT_OPEN | IT_CREAT)))
-		acl_bufsize = imp->imp_connect_data.ocd_max_easize;
+		acl_bufsize = min_t(u32,
+				    imp->imp_connect_data.ocd_max_easize,
+				    XATTR_SIZE_MAX);
 	else
 		acl_bufsize = LUSTRE_POSIX_ACL_MAX_SIZE_OLD;
 
@@ -936,10 +938,11 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 
 	if ((int)lockrep->lock_policy_res2 == -ERANGE &&
 	    it->it_op & (IT_OPEN | IT_GETATTR | IT_LOOKUP) &&
-	    acl_bufsize != imp->imp_connect_data.ocd_max_easize) {
+	    acl_bufsize == LUSTRE_POSIX_ACL_MAX_SIZE_OLD) {
 		mdc_clear_replay_flag(req, -ERANGE);
 		ptlrpc_req_finished(req);
-		acl_bufsize = imp->imp_connect_data.ocd_max_easize;
+		acl_bufsize = min_t(u32, imp->imp_connect_data.ocd_max_easize,
+				    XATTR_SIZE_MAX);
 		goto resend;
 	}
 
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 062685c..2611fc4 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -135,7 +135,9 @@ int mdc_setattr(struct obd_export *exp, struct md_op_data *op_data,
 	mdc_setattr_pack(req, op_data, ea, ealen);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER,
-			     req->rq_import->imp_connect_data.ocd_max_easize);
+			     min_t(u32,
+				   req->rq_import->imp_connect_data.ocd_max_easize,
+				   XATTR_SIZE_MAX));
 	ptlrpc_request_set_replen(req);
 
 	rc = mdc_reint(req, LUSTRE_IMP_FULL);
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index d702fd1..4711288 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -234,9 +234,10 @@ static int mdc_getattr(struct obd_export *exp, struct md_op_data *op_data,
 
 	rc = mdc_getattr_common(exp, req);
 	if (rc) {
-		if (rc == -ERANGE &&
-		    acl_bufsize != imp->imp_connect_data.ocd_max_easize) {
-			acl_bufsize = imp->imp_connect_data.ocd_max_easize;
+		if (rc == -ERANGE) {
+			acl_bufsize = min_t(u32,
+					    imp->imp_connect_data.ocd_max_easize,
+					    XATTR_SIZE_MAX);
 			mdc_reset_acl_req(req);
 			goto again;
 		}
@@ -289,9 +290,10 @@ static int mdc_getattr_name(struct obd_export *exp, struct md_op_data *op_data,
 
 	rc = mdc_getattr_common(exp, req);
 	if (rc) {
-		if (rc == -ERANGE &&
-		    acl_bufsize != imp->imp_connect_data.ocd_max_easize) {
-			acl_bufsize = imp->imp_connect_data.ocd_max_easize;
+		if (rc == -ERANGE) {
+			acl_bufsize = min_t(u32,
+					    imp->imp_connect_data.ocd_max_easize,
+					    XATTR_SIZE_MAX);
 			mdc_reset_acl_req(req);
 			goto again;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 204/622] lnet: libcfs: allow file/func/line passed to CDEBUG()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (202 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 203/622] lustre: mdc: Improve xattr buffer allocations James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 205/622] lustre: llog: add startcat for wrapped catalog James Simmons
                   ` (418 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Allow the file, function, and line number to be passed to CDEBUG()
messages so that they are not duplicated in helper functions that
may be called from multiple places.

This patch is largely a no-op in terms of code, with the exception
of one call in osc_extent_sanity_check0() to OSC_EXTENT_DUMP() that
is changed to OSC_EXTENT_DUMP_WITH_LOC().

WC-bug-id: https://jira.whamcloud.com/browse/LU-4664
Lustre-commit: 8503e73bd936 ("LU-4664 libcfs: allow file/func/line passed to CDEBUG()")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33588
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c           | 43 +++++++++++++----------
 include/linux/libcfs/libcfs_debug.h | 69 ++++++++++++++++++++++---------------
 2 files changed, 65 insertions(+), 47 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 1ff258c..a18e791 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -58,10 +58,10 @@ static int osc_io_unplug_async(const struct lu_env *env,
 static void osc_free_grant(struct client_obd *cli, unsigned int nr_pages,
 			   unsigned int lost_grant, unsigned int dirty_grant);
 
-static void __osc_extent_tree_dump(int level, struct osc_object *obj,
+static void __osc_extent_tree_dump(int mask, struct osc_object *obj,
 				   const char *func, int line);
-#define osc_extent_tree_dump(lvl, obj) \
-	__osc_extent_tree_dump(lvl, obj, __func__, __LINE__)
+#define osc_extent_tree_dump(mask, obj) \
+	__osc_extent_tree_dump(mask, obj, __func__, __LINE__)
 
 static void osc_unreserve_grant(struct client_obd *cli, unsigned int reserved,
 				unsigned int unused);
@@ -106,18 +106,19 @@ static inline char list_empty_marker(struct list_head *list)
 static const char * const oes_strings[] = {
 	"inv", "active", "cache", "locking", "lockdone", "rpc", "trunc", NULL };
 
-#define OSC_EXTENT_DUMP(lvl, extent, fmt, ...) do {			      \
+#define OSC_EXTENT_DUMP_WITH_LOC(file, func, line, mask, extent, fmt, ...) do {\
+	static struct cfs_debug_limit_state cdls;			      \
 	struct osc_extent *__ext = (extent);				      \
 	char __buf[16];							      \
 									      \
-	CDEBUG(lvl,							      \
+	__CDEBUG_WITH_LOC(file, func, line, mask, &cdls,		      \
 		"extent %p@{" EXTSTR ", "				      \
 		"[%d|%d|%c|%s|%s|%p], [%d|%d|%c|%c|%p|%u|%p]} " fmt,	      \
 		/* ----- extent part 0 ----- */				      \
 		__ext, EXTPARA(__ext),					      \
 		/* ----- part 1 ----- */				      \
 		kref_read(&__ext->oe_refc),				      \
-		atomic_read(&__ext->oe_users),			      \
+		atomic_read(&__ext->oe_users),				      \
 		list_empty_marker(&__ext->oe_link),			      \
 		oes_strings[__ext->oe_state], ext_flags(__ext, __buf),	      \
 		__ext->oe_obj,						      \
@@ -128,12 +129,16 @@ static inline char list_empty_marker(struct list_head *list)
 		__ext->oe_dlmlock, __ext->oe_mppr, __ext->oe_owner,	      \
 		/* ----- part 4 ----- */				      \
 		## __VA_ARGS__);					      \
-	if (lvl == D_ERROR && __ext->oe_dlmlock)			      \
+	if (mask == D_ERROR && __ext->oe_dlmlock)			      \
 		LDLM_ERROR(__ext->oe_dlmlock, "extent: %p", __ext);	      \
 	else								      \
 		LDLM_DEBUG(__ext->oe_dlmlock, "extent: %p", __ext);	      \
 } while (0)
 
+#define OSC_EXTENT_DUMP(mask, ext, fmt, ...)				\
+	OSC_EXTENT_DUMP_WITH_LOC(__FILE__, __func__, __LINE__,		\
+				 mask, ext, fmt, ## __VA_ARGS__)
+
 #undef EASSERTF
 #define EASSERTF(expr, ext, fmt, args...) do {				\
 	if (!(expr)) {							\
@@ -300,9 +305,9 @@ static int __osc_extent_sanity_check(struct osc_extent *ext,
 
 out:
 	if (rc != 0)
-		OSC_EXTENT_DUMP(D_ERROR, ext,
-				"%s:%d sanity check %p failed with rc = %d\n",
-				func, line, ext, rc);
+		OSC_EXTENT_DUMP_WITH_LOC(__FILE__, func, line, D_ERROR, ext,
+					 "sanity check %p failed: rc = %d\n",
+					 ext, rc);
 	return rc;
 }
 
@@ -1250,34 +1255,34 @@ static int osc_extent_expand(struct osc_extent *ext, pgoff_t index,
 	return rc;
 }
 
-static void __osc_extent_tree_dump(int level, struct osc_object *obj,
+static void __osc_extent_tree_dump(int mask, struct osc_object *obj,
 				   const char *func, int line)
 {
 	struct osc_extent *ext;
 	int cnt;
 
-	if (!cfs_cdebug_show(level, DEBUG_SUBSYSTEM))
+	if (!cfs_cdebug_show(mask, DEBUG_SUBSYSTEM))
 		return;
 
-	CDEBUG(level, "Dump object %p extents at %s:%d, mppr: %u.\n",
+	CDEBUG(mask, "Dump object %p extents at %s:%d, mppr: %u.\n",
 	       obj, func, line, osc_cli(obj)->cl_max_pages_per_rpc);
 
 	/* osc_object_lock(obj); */
 	cnt = 1;
 	for (ext = first_extent(obj); ext; ext = next_extent(ext))
-		OSC_EXTENT_DUMP(level, ext, "in tree %d.\n", cnt++);
+		OSC_EXTENT_DUMP(mask, ext, "in tree %d.\n", cnt++);
 
 	cnt = 1;
 	list_for_each_entry(ext, &obj->oo_hp_exts, oe_link)
-		OSC_EXTENT_DUMP(level, ext, "hp %d.\n", cnt++);
+		OSC_EXTENT_DUMP(mask, ext, "hp %d.\n", cnt++);
 
 	cnt = 1;
 	list_for_each_entry(ext, &obj->oo_urgent_exts, oe_link)
-		OSC_EXTENT_DUMP(level, ext, "urgent %d.\n", cnt++);
+		OSC_EXTENT_DUMP(mask, ext, "urgent %d.\n", cnt++);
 
 	cnt = 1;
 	list_for_each_entry(ext, &obj->oo_reading_exts, oe_link)
-		OSC_EXTENT_DUMP(level, ext, "reading %d.\n", cnt++);
+		OSC_EXTENT_DUMP(mask, ext, "reading %d.\n", cnt++);
 	/* osc_object_unlock(obj); */
 }
 
@@ -1395,9 +1400,9 @@ static int osc_completion(const struct lu_env *env, struct osc_async_page *oap,
 	return 0;
 }
 
-#define OSC_DUMP_GRANT(lvl, cli, fmt, args...) do {			\
+#define OSC_DUMP_GRANT(mask, cli, fmt, args...) do {			\
 	struct client_obd *__tmp = (cli);				\
-	CDEBUG(lvl, "%s: grant { dirty: %ld/%ld dirty_pages: %ld/%lu "	\
+	CDEBUG(mask, "%s: grant { dirty: %ld/%ld dirty_pages: %ld/%lu "	\
 	       "dropped: %ld avail: %ld, dirty_grant: %ld, "		\
 	       "reserved: %ld, flight: %d } lru {in list: %ld, "	\
 	       "left: %ld, waiters: %d }" fmt "\n",			\
diff --git a/include/linux/libcfs/libcfs_debug.h b/include/linux/libcfs/libcfs_debug.h
index 31a97ec..99905f7 100644
--- a/include/linux/libcfs/libcfs_debug.h
+++ b/include/linux/libcfs/libcfs_debug.h
@@ -79,26 +79,29 @@
 			   (THREAD_SIZE - 1)))
 # endif /* __ia64__ */
 
-#define __CHECK_STACK(msgdata, mask, cdls)				\
+#define __CHECK_STACK_WITH_LOC(file, func, line, msgdata, mask, cdls)	\
 do {									\
 	if (unlikely(CDEBUG_STACK() > libcfs_stack)) {			\
-		LIBCFS_DEBUG_MSG_DATA_INIT(msgdata, D_WARNING, NULL);   \
+		LIBCFS_DEBUG_MSG_DATA_INIT(file, func, line, msgdata,	\
+					   D_WARNING, NULL);		\
 		libcfs_stack = CDEBUG_STACK();				\
-		libcfs_debug_msg(msgdata,				\
-				 "maximum lustre stack %lu\n",		\
-				 CDEBUG_STACK());			\
+		libcfs_debug_msg(msgdata, "maximum lustre stack %u\n",	\
+				 libcfs_stack);				\
 		(msgdata)->msg_mask = mask;				\
 		(msgdata)->msg_cdls = cdls;				\
 		dump_stack();						\
 		/*panic("LBUG");*/					\
 	}								\
 } while (0)
-#define CFS_CHECK_STACK(msgdata, mask, cdls)  __CHECK_STACK(msgdata, mask, cdls)
 #else /* __x86_64__ */
-#define CFS_CHECK_STACK(msgdata, mask, cdls) do {} while (0)
 #define CDEBUG_STACK() (0L)
+#define __CHECK_STACK_WITH_LOC(file, func, line, msgdata, mask, cdls)	\
+	do {} while (0)
 #endif /* __x86_64__ */
 
+#define CFS_CHECK_STACK(msgdata, mask, cdls)				\
+	__CHECK_STACK_WITH_LOC(__FILE__, __func__, __LINE__,		\
+			       msgdata, mask, cdls)
 #ifndef DEBUG_SUBSYSTEM
 # define DEBUG_SUBSYSTEM S_UNDEFINED
 #endif
@@ -121,24 +124,28 @@ struct libcfs_debug_msg_data {
 	struct cfs_debug_limit_state   *msg_cdls;
 };
 
-#define LIBCFS_DEBUG_MSG_DATA_INIT(data, mask, cdls)			\
+#define LIBCFS_DEBUG_MSG_DATA_INIT(file, func, line, msgdata, mask, cdls)\
 do {									\
-	(data)->msg_subsys	= DEBUG_SUBSYSTEM;			\
-	(data)->msg_file	= __FILE__;				\
-	(data)->msg_fn		= __func__;				\
-	(data)->msg_line	= __LINE__;				\
-	(data)->msg_cdls	= (cdls);				\
-	(data)->msg_mask	= (mask);				\
+	(msgdata)->msg_subsys = DEBUG_SUBSYSTEM;			\
+	(msgdata)->msg_file   = (file);					\
+	(msgdata)->msg_fn     = (func);					\
+	(msgdata)->msg_line   = (line);					\
+	(msgdata)->msg_mask   = (mask);					\
+	(msgdata)->msg_cdls   = (cdls);					\
 } while (0)
 
-#define LIBCFS_DEBUG_MSG_DATA_DECL(dataname, mask, cdls)		\
-	static struct libcfs_debug_msg_data dataname = {		\
-		.msg_subsys	= DEBUG_SUBSYSTEM,			\
-		.msg_file	= __FILE__,				\
-		.msg_fn		= __func__,				\
-		.msg_line	= __LINE__,				\
-		.msg_cdls	= (cdls)	 };			\
-	dataname.msg_mask	= (mask)
+#define LIBCFS_DEBUG_MSG_DATA_DECL_LOC(file, func, line, msgdata, mask, cdls)\
+	static struct libcfs_debug_msg_data msgdata = {			\
+		.msg_subsys = DEBUG_SUBSYSTEM,				\
+		.msg_file   = (file),					\
+		.msg_fn     = (func),					\
+		.msg_line   = (line),					\
+		.msg_cdls   = (cdls) };					\
+	msgdata.msg_mask   = (mask)
+
+#define LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, cdls)			\
+	LIBCFS_DEBUG_MSG_DATA_DECL_LOC(__FILE__, __func__, __LINE__,	\
+				       msgdata, mask, cdls)
 
 /**
  * Filters out logging messages based on mask and subsystem.
@@ -147,27 +154,32 @@ static inline int cfs_cdebug_show(unsigned int mask, unsigned int subsystem)
 {
 	return mask & D_CANTMASK ||
 		((libcfs_debug & mask) && (libcfs_subsystem_debug & subsystem));
+
 }
 
-#define __CDEBUG(cdls, mask, format, ...)				\
+#define __CDEBUG_WITH_LOC(file, func, line, mask, cdls, format, ...)	\
 do {									\
 	static struct libcfs_debug_msg_data msgdata;			\
 									\
-	CFS_CHECK_STACK(&msgdata, mask, cdls);				\
+	__CHECK_STACK_WITH_LOC(file, func, line, &msgdata, mask, cdls);	\
 									\
 	if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {			\
-		LIBCFS_DEBUG_MSG_DATA_INIT(&msgdata, mask, cdls);	\
+		LIBCFS_DEBUG_MSG_DATA_INIT(file, func, line,		\
+					   &msgdata, mask, cdls);	\
 		libcfs_debug_msg(&msgdata, format, ## __VA_ARGS__);	\
 	}								\
 } while (0)
 
-#define CDEBUG(mask, format, ...) __CDEBUG(NULL, mask, format, ## __VA_ARGS__)
+#define CDEBUG(mask, format, ...)					\
+	__CDEBUG_WITH_LOC(__FILE__, __func__, __LINE__,			\
+			  mask, NULL, format, ## __VA_ARGS__)
 
 #define CDEBUG_LIMIT(mask, format, ...)					\
 do {									\
 	static struct cfs_debug_limit_state cdls;			\
 									\
-	__CDEBUG(&cdls, mask, format, ## __VA_ARGS__);			\
+	__CDEBUG_WITH_LOC(__FILE__, __func__, __LINE__,			\
+			  mask, &cdls, format, ## __VA_ARGS__);		\
 } while (0)
 
 /*
@@ -189,7 +201,8 @@ static inline int cfs_cdebug_show(unsigned int mask, unsigned int subsystem)
 			   "%x-%x: " format, errnum, LERRCHKSUM(errnum), ## __VA_ARGS__)
 #define LCONSOLE_ERROR(format, ...) LCONSOLE_ERROR_MSG(0x00, format, ## __VA_ARGS__)
 
-#define LCONSOLE_EMERG(format, ...) CDEBUG(D_CONSOLE | D_EMERG, format, ## __VA_ARGS__)
+#define LCONSOLE_EMERG(format, ...) \
+	CDEBUG(D_CONSOLE | D_EMERG, format, ## __VA_ARGS__)
 
 int libcfs_debug_msg(struct libcfs_debug_msg_data *msgdata,
 		     const char *format1, ...)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 205/622] lustre: llog: add startcat for wrapped catalog
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (203 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 204/622] lnet: libcfs: allow file/func/line passed to CDEBUG() James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 206/622] lustre: llog: add synchronization for the last record James Simmons
                   ` (417 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

The osp_sync_thread loop for a llog_cat_process has a mistake.
When llog_cat_process has reached a bottom of catalog, the processing
restarts with 0. Which means a default processing. In this case a
catalog is wrapped and processing starts from a llh_cat_idx. But
records at the bottom were processed already, and were not cancelled
yet. The next message appears at log.
osp_sync_interpret()) reply req ffff8800123e3600/1, rc -2, transno 0

llog_cat_process support startcat index for processing catalog.
In this case the processing starts from startcat index. But if
catalog is wrapped startcat index is ignored.

The patch adds supporting of startcat index for wrapped catalog.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10913
Cray-bug-id: LUS-6765
Lustre-commit: 8109c9e1718d ("LU-10913 llog: add startcat for wrapped catalog")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Reviewed-on: https://review.whamcloud.com/33749
Reviewed-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/llog_cat.c          | 33 ++++++++++++++++++++++++---------
 include/uapi/linux/lustre/lustre_idl.h |  5 +++++
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/obdclass/llog_cat.c b/fs/lustre/obdclass/llog_cat.c
index ca97e08..30b0ac5 100644
--- a/fs/lustre/obdclass/llog_cat.c
+++ b/fs/lustre/obdclass/llog_cat.c
@@ -222,7 +222,7 @@ static int llog_cat_process_or_fork(const struct lu_env *env,
 	LASSERT(llh->llh_flags & LLOG_F_IS_CAT);
 	d.lpd_data = data;
 	d.lpd_cb = cb;
-	d.lpd_startcat = startcat;
+	d.lpd_startcat = (startcat == LLOG_CAT_FIRST ? 0 : startcat);
 	d.lpd_startidx = startidx;
 
 	if (llh->llh_cat_idx > cat_llh->lgh_last_idx) {
@@ -231,14 +231,29 @@ static int llog_cat_process_or_fork(const struct lu_env *env,
 		CWARN("%s: catlog " DFID " crosses index zero\n",
 		      cat_llh->lgh_ctxt->loc_obd->obd_name,
 		      PFID(&cat_llh->lgh_id.lgl_oi.oi_fid));
-
-		cd.lpcd_first_idx = llh->llh_cat_idx;
-		cd.lpcd_last_idx = 0;
-		rc = llog_process_or_fork(env, cat_llh, cat_cb, &d, &cd, fork);
-		if (rc != 0)
-			return rc;
-
-		cd.lpcd_first_idx = 0;
+		/*startcat = 0 is default value for general processing */
+		if ((startcat != LLOG_CAT_FIRST &&
+		    startcat >= llh->llh_cat_idx) || !startcat) {
+			/* processing the catalog part at the end */
+			cd.lpcd_first_idx = (startcat ? startcat :
+					     llh->llh_cat_idx);
+			cd.lpcd_last_idx = 0;
+			rc = llog_process_or_fork(env, cat_llh, cat_cb,
+						  &d, &cd, fork);
+			/* Reset the startcat because it has already reached
+			 * catalog bottom.
+			 */
+			startcat = 0;
+			if (rc != 0)
+				return rc;
+		}
+		/* processing the catalog part at the beginning */
+		cd.lpcd_first_idx = (startcat == LLOG_CAT_FIRST) ? 0 : startcat;
+		/* Note, the processing will stop at the lgh_last_idx value,
+		 * and it could be increased during processing. So records
+		 * between current lgh_last_idx and lgh_last_idx in future
+		 * would left unprocessed.
+		 */
 		cd.lpcd_last_idx = cat_llh->lgh_last_idx;
 		rc = llog_process_or_fork(env, cat_llh, cat_cb, &d, &cd, fork);
 	} else {
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 77b9539..76068ee 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2618,6 +2618,11 @@ enum llog_flag {
 			  LLOG_F_EXT_X_OMODE | LLOG_F_EXT_X_XATTR,
 };
 
+/* means first record of catalog */
+enum {
+	LLOG_CAT_FIRST = -1,
+};
+
 /* On-disk header structure of each log object, stored in little endian order */
 #define LLOG_MIN_CHUNK_SIZE	8192
 #define LLOG_HEADER_SIZE	(96)	/* sizeof (llog_log_hdr) +
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 206/622] lustre: llog: add synchronization for the last record
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (204 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 205/622] lustre: llog: add startcat for wrapped catalog James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 207/622] lustre: ptlrpc: improve memory allocation for service RPCs James Simmons
                   ` (416 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

The initial problem was a race between llog_process_thread
and llog_osd_write_rec for a last record with lgh_last_idx.
The catalog should be wrapped for the problem. The lgh_last_idx
could be increased with a modification of llog bitmap, and a writing
record happen a bit later. When llog_process_thread processing
lgh_last_idx after modification and before a write it operates
with old record data.

The lustre client is only a consumer of llog records but we still
need the changes to better handle consumption of the llog records.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11591
Lustre-commit: ec4194e4e78c ("LU-11591 llog: add synchronization for the last record")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-6683
Reviewed-on: https://review.whamcloud.com/33683
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/llog.c | 68 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 50 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/obdclass/llog.c b/fs/lustre/obdclass/llog.c
index 65384ded..4e9fd17 100644
--- a/fs/lustre/obdclass/llog.c
+++ b/fs/lustre/obdclass/llog.c
@@ -230,10 +230,11 @@ static int llog_process_thread(void *arg)
 	struct llog_process_cat_data *cd  = lpi->lpi_catdata;
 	char *buf;
 	u64 cur_offset, tmp_offset;
-	int chunk_size;
+	size_t chunk_size;
 	int rc = 0, index = 1, last_index;
 	int saved_index = 0;
 	int last_called_index = 0;
+	bool repeated = false;
 
 	if (!llh)
 		return -EINVAL;
@@ -261,8 +262,10 @@ static int llog_process_thread(void *arg)
 	while (rc == 0) {
 		unsigned int buf_offset = 0;
 		struct llog_rec_hdr *rec;
+		off_t chunk_offset = 0;
 		bool partial_chunk;
-		off_t chunk_offset;
+		int synced_idx = 0;
+		int lh_last_idx;
 
 		/* skip records not set in bitmap */
 		while (index <= last_index &&
@@ -277,8 +280,23 @@ static int llog_process_thread(void *arg)
 repeat:
 		/* get the buf with our target record; avoid old garbage */
 		memset(buf, 0, chunk_size);
+		/* the record index for outdated chunk data */
+		/* it is safe to process buffer until saved lgh_last_idx */
+		lh_last_idx = LLOG_HDR_TAIL(llh)->lrt_index;
 		rc = llog_next_block(lpi->lpi_env, loghandle, &saved_index,
 				     index, &cur_offset, buf, chunk_size);
+		if (repeated && rc)
+			CDEBUG(D_OTHER,
+			       "cur_offset %llu, chunk_offset %llu, buf_offset %u, rc = %d\n",
+			       cur_offset, (u64)chunk_offset, buf_offset, rc);
+		/* we`ve tried to reread the chunk, but there is no
+		 * new records
+		 */
+		if (rc == -EIO && repeated && (chunk_offset + buf_offset) ==
+		    cur_offset) {
+			rc = 0;
+			goto out;
+		}
 		if (rc)
 			goto out;
 
@@ -313,29 +331,43 @@ static int llog_process_thread(void *arg)
 			CDEBUG(D_OTHER, "after swabbing, type=%#x idx=%d\n",
 			       rec->lrh_type, rec->lrh_index);
 
-			/*
-			 * for partial chunk the end of it is zeroed, check
-			 * for index 0 to distinguish it.
+			if (index == (synced_idx + 1) &&
+			    synced_idx == LLOG_HDR_TAIL(llh)->lrt_index) {
+				rc = 0;
+				goto out;
+			}
+
+			/* the bitmap could be changed during processing
+			 * records from the chunk. For wrapped catalog
+			 * it means we can read deleted record and try to
+			 * process it. Check this case and reread the chunk.
+			 * It is safe to process to lh_last_idx, including
+			 * lh_last_idx if it was synced. We can not do <=
+			 * comparison, cause for wrapped catalog lgh_last_idx
+			 * could be less than index. So we detect last index
+			 * for processing as index == lh_last_idx+1. But when
+			 * catalog is wrapped and full lgh_last_idx=llh_cat_idx,
+			 * the first processing index is llh_cat_idx+1.
 			 */
-			if (partial_chunk && !rec->lrh_index) {
-				/* concurrent llog_add() might add new records
-				 * while llog_processing, check this is not
-				 * the case and re-read the current chunk
-				 * otherwise.
-				 */
-				if (index > loghandle->lgh_last_idx) {
-					rc = 0;
-					goto out;
-				}
-				CDEBUG(D_OTHER,
-				       "Re-read last llog buffer for new records, index %u, last %u\n",
-				       index, loghandle->lgh_last_idx);
+			if ((index == lh_last_idx && synced_idx != index) ||
+			    (index == (lh_last_idx + 1) &&
+			     !(index == (llh->llh_cat_idx + 1) &&
+			       (llh->llh_flags & LLOG_F_IS_CAT))) ||
+			     (rec->lrh_index == 0 && !repeated)) {
 				/* save offset inside buffer for the re-read */
 				buf_offset = (char *)rec - (char *)buf;
 				cur_offset = chunk_offset;
+				repeated = true;
+				/* We need to be sure lgh_last_idx
+				 * record was saved to disk
+				 */
+				synced_idx = LLOG_HDR_TAIL(llh)->lrt_index;
+				CDEBUG(D_OTHER, "synced_idx: %d\n", synced_idx);
 				goto repeat;
 			}
 
+			repeated = false;
+
 			if (!rec->lrh_len || rec->lrh_len > chunk_size) {
 				CWARN("invalid length %d in llog record for index %d/%d\n",
 				      rec->lrh_len,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 207/622] lustre: ptlrpc: improve memory allocation for service RPCs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (205 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 206/622] lustre: llog: add synchronization for the last record James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 208/622] lustre: llite: enable flock mount option by default James Simmons
                   ` (415 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andrew Perepechko <c17827@cray.com>

The memory for service RPCs are not always page aligned
for its size i.e 17KiB for example. Round up to the nearest
power of 2 so we can effectively use the whole allocated buffer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11897
Cray-bug-id: LUS-6657
Lustre-commit: 3a90458bd84d ("LU-11897 ost: improve memory allocation for ost")
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/34127
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index b94ed6a..7bc578c 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -641,6 +641,13 @@ struct ptlrpc_service *
 	service->srv_rep_portal	= conf->psc_buf.bc_rep_portal;
 	service->srv_req_portal	= conf->psc_buf.bc_req_portal;
 
+	/* With slab/alloc_pages buffer size will be rounded up to 2^n */
+	if (service->srv_buf_size & (service->srv_buf_size - 1)) {
+		int round = size_roundup_power2(service->srv_buf_size);
+
+		service->srv_buf_size = round;
+	}
+
 	/* Increase max reply size to next power of two */
 	service->srv_max_reply_size = 1;
 	while (service->srv_max_reply_size <
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 208/622] lustre: llite: enable flock mount option by default
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (206 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 207/622] lustre: ptlrpc: improve memory allocation for service RPCs James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 209/622] lustre: lmv: avoid gratuitous 64-bit modulus James Simmons
                   ` (414 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The "flock" mount option has been optional for many years, initially
because of potential stability issues, and also to provide a choice
for administrators to select between "flock" and "localflock" options.

However, from the large number of problems that users report when
trying to use applications that depend on this feature (typically
databases and other cloud stacks) that disabling flock by default
causes more problems than it solves.

Enable the "flock" (distributed coherent userspace locking) feature
by default.  If applications do not need this functionality, then it
will not affect them.  If applications *do* need this functionality,
they will get it.  If administrators really know what they are doing,
then they can use the "localflock" feature to enable client-local
flock functionality, possibly only on select nodes that need this.

Users wanting to disable this functionality should mount with the
existing "-o noflock" mount option.

If clients are already using "-o {flock|localflock|noflock}" then
their existing options will be handled appropriately.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10885
Lustre-commit: 3613af3e15cb ("LU-10885 llite: enable flock mount option by default")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32091
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 4797ee9..84fc54d 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -104,7 +104,7 @@ static struct ll_sb_info *ll_init_sbi(void)
 
 	sbi->ll_flags |= LL_SBI_VERBOSE;
 	sbi->ll_flags |= LL_SBI_CHECKSUM;
-
+	sbi->ll_flags |= LL_SBI_FLOCK;
 	sbi->ll_flags |= LL_SBI_LRU_RESIZE;
 	sbi->ll_flags |= LL_SBI_LAZYSTATFS;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 209/622] lustre: lmv: avoid gratuitous 64-bit modulus
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (207 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 208/622] lustre: llite: enable flock mount option by default James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 210/622] lustre: Ensure crc-t10pi is enabled James Simmons
                   ` (413 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Fix the pct() calculation to use unsigned long arguments, since this
is what callers use.  Remove duplicate pct() definition in lproc_mdc.

Don't do a 64-bit modulus of the LNet NID to find the starting MDT
index when this isn't really needed.

Similarly, don't compute the FLD cache usage percentage for a debug
message that is never used.

Fixes: fed15ee3b3f2 ("lustre: headers: define pct(a,b) once")
WC-bug-id: https://jira.whamcloud.com/browse/LU-10171
Lustre-commit: e1b63fd21177 ("LU-10171 lmv: avoid gratuitous 64-bit modulus")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33922
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/fld/fld_cache.c          | 4 +---
 fs/lustre/include/lprocfs_status.h | 2 +-
 fs/lustre/include/obd.h            | 3 +--
 fs/lustre/lmv/lmv_obd.c            | 5 ++++-
 fs/lustre/mdc/lproc_mdc.c          | 8 +++-----
 5 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/fld/fld_cache.c b/fs/lustre/fld/fld_cache.c
index 96be544..5267ba2 100644
--- a/fs/lustre/fld/fld_cache.c
+++ b/fs/lustre/fld/fld_cache.c
@@ -98,10 +98,8 @@ void fld_cache_fini(struct fld_cache *cache)
 	fld_cache_flush(cache);
 
 	CDEBUG(D_INFO, "FLD cache statistics (%s):\n", cache->fci_name);
-	CDEBUG(D_INFO, "  Total reqs: %llu\n", cache->fci_stat.fst_count);
 	CDEBUG(D_INFO, "  Cache reqs: %llu\n", cache->fci_stat.fst_cache);
-	CDEBUG(D_INFO, "  Cache hits: %u%%\n",
-	       pct(cache->fci_stat.fst_cache, cache->fci_stat.fst_count));
+	CDEBUG(D_INFO, "  Total reqs: %llu\n", cache->fci_stat.fst_count);
 
 	kfree(cache);
 }
diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index c1079f1..8d74822 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -58,7 +58,7 @@ struct lprocfs_vars {
 	umode_t				proc_mode;
 };
 
-static inline u32 pct(s64 a, s64 b)
+static inline unsigned int pct(unsigned long a, unsigned long b)
 {
 	return b ? a * 100 / b : 0;
 }
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 4829e11..bf0bf97 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -437,11 +437,10 @@ struct lmv_obd {
 	int			connected;
 	int			max_easize;
 	int			max_def_easize;
+	u32			lmv_statfs_start;
 
 	u32			tgts_size; /* size of tgts array */
 	struct lmv_tgt_desc	**tgts;
-	int			lmv_statfs_start;
-
 	struct obd_connect_data	conn_data;
 	struct kobject		*lmv_tgts_kobj;
 };
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 9f9abd3..0685925 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1366,8 +1366,11 @@ static int lmv_select_statfs_mdt(struct lmv_obd *lmv, u32 flags)
 			break;
 
 		if (LNET_NETTYP(LNET_NIDNET(lnet_id.nid)) != LOLND) {
+			/* We dont need a full 64-bit modulus, just enough
+			 * to distribute the requests across MDTs evenly.
+			 */
 			lmv->lmv_statfs_start =
-				lnet_id.nid % lmv->desc.ld_tgt_count;
+				(u32)lnet_id.nid % lmv->desc.ld_tgt_count;
 			break;
 		}
 	}
diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index 70c9eaf..81167bbd 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -328,7 +328,6 @@ static ssize_t mdc_rpc_stats_seq_write(struct file *file,
 	return len;
 }
 
-#define pct(a, b) (b ? a * 100 / b : 0)
 static int mdc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 {
 	struct obd_device *dev = seq->private;
@@ -364,7 +363,7 @@ static int mdc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 		read_cum += r;
 		write_cum += w;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu   | %10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u   | %10lu %3u %3u\n",
 			   1 << i, r, pct(r, read_tot),
 			   pct(read_cum, read_tot), w,
 			   pct(w, write_tot),
@@ -388,7 +387,7 @@ static int mdc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 		read_cum += r;
 		write_cum += w;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu   | %10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u   | %10lu %3u %3u\n",
 			   i, r, pct(r, read_tot), pct(read_cum, read_tot), w,
 			   pct(w, write_tot), pct(write_cum, write_tot));
 		if (read_cum == read_tot && write_cum == write_tot)
@@ -410,7 +409,7 @@ static int mdc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 		read_cum += r;
 		write_cum += w;
-		seq_printf(seq, "%d:\t\t%10lu %3lu %3lu   | %10lu %3lu %3lu\n",
+		seq_printf(seq, "%d:\t\t%10lu %3u %3u   | %10lu %3u %3u\n",
 			   (i == 0) ? 0 : 1 << (i - 1),
 			   r, pct(r, read_tot), pct(read_cum, read_tot),
 			   w, pct(w, write_tot), pct(write_cum, write_tot));
@@ -421,7 +420,6 @@ static int mdc_rpc_stats_seq_show(struct seq_file *seq, void *v)
 
 	return 0;
 }
-#undef pct
 LPROC_SEQ_FOPS(mdc_rpc_stats);
 
 static int mdc_stats_seq_show(struct seq_file *seq, void *v)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 210/622] lustre: Ensure crc-t10pi is enabled.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (208 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 209/622] lustre: lmv: avoid gratuitous 64-bit modulus James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 211/622] lustre: lov: fix lov_iocontrol for inactive OST case James Simmons
                   ` (412 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Also simplify check_write_checksum code a little - the var isn't needed.

Fixes: 86e186db3ed ("lustre: osc: T10PI between RPC and BIO")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11770
Lustre-commit: e0fb3133372e ("LU-11770 osc: allow build without blk_integrity or crc-t10pi")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33923
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/Kconfig           |  1 +
 fs/lustre/osc/osc_request.c | 14 +++-----------
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/Kconfig b/fs/lustre/Kconfig
index 2eb7e45..bc89565 100644
--- a/fs/lustre/Kconfig
+++ b/fs/lustre/Kconfig
@@ -9,6 +9,7 @@ config LUSTRE_FS
 	select CRYPTO_SHA1
 	select CRYPTO_SHA256
 	select CRYPTO_SHA512
+	select CRC_T10DIF
 	select DEBUG_FS
 	select FHANDLE
 	select QUOTA
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index ba84bd1..6ce22c3 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1638,7 +1638,6 @@ static int check_write_checksum(struct obdo *oa,
 	const char *obd_name = aa->aa_cli->cl_import->imp_obd->obd_name;
 	obd_dif_csum_fn *fn = NULL;
 	int sector_size = 0;
-	bool t10pi = false;
 	u32 new_cksum;
 	char *msg;
 	enum cksum_type cksum_type;
@@ -1658,22 +1657,18 @@ static int check_write_checksum(struct obdo *oa,
 
 	switch (cksum_type) {
 	case OBD_CKSUM_T10IP512:
-		t10pi = true;
 		fn = obd_dif_ip_fn;
 		sector_size = 512;
 		break;
 	case OBD_CKSUM_T10IP4K:
-		t10pi = true;
 		fn = obd_dif_ip_fn;
 		sector_size = 4096;
 		break;
 	case OBD_CKSUM_T10CRC512:
-		t10pi = true;
 		fn = obd_dif_crc_fn;
 		sector_size = 512;
 		break;
 	case OBD_CKSUM_T10CRC4K:
-		t10pi = true;
 		fn = obd_dif_crc_fn;
 		sector_size = 4096;
 		break;
@@ -1681,13 +1676,10 @@ static int check_write_checksum(struct obdo *oa,
 		break;
 	}
 
-	if (t10pi)
+	if (fn)
 		rc = osc_checksum_bulk_t10pi(obd_name, aa->aa_requested_nob,
-					     aa->aa_page_count,
-					     aa->aa_ppga,
-					     OST_WRITE,
-					     fn,
-					     sector_size,
+					     aa->aa_page_count, aa->aa_ppga,
+					     OST_WRITE, fn, sector_size,
 					     &new_cksum);
 	else
 		rc = osc_checksum_bulk(aa->aa_requested_nob, aa->aa_page_count,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 211/622] lustre: lov: fix lov_iocontrol for inactive OST case
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (209 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 210/622] lustre: Ensure crc-t10pi is enabled James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 212/622] lustre: llite: Initialize cl_dirty_max_pages James Simmons
                   ` (411 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

For inactive OSTs lov->lov_tgts[index]->ltd_exp is
NULL. lov_iocontrol() is to check that before dereferencing to
lov->lov_tgts[index]->ltd_exp->exp_obd.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11911
Lustre-commit: 0facd12afa33 ("LU-11911 lov: fix lov_iocontrol for inactive OST case")
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Cray-bug-id: LUS-6937
Reviewed-on: https://review.whamcloud.com/34148
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_obd.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 08d7edc..cc0ca1c 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -1001,15 +1001,15 @@ static int lov_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 			/* Try again with the next index */
 			return -EAGAIN;
 
-		imp = lov->lov_tgts[index]->ltd_exp->exp_obd->u.cli.cl_import;
-		if (!lov->lov_tgts[index]->ltd_active &&
-		    imp->imp_state != LUSTRE_IMP_IDLE)
-			return -ENODATA;
-
 		osc_obd = class_exp2obd(lov->lov_tgts[index]->ltd_exp);
 		if (!osc_obd)
 			return -EINVAL;
 
+		imp = osc_obd->u.cli.cl_import;
+		if (!lov->lov_tgts[index]->ltd_active &&
+		    imp->imp_state != LUSTRE_IMP_IDLE)
+			return -ENODATA;
+
 		/* copy UUID */
 		if (copy_to_user(data->ioc_pbuf2, obd2cli_tgt(osc_obd),
 				 min_t(unsigned long, data->ioc_plen2,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 212/622] lustre: llite: Initialize cl_dirty_max_pages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (210 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 211/622] lustre: lov: fix lov_iocontrol for inactive OST case James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 213/622] lustre: mdc: don't use ACL at setattr James Simmons
                   ` (410 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

cl_dirty_max_pages must be initialized to zero before
calling client_adjust_max_dirty.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11919
Lustre-commit: 2e9c896dec6d ("LU-11919 llite: Initialize cl_dirty_max_pages")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34173
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lib.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 5fe5711..11955b1 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -315,6 +315,7 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 		     sizeof(server_uuid)));
 
 	cli->cl_dirty_pages = 0;
+	cli->cl_dirty_max_pages = 0;
 	cli->cl_avail_grant = 0;
 	/* FIXME: Should limit this for the sum of all cl_dirty_max_pages. */
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 213/622] lustre: mdc: don't use ACL at setattr
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (211 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 212/622] lustre: llite: Initialize cl_dirty_max_pages James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 214/622] lnet: o2iblnd: ibc_rxs is created and freed with different size James Simmons
                   ` (409 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

For ldiskfs with large_ea, EA max size is equal to 1MB.
At mdc_setattr ptlrpc reply size is 1.1MB and it is rounded
to 2MB. So REINT_SETATTR request takes about 2MB of memory at
client. For a MDS failover case many request stay at reply queue
and could lead to OOM.

The patch changes acl size to zero, cause server doesn't fill
acl for setattr request.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11934
Lustre-commit: e7f6f870c356 ("LU-11934 mdc: don't use ACL at setattr")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-6938
Reviewed-on: https://review.whamcloud.com/34194
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_reint.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 2611fc4..0e5f012 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -134,10 +134,8 @@ int mdc_setattr(struct obd_export *exp, struct md_op_data *op_data,
 		       op_data->op_attr.ia_ctime.tv_sec);
 	mdc_setattr_pack(req, op_data, ea, ealen);
 
-	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER,
-			     min_t(u32,
-				   req->rq_import->imp_connect_data.ocd_max_easize,
-				   XATTR_SIZE_MAX));
+	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER, 0);
+
 	ptlrpc_request_set_replen(req);
 
 	rc = mdc_reint(req, LUSTRE_IMP_FULL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 214/622] lnet: o2iblnd: ibc_rxs is created and freed with different size
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (212 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 213/622] lustre: mdc: don't use ACL at setattr James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 215/622] lustre: osc: reduce atomic ops in osc_enter_cache_try James Simmons
                   ` (408 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

kiblnd_create_conn()) alloc '(conn->ibc_rxs)': 26832 at ffffc90012e69000
kiblnd_destroy_conn()) kfreed 'conn->ibc_rxs': 4576 at ffffc90012e69000

The size changed by kiblnd_create_conn() :
"peer 172.18.2.3 at o2ib - queue depth reduced from 128 to 21"

Based on size LIBCFS_FREE() decides whether to use kfree or vfree
and accounts memory usage.

Allocate ibc_rxs after rdma_create_qp()

Cray-bug-id: LUS-6339
WC-bug-id: https://jira.whamcloud.com/browse/LU-11702
Lustre-commit: 277a6faa5b16 ("LU-11702 o2iblnd: ibc_rxs is created and freed with different size")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/33721
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 017fe5f..0e207ef 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -735,6 +735,8 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 	conn->ibc_cmid = cmid;
 	conn->ibc_max_frags = peer_ni->ibp_max_frags;
 	conn->ibc_queue_depth = peer_ni->ibp_queue_depth;
+	conn->ibc_rxs = NULL;
+	conn->ibc_rx_pages = NULL;
 
 	INIT_LIST_HEAD(&conn->ibc_early_rxs);
 	INIT_LIST_HEAD(&conn->ibc_tx_noops);
@@ -778,20 +780,6 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 
 	write_unlock_irqrestore(glock, flags);
 
-	conn->ibc_rxs = kzalloc_cpt(IBLND_RX_MSGS(conn) * sizeof(struct kib_rx),
-				    GFP_NOFS, cpt);
-	if (!conn->ibc_rxs) {
-		CERROR("Cannot allocate RX buffers\n");
-		goto failed_2;
-	}
-
-	rc = kiblnd_alloc_pages(&conn->ibc_rx_pages, cpt,
-				IBLND_RX_MSG_PAGES(conn));
-	if (rc)
-		goto failed_2;
-
-	kiblnd_map_rx_descs(conn);
-
 	cq_attr.cqe = IBLND_CQ_ENTRIES(conn);
 	cq_attr.comp_vector = kiblnd_get_completion_vector(conn, cpt);
 	cq = ib_create_cq(cmid->device,
@@ -856,6 +844,20 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 
 	kfree(init_qp_attr);
 
+	conn->ibc_rxs = kzalloc_cpt(IBLND_RX_MSGS(conn) * sizeof(struct kib_rx),
+				    GFP_NOFS, cpt);
+	if (!conn->ibc_rxs) {
+		CERROR("Cannot allocate RX buffers\n");
+		goto failed_2;
+	}
+
+	rc = kiblnd_alloc_pages(&conn->ibc_rx_pages, cpt,
+				IBLND_RX_MSG_PAGES(conn));
+	if (rc)
+		goto failed_2;
+
+	kiblnd_map_rx_descs(conn);
+
 	/* 1 ref for caller and each rxmsg */
 	atomic_set(&conn->ibc_refcount, 1 + IBLND_RX_MSGS(conn));
 	conn->ibc_nrx = IBLND_RX_MSGS(conn);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 215/622] lustre: osc: reduce atomic ops in osc_enter_cache_try
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (213 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 214/622] lnet: o2iblnd: ibc_rxs is created and freed with different size James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 216/622] lustre: llite: ll_fault should fail for insane file offsets James Simmons
                   ` (407 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

We can reduce the number of atomic ops performed on
obd_dirty_pages for the common case.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11775
Lustre-commit: 8b364fbd6bd9 ("LU-11775 osc: reduce atomic ops in osc_enter_cache_try")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/33859
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index a18e791..bdaf65f 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -1423,7 +1423,6 @@ static void osc_consume_write_grant(struct client_obd *cli,
 {
 	assert_spin_locked(&cli->cl_loi_list_lock);
 	LASSERT(!(pga->flag & OBD_BRW_FROM_GRANT));
-	atomic_long_inc(&obd_dirty_pages);
 	cli->cl_dirty_pages++;
 	pga->flag |= OBD_BRW_FROM_GRANT;
 	CDEBUG(D_CACHE, "using %lu grant credits for brw %p page %p\n",
@@ -1560,13 +1559,18 @@ static bool osc_enter_cache_try(struct client_obd *cli,
 	if (osc_reserve_grant(cli, bytes) < 0)
 		return rc;
 
-	if (cli->cl_dirty_pages < cli->cl_dirty_max_pages &&
-	    atomic_long_read(&obd_dirty_pages) + 1 <= obd_max_dirty_pages) {
-		osc_consume_write_grant(cli, &oap->oap_brw_page);
-		rc = true;
-	} else {
-		__osc_unreserve_grant(cli, bytes, bytes);
+	if (cli->cl_dirty_pages < cli->cl_dirty_max_pages) {
+		if (atomic_long_add_return(1, &obd_dirty_pages) <=
+		    obd_max_dirty_pages) {
+			osc_consume_write_grant(cli, &oap->oap_brw_page);
+			rc = true;
+			goto out;
+		} else
+			atomic_long_dec(&obd_dirty_pages);
 	}
+	__osc_unreserve_grant(cli, bytes, bytes);
+
+out:
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 216/622] lustre: llite: ll_fault should fail for insane file offsets
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (214 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 215/622] lustre: osc: reduce atomic ops in osc_enter_cache_try James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 217/622] lustre: ptlrpc: reset generation for old requests James Simmons
                   ` (406 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Zarochentsev <c17826@cray.com>

A page fault for a mmapped lustre file at offset large than
2^63 cause Lustre client to hang due to wrong page index
calculations from signed loff_t.
There is no need to do such calclulations but perform
page offset sanity checks in ll_fault().

Cray-bug-id: LUS-1392
WC-bug-id: https://jira.whamcloud.com/browse/LU-8299
Lustre-commit: ada3b33b52cd ("LU-8299 llite: ll_fault should fail for insane file offsets")
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-on: https://review.whamcloud.com/34242
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_mmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index 14080b6..236d1d2 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -373,6 +373,9 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
 			   LPROC_LL_FAULT, 1);
 
+	/* make sure offset is not a negative number */
+	if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return VM_FAULT_SIGBUS;
 restart:
 	result = __ll_fault(vmf->vma, vmf);
 	if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 217/622] lustre: ptlrpc: reset generation for old requests
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (215 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 216/622] lustre: llite: ll_fault should fail for insane file offsets James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 218/622] lustre: osc: check if opg is in lru list without locking James Simmons
                   ` (405 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

All requests generated while the import is changing from
FULL to IDLE need to be moved to the new generation.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11951
Lustre-commit: 42d8cb04637b ("LU-11951 ptlrpc: reset generation for old requests")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34221
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h |  1 +
 fs/lustre/ptlrpc/import.c       | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index d9a0395..5e5cf3a 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -263,6 +263,7 @@
 #define OBD_FAIL_OST_DQACQ_NET				0x230
 #define OBD_FAIL_OST_STATFS_EINPROGRESS			0x231
 #define OBD_FAIL_OST_SET_INFO_NET			0x232
+#define OBD_FAIL_OST_DISCONNECT_DELAY	 0x245
 
 #define OBD_FAIL_LDLM					0x300
 #define OBD_FAIL_LDLM_NAMESPACE_NEW			0x301
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index df6c459..34a2cb0 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1593,6 +1593,23 @@ int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
 }
 EXPORT_SYMBOL(ptlrpc_disconnect_import);
 
+static void ptlrpc_reset_reqs_generation(struct obd_import *imp)
+{
+	struct ptlrpc_request *old, *tmp;
+
+	/* tag all resendable requests generated before disconnection
+	 * notice this code is part of disconnect-at-idle path only
+	 */
+	list_for_each_entry_safe(old, tmp, &imp->imp_delayed_list,
+			rq_list) {
+		spin_lock(&old->rq_lock);
+		if (old->rq_import_generation == imp->imp_generation - 1 &&
+		    !old->rq_no_resend)
+			old->rq_import_generation = imp->imp_generation;
+		spin_unlock(&old->rq_lock);
+	}
+}
+
 static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 					    struct ptlrpc_request *req,
 					    void *args, int rc)
@@ -1600,7 +1617,7 @@ static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 	struct obd_import *imp = req->rq_import;
 	int connect = 0;
 
-	DEBUG_REQ(D_HA, req, "inflight=%d, refcount=%d: rc = %d\n",
+	DEBUG_REQ(D_HA, req, "inflight=%d, refcount=%d: rc = %d ",
 		  atomic_read(&imp->imp_inflight),
 		  atomic_read(&imp->imp_refcount), rc);
 
@@ -1620,6 +1637,7 @@ static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 			imp->imp_generation++;
 			imp->imp_initiated_at = imp->imp_generation;
 			IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_NEW);
+			ptlrpc_reset_reqs_generation(imp);
 			connect = 1;
 		}
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 218/622] lustre: osc: check if opg is in lru list without locking
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (216 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 217/622] lustre: ptlrpc: reset generation for old requests James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 219/622] lnet: use right rtr address James Simmons
                   ` (404 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

osc_lru_use is called for every page queued for io,
we can just check if the osc_page is in the lru list
without taking the cl_lru_list_lock and return if not
as a fast path.
Note we still need to do the check again after locking
as it could be removed from the lru list by another thread.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11775
Lustre-commit: b3af0798682b ("LU-11775 osc: check if opg is in lru list without locking")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/33860
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_page.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 4dc6c18..7382e0d 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -494,6 +494,9 @@ static void osc_lru_use(struct client_obd *cli, struct osc_page *opg)
 	 * ops_lru should be empty
 	 */
 	if (opg->ops_in_lru) {
+		if (list_empty(&opg->ops_lru))
+			return;
+
 		spin_lock(&cli->cl_lru_list_lock);
 		if (!list_empty(&opg->ops_lru)) {
 			__osc_lru_del(cli, opg);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 219/622] lnet: use right rtr address
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (217 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 218/622] lustre: osc: check if opg is in lru list without locking James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 220/622] lnet: use right address for routing message James Simmons
                   ` (403 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

use a sender router to avoid credits distribution problem.
Sender is preferable rtr now.

Cray-bug-id: LUS-6490
WC-bug-id: https://jira.whamcloud.com/browse/LU-11413
Lustre-commit: 3f4520608130 ("LU-11413 lnet: use right rtr address")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/34031
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 2 +-
 net/lnet/lnet/lib-msg.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index f5548eb..468de06 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3558,7 +3558,7 @@ void lnet_monitor_thr_stop(void)
 	lnet_ni_recv(ni, msg->msg_private, NULL, 0, 0, 0, 0);
 	msg->msg_receiving = 0;
 
-	rc = lnet_send(ni->ni_nid, msg, LNET_NID_ANY);
+	rc = lnet_send(ni->ni_nid, msg, msg->msg_from);
 	if (rc < 0) {
 		/* didn't get as far as lnet_ni_send() */
 		CERROR("%s: Unable to send REPLY for GET from %s: %d\n",
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index af0675e..0738bf7 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -401,7 +401,7 @@
 		 * NB: we probably want to use NID of msg::msg_from as 3rd
 		 * parameter (router NID) if it's routed message
 		 */
-		rc = lnet_send(msg->msg_ev.target.nid, msg, LNET_NID_ANY);
+		rc = lnet_send(msg->msg_ev.target.nid, msg, msg->msg_from);
 
 		lnet_net_lock(cpt);
 		/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 220/622] lnet: use right address for routing message
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (218 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 219/622] lnet: use right rtr address James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 221/622] lustre: lov: avoid signed vs. unsigned comparison James Simmons
                   ` (402 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

msg_initiator is real sender address, so use this address as
hash source to better distribution against CPT on server side.

Cray-bug-id: LUS-6841
WC-bug-id: https://jira.whamcloud.com/browse/LU-11413
Lustre-commit: ad263e5d6e93 ("LU-11413 lnet: use right address for routing message")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/34032
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 468de06..185c31a 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3463,7 +3463,7 @@ void lnet_monitor_thr_stop(void)
 	info.mi_rlength	= hdr->payload_length;
 	info.mi_roffset	= hdr->msg.put.offset;
 	info.mi_mbits = hdr->msg.put.match_bits;
-	info.mi_cpt = lnet_cpt_of_nid(msg->msg_rxpeer->lpni_nid, ni);
+	info.mi_cpt = lnet_cpt_of_nid(msg->msg_initiator, ni);
 
 	msg->msg_rx_ready_delay = !ni->ni_net->net_lnd->lnd_eager_recv;
 	ready_delay = msg->msg_rx_ready_delay;
@@ -3527,7 +3527,7 @@ void lnet_monitor_thr_stop(void)
 	info.mi_rlength = hdr->msg.get.sink_length;
 	info.mi_roffset = hdr->msg.get.src_offset;
 	info.mi_mbits = hdr->msg.get.match_bits;
-	info.mi_cpt = lnet_cpt_of_nid(msg->msg_rxpeer->lpni_nid, ni);
+	info.mi_cpt = lnet_cpt_of_nid(msg->msg_initiator, ni);
 
 	rc = lnet_ptl_match_md(&info, msg);
 	if (rc == LNET_MATCHMD_DROP) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 221/622] lustre: lov: avoid signed vs. unsigned comparison
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (219 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 220/622] lnet: use right address for routing message James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 222/622] lustre: obd: use ldo_process_config for mdc and osc layer James Simmons
                   ` (401 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

In the expansion of do_div64() GCC complains about pointer comparison
because loff_t is not a u64 variable as it should be.  lov_do_div64()
also has signed vs. unsigned comparisons due to a signed loff_t.
Change lov_do_div() to use a 64-bit variable for do_div() instead of
loff_t to avoid these warnings.

Change OST_MAXREQSIZE and friends to be consistently unsigned values
to avoid compiler warnings.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11830
Lustre-commit: 632b3591b6ea ("LU-11830 lov: avoid signed vs. unsigned comparison")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33921
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 15 ++++++++-------
 fs/lustre/lov/lov_internal.h   | 15 +++++++++------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 36de665..8d71559 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -281,21 +281,22 @@
  * - OST_IO_MAXREQSIZE must be at least 1 page of cookies plus some spillover
  * - Must be a multiple of 1024
  */
-#define _OST_MAXREQSIZE_BASE	(sizeof(struct lustre_msg) + \
+#define _OST_MAXREQSIZE_BASE ((unsigned long)(sizeof(struct lustre_msg) + \
 				 sizeof(struct ptlrpc_body) + \
 				 sizeof(struct obdo) + \
 				 sizeof(struct obd_ioobj) + \
-				 sizeof(struct niobuf_remote))
-#define _OST_MAXREQSIZE_SUM	(_OST_MAXREQSIZE_BASE + \
+				 sizeof(struct niobuf_remote)))
+#define _OST_MAXREQSIZE_SUM ((unsigned long)(_OST_MAXREQSIZE_BASE + \
 				 sizeof(struct niobuf_remote) * \
-				 (DT_MAX_BRW_PAGES - 1))
+				 (DT_MAX_BRW_PAGES - 1)))
 
 /**
  * FIEMAP request can be 4K+ for now
  */
-#define OST_MAXREQSIZE		(16 * 1024)
-#define OST_IO_MAXREQSIZE	max_t(int, OST_MAXREQSIZE, \
-				      (((_OST_MAXREQSIZE_SUM - 1) | (1024 - 1)) + 1))
+#define OST_MAXREQSIZE		(16UL * 1024UL)
+#define OST_IO_MAXREQSIZE	max(OST_MAXREQSIZE,		\
+				   ((_OST_MAXREQSIZE_SUM - 1) |	\
+				   (1024 - 1)) + 1)
 
 /* Safe estimate of free space in standard RPC, provides upper limit for # of
  * bytes of i/o to pack in RPC (skipping bulk transfer).
diff --git a/fs/lustre/lov/lov_internal.h b/fs/lustre/lov/lov_internal.h
index 376ac52..36586b3 100644
--- a/fs/lustre/lov/lov_internal.h
+++ b/fs/lustre/lov/lov_internal.h
@@ -186,19 +186,22 @@ struct lsm_operations {
 })
 #elif BITS_PER_LONG == 32
 # define lov_do_div64(n, base) ({					      \
+	u64 __num = (n);						      \
 	u64 __rem;							      \
 	if ((sizeof(base) > 4) && (((base) & 0xffffffff00000000ULL) != 0)) {  \
 		int __remainder;					      \
-		LASSERTF(!((base) & (LOV_MIN_STRIPE_SIZE - 1)), "64 bit lov " \
-			 "division %llu / %llu\n", (n), (u64)(base));	      \
-		__remainder = (n) & (LOV_MIN_STRIPE_SIZE - 1);		      \
-		(n) >>= LOV_MIN_STRIPE_BITS;				      \
-		__rem = do_div(n, (base) >> LOV_MIN_STRIPE_BITS);	      \
+		LASSERTF(!((base) & (LOV_MIN_STRIPE_SIZE - 1)),		      \
+			 "64 bit lov division %llu / %llu\n",		      \
+			 __num, (u64)(base));				      \
+		__remainder = __num & (LOV_MIN_STRIPE_SIZE - 1);	      \
+		__num >>= LOV_MIN_STRIPE_BITS;				      \
+		__rem = do_div(__num, (base) >> LOV_MIN_STRIPE_BITS);	      \
 		__rem <<= LOV_MIN_STRIPE_BITS;				      \
 		__rem += __remainder;					      \
 	} else {							      \
-		__rem = do_div(n, base);				      \
+		__rem = do_div(__num, base);				      \
 	}								      \
+	(n) = __num;							      \
 	__rem;								      \
 })
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 222/622] lustre: obd: use ldo_process_config for mdc and osc layer
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (220 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 221/622] lustre: lov: avoid signed vs. unsigned comparison James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 223/622] lnet: check for asymmetrical route messages James Simmons
                   ` (400 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

Both the mdc and osc layer use the lu_device infrastructure but
we don't use ldo_process_config() which is preferred over the
currently used obd_process_config() handling. Migrate to the
lu_device ldo_process_config() for both mdc and osc layer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9855
Lustre-commit: d12959c69fd4 ("LU-9855 obd: use ldo_process_config for mdc and osc layer")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/34106
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_dev.c      | 11 +++++++----
 fs/lustre/mdc/mdc_internal.h |  1 -
 fs/lustre/mdc/mdc_request.c  | 11 -----------
 fs/lustre/osc/osc_dev.c      | 11 +++++++----
 fs/lustre/osc/osc_request.c  | 14 --------------
 5 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 306b917..f23f6cf 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -35,6 +35,7 @@
 
 #include <obd_class.h>
 #include <lustre_osc.h>
+#include <uapi/linux/lustre/lustre_param.h>
 
 #include "mdc_internal.h"
 
@@ -1422,15 +1423,17 @@ struct lu_object *mdc_object_alloc(const struct lu_env *env,
 	return obj;
 }
 
-static int mdc_cl_process_config(const struct lu_env *env,
-				 struct lu_device *d, struct lustre_cfg *cfg)
+static int mdc_process_config(const struct lu_env *env, struct lu_device *d,
+			      struct lustre_cfg *cfg)
 {
-	return mdc_process_config(d->ld_obd, 0, cfg);
+	size_t count  = class_modify_config(cfg, PARAM_MDC,
+					    &d->ld_obd->obd_kset.kobj);
+	return count > 0 ? 0 : count;
 }
 
 const struct lu_device_operations mdc_lu_ops = {
 	.ldo_object_alloc	= mdc_object_alloc,
-	.ldo_process_config	= mdc_cl_process_config,
+	.ldo_process_config	= mdc_process_config,
 	.ldo_recovery_complete	= NULL,
 };
 
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index 7a6ec81..a5fe164 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -93,7 +93,6 @@ int mdc_resource_get_unused(struct obd_export *exp, const struct lu_fid *fid,
 int mdc_fid_alloc(const struct lu_env *env, struct obd_export *exp,
 		  struct lu_fid *fid, struct md_op_data *op_data);
 int mdc_setup(struct obd_device *obd, struct lustre_cfg *cfg);
-int mdc_process_config(struct obd_device *obd, u32 len, void *buf);
 
 struct obd_client_handle;
 
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 4711288..c08a6ee 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -51,7 +51,6 @@
 #include <lustre_kernelcomm.h>
 #include <lustre_lmv.h>
 #include <lustre_log.h>
-#include <uapi/linux/lustre/lustre_param.h>
 #include <lustre_swab.h>
 #include <obd_class.h>
 #include <lustre_osc.h>
@@ -2743,15 +2742,6 @@ static int mdc_cleanup(struct obd_device *obd)
 	return osc_cleanup_common(obd);
 }
 
-int mdc_process_config(struct obd_device *obd, u32 len, void *buf)
-{
-	struct lustre_cfg *lcfg = buf;
-	size_t count  = class_modify_config(lcfg, PARAM_MDC,
-					    &obd->obd_kset.kobj);
-
-	return count > 0 ? 0 : count;
-}
-
 static const struct obd_ops mdc_obd_ops = {
 	.owner			= THIS_MODULE,
 	.setup			= mdc_setup,
@@ -2770,7 +2760,6 @@ int mdc_process_config(struct obd_device *obd, u32 len, void *buf)
 	.fid_alloc		= mdc_fid_alloc,
 	.import_event		= mdc_import_event,
 	.get_info		= mdc_get_info,
-	.process_config		= mdc_process_config,
 	.get_uuid		= mdc_get_uuid,
 	.quotactl		= mdc_quotactl,
 };
diff --git a/fs/lustre/osc/osc_dev.c b/fs/lustre/osc/osc_dev.c
index b8bf75a..6469973 100644
--- a/fs/lustre/osc/osc_dev.c
+++ b/fs/lustre/osc/osc_dev.c
@@ -40,6 +40,7 @@
 /* class_name2obd() */
 #include <obd_class.h>
 #include <lustre_osc.h>
+#include <uapi/linux/lustre/lustre_param.h>
 
 #include "osc_internal.h"
 
@@ -161,15 +162,17 @@ struct lu_context_key osc_session_key = {
 /* type constructor/destructor: osc_type_{init,fini,start,stop}(). */
 LU_TYPE_INIT_FINI(osc, &osc_key, &osc_session_key);
 
-static int osc_cl_process_config(const struct lu_env *env,
-				 struct lu_device *d, struct lustre_cfg *cfg)
+static int osc_process_config(const struct lu_env *env, struct lu_device *d,
+			      struct lustre_cfg *cfg)
 {
-	return osc_process_config_base(d->ld_obd, cfg);
+	ssize_t count  = class_modify_config(cfg, PARAM_OSC,
+					     &d->ld_obd->obd_kset.kobj);
+	return count > 0 ? 0 : count;
 }
 
 static const struct lu_device_operations osc_lu_ops = {
 	.ldo_object_alloc	= osc_object_alloc,
-	.ldo_process_config	= osc_cl_process_config,
+	.ldo_process_config	= osc_process_config,
 	.ldo_recovery_complete	= NULL
 };
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 6ce22c3..c55d5a9 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -47,7 +47,6 @@
 #include <lprocfs_status.h>
 #include <uapi/linux/lustre/lustre_ioctl.h>
 #include <lustre_obdo.h>
-#include <uapi/linux/lustre/lustre_param.h>
 #include <lustre_fid.h>
 #include <obd_class.h>
 #include <obd.h>
@@ -3348,18 +3347,6 @@ int osc_cleanup_common(struct obd_device *obd)
 }
 EXPORT_SYMBOL(osc_cleanup_common);
 
-int osc_process_config_base(struct obd_device *obd, struct lustre_cfg *lcfg)
-{
-	ssize_t count  = class_modify_config(lcfg, PARAM_OSC,
-					     &obd->obd_kset.kobj);
-	return count > 0 ? 0 : count;
-}
-
-static int osc_process_config(struct obd_device *obd, u32 len, void *buf)
-{
-	return osc_process_config_base(obd, buf);
-}
-
 static const struct obd_ops osc_obd_ops = {
 	.owner		= THIS_MODULE,
 	.setup		= osc_setup,
@@ -3379,7 +3366,6 @@ static int osc_process_config(struct obd_device *obd, u32 len, void *buf)
 	.iocontrol	= osc_iocontrol,
 	.set_info_async	= osc_set_info_async,
 	.import_event	= osc_import_event,
-	.process_config	= osc_process_config,
 	.quotactl	= osc_quotactl,
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 223/622] lnet: check for asymmetrical route messages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (221 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 222/622] lustre: obd: use ldo_process_config for mdc and osc layer James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 224/622] lustre: llite: Lock inode on tiny write if setuid/setgid set James Simmons
                   ` (399 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

Asymmetrical routes can be an issue when debugging network,
and allowing them also opens the door to attacks where hostile
clients inject data to the servers.

In order to prevent asymmetrical routes, add a new lnet kernel
module option named 'lnet_drop_asym_route'. When set to non-zero,
lnet_parse() will check if the message received from a remote peer
is coming through a router that would normally be used by this node
to reach the remote peer. If it is not the case, then it means we
are dealing with an asymmetrical route message, and the message will
be dropped.

The check for asymmetrical route can also be switched on/off with
the command 'lnetctl set drop_asym_route 0|1'. And this parameter is
exported/imported in Yaml.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11894
Lustre-commit: 4932febc1213 ("LU-11894 lnet: check for asymmetrical route messages")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/34119
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 +
 net/lnet/lnet/api-ni.c        | 44 ++++++++++++++++++++++++++++++++++++++++
 net/lnet/lnet/lib-move.c      | 47 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index d09fb4c..a6e64f6 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -507,6 +507,7 @@ struct lnet_ni *
 extern unsigned int lnet_health_sensitivity;
 extern unsigned int lnet_recovery_interval;
 extern unsigned int lnet_peer_discovery_disabled;
+extern unsigned int lnet_drop_asym_route;
 extern int portal_rotor;
 
 int lnet_lib_init(void);
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 3ee10da..e5f5c6c 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -126,6 +126,20 @@ static int recovery_interval_set(const char *val,
 MODULE_PARM_DESC(lnet_peer_discovery_disabled,
 		 "Set to 1 to disable peer discovery on this node.");
 
+unsigned int lnet_drop_asym_route;
+static int drop_asym_route_set(const char *val, const struct kernel_param *kp);
+
+static struct kernel_param_ops param_ops_drop_asym_route = {
+	.set = drop_asym_route_set,
+	.get = param_get_int,
+};
+
+#define param_check_drop_asym_route(name, p)	\
+	__param_check(name, p, int)
+module_param(lnet_drop_asym_route, drop_asym_route, 0644);
+MODULE_PARM_DESC(lnet_drop_asym_route,
+		 "Set to 1 to drop asymmetrical route messages.");
+
 unsigned int lnet_transaction_timeout = 50;
 static int transaction_to_set(const char *val, const struct kernel_param *kp);
 static struct kernel_param_ops param_ops_transaction_timeout = {
@@ -292,6 +306,36 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 }
 
 static int
+drop_asym_route_set(const char *val, const struct kernel_param *kp)
+{
+	int rc;
+	unsigned int *drop_asym_route = (unsigned int *)kp->arg;
+	unsigned long value;
+
+	rc = kstrtoul(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'lnet_drop_asym_route'\n");
+		return rc;
+	}
+
+	/* The purpose of locking the api_mutex here is to ensure that
+	 * the correct value ends up stored properly.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+
+	if (value == *drop_asym_route) {
+		mutex_unlock(&the_lnet.ln_api_mutex);
+		return 0;
+	}
+
+	*drop_asym_route = value;
+
+	mutex_unlock(&the_lnet.ln_api_mutex);
+
+	return 0;
+}
+
+static int
 transaction_to_set(const char *val, const struct kernel_param *kp)
 {
 	unsigned int *transaction_to = (unsigned int *)kp->arg;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 185c31a..809d2b6 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3959,6 +3959,53 @@ void lnet_monitor_thr_stop(void)
 		goto drop;
 	}
 
+	if (lnet_drop_asym_route && for_me &&
+	    LNET_NIDNET(src_nid) != LNET_NIDNET(from_nid)) {
+		struct lnet_net *net;
+		struct lnet_remotenet *rnet;
+		bool found = true;
+
+		/* we are dealing with a routed message,
+		 * so see if route to reach src_nid goes through from_nid
+		 */
+		lnet_net_lock(cpt);
+		net = lnet_get_net_locked(LNET_NIDNET(ni->ni_nid));
+		if (!net) {
+			lnet_net_unlock(cpt);
+			CERROR("net %s not found\n",
+			       libcfs_net2str(LNET_NIDNET(ni->ni_nid)));
+			return -EPROTO;
+		}
+
+		rnet = lnet_find_rnet_locked(LNET_NIDNET(src_nid));
+		if (rnet) {
+			struct lnet_peer_ni *gw = NULL;
+			struct lnet_route *route;
+
+			list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
+				found = false;
+				gw = route->lr_gateway;
+				if (gw->lpni_net != net)
+					continue;
+				if (gw->lpni_nid == from_nid) {
+					found = true;
+					break;
+				}
+			}
+		}
+		lnet_net_unlock(cpt);
+		if (!found) {
+			/* we would not use from_nid to route a message to
+			 * src_nid
+			 * => asymmetric routing detected but forbidden
+			 */
+			CERROR("%s, src %s: Dropping asymmetrical route %s\n",
+			       libcfs_nid2str(from_nid),
+			       libcfs_nid2str(src_nid), lnet_msgtyp2str(type));
+			goto drop;
+		}
+	}
+
 	msg = kzalloc(sizeof(*msg), GFP_NOFS);
 	if (!msg) {
 		CERROR("%s, src %s: Dropping %s (out of memory)\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 224/622] lustre: llite: Lock inode on tiny write if setuid/setgid set
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (222 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 223/622] lnet: check for asymmetrical route messages James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 225/622] lustre: llite: make sure name pack atomic James Simmons
                   ` (398 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

During a write, the setuid/setgid bits must be reset if they are
enabled and the user does not have the correct permissions. Setting
any file attributes, including setuid and setgid, requires the inode
to be locked. Writes became lockless with the introduction of
LU-1669. Locking the inode in the setuid/setgid case was added to
vvp_io_write_start() as a special case. The inode locking was not
included when support for tiny writes was added with LU-9409. This
mod adds the necessary inode lock/unlock calls to ll_do_tiny_write().

If the inode is not locked when setuid/setgid are reset, the kernel
will issue a one time warning and Lustre may hang trying to get the
inode lock in ll_setattr_raw().

WC-bug-id: https://jira.whamcloud.com/browse/LU-11944
Lustre-commit: f39a552922ca ("LU-11944 llite: Lock inode on tiny write if setuid/setgid set")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/34218
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c   | 6 ++++++
 fs/lustre/llite/vvp_io.c | 6 +++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 7078734..a73d11f 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1616,6 +1616,7 @@ static ssize_t ll_do_tiny_write(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t count = iov_iter_count(iter);
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
+	bool lock_inode = !IS_NOSEC(inode);
 	ssize_t result = 0;
 
 	/* Restrict writes to single page and < PAGE_SIZE.  See comment@top
@@ -1625,8 +1626,13 @@ static ssize_t ll_do_tiny_write(struct kiocb *iocb, struct iov_iter *iter)
 	    (iocb->ki_pos & (PAGE_SIZE-1)) + count > PAGE_SIZE)
 		return 0;
 
+	if (unlikely(lock_inode))
+		inode_lock(inode);
 	result = __generic_file_write_iter(iocb, iter);
 
+	if (unlikely(lock_inode))
+		inode_unlock(inode);
+
 	/* If the page is not already dirty, ll_tiny_write_begin returns
 	 * -ENODATA.  We continue on to normal write.
 	 */
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 85bb3e0..ad4b39e 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -1037,13 +1037,13 @@ static int vvp_io_write_start(const struct lu_env *env,
 		 * consistency, proper locking to protect against writes,
 		 * trucates, etc. is handled in the higher layers of lustre.
 		 */
-		bool lock_node = !IS_NOSEC(inode);
+		lock_inode = !IS_NOSEC(inode);
 
-		if (lock_node)
+		if (unlikely(lock_inode))
 			inode_lock(inode);
 		result = __generic_file_write_iter(vio->vui_iocb,
 						   vio->vui_iter);
-		if (lock_node)
+		if (unlikely(lock_inode))
 			inode_unlock(inode);
 
 		if (result > 0 || result == -EIOCBQUEUED)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 225/622] lustre: llite: make sure name pack atomic
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (223 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 224/622] lustre: llite: Lock inode on tiny write if setuid/setgid set James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 226/622] lustre: ptlrpc: handle proper import states for recovery James Simmons
                   ` (397 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

We are trying to access dentry name directly and pass it
down without holding @d_lock, this is racy and possibly
make us trigger assertions:

(mdc_lib.c:137:mdc_pack_name()) ASSERTION( lu_name_is_valid_2(buf, cpy_len) ) failed:

Fix the problem by allocting memory and copy name with @d_lock
held.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12020
Lustre-Commit: f575b6551b2b ("LU-12020 llite: make sure name pack atomic")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34330
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index a73d11f..4560ae0 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -502,7 +502,7 @@ static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 	struct inode *inode = d_inode(de);
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct dentry *parent = de->d_parent;
-	const char *name = NULL;
+	char *name = NULL;
 	struct md_op_data *op_data;
 	struct ptlrpc_request *req = NULL;
 	int len = 0, rc;
@@ -514,21 +514,41 @@ static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 	 * if server supports open-by-fid, or file name is invalid, don't pack
 	 * name in open request
 	 */
-	if (!(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_OPEN_BY_FID) &&
-	    lu_name_is_valid_2(de->d_name.name, de->d_name.len)) {
-		name = de->d_name.name;
+	if (!(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_OPEN_BY_FID)) {
+retry:
 		len = de->d_name.len;
+		name = kmalloc(len, GFP_NOFS);
+		if (!name)
+			return -ENOMEM;
+		/* race here */
+		spin_lock(&de->d_lock);
+		if (len != de->d_name.len) {
+			spin_unlock(&de->d_lock);
+			kfree(name);
+			goto retry;
+		}
+		memcpy(name, de->d_name.name, len);
+		spin_unlock(&de->d_lock);
+
+		if (!lu_name_is_valid_2(name, len)) {
+			kfree(name);
+			name = NULL;
+			len = 0;
+		}
 	}
 
 	op_data  = ll_prep_md_op_data(NULL, d_inode(parent), inode, name, len,
 				      O_RDWR, LUSTRE_OPC_ANY, NULL);
-	if (IS_ERR(op_data))
+	if (IS_ERR(op_data)) {
+		kfree(name);
 		return PTR_ERR(op_data);
+	}
 	op_data->op_data = lmm;
 	op_data->op_data_size = lmmsize;
 
 	rc = md_intent_lock(sbi->ll_md_exp, op_data, itp, &req,
 			    &ll_md_blocking_ast, 0);
+	kfree(name);
 	ll_finish_md_op_data(op_data);
 	if (rc == -ESTALE) {
 		/* reason for keep own exit path - don`t flood log
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 226/622] lustre: ptlrpc: handle proper import states for recovery
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (224 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 225/622] lustre: llite: make sure name pack atomic James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 227/622] lustre: ldlm: don't convert wrong resource James Simmons
                   ` (396 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

There are two problems:

See following assertion:

    lod_add_device() lustre-OSTe42a-osc-MDT0000:
                     can't set up pool, failed with -12
    osp_disconnect() ASSERTION( imp != ((void *)0) ) failed:
    osp_disconnect() LBUG
    CPU: 1 PID: 10059 Comm: llog_process_th

Problem is obd_disconnect() will cleanup @imp and set NULL.
 ->osp_obd_disconnect
    ->class_manual_cleanup
       ->class_process_config
          ->class_cleanup
             ->obd_precleanup
                ->osp_device_fini
                   ->client_obd_cleanup

While ldo_process_config() will try to access @imp again:
 ->ldo_process_config
    ->osp_shutdown
       ->osp_disconnect
          ->LASSERT(imp != NULL)

Another problem is if we failed before obd_connect().
we will hang on with mount:
 ->ldo_process_config
    ->osp_shutdown
       ->osp_disconnect
          ->ptlrpc_disconnect_import
             ->rc = l_wait_event(imp->imp_recovery_waitq,
                                 !ptlrpc_import_in_recovery(imp), &lwi);

Since connect is not called, imp state will stay LUSTRE_IMP_NEW.
Fix this by check whether we are in recovery properly, only consider
we are in recovery if we are in following states:

 LUSTRE_IMP_CONNECTING = 4,
 LUSTRE_IMP_REPLAY     = 5,
 LUSTRE_IMP_REPLAY_LOCKS = 6,
 LUSTRE_IMP_REPLAY_WAIT  = 7,
 LUSTRE_IMP_RECOVER    = 8,

WC-bug-id: https://jira.whamcloud.com/browse/LU-11243
Lustre-commit: f28353b3d810 ("LU-11243 lod: fix assertion and hang upon lod_add_device failure")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/32994
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/recover.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ptlrpc/recover.c b/fs/lustre/ptlrpc/recover.c
index ceab288..e26612d 100644
--- a/fs/lustre/ptlrpc/recover.c
+++ b/fs/lustre/ptlrpc/recover.c
@@ -367,9 +367,8 @@ int ptlrpc_import_in_recovery(struct obd_import *imp)
 	int in_recovery = 1;
 
 	spin_lock(&imp->imp_lock);
-	if (imp->imp_state == LUSTRE_IMP_FULL ||
-	    imp->imp_state == LUSTRE_IMP_CLOSED ||
-	    imp->imp_state == LUSTRE_IMP_DISCON ||
+	if (imp->imp_state <= LUSTRE_IMP_DISCON ||
+	    imp->imp_state >= LUSTRE_IMP_FULL ||
 	    imp->imp_obd->obd_no_recov)
 		in_recovery = 0;
 	spin_unlock(&imp->imp_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 227/622] lustre: ldlm: don't convert wrong resource
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (225 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 226/622] lustre: ptlrpc: handle proper import states for recovery James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 228/622] lustre: llite: limit statfs ffree if less than OST ffree James Simmons
                   ` (395 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

During enqueue the returned lock may have different resource
and local client lock replaces resource too. But there is
a valid race with bl_ast and reply from server, so BL AST
may come earlier and find client lock with old resource.
In that case ldlm_handle_bl_callback() should proceed with
normal cancel and don't use cancel_bits for lock convert.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11836
Lustre-commit: 2bc71659db69 ("LU-11836 ldlm: don't convert wrong resource")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34264
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lockd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 6905ee5..2985e37 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -131,8 +131,14 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 		 * NOTE: ld can be NULL or can be not NULL but zeroed if
 		 * passed from ldlm_bl_thread_blwi(), check below used bits
 		 * in ld to make sure it is valid description.
+		 *
+		 * If server may replace lock resource keeping the same cookie,
+		 * never use cancel bits from different resource, full cancel
+		 * is to be used.
 		 */
-		if (ld && ld->l_policy_data.l_inodebits.bits)
+		if (ld && ld->l_policy_data.l_inodebits.bits &&
+		    ldlm_res_eq(&ld->l_resource.lr_name,
+				&lock->l_resource->lr_name))
 			lock->l_policy_data.l_inodebits.cancel_bits =
 				ld->l_policy_data.l_inodebits.cancel_bits;
 		/* if there is no valid ld and lock is cbpending already
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 228/622] lustre: llite: limit statfs ffree if less than OST ffree
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (226 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 227/622] lustre: ldlm: don't convert wrong resource James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 229/622] lustre: mdc: prevent glimpse lock count grow James Simmons
                   ` (394 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

If the OSTs report fewer total free objects than the MDTs, then
use the free files count reported by the OSTs, since it represents
the minimum number of files that can be created in the filesystem
(creating more may be possible, but this depends on other factors).
This has always been what ll_statfs_internal() reports, but the
statfs aggregation via the MDT missed this step in lod_statfs().

Fix a minor defect in sanity test_418() that would let it loop
forever until the test was killed due to timeout if the "df -i"
and "lfs df -i" output did not converge.

Fixes: 41a201a04c0f ("lustre: protocol: MDT as a statfs proxy")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11721
Lustre-commit: a829595add80 ("LU-11721 lod: limit statfs ffree if less than OST ffree")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34167
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Nikitas Angelinas <nangelinas@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h |  5 +++--
 fs/lustre/llite/llite_lib.c   | 22 +++++++++++-----------
 fs/lustre/lmv/lmv_obd.c       |  4 ++--
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 434bb79..6a4b6a5 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -898,8 +898,9 @@ static inline int obd_statfs_async(struct obd_export *exp,
 
 	obd = exp->exp_obd;
 	if (!obd->obd_type || !obd->obd_type->typ_dt_ops->statfs) {
-		CERROR("%s: no %s operation\n", obd->obd_name, __func__);
-		return -EOPNOTSUPP;
+		rc = -EOPNOTSUPP;
+		CERROR("%s: no statfs operation: rc = %d\n", obd->obd_name, rc);
+		return rc;
 	}
 
 	CDEBUG(D_SUPER, "%s: age %lld, max_age %lld\n",
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 84fc54d..4d41981a 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1723,17 +1723,15 @@ int ll_setattr(struct dentry *de, struct iattr *attr)
 int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 		       u32 flags)
 {
-	struct obd_statfs obd_osfs;
+	struct obd_statfs obd_osfs = { 0 };
 	time64_t max_age;
 	int rc;
 
 	max_age = ktime_get_seconds() - OBD_STATFS_CACHE_SECONDS;
 
 	rc = obd_statfs(NULL, sbi->ll_md_exp, osfs, max_age, flags);
-	if (rc) {
-		CERROR("md_statfs fails: rc = %d\n", rc);
+	if (rc)
 		return rc;
-	}
 
 	osfs->os_type = LL_SUPER_MAGIC;
 
@@ -1749,8 +1747,9 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 
 	rc = obd_statfs(NULL, sbi->ll_dt_exp, &obd_osfs, max_age, flags);
 	if (rc) {
-		CERROR("obd_statfs fails: rc = %d\n", rc);
-		return rc;
+		/* Possibly a filesystem with no OSTs.  Report MDT totals. */
+		rc = 0;
+		goto out;
 	}
 
 	CDEBUG(D_SUPER, "OSC blocks %llu/%llu objects %llu/%llu\n",
@@ -1762,13 +1761,14 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 	osfs->os_bfree = obd_osfs.os_bfree;
 	osfs->os_bavail = obd_osfs.os_bavail;
 
-	/* If we don't have as many objects free on the OST as inodes
-	 * on the MDS, we reduce the total number of inodes to
-	 * compensate, so that the "inodes in use" number is correct.
+	/* If we have _some_ OSTs, but don't have as many free objects on the
+	 * OSTs as inodes on the MDTs, reduce the reported number of inodes
+	 * to compensate, so that the "inodes in use" number is correct.
+	 * This should be kept in sync with lod_statfs() behaviour.
 	 */
-	if (obd_osfs.os_ffree < osfs->os_ffree) {
+	if (obd_osfs.os_files && obd_osfs.os_ffree < osfs->os_ffree) {
 		osfs->os_files = (osfs->os_files - osfs->os_ffree) +
-			obd_osfs.os_ffree;
+				 obd_osfs.os_ffree;
 		osfs->os_ffree = obd_osfs.os_ffree;
 	}
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 0685925..6ad100c 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1402,8 +1402,8 @@ static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 		rc = obd_statfs(env, lmv->tgts[idx]->ltd_exp, temp,
 				max_age, flags);
 		if (rc) {
-			CERROR("can't stat MDS #%d (%s), error %d\n", i,
-			       lmv->tgts[idx]->ltd_exp->exp_obd->obd_name,
+			CERROR("%s: can't stat MDS #%d: rc = %d\n",
+			       lmv->tgts[idx]->ltd_exp->exp_obd->obd_name, i,
 			       rc);
 			goto out_free_temp;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 229/622] lustre: mdc: prevent glimpse lock count grow
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (227 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 228/622] lustre: llite: limit statfs ffree if less than OST ffree James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 230/622] lustre: dne: performance improvement for file creation James Simmons
                   ` (393 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

DOM locks matching tries to ignore locks with
LDLM_FL_KMS_IGNORE flag during ldlm_lock_match() but
checks that after ldlm_lock_match() call. Therefore if
there is any lock with such flag in queue then all other
locks after it are ignored and new lock is created causing
big amount of locks on single resource in some access
patterns.
Patch extends lock_matches() function to check flags to
exclude and adds ldlm_lock_match_with_skip() to use that
when needed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11964
Lustre-commit: b915221b6d0f ("LU-11964 mdc: prevent glimpse lock count grow")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34261
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h  | 27 ++++++++++---
 fs/lustre/include/obd_support.h |  1 +
 fs/lustre/ldlm/ldlm_lock.c      | 90 ++++++++++++++++++-----------------------
 fs/lustre/mdc/mdc_dev.c         | 28 +++++++++----
 4 files changed, 82 insertions(+), 64 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 1133e20..a95555e 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -1136,12 +1136,27 @@ void ldlm_lock_decref_and_cancel(const struct lustre_handle *lockh,
 void ldlm_lock_fail_match_locked(struct ldlm_lock *lock);
 void ldlm_lock_allow_match(struct ldlm_lock *lock);
 void ldlm_lock_allow_match_locked(struct ldlm_lock *lock);
-enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns, u64 flags,
-			       const struct ldlm_res_id *res_id,
-			       enum ldlm_type type,
-			       union ldlm_policy_data *policy,
-			       enum ldlm_mode mode, struct lustre_handle *lh,
-			       int unref);
+enum ldlm_mode ldlm_lock_match_with_skip(struct ldlm_namespace *ns,
+					 u64 flags, u64 skip_flags,
+					 const struct ldlm_res_id *res_id,
+					 enum ldlm_type type,
+					 union ldlm_policy_data *policy,
+					 enum ldlm_mode mode,
+					 struct lustre_handle *lh,
+					 int unref);
+static inline enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns,
+					     u64 flags,
+					     const struct ldlm_res_id *res_id,
+					     enum ldlm_type type,
+					     union ldlm_policy_data *policy,
+					     enum ldlm_mode mode,
+					     struct lustre_handle *lh,
+					     int unref)
+{
+	return ldlm_lock_match_with_skip(ns, flags, 0, res_id, type, policy,
+					 mode, lh, unref);
+}
+
 enum ldlm_mode ldlm_revalidate_lock_handle(const struct lustre_handle *lockh,
 					   u64 *bits);
 void ldlm_lock_cancel(struct ldlm_lock *lock);
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 5e5cf3a..39547a0 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -391,6 +391,7 @@
 #define OBD_FAIL_MDC_LIGHTWEIGHT			0x805
 #define OBD_FAIL_MDC_CLOSE				0x806
 #define OBD_FAIL_MDC_MERGE				0x807
+#define OBD_FAIL_MDC_GLIMPSE_DDOS			0x808
 
 #define OBD_FAIL_MGS					0x900
 #define OBD_FAIL_MGS_ALL_REQUEST_NET			0x901
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 06690a6..cc96fbd 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -1053,6 +1053,7 @@ struct lock_match_data {
 	enum ldlm_mode		*lmd_mode;
 	union ldlm_policy_data	*lmd_policy;
 	u64			 lmd_flags;
+	u64			 lmd_skip_flags;
 	int			 lmd_unref;
 };
 
@@ -1133,6 +1134,10 @@ static bool lock_matches(struct ldlm_lock *lock, void *vdata)
 	if (!equi(data->lmd_flags & LDLM_FL_LOCAL_ONLY, ldlm_is_local(lock)))
 		return false;
 
+	/* Filter locks by skipping flags */
+	if (data->lmd_skip_flags & lock->l_flags)
+		return false;
+
 	if (data->lmd_flags & LDLM_FL_TEST_LOCK) {
 		LDLM_LOCK_GET(lock);
 		ldlm_lock_touch_in_lru(lock);
@@ -1267,12 +1272,13 @@ void ldlm_lock_allow_match(struct ldlm_lock *lock)
  * keep caller code unchanged), the context failure will be discovered by
  * caller sometime later.
  */
-enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns, u64 flags,
-			       const struct ldlm_res_id *res_id,
-			       enum ldlm_type type,
-			       union ldlm_policy_data *policy,
-			       enum ldlm_mode mode,
-			       struct lustre_handle *lockh, int unref)
+enum ldlm_mode ldlm_lock_match_with_skip(struct ldlm_namespace *ns,
+					 u64 flags, u64 skip_flags,
+					 const struct ldlm_res_id *res_id,
+					 enum ldlm_type type,
+					 union ldlm_policy_data *policy,
+					 enum ldlm_mode mode,
+					 struct lustre_handle *lockh, int unref)
 {
 	struct lock_match_data data = {
 		.lmd_old	= NULL,
@@ -1280,11 +1286,12 @@ enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns, u64 flags,
 		.lmd_mode	= &mode,
 		.lmd_policy	= policy,
 		.lmd_flags	= flags,
+		.lmd_skip_flags	= skip_flags,
 		.lmd_unref	= unref,
 	};
 	struct ldlm_resource *res;
 	struct ldlm_lock *lock;
-	int rc = 0;
+	int matched;
 
 	if (!ns) {
 		data.lmd_old = ldlm_handle2lock(lockh);
@@ -1304,25 +1311,13 @@ enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns, u64 flags,
 
 	LDLM_RESOURCE_ADDREF(res);
 	lock_res(res);
-
 	if (res->lr_type == LDLM_EXTENT)
 		lock = search_itree(res, &data);
 	else
 		lock = search_queue(&res->lr_granted, &data);
-	if (lock) {
-		rc = 1;
-		goto out;
-	}
-	if (flags & LDLM_FL_BLOCK_GRANTED) {
-		rc = 0;
-		goto out;
-	}
-	lock = search_queue(&res->lr_waiting, &data);
-	if (lock) {
-		rc = 1;
-		goto out;
-	}
-out:
+	if (!lock && !(flags & LDLM_FL_BLOCK_GRANTED))
+		lock = search_queue(&res->lr_waiting, &data);
+	matched = lock ? mode : 0;
 	unlock_res(res);
 	LDLM_RESOURCE_DELREF(res);
 	ldlm_resource_putref(res);
@@ -1338,13 +1333,8 @@ enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns, u64 flags,
 							  LDLM_FL_WAIT_NOREPROC,
 								 NULL);
 				if (err) {
-					if (flags & LDLM_FL_TEST_LOCK)
-						LDLM_LOCK_RELEASE(lock);
-					else
-						ldlm_lock_decref_internal(lock,
-									  mode);
-					rc = 0;
-					goto out2;
+					matched = 0;
+					goto out_fail_match;
 				}
 			}
 
@@ -1352,49 +1342,49 @@ enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns, u64 flags,
 			wait_event_idle_timeout(lock->l_waitq,
 						lock->l_flags & wait_flags,
 						obd_timeout * HZ);
+
 			if (!ldlm_is_lvb_ready(lock)) {
-				if (flags & LDLM_FL_TEST_LOCK)
-					LDLM_LOCK_RELEASE(lock);
-				else
-					ldlm_lock_decref_internal(lock, mode);
-				rc = 0;
+				matched = 0;
+				goto out_fail_match;
 			}
 		}
-	}
-out2:
-	if (rc) {
-		LDLM_DEBUG(lock, "matched (%llu %llu)",
-			   (type == LDLM_PLAIN || type == LDLM_IBITS) ?
-				res_id->name[2] : policy->l_extent.start,
-			   (type == LDLM_PLAIN || type == LDLM_IBITS) ?
-				res_id->name[3] : policy->l_extent.end);
 
 		/* check user's security context */
 		if (lock->l_conn_export &&
 		    sptlrpc_import_check_ctx(class_exp2cliimp(lock->l_conn_export))) {
-			if (!(flags & LDLM_FL_TEST_LOCK))
-				ldlm_lock_decref_internal(lock, mode);
-			rc = 0;
+			matched = 0;
+			goto out_fail_match;
 		}
 
+		LDLM_DEBUG(lock, "matched (%llu %llu)",
+			   (type == LDLM_PLAIN || type == LDLM_IBITS) ?
+			   res_id->name[2] : policy->l_extent.start,
+			   (type == LDLM_PLAIN || type == LDLM_IBITS) ?
+			   res_id->name[3] : policy->l_extent.end);
+
+out_fail_match:
 		if (flags & LDLM_FL_TEST_LOCK)
 			LDLM_LOCK_RELEASE(lock);
+		else if (!matched)
+			ldlm_lock_decref_internal(lock, mode);
+	}
 
-	} else if (!(flags & LDLM_FL_TEST_LOCK)) {/*less verbose for test-only*/
+	/* less verbose for test-only */
+	if (!matched && !(flags & LDLM_FL_TEST_LOCK)) {
 		LDLM_DEBUG_NOLOCK("not matched ns %p type %u mode %u res %llu/%llu (%llu %llu)",
 				  ns, type, mode, res_id->name[0],
 				  res_id->name[1],
 				  (type == LDLM_PLAIN || type == LDLM_IBITS) ?
-					res_id->name[2] : policy->l_extent.start,
+				  res_id->name[2] : policy->l_extent.start,
 				  (type == LDLM_PLAIN || type == LDLM_IBITS) ?
-					res_id->name[3] : policy->l_extent.end);
+				  res_id->name[3] : policy->l_extent.end);
 	}
 	if (data.lmd_old)
 		LDLM_LOCK_PUT(data.lmd_old);
 
-	return rc ? mode : 0;
+	return matched;
 }
-EXPORT_SYMBOL(ldlm_lock_match);
+EXPORT_SYMBOL(ldlm_lock_match_with_skip);
 
 enum ldlm_mode ldlm_revalidate_lock_handle(const struct lustre_handle *lockh,
 					   u64 *bits)
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index f23f6cf..cb173f4 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -676,10 +676,16 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 	if (einfo->ei_mode == LCK_PR)
 		mode |= LCK_PW;
 
-	if (!glimpse)
+	if (glimpse)
 		match_flags |= LDLM_FL_BLOCK_GRANTED;
-	mode = ldlm_lock_match(obd->obd_namespace, match_flags, res_id,
-			       einfo->ei_type, policy, mode, &lockh, 0);
+	/* DOM locking uses LDLM_FL_KMS_IGNORE to mark locks wich have no valid
+	 * LVB information, e.g. canceled locks or locks of just pruned object,
+	 * such locks should be skipped.
+	 */
+	mode = ldlm_lock_match_with_skip(obd->obd_namespace, match_flags,
+					 LDLM_FL_KMS_IGNORE, res_id,
+					 einfo->ei_type, policy, mode,
+					 &lockh, 0);
 	if (mode) {
 		struct ldlm_lock *matched;
 
@@ -687,8 +693,16 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 			return ELDLM_OK;
 
 		matched = ldlm_handle2lock(&lockh);
-		if (!matched || ldlm_is_kms_ignore(matched))
+		/* this shouldn't happen but this check is kept to make
+		 * related test fail if problem occurs
+		 */
+		if (unlikely(ldlm_is_kms_ignore(matched))) {
+			LDLM_ERROR(matched, "matched lock has KMS ignore flag");
 			goto no_match;
+		}
+
+		if (OBD_FAIL_CHECK(OBD_FAIL_MDC_GLIMPSE_DDOS))
+			ldlm_set_kms_ignore(matched);
 
 		if (mdc_set_dom_lock_data(env, matched, einfo->ei_cbdata)) {
 			*flags |= LDLM_FL_LVB_READY;
@@ -1337,11 +1351,9 @@ static int mdc_attr_get(const struct lu_env *env, struct cl_object *obj,
 
 static int mdc_object_ast_clear(struct ldlm_lock *lock, void *data)
 {
-	if ((!lock->l_ast_data && !ldlm_is_kms_ignore(lock)) ||
-	    (lock->l_ast_data == data)) {
+	if (lock->l_ast_data == data)
 		lock->l_ast_data = NULL;
-		ldlm_set_kms_ignore(lock);
-	}
+	ldlm_set_kms_ignore(lock);
 	return LDLM_ITER_CONTINUE;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 230/622] lustre: dne: performance improvement for file creation
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (228 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 229/622] lustre: mdc: prevent glimpse lock count grow James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 231/622] lustre: mdc: return DOM size on open resend James Simmons
                   ` (392 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Jinshan Xiong <jinshan.xiong@gmail.com>

This is to remove an obsoleted code where it causes drastic
performance degradation. This code is written before PERM lock
is introduced, and it requests UPDATE lock at path walk for
remote directory, which will be cancelled at later file creation.

Tests result before and after this patch is applied:

Test case:
rm -rf /mnt/lustre_purple/testdir
lfs mkdir -i 0 /mnt/lustre_purple/testdir
lfs mkdir -i 2 /mnt/lustre_purple/testdir/dir2
./lustre-release/lustre/tests/createmany -o \
        /mnt/lustre_purple/testdir/dir2/f 10000

Before the patch is applied:
total: 10000 open/close in 12.82 seconds: 780.22 ops/second

After the patch is applied:
total: 10000 open/close in 4.89 seconds: 2044.75 ops/second

WC-bug-id: https://jira.whamcloud.com/browse/LU-11999
Lustre-commit: bfbd062e6b17 ("LU-11999 dne: performance improvement for file creation")
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-on: https://review.whamcloud.com/34291
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_intent.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 3f51032..6933f7d 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -71,13 +71,6 @@ static int lmv_intent_remote(struct obd_export *exp, struct lookup_intent *it,
 	LASSERT((body->mbo_valid & OBD_MD_MDS));
 
 	/*
-	 * Unfortunately, we have to lie to MDC/MDS to retrieve
-	 * attributes llite needs and provideproper locking.
-	 */
-	if (it->it_op & IT_LOOKUP)
-		it->it_op = IT_GETATTR;
-
-	/*
 	 * We got LOOKUP lock, but we really need attrs.
 	 */
 	pmode = it->it_lock_mode;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 231/622] lustre: mdc: return DOM size on open resend
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (229 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 230/622] lustre: dne: performance improvement for file creation James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 232/622] lustre: llite: optimizations for not granted lock processing James Simmons
                   ` (391 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

DOM size is returned along with DOM lock always, but it is
not true with open resend. Fix was server side but we did update
a mdc debug message.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11835
Lustre-commit: bc3ef43d36b5 ("LU-11835 mdt: return DOM size on open resend")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34044
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_locks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 9898b6a..55de559 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -742,7 +742,7 @@ static int mdc_finish_enqueue(struct obd_export *exp,
 
 		body = req_capsule_server_get(pill, &RMF_MDT_BODY);
 		if (!(body->mbo_valid & OBD_MD_DOM_SIZE)) {
-			LDLM_ERROR(lock, "%s: DoM lock without size.\n",
+			LDLM_ERROR(lock, "%s: DoM lock without size.",
 				   exp->exp_obd->obd_name);
 			rc = -EPROTO;
 			goto out_lock;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 232/622] lustre: llite: optimizations for not granted lock processing
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (230 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 231/622] lustre: mdc: return DOM size on open resend James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 233/622] lustre: osc: propagate grant shrink interval immediately James Simmons
                   ` (390 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andrew Perepechko <c17827@cray.com>

This patch removes ll_md_blocking_ast() processing for
not granted locks. The reason is ll_invalidate_negative_children()
can slow down I/O significantly without a reason if there
are thousands or millions of files in the directory
cache.

Seagate-bug-id: MRP-3409
WC-bug-id: https://jira.whamcloud.com/browse/LU-8047
Lustre-commit: 2c126c5a73ed ("LU-8047 llite: optimizations for not granted lock processing")
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/19665
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h | 5 +++++
 fs/lustre/ldlm/ldlm_extent.c   | 2 +-
 fs/lustre/ldlm/ldlm_internal.h | 3 +--
 fs/lustre/ldlm/ldlm_lock.c     | 6 +++---
 fs/lustre/ldlm/ldlm_lockd.c    | 4 ++--
 fs/lustre/ldlm/ldlm_request.c  | 7 +++----
 fs/lustre/llite/namei.c        | 4 ++++
 fs/lustre/osc/osc_lock.c       | 8 ++++----
 fs/lustre/osc/osc_request.c    | 2 +-
 9 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index a95555e..355049f 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -876,6 +876,11 @@ struct ldlm_resource {
 	struct lu_ref			lr_reference;
 };
 
+static inline int ldlm_is_granted(struct ldlm_lock *lock)
+{
+	return lock->l_req_mode == lock->l_granted_mode;
+}
+
 static inline bool ldlm_has_layout(struct ldlm_lock *lock)
 {
 	return lock->l_resource->lr_type == LDLM_IBITS &&
diff --git a/fs/lustre/ldlm/ldlm_extent.c b/fs/lustre/ldlm/ldlm_extent.c
index 7c72d04..98e2a75 100644
--- a/fs/lustre/ldlm/ldlm_extent.c
+++ b/fs/lustre/ldlm/ldlm_extent.c
@@ -151,7 +151,7 @@ void ldlm_extent_add_lock(struct ldlm_resource *res,
 	struct ldlm_interval_tree *tree;
 	int idx;
 
-	LASSERT(lock->l_granted_mode == lock->l_req_mode);
+	LASSERT(ldlm_is_granted(lock));
 
 	LASSERT(RB_EMPTY_NODE(&lock->l_rb));
 
diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index df57c02..ede48b2 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -310,8 +310,7 @@ static inline int is_granted_or_cancelled(struct ldlm_lock *lock)
 	int ret = 0;
 
 	lock_res_and_lock(lock);
-	if ((lock->l_req_mode == lock->l_granted_mode) &&
-	    !ldlm_is_cp_reqd(lock))
+	if (ldlm_is_granted(lock) && !ldlm_is_cp_reqd(lock))
 		ret = 1;
 	else if (ldlm_is_failed(lock) || ldlm_is_cancel(lock))
 		ret = 1;
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index cc96fbd..b6c49c5 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -992,7 +992,7 @@ void ldlm_grant_lock_with_skiplist(struct ldlm_lock *lock)
 {
 	struct sl_insert_point prev;
 
-	LASSERT(lock->l_req_mode == lock->l_granted_mode);
+	LASSERT(ldlm_is_granted(lock));
 
 	search_granted_lock(&lock->l_resource->lr_granted, lock, &prev);
 	ldlm_granted_list_add_lock(lock, &prev);
@@ -1591,7 +1591,7 @@ enum ldlm_error ldlm_lock_enqueue(const struct lu_env *env,
 	struct ldlm_resource *res = lock->l_resource;
 
 	lock_res_and_lock(lock);
-	if (lock->l_req_mode == lock->l_granted_mode) {
+	if (ldlm_is_granted(lock)) {
 		/* The server returned a blocked lock, but it was granted
 		 * before we got a chance to actually enqueue it.  We don't
 		 * need to do anything else.
@@ -1799,7 +1799,7 @@ void ldlm_lock_cancel(struct ldlm_lock *lock)
 	ldlm_resource_unlink_lock(lock);
 	ldlm_lock_destroy_nolock(lock);
 
-	if (lock->l_granted_mode == lock->l_req_mode)
+	if (ldlm_is_granted(lock))
 		ldlm_pool_del(&ns->ns_pool, lock);
 
 	/* Make sure we will not be called again for same lock what is possible
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 2985e37..db0da99 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -193,7 +193,7 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 
 		while (to > 0) {
 			schedule_timeout_interruptible(to);
-			if (lock->l_granted_mode == lock->l_req_mode ||
+			if (ldlm_is_granted(lock) ||
 			    ldlm_is_destroyed(lock))
 				break;
 		}
@@ -236,7 +236,7 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 	}
 
 	if (ldlm_is_destroyed(lock) ||
-	    lock->l_granted_mode == lock->l_req_mode) {
+	    ldlm_is_granted(lock)) {
 		/* bug 11300: the lock has already been granted */
 		unlock_res_and_lock(lock);
 		LDLM_DEBUG(lock, "Double grant race happened");
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index b9e9ae9..7c3935f 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -292,8 +292,7 @@ static void failed_lock_cleanup(struct ldlm_namespace *ns,
 	/* Set a flag to prevent us from sending a CANCEL (bug 407) */
 	lock_res_and_lock(lock);
 	/* Check that lock is not granted or failed, we might race. */
-	if ((lock->l_req_mode != lock->l_granted_mode) &&
-	    !ldlm_is_failed(lock)) {
+	if (!ldlm_is_granted(lock) && !ldlm_is_failed(lock)) {
 		/* Make sure that this lock will not be found by raced
 		 * bl_ast and -EINVAL reply is sent to server anyways.
 		 * bug 17645
@@ -477,7 +476,7 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 		 * a tiny window for completion to get in
 		 */
 		lock_res_and_lock(lock);
-		if (lock->l_req_mode != lock->l_granted_mode)
+		if (!ldlm_is_granted(lock))
 			rc = ldlm_fill_lvb(lock, &req->rq_pill, RCL_SERVER,
 					   lock->l_lvb_data, lvb_len);
 		unlock_res_and_lock(lock);
@@ -2196,7 +2195,7 @@ static int replay_one_lock(struct obd_import *imp, struct ldlm_lock *lock)
 	 * This happens whenever a lock enqueue is the request that triggers
 	 * recovery.
 	 */
-	if (lock->l_granted_mode == lock->l_req_mode)
+	if (ldlm_is_granted(lock))
 		flags = LDLM_FL_REPLAY | LDLM_FL_BLOCK_GRANTED;
 	else if (lock->l_granted_mode)
 		flags = LDLM_FL_REPLAY | LDLM_FL_BLOCK_CONV;
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 3e3fbd9..e410ff0 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -464,6 +464,10 @@ int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
 		break;
 	}
 	case LDLM_CB_CANCELING:
+		/* Nothing to do for non-granted locks */
+		if (!ldlm_is_granted(lock))
+			break;
+
 		if (ldlm_is_converting(lock)) {
 			/* this is called on already converted lock, so
 			 * ibits has remained bits only and cancel_bits
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index eccea37..29d8373 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -105,7 +105,7 @@ static int osc_lock_invariant(struct osc_lock *ols)
 		return 0;
 
 	if (!ergo(ols->ols_state == OLS_GRANTED,
-		  olock && olock->l_req_mode == olock->l_granted_mode &&
+		  olock && ldlm_is_granted(olock) &&
 		  ols->ols_hold))
 		return 0;
 	return 1;
@@ -227,7 +227,7 @@ static void osc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 
 	/* Lock must have been granted. */
 	lock_res_and_lock(dlmlock);
-	if (dlmlock->l_granted_mode == dlmlock->l_req_mode) {
+	if (ldlm_is_granted(dlmlock)) {
 		struct ldlm_extent *ext = &dlmlock->l_policy_data.l_extent;
 		struct cl_lock_descr *descr = &oscl->ols_cl.cls_lock->cll_descr;
 
@@ -336,7 +336,7 @@ static int osc_lock_upcall_speculative(void *cookie,
 	LASSERT(dlmlock);
 
 	lock_res_and_lock(dlmlock);
-	LASSERT(dlmlock->l_granted_mode == dlmlock->l_req_mode);
+	LASSERT(ldlm_is_granted(dlmlock));
 
 	/* there is no osc_lock associated with speculative lock */
 	osc_lock_lvb_update(env, osc, dlmlock, NULL);
@@ -401,7 +401,7 @@ static int __osc_dlm_blocking_ast(const struct lu_env *env,
 	LASSERT(flag == LDLM_CB_CANCELING);
 
 	lock_res_and_lock(dlmlock);
-	if (dlmlock->l_granted_mode != dlmlock->l_req_mode) {
+	if (!ldlm_is_granted(dlmlock)) {
 		dlmlock->l_ast_data = NULL;
 		unlock_res_and_lock(dlmlock);
 		return 0;
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index c55d5a9..7190da9 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -3163,7 +3163,7 @@ static int osc_cancel_weight(struct ldlm_lock *lock)
 	 * Cancel all unused and granted extent lock.
 	 */
 	if (lock->l_resource->lr_type == LDLM_EXTENT &&
-	    lock->l_granted_mode == lock->l_req_mode &&
+	    ldlm_is_granted(lock) &&
 	    osc_ldlm_weigh_ast(lock) == 0)
 		return 1;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 233/622] lustre: osc: propagate grant shrink interval immediately
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (231 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 232/622] lustre: llite: optimizations for not granted lock processing James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 234/622] lustre: osc: grant shrink shouldn't account skipped OSC James Simmons
                   ` (389 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

currently the new interval (updated with lctl) will be used
only when the next shrink happens. with default interval it
will take at least 20 minutes. instead we should refresh it
immediately.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11408
Lustre-commit: 0b09a19bdf2d ("LU-11408 osc: propagate grant shrink interval immediately")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33204
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/lproc_osc.c    | 2 ++
 fs/lustre/osc/osc_internal.h | 1 +
 fs/lustre/osc/osc_request.c  | 6 ++++++
 3 files changed, 9 insertions(+)

diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index ea67d20..5faf518 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -349,6 +349,8 @@ static ssize_t grant_shrink_interval_store(struct kobject *kobj,
 		return -ERANGE;
 
 	obd->u.cli.cl_grant_shrink_interval = val;
+	osc_update_next_shrink(&obd->u.cli);
+	osc_schedule_grant_work();
 
 	return count;
 }
diff --git a/fs/lustre/osc/osc_internal.h b/fs/lustre/osc/osc_internal.h
index 2cb737b..0f0f4d4 100644
--- a/fs/lustre/osc/osc_internal.h
+++ b/fs/lustre/osc/osc_internal.h
@@ -43,6 +43,7 @@
 extern struct ptlrpc_request_pool *osc_rq_pool;
 
 int osc_shrink_grant_to_target(struct client_obd *cli, u64 target_bytes);
+void osc_schedule_grant_work(void);
 void osc_update_next_shrink(struct client_obd *cli);
 int lru_queue_work(const struct lu_env *env, void *data);
 int osc_extent_finish(const struct lu_env *env, struct osc_extent *ext,
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 7190da9..7b120da 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -905,6 +905,12 @@ static void osc_grant_work_handler(struct work_struct *data)
 		schedule_work(&work.work);
 }
 
+void osc_schedule_grant_work(void)
+{
+	cancel_delayed_work_sync(&work);
+	schedule_work(&work.work);
+}
+
 /**
  * Start grant thread for returing grant to server for idle clients.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 234/622] lustre: osc: grant shrink shouldn't account skipped OSC
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (232 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 233/622] lustre: osc: propagate grant shrink interval immediately James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 235/622] lustre: quota: protect quota flags at OSC James Simmons
                   ` (388 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

otherwise only the first 100 OSCs are subject to grant shrink procedure.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11409
Lustre-commit: 2b215d3763a8 ("LU-11409 osc: grant shrink shouldn't account skipped OSC")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33206
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 7b120da..14180a4 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -879,9 +879,11 @@ static void osc_grant_work_handler(struct work_struct *data)
 	mutex_lock(&client_gtd.gtd_mutex);
 	list_for_each_entry(cli, &client_gtd.gtd_clients,
 			    cl_grant_chain) {
-		if (++rpc_sent < GRANT_SHRINK_RPC_BATCH &&
-		    osc_should_shrink_grant(cli))
+		if (rpc_sent < GRANT_SHRINK_RPC_BATCH &&
+		    osc_should_shrink_grant(cli)) {
 			osc_shrink_grant(cli);
+			rpc_sent++;
+		}
 
 		if (!init_next_shrink) {
 			if (cli->cl_next_shrink_grant < next_shrink &&
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 235/622] lustre: quota: protect quota flags at OSC
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (233 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 234/622] lustre: osc: grant shrink shouldn't account skipped OSC James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 236/622] lustre: osc: pass client page size during reconnect too James Simmons
                   ` (387 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

There is no protection in OSC quota hash tracking the quota flags of
different qid, which could cause the previous request to modify the
quota flags which was set by the current request because the replies
could be out of order.

This patch also adds a lock to protect the operations on the quota
hash from different requests.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11678
Lustre-commit: 77d9f4e05a5c ("LU-11678 quota: protect quota flags at OSC")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33747
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h      |  3 +++
 fs/lustre/osc/osc_internal.h |  2 +-
 fs/lustre/osc/osc_quota.c    | 11 ++++++++++-
 fs/lustre/osc/osc_request.c  |  3 ++-
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index bf0bf97..ff94092 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -344,8 +344,11 @@ struct client_obd {
 	/* ptlrpc work for writeback in ptlrpcd context */
 	void			*cl_writeback_work;
 	void			*cl_lru_work;
+	struct mutex		cl_quota_mutex;
 	/* hash tables for osc_quota_info */
 	struct rhashtable	cl_quota_hash[MAXQUOTAS];
+	/* the xid of the request updating the hash tables */
+	u64			cl_quota_last_xid;
 	/* Links to the global list of registered changelog devices */
 	struct list_head	cl_chg_dev_linkage;
 };
diff --git a/fs/lustre/osc/osc_internal.h b/fs/lustre/osc/osc_internal.h
index 0f0f4d4..6f71d8d 100644
--- a/fs/lustre/osc/osc_internal.h
+++ b/fs/lustre/osc/osc_internal.h
@@ -136,7 +136,7 @@ static inline char *cli_name(struct client_obd *cli)
 
 int osc_quota_setup(struct obd_device *obd);
 int osc_quota_cleanup(struct obd_device *obd);
-int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
+int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[],
 		    u32 valid, u32 flags);
 int osc_quota_chkdq(struct client_obd *cli, const unsigned int qid[]);
 int osc_quotactl(struct obd_device *unused, struct obd_export *exp,
diff --git a/fs/lustre/osc/osc_quota.c b/fs/lustre/osc/osc_quota.c
index cb5ddef..316e087 100644
--- a/fs/lustre/osc/osc_quota.c
+++ b/fs/lustre/osc/osc_quota.c
@@ -109,7 +109,7 @@ static inline u32 fl_quota_flag(int qtype)
 	}
 }
 
-int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
+int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[],
 		    u32 valid, u32 flags)
 {
 	int type;
@@ -118,6 +118,11 @@ int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
 	if ((valid & (OBD_MD_FLALLQUOTA)) == 0)
 		return 0;
 
+	mutex_lock(&cli->cl_quota_mutex);
+	if (cli->cl_quota_last_xid > xid)
+		goto out_unlock;
+
+	cli->cl_quota_last_xid = xid;
 	for (type = 0; type < MAXQUOTAS; type++) {
 		struct osc_quota_info *oqi;
 
@@ -175,6 +180,8 @@ int osc_quota_setdq(struct client_obd *cli, const unsigned int qid[],
 		}
 	}
 
+out_unlock:
+	mutex_unlock(&cli->cl_quota_mutex);
 	return rc;
 }
 
@@ -191,6 +198,8 @@ int osc_quota_setup(struct obd_device *obd)
 	struct client_obd *cli = &obd->u.cli;
 	int i, type;
 
+	mutex_init(&cli->cl_quota_mutex);
+
 	for (type = 0; type < MAXQUOTAS; type++) {
 		if (rhashtable_init(&cli->cl_quota_hash[type],
 				    &quota_hash_params) != 0)
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 14180a4..dca141f 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1753,7 +1753,8 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 		       "setdq for [%u %u %u] with valid %#llx, flags %x\n",
 		       body->oa.o_uid, body->oa.o_gid, body->oa.o_projid,
 		       body->oa.o_valid, body->oa.o_flags);
-		osc_quota_setdq(cli, qid, body->oa.o_valid, body->oa.o_flags);
+		osc_quota_setdq(cli, req->rq_xid, qid, body->oa.o_valid,
+				body->oa.o_flags);
 	}
 
 	osc_update_grant(cli, body);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 236/622] lustre: osc: pass client page size during reconnect too
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (234 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 235/622] lustre: quota: protect quota flags at OSC James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 237/622] lustre: ptlrpc: Change static defines to use macro for sec_gc.c James Simmons
                   ` (386 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Client page size is reported to the server in ocd_grant_blkbits
and server returns back device blocksize. During reconnect that
ocd_grant_blkbits contains server device blocksize which is used
by server as client page size wrongly.

Patch sets ocd_grant_blkbits to the client page size again during
reconnect so server will get expected information.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11752
Lustre-commit: 5bec8f95cc10 ("LU-11752 osc: pass client page size during reconnect too")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33847
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index dca141f..a7e4f7a 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -3003,10 +3003,13 @@ int osc_reconnect(const struct lu_env *env, struct obd_export *exp,
 
 		spin_lock(&cli->cl_loi_list_lock);
 		grant = cli->cl_avail_grant + cli->cl_reserved_grant;
-		if (data->ocd_connect_flags & OBD_CONNECT_GRANT_PARAM)
+		if (data->ocd_connect_flags & OBD_CONNECT_GRANT_PARAM) {
+			/* restore ocd_grant_blkbits as client page bits */
+			data->ocd_grant_blkbits = PAGE_SHIFT;
 			grant += cli->cl_dirty_grant;
-		else
+		} else {
 			grant += cli->cl_dirty_pages << PAGE_SHIFT;
+		}
 		data->ocd_grant = grant ? : 2 * cli_brw_size(obd);
 		lost_grant = cli->cl_lost_grant;
 		cli->cl_lost_grant = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 237/622] lustre: ptlrpc: Change static defines to use macro for sec_gc.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (235 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 236/622] lustre: osc: pass client page size during reconnect too James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 238/622] lnet: libcfs: do not calculate debug_mb if it is set James Simmons
                   ` (385 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch replaces all mutex, locks, and wait qeueues
which are defined statically in file fs/lustre/ptlrpc/sec_gc.c
with kernel provided macro.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9010
Lustre-commit: 50c01e02506f ("LU-9010 ptlrpc: Change static defines to use macro for sec_gc.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/33937
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/sec_gc.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/ptlrpc/sec_gc.c b/fs/lustre/ptlrpc/sec_gc.c
index d5edcec..3baed8c 100644
--- a/fs/lustre/ptlrpc/sec_gc.c
+++ b/fs/lustre/ptlrpc/sec_gc.c
@@ -48,12 +48,12 @@
 
 #define SEC_GC_INTERVAL (30 * 60)
 
-static struct mutex sec_gc_mutex;
+static DEFINE_MUTEX(sec_gc_mutex);
 static LIST_HEAD(sec_gc_list);
-static spinlock_t sec_gc_list_lock;
+static DEFINE_SPINLOCK(sec_gc_list_lock);
 
 static LIST_HEAD(sec_gc_ctx_list);
-static spinlock_t sec_gc_ctx_list_lock;
+static DEFINE_SPINLOCK(sec_gc_ctx_list_lock);
 
 static atomic_t sec_gc_wait_del = ATOMIC_INIT(0);
 
@@ -176,10 +176,6 @@ static void sec_gc_main(struct work_struct *ws)
 
 int sptlrpc_gc_init(void)
 {
-	mutex_init(&sec_gc_mutex);
-	spin_lock_init(&sec_gc_list_lock);
-	spin_lock_init(&sec_gc_ctx_list_lock);
-
 	schedule_delayed_work(&sec_gc_work, 0);
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 238/622] lnet: libcfs: do not calculate debug_mb if it is set
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (236 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 237/622] lustre: ptlrpc: Change static defines to use macro for sec_gc.c James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 239/622] lustre: ldlm: Lost lease lock on migrate error James Simmons
                   ` (384 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

debug_mb is libcfs module parameter. It should be possible to set it
via

modprobe libcfs libcfs_debug_mb=800

or via adding

options libcfs libcfs_debug_mb=800

to modules configuration.

Fixes: 0871d551af ("staging/lustre/libcfs: move /proc/sys/lnet to debugfs")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11898
Lustre-commit: adeb29400a4a ("LU-11898 libcfs: do not calculate debug_mb if it is set")
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Cray-bug-id: LUS-6936
Reviewed-on: https://review.whamcloud.com/34128
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/debug.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/lnet/libcfs/debug.c b/net/lnet/libcfs/debug.c
index 88c4c36..c6b92df 100644
--- a/net/lnet/libcfs/debug.c
+++ b/net/lnet/libcfs/debug.c
@@ -553,7 +553,8 @@ int libcfs_debug_init(unsigned long bufsize)
 
 	libcfs_register_panic_notifier();
 	kernel_param_lock(THIS_MODULE);
-	libcfs_debug_mb = cfs_trace_get_debug_mb();
+	if (libcfs_debug_mb == 0)
+		libcfs_debug_mb = cfs_trace_get_debug_mb();
 	kernel_param_unlock(THIS_MODULE);
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 239/622] lustre: ldlm: Lost lease lock on migrate error
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (237 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 238/622] lnet: libcfs: do not calculate debug_mb if it is set James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 240/622] lnet: lnd: increase CQ entries James Simmons
                   ` (383 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

All the file operations have the following locking order - parent,
child. If a lock for a child is returned to the client, the following
operations on this file are done by the child fid.

However, the migrate is an exception - it takes the lease lock first and
takes the PW parent lock next during the MDS_REINT.

At the same time, if there is a parallel racing operation (open) which
has taken a lock on parent (conflicting with the next MDS_REINT) and
is trying to take a lock on child - it is blocked until
the lease cancel comes.

The lease cancel is piggy-backed on the MDS_REINT RPC and is handled
at the end of the operation, trying to take the conflicting parent lock
first - thus a deadlock occurs.

At the same time, the lease lock is not supposed to block anything, it
is just an indicator on the server there is no other conflicting
operation has occurred during the migration - thus
set LDLM_FL_CANCEL_ON_BLOCK on it and the conflicting operation
will not be blocked.

In this case, the MDS_REINT will return -EAGAIN as the lease
is cancelled and the client will retry its migration.

Cray-bug-id: LUS-6811
WC-bug-id: https://jira.whamcloud.com/browse/LU-11926
Lustre-commit: ae7ca90713b4 ("LU-11926 ldlm: Lost lease lock on migrate error")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/34182
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 1 +
 fs/lustre/ldlm/ldlm_lockd.c     | 3 ---
 fs/lustre/ldlm/ldlm_request.c   | 4 ++++
 fs/lustre/llite/file.c          | 4 +++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 39547a0..a60fa07 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -302,6 +302,7 @@
 #define OBD_FAIL_LDLM_CP_CB_WAIT5			0x323
 
 #define OBD_FAIL_LDLM_GRANT_CHECK			0x32a
+#define OBD_FAIL_LDLM_LOCAL_CANCEL_PAUSE		0x32c
 
 /* LOCKLESS IO */
 #define OBD_FAIL_LDLM_SET_CONTENTION			0x385
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index db0da99..ea146aa 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -149,9 +149,6 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 	}
 	ldlm_set_cbpending(lock);
 
-	if (ldlm_is_cancel_on_block(lock))
-		ldlm_set_cancel(lock);
-
 	do_ast = !lock->l_readers && !lock->l_writers;
 	unlock_res_and_lock(lock);
 
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 7c3935f..fb564f4 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1293,6 +1293,10 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 	ldlm_set_canceling(lock);
 	unlock_res_and_lock(lock);
 
+	if (cancel_flags & LCF_LOCAL)
+		OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_LOCAL_CANCEL_PAUSE,
+				 cfs_fail_val);
+
 	rc = ldlm_cli_cancel_local(lock);
 	if (rc == LDLM_FL_LOCAL_ONLY || cancel_flags & LCF_LOCAL) {
 		LDLM_LOCK_RELEASE(lock);
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 4560ae0..7ec1099 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3934,7 +3934,9 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 	if (!rc) {
 		LASSERT(request);
 		ll_update_times(request, parent);
+	}
 
+	if (rc == 0 || rc == -EAGAIN) {
 		body = req_capsule_server_get(&request->rq_pill, &RMF_MDT_BODY);
 		LASSERT(body);
 
@@ -3957,7 +3959,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 		request = NULL;
 	}
 
-	/* Try again if the file layout has changed. */
+	/* Try again if the lease has cancelled. */
 	if (rc == -EAGAIN && S_ISREG(child_inode->i_mode))
 		goto again;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 240/622] lnet: lnd: increase CQ entries
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (238 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 239/622] lustre: ldlm: Lost lease lock on migrate error James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 241/622] lustre: security: return security context for metadata ops James Simmons
                   ` (382 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Several sites have reported RDMA timeouts. Most of the timeouts
are occurring for transmits on the active_tx queue. Transmits are
placed on the active_tx queue until a completion is received. If
there isn't enough CQ entries available, it's possible for a
completions events to be delayed, causing these timeouts.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12065
Lustre-commit: bf3fc7f1a7bf ("LU-12065 lnd: increase CQ entries")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/34473
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 999b58d..44f1d84 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -136,8 +136,7 @@ struct kib_tunables {
 /* WRs and CQEs (per connection) */
 #define IBLND_RECV_WRS(c)	IBLND_RX_MSGS(c)
 
-#define IBLND_CQ_ENTRIES(c)	\
-	(IBLND_RECV_WRS(c) + 2 * c->ibc_queue_depth)
+#define IBLND_CQ_ENTRIES(c)	(IBLND_RECV_WRS(c) + kiblnd_send_wrs(c))
 
 struct kib_hca_dev;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 241/622] lustre: security: return security context for metadata ops
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (239 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 240/622] lnet: lnd: increase CQ entries James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 242/622] lustre: grant: prevent overflow of o_undirty James Simmons
                   ` (381 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Bruno Faccini <bruno.faccini@intel.com>

Security layer needs to fetch security context of files/dirs
upon metadata ops like lookup, getattr, open, truncate, and
layout, for its own purpose and control checks.
Retrieving the security context consists in a getxattr operation
at the file system level. The fact that the requested metadata
operation and the getxattr are not atomic can create a window
for a dead-lock situation where, based on some access patterns,
all MDT service threads can become stuck waiting for lookup lock
to be released and thus unable to serve getxattr for security context.
Another problem is that sending an additional getxattr request for
every metadata op hurts performance.

This patch introduces a way to get atomicity by having
the MDT return security context upon granted lock reply,
sparing the client an additional getxattr request.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9193
Lustre-commit: fca35f74f9ec ("LU-9193 security: return security context for metadata ops")
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/26831
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h                |  3 +-
 fs/lustre/llite/llite_internal.h       |  3 ++
 fs/lustre/llite/namei.c                | 60 +++++++++++++++++++++++++++++++--
 fs/lustre/llite/xattr_security.c       | 19 +++++++++++
 fs/lustre/lmv/lmv_intent.c             | 21 ++++++++++--
 fs/lustre/mdc/mdc_locks.c              | 61 +++++++++++++++++++++++++++++++++-
 fs/lustre/mdc/mdc_request.c            |  2 ++
 fs/lustre/ptlrpc/layout.c              |  9 +++--
 include/uapi/linux/lustre/lustre_idl.h |  1 +
 9 files changed, 169 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index ff94092..758efc1 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -778,8 +778,9 @@ struct md_op_data {
 	u64			op_data_version;
 	struct lustre_handle	op_lease_handle;
 
-	/* File security context, for creates. */
+	/* File security context, for creates/metadata ops */
 	const char	       *op_file_secctx_name;
+	u32			op_file_secctx_name_size;
 	void		       *op_file_secctx;
 	u32			op_file_secctx_size;
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index d41531b..3c81c3b 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -279,6 +279,9 @@ int ll_dentry_init_security(struct dentry *dentry, int mode, struct qstr *name,
 int ll_inode_init_security(struct dentry *dentry, struct inode *inode,
 			   struct inode *dir);
 
+int ll_listsecurity(struct inode *inode, char *secctx_name,
+		    size_t secctx_name_size);
+
 /*
  * Locking to guarantee consistency of non-atomic updates to long long i_size,
  * consistency between file size and KMS.
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index e410ff0..ee3ce70 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -592,7 +592,8 @@ struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
 
 static int ll_lookup_it_finish(struct ptlrpc_request *request,
 			       struct lookup_intent *it,
-			       struct inode *parent, struct dentry **de)
+			       struct inode *parent, struct dentry **de,
+			       void *secctx, u32 secctxlen)
 {
 	struct inode *inode = NULL;
 	u64 bits = 0;
@@ -605,6 +606,10 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 	CDEBUG(D_DENTRY, "it %p it_disposition %x\n", it,
 	       it->it_disposition);
 	if (!it_disposition(it, DISP_LOOKUP_NEG)) {
+		struct req_capsule *pill = &request->rq_pill;
+		struct mdt_body *body = req_capsule_server_get(pill,
+							       &RMF_MDT_BODY);
+
 		rc = ll_prep_inode(&inode, request, (*de)->d_sb, it);
 		if (rc)
 			return rc;
@@ -623,6 +628,32 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		 * ll_glimpse_size or some equivalent themselves anyway.
 		 * Also see bug 7198.
 		 */
+
+		/* If security context was returned by MDT, put it in
+		 * inode now to save an extra getxattr from security hooks,
+		 * and avoid deadlock.
+		 */
+		if (body->mbo_valid & OBD_MD_SECCTX) {
+			secctx = req_capsule_server_get(pill, &RMF_FILE_SECCTX);
+			secctxlen = req_capsule_get_size(pill,
+							 &RMF_FILE_SECCTX,
+							 RCL_SERVER);
+
+			if (secctxlen)
+				CDEBUG(D_SEC,
+				       "server returned security context for " DFID "\n",
+				       PFID(ll_inode2fid(inode)));
+		}
+
+		if (secctx && secctxlen != 0) {
+			inode_lock(inode);
+			rc = security_inode_notifysecctx(inode, secctx,
+							 secctxlen);
+			inode_unlock(inode);
+			if (rc)
+				CWARN("cannot set security context for " DFID ": rc = %d\n",
+				      PFID(ll_inode2fid(inode)), rc);
+		}
 	}
 
 	alias = ll_splice_alias(inode, *de);
@@ -680,6 +711,7 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 	struct dentry *save = dentry, *retval;
 	struct ptlrpc_request *req = NULL;
 	struct md_op_data *op_data = NULL;
+	char secctx_name[XATTR_NAME_MAX + 1];
 	struct inode *inode;
 	u32 opc;
 	int rc;
@@ -742,6 +774,28 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 			*secctx = op_data->op_file_secctx;
 		if (secctxlen)
 			*secctxlen = op_data->op_file_secctx_size;
+	} else {
+		if (secctx)
+			*secctx = NULL;
+		if (secctxlen)
+			*secctxlen = 0;
+	}
+
+	/* ask for security context upon intent */
+	if (it->it_op & (IT_LOOKUP | IT_GETATTR | IT_OPEN)) {
+		/* get name of security xattr to request to server */
+		rc = ll_listsecurity(parent, secctx_name,
+				     sizeof(secctx_name));
+		if (rc < 0) {
+			CDEBUG(D_SEC,
+			       "cannot get security xattr name for " DFID ": rc = %d\n",
+			       PFID(ll_inode2fid(parent)), rc);
+		} else if (rc > 0) {
+			op_data->op_file_secctx_name = secctx_name;
+			op_data->op_file_secctx_name_size = rc;
+			CDEBUG(D_SEC, "'%.*s' is security xattr for " DFID "\n",
+			       rc, secctx_name, PFID(ll_inode2fid(parent)));
+		}
 	}
 
 	rc = md_intent_lock(ll_i2mdexp(parent), op_data, it, &req,
@@ -783,7 +837,9 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 
 	/* dir layout may change */
 	ll_unlock_md_op_lsm(op_data);
-	rc = ll_lookup_it_finish(req, it, parent, &dentry);
+	rc = ll_lookup_it_finish(req, it, parent, &dentry,
+				 secctx ? *secctx : NULL,
+				 secctxlen ? *secctxlen : 0);
 	if (rc != 0) {
 		ll_intent_release(it);
 		retval = ERR_PTR(rc);
diff --git a/fs/lustre/llite/xattr_security.c b/fs/lustre/llite/xattr_security.c
index e5a52d9..e4fb64a 100644
--- a/fs/lustre/llite/xattr_security.c
+++ b/fs/lustre/llite/xattr_security.c
@@ -132,3 +132,22 @@ int ll_dentry_init_security(struct dentry *dentry, int mode, struct qstr *name,
 		return 0;
 	return err;
 }
+
+/**
+ * Get security context xattr name used by policy.
+ *
+ * \retval >= 0     length of xattr name
+ * \retval < 0      failure to get security context xattr name
+ */
+int
+ll_listsecurity(struct inode *inode, char *secctx_name, size_t secctx_name_size)
+{
+	int rc;
+
+	rc = security_inode_listsecurity(inode, secctx_name, secctx_name_size);
+	if (rc >= secctx_name_size)
+		rc = -ERANGE;
+	else if (rc >= 0)
+		secctx_name[rc] = '\0';
+	return rc;
+}
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 6933f7d..45f1ac5 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -52,7 +52,8 @@ static int lmv_intent_remote(struct obd_export *exp, struct lookup_intent *it,
 			     const struct lu_fid *parent_fid,
 			     struct ptlrpc_request **reqp,
 			     ldlm_blocking_callback cb_blocking,
-			     u64 extra_lock_flags)
+			     u64 extra_lock_flags,
+			     const char *secctx_name, u32 secctx_name_size)
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
@@ -109,6 +110,16 @@ static int lmv_intent_remote(struct obd_export *exp, struct lookup_intent *it,
 	CDEBUG(D_INODE, "REMOTE_INTENT with fid=" DFID " -> mds #%u\n",
 	       PFID(&body->mbo_fid1), tgt->ltd_idx);
 
+	/* ask for security context upon intent */
+	if (it->it_op & (IT_LOOKUP | IT_GETATTR | IT_OPEN) &&
+	    secctx_name_size != 0 && secctx_name) {
+		op_data->op_file_secctx_name = secctx_name;
+		op_data->op_file_secctx_name_size = secctx_name_size;
+		CDEBUG(D_SEC,
+		       "'%.*s' is security xattr to fetch for " DFID "\n",
+		       secctx_name_size, secctx_name, PFID(&body->mbo_fid1));
+	}
+
 	rc = md_intent_lock(tgt->ltd_exp, op_data, it, &req, cb_blocking,
 			    extra_lock_flags);
 	if (rc)
@@ -385,7 +396,9 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	/* Not cross-ref case, just get out of here. */
 	if (unlikely((body->mbo_valid & OBD_MD_MDS))) {
 		rc = lmv_intent_remote(exp, it, &op_data->op_fid1, reqp,
-				       cb_blocking, extra_lock_flags);
+				       cb_blocking, extra_lock_flags,
+				       op_data->op_file_secctx_name,
+				       op_data->op_file_secctx_name_size);
 		if (rc != 0)
 			return rc;
 
@@ -471,7 +484,9 @@ static int lmv_intent_lookup(struct obd_export *exp,
 	/* Not cross-ref case, just get out of here. */
 	if (unlikely((body->mbo_valid & OBD_MD_MDS))) {
 		rc = lmv_intent_remote(exp, it, NULL, reqp, cb_blocking,
-				       extra_lock_flags);
+				       extra_lock_flags,
+				       op_data->op_file_secctx_name,
+				       op_data->op_file_secctx_name_size);
 		if (rc != 0)
 			return rc;
 		body = req_capsule_server_get(&(*reqp)->rq_pill, &RMF_MDT_BODY);
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 55de559..6f4baa6 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -310,7 +310,7 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 
 	req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX_NAME,
 			     RCL_CLIENT, op_data->op_file_secctx_name ?
-			     strlen(op_data->op_file_secctx_name) + 1 : 0);
+			     op_data->op_file_secctx_name_size : 0);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX, RCL_CLIENT,
 			     op_data->op_file_secctx_size);
@@ -337,6 +337,30 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 			     obddev->u.cli.cl_max_mds_easize);
 	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER, acl_bufsize);
 
+	if (!(it->it_op & IT_CREAT) && it->it_op & IT_OPEN &&
+	    req_capsule_has_field(&req->rq_pill, &RMF_FILE_SECCTX_NAME,
+				  RCL_CLIENT) &&
+	    op_data->op_file_secctx_name_size > 0 &&
+	    op_data->op_file_secctx_name) {
+		char *secctx_name;
+
+		secctx_name = req_capsule_client_get(&req->rq_pill,
+						     &RMF_FILE_SECCTX_NAME);
+		memcpy(secctx_name, op_data->op_file_secctx_name,
+		       op_data->op_file_secctx_name_size);
+		req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX,
+				     RCL_SERVER,
+				     obddev->u.cli.cl_max_mds_easize);
+
+		CDEBUG(D_SEC, "packed '%.*s' as security xattr name\n",
+		       op_data->op_file_secctx_name_size,
+		       op_data->op_file_secctx_name);
+
+	} else {
+		req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX,
+				     RCL_SERVER, 0);
+	}
+
 	/**
 	 * Inline buffer for possible data from Data-on-MDT files.
 	 */
@@ -407,6 +431,8 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	/* pack the intent */
 	lit = req_capsule_client_get(&req->rq_pill, &RMF_LDLM_INTENT);
 	lit->opc = IT_GETXATTR;
+	CDEBUG(D_INFO, "%s: get xattrs for " DFID "\n",
+	       exp->exp_obd->obd_name, PFID(&op_data->op_fid1));
 
 	/* If the supplied buffer is too small then the server will
 	 * return -ERANGE and llite will fallback to using non cached
@@ -454,12 +480,25 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	struct ldlm_intent *lit;
 	int rc;
 	u32 easize;
+	bool have_secctx = false;
 
 	req = ptlrpc_request_alloc(class_exp2cliimp(exp),
 				   &RQF_LDLM_INTENT_GETATTR);
 	if (!req)
 		return ERR_PTR(-ENOMEM);
 
+	/* send name of security xattr to get upon intent */
+	if (it->it_op & (IT_LOOKUP | IT_GETATTR) &&
+	    req_capsule_has_field(&req->rq_pill, &RMF_FILE_SECCTX_NAME,
+				  RCL_CLIENT) &&
+	    op_data->op_file_secctx_name_size > 0 &&
+	    op_data->op_file_secctx_name) {
+		have_secctx = true;
+		req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX_NAME,
+				     RCL_CLIENT,
+				     op_data->op_file_secctx_name_size);
+	}
+
 	req_capsule_set_size(&req->rq_pill, &RMF_NAME, RCL_CLIENT,
 			     op_data->op_namelen + 1);
 
@@ -483,6 +522,26 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 
 	req_capsule_set_size(&req->rq_pill, &RMF_MDT_MD, RCL_SERVER, easize);
 	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER, acl_bufsize);
+
+	if (have_secctx) {
+		char *secctx_name;
+
+		secctx_name = req_capsule_client_get(&req->rq_pill,
+						     &RMF_FILE_SECCTX_NAME);
+		memcpy(secctx_name, op_data->op_file_secctx_name,
+		       op_data->op_file_secctx_name_size);
+
+		req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX,
+				     RCL_SERVER, easize);
+
+		CDEBUG(D_SEC, "packed '%.*s' as security xattr name\n",
+		       op_data->op_file_secctx_name_size,
+		       op_data->op_file_secctx_name);
+	} else {
+		req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX,
+				     RCL_SERVER, 0);
+	}
+
 	ptlrpc_request_set_replen(req);
 	return req;
 }
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index c08a6ee..88e790f0 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -439,6 +439,8 @@ static int mdc_getxattr(struct obd_export *exp, const struct lu_fid *fid,
 	LASSERT(obd_md_valid == OBD_MD_FLXATTR ||
 		obd_md_valid == OBD_MD_FLXATTRLS);
 
+	CDEBUG(D_INFO, "%s: get xattr '%s' for " DFID "\n",
+	       exp->exp_obd->obd_name, name, PFID(fid));
 	rc = mdc_xattr_common(exp, &RQF_MDS_GETXATTR, fid, MDS_GETXATTR,
 			      obd_md_valid, name, NULL, 0, buf_size, 0, -1,
 			      req);
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 2e74ae1b..1dd18b9 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -417,6 +417,7 @@
 	&RMF_CAPA1,
 	&RMF_CAPA2,
 	&RMF_NIOBUF_INLINE,
+	&RMF_FILE_SECCTX
 };
 
 static const struct req_msg_field *ldlm_intent_getattr_client[] = {
@@ -425,7 +426,8 @@
 	&RMF_LDLM_INTENT,
 	&RMF_MDT_BODY,     /* coincides with mds_getattr_name_client[] */
 	&RMF_CAPA1,
-	&RMF_NAME
+	&RMF_NAME,
+	&RMF_FILE_SECCTX_NAME
 };
 
 static const struct req_msg_field *ldlm_intent_getattr_server[] = {
@@ -434,7 +436,8 @@
 	&RMF_MDT_BODY,
 	&RMF_MDT_MD,
 	&RMF_ACL,
-	&RMF_CAPA1
+	&RMF_CAPA1,
+	&RMF_FILE_SECCTX
 };
 
 static const struct req_msg_field *ldlm_intent_create_client[] = {
@@ -935,7 +938,7 @@ struct req_msg_field RMF_FILE_SECCTX_NAME =
 EXPORT_SYMBOL(RMF_FILE_SECCTX_NAME);
 
 struct req_msg_field RMF_FILE_SECCTX =
-	DEFINE_MSGF("file_secctx", 0, -1, NULL, NULL);
+	DEFINE_MSGF("file_secctx", RMF_F_NO_SIZE_CHECK, -1, NULL, NULL);
 EXPORT_SYMBOL(RMF_FILE_SECCTX);
 
 struct req_msg_field RMF_LLOGD_BODY =
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 76068ee..1a1b6c6 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1198,6 +1198,7 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 #define OBD_MD_DEFAULT_MEA	(0x0040000000000000ULL) /* default MEA */
 #define OBD_MD_FLOSTLAYOUT	(0x0080000000000000ULL)	/* contain ost_layout */
 #define OBD_MD_FLPROJID		(0x0100000000000000ULL) /* project ID */
+#define OBD_MD_SECCTX        (0x0200000000000000ULL) /* embed security xattr */
 
 #define OBD_MD_FLALLQUOTA (OBD_MD_FLUSRQUOTA | \
 			   OBD_MD_FLGRPQUOTA | \
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 242/622] lustre: grant: prevent overflow of o_undirty
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (240 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 241/622] lustre: security: return security context for metadata ops James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 243/622] lustre: ptlrpc: manage SELinux policy info at connect time James Simmons
                   ` (380 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Zhuravlev <bzzz@whamcloud.com>

For the server side tgt_grant_inflate() returns a u64, and if
tgd_blockbits and val are large enough, can return a value >= 2^32.
tgt_grant_incoming() assigns oa->o_undirty the returned value.
Since o_undirty is u32, it can overflow.

This occurs with Lustre clients < 2.10 and a ZFS backend when the zfs
"recordsize" > 128k (the default).

In tgt_grant_inflate(), check the returned value and prevent o_undirty
from being assigned a value greater than 2^30.

For the osc client side use PTLRPC_MAX_RW_SIZE to prevent o_undirty
overflow.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11798
Lustre-commit: d6f521916211 ("LU-11798 grant: prevent overflow of o_undirty")
Signed-off-by: Alexey Zhuravlev <bzzz@whamcloud.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-on: https://review.whamcloud.com/33948
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index a7e4f7a..1fc50cc 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -686,8 +686,8 @@ static void osc_announce_cached(struct client_obd *cli, struct obdo *oa,
 		/* Do not ask for more than OBD_MAX_GRANT - a margin for server
 		 * to add extent tax, etc.
 		 */
-		oa->o_undirty = min(undirty, OBD_MAX_GRANT -
-				    (PTLRPC_MAX_BRW_PAGES << PAGE_SHIFT)*4UL);
+		oa->o_undirty = min(undirty, OBD_MAX_GRANT &
+				    ~(PTLRPC_MAX_BRW_SIZE * 4UL));
 	}
 	oa->o_grant = cli->cl_avail_grant + cli->cl_reserved_grant;
 	oa->o_dropped = cli->cl_lost_grant;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 243/622] lustre: ptlrpc: manage SELinux policy info at connect time
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (241 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 242/622] lustre: grant: prevent overflow of o_undirty James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 244/622] lustre: ptlrpc: manage SELinux policy info for metadata ops James Simmons
                   ` (379 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

At connect time, compute SELinux policy info on client side, and
send it over the wire.
On server side, get SELinux policy info from nodemap and compare
it with the one received from client.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8955
Lustre-commit: dd200e5530fd ("LU-8955 ptlrpc: manage SELinux policy info at connect time")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/24422
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h |  1 +
 fs/lustre/llite/llite_lib.c           |  4 ++++
 fs/lustre/ptlrpc/import.c             | 16 +++++++++++++++-
 fs/lustre/ptlrpc/layout.c             |  7 ++++++-
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 36656c6..9b618fe 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -269,6 +269,7 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_msg_field RMF_HSM_STATE_SET;
 extern struct req_msg_field RMF_MDS_HSM_CURRENT_ACTION;
 extern struct req_msg_field RMF_MDS_HSM_REQUEST;
+extern struct req_msg_field RMF_SELINUX_POL;
 
 /* seq-mgr fields */
 extern struct req_msg_field RMF_SEQ_OPC;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 4d41981a..10d9180 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -256,6 +256,10 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	obd_connect_set_secctx(data);
 
+#if defined(CONFIG_SECURITY)
+	data->ocd_connect_flags2 |= OBD_CONNECT2_SELINUX_POLICY;
+#endif
+
 	data->ocd_brw_size = MD_MAX_BRW_SIZE;
 
 	err = obd_connect(NULL, &sbi->ll_md_exp, sbi->ll_md_obd,
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 34a2cb0..39d9e3e 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -606,7 +606,8 @@ int ptlrpc_connect_import(struct obd_import *imp)
 			 obd2cli_tgt(imp->imp_obd),
 			 obd->obd_uuid.uuid,
 			 (char *)&imp->imp_dlm_handle,
-			 (char *)&imp->imp_connect_data };
+			 (char *)&imp->imp_connect_data,
+			 NULL };
 	struct ptlrpc_connect_async_args *aa;
 	int rc;
 
@@ -670,6 +671,19 @@ int ptlrpc_connect_import(struct obd_import *imp)
 		goto out;
 	}
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(request);
+	if (rc < 0) {
+		ptlrpc_request_free(request);
+		goto out;
+	}
+
+	bufs[5] = request->rq_sepol;
+
+	req_capsule_set_size(&request->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(request->rq_sepol) ?
+			     strlen(request->rq_sepol) + 1 : 0);
+
 	rc = ptlrpc_request_bufs_pack(request, LUSTRE_OBD_VERSION,
 				      imp->imp_connect_op, bufs, NULL);
 	if (rc) {
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 1dd18b9..f80c627 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -315,7 +315,8 @@
 	&RMF_TGTUUID,
 	&RMF_CLUUID,
 	&RMF_CONN,
-	&RMF_CONNECT_DATA
+	&RMF_CONNECT_DATA,
+	&RMF_SELINUX_POL,
 };
 
 static const struct req_msg_field *obd_connect_server[] = {
@@ -1039,6 +1040,10 @@ struct req_msg_field RMF_LAYOUT_INTENT =
 		    NULL);
 EXPORT_SYMBOL(RMF_LAYOUT_INTENT);
 
+struct req_msg_field RMF_SELINUX_POL =
+	DEFINE_MSGF("selinux_pol", RMF_F_STRING, -1, NULL, NULL);
+EXPORT_SYMBOL(RMF_SELINUX_POL);
+
 /*
  * OST request field.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 244/622] lustre: ptlrpc: manage SELinux policy info for metadata ops
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (242 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 243/622] lustre: ptlrpc: manage SELinux policy info at connect time James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 245/622] lustre: obd: make health_check sysfs compliant James Simmons
                   ` (378 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

Add SELinux policy info for following metedata operations:
- create
- open
- unlink
- rename
- getxattr
- setxattr
- setattr
- getattr
- symlink
- hardlink

On server side, get SELinux policy info from nodemap and compare
it with the one received from client.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8955
Lustre-commit: 0a773f04b288 ("LU-8955 ptlrpc: manage SELinux policy info for metadata ops")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/24424
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h |  2 +-
 fs/lustre/mdc/mdc_internal.h          |  1 +
 fs/lustre/mdc/mdc_lib.c               | 31 +++++++++++++++++++++++++++
 fs/lustre/mdc/mdc_locks.c             | 23 ++++++++++++++++++++
 fs/lustre/mdc/mdc_reint.c             | 40 +++++++++++++++++++++++++++++++++++
 fs/lustre/mdc/mdc_request.c           | 17 ++++++++++++---
 fs/lustre/ptlrpc/layout.c             | 32 +++++++++++++++++++---------
 7 files changed, 132 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 9b618fe..378f0b6 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -60,7 +60,7 @@ enum req_location {
 };
 
 /* Maximal number of fields (buffers) in a request message. */
-#define REQ_MAX_FIELD_NR 10
+#define REQ_MAX_FIELD_NR 11
 
 struct req_capsule {
 	struct ptlrpc_request		*rc_req;
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index a5fe164..f75498a 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -57,6 +57,7 @@ void mdc_open_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 void mdc_file_secctx_pack(struct ptlrpc_request *req,
 			  const char *secctx_name,
 			  const void *secctx, size_t secctx_size);
+void mdc_file_sepol_pack(struct ptlrpc_request *req);
 
 void mdc_unlink_pack(struct ptlrpc_request *req, struct md_op_data *op_data);
 void mdc_link_pack(struct ptlrpc_request *req, struct md_op_data *op_data);
diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index 00a6be4..980676a 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -138,6 +138,22 @@ void mdc_file_secctx_pack(struct ptlrpc_request *req, const char *secctx_name,
 	memcpy(buf, secctx, buf_size);
 }
 
+void mdc_file_sepol_pack(struct ptlrpc_request *req)
+{
+	void *buf;
+	size_t buf_size;
+
+	if (strlen(req->rq_sepol) == 0)
+		return;
+
+	buf = req_capsule_client_get(&req->rq_pill, &RMF_SELINUX_POL);
+	buf_size = req_capsule_get_size(&req->rq_pill, &RMF_SELINUX_POL,
+					RCL_CLIENT);
+
+	LASSERT(buf_size == strlen(req->rq_sepol) + 1);
+	snprintf(buf, strlen(req->rq_sepol) + 1, "%s", req->rq_sepol);
+}
+
 void mdc_readdir_pack(struct ptlrpc_request *req, u64 pgoff, size_t size,
 		      const struct lu_fid *fid)
 {
@@ -192,6 +208,9 @@ void mdc_create_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 	mdc_file_secctx_pack(req, op_data->op_file_secctx_name,
 			     op_data->op_file_secctx,
 			     op_data->op_file_secctx_size);
+
+	/* pack SELinux policy info if any */
+	mdc_file_sepol_pack(req);
 }
 
 static inline u64 mds_pack_open_flags(u64 flags)
@@ -266,6 +285,9 @@ void mdc_open_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 		mdc_file_secctx_pack(req, op_data->op_file_secctx_name,
 				     op_data->op_file_secctx,
 				     op_data->op_file_secctx_size);
+
+		/* pack SELinux policy info if any */
+		mdc_file_sepol_pack(req);
 	}
 
 	if (lmm) {
@@ -412,6 +434,9 @@ void mdc_unlink_pack(struct ptlrpc_request *req, struct md_op_data *op_data)
 	rec->ul_bias = op_data->op_bias;
 
 	mdc_pack_name(req, &RMF_NAME, op_data->op_name, op_data->op_namelen);
+
+	/* pack SELinux policy info if any */
+	mdc_file_sepol_pack(req);
 }
 
 void mdc_link_pack(struct ptlrpc_request *req, struct md_op_data *op_data)
@@ -434,6 +459,9 @@ void mdc_link_pack(struct ptlrpc_request *req, struct md_op_data *op_data)
 	rec->lk_bias = op_data->op_bias;
 
 	mdc_pack_name(req, &RMF_NAME, op_data->op_name, op_data->op_namelen);
+
+	/* pack SELinux policy info if any */
+	mdc_file_sepol_pack(req);
 }
 
 static void mdc_close_intent_pack(struct ptlrpc_request *req,
@@ -505,6 +533,9 @@ void mdc_rename_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 
 	if (new)
 		mdc_pack_name(req, &RMF_SYMTGT, new, newlen);
+
+	/* pack SELinux policy info if any */
+	mdc_file_sepol_pack(req);
 }
 
 void mdc_migrate_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 6f4baa6..05447ea 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -315,6 +315,16 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX, RCL_CLIENT,
 			     op_data->op_file_secctx_size);
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return ERR_PTR(rc);
+	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
+
 	rc = ldlm_prep_enqueue_req(exp, req, &cancels, count);
 	if (rc < 0) {
 		ptlrpc_request_free(req);
@@ -422,6 +432,16 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	if (!req)
 		return ERR_PTR(-ENOMEM);
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return ERR_PTR(rc);
+	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
+
 	rc = ldlm_prep_enqueue_req(exp, req, &cancels, count);
 	if (rc) {
 		ptlrpc_request_free(req);
@@ -452,6 +472,9 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	mdc_pack_body(req, &op_data->op_fid1, op_data->op_valid,
 		      ea_vals_buf_size, -1, 0);
 
+	/* get SELinux policy info if any */
+	mdc_file_sepol_pack(req);
+
 	req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_SERVER,
 			     GA_DEFAULT_EA_NAME_LEN * GA_DEFAULT_EA_NUM);
 
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 0e5f012..86acb4e 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -197,6 +197,16 @@ int mdc_create(struct obd_export *exp, struct md_op_data *op_data,
 	req_capsule_set_size(&req->rq_pill, &RMF_FILE_SECCTX, RCL_CLIENT,
 			     op_data->op_file_secctx_size);
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return rc;
+	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
+
 	rc = mdc_prep_elc_req(exp, req, MDS_REINT, &cancels, count);
 	if (rc) {
 		ptlrpc_request_free(req);
@@ -286,6 +296,16 @@ int mdc_unlink(struct obd_export *exp, struct md_op_data *op_data,
 	req_capsule_set_size(&req->rq_pill, &RMF_NAME, RCL_CLIENT,
 			     op_data->op_namelen + 1);
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return rc;
+	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
+
 	rc = mdc_prep_elc_req(exp, req, MDS_REINT, &cancels, count);
 	if (rc) {
 		ptlrpc_request_free(req);
@@ -332,6 +352,16 @@ int mdc_link(struct obd_export *exp, struct md_op_data *op_data,
 	req_capsule_set_size(&req->rq_pill, &RMF_NAME, RCL_CLIENT,
 			     op_data->op_namelen + 1);
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return rc;
+	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
+
 	rc = mdc_prep_elc_req(exp, req, MDS_REINT, &cancels, count);
 	if (rc) {
 		ptlrpc_request_free(req);
@@ -394,6 +424,16 @@ int mdc_rename(struct obd_export *exp, struct md_op_data *op_data,
 		req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_CLIENT,
 				     op_data->op_data_size);
 
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return rc;
+	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
+
 	rc = mdc_prep_elc_req(exp, req, MDS_REINT, &cancels, count);
 	if (rc) {
 		ptlrpc_request_free(req);
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 88e790f0..80e58c8 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -328,11 +328,20 @@ static int mdc_xattr_common(struct obd_export *exp,
 		req_capsule_set_size(&req->rq_pill, &RMF_NAME, RCL_CLIENT,
 				     xattr_namelen);
 	}
-	if (input_size) {
+	if (input_size)
 		LASSERT(input);
-		req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_CLIENT,
-				     input_size);
+	req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_CLIENT,
+			     input_size);
+
+	/* get SELinux policy info if any */
+	rc = sptlrpc_get_sepol(req);
+	if (rc < 0) {
+		ptlrpc_request_free(req);
+		return rc;
 	}
+	req_capsule_set_size(&req->rq_pill, &RMF_SELINUX_POL, RCL_CLIENT,
+			     strlen(req->rq_sepol) ?
+			     strlen(req->rq_sepol) + 1 : 0);
 
 	/* Flush local XATTR locks to get rid of a possible cancel RPC */
 	if (opcode == MDS_REINT && fid_is_sane(fid) &&
@@ -393,6 +402,8 @@ static int mdc_xattr_common(struct obd_export *exp,
 		memcpy(tmp, input, input_size);
 	}
 
+	mdc_file_sepol_pack(req);
+
 	if (req_capsule_has_field(&req->rq_pill, &RMF_EADATA, RCL_SERVER))
 		req_capsule_set_size(&req->rq_pill, &RMF_EADATA,
 				     RCL_SERVER, output_size);
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index f80c627..9a676ae 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -193,7 +193,8 @@
 	&RMF_EADATA,
 	&RMF_DLM_REQ,
 	&RMF_FILE_SECCTX_NAME,
-	&RMF_FILE_SECCTX
+	&RMF_FILE_SECCTX,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_create_sym_client[] = {
@@ -204,7 +205,8 @@
 	&RMF_SYMTGT,
 	&RMF_DLM_REQ,
 	&RMF_FILE_SECCTX_NAME,
-	&RMF_FILE_SECCTX
+	&RMF_FILE_SECCTX,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_open_client[] = {
@@ -215,7 +217,8 @@
 	&RMF_NAME,
 	&RMF_EADATA,
 	&RMF_FILE_SECCTX_NAME,
-	&RMF_FILE_SECCTX
+	&RMF_FILE_SECCTX,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_open_server[] = {
@@ -232,7 +235,8 @@
 	&RMF_REC_REINT,
 	&RMF_CAPA1,
 	&RMF_NAME,
-	&RMF_DLM_REQ
+	&RMF_DLM_REQ,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_link_client[] = {
@@ -241,7 +245,8 @@
 	&RMF_CAPA1,
 	&RMF_CAPA2,
 	&RMF_NAME,
-	&RMF_DLM_REQ
+	&RMF_DLM_REQ,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_rename_client[] = {
@@ -251,7 +256,8 @@
 	&RMF_CAPA2,
 	&RMF_NAME,
 	&RMF_SYMTGT,
-	&RMF_DLM_REQ
+	&RMF_DLM_REQ,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_migrate_client[] = {
@@ -262,6 +268,7 @@
 	&RMF_NAME,
 	&RMF_SYMTGT,
 	&RMF_DLM_REQ,
+	&RMF_SELINUX_POL,
 	&RMF_MDT_EPOCH,
 	&RMF_CLOSE_DATA,
 	&RMF_EADATA
@@ -292,7 +299,8 @@
 	&RMF_CAPA1,
 	&RMF_NAME,
 	&RMF_EADATA,
-	&RMF_DLM_REQ
+	&RMF_DLM_REQ,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_reint_resync[] = {
@@ -450,7 +458,8 @@
 	&RMF_NAME,
 	&RMF_EADATA,
 	&RMF_FILE_SECCTX_NAME,
-	&RMF_FILE_SECCTX
+	&RMF_FILE_SECCTX,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *ldlm_intent_open_client[] = {
@@ -463,7 +472,8 @@
 	&RMF_NAME,
 	&RMF_EADATA,
 	&RMF_FILE_SECCTX_NAME,
-	&RMF_FILE_SECCTX
+	&RMF_FILE_SECCTX,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *ldlm_intent_getxattr_client[] = {
@@ -472,6 +482,7 @@
 	&RMF_LDLM_INTENT,
 	&RMF_MDT_BODY,
 	&RMF_CAPA1,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *ldlm_intent_getxattr_server[] = {
@@ -496,7 +507,8 @@
 	&RMF_MDT_BODY,
 	&RMF_CAPA1,
 	&RMF_NAME,
-	&RMF_EADATA
+	&RMF_EADATA,
+	&RMF_SELINUX_POL
 };
 
 static const struct req_msg_field *mds_getxattr_server[] = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 245/622] lustre: obd: make health_check sysfs compliant
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (243 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 244/622] lustre: ptlrpc: manage SELinux policy info for metadata ops James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 246/622] lustre: misc: delete OBD_IOC_PING_TARGET ioctl James Simmons
                   ` (377 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

The patch http://review.whamcloud.com/16721 was
ported to the upstream client but was rejected
since it violating the sysfs one item rule. Change
the reporting of LBUG plus unhealthy to just
reporting LBUG. Move the reporting of which device
is unhealthy to a new debugfs file that mirrors
the sysfs file.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: 5d368bd0b203 ("LU-8066 obd: make health_check sysfs compliant")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/25631
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obd_sysfs.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/lustre/obdclass/obd_sysfs.c b/fs/lustre/obdclass/obd_sysfs.c
index 73e44e7..ca15936 100644
--- a/fs/lustre/obdclass/obd_sysfs.c
+++ b/fs/lustre/obdclass/obd_sysfs.c
@@ -194,8 +194,12 @@ static ssize_t pinger_show(struct kobject *kobj, struct attribute *attr,
 
 		if (obd_health_check(NULL, obd))
 			healthy = false;
+
 		class_decref(obd, __func__, current);
 		read_lock(&obd_dev_lock);
+
+		if (!healthy)
+			break;
 	}
 	read_unlock(&obd_dev_lock);
 
@@ -363,6 +367,40 @@ static int obd_device_list_open(struct inode *inode, struct file *file)
 	.release = seq_release,
 };
 
+static int
+health_check_seq_show(struct seq_file *m, void *unused)
+{
+	int i;
+
+	read_lock(&obd_dev_lock);
+	for (i = 0; i < class_devno_max(); i++) {
+		struct obd_device *obd;
+
+		obd = class_num2obd(i);
+		if (!obd || !obd->obd_attached || !obd->obd_set_up)
+			continue;
+
+		LASSERT(obd->obd_magic == OBD_DEVICE_MAGIC);
+		if (obd->obd_stopping)
+			continue;
+
+		class_incref(obd, __func__, current);
+		read_unlock(&obd_dev_lock);
+
+		if (obd_health_check(NULL, obd)) {
+			seq_printf(m, "device %s reported unhealthy\n",
+				   obd->obd_name);
+		}
+		class_decref(obd, __func__, current);
+		read_lock(&obd_dev_lock);
+	}
+	read_unlock(&obd_dev_lock);
+
+	return 0;
+}
+
+LPROC_SEQ_FOPS_RO(health_check);
+
 struct kset *lustre_kset;
 EXPORT_SYMBOL_GPL(lustre_kset);
 
@@ -407,6 +445,9 @@ int class_procfs_init(void)
 
 	debugfs_create_file("devices", 0444, debugfs_lustre_root, NULL,
 			    &obd_device_list_fops);
+
+	debugfs_create_file("health_check", 0444, debugfs_lustre_root,
+			    NULL, &health_check_fops);
 out:
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 246/622] lustre: misc: delete OBD_IOC_PING_TARGET ioctl
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (244 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 245/622] lustre: obd: make health_check sysfs compliant James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 247/622] lustre: misc: remove LIBCFS_IOC_DEBUG_MASK ioctl James Simmons
                   ` (376 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The OBD_IOC_PING_TARGET ioctl was removed from tool usage in
Lustre v2_5_60_0-27-g122aadd and replaced with a sysfs interface.
It is no longer needed and can be removed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6202
Lustre-commit: d17d6ef74e52 ("LU-6202 misc: delete OBD_IOC_PING_TARGET ioctl")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33691
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_request.c              |  4 +---
 fs/lustre/obdclass/class_obd.c           |  4 ++--
 fs/lustre/osc/osc_request.c              | 25 +++++++++++--------------
 include/uapi/linux/lustre/lustre_ioctl.h |  2 +-
 4 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 80e58c8..f197abc 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -2114,9 +2114,7 @@ static int mdc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 	case IOC_OSC_SET_ACTIVE:
 		rc = ptlrpc_set_import_active(imp, data->ioc_offset);
 		goto out;
-	case OBD_IOC_PING_TARGET:
-		rc = ptlrpc_obd_ping(obd);
-		goto out;
+
 	/*
 	 * Normally IOC_OBD_STATFS, OBD_IOC_QUOTACTL iocontrol are handled by
 	 * LMV instead of MDC. But when the cluster is upgraded from 1.8,
diff --git a/fs/lustre/obdclass/class_obd.c b/fs/lustre/obdclass/class_obd.c
index 0435f62..373a8d2 100644
--- a/fs/lustre/obdclass/class_obd.c
+++ b/fs/lustre/obdclass/class_obd.c
@@ -510,8 +510,8 @@ int class_handle_ioctl(unsigned int cmd, unsigned long arg)
 static long obd_class_ioctl(struct file *filp, unsigned int cmd,
 			    unsigned long arg)
 {
-	/* Allow non-root access for OBD_IOC_PING_TARGET - used by lfs check */
-	if (!capable(CAP_SYS_ADMIN) && (cmd != OBD_IOC_PING_TARGET))
+	/* Allow non-root access for some limited ioctls */
+	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
 
 	if ((cmd & 0xffffff00) == ((int)'T') << 8) /* ignore all tty ioctls */
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 1fc50cc..7a99ef2 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -2840,7 +2840,7 @@ static int osc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct obd_ioctl_data *data = karg;
-	int err = 0;
+	int rc = 0;
 
 	if (!try_module_get(THIS_MODULE)) {
 		CERROR("%s: cannot get module '%s'\n", obd->obd_name,
@@ -2849,27 +2849,24 @@ static int osc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 	}
 	switch (cmd) {
 	case OBD_IOC_CLIENT_RECOVER:
-		err = ptlrpc_recover_import(obd->u.cli.cl_import,
-					    data->ioc_inlbuf1, 0);
-		if (err > 0)
-			err = 0;
+		rc = ptlrpc_recover_import(obd->u.cli.cl_import,
+					   data->ioc_inlbuf1, 0);
+		if (rc > 0)
+			rc = 0;
 		goto out;
 	case IOC_OSC_SET_ACTIVE:
-		err = ptlrpc_set_import_active(obd->u.cli.cl_import,
-					       data->ioc_offset);
-		goto out;
-	case OBD_IOC_PING_TARGET:
-		err = ptlrpc_obd_ping(obd);
+		rc = ptlrpc_set_import_active(obd->u.cli.cl_import,
+					      data->ioc_offset);
 		goto out;
 	default:
-		CDEBUG(D_INODE, "unrecognised ioctl %#x by %s\n",
-		       cmd, current->comm);
-		err = -ENOTTY;
+		CDEBUG(D_INODE, "%s: unrecognised ioctl %#x by %s\n",
+		       obd->obd_name, cmd, current->comm);
+		rc = -ENOTTY;
 		goto out;
 	}
 out:
 	module_put(THIS_MODULE);
-	return err;
+	return rc;
 }
 
 int osc_set_info_async(const struct lu_env *env, struct obd_export *exp,
diff --git a/include/uapi/linux/lustre/lustre_ioctl.h b/include/uapi/linux/lustre/lustre_ioctl.h
index 8289d43..30eb120 100644
--- a/include/uapi/linux/lustre/lustre_ioctl.h
+++ b/include/uapi/linux/lustre/lustre_ioctl.h
@@ -162,7 +162,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
 #define OBD_IOC_GETDTNAME	OBD_IOC_GETNAME
 #define OBD_IOC_LOV_GET_CONFIG	_IOWR('f', 132, OBD_IOC_DATA_TYPE)
 #define OBD_IOC_CLIENT_RECOVER	_IOW('f', 133, OBD_IOC_DATA_TYPE)
-#define OBD_IOC_PING_TARGET	_IOW('f', 136, OBD_IOC_DATA_TYPE)
+/* was	OBD_IOC_PING_TARGET	_IOW('f', 136, OBD_IOC_DATA_TYPE) until 2.11 */
 
 /*	OBD_IOC_DEC_FS_USE_COUNT _IO('f', 139) */
 #define OBD_IOC_NO_TRANSNO	_IOW('f', 140, OBD_IOC_DATA_TYPE)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 247/622] lustre: misc: remove LIBCFS_IOC_DEBUG_MASK ioctl
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (245 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 246/622] lustre: misc: delete OBD_IOC_PING_TARGET ioctl James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 248/622] lustre: llite: add file heat support James Simmons
                   ` (375 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Remove the LIBCFS_IOC_DEBUG_MASK ioctl, since the debug and subsystem
mask can be modified via sysfs for a long time, and tools have not
used this ioctl since 2.6.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6202
Lustre-commit: 70f932c7bfc5 ("LU-6202 misc: remove LIBCFS_IOC_DEBUG_MASK ioctl")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33692
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/class_obd.c           | 9 ---------
 include/uapi/linux/lnet/libcfs_ioctl.h   | 8 --------
 include/uapi/linux/lustre/lustre_ioctl.h | 2 +-
 3 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/fs/lustre/obdclass/class_obd.c b/fs/lustre/obdclass/class_obd.c
index 373a8d2..609b4cc 100644
--- a/fs/lustre/obdclass/class_obd.c
+++ b/fs/lustre/obdclass/class_obd.c
@@ -274,18 +274,9 @@ int obd_ioctl_getdata(struct obd_ioctl_data **datap, int *len, void __user *arg)
 int class_handle_ioctl(unsigned int cmd, unsigned long arg)
 {
 	struct obd_ioctl_data *data;
-	struct libcfs_debug_ioctl_data *debug_data;
 	struct obd_device *obd = NULL;
 	int err = 0, len = 0;
 
-	/* only for debugging */
-	if (cmd == LIBCFS_IOC_DEBUG_MASK) {
-		debug_data = (struct libcfs_debug_ioctl_data *)arg;
-		libcfs_subsystem_debug = debug_data->subs;
-		libcfs_debug = debug_data->debug;
-		return 0;
-	}
-
 	CDEBUG(D_IOCTL, "cmd = %x\n", cmd);
 	if (obd_ioctl_getdata(&data, &len, (void __user *)arg)) {
 		CERROR("OBD ioctl: data error\n");
diff --git a/include/uapi/linux/lnet/libcfs_ioctl.h b/include/uapi/linux/lnet/libcfs_ioctl.h
index dfb73f7..455ed78 100644
--- a/include/uapi/linux/lnet/libcfs_ioctl.h
+++ b/include/uapi/linux/lnet/libcfs_ioctl.h
@@ -77,14 +77,6 @@ struct libcfs_ioctl_data {
 	char ioc_bulk[0];
 };
 
-struct libcfs_debug_ioctl_data {
-	struct libcfs_ioctl_hdr hdr;
-	unsigned int subs;
-	unsigned int debug;
-};
-
-/* 'f' ioctls are defined in lustre_ioctl.h and lustre_user.h except for: */
-#define LIBCFS_IOC_DEBUG_MASK		_IOWR('f', 250, long)
 #define IOCTL_LIBCFS_TYPE		long
 
 #define IOC_LIBCFS_TYPE			('e')
diff --git a/include/uapi/linux/lustre/lustre_ioctl.h b/include/uapi/linux/lustre/lustre_ioctl.h
index 30eb120..b067cc6 100644
--- a/include/uapi/linux/lustre/lustre_ioctl.h
+++ b/include/uapi/linux/lustre/lustre_ioctl.h
@@ -222,7 +222,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
 #define OBD_IOC_STOP_LFSCK	_IOW('f', 231, OBD_IOC_DATA_TYPE)
 #define OBD_IOC_QUERY_LFSCK	_IOR('f', 232, struct obd_ioctl_data)
 /*	lustre/lustre_user.h	240-249 */
-/*	LIBCFS_IOC_DEBUG_MASK	250 */
+/* was LIBCFS_IOC_DEBUG_MASK   _IOWR('f', 250, long) until 2.11 */
 
 #define IOC_OSC_SET_ACTIVE	_IOWR('h', 21, void *)
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 248/622] lustre: llite: add file heat support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (246 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 247/622] lustre: misc: remove LIBCFS_IOC_DEBUG_MASK ioctl James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 249/622] lustre: obdclass: improve llog config record message James Simmons
                   ` (374 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Li Xi <lixi@ddn.com>

File heat is a special attribute fo files/objects which reflects
the access frequency of the files/objects.
File heat is mainly desinged for cache management. Caches like
PCC can use file heat to determine which files to be removed from
the cache or which files to fetch into cache.
This patch adds file heat support on llite level.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10602
Lustre-commit: ae723cf8161f ("LU-10602 llite: add file heat support")
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/34399
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h           |  11 ++++
 fs/lustre/include/obd_support.h         |   6 ++
 fs/lustre/llite/file.c                  | 104 ++++++++++++++++++++++++++++++-
 fs/lustre/llite/llite_internal.h        |  20 +++++-
 fs/lustre/llite/llite_lib.c             |   6 ++
 fs/lustre/llite/lproc_llite.c           | 106 ++++++++++++++++++++++++++++++++
 fs/lustre/obdclass/class_obd.c          |  73 ++++++++++++++++++++++
 include/uapi/linux/lustre/lustre_user.h |  32 ++++++++++
 8 files changed, 356 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 6a4b6a5..6cddc4f 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1710,4 +1710,15 @@ struct root_squash_info {
 struct obd_ioctl_data;
 int obd_ioctl_getdata(struct obd_ioctl_data **data, int *len, void __user *arg);
 
+extern void obd_heat_add(struct obd_heat_instance *instance,
+			 unsigned int time_second, u64 count,
+			 unsigned int weight, unsigned int period_second);
+extern void obd_heat_decay(struct obd_heat_instance *instance,
+			   u64 time_second, unsigned int weight,
+			   unsigned int period_second);
+extern u64 obd_heat_get(struct obd_heat_instance *instance,
+			unsigned int time_second, unsigned int weight,
+			unsigned int period_second);
+extern void obd_heat_clear(struct obd_heat_instance *instance, int count);
+
 #endif /* __LINUX_OBD_CLASS_H */
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index a60fa07..36955e8 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -536,4 +536,10 @@
 	(keylen >= (sizeof(str) - 1) &&			\
 	memcmp(key, str, (sizeof(str) - 1)) == 0)
 
+struct obd_heat_instance {
+	u64 ohi_heat;
+	u64 ohi_time_second;
+	u64 ohi_count;
+};
+
 #endif
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 7ec1099..f5b5eec 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1399,6 +1399,37 @@ static void ll_io_init(struct cl_io *io, const struct file *file, int write)
 	ll_io_set_mirror(io, file);
 }
 
+static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
+			u64 count)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
+	enum obd_heat_type sample_type;
+	enum obd_heat_type iobyte_type;
+	u64 now = ktime_get_real_seconds();
+
+	if (!ll_sbi_has_file_heat(sbi) ||
+	    lli->lli_heat_flags & LU_HEAT_FLAG_OFF)
+		return;
+
+	if (iot == CIT_READ) {
+		sample_type = OBD_HEAT_READSAMPLE;
+		iobyte_type = OBD_HEAT_READBYTE;
+	} else if (iot == CIT_WRITE) {
+		sample_type = OBD_HEAT_WRITESAMPLE;
+		iobyte_type = OBD_HEAT_WRITEBYTE;
+	} else {
+		return;
+	}
+
+	spin_lock(&lli->lli_heat_lock);
+	obd_heat_add(&lli->lli_heat_instances[sample_type], now, 1,
+		     sbi->ll_heat_decay_weight, sbi->ll_heat_period_second);
+	obd_heat_add(&lli->lli_heat_instances[iobyte_type], now, count,
+		     sbi->ll_heat_decay_weight, sbi->ll_heat_period_second);
+	spin_unlock(&lli->lli_heat_lock);
+}
+
 static ssize_t
 ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		   struct file *file, enum cl_io_type iot,
@@ -1512,6 +1543,8 @@ static void ll_io_init(struct cl_io *io, const struct file *file, int write)
 		}
 	}
 	CDEBUG(D_VFSTRACE, "iot: %d, result: %zd\n", iot, result);
+	if (result > 0)
+		ll_heat_add(file_inode(file), iot, result);
 
 	return result > 0 ? result : rc;
 }
@@ -1575,9 +1608,11 @@ static void ll_io_init(struct cl_io *io, const struct file *file, int write)
 	if (result == -ENODATA)
 		result = 0;
 
-	if (result > 0)
+	if (result > 0) {
+		ll_heat_add(file_inode(iocb->ki_filp), CIT_READ, result);
 		ll_stats_ops_tally(ll_i2sbi(file_inode(iocb->ki_filp)),
 				   LPROC_LL_READ_BYTES, result);
+	}
 
 	return result;
 }
@@ -1660,6 +1695,7 @@ static ssize_t ll_do_tiny_write(struct kiocb *iocb, struct iov_iter *iter)
 		result = 0;
 
 	if (result > 0) {
+		ll_heat_add(inode, CIT_WRITE, result);
 		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_WRITE_BYTES,
 				   result);
 		set_bit(LLIF_DATA_MODIFIED, &ll_i2info(inode)->lli_flags);
@@ -3128,6 +3164,41 @@ static long ll_file_set_lease(struct file *file, struct ll_ioc_lease *ioc,
 	return rc;
 }
 
+static void ll_heat_get(struct inode *inode, struct lu_heat *heat)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
+	u64 now = ktime_get_real_seconds();
+	int i;
+
+	spin_lock(&lli->lli_heat_lock);
+	heat->lh_flags = lli->lli_heat_flags;
+	for (i = 0; i < heat->lh_count; i++)
+		heat->lh_heat[i] = obd_heat_get(&lli->lli_heat_instances[i],
+						now, sbi->ll_heat_decay_weight,
+						sbi->ll_heat_period_second);
+	spin_unlock(&lli->lli_heat_lock);
+}
+
+static int ll_heat_set(struct inode *inode, u64 flags)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	int rc = 0;
+
+	spin_lock(&lli->lli_heat_lock);
+	if (flags & LU_HEAT_FLAG_CLEAR)
+		obd_heat_clear(lli->lli_heat_instances, OBD_HEAT_COUNT);
+
+	if (flags & LU_HEAT_FLAG_OFF)
+		lli->lli_heat_flags |= LU_HEAT_FLAG_OFF;
+	else
+		lli->lli_heat_flags &= ~LU_HEAT_FLAG_OFF;
+
+	spin_unlock(&lli->lli_heat_lock);
+
+	return rc;
+}
+
 static long
 ll_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
@@ -3510,6 +3581,37 @@ static long ll_file_set_lease(struct file *file, struct ll_ioc_lease *ioc,
 		return ll_ioctl_fssetxattr(inode, cmd, arg);
 	case BLKSSZGET:
 		return put_user(PAGE_SIZE, (int __user *)arg);
+	case LL_IOC_HEAT_GET: {
+		struct lu_heat uheat;
+		struct lu_heat *heat;
+		int size;
+
+		if (copy_from_user(&uheat, (void __user *)arg, sizeof(uheat)))
+			return -EFAULT;
+
+		if (uheat.lh_count > OBD_HEAT_COUNT)
+			uheat.lh_count = OBD_HEAT_COUNT;
+
+		size = offsetof(typeof(uheat), lh_heat[uheat.lh_count]);
+		heat = kzalloc(size, GFP_KERNEL);
+		if (!heat)
+			return -ENOMEM;
+
+		heat->lh_count = uheat.lh_count;
+		ll_heat_get(inode, heat);
+		rc = copy_to_user((char __user *)arg, heat, size);
+		kfree(heat);
+		return rc ? -EFAULT : 0;
+	}
+	case LL_IOC_HEAT_SET: {
+		u64 flags;
+
+		if (copy_from_user(&flags, (void __user *)arg, sizeof(flags)))
+			return -EFAULT;
+
+		rc = ll_heat_set(inode, flags);
+		return rc;
+	}
 	default:
 		return obd_iocontrol(cmd, ll_i2dtexp(inode), 0, NULL,
 				     (void __user *)arg);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 3c81c3b..5a0a5ed 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -196,6 +196,11 @@ struct ll_inode_info {
 			/* for writepage() only to communicate to fsync */
 			int				lli_async_rc;
 
+			/* protect the file heat fields */
+			spinlock_t			lli_heat_lock;
+			u32				lli_heat_flags;
+			struct obd_heat_instance	lli_heat_instances[OBD_HEAT_COUNT];
+
 			/*
 			 * Whenever a process try to read/write the file, the
 			 * jobid of the process will be saved here, and it'll
@@ -418,7 +423,7 @@ enum stats_track_type {
 					  * create
 					  */
 #define LL_SBI_TINY_WRITE	0x2000000 /* tiny write support */
-
+#define LL_SBI_FILE_HEAT    0x4000000 /* file heat support */
 #define LL_SBI_FLAGS {	\
 	"nolck",	\
 	"checksum",	\
@@ -446,6 +451,7 @@ enum stats_track_type {
 	"file_secctx",	\
 	"pio",		\
 	"tiny_write",	\
+	"file_heat",	\
 }
 
 /*
@@ -546,8 +552,15 @@ struct ll_sb_info {
 
 	struct kset		ll_kset;	/* sysfs object */
 	struct completion	 ll_kobj_unregister;
+
+	/* File heat */
+	unsigned int		ll_heat_decay_weight;
+	unsigned int		ll_heat_period_second;
 };
 
+#define SBI_DEFAULT_HEAT_DECAY_WEIGHT	((80 * 256 + 50) / 100)
+#define SBI_DEFAULT_HEAT_PERIOD_SECOND	(60)
+
 /*
  * per file-descriptor read-ahead data.
  */
@@ -710,6 +723,11 @@ static inline bool ll_sbi_has_tiny_write(struct ll_sb_info *sbi)
 	return !!(sbi->ll_flags & LL_SBI_TINY_WRITE);
 }
 
+static inline bool ll_sbi_has_file_heat(struct ll_sb_info *sbi)
+{
+	return !!(sbi->ll_flags & LL_SBI_FILE_HEAT);
+}
+
 void ll_ras_enter(struct file *f);
 
 /* llite/lcommon_misc.c */
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 10d9180..795a1f1 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -133,6 +133,9 @@ static struct ll_sb_info *ll_init_sbi(void)
 	INIT_LIST_HEAD(&sbi->ll_squash.rsi_nosquash_nids);
 	spin_lock_init(&sbi->ll_squash.rsi_lock);
 
+	/* Per-filesystem file heat */
+	sbi->ll_heat_decay_weight = SBI_DEFAULT_HEAT_DECAY_WEIGHT;
+	sbi->ll_heat_period_second = SBI_DEFAULT_HEAT_PERIOD_SECOND;
 	return sbi;
 }
 
@@ -949,6 +952,9 @@ void ll_lli_init(struct ll_inode_info *lli)
 		INIT_LIST_HEAD(&lli->lli_agl_list);
 		lli->lli_agl_index = 0;
 		lli->lli_async_rc = 0;
+		spin_lock_init(&lli->lli_heat_lock);
+		obd_heat_clear(lli->lli_heat_instances, OBD_HEAT_COUNT);
+		lli->lli_heat_flags = 0;
 	}
 	mutex_init(&lli->lli_layout_mutex);
 	memset(lli->lli_jobid, 0, sizeof(lli->lli_jobid));
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 4060271..596aad8 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1096,6 +1096,109 @@ static ssize_t fast_read_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(fast_read);
 
+static ssize_t file_heat_show(struct kobject *kobj,
+			      struct attribute *attr,
+			      char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+			!!(sbi->ll_flags & LL_SBI_FILE_HEAT));
+}
+
+static ssize_t file_heat_store(struct kobject *kobj,
+			       struct attribute *attr,
+			       const char *buffer,
+			       size_t count)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	bool val;
+	int rc;
+
+	rc = kstrtobool(buffer, &val);
+	if (rc)
+		return rc;
+
+	spin_lock(&sbi->ll_lock);
+	if (val)
+		sbi->ll_flags |= LL_SBI_FILE_HEAT;
+	else
+		sbi->ll_flags &= ~LL_SBI_FILE_HEAT;
+	spin_unlock(&sbi->ll_lock);
+
+	return count;
+}
+LUSTRE_RW_ATTR(file_heat);
+
+static ssize_t heat_decay_percentage_show(struct kobject *kobj,
+					  struct attribute *attr,
+					  char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+		       (sbi->ll_heat_decay_weight * 100 + 128) / 256);
+}
+
+static ssize_t heat_decay_percentage_store(struct kobject *kobj,
+					   struct attribute *attr,
+					   const char *buffer,
+					   size_t count)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buffer, 10, &val);
+	if (rc)
+		return rc;
+
+	if (val < 0 || val > 100)
+		return -ERANGE;
+
+	sbi->ll_heat_decay_weight = (val * 256 + 50) / 100;
+
+	return count;
+}
+LUSTRE_RW_ATTR(heat_decay_percentage);
+
+static ssize_t heat_period_second_show(struct kobject *kobj,
+				       struct attribute *attr,
+				       char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u\n", sbi->ll_heat_period_second);
+}
+
+static ssize_t heat_period_second_store(struct kobject *kobj,
+					struct attribute *attr,
+					const char *buffer,
+					size_t count)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buffer, 10, &val);
+	if (rc)
+		return rc;
+
+	if (val <= 0)
+		return -ERANGE;
+
+	sbi->ll_heat_period_second = val;
+
+	return count;
+}
+LUSTRE_RW_ATTR(heat_period_second);
+
 static int ll_unstable_stats_seq_show(struct seq_file *m, void *v)
 {
 	struct super_block *sb = m->private;
@@ -1264,6 +1367,9 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 	&lustre_attr_xattr_cache.attr,
 	&lustre_attr_fast_read.attr,
 	&lustre_attr_tiny_write.attr,
+	&lustre_attr_file_heat.attr,
+	&lustre_attr_heat_decay_percentage.attr,
+	&lustre_attr_heat_period_second.attr,
 	NULL,
 };
 
diff --git a/fs/lustre/obdclass/class_obd.c b/fs/lustre/obdclass/class_obd.c
index 609b4cc..0718fdb 100644
--- a/fs/lustre/obdclass/class_obd.c
+++ b/fs/lustre/obdclass/class_obd.c
@@ -706,6 +706,79 @@ static void obdclass_exit(void)
 	obd_zombie_impexp_stop();
 }
 
+void obd_heat_clear(struct obd_heat_instance *instance, int count)
+{
+	memset(instance, 0, sizeof(*instance) * count);
+}
+EXPORT_SYMBOL(obd_heat_clear);
+
+/*
+ * The file heat is calculated for every time interval period I. The access
+ * frequency during each period is counted. The file heat is only recalculated
+ * at the end of a time period.  And a percentage of the former file heat is
+ * lost when recalculated. The recursion formula to calculate the heat of the
+ * file f is as follow:
+ *
+ * Hi+1(f) = (1-P)*Hi(f)+ P*Ci
+ *
+ * Where Hi is the heat value in the period between time points i*I and
+ * (i+1)*I; Ci is the access count in the period; the symbol P refers to the
+ * weight of Ci. The larger the value the value of P is, the more influence Ci
+ * has on the file heat.
+ */
+void obd_heat_decay(struct obd_heat_instance *instance,  u64 time_second,
+		    unsigned int weight, unsigned int period_second)
+{
+	u64 second;
+
+	if (instance->ohi_time_second > time_second) {
+		obd_heat_clear(instance, 1);
+		return;
+	}
+
+	if (instance->ohi_time_second == 0)
+		return;
+
+	for (second = instance->ohi_time_second + period_second;
+	     second < time_second;
+	     second += period_second) {
+		instance->ohi_heat = instance->ohi_heat *
+				(256 - weight) / 256 +
+				instance->ohi_count * weight / 256;
+		instance->ohi_count = 0;
+		instance->ohi_time_second = second;
+	}
+}
+EXPORT_SYMBOL(obd_heat_decay);
+
+u64 obd_heat_get(struct obd_heat_instance *instance, unsigned int time_second,
+		 unsigned int weight, unsigned int period_second)
+{
+	obd_heat_decay(instance, time_second, weight, period_second);
+
+	if (instance->ohi_count == 0)
+		return instance->ohi_heat;
+
+	return instance->ohi_heat * (256 - weight) / 256 +
+	       instance->ohi_count * weight / 256;
+}
+EXPORT_SYMBOL(obd_heat_get);
+
+void obd_heat_add(struct obd_heat_instance *instance,
+		  unsigned int time_second,  u64 count,
+		  unsigned int weight, unsigned int period_second)
+{
+	obd_heat_decay(instance, time_second, weight, period_second);
+	if (instance->ohi_time_second == 0) {
+		instance->ohi_time_second = time_second;
+		instance->ohi_heat = 0;
+		instance->ohi_count = count;
+	} else {
+		instance->ohi_count += count;
+	}
+}
+EXPORT_SYMBOL(obd_heat_add);
+
 MODULE_AUTHOR("OpenSFS, Inc. <http://www.lustre.org/>");
 MODULE_DESCRIPTION("Lustre Class Driver");
 MODULE_VERSION(LUSTRE_VERSION_STRING);
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index c1e9dca..1d402f1 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -352,6 +352,8 @@ struct ll_ioc_lease_id {
 #define LL_IOC_FID2MDTIDX		_IOWR('f', 248, struct lu_fid)
 #define LL_IOC_GETPARENT		_IOWR('f', 249, struct getparent)
 #define LL_IOC_LADVISE			_IOR('f', 250, struct llapi_lu_ladvise)
+#define LL_IOC_HEAT_GET			_IOWR('f', 251, struct lu_heat)
+#define LL_IOC_HEAT_SET			_IOW('f', 252, long)
 
 #define LL_STATFS_LMV		1
 #define LL_STATFS_LOV		2
@@ -1957,6 +1959,36 @@ enum lockahead_results {
 	LLA_RESULT_SAME,
 };
 
+enum lu_heat_flag_bit {
+	LU_HEAT_FLAG_BIT_INVALID = 0,
+	LU_HEAT_FLAG_BIT_OFF,
+	LU_HEAT_FLAG_BIT_CLEAR,
+};
+
+#define LU_HEAT_FLAG_CLEAR	(1 << LU_HEAT_FLAG_BIT_CLEAR)
+#define LU_HEAT_FLAG_OFF	(1 << LU_HEAT_FLAG_BIT_OFF)
+
+enum obd_heat_type {
+	OBD_HEAT_READSAMPLE	= 0,
+	OBD_HEAT_WRITESAMPLE	= 1,
+	OBD_HEAT_READBYTE	= 2,
+	OBD_HEAT_WRITEBYTE	= 3,
+	OBD_HEAT_COUNT
+};
+
+#define LU_HEAT_NAMES {					\
+	[OBD_HEAT_READSAMPLE]	= "readsample",		\
+	[OBD_HEAT_WRITESAMPLE]	= "writesample",	\
+	[OBD_HEAT_READBYTE]	= "readbyte",		\
+	[OBD_HEAT_WRITEBYTE]	= "writebyte",		\
+}
+
+struct lu_heat {
+	__u32 lh_count;
+	__u32 lh_flags;
+	__u64 lh_heat[0];
+};
+
 /** @} lustreuser */
 
 #endif /* _LUSTRE_USER_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 249/622] lustre: obdclass: improve llog config record message
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (247 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 248/622] lustre: llite: add file heat support James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 250/622] lustre: lov: remove KEY_CACHE_SET to simplify the code James Simmons
                   ` (373 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Improve the config record message in class_config_parse_rec()
by removing the newline and formating to match the other
entires for the output dump buffer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11566
Lustre-commit: 2ec11b04dd76 ("LU-11566 utils: improve usage/docs for lctl llog commands")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34004
Reviewed-by: Joseph Gmitter <jgmitter@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obd_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/obd_config.c b/fs/lustre/obdclass/obd_config.c
index 398f888..4b1848f 100644
--- a/fs/lustre/obdclass/obd_config.c
+++ b/fs/lustre/obdclass/obd_config.c
@@ -1561,7 +1561,7 @@ static int class_config_parse_rec(struct llog_rec_hdr *rec, char *buf,
 		char nidstr[LNET_NIDSTR_SIZE];
 
 		libcfs_nid2str_r(lcfg->lcfg_nid, nidstr, sizeof(nidstr));
-		ptr += snprintf(ptr, end - ptr, "nid=%s(%#llx)\n     ",
+		ptr += snprintf(ptr, end - ptr, "nid=%s(%#llx)  ",
 				nidstr, lcfg->lcfg_nid);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 250/622] lustre: lov: remove KEY_CACHE_SET to simplify the code
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (248 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 249/622] lustre: obdclass: improve llog config record message James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:11 ` [lustre-devel] [PATCH 251/622] lustre: ldlm: Fix style issues for ldlm_lockd.c James Simmons
                   ` (372 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

We must invoke obd_set_info_async with KEY_CACHE_SET after
obd_connect for OSC device. In fact, It can be combined
in obd_connect to simplify the code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12072
Lustre-commit: 6d21fbbf018b ("LU-12072 lov: remove KEY_CACHE_SET to simplify the code")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34419
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h     |  2 +-
 fs/lustre/ldlm/ldlm_lib.c   | 14 +++++++++++++
 fs/lustre/llite/llite_lib.c | 13 ++----------
 fs/lustre/lmv/lmv_obd.c     |  3 ++-
 fs/lustre/lov/lov_obd.c     | 49 ++++++++++++++-------------------------------
 fs/lustre/osc/osc_request.c | 17 ----------------
 6 files changed, 34 insertions(+), 64 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 758efc1..2195f85 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -446,6 +446,7 @@ struct lmv_obd {
 	struct lmv_tgt_desc	**tgts;
 	struct obd_connect_data	conn_data;
 	struct kobject		*lmv_tgts_kobj;
+	void			*lmv_cache;
 };
 
 struct niobuf_local {
@@ -672,7 +673,6 @@ struct obd_device {
 /*      KEY_SET_INFO in lustre_idl.h */
 #define KEY_SPTLRPC_CONF	"sptlrpc_conf"
 
-#define KEY_CACHE_SET		"cache_set"
 #define KEY_CACHE_LRU_SHRINK	"cache_lru_shrink"
 
 /* Flags for op_xvalid */
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 11955b1..4a982ab 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -40,6 +40,7 @@
 
 #define DEBUG_SUBSYSTEM S_LDLM
 
+#include <cl_object.h>
 #include <obd.h>
 #include <obd_class.h>
 #include <lustre_dlm.h>
@@ -579,6 +580,19 @@ int client_connect_import(const struct lu_env *env,
 out_sem:
 	up_write(&cli->cl_sem);
 
+	if (!rc && localdata) {
+		LASSERT(!cli->cl_cache); /* only once */
+		cli->cl_cache = (struct cl_client_cache *)localdata;
+		cl_cache_incref(cli->cl_cache);
+		cli->cl_lru_left = &cli->cl_cache->ccc_lru_left;
+
+		/* add this osc into entity list */
+		LASSERT(list_empty(&cli->cl_lru_osc));
+		spin_lock(&cli->cl_cache->ccc_lru_lock);
+		list_add(&cli->cl_lru_osc, &cli->cl_cache->ccc_lru);
+		spin_unlock(&cli->cl_cache->ccc_lru_lock);
+	}
+
 	return rc;
 }
 EXPORT_SYMBOL(client_connect_import);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 795a1f1..57486b4 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -266,7 +266,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	data->ocd_brw_size = MD_MAX_BRW_SIZE;
 
 	err = obd_connect(NULL, &sbi->ll_md_exp, sbi->ll_md_obd,
-			  &sbi->ll_sb_uuid, data, NULL);
+			  &sbi->ll_sb_uuid, data, sbi->ll_cache);
 	if (err == -EBUSY) {
 		LCONSOLE_ERROR_MSG(0x14f,
 				   "An MDT (md %s) is performing recovery, of which this client is not a part. Please wait for recovery to complete, abort, or time out.\n",
@@ -462,7 +462,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	data->ocd_brw_size = DT_MAX_BRW_SIZE;
 
 	err = obd_connect(NULL, &sbi->ll_dt_exp, sbi->ll_dt_obd,
-			  &sbi->ll_sb_uuid, data, NULL);
+			  &sbi->ll_sb_uuid, data, sbi->ll_cache);
 	if (err == -EBUSY) {
 		LCONSOLE_ERROR_MSG(0x150,
 				   "An OST (dt %s) is performing recovery, of which this client is not a part.  Please wait for recovery to complete, abort, or time out.\n",
@@ -583,15 +583,6 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	}
 	cl_sb_init(sb);
 
-	err = obd_set_info_async(NULL, sbi->ll_dt_exp, sizeof(KEY_CACHE_SET),
-				 KEY_CACHE_SET, sizeof(*sbi->ll_cache),
-				 sbi->ll_cache, NULL);
-	if (err) {
-		CERROR("%s: Set cache_set failed: rc = %d\n",
-		       sbi->ll_dt_exp->exp_obd->obd_name, err);
-		goto out_root;
-	}
-
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root) {
 		CERROR("%s: can't make root dentry\n",
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 6ad100c..9f3d6de 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -207,6 +207,7 @@ static int lmv_connect(const struct lu_env *env,
 
 	lmv->connected = 0;
 	lmv->conn_data = *data;
+	lmv->lmv_cache = localdata;
 
 	lmv->lmv_tgts_kobj = kobject_create_and_add("target_obds",
 						    &obd->obd_kset.kobj);
@@ -299,7 +300,7 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 	}
 
 	rc = obd_connect(NULL, &mdc_exp, mdc_obd, &obd->obd_uuid,
-			 &lmv->conn_data, NULL);
+			 &lmv->conn_data, lmv->lmv_cache);
 	if (rc) {
 		CERROR("target %s connect error %d\n", tgt->ltd_uuid.uuid, rc);
 		return rc;
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index cc0ca1c..240cc6f9 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -120,7 +120,7 @@ static int lov_set_osc_active(struct obd_device *obd, struct obd_uuid *uuid,
 static int lov_notify(struct obd_device *obd, struct obd_device *watched,
 		      enum obd_notify_event ev);
 
-int lov_connect_obd(struct obd_device *obd, u32 index, int activate,
+int lov_connect_osc(struct obd_device *obd, u32 index, int activate,
 		    struct obd_connect_data *data)
 {
 	struct lov_obd *lov = &obd->u.lov;
@@ -169,13 +169,13 @@ int lov_connect_obd(struct obd_device *obd, u32 index, int activate,
 
 	if (imp->imp_invalid) {
 		CDEBUG(D_CONFIG,
-		       "not connecting OSC %s; administratively disabled\n",
+		       "%s: not connecting - administratively disabled\n",
 		       obd_uuid2str(tgt_uuid));
 		return 0;
 	}
 
 	rc = obd_connect(NULL, &lov->lov_tgts[index]->ltd_exp, tgt_obd,
-			 &lov_osc_uuid, data, NULL);
+			 &lov_osc_uuid, data, lov->lov_cache);
 	if (rc || !lov->lov_tgts[index]->ltd_exp) {
 		CERROR("Target %s connect error %d\n",
 		       obd_uuid2str(tgt_uuid), rc);
@@ -231,12 +231,17 @@ static int lov_connect(const struct lu_env *env,
 
 	lov_tgts_getref(obd);
 
+	if (localdata) {
+		lov->lov_cache = localdata;
+		cl_cache_incref(lov->lov_cache);
+	}
+
 	for (i = 0; i < lov->desc.ld_tgt_count; i++) {
 		tgt = lov->lov_tgts[i];
 		if (!tgt || obd_uuid_empty(&tgt->ltd_uuid))
 			continue;
 		/* Flags will be lowest common denominator */
-		rc = lov_connect_obd(obd, i, tgt->ltd_activate, &lov->lov_ocd);
+		rc = lov_connect_osc(obd, i, tgt->ltd_activate, &lov->lov_ocd);
 		if (rc) {
 			CERROR("%s: lov connect tgt %d failed: %d\n",
 			       obd->obd_name, i, rc);
@@ -381,20 +386,12 @@ static int lov_set_osc_active(struct obd_device *obd, struct obd_uuid *uuid,
 			struct obd_uuid lov_osc_uuid = {"LOV_OSC_UUID"};
 
 			rc = obd_connect(NULL, &tgt->ltd_exp, tgt->ltd_obd,
-					 &lov_osc_uuid, &lov->lov_ocd, NULL);
+					 &lov_osc_uuid, &lov->lov_ocd,
+					 lov->lov_cache);
 			if (rc || !tgt->ltd_exp) {
 				index = rc;
 				goto out;
 			}
-			rc = obd_set_info_async(NULL, tgt->ltd_exp,
-						sizeof(KEY_CACHE_SET),
-						KEY_CACHE_SET,
-						sizeof(struct cl_client_cache),
-						lov->lov_cache, NULL);
-			if (rc < 0) {
-				index = rc;
-				goto out;
-			}
 		}
 
 		if (lov->lov_tgts[index]->ltd_activate == activate) {
@@ -574,17 +571,16 @@ static int lov_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	CDEBUG(D_CONFIG, "idx=%d ltd_gen=%d ld_tgt_count=%d\n",
 	       index, tgt->ltd_gen, lov->desc.ld_tgt_count);
 
-	if (lov->lov_connects == 0) {
+	if (lov->lov_connects == 0)
 		/* lov_connect hasn't been called yet. We'll do the
-		 * lov_connect_obd on this target when that fn first runs,
+		 * lov_connect_osc on this target when that fn first runs,
 		 * because we don't know the connect flags yet.
 		 */
 		return 0;
-	}
 
 	lov_tgts_getref(obd);
 
-	rc = lov_connect_obd(obd, index, active, &lov->lov_ocd);
+	rc = lov_connect_osc(obd, index, active, &lov->lov_ocd);
 	if (rc)
 		goto out;
 
@@ -594,15 +590,6 @@ static int lov_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 		goto out;
 	}
 
-	if (lov->lov_cache) {
-		rc = obd_set_info_async(NULL, tgt->ltd_exp,
-					sizeof(KEY_CACHE_SET), KEY_CACHE_SET,
-					sizeof(struct cl_client_cache),
-					lov->lov_cache, NULL);
-		if (rc < 0)
-			goto out;
-	}
-
 	rc = lov_notify(obd, tgt->ltd_exp->exp_obd,
 			active ? OBD_NOTIFY_CONNECT : OBD_NOTIFY_INACTIVE);
 
@@ -1216,14 +1203,8 @@ static int lov_set_info_async(const struct lu_env *env, struct obd_export *exp,
 
 	lov_tgts_getref(obddev);
 
-	if (KEY_IS(KEY_CHECKSUM)) {
+	if (KEY_IS(KEY_CHECKSUM))
 		do_inactive = true;
-	} else if (KEY_IS(KEY_CACHE_SET)) {
-		LASSERT(!lov->lov_cache);
-		lov->lov_cache = val;
-		do_inactive = true;
-		cl_cache_incref(lov->lov_cache);
-	}
 
 	for (i = 0; i < lov->desc.ld_tgt_count; i++) {
 		tgt = lov->lov_tgts[i];
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 7a99ef2..a988cbf 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -2899,23 +2899,6 @@ int osc_set_info_async(const struct lu_env *env, struct obd_export *exp,
 		return 0;
 	}
 
-	if (KEY_IS(KEY_CACHE_SET)) {
-		struct client_obd *cli = &obd->u.cli;
-
-		LASSERT(!cli->cl_cache); /* only once */
-		cli->cl_cache = val;
-		cl_cache_incref(cli->cl_cache);
-		cli->cl_lru_left = &cli->cl_cache->ccc_lru_left;
-
-		/* add this osc into entity list */
-		LASSERT(list_empty(&cli->cl_lru_osc));
-		spin_lock(&cli->cl_cache->ccc_lru_lock);
-		list_add(&cli->cl_lru_osc, &cli->cl_cache->ccc_lru);
-		spin_unlock(&cli->cl_cache->ccc_lru_lock);
-
-		return 0;
-	}
-
 	if (KEY_IS(KEY_CACHE_LRU_SHRINK)) {
 		struct client_obd *cli = &obd->u.cli;
 		long nr = atomic_long_read(&cli->cl_lru_in_list) >> 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 251/622] lustre: ldlm: Fix style issues for ldlm_lockd.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (249 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 250/622] lustre: lov: remove KEY_CACHE_SET to simplify the code James Simmons
@ 2020-02-27 21:11 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 252/622] lustre: ldlm: Fix style issues for ldlm_request.c James Simmons
                   ` (371 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:11 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ldlm/ldlm_lockd.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 5275c82c67d9 ("LU-6142 ldlm: Fix style issues for ldlm_lockd.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34544
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lockd.c | 64 +++++++++++++++++++++++++++------------------
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index ea146aa..f37d8ef 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -80,7 +80,7 @@ struct ldlm_bl_pool {
 	/*
 	 * blp_prio_list is used for callbacks that should be handled
 	 * as a priority. It is used for LDLM_FL_DISCARD_DATA requests.
-	 * see bug 13843
+	 * see b=13843
 	 */
 	struct list_head	blp_prio_list;
 
@@ -126,22 +126,24 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 
 	/* set bits to cancel for this lock for possible lock convert */
 	if (lock->l_resource->lr_type == LDLM_IBITS) {
-		/* Lock description contains policy of blocking lock,
-		 * and its cancel_bits is used to pass conflicting bits.
-		 * NOTE: ld can be NULL or can be not NULL but zeroed if
-		 * passed from ldlm_bl_thread_blwi(), check below used bits
-		 * in ld to make sure it is valid description.
+		/*
+		 * Lock description contains policy of blocking lock, and its
+		 * cancel_bits is used to pass conflicting bits.  NOTE: ld can
+		 * be NULL or can be not NULL but zeroed if passed from
+		 * ldlm_bl_thread_blwi(), check below used bits in ld to make
+		 * sure it is valid description.
 		 *
-		 * If server may replace lock resource keeping the same cookie,
-		 * never use cancel bits from different resource, full cancel
-		 * is to be used.
+		 * If server may replace lock resource keeping the same
+		 * cookie, never use cancel bits from different resource, full
+		 * cancel is to be used.
 		 */
 		if (ld && ld->l_policy_data.l_inodebits.bits &&
 		    ldlm_res_eq(&ld->l_resource.lr_name,
 				&lock->l_resource->lr_name))
 			lock->l_policy_data.l_inodebits.cancel_bits =
 				ld->l_policy_data.l_inodebits.cancel_bits;
-		/* if there is no valid ld and lock is cbpending already
+		/*
+		 * If there is no valid ld and lock is cbpending already
 		 * then cancel_bits should be kept, otherwise it is zeroed.
 		 */
 		else if (!ldlm_is_cbpending(lock))
@@ -169,7 +171,7 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 	LDLM_LOCK_RELEASE(lock);
 }
 
-/**
+/*
  * Callback handler for receiving incoming completion ASTs.
  *
  * This only can happen on client side.
@@ -241,8 +243,10 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 		goto out;
 	}
 
-	/* If we receive the completion AST before the actual enqueue returned,
-	 * then we might need to switch lock modes, resources, or extents.
+	/*
+	 * If we receive the completion AST before the actual enqueue
+	 * returned, then we might need to switch lock modes, resources, or
+	 * extents.
 	 */
 	if (dlm_req->lock_desc.l_granted_mode != lock->l_req_mode) {
 		lock->l_req_mode = dlm_req->lock_desc.l_granted_mode;
@@ -260,7 +264,8 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 	ldlm_resource_unlink_lock(lock);
 
 	if (dlm_req->lock_flags & LDLM_FL_AST_SENT) {
-		/* BL_AST locks are not needed in LRU.
+		/*
+		 * BL_AST locks are not needed in LRU.
 		 * Let ldlm_cancel_lru() be fast.
 		 */
 		ldlm_lock_remove_from_lru(lock);
@@ -374,7 +379,8 @@ static int __ldlm_bl_to_thread(struct ldlm_bl_work_item *blwi,
 
 	wake_up(&blp->blp_waitq);
 
-	/* can not check blwi->blwi_flags as blwi could be already freed in
+	/*
+	 * Can not check blwi->blwi_flags as blwi could be already freed in
 	 * LCF_ASYNC mode
 	 */
 	if (!(cancel_flags & LCF_ASYNC))
@@ -439,7 +445,8 @@ static int ldlm_bl_to_thread(struct ldlm_namespace *ns,
 
 		rc = __ldlm_bl_to_thread(blwi, cancel_flags);
 	} else {
-		/* if it is synchronous call do minimum mem alloc, as it could
+		/*
+		 * If it is synchronous call do minimum mem alloc, as it could
 		 * be triggered from kernel shrinker
 		 */
 		struct ldlm_bl_work_item blwi;
@@ -535,7 +542,8 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 	struct ldlm_lock *lock;
 	int rc;
 
-	/* Requests arrive in sender's byte order.  The ptlrpc service
+	/*
+	 * Requests arrive in sender's byte order.  The ptlrpc service
 	 * handler has already checked and, if necessary, byte-swapped the
 	 * incoming request message body, but I am responsible for the
 	 * message buffers.
@@ -596,7 +604,8 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 		return 0;
 	}
 
-	/* Force a known safe race, send a cancel to the server for a lock
+	/*
+	 * Force a known safe race, send a cancel to the server for a lock
 	 * which the server has already started a blocking callback on.
 	 */
 	if (OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE) &&
@@ -626,7 +635,8 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 	lock->l_flags |= ldlm_flags_from_wire(dlm_req->lock_flags &
 					      LDLM_FL_AST_MASK);
 	if (lustre_msg_get_opc(req->rq_reqmsg) == LDLM_BL_CALLBACK) {
-		/* If somebody cancels lock and cache is already dropped,
+		/*
+		 * If somebody cancels lock and cache is already dropped,
 		 * or lock is failed before cp_ast received on client,
 		 * we can tell the server we have no lock. Otherwise, we
 		 * should send cancel after dropping the cache.
@@ -643,7 +653,8 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 					     &dlm_req->lock_handle[0]);
 			return 0;
 		}
-		/* BL_AST locks are not needed in LRU.
+		/*
+		 * BL_AST locks are not needed in LRU.
 		 * Let ldlm_cancel_lru() be fast.
 		 */
 		ldlm_lock_remove_from_lru(lock);
@@ -651,14 +662,15 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 	}
 	unlock_res_and_lock(lock);
 
-	/* We want the ost thread to get this reply so that it can respond
+	/*
+	 * We want the ost thread to get this reply so that it can respond
 	 * to ost requests (write cache writeback) that might be triggered
 	 * in the callback.
 	 *
 	 * But we'd also like to be able to indicate in the reply that we're
 	 * cancelling right now, because it's unused, or have an intent result
-	 * in the reply, so we might have to push the responsibility for sending
-	 * the reply down into the AST handlers, alas.
+	 * in the reply, so we might have to push the responsibility for
+	 * sending the reply down into the AST handlers, alas.
 	 */
 
 	switch (lustre_msg_get_opc(req->rq_reqmsg)) {
@@ -866,7 +878,8 @@ static int ldlm_bl_thread_main(void *arg)
 		if (rc == LDLM_ITER_STOP)
 			break;
 
-		/* If there are many namespaces, we will not sleep waiting for
+		/*
+		 * If there are many namespaces, we will not sleep waiting for
 		 * work, and must do a cond_resched to avoid holding the CPU
 		 * for too long
 		 */
@@ -1171,7 +1184,8 @@ void ldlm_exit(void)
 	if (ldlm_refcount)
 		CERROR("ldlm_refcount is %d in %s!\n", ldlm_refcount, __func__);
 	kmem_cache_destroy(ldlm_resource_slab);
-	/* ldlm_lock_put() use RCU to call ldlm_lock_free, so need call
+	/*
+	 * ldlm_lock_put() use RCU to call ldlm_lock_free, so need call
 	 * synchronize_rcu() to wait a grace period elapsed, so that
 	 * ldlm_lock_free() get a chance to be called.
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 252/622] lustre: ldlm: Fix style issues for ldlm_request.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (250 preceding siblings ...)
  2020-02-27 21:11 ` [lustre-devel] [PATCH 251/622] lustre: ldlm: Fix style issues for ldlm_lockd.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 253/622] lustre: ptlrpc: Fix style issues for sec_bulk.c James Simmons
                   ` (370 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ldlm/ldlm_request.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 3a56c0e5f42f ("LU-6142 ldlm: Fix style issues for ldlm_request.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34547
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 144 +++++++++++++++++++++++++++---------------
 1 file changed, 94 insertions(+), 50 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index fb564f4..45d70d4 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -147,7 +147,8 @@ static void ldlm_expired_completion_wait(struct ldlm_lock *lock, u32 conn_cnt)
  *
  * Return:	timeout in seconds to wait for the server reply
  */
-/* We use the same basis for both server side and client side functions
+/*
+ * We use the same basis for both server side and client side functions
  * from a single node.
  */
 static time64_t ldlm_cp_timeout(struct ldlm_lock *lock)
@@ -289,13 +290,14 @@ static void failed_lock_cleanup(struct ldlm_namespace *ns,
 {
 	int need_cancel = 0;
 
-	/* Set a flag to prevent us from sending a CANCEL (bug 407) */
+	/* Set a flag to prevent us from sending a CANCEL (b=407) */
 	lock_res_and_lock(lock);
 	/* Check that lock is not granted or failed, we might race. */
 	if (!ldlm_is_granted(lock) && !ldlm_is_failed(lock)) {
-		/* Make sure that this lock will not be found by raced
+		/*
+		 * Make sure that this lock will not be found by raced
 		 * bl_ast and -EINVAL reply is sent to server anyways.
-		 * bug 17645
+		 * b=17645
 		 */
 		lock->l_flags |= LDLM_FL_LOCAL_ONLY | LDLM_FL_FAILED |
 				 LDLM_FL_ATOMIC_CB | LDLM_FL_CBPENDING;
@@ -309,10 +311,12 @@ static void failed_lock_cleanup(struct ldlm_namespace *ns,
 	else
 		LDLM_DEBUG(lock, "lock was granted or failed in race");
 
-	/* XXX - HACK because we shouldn't call ldlm_lock_destroy()
+	/*
+	 * XXX - HACK because we shouldn't call ldlm_lock_destroy()
 	 *       from llite/file.c/ll_file_flock().
 	 */
-	/* This code makes for the fact that we do not have blocking handler on
+	/*
+	 * This code makes for the fact that we do not have blocking handler on
 	 * a client for flock locks. As such this is the place where we must
 	 * completely kill failed locks. (interrupted and those that
 	 * were waiting to be granted when server evicted us.
@@ -416,7 +420,8 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 	CDEBUG(D_INFO, "local: %p, remote cookie: %#llx, flags: 0x%llx\n",
 	       lock, reply->lock_handle.cookie, *flags);
 
-	/* If enqueue returned a blocked lock but the completion handler has
+	/*
+	 * If enqueue returned a blocked lock but the completion handler has
 	 * already run, then it fixed up the resource and we don't need to do it
 	 * again.
 	 */
@@ -466,11 +471,13 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 		LDLM_DEBUG(lock, "enqueue reply includes blocking AST");
 	}
 
-	/* If the lock has already been granted by a completion AST, don't
+	/*
+	 * If the lock has already been granted by a completion AST, don't
 	 * clobber the LVB with an older one.
 	 */
 	if (lvb_len > 0) {
-		/* We must lock or a racing completion might update lvb without
+		/*
+		 * We must lock or a racing completion might update lvb without
 		 * letting us know and we'll clobber the correct value.
 		 * Cannot unlock after the check either, as that still leaves
 		 * a tiny window for completion to get in
@@ -499,7 +506,8 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 	}
 
 	if (lvb_len > 0 && lvb) {
-		/* Copy the LVB here, and not earlier, because the completion
+		/*
+		 * Copy the LVB here, and not earlier, because the completion
 		 * AST (if any) can override what we got in the reply
 		 */
 		memcpy(lvb, lock->l_lvb_data, lvb_len);
@@ -586,7 +594,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
 		to_free = !ns_connect_lru_resize(ns) &&
 			  opc == LDLM_ENQUEUE ? 1 : 0;
 
-		/* Cancel LRU locks here _only_ if the server supports
+		/*
+		 * Cancel LRU locks here _only_ if the server supports
 		 * EARLY_CANCEL. Otherwise we have to send extra CANCEL
 		 * RPC, which will make us slower.
 		 */
@@ -611,7 +620,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
 		if (canceloff) {
 			dlm = req_capsule_client_get(pill, &RMF_DLM_REQ);
 			LASSERT(dlm);
-			/* Skip first lock handler in ldlm_request_pack(),
+			/*
+			 * Skip first lock handler in ldlm_request_pack(),
 			 * this method will increment @lock_count according
 			 * to the lock handle amount actually written to
 			 * the buffer.
@@ -685,7 +695,8 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 
 	ns = exp->exp_obd->obd_namespace;
 
-	/* If we're replaying this lock, just check some invariants.
+	/*
+	 * If we're replaying this lock, just check some invariants.
 	 * If we're creating a new lock, get everything all setup nicely.
 	 */
 	if (is_replay) {
@@ -752,7 +763,8 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 	if (*flags & LDLM_FL_NDELAY) {
 		DEBUG_REQ(D_DLMTRACE, req, "enque lock with no delay\n");
 		req->rq_no_resend = req->rq_no_delay = 1;
-		/* probably set a shorter timeout value and handle ETIMEDOUT
+		/*
+		 * probably set a shorter timeout value and handle ETIMEDOUT
 		 * in osc_lock_upcall() correctly
 		 */
 		/* lustre_msg_set_timeout(req, req->rq_timeout / 2); */
@@ -799,7 +811,8 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 				    einfo->ei_mode, flags, lvb, lvb_len,
 				    lockh, rc);
 
-	/* If ldlm_cli_enqueue_fini did not find the lock, we need to free
+	/*
+	 * If ldlm_cli_enqueue_fini did not find the lock, we need to free
 	 * one reference that we took
 	 */
 	if (err == -ENOLCK)
@@ -860,7 +873,8 @@ static int lock_convert_interpret(const struct lu_env *env,
 	}
 
 	lock_res_and_lock(lock);
-	/* Lock convert is sent for any new bits to drop, the converting flag
+	/*
+	 * Lock convert is sent for any new bits to drop, the converting flag
 	 * is dropped when ibits on server are the same as on client. Meanwhile
 	 * that can be so that more later convert will be replied first with
 	 * and clear converting flag, so in case of such race just exit here.
@@ -872,7 +886,8 @@ static int lock_convert_interpret(const struct lu_env *env,
 			   reply->lock_desc.l_policy_data.l_inodebits.bits);
 	} else if (reply->lock_desc.l_policy_data.l_inodebits.bits !=
 		   lock->l_policy_data.l_inodebits.bits) {
-		/* Compare server returned lock ibits and local lock ibits
+		/*
+		 * Compare server returned lock ibits and local lock ibits
 		 * if they are the same we consider conversion is done,
 		 * otherwise we have more converts inflight and keep
 		 * converting flag.
@@ -882,14 +897,16 @@ static int lock_convert_interpret(const struct lu_env *env,
 	} else {
 		ldlm_clear_converting(lock);
 
-		/* Concurrent BL AST may arrive and cause another convert
+		/*
+		 * Concurrent BL AST may arrive and cause another convert
 		 * or cancel so just do nothing here if bl_ast is set,
 		 * finish with convert otherwise.
 		 */
 		if (!ldlm_is_bl_ast(lock)) {
 			struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
 
-			/* Drop cancel_bits since there are no more converts
+			/*
+			 * Drop cancel_bits since there are no more converts
 			 * and put lock into LRU if it is still not used and
 			 * is not there yet.
 			 */
@@ -918,7 +935,8 @@ static int lock_convert_interpret(const struct lu_env *env,
 		}
 		unlock_res_and_lock(lock);
 
-		/* fallback to normal lock cancel. If rc means there is no
+		/*
+		 * fallback to normal lock cancel. If rc means there is no
 		 * valid lock on server, do only local cancel
 		 */
 		if (rc == ELDLM_NO_LOCK_DATA)
@@ -959,7 +977,8 @@ int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
 		return -EINVAL;
 	}
 
-	/* this is better to check earlier and it is done so already,
+	/*
+	 * this is better to check earlier and it is done so already,
 	 * but this check is kept too as final one to issue an error
 	 * if any new code will miss such check.
 	 */
@@ -1075,7 +1094,8 @@ static void ldlm_cancel_pack(struct ptlrpc_request *req,
 	max += LDLM_LOCKREQ_HANDLES;
 	LASSERT(max >= dlm->lock_count + count);
 
-	/* XXX: it would be better to pack lock handles grouped by resource.
+	/*
+	 * XXX: it would be better to pack lock handles grouped by resource.
 	 * so that the server cancel would call filter_lvbo_update() less
 	 * frequently.
 	 */
@@ -1202,7 +1222,8 @@ int ldlm_cli_update_pool(struct ptlrpc_request *req)
 		return 0;
 	}
 
-	/* In some cases RPC may contain SLV and limit zeroed out. This
+	/*
+	 * In some cases RPC may contain SLV and limit zeroed out. This
 	 * is the case when server does not support LRU resize feature.
 	 * This is also possible in some recovery cases when server-side
 	 * reqs have no reference to the OBD export and thus access to
@@ -1221,7 +1242,8 @@ int ldlm_cli_update_pool(struct ptlrpc_request *req)
 	new_slv = lustre_msg_get_slv(req->rq_repmsg);
 	obd = req->rq_import->imp_obd;
 
-	/* Set new SLV and limit in OBD fields to make them accessible
+	/*
+	 * Set new SLV and limit in OBD fields to make them accessible
 	 * to the pool thread. We do not access obd_namespace and pool
 	 * directly here as there is no reliable way to make sure that
 	 * they are still alive at cleanup time. Evil races are possible
@@ -1281,7 +1303,8 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		return 0;
 	}
 
-	/* Lock is being converted, cancel it immediately.
+	/*
+	 * Lock is being converted, cancel it immediately.
 	 * When convert will end, it releases lock and it will be gone.
 	 */
 	if (ldlm_is_converting(lock)) {
@@ -1302,7 +1325,8 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		LDLM_LOCK_RELEASE(lock);
 		return 0;
 	}
-	/* Even if the lock is marked as LDLM_FL_BL_AST, this is a LDLM_CANCEL
+	/*
+	 * Even if the lock is marked as LDLM_FL_BL_AST, this is a LDLM_CANCEL
 	 * RPC which goes to canceld portal, so we can cancel other LRU locks
 	 * here and send them all as one LDLM_CANCEL RPC.
 	 */
@@ -1350,7 +1374,8 @@ int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
 		} else {
 			rc = ldlm_cli_cancel_local(lock);
 		}
-		/* Until we have compound requests and can send LDLM_CANCEL
+		/*
+		 * Until we have compound requests and can send LDLM_CANCEL
 		 * requests batched with generic RPCs, we need to send cancels
 		 * with the LDLM_FL_BL_AST flag in a separate RPC from
 		 * the one being generated now.
@@ -1387,7 +1412,8 @@ int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
 {
 	enum ldlm_policy_res result = LDLM_POLICY_CANCEL_LOCK;
 
-	/* don't check added & count since we want to process all locks
+	/*
+	 * don't check added & count since we want to process all locks
 	 * from unused list.
 	 * It's fine to not take lock to access lock->l_resource since
 	 * the lock has already been granted so it won't change.
@@ -1424,7 +1450,8 @@ static enum ldlm_policy_res ldlm_cancel_lrur_policy(struct ldlm_namespace *ns,
 	u64 slv, lvf, lv;
 	s64 la;
 
-	/* Stop LRU processing when we reach past @count or have checked all
+	/*
+	 * Stop LRU processing when we reach past @count or have checked all
 	 * locks in LRU.
 	 */
 	if (count && added >= count)
@@ -1447,7 +1474,8 @@ static enum ldlm_policy_res ldlm_cancel_lrur_policy(struct ldlm_namespace *ns,
 	/* Inform pool about current CLV to see it via debugfs. */
 	ldlm_pool_set_clv(pl, lv);
 
-	/* Stop when SLV is not yet come from server or lv is smaller than
+	/*
+	 * Stop when SLV is not yet come from server or lv is smaller than
 	 * it is.
 	 */
 	if (slv == 0 || lv < slv)
@@ -1469,7 +1497,8 @@ static enum ldlm_policy_res ldlm_cancel_passed_policy(struct ldlm_namespace *ns,
 						      int unused, int added,
 						      int count)
 {
-	/* Stop LRU processing when we reach past @count or have checked all
+	/*
+	 * Stop LRU processing when we reach past @count or have checked all
 	 * locks in LRU.
 	 */
 	return (added >= count) ?
@@ -1538,7 +1567,8 @@ static enum ldlm_policy_res ldlm_cancel_aged_policy(struct ldlm_namespace *ns,
 ldlm_cancel_default_policy(struct ldlm_namespace *ns, struct ldlm_lock *lock,
 			   int unused, int added, int count)
 {
-	/* Stop LRU processing when we reach past count or have checked all
+	/*
+	 * Stop LRU processing when we reach past count or have checked all
 	 * locks in LRU.
 	 */
 	return (added >= count) ?
@@ -1652,7 +1682,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 			    !ldlm_is_converting(lock))
 				break;
 
-			/* Somebody is already doing CANCEL. No need for this
+			/*
+			 * Somebody is already doing CANCEL. No need for this
 			 * lock in LRU, do not traverse it again.
 			 */
 			ldlm_lock_remove_from_lru_nolock(lock);
@@ -1668,7 +1699,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		spin_unlock(&ns->ns_lock);
 		lu_ref_add(&lock->l_reference, __func__, current);
 
-		/* Pass the lock through the policy filter and see if it
+		/*
+		 * Pass the lock through the policy filter and see if it
 		 * should stay in LRU.
 		 *
 		 * Even for shrinker policy we stop scanning if
@@ -1707,7 +1739,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		/* Check flags again under the lock. */
 		if (ldlm_is_canceling(lock) || ldlm_is_converting(lock) ||
 		    (ldlm_lock_remove_from_lru_check(lock, last_use) == 0)) {
-			/* Another thread is removing lock from LRU, or
+			/*
+			 * Another thread is removing lock from LRU, or
 			 * somebody is already doing CANCEL, or there
 			 * is a blocking request which will send cancel
 			 * by itself, or the lock is no longer unused or
@@ -1722,7 +1755,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		}
 		LASSERT(!lock->l_readers && !lock->l_writers);
 
-		/* If we have chosen to cancel this lock voluntarily, we
+		/*
+		 * If we have chosen to cancel this lock voluntarily, we
 		 * better send cancel notification to server, so that it
 		 * frees appropriate state. This might lead to a race
 		 * where while we are doing cancel here, server is also
@@ -1730,7 +1764,8 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		 */
 		ldlm_clear_cancel_on_block(lock);
 
-		/* Setting the CBPENDING flag is a little misleading,
+		/*
+		 * Setting the CBPENDING flag is a little misleading,
 		 * but prevents an important race; namely, once
 		 * CBPENDING is set, the lock can accumulate no more
 		 * readers/writers. Since readers and writers are
@@ -1744,11 +1779,12 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		     ldlm_has_dom(lock)) && lock->l_granted_mode == LCK_PR)
 			ldlm_set_discard_data(lock);
 
-		/* We can't re-add to l_lru as it confuses the
+		/*
+		 * We can't re-add to l_lru as it confuses the
 		 * refcounting in ldlm_lock_remove_from_lru() if an AST
 		 * arrives after we drop lr_lock below. We use l_bl_ast
 		 * and can't use l_pending_chain as it is used both on
-		 * server and client nevertheless bug 5666 says it is
+		 * server and client nevertheless b=5666 says it is
 		 * used only on server
 		 */
 		LASSERT(list_empty(&lock->l_bl_ast));
@@ -1787,7 +1823,8 @@ int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
 	LIST_HEAD(cancels);
 	int count, rc;
 
-	/* Just prepare the list of locks, do not actually cancel them yet.
+	/*
+	 * Just prepare the list of locks, do not actually cancel them yet.
 	 * Locks are cancelled later in a separate thread.
 	 */
 	count = ldlm_prepare_lru_list(ns, &cancels, nr, 0, flags);
@@ -1824,7 +1861,8 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 		if (lock->l_readers || lock->l_writers)
 			continue;
 
-		/* If somebody is already doing CANCEL, or blocking AST came,
+		/*
+		 * If somebody is already doing CANCEL, or blocking AST came,
 		 * skip this lock.
 		 */
 		if (ldlm_is_bl_ast(lock) || ldlm_is_canceling(lock) ||
@@ -1834,7 +1872,8 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 		if (lockmode_compat(lock->l_granted_mode, mode))
 			continue;
 
-		/* If policy is given and this is IBITS lock, add to list only
+		/*
+		 * If policy is given and this is IBITS lock, add to list only
 		 * those locks that match by policy.
 		 * Skip locks with DoM bit always to don't flush data.
 		 */
@@ -1878,7 +1917,8 @@ int ldlm_cli_cancel_list(struct list_head *cancels, int count,
 	if (list_empty(cancels) || count == 0)
 		return 0;
 
-	/* XXX: requests (both batched and not) could be sent in parallel.
+	/*
+	 * XXX: requests (both batched and not) could be sent in parallel.
 	 * Usually it is enough to have just 1 RPC, but it is possible that
 	 * there are too many locks to be cancelled in LRU or on a resource.
 	 * It would also speed up the case when the server does not support
@@ -2071,7 +2111,8 @@ static void ldlm_namespace_foreach(struct ldlm_namespace *ns,
 				 ldlm_res_iter_helper, &helper, 0);
 }
 
-/* non-blocking function to manipulate a lock whose cb_data is being put away.
+/*
+ * non-blocking function to manipulate a lock whose cb_data is being put away.
  * return  0:  find no resource
  *       > 0:  must be LDLM_ITER_STOP/LDLM_ITER_CONTINUE.
  *       < 0:  errors
@@ -2108,8 +2149,9 @@ static int ldlm_chain_lock_for_replay(struct ldlm_lock *lock, void *closure)
 		 "lock %p next %p prev %p\n",
 		 lock, &lock->l_pending_chain.next,
 		 &lock->l_pending_chain.prev);
-	/* bug 9573: don't replay locks left after eviction, or
-	 * bug 17614: locks being actively cancelled. Get a reference
+	/*
+	 * b=9573: don't replay locks left after eviction, or
+	 * b=17614: locks being actively cancelled. Get a reference
 	 * on a lock so that it does not disappear under us (e.g. due to cancel)
 	 */
 	if (!(lock->l_flags & (LDLM_FL_FAILED | LDLM_FL_BL_DONE))) {
@@ -2169,7 +2211,7 @@ static int replay_one_lock(struct obd_import *imp, struct ldlm_lock *lock)
 	struct ldlm_request *body;
 	int flags;
 
-	/* Bug 11974: Do not replay a lock which is actively being canceled */
+	/* B=11974: Do not replay a lock which is actively being canceled */
 	if (ldlm_is_bl_done(lock)) {
 		LDLM_DEBUG(lock, "Not replaying canceled lock:");
 		return 0;
@@ -2226,10 +2268,11 @@ static int replay_one_lock(struct obd_import *imp, struct ldlm_lock *lock)
 	req_capsule_set_size(&req->rq_pill, &RMF_DLM_LVB, RCL_SERVER,
 			     lock->l_lvb_len);
 	ptlrpc_request_set_replen(req);
-	/* notify the server we've replayed all requests.
+	/*
+	 * notify the server we've replayed all requests.
 	 * also, we mark the request to be put on a dedicated
 	 * queue to be processed after all request replayes.
-	 * bug 6063
+	 * b=6063
 	 */
 	lustre_msg_set_flags(req->rq_reqmsg, MSG_REQ_REPLAY_DONE);
 
@@ -2263,7 +2306,8 @@ static void ldlm_cancel_unused_locks_for_replay(struct ldlm_namespace *ns)
 	       "Dropping as many unused locks as possible before replay for namespace %s (%d)\n",
 	       ldlm_ns_name(ns), ns->ns_nr_unused);
 
-	/* We don't need to care whether or not LRU resize is enabled
+	/*
+	 * We don't need to care whether or not LRU resize is enabled
 	 * because the LDLM_LRU_FLAG_NO_WAIT policy doesn't use the
 	 * count parameter
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 253/622] lustre: ptlrpc: Fix style issues for sec_bulk.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (251 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 252/622] lustre: ldlm: Fix style issues for ldlm_request.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 254/622] lustre: ldlm: Fix style issues for ptlrpcd.c James Simmons
                   ` (369 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ptlrpc/sec_bulk.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: a294ea9a0e04 ("LU-6142 ptlrpc: Fix style issues for sec_bulk.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34548
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/sec_bulk.c | 71 ++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 39 deletions(-)

diff --git a/fs/lustre/ptlrpc/sec_bulk.c b/fs/lustre/ptlrpc/sec_bulk.c
index e170da1..d36230b 100644
--- a/fs/lustre/ptlrpc/sec_bulk.c
+++ b/fs/lustre/ptlrpc/sec_bulk.c
@@ -50,9 +50,9 @@
 
 #include "ptlrpc_internal.h"
 
-/****************************************
- * bulk encryption page pools	   *
- ****************************************/
+/*
+ * bulk encryption page pools
+ */
 
 #define POINTERS_PER_PAGE	(PAGE_SIZE / sizeof(void *))
 #define PAGES_PER_POOL		(POINTERS_PER_PAGE)
@@ -63,19 +63,16 @@
 #define CACHE_QUIESCENT_PERIOD  (20)
 
 static struct ptlrpc_enc_page_pool {
-	/*
-	 * constants
-	 */
-	unsigned long		epp_max_pages;	/* maximum pages can hold, const */
-	unsigned int		epp_max_pools;	/* number of pools, const */
+	unsigned long epp_max_pages;	/* maximum pages can hold, const */
+	unsigned int epp_max_pools;	/* number of pools, const */
 
 	/*
 	 * wait queue in case of not enough free pages.
 	 */
-	wait_queue_head_t	epp_waitq;	/* waiting threads */
-	unsigned int		epp_waitqlen;	/* wait queue length */
-	unsigned long		epp_pages_short; /* # of pages wanted of in-q users */
-	unsigned int		epp_growing:1;	/* during adding pages */
+	wait_queue_head_t epp_waitq;	/* waiting threads */
+	unsigned int epp_waitqlen;	/* wait queue length */
+	unsigned long epp_pages_short;	/* # of pages wanted of in-q users */
+	unsigned int epp_growing:1;	/* during adding pages */
 
 	/*
 	 * indicating how idle the pools are, from 0 to MAX_IDLE_IDX
@@ -84,36 +81,32 @@
 	 * is idled for a while but the idle_idx might still be low if no
 	 * activities happened in the pools.
 	 */
-	unsigned long		epp_idle_idx;
+	unsigned long epp_idle_idx;
 
 	/* last shrink time due to mem tight */
-	time64_t		epp_last_shrink;
-	time64_t		epp_last_access;
-
-	/*
-	 * in-pool pages bookkeeping
-	 */
-	spinlock_t		epp_lock;	 /* protect following fields */
-	unsigned long		epp_total_pages; /* total pages in pools */
-	unsigned long		epp_free_pages;	 /* current pages available */
-
-	/*
-	 * statistics
-	 */
-	unsigned long		epp_st_max_pages;	/* # of pages ever reached */
-	unsigned int		epp_st_grows;		/* # of grows */
-	unsigned int		epp_st_grow_fails;	/* # of add pages failures */
-	unsigned int		epp_st_shrinks;		/* # of shrinks */
-	unsigned long		epp_st_access;		/* # of access */
-	unsigned long		epp_st_missings;	/* # of cache missing */
-	unsigned long		epp_st_lowfree;		/* lowest free pages reached */
-	unsigned int		epp_st_max_wqlen;	/* highest waitqueue length */
-	ktime_t			epp_st_max_wait;	/* in nanoseconds */
-	unsigned long		epp_st_outofmem;	/* # of out of mem requests */
+	time64_t epp_last_shrink;
+	time64_t epp_last_access;
+
+	/* in-pool pages bookkeeping */
+	spinlock_t epp_lock;			/* protect following fields */
+	unsigned long epp_total_pages;		/* total pages in pools */
+	unsigned long epp_free_pages;		/* current pages available */
+
+	/* statistics */
+	unsigned long epp_st_max_pages;		/* # of pages ever reached */
+	unsigned int epp_st_grows;		/* # of grows */
+	unsigned int epp_st_grow_fails;		/* # of add pages failures */
+	unsigned int epp_st_shrinks;		/* # of shrinks */
+	unsigned long epp_st_access;		/* # of access */
+	unsigned long epp_st_missings;		/* # of cache missing */
+	unsigned long epp_st_lowfree;		/* lowest free pages reached */
+	unsigned int epp_st_max_wqlen;		/* highest waitqueue length */
+	ktime_t epp_st_max_wait;		/* in nanoseconds */
+	unsigned long epp_st_outofmem;		/* # of out of mem requests */
 	/*
-	 * pointers to pools
+	 * pointers to pools, may be vmalloc'd
 	 */
-	struct page		***epp_pools;
+	struct page ***epp_pools;
 } page_pools;
 
 /*
@@ -185,7 +178,7 @@ static void enc_pools_release_free_pages(long npages)
 
 	/* max pool index after the release */
 	p_idx_max1 = page_pools.epp_total_pages == 0 ? -1 :
-		     ((page_pools.epp_total_pages - 1) / PAGES_PER_POOL);
+		((page_pools.epp_total_pages - 1) / PAGES_PER_POOL);
 
 	p_idx = page_pools.epp_free_pages / PAGES_PER_POOL;
 	g_idx = page_pools.epp_free_pages % PAGES_PER_POOL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 254/622] lustre: ldlm: Fix style issues for ptlrpcd.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (252 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 253/622] lustre: ptlrpc: Fix style issues for sec_bulk.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 255/622] lustre: ptlrpc: IR doesn't reconnect after EAGAIN James Simmons
                   ` (368 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ptlrpc/ptlrpcd.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: f64aeebfceb3 ("LU-6142 ldlm: Fix style issues for ptlrpcd.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34604
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/ptlrpcd.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/ptlrpc/ptlrpcd.c b/fs/lustre/ptlrpc/ptlrpcd.c
index e9c03ba..bcf1e46 100644
--- a/fs/lustre/ptlrpc/ptlrpcd.c
+++ b/fs/lustre/ptlrpc/ptlrpcd.c
@@ -238,7 +238,8 @@ void ptlrpcd_add_req(struct ptlrpc_request *req)
 			wait_event_idle(req->rq_set_waitq,
 					!req->rq_set);
 	} else if (req->rq_set) {
-		/* If we have a valid "rq_set", just reuse it to avoid double
+		/*
+		 * If we have a valid "rq_set", just reuse it to avoid double
 		 * linked.
 		 */
 		LASSERT(req->rq_phase == RQ_PHASE_NEW);
@@ -294,7 +295,8 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
 		spin_unlock(&set->set_new_req_lock);
 	}
 
-	/* We should call lu_env_refill() before handling new requests to make
+	/*
+	 * We should call lu_env_refill() before handling new requests to make
 	 * sure that env key the requests depending on really exists.
 	 */
 	rc2 = lu_env_refill(env);
@@ -316,7 +318,8 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
 	if (atomic_read(&set->set_remaining))
 		rc |= ptlrpc_check_set(env, set);
 
-	/* NB: ptlrpc_check_set has already moved completed request at the
+	/*
+	 * NB: ptlrpc_check_set has already moved completed request at the
 	 * head of seq::set_requests
 	 */
 	list_for_each_entry_safe(req, tmp, &set->set_requests, rq_set_chain) {
@@ -334,7 +337,8 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
 		 */
 		rc = atomic_read(&set->set_new_count);
 
-		/* If we have nothing to do, check whether we can take some
+		/*
+		 * If we have nothing to do, check whether we can take some
 		 * work from our partner threads.
 		 */
 		if (rc == 0 && pc->pc_npartners > 0) {
@@ -379,7 +383,6 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
  * Main ptlrpcd thread.
  * ptlrpc's code paths like to execute in process context, so we have this
  * thread which spins on a set which contains the rpcs and sends them.
- *
  */
 static int ptlrpcd(void *arg)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 255/622] lustre: ptlrpc: IR doesn't reconnect after EAGAIN
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (253 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 254/622] lustre: ldlm: Fix style issues for ptlrpcd.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 256/622] lustre: llite: ll_fault fixes James Simmons
                   ` (367 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Sergey Cheremencev <c17829@cray.com>

There is a chance that client is connecting to OST
before recovery start when OST is not configured.
In such case OST returns EAGAIN(target->obd_no_conn == 1).
There is no problem when pinger_recov is enabled
because ptlrpc_pinger_main will reconnect later.
But it doesn't reconnect when pinger_recov is 0.

Move setting imp_connect_error to ptlrpc_connect_interpret.
It is needed to store there only connection errors.

Cray-bug-id: LUS-2034
WC-bug-id: https://jira.whamcloud.com/browse/LU-11601
Lustre-commit: 3341c8c31871 ("LU-11601 ptlrpc: IR doesn't reconnect after EAGAIN")
Signed-off-by: Sergey Cheremencev <c17829@cray.com>
Reviewed-on: https://es-gerrit.dev.cray.com/153542
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-on: https://review.whamcloud.com/33557
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 1 +
 fs/lustre/ptlrpc/client.c       | 1 -
 fs/lustre/ptlrpc/import.c       | 1 +
 fs/lustre/ptlrpc/pinger.c       | 3 ++-
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 36955e8..9ebdcb6 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -264,6 +264,7 @@
 #define OBD_FAIL_OST_STATFS_EINPROGRESS			0x231
 #define OBD_FAIL_OST_SET_INFO_NET			0x232
 #define OBD_FAIL_OST_DISCONNECT_DELAY	 0x245
+#define OBD_FAIL_OST_PREPARE_DELAY	 0x247
 
 #define OBD_FAIL_LDLM					0x300
 #define OBD_FAIL_LDLM_NAMESPACE_NEW			0x301
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index f57ec1883..0f5aa92 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -1457,7 +1457,6 @@ static int after_reply(struct ptlrpc_request *req)
 				  lustre_msg_get_service_time(req->rq_repmsg));
 
 	rc = ptlrpc_check_status(req);
-	imp->imp_connect_error = rc;
 
 	if (rc) {
 		/*
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 39d9e3e..a75856a 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -944,6 +944,7 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 		return 0;
 	}
 
+	imp->imp_connect_error = rc;
 	if (rc) {
 		struct ptlrpc_request *free_req;
 		struct ptlrpc_request *tmp;
diff --git a/fs/lustre/ptlrpc/pinger.c b/fs/lustre/ptlrpc/pinger.c
index c565e2d..c3fbddc 100644
--- a/fs/lustre/ptlrpc/pinger.c
+++ b/fs/lustre/ptlrpc/pinger.c
@@ -228,7 +228,8 @@ static void ptlrpc_pinger_process_import(struct obd_import *imp,
 	if (level == LUSTRE_IMP_DISCON && !imp_is_deactive(imp)) {
 		/* wait for a while before trying recovery again */
 		imp->imp_next_ping = ptlrpc_next_reconnect(imp);
-		if (!imp->imp_no_pinger_recover)
+		if (!imp->imp_no_pinger_recover ||
+		    imp->imp_connect_error == -EAGAIN)
 			ptlrpc_initiate_recovery(imp);
 	} else if (level != LUSTRE_IMP_FULL ||
 		   imp->imp_obd->obd_no_recov ||
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 256/622] lustre: llite: ll_fault fixes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (254 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 255/622] lustre: ptlrpc: IR doesn't reconnect after EAGAIN James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 257/622] lustre: lsom: Add an OBD_CONNECT2_LSOM connect flag James Simmons
                   ` (366 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Various error conditions in the fault path can cause us to
not return a page in vm_fault.  Check if it's present
before accessing it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11403
Lustre-commit: a8f4d1e5fd79 ("LU-11403 llite: ll_fault fixes")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34247
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index 236d1d2..37ce508 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -378,7 +378,8 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 restart:
 	result = __ll_fault(vmf->vma, vmf);
-	if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) {
+	if (vmf->page &&
+	    !(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) {
 		struct page *vmpage = vmf->page;
 
 		/* check if this page has been truncated */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 257/622] lustre: lsom: Add an OBD_CONNECT2_LSOM connect flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (255 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 256/622] lustre: llite: ll_fault fixes James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 258/622] lustre: pcc: Reserve a new connection flag for PCC James Simmons
                   ` (365 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

Add an OBD_CONNECT2_LSOM connect flag so that clients do not send
MDS_ATTR_LSIZE and MDS_ATTR_LBLOCKS flags to the old servers that
do not support them.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12021
Lustre-commit: fdd2c5d3a6e5 ("LU-12021 lsom: Add an OBD_CONNECT2_LSOM connect flag")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/34343
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Aurelien Degremont <degremoa@amazon.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c            | 3 ++-
 fs/lustre/mdc/mdc_request.c            | 4 ++++
 fs/lustre/obdclass/lprocfs_status.c    | 4 +++-
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 5 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 57486b4..347bdd6 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -216,7 +216,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_DIR_MIGRATE |
 				   OBD_CONNECT2_SUM_STATFS |
-				   OBD_CONNECT2_ARCHIVE_ID_ARRAY;
+				   OBD_CONNECT2_ARCHIVE_ID_ARRAY |
+				   OBD_CONNECT2_LSOM;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index f197abc..5931bc1 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -945,6 +945,10 @@ static int mdc_close(struct obd_export *exp, struct md_op_data *op_data,
 	req->rq_request_portal = MDS_READPAGE_PORTAL;
 	ptlrpc_at_set_req_timeout(req);
 
+	if (!(exp_connect_flags2(exp) & OBD_CONNECT2_LSOM))
+		op_data->op_xvalid &= ~(OP_XVALID_LAZYSIZE |
+					OP_XVALID_LAZYBLOCKS);
+
 	mdc_close_pack(req, op_data);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_MDT_MD, RCL_SERVER,
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 7701bc3..cdf25ed 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -120,7 +120,9 @@
 	"wbc",		/* 0x40 */
 	"lock_convert",	/* 0x80 */
 	"archive_id_array",	/* 0x100 */
-	"selinux_policy",	/* 0x200 */
+	"unknown",		/* 0x200 */
+	"selinux_policy",	/* 0x400 */
+	"lsom",			/* 0x800 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index bf79b8b..7cb6d74 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1146,6 +1146,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_ARCHIVE_ID_ARRAY);
 	LASSERTF(OBD_CONNECT2_SELINUX_POLICY == 0x400ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_SELINUX_POLICY);
+	LASSERTF(OBD_CONNECT2_LSOM == 0x800ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_LSOM);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 1a1b6c6..6b9a623 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -806,6 +806,7 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_LOCK_CONVERT	0x80ULL /* IBITS lock convert support */
 #define OBD_CONNECT2_ARCHIVE_ID_ARRAY  0x100ULL	/* store HSM archive_id in array */
 #define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
+#define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 258/622] lustre: pcc: Reserve a new connection flag for PCC
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (256 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 257/622] lustre: lsom: Add an OBD_CONNECT2_LSOM connect flag James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 259/622] lustre: uapi: reserve connect flag for plain layout James Simmons
                   ` (364 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

Reserve OBD_CONNECT2_PCC connection flag that will be set
(in ocd_connect_flags2) if a Lustre server or a client supports
Persistent Client Cache (PCC).

WC-bug-id: https://jira.whamcloud.com/browse/LU-10092
Lustre-commit: 93aa68404669 ("LU-10092 pcc: Reserve a new connection flag for PCC")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/34356
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 1 +
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 3 files changed, 4 insertions(+)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index cdf25ed..254a600 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -123,6 +123,7 @@
 	"unknown",		/* 0x200 */
 	"selinux_policy",	/* 0x400 */
 	"lsom",			/* 0x800 */
+	"pcc",			/* 0x1000 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 7cb6d74..22447e2 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1148,6 +1148,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_SELINUX_POLICY);
 	LASSERTF(OBD_CONNECT2_LSOM == 0x800ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_LSOM);
+	LASSERTF(OBD_CONNECT2_PCC == 0x1000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_PCC);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 6b9a623..46c3369 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -807,6 +807,7 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_ARCHIVE_ID_ARRAY  0x100ULL	/* store HSM archive_id in array */
 #define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
 #define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
+#define OBD_CONNECT2_PCC	       0x1000ULL /* Persistent Client Cache */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 259/622] lustre: uapi: reserve connect flag for plain layout
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (257 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 258/622] lustre: pcc: Reserve a new connection flag for PCC James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 260/622] lustre: ptlrpc: allow stopping threads above threads_max James Simmons
                   ` (363 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Reserve OBD_CONNECT2_PLAIN_LAYOUT flag, so that client supporting
plain layout won't enable plain layout if MDT doesn't support,
and in contrary, MDT supporting plain layout won't send such layout
to client that doesn't support.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: 14ee65e77bdc ("LU-11213 uapi: reserve connect flag for plain layout")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34656
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 1 +
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 3 files changed, 4 insertions(+)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 254a600..a7c274a 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -124,6 +124,7 @@
 	"selinux_policy",	/* 0x400 */
 	"lsom",			/* 0x800 */
 	"pcc",			/* 0x1000 */
+	"plain_layout",		/* 0x2000 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 22447e2..4a268f6 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1150,6 +1150,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_LSOM);
 	LASSERTF(OBD_CONNECT2_PCC == 0x1000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_PCC);
+	LASSERTF(OBD_CONNECT2_PLAIN_LAYOUT == 0x2000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_PLAIN_LAYOUT);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 46c3369..1b4b018 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -808,6 +808,7 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
 #define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
 #define OBD_CONNECT2_PCC	       0x1000ULL /* Persistent Client Cache */
+#define OBD_CONNECT2_PLAIN_LAYOUT      0x2000ULL /* Plain Directory Layout */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 260/622] lustre: ptlrpc: allow stopping threads above threads_max
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (258 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 259/622] lustre: uapi: reserve connect flag for plain layout James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 261/622] lnet: Avoid lnet debugfs read/write if ctl_table does not exist James Simmons
                   ` (362 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

If a service "threads_max" parameter is set below the number of
running threads, stop each highest-numbered running thread until
the running thread count is below threads_max.  Stopping nly the
last thread ensures the thread t_id numbers are always contiguous
rather than having gaps.  If the threads are started again they
will again be assigned contiguous t_id values.

Each thread is stopped only after it has finished processing an
incoming request, so running threads may not immediately stop
when the tunable is changed.

Also fix function declarations in file to match proper coding style.

WC-bug-id: https://jira.whamcloud.com/browse/LU-947
Lustre-commit: 183cb1e3cdd2 ("LU-947 ptlrpc: allow stopping threads above threads_max")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34400
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 124 +++++++++++++++++++++++++--------------------
 1 file changed, 69 insertions(+), 55 deletions(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 7bc578c..362102b 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -106,8 +106,7 @@
 	return rqbd;
 }
 
-static void
-ptlrpc_free_rqbd(struct ptlrpc_request_buffer_desc *rqbd)
+static void ptlrpc_free_rqbd(struct ptlrpc_request_buffer_desc *rqbd)
 {
 	struct ptlrpc_service_part *svcpt = rqbd->rqbd_svcpt;
 
@@ -123,8 +122,7 @@
 	kfree(rqbd);
 }
 
-static int
-ptlrpc_grow_req_bufs(struct ptlrpc_service_part *svcpt, int post)
+static int ptlrpc_grow_req_bufs(struct ptlrpc_service_part *svcpt, int post)
 {
 	struct ptlrpc_service *svc = svcpt->scp_service;
 	struct ptlrpc_request_buffer_desc *rqbd;
@@ -230,8 +228,8 @@ struct ptlrpc_hr_service {
 /**
  * Choose an hr thread to dispatch requests to.
  */
-static struct ptlrpc_hr_thread *
-ptlrpc_hr_select(struct ptlrpc_service_part *svcpt)
+static
+struct ptlrpc_hr_thread *ptlrpc_hr_select(struct ptlrpc_service_part *svcpt)
 {
 	struct ptlrpc_hr_partition *hrp;
 	unsigned int rotor;
@@ -270,8 +268,7 @@ void ptlrpc_dispatch_difficult_reply(struct ptlrpc_reply_state *rs)
 	wake_up(&hrt->hrt_waitq);
 }
 
-void
-ptlrpc_schedule_difficult_reply(struct ptlrpc_reply_state *rs)
+void ptlrpc_schedule_difficult_reply(struct ptlrpc_reply_state *rs)
 {
 	assert_spin_locked(&rs->rs_svcpt->scp_rep_lock);
 	assert_spin_locked(&rs->rs_lock);
@@ -288,8 +285,7 @@ void ptlrpc_dispatch_difficult_reply(struct ptlrpc_reply_state *rs)
 }
 EXPORT_SYMBOL(ptlrpc_schedule_difficult_reply);
 
-static int
-ptlrpc_server_post_idle_rqbds(struct ptlrpc_service_part *svcpt)
+static int ptlrpc_server_post_idle_rqbds(struct ptlrpc_service_part *svcpt)
 {
 	struct ptlrpc_request_buffer_desc *rqbd;
 	int rc;
@@ -345,9 +341,8 @@ static void ptlrpc_at_timer(struct timer_list *t)
 	wake_up(&svcpt->scp_waitq);
 }
 
-static void
-ptlrpc_server_nthreads_check(struct ptlrpc_service *svc,
-			     struct ptlrpc_service_conf *conf)
+static void ptlrpc_server_nthreads_check(struct ptlrpc_service *svc,
+					 struct ptlrpc_service_conf *conf)
 {
 	struct ptlrpc_service_thr_conf *tc = &conf->psc_thr;
 	unsigned int init;
@@ -457,9 +452,8 @@ static void ptlrpc_at_timer(struct timer_list *t)
 /**
  * Initialize percpt data for a service
  */
-static int
-ptlrpc_service_part_init(struct ptlrpc_service *svc,
-			 struct ptlrpc_service_part *svcpt, int cpt)
+static int ptlrpc_service_part_init(struct ptlrpc_service *svc,
+				    struct ptlrpc_service_part *svcpt, int cpt)
 {
 	struct ptlrpc_at_array *array;
 	int size;
@@ -549,10 +543,9 @@ static void ptlrpc_at_timer(struct timer_list *t)
  * This includes starting serving threads , allocating and posting rqbds and
  * so on.
  */
-struct ptlrpc_service *
-ptlrpc_register_service(struct ptlrpc_service_conf *conf,
-			struct kset *parent,
-			struct dentry *debugfs_entry)
+struct ptlrpc_service *ptlrpc_register_service(struct ptlrpc_service_conf *conf,
+					       struct kset *parent,
+					       struct dentry *debugfs_entry)
 {
 	struct ptlrpc_service_cpt_conf *cconf = &conf->psc_cpt;
 	struct ptlrpc_service *service;
@@ -1019,8 +1012,7 @@ static int ptlrpc_at_add_timed(struct ptlrpc_request *req)
 	return 0;
 }
 
-static void
-ptlrpc_at_remove_timed(struct ptlrpc_request *req)
+static void ptlrpc_at_remove_timed(struct ptlrpc_request *req)
 {
 	struct ptlrpc_at_array *array;
 
@@ -1351,7 +1343,7 @@ static void ptlrpc_server_hpreq_fini(struct ptlrpc_request *req)
 	}
 }
 
-static int ptlrpc_server_request_add(struct ptlrpc_service_part *svcpt,
+static int ptlrpc_server_request_add(struct ptlrpc_service_part  *svcpt,
 				     struct ptlrpc_request *req)
 {
 	int rc;
@@ -1453,8 +1445,9 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
  * \see ptlrpc_server_allow_normal
  * \see ptlrpc_server_allow high
  */
-static inline bool
-ptlrpc_server_request_pending(struct ptlrpc_service_part *svcpt, bool force)
+static inline
+bool ptlrpc_server_request_pending(struct ptlrpc_service_part *svcpt,
+				   bool force)
 {
 	return ptlrpc_server_high_pending(svcpt, force) ||
 	       ptlrpc_server_normal_pending(svcpt, force);
@@ -1510,9 +1503,8 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
  * All incoming requests pass through here before getting into
  * ptlrpc_server_handle_req later on.
  */
-static int
-ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
-			    struct ptlrpc_thread *thread)
+static int ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
+				       struct ptlrpc_thread *thread)
 {
 	struct ptlrpc_service *svc = svcpt->scp_service;
 	struct ptlrpc_request *req;
@@ -1668,9 +1660,8 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
  * Main incoming request handling logic.
  * Calls handler function from service to do actual processing.
  */
-static int
-ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
-			     struct ptlrpc_thread *thread)
+static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
+					struct ptlrpc_thread *thread)
 {
 	struct ptlrpc_service *svc = svcpt->scp_service;
 	struct ptlrpc_request *request;
@@ -1817,8 +1808,7 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
 /**
  * An internal function to process a single reply state object.
  */
-static int
-ptlrpc_handle_rs(struct ptlrpc_reply_state *rs)
+static int ptlrpc_handle_rs(struct ptlrpc_reply_state *rs)
 {
 	struct ptlrpc_service_part *svcpt = rs->rs_svcpt;
 	struct ptlrpc_service *svc = svcpt->scp_service;
@@ -1918,8 +1908,7 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
 	return 1;
 }
 
-static void
-ptlrpc_check_rqbd_pool(struct ptlrpc_service_part *svcpt)
+static void ptlrpc_check_rqbd_pool(struct ptlrpc_service_part *svcpt)
 {
 	int avail = svcpt->scp_nrqbds_posted;
 	int low_water = test_req_buffer_pressure ? 0 :
@@ -1942,8 +1931,7 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
 	}
 }
 
-static inline int
-ptlrpc_threads_enough(struct ptlrpc_service_part *svcpt)
+static inline int ptlrpc_threads_enough(struct ptlrpc_service_part *svcpt)
 {
 	return svcpt->scp_nreqs_active <
 	       svcpt->scp_nthrs_running - 1 -
@@ -1955,8 +1943,7 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
  * user can call it w/o any lock but need to hold
  * ptlrpc_service_part::scp_lock to get reliable result
  */
-static inline int
-ptlrpc_threads_increasable(struct ptlrpc_service_part *svcpt)
+static inline int ptlrpc_threads_increasable(struct ptlrpc_service_part *svcpt)
 {
 	return svcpt->scp_nthrs_running +
 	       svcpt->scp_nthrs_starting <
@@ -1966,22 +1953,47 @@ static bool ptlrpc_server_normal_pending(struct ptlrpc_service_part *svcpt,
 /**
  * too many requests and allowed to create more threads
  */
-static inline int
-ptlrpc_threads_need_create(struct ptlrpc_service_part *svcpt)
+static inline int ptlrpc_threads_need_create(struct ptlrpc_service_part *svcpt)
 {
 	return !ptlrpc_threads_enough(svcpt) &&
 		ptlrpc_threads_increasable(svcpt);
 }
 
-static inline int
-ptlrpc_thread_stopping(struct ptlrpc_thread *thread)
+static inline int ptlrpc_thread_stopping(struct ptlrpc_thread *thread)
 {
 	return thread_is_stopping(thread) ||
 	       thread->t_svcpt->scp_service->srv_is_stopping;
 }
 
-static inline int
-ptlrpc_rqbd_pending(struct ptlrpc_service_part *svcpt)
+/* stop the highest numbered thread if there are too many threads running */
+static inline bool ptlrpc_thread_should_stop(struct ptlrpc_thread *thread)
+{
+	struct ptlrpc_service_part *svcpt = thread->t_svcpt;
+
+	return thread->t_id >= svcpt->scp_service->srv_nthrs_cpt_limit &&
+		thread->t_id == svcpt->scp_thr_nextid - 1;
+}
+
+static void ptlrpc_stop_thread(struct ptlrpc_thread *thread)
+{
+	CDEBUG(D_INFO, "Stopping thread %s #%u\n",
+	       thread->t_svcpt->scp_service->srv_thread_name, thread->t_id);
+	thread_add_flags(thread, SVC_STOPPING);
+}
+
+static inline void ptlrpc_thread_stop(struct ptlrpc_thread *thread)
+{
+	struct ptlrpc_service_part *svcpt = thread->t_svcpt;
+
+	spin_lock(&svcpt->scp_lock);
+	if (ptlrpc_thread_should_stop(thread)) {
+		ptlrpc_stop_thread(thread);
+		svcpt->scp_thr_nextid--;
+	}
+	spin_unlock(&svcpt->scp_lock);
+}
+
+static inline int ptlrpc_rqbd_pending(struct ptlrpc_service_part *svcpt)
 {
 	return !list_empty(&svcpt->scp_rqbd_idle) &&
 	       svcpt->scp_rqbd_timeout == 0;
@@ -2250,14 +2262,19 @@ static int ptlrpc_main(void *arg)
 			CDEBUG(D_RPCTRACE, "Posted buffers: %d\n",
 			       svcpt->scp_nrqbds_posted);
 		}
+
+		/* If the number of threads has been tuned downward and this
+		 * thread should be stopped, then stop in reverse order so the
+		 * the threads always have contiguous thread index values.
+		 */
+		if (unlikely(ptlrpc_thread_should_stop(thread)))
+			ptlrpc_thread_stop(thread);
 	}
 
 	ptlrpc_watchdog_disable(&thread->t_watchdog);
 
 out_srv_fini:
-	/*
-	 * deconstruct service specific state created by ptlrpc_start_thread()
-	 */
+	/* deconstruct service thread state created by ptlrpc_start_thread() */
 	if (svc->srv_ops.so_thr_done)
 		svc->srv_ops.so_thr_done(thread);
 
@@ -2266,8 +2283,8 @@ static int ptlrpc_main(void *arg)
 		kfree(env);
 	}
 out:
-	CDEBUG(D_RPCTRACE, "service thread [ %p : %u ] %d exiting: rc %d\n",
-	       thread, thread->t_pid, thread->t_id, rc);
+	CDEBUG(D_RPCTRACE, "%s: service thread [%p:%u] %d exiting: rc = %d\n",
+	       thread->t_name, thread, thread->t_pid, thread->t_id, rc);
 
 	spin_lock(&svcpt->scp_lock);
 	if (thread_test_and_clear_flags(thread, SVC_STARTING))
@@ -2416,11 +2433,8 @@ static void ptlrpc_svcpt_stop_threads(struct ptlrpc_service_part *svcpt)
 
 	spin_lock(&svcpt->scp_lock);
 	/* let the thread know that we would like it to stop asap */
-	list_for_each_entry(thread, &svcpt->scp_threads, t_link) {
-		CDEBUG(D_INFO, "Stopping thread %s #%u\n",
-		       svcpt->scp_service->srv_thread_name, thread->t_id);
-		thread_add_flags(thread, SVC_STOPPING);
-	}
+	list_for_each_entry(thread, &svcpt->scp_threads, t_link)
+		ptlrpc_stop_thread(thread);
 
 	wake_up_all(&svcpt->scp_waitq);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 261/622] lnet: Avoid lnet debugfs read/write if ctl_table does not exist
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (259 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 260/622] lustre: ptlrpc: allow stopping threads above threads_max James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 262/622] lnet: lnd: bring back concurrent_sends James Simmons
                   ` (361 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sharmaso@whamcloud.com>

Running command "lctl get param -n stats" after lnet
is taken down leads to kernel panic because it
tries to read from the file which doesn't exist
anymore.

In lnet_debugfs_read() and lnet_debugfs_write(),
check if struct ctl_table is valid before trying
to read/write to it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11986
Lustre-commit: 54ca5e471d9f ("LU-11986 lnet: Avoid lnet debugfs read/write if ctl_table does not exist")
Signed-off-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34634
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/module.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/lnet/libcfs/module.c b/net/lnet/libcfs/module.c
index bee2581..37a3fee 100644
--- a/net/lnet/libcfs/module.c
+++ b/net/lnet/libcfs/module.c
@@ -597,9 +597,11 @@ static ssize_t lnet_debugfs_read(struct file *filp, char __user *buf,
 {
 	struct ctl_table *table = filp->private_data;
 	loff_t old_pos = *ppos;
-	ssize_t rc;
+	ssize_t rc = -EINVAL;
 
-	rc = table->proc_handler(table, 0, (void __user *)buf, &count, ppos);
+	if (table)
+		rc = table->proc_handler(table, 0, (void __user *)buf,
+					 &count, ppos);
 	/*
 	 * On success, the length read is either in error or in count.
 	 * If ppos changed, then use count, else use error
@@ -617,9 +619,11 @@ static ssize_t lnet_debugfs_write(struct file *filp, const char __user *buf,
 {
 	struct ctl_table *table = filp->private_data;
 	loff_t old_pos = *ppos;
-	ssize_t rc;
+	ssize_t rc = -EINVAL;
 
-	rc = table->proc_handler(table, 1, (void __user *)buf, &count, ppos);
+	if (table)
+		rc = table->proc_handler(table, 1, (void __user *)buf, &count,
+					 ppos);
 	if (rc)
 		return rc;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 262/622] lnet: lnd: bring back concurrent_sends
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (260 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 261/622] lnet: Avoid lnet debugfs read/write if ctl_table does not exist James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 263/622] lnet: properly cleanup lnet debugfs files James Simmons
                   ` (360 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Revert "lustre: lnd: remove concurrent_sends tunable"

This reverts commit 0d4b38f73774f8363d6c419b16d3b34d23ad1ca9.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11931
Lustre-commit: 83e45ead69ba ("LU-11931 lnd: bring back concurrent_sends")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34646
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.h           | 24 +++++++++++++++++++++++-
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c        |  5 +++--
 net/lnet/klnds/o2iblnd/o2iblnd_modparams.c | 30 +++++++++++++++++++++++++++---
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 44f1d84..baf1006 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -136,7 +136,9 @@ struct kib_tunables {
 /* WRs and CQEs (per connection) */
 #define IBLND_RECV_WRS(c)	IBLND_RX_MSGS(c)
 
-#define IBLND_CQ_ENTRIES(c)	(IBLND_RECV_WRS(c) + kiblnd_send_wrs(c))
+#define IBLND_CQ_ENTRIES(c)	\
+	(IBLND_RECV_WRS(c) + 2 * kiblnd_concurrent_sends(c->ibc_version, \
+							 c->ibc_peer->ibp_ni))
 
 struct kib_hca_dev;
 
@@ -635,6 +637,26 @@ struct kib_peer_ni {
 
 int kiblnd_msg_queue_size(int version, struct lnet_ni *ni);
 
+static inline int
+kiblnd_concurrent_sends(int version, struct lnet_ni *ni)
+{
+	struct lnet_ioctl_config_o2iblnd_tunables *tunables;
+	int concurrent_sends;
+
+	tunables = &ni->ni_lnd_tunables.lnd_tun_u.lnd_o2ib;
+	concurrent_sends = tunables->lnd_concurrent_sends;
+
+	if (version == IBLND_MSG_VERSION_1) {
+		if (concurrent_sends > IBLND_MSG_QUEUE_SIZE_V1 * 2)
+			return IBLND_MSG_QUEUE_SIZE_V1 * 2;
+
+		if (concurrent_sends < IBLND_MSG_QUEUE_SIZE_V1 / 2)
+			return IBLND_MSG_QUEUE_SIZE_V1 / 2;
+	}
+
+	return concurrent_sends;
+}
+
 static inline void
 kiblnd_hdev_addref_locked(struct kib_hca_dev *hdev)
 {
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 68ab7d5..fa5c93a 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -806,6 +806,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 {
 	struct kib_msg *msg = tx->tx_msg;
 	struct kib_peer_ni *peer_ni = conn->ibc_peer;
+	struct lnet_ni *ni = peer_ni->ibp_ni;
 	int ver = conn->ibc_version;
 	int rc;
 	int done;
@@ -821,7 +822,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	LASSERT(conn->ibc_credits >= 0);
 	LASSERT(conn->ibc_credits <= conn->ibc_queue_depth);
 
-	if (conn->ibc_nsends_posted == conn->ibc_queue_depth) {
+	if (conn->ibc_nsends_posted == kiblnd_concurrent_sends(ver, ni)) {
 		/* tx completions outstanding... */
 		CDEBUG(D_NET, "%s: posted enough\n",
 		       libcfs_nid2str(peer_ni->ibp_nid));
@@ -976,7 +977,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		return;
 	}
 
-	LASSERT(conn->ibc_nsends_posted <= conn->ibc_queue_depth);
+	LASSERT(conn->ibc_nsends_posted <= kiblnd_concurrent_sends(ver, ni));
 	LASSERT(!IBLND_OOB_CAPABLE(ver) ||
 		conn->ibc_noops_posted <= IBLND_OOB_MSGS(ver));
 	LASSERT(conn->ibc_reserved_credits >= 0);
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c b/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c
index b5df7fe..c9e14ec 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_modparams.c
@@ -109,7 +109,7 @@
 
 static int concurrent_sends;
 module_param(concurrent_sends, int, 0444);
-MODULE_PARM_DESC(concurrent_sends, "send work-queue sizing (obsolete)");
+MODULE_PARM_DESC(concurrent_sends, "send work-queue sizing");
 
 static bool use_fastreg_gaps;
 module_param(use_fastreg_gaps, bool, 0444);
@@ -272,10 +272,33 @@ int kiblnd_tunables_setup(struct lnet_ni *ni)
 		tunables->lnd_peercredits_hiw = peer_credits_hiw;
 
 	if (tunables->lnd_peercredits_hiw < net_tunables->lct_peer_tx_credits / 2)
-		tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits / 2;
+		tunables->lnd_peercredits_hiw =
+			net_tunables->lct_peer_tx_credits / 2;
 
 	if (tunables->lnd_peercredits_hiw >= net_tunables->lct_peer_tx_credits)
-		tunables->lnd_peercredits_hiw = net_tunables->lct_peer_tx_credits - 1;
+		tunables->lnd_peercredits_hiw =
+			net_tunables->lct_peer_tx_credits - 1;
+
+	if (tunables->lnd_concurrent_sends == 0)
+		tunables->lnd_concurrent_sends =
+			net_tunables->lct_peer_tx_credits;
+
+	if (tunables->lnd_concurrent_sends >
+	    net_tunables->lct_peer_tx_credits * 2)
+		tunables->lnd_concurrent_sends =
+			net_tunables->lct_peer_tx_credits * 2;
+
+	if (tunables->lnd_concurrent_sends <
+	    net_tunables->lct_peer_tx_credits / 2)
+		tunables->lnd_concurrent_sends =
+			net_tunables->lct_peer_tx_credits / 2;
+
+	if (tunables->lnd_concurrent_sends <
+	    net_tunables->lct_peer_tx_credits) {
+		CWARN("Concurrent sends %d is lower than message queue size: %d, performance may drop slightly.\n",
+		      tunables->lnd_concurrent_sends,
+		      net_tunables->lct_peer_tx_credits);
+	}
 
 	if (!tunables->lnd_fmr_pool_size)
 		tunables->lnd_fmr_pool_size = fmr_pool_size;
@@ -298,6 +321,7 @@ void kiblnd_tunables_init(void)
 	default_tunables.lnd_version = 0;
 	default_tunables.lnd_peercredits_hiw = peer_credits_hiw;
 	default_tunables.lnd_map_on_demand = map_on_demand;
+	default_tunables.lnd_concurrent_sends = concurrent_sends;
 	default_tunables.lnd_fmr_pool_size = fmr_pool_size;
 	default_tunables.lnd_fmr_flush_trigger = fmr_flush_trigger;
 	default_tunables.lnd_fmr_cache = fmr_cache;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 263/622] lnet: properly cleanup lnet debugfs files
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (261 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 262/622] lnet: lnd: bring back concurrent_sends James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 264/622] lustre: mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea James Simmons
                   ` (359 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

The function lnet_router_debugfs_remove() is suppose to cleanup
the lnet specific debugfs files but that is not happening at all.
Change lnet_remove_debugfs() from doing the final debugfs lnet
and libcfs cleanup to doing specific debugfs file removal. We
can make libcfs module unloading to directly finish the entire
libcfs and debugfs tree removal instead. With this change we can
make lnet_router_debugfs_fini() call lnet_remove_debugfs().

WC-bug-id: https://jira.whamcloud.com/browse/LU-11986
Lustre-commit: 8cb7ccf54e2d ("LU-11986 lnet: properly cleanup lnet debugfs files")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/34669
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/libcfs/libcfs.h |  1 +
 net/lnet/libcfs/module.c      | 16 ++++++++++++----
 net/lnet/lnet/router_proc.c   |  1 +
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/libcfs/libcfs.h b/include/linux/libcfs/libcfs.h
index 33f7477..d3a9754 100644
--- a/include/linux/libcfs/libcfs.h
+++ b/include/linux/libcfs/libcfs.h
@@ -57,6 +57,7 @@ static inline int notifier_from_ioctl_errno(int err)
 extern struct workqueue_struct *cfs_rehash_wq;
 
 void lnet_insert_debugfs(struct ctl_table *table);
+void lnet_remove_debugfs(struct ctl_table *table);
 
 /*
  * Memory
diff --git a/net/lnet/libcfs/module.c b/net/lnet/libcfs/module.c
index 37a3fee..2e803d6 100644
--- a/net/lnet/libcfs/module.c
+++ b/net/lnet/libcfs/module.c
@@ -691,12 +691,18 @@ static void lnet_insert_debugfs_links(
 				       symlinks->target);
 }
 
-static void lnet_remove_debugfs(void)
+void lnet_remove_debugfs(struct ctl_table *table)
 {
-	debugfs_remove_recursive(lnet_debugfs_root);
+	for (; table && table->procname; table++) {
+		struct qstr dname = QSTR_INIT(table->procname,
+					      strlen(table->procname));
+		struct dentry *dentry;
 
-	lnet_debugfs_root = NULL;
+		dentry = d_hash_and_lookup(lnet_debugfs_root, &dname);
+		debugfs_remove(dentry);
+	}
 }
+EXPORT_SYMBOL_GPL(lnet_remove_debugfs);
 
 static DEFINE_MUTEX(libcfs_startup);
 static int libcfs_active;
@@ -771,7 +777,9 @@ static void libcfs_exit(void)
 {
 	int rc;
 
-	lnet_remove_debugfs();
+	/* Remove everthing */
+	debugfs_remove_recursive(lnet_debugfs_root);
+	lnet_debugfs_root = NULL;
 
 	if (cfs_rehash_wq)
 		destroy_workqueue(cfs_rehash_wq);
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 45abcfb..8517411 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -936,4 +936,5 @@ void lnet_router_debugfs_init(void)
 
 void lnet_router_debugfs_fini(void)
 {
+	lnet_remove_debugfs(lnet_table);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 264/622] lustre: mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (262 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 263/622] lnet: properly cleanup lnet debugfs files James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 265/622] lnet: Cleanup lnet_get_rtr_pool_cfg James Simmons
                   ` (358 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

In order to prepare for replay lmm->lmm_stripe_offset (which contains
layout generation) has to be set to -1 (LOV_OFFSET_DEFAULT) in order
to not confuse lod_verify_v1v3

Fix patch for ("LU-169 lov: add generation number to LOV EA") which
was apart of original Lustre merger to Linux kernel.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12040
Lustre-commit: c872afa36ff5 ("LU-12040 mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea")
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Cray-bug-id: LUS-7008
Reviewed-on: https://review.whamcloud.com/34371
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_lib.c                 | 3 ++-
 fs/lustre/mdc/mdc_locks.c               | 8 ++++++--
 include/uapi/linux/lustre/lustre_user.h | 1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index 980676a..f0e5a84 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -406,7 +406,8 @@ void mdc_setattr_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 		lum->lmm_magic = cpu_to_le32(LOV_USER_MAGIC_V1);
 		lum->lmm_stripe_size = 0;
 		lum->lmm_stripe_count = 0;
-		lum->lmm_stripe_offset = (typeof(lum->lmm_stripe_offset))(-1);
+		lum->lmm_stripe_offset =
+		  (typeof(lum->lmm_stripe_offset))LOV_OFFSET_DEFAULT;
 	} else {
 		memcpy(lum, ea, ealen);
 	}
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 05447ea..019eb35 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -220,8 +220,8 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 			  void *data, u32 size)
 {
 	struct req_capsule *pill = &req->rq_pill;
+	struct lov_user_md *lmm;
 	int rc = 0;
-	void *lmm;
 
 	if (req_capsule_get_size(pill, field, RCL_CLIENT) < size) {
 		rc = sptlrpc_cli_enlarge_reqbuf(req, field, size);
@@ -237,8 +237,12 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 
 	req_capsule_set_size(pill, field, RCL_CLIENT, size);
 	lmm = req_capsule_client_get(pill, field);
-	if (lmm)
+	if (lmm) {
 		memcpy(lmm, data, size);
+		/* overwrite layout generation returned from the MDS */
+		lmm->lmm_stripe_offset =
+		  (typeof(lmm->lmm_stripe_offset))LOV_OFFSET_DEFAULT;
+	}
 
 	return rc;
 }
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 1d402f1..3901eb2 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -404,6 +404,7 @@ struct ll_ioc_lease_id {
 
 #define LOV_MAXPOOLNAME 15
 #define LOV_POOLNAMEF "%.15s"
+#define LOV_OFFSET_DEFAULT      ((__u16)-1)
 
 #define LOV_MIN_STRIPE_BITS	16	/* maximum PAGE_SIZE (ia64), power of 2 */
 #define LOV_MIN_STRIPE_SIZE	(1 << LOV_MIN_STRIPE_BITS)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 265/622] lnet: Cleanup lnet_get_rtr_pool_cfg
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (263 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 264/622] lustre: mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 266/622] lustre: quota: make overquota flag for old req James Simmons
                   ` (357 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

The cfs_percpt_for_each loop contains an off-by-one error that causes
memory corruption. In addition, the way these loops are nested results
in unnecessary iterations. We only need to iterate through the cpts
until we match the cpt number passed as an argument. At that point we
want to copy the router buffer pools for that cpt.

Cray-bug-id: LUS-7240
WC-bug-id: https://jira.whamcloud.com/browse/LU-12152
Lustre-commit: 187117fd94e4 ("LU-12152 lnet: Cleanup lnet_get_rtr_pool_cfg")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/34591
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 66a116c..78a8659 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -549,29 +549,30 @@ static void lnet_shuffle_seed(void)
 	lnet_del_route(LNET_NIDNET(LNET_NID_ANY), LNET_NID_ANY);
 }
 
-int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg)
+int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 {
+	struct lnet_rtrbufpool *rbp;
 	int i, rc = -ENOENT, j;
 
 	if (!the_lnet.ln_rtrpools)
 		return rc;
 
-	for (i = 0; i < LNET_NRBPOOLS; i++) {
-		struct lnet_rtrbufpool *rbp;
 
-		lnet_net_lock(LNET_LOCK_EX);
-		cfs_percpt_for_each(rbp, j, the_lnet.ln_rtrpools) {
-			if (i++ != idx)
-				continue;
+	cfs_percpt_for_each(rbp, i, the_lnet.ln_rtrpools) {
+		if (i != cpt)
+			continue;
 
-			pool_cfg->pl_pools[i].pl_npages = rbp[i].rbp_npages;
-			pool_cfg->pl_pools[i].pl_nbuffers = rbp[i].rbp_nbuffers;
-			pool_cfg->pl_pools[i].pl_credits = rbp[i].rbp_credits;
-			pool_cfg->pl_pools[i].pl_mincredits = rbp[i].rbp_mincredits;
-			rc = 0;
-			break;
+		lnet_net_lock(i);
+		for (j = 0; j < LNET_NRBPOOLS; j++) {
+			pool_cfg->pl_pools[j].pl_npages = rbp[j].rbp_npages;
+			pool_cfg->pl_pools[j].pl_nbuffers = rbp[j].rbp_nbuffers;
+			pool_cfg->pl_pools[j].pl_credits = rbp[j].rbp_credits;
+			pool_cfg->pl_pools[j].pl_mincredits =
+				rbp[j].rbp_mincredits;
 		}
-		lnet_net_unlock(LNET_LOCK_EX);
+		lnet_net_unlock(i);
+		rc = 0;
+		break;
 	}
 
 	lnet_net_lock(LNET_LOCK_EX);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 266/622] lustre: quota: make overquota flag for old req
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (264 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 265/622] lnet: Cleanup lnet_get_rtr_pool_cfg James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 267/622] lustre: osd: Set max ea size to XATTR_SIZE_MAX James Simmons
                   ` (356 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

For the old request with over quota flag, the over quota flag
should still be marked at OSC, because the old request could be
processed afther the new request at OST, then it won't break the
quota enforement at OST.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11678
Lustre-commit: c59cf862c3c0 ("LU-11678 quota: make overquota flag for old req")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34645
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shilong Wang <wshilong@ddn.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_quota.c              | 11 +++++++++--
 include/uapi/linux/lustre/lustre_idl.h |  3 +++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/osc/osc_quota.c b/fs/lustre/osc/osc_quota.c
index 316e087..8ff803c 100644
--- a/fs/lustre/osc/osc_quota.c
+++ b/fs/lustre/osc/osc_quota.c
@@ -119,10 +119,17 @@ int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[],
 		return 0;
 
 	mutex_lock(&cli->cl_quota_mutex);
-	if (cli->cl_quota_last_xid > xid)
+	/* still mark the quots is running out for the old request, because it
+	 * could be processed after the new request at OST, the side effect is
+	 * the following request will be processed synchronously, but it will
+	 * not break the quota enforcement.
+	 */
+	if (cli->cl_quota_last_xid > xid && !(flags & OBD_FL_NO_QUOTA_ALL))
 		goto out_unlock;
 
-	cli->cl_quota_last_xid = xid;
+	if (cli->cl_quota_last_xid < xid)
+		cli->cl_quota_last_xid = xid;
+
 	for (type = 0; type < MAXQUOTAS; type++) {
 		struct osc_quota_info *oqi;
 
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 1b4b018..3a2a093 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -998,6 +998,9 @@ enum obdo_flags {
 				   OBD_FL_CKSUM_T10IP4K |
 				   OBD_FL_CKSUM_T10CRC512 |
 				   OBD_FL_CKSUM_T10CRC4K),
+
+	OBD_FL_NO_QUOTA_ALL = OBD_FL_NO_USRQUOTA | OBD_FL_NO_GRPQUOTA |
+			      OBD_FL_NO_PRJQUOTA,
 };
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 267/622] lustre: osd: Set max ea size to XATTR_SIZE_MAX
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (265 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 266/622] lustre: quota: make overquota flag for old req James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 268/622] lustre: lov: Remove unnecessary assert James Simmons
                   ` (355 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Lustre currently limits EA size to either ~1 MiB (ldiskfs)
or 32K (ZFS).  VFS has its own limit, XATTR_SIZE_MAX,
which we must respect to interoperate correctly with
userspace tools like tar, getattr, and the getxattr()
syscall.

Set this as the new max EA size for both ldiskfs and ZFS.

(The current 32K on ZFS is too small for
LOV_MAX_STRIPE_COUNT [2000] files, so needs to be raised
regardless.)

In order to use this correctly, we have to use the real ea
size on the client.  The previous code for maximum ea size
on the client (KEY_MAX_EASIZE, llite.max_easize) used a
calculated value based on number of targets.

With one exception, the mdc code already uses the default
ea size rather than the max.  Default ea size adjusts
automatically to the largest size sent by the server.

The exception is the open code, which uses the max so it
never has to resend a layout request.  This patch changes
it to use default, which means that the first time a very
widely striped file is opened, the open will be resent.

Add limit checks on client & server so the xattr size limit
is honored.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11868
Lustre-commit: 3ec712bd183a ("LU-11868 osd: Set max ea size to XATTR_SIZE_MAX")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34058
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h     |  7 +++++++
 fs/lustre/llite/llite_lib.c |  4 ++++
 fs/lustre/lov/lov_obd.c     |  5 +----
 fs/lustre/mdc/mdc_locks.c   | 12 ++++++------
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 2195f85..687b54b 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -154,6 +154,13 @@ enum obd_cl_sem_lock_class {
  */
 #define OBD_MAX_DEFAULT_EA_SIZE		4096
 
+/*
+ * Lustre can handle larger xattrs internally, but we must respect the Linux
+ * VFS limitation or tools like tar cannot interact with Lustre volumes
+ * correctly.
+ */
+#define OBD_MAX_EA_SIZE		XATTR_SIZE_MAX
+
 struct mdc_rpc_lock;
 struct obd_import;
 struct client_obd {
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 347bdd6..aadde3f 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -663,12 +663,16 @@ int ll_get_max_mdsize(struct ll_sb_info *sbi, int *lmmsize)
 		return rc;
 	}
 
+	CDEBUG(D_INFO, "max LOV ea size: %d\n", *lmmsize);
+
 	size = sizeof(int);
 	rc = obd_get_info(NULL, sbi->ll_md_exp, sizeof(KEY_MAX_EASIZE),
 			  KEY_MAX_EASIZE, &size, lmmsize);
 	if (rc)
 		CERROR("Get max mdsize error rc %d\n", rc);
 
+	CDEBUG(D_INFO, "max LMV ea size: %d\n", *lmmsize);
+
 	return rc;
 }
 
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 240cc6f9..3a90e7e 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -1162,10 +1162,7 @@ static int lov_get_info(const struct lu_env *env, struct obd_export *exp,
 	lov_tgts_getref(obddev);
 
 	if (KEY_IS(KEY_MAX_EASIZE)) {
-		u32 max_stripe_count = min_t(u32, ld->ld_active_tgt_count,
-					     LOV_MAX_STRIPE_COUNT);
-
-		*((u32 *)val) = lov_mds_md_size(max_stripe_count, LOV_MAGIC_V3);
+		*((u32 *)val) = exp->exp_connect_data.ocd_max_easize;
 	} else if (KEY_IS(KEY_DEFAULT_EASIZE)) {
 		u32 def_stripe_count = min_t(u32, ld->ld_default_stripe_count,
 					     LOV_MAX_STRIPE_COUNT);
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 019eb35..f6273ef 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -256,12 +256,15 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	struct ldlm_intent *lit;
 	const void *lmm = op_data->op_data;
 	u32 lmmsize = op_data->op_data_size;
+	u32 mdt_md_capsule_size;
 	LIST_HEAD(cancels);
 	int count = 0;
 	enum ldlm_mode mode;
 	int rc;
 	int repsize, repsize_estimate;
 
+	mdt_md_capsule_size = obddev->u.cli.cl_default_mds_easize;
+
 	it->it_create_mode = (it->it_create_mode & ~S_IFMT) | S_IFREG;
 
 	/* XXX: openlock is not cancelled for cross-refs. */
@@ -348,7 +351,7 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 		      lmmsize);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_MDT_MD, RCL_SERVER,
-			     obddev->u.cli.cl_max_mds_easize);
+			     mdt_md_capsule_size);
 	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER, acl_bufsize);
 
 	if (!(it->it_op & IT_CREAT) && it->it_op & IT_OPEN &&
@@ -387,7 +390,7 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 				      lustre_msg_early_size());
 	/* Estimate free space for DoM files in repbuf */
 	repsize_estimate = repsize - (req->rq_replen -
-			   obddev->u.cli.cl_max_mds_easize +
+			   mdt_md_capsule_size +
 			   sizeof(struct lov_comp_md_v1) +
 			   sizeof(struct lov_comp_md_entry_v1) +
 			   lov_mds_md_size(0, LOV_MAGIC_V3));
@@ -539,10 +542,7 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 	lit = req_capsule_client_get(&req->rq_pill, &RMF_LDLM_INTENT);
 	lit->opc = (u64)it->it_op;
 
-	if (obddev->u.cli.cl_default_mds_easize > 0)
-		easize = obddev->u.cli.cl_default_mds_easize;
-	else
-		easize = obddev->u.cli.cl_max_mds_easize;
+	easize = obddev->u.cli.cl_default_mds_easize;
 
 	/* pack the intended request */
 	mdc_getattr_pack(req, valid, it->it_flags, op_data, easize);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 268/622] lustre: lov: Remove unnecessary assert
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (266 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 267/622] lustre: osd: Set max ea size to XATTR_SIZE_MAX James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 269/622] lnet: o2iblnd: kib_conn leak James Simmons
                   ` (354 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

This is asserting on network data from the server, and
additionally, the LU-9846 (overstriping) work shows this
condition is not a problem if it does somehow occur.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11796
Lustre-commit: 1d7104485119 ("LU-11796 lov: Remove unnecessary assert")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33882
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_object.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index c6324f4..c04b2ae 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -210,7 +210,6 @@ static int lov_init_raid0(const struct lu_env *env, struct lov_device *dev,
 
 	spin_lock_init(&r0->lo_sub_lock);
 	r0->lo_nr = lse->lsme_stripe_count;
-	LASSERT(r0->lo_nr <= lov_targets_nr(dev));
 
 	r0->lo_sub = kcalloc(r0->lo_nr, sizeof(r0->lo_sub[0]),
 			     GFP_KERNEL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 269/622] lnet: o2iblnd: kib_conn leak
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (267 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 268/622] lustre: lov: Remove unnecessary assert James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 270/622] lustre: llite: switch to use ll_fsname directly James Simmons
                   ` (353 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

A new tx can be queued while kiblnd_finalise_conn()
aborts txs. Thus a reference from new tx will
prevent connection from moving into kib_connd_zombies.

Insert new tx after IBLND_CONN_DISCONNECTED into
ibc_zombie_txs list and abort it during
kiblnd_destroy_conn().

Cray-bug-id: LUS-6412
WC-bug-id: https://jira.whamcloud.com/browse/LU-11756
Lustre-commit: a155c3fca38d ("LU-11756 o2iblnd: kib_conn leak")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/33828
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c    |  4 ++++
 net/lnet/klnds/o2iblnd/o2iblnd.h    |  5 ++++-
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 21 ++++++++++++++++++---
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 0e207ef..bb7590f 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -744,6 +744,7 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 	INIT_LIST_HEAD(&conn->ibc_tx_queue_rsrvd);
 	INIT_LIST_HEAD(&conn->ibc_tx_queue_nocred);
 	INIT_LIST_HEAD(&conn->ibc_active_txs);
+	INIT_LIST_HEAD(&conn->ibc_zombie_txs);
 	spin_lock_init(&conn->ibc_lock);
 
 	conn->ibc_connvars = kzalloc_cpt(sizeof(*conn->ibc_connvars), GFP_NOFS, cpt);
@@ -951,6 +952,9 @@ void kiblnd_destroy_conn(struct kib_conn *conn)
 	if (conn->ibc_cq)
 		ib_destroy_cq(conn->ibc_cq);
 
+	kiblnd_txlist_done(&conn->ibc_zombie_txs, -ECONNABORTED,
+			   LNET_MSG_STATUS_OK);
+
 	if (conn->ibc_rx_pages)
 		kiblnd_unmap_rx_descs(conn);
 
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index baf1006..eb80d5e 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -581,7 +581,9 @@ struct kib_conn {
 	struct list_head	ibc_tx_queue_rsrvd;   /* sends that need to */
 						      /* reserve an ACK/DONE msg */
 	struct list_head	ibc_active_txs;	/* active tx awaiting completion */
-	spinlock_t		ibc_lock;	/* serialise */
+	spinlock_t		ibc_lock;	/* zombie tx awaiting done */
+	struct list_head	ibc_zombie_txs;
+	/* serialise */
 	struct kib_rx		*ibc_rxs;	/* the rx descs */
 	struct kib_pages	*ibc_rx_pages;	/* premapped rx msg pages */
 
@@ -1005,6 +1007,7 @@ static inline unsigned int kiblnd_sg_dma_len(struct ib_device *dev,
 #define KIBLND_CONN_PARAM(e)		((e)->param.conn.private_data)
 #define KIBLND_CONN_PARAM_LEN(e)	((e)->param.conn.private_data_len)
 
+void kiblnd_abort_txs(struct kib_conn *conn, struct list_head *txs);
 void kiblnd_map_rx_descs(struct kib_conn *conn);
 void kiblnd_unmap_rx_descs(struct kib_conn *conn);
 void kiblnd_pool_free_node(struct kib_pool *pool, struct list_head *node);
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index fa5c93a..a3abbb6 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1211,6 +1211,21 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	LASSERT(!tx->tx_queued);	/* not queued for sending already */
 	LASSERT(conn->ibc_state >= IBLND_CONN_ESTABLISHED);
 
+	if (conn->ibc_state >= IBLND_CONN_DISCONNECTED) {
+		tx->tx_status = -ECONNABORTED;
+		tx->tx_waiting = 0;
+		if (tx->tx_conn) {
+			/* PUT_DONE first attached to conn as a PUT_REQ */
+			LASSERT(tx->tx_conn == conn);
+			LASSERT(tx->tx_msg->ibm_type == IBLND_MSG_PUT_DONE);
+			tx->tx_conn = NULL;
+			kiblnd_conn_decref(conn);
+		}
+		list_add(&tx->tx_list, &conn->ibc_zombie_txs);
+
+		return;
+	}
+
 	timeout_ns = lnet_get_lnd_timeout() * NSEC_PER_SEC;
 	tx->tx_queued = 1;
 	tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);
@@ -2056,7 +2071,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
 }
 
-static void
+void
 kiblnd_abort_txs(struct kib_conn *conn, struct list_head *txs)
 {
 	LIST_HEAD(zombies);
@@ -2123,8 +2138,6 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	LASSERT(!in_interrupt());
 	LASSERT(conn->ibc_state > IBLND_CONN_INIT);
 
-	kiblnd_set_conn_state(conn, IBLND_CONN_DISCONNECTED);
-
 	/*
 	 * abort_receives moves QP state to IB_QPS_ERR.  This is only required
 	 * for connections that didn't get as far as being connected, because
@@ -2132,6 +2145,8 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	 */
 	kiblnd_abort_receives(conn);
 
+	kiblnd_set_conn_state(conn, IBLND_CONN_DISCONNECTED);
+
 	/*
 	 * Complete all tx descs not waiting for sends to complete.
 	 * NB we should be safe from RDMA now that the QP has changed state
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 270/622] lustre: llite: switch to use ll_fsname directly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (268 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 269/622] lnet: o2iblnd: kib_conn leak James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 271/622] lustre: llite: improve max_readahead console messages James Simmons
                   ` (352 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

There are many places which try to access filesystem
fsname, instead of parsing it everytime, just store
it into @sbi, we could use @ll_fsname directly whenever we need.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: 506b68a35904 ("LU-12043 llite: switch to use ll_fsname directly")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34602
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c            |  7 ++--
 fs/lustre/llite/file.c           | 42 ++++++++++------------
 fs/lustre/llite/lcommon_cl.c     |  5 ++-
 fs/lustre/llite/llite_internal.h |  4 ++-
 fs/lustre/llite/llite_lib.c      | 76 ++++++++++++++--------------------------
 fs/lustre/llite/llite_nfs.c      |  8 ++---
 fs/lustre/llite/lproc_llite.c    | 10 +++---
 fs/lustre/llite/statahead.c      | 10 +++---
 fs/lustre/llite/symlink.c        |  9 +++--
 fs/lustre/llite/vvp_io.c         |  4 +--
 fs/lustre/llite/xattr.c          |  2 +-
 11 files changed, 72 insertions(+), 105 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index ef4fa36..8293a01 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -602,8 +602,8 @@ int ll_dir_setstripe(struct inode *inode, struct lov_user_md *lump,
 
 		buf = param;
 		/* Get fsname and assume devname to be -MDT0000. */
-		ll_get_fsname(inode->i_sb, buf, MTI_NAME_MAXLEN);
-		strcat(buf, "-MDT0000.lov");
+		snprintf(buf, MGS_PARAM_MAXLEN, "%s-MDT0000.lov",
+			 sbi->ll_fsname);
 		buf += strlen(buf);
 
 		/* Set root stripesize */
@@ -1276,8 +1276,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		rc = ll_get_fid_by_name(inode, filename, namelen, NULL, NULL);
 		if (rc < 0) {
 			CERROR("%s: lookup %.*s failed: rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0), namelen,
-			       filename, rc);
+			       sbi->ll_fsname, namelen, filename, rc);
 			goto out_free;
 		}
 out_free:
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index f5b5eec..0f15ea8 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -135,8 +135,7 @@ static int ll_close_inode_openhandle(struct inode *inode,
 
 	if (!class_exp2obd(md_exp)) {
 		CERROR("%s: invalid MDC connection handle closing " DFID "\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
-		       PFID(&lli->lli_fid));
+		       ll_i2sbi(inode)->ll_fsname, PFID(&lli->lli_fid));
 		rc = 0;
 		goto out;
 	}
@@ -460,7 +459,7 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	 */
 	if (rnb->rnb_offset + rnb->rnb_len < i_size_read(inode)) {
 		CERROR("%s: server returns off/len %llu/%u < i_size %llu\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0), rnb->rnb_offset,
+		       ll_i2sbi(inode)->ll_fsname, rnb->rnb_offset,
 		       rnb->rnb_len, i_size_read(inode));
 		return;
 	}
@@ -486,8 +485,8 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 		if (IS_ERR(vmpage)) {
 			CWARN("%s: cannot fill page %lu for "DFID
 			      " with data: rc = %li\n",
-			      ll_get_fsname(inode->i_sb, NULL, 0),
-			      index + start, PFID(lu_object_fid(&obj->co_lu)),
+			      ll_i2sbi(inode)->ll_fsname, index + start,
+			      PFID(lu_object_fid(&obj->co_lu)),
 			      PTR_ERR(vmpage));
 			break;
 		}
@@ -1080,8 +1079,7 @@ static int ll_lease_och_release(struct inode *inode, struct file *file)
 	rc2 = ll_close_inode_openhandle(inode, och, 0, NULL);
 	if (rc2 < 0)
 		CERROR("%s: error closing file " DFID ": %d\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
-		       PFID(&ll_i2info(inode)->lli_fid), rc2);
+		       sbi->ll_fsname, PFID(&ll_i2info(inode)->lli_fid), rc2);
 	och = NULL; /* och has been freed in ll_close_inode_openhandle() */
 out_release_it:
 	ll_intent_release(&it);
@@ -1124,7 +1122,7 @@ static int ll_swap_layouts_close(struct obd_client_handle *och,
 	int rc;
 
 	CDEBUG(D_INODE, "%s: biased close of file " DFID "\n",
-	       ll_get_fsname(inode->i_sb, NULL, 0), PFID(fid1));
+	       ll_i2sbi(inode)->ll_fsname, PFID(fid1));
 
 	rc = ll_check_swap_layouts_validity(inode, inode2);
 	if (rc < 0)
@@ -2293,7 +2291,7 @@ int ll_hsm_release(struct inode *inode)
 	u16 refcheck;
 
 	CDEBUG(D_INODE, "%s: Releasing file " DFID ".\n",
-	       ll_get_fsname(inode->i_sb, NULL, 0),
+	       ll_i2sbi(inode)->ll_fsname,
 	       PFID(&ll_i2info(inode)->lli_fid));
 
 	och = ll_lease_open(inode, NULL, FMODE_WRITE, MDS_OPEN_RELEASE);
@@ -2716,6 +2714,7 @@ int ll_file_lock_ahead(struct file *file, struct llapi_lu_ladvise *ladvise)
 static int ll_ladvise_sanity(struct inode *inode,
 			     struct llapi_lu_ladvise *ladvise)
 {
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	enum lu_ladvise_type advice = ladvise->lla_advice;
 	/* Note the peradvice flags is a 32 bit field, so per advice flags must
 	 * be in the first 32 bits of enum ladvise_flags
@@ -2728,7 +2727,7 @@ static int ll_ladvise_sanity(struct inode *inode,
 		rc = -EINVAL;
 		CDEBUG(D_VFSTRACE,
 		       "%s: advice with value '%d' not recognized, last supported advice is %s (value '%d'): rc = %d\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0), advice,
+		       sbi->ll_fsname, advice,
 		       ladvise_names[LU_LADVISE_MAX - 1], LU_LADVISE_MAX - 1,
 		       rc);
 		goto out;
@@ -2741,7 +2740,7 @@ static int ll_ladvise_sanity(struct inode *inode,
 			rc = -EINVAL;
 			CDEBUG(D_VFSTRACE,
 			       "%s: Invalid flags (%x) for %s: rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0), flags,
+			       sbi->ll_fsname, flags,
 			       ladvise_names[advice], rc);
 			goto out;
 		}
@@ -2753,7 +2752,7 @@ static int ll_ladvise_sanity(struct inode *inode,
 			rc = -EINVAL;
 			CDEBUG(D_VFSTRACE,
 			       "%s: Invalid mode (%d) for %s: rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0),
+			       sbi->ll_fsname,
 			       ladvise->lla_lockahead_mode,
 			       ladvise_names[advice], rc);
 			goto out;
@@ -2769,7 +2768,7 @@ static int ll_ladvise_sanity(struct inode *inode,
 			rc = -EINVAL;
 			CDEBUG(D_VFSTRACE,
 			       "%s: Invalid flags (%x) for %s: rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0), flags,
+			       sbi->ll_fsname, flags,
 			       ladvise_names[advice], rc);
 			goto out;
 		}
@@ -2777,7 +2776,7 @@ static int ll_ladvise_sanity(struct inode *inode,
 			rc = -EINVAL;
 			CDEBUG(D_VFSTRACE,
 			       "%s: Invalid range (%llu to %llu) for %s: rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0),
+			       sbi->ll_fsname,
 			       ladvise->lla_start, ladvise->lla_end,
 			       ladvise_names[advice], rc);
 			goto out;
@@ -3970,7 +3969,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 		if (le32_to_cpu(lum->lum_stripe_count) > 1 ||
 		    ll_i2info(child_inode)->lli_lsm_md) {
 			CERROR("%s: MDT doesn't support stripe directory migration!\n",
-			       ll_get_fsname(parent->i_sb, NULL, 0));
+			       ll_i2sbi(parent)->ll_fsname);
 			rc = -EOPNOTSUPP;
 			goto out_iput;
 		}
@@ -3997,7 +3996,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 	op_data->op_fid3 = *ll_inode2fid(child_inode);
 	if (!fid_is_sane(&op_data->op_fid3)) {
 		CERROR("%s: migrate %s, but fid " DFID " is insane\n",
-		       ll_get_fsname(parent->i_sb, NULL, 0), name,
+		       ll_i2sbi(parent)->ll_fsname, name,
 		       PFID(&op_data->op_fid3));
 		rc = -EINVAL;
 		goto out_unlock;
@@ -4171,7 +4170,7 @@ static int ll_inode_revalidate_fini(struct inode *inode, int rc)
 	} else if (rc != 0) {
 		CDEBUG_LIMIT((rc == -EACCES || rc == -EIDRM) ? D_INFO : D_ERROR,
 			     "%s: revalidate FID " DFID " error: rc = %d\n",
-			     ll_get_fsname(inode->i_sb, NULL, 0),
+			     ll_i2sbi(inode)->ll_fsname,
 			     PFID(ll_inode2fid(inode)), rc);
 	}
 
@@ -4677,8 +4676,7 @@ static int ll_layout_lock_set(struct lustre_handle *lockh, enum ldlm_mode mode,
 	/* wait for IO to complete if it's still being used. */
 	if (wait_layout) {
 		CDEBUG(D_INODE, "%s: " DFID "(%p) wait for layout reconf\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
-		       PFID(&lli->lli_fid), inode);
+		       sbi->ll_fsname, PFID(&lli->lli_fid), inode);
 
 		memset(&conf, 0, sizeof(conf));
 		conf.coc_opc = OBJECT_CONF_WAIT;
@@ -4689,8 +4687,7 @@ static int ll_layout_lock_set(struct lustre_handle *lockh, enum ldlm_mode mode,
 
 		CDEBUG(D_INODE,
 		       "%s: file=" DFID " waiting layout return: %d.\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
-		       PFID(&lli->lli_fid), rc);
+		       sbi->ll_fsname, PFID(&lli->lli_fid), rc);
 	}
 	return rc;
 }
@@ -4727,8 +4724,7 @@ static int ll_layout_intent(struct inode *inode, struct layout_intent *intent)
 		it.it_flags = FMODE_WRITE;
 
 	LDLM_DEBUG_NOLOCK("%s: requeue layout lock for file " DFID "(%p)",
-			  ll_get_fsname(inode->i_sb, NULL, 0),
-			  PFID(&lli->lli_fid), inode);
+			  sbi->ll_fsname, PFID(&lli->lli_fid), inode);
 
 	rc = md_intent_lock(sbi->ll_md_exp, op_data, &it, &req,
 			    &ll_md_blocking_ast, 0);
diff --git a/fs/lustre/llite/lcommon_cl.c b/fs/lustre/llite/lcommon_cl.c
index 9ac80e0..3129316 100644
--- a/fs/lustre/llite/lcommon_cl.c
+++ b/fs/lustre/llite/lcommon_cl.c
@@ -174,8 +174,7 @@ int cl_file_inode_init(struct inode *inode, struct lustre_md *md)
 		if (!(inode->i_state & I_NEW)) {
 			result = -EIO;
 			CERROR("%s: unexpected not-NEW inode "DFID": rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0), PFID(fid),
-			       result);
+			       ll_i2sbi(inode)->ll_fsname, PFID(fid), result);
 			goto out;
 		}
 
@@ -202,7 +201,7 @@ int cl_file_inode_init(struct inode *inode, struct lustre_md *md)
 
 	if (result)
 		CERROR("%s: failed to initialize cl_object "DFID": rc = %d\n",
-			ll_get_fsname(inode->i_sb, NULL, 0), PFID(fid), result);
+		       ll_i2sbi(inode)->ll_fsname, PFID(fid), result);
 
 out:
 	cl_env_put(env, &refcheck);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 5a0a5ed..b9478f4d 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -556,6 +556,9 @@ struct ll_sb_info {
 	/* File heat */
 	unsigned int		ll_heat_decay_weight;
 	unsigned int		ll_heat_period_second;
+
+	/* filesystem fsname */
+	char			ll_fsname[LUSTRE_MAXFSNAME + 1];
 };
 
 #define SBI_DEFAULT_HEAT_DECAY_WEIGHT	((80 * 256 + 50) / 100)
@@ -935,7 +938,6 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 				      u32 mode, u32 opc, void *data);
 void ll_finish_md_op_data(struct md_op_data *op_data);
 int ll_get_obd_name(struct inode *inode, unsigned int cmd, unsigned long arg);
-char *ll_get_fsname(struct super_block *sb, char *buf, int buflen);
 void ll_compute_rootsquash_state(struct ll_sb_info *sbi);
 void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req);
 ssize_t ll_copy_user_md(const struct lov_user_md __user *md,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index aadde3f..8e5cf0a 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -586,9 +586,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	sb->s_root = d_make_root(root);
 	if (!sb->s_root) {
-		CERROR("%s: can't make root dentry\n",
-		       ll_get_fsname(sb, NULL, 0));
 		err = -ENOMEM;
+		CERROR("%s: can't make root dentry, rc = %d\n",
+		       sbi->ll_fsname, err);
 		goto out_lock_cn_cb;
 	}
 
@@ -614,7 +614,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 					sbi->ll_dt_obd->obd_type->typ_name);
 		if (err < 0) {
 			CERROR("%s: could not register %s in llite: rc = %d\n",
-			       dt, ll_get_fsname(sb, NULL, 0), err);
+			       dt, sbi->ll_fsname, err);
 			err = 0;
 		}
 	}
@@ -625,7 +625,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 					sbi->ll_md_obd->obd_type->typ_name);
 		if (err < 0) {
 			CERROR("%s: could not register %s in llite: rc = %d\n",
-			       md, ll_get_fsname(sb, NULL, 0), err);
+			       md, sbi->ll_fsname, err);
 			err = 0;
 		}
 	}
@@ -1004,6 +1004,19 @@ int ll_fill_super(struct super_block *sb)
 	if (ptr && (strcmp(ptr, "-client") == 0))
 		len -= 7;
 
+	if (len > LUSTRE_MAXFSNAME) {
+		if (unlikely(len >= MAX_OBD_NAME))
+			len = MAX_OBD_NAME - 1;
+		strncpy(name, profilenm, len);
+		name[len] = '\0';
+		err = -ENAMETOOLONG;
+		CERROR("%s: fsname longer than %u characters: rc = %d\n",
+		       name, LUSTRE_MAXFSNAME, err);
+		goto out_free;
+	}
+	strncpy(sbi->ll_fsname, profilenm, len);
+	sbi->ll_fsname[len] = '\0';
+
 	/* Mount info */
 	snprintf(name, sizeof(name), "%.*s-%px", len,
 		 lsi->lsi_lmd->lmd_profile, sb);
@@ -1014,7 +1027,7 @@ int ll_fill_super(struct super_block *sb)
 	err = ll_debugfs_register_super(sb, name);
 	if (err < 0) {
 		CERROR("%s: could not register mountpoint in llite: rc = %d\n",
-		       ll_get_fsname(sb, NULL, 0), err);
+		       sbi->ll_fsname, err);
 		err = 0;
 	}
 
@@ -1208,7 +1221,7 @@ static struct inode *ll_iget_anon_dir(struct super_block *sb,
 	inode = iget_locked(sb, ino);
 	if (!inode) {
 		CERROR("%s: failed get simple inode " DFID ": rc = -ENOENT\n",
-		       ll_get_fsname(sb, NULL, 0), PFID(fid));
+		       sbi->ll_fsname, PFID(fid));
 		return ERR_PTR(-ENOENT);
 	}
 
@@ -1252,8 +1265,7 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 	LASSERT(lsm);
 
 	CDEBUG(D_INODE, "%s: "DFID" set dir layout:\n",
-		ll_get_fsname(inode->i_sb, NULL, 0),
-		PFID(&lli->lli_fid));
+	       ll_i2sbi(inode)->ll_fsname, PFID(&lli->lli_fid));
 	lsm_md_dump(D_INODE, lsm);
 
 	/*
@@ -1322,7 +1334,7 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 		if (lsm->lsm_md_layout_version <=
 		    lli->lli_lsm_md->lsm_md_layout_version) {
 			CERROR("%s: " DFID " dir layout mismatch:\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0),
+			       ll_i2sbi(inode)->ll_fsname,
 			       PFID(&lli->lli_fid));
 			lsm_md_dump(D_ERROR, lli->lli_lsm_md);
 			lsm_md_dump(D_ERROR, lsm);
@@ -1529,7 +1541,7 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE, "%s: setattr inode " DFID "(%p) from %llu to %llu, valid %x, hsm_import %d\n",
-	       ll_get_fsname(inode->i_sb, NULL, 0), PFID(&lli->lli_fid), inode,
+	       ll_i2sbi(inode)->ll_fsname, PFID(&lli->lli_fid), inode,
 	       i_size_read(inode), attr->ia_size, attr->ia_valid, hsm_import);
 
 	if (attr->ia_valid & ATTR_SIZE) {
@@ -2046,7 +2058,7 @@ void ll_delete_inode(struct inode *inode)
 
 	LASSERTF(nrpages == 0,
 		 "%s: inode="DFID"(%p) nrpages=%lu, see https://jira.whamcloud.com/browse/LU-118\n",
-		 ll_get_fsname(inode->i_sb, NULL, 0),
+		 ll_i2sbi(inode)->ll_fsname,
 		 PFID(ll_inode2fid(inode)), inode, nrpages);
 
 	ll_clear_inode(inode);
@@ -2300,7 +2312,7 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 		 */
 		if (!fid_is_sane(&md.body->mbo_fid1)) {
 			CERROR("%s: Fid is insane " DFID "\n",
-			       ll_get_fsname(sb, NULL, 0),
+			       sbi->ll_fsname,
 			       PFID(&md.body->mbo_fid1));
 			rc = -EINVAL;
 			goto out;
@@ -2570,40 +2582,6 @@ int ll_get_obd_name(struct inode *inode, unsigned int cmd, unsigned long arg)
 	return 0;
 }
 
-/**
- * Get lustre file system name by @sbi. If @buf is provided(non-NULL), the
- * fsname will be returned in this buffer; otherwise, a static buffer will be
- * used to store the fsname and returned to caller.
- */
-char *ll_get_fsname(struct super_block *sb, char *buf, int buflen)
-{
-	static char fsname_static[MTI_NAME_MAXLEN];
-	struct lustre_sb_info *lsi = s2lsi(sb);
-	char *ptr;
-	int len;
-
-	if (!buf) {
-		/* this means the caller wants to use static buffer
-		 * and it doesn't care about race. Usually this is
-		 * in error reporting path
-		 */
-		buf = fsname_static;
-		buflen = sizeof(fsname_static);
-	}
-
-	len = strlen(lsi->lsi_lmd->lmd_profile);
-	ptr = strrchr(lsi->lsi_lmd->lmd_profile, '-');
-	if (ptr && (strcmp(ptr, "-client") == 0))
-		len -= 7;
-
-	if (unlikely(len >= buflen))
-		len = buflen - 1;
-	strncpy(buf, lsi->lsi_lmd->lmd_profile, len);
-	buf[len] = '\0';
-
-	return buf;
-}
-
 void ll_dirty_page_discard_warn(struct page *page, int ioret)
 {
 	char *buf, *path = NULL;
@@ -2613,15 +2591,15 @@ void ll_dirty_page_discard_warn(struct page *page, int ioret)
 	/* this can be called inside spin lock so use GFP_ATOMIC. */
 	buf = (char *)__get_free_page(GFP_ATOMIC);
 	if (buf) {
-		dentry = d_find_alias(page->mapping->host);
+		dentry = d_find_alias(inode);
 		if (dentry)
 			path = dentry_path_raw(dentry, buf, PAGE_SIZE);
 	}
 
 	CDEBUG(D_WARNING,
 	       "%s: dirty page discard: %s/fid: " DFID "/%s may get corrupted (rc %d)\n",
-	       ll_get_fsname(page->mapping->host->i_sb, NULL, 0),
-	       s2lsi(page->mapping->host->i_sb)->lsi_lmd->lmd_dev,
+	       ll_i2sbi(inode)->ll_fsname,
+	       s2lsi(inode->i_sb)->lsi_lmd->lmd_dev,
 	       PFID(ll_inode2fid(inode)),
 	       (path && !IS_ERR(path)) ? path : "", ioret);
 
diff --git a/fs/lustre/llite/llite_nfs.c b/fs/lustre/llite/llite_nfs.c
index de8f707..2ac5ad9 100644
--- a/fs/lustre/llite/llite_nfs.c
+++ b/fs/lustre/llite/llite_nfs.c
@@ -181,7 +181,7 @@ static int ll_encode_fh(struct inode *inode, u32 *fh, int *plen,
 	struct lustre_nfs_fid *nfs_fid = (void *)fh;
 
 	CDEBUG(D_INFO, "%s: encoding for (" DFID ") maxlen=%d minlen=%d\n",
-	       ll_get_fsname(inode->i_sb, NULL, 0),
+	       ll_i2sbi(inode)->ll_fsname,
 	       PFID(ll_inode2fid(inode)), *plen, fileid_len);
 
 	if (*plen < fileid_len) {
@@ -298,8 +298,7 @@ int ll_dir_get_parent_fid(struct inode *dir, struct lu_fid *parent_fid)
 	sbi = ll_s2sbi(dir->i_sb);
 
 	CDEBUG(D_INFO, "%s: getting parent for (" DFID ")\n",
-	       ll_get_fsname(dir->i_sb, NULL, 0),
-	       PFID(ll_inode2fid(dir)));
+	       sbi->ll_fsname, PFID(ll_inode2fid(dir)));
 
 	rc = ll_get_default_mdsize(sbi, &lmmsize);
 	if (rc != 0)
@@ -315,8 +314,7 @@ int ll_dir_get_parent_fid(struct inode *dir, struct lu_fid *parent_fid)
 	ll_finish_md_op_data(op_data);
 	if (rc) {
 		CERROR("%s: failure inode " DFID " get parent: rc = %d\n",
-		       ll_get_fsname(dir->i_sb, NULL, 0),
-		       PFID(ll_inode2fid(dir)), rc);
+		       sbi->ll_fsname, PFID(ll_inode2fid(dir)), rc);
 		return rc;
 	}
 	body = req_capsule_server_get(&req->rq_pill, &RMF_MDT_BODY);
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 596aad8..197c09c 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -523,7 +523,7 @@ static ssize_t ll_max_cached_mb_seq_write(struct file *file,
 
 	if (pages_number < 0 || pages_number > totalram_pages()) {
 		CERROR("%s: can't set max cache more than %lu MB\n",
-		       ll_get_fsname(sb, NULL, 0),
+		       sbi->ll_fsname,
 		       totalram_pages() >> (20 - PAGE_SHIFT));
 		return -ERANGE;
 	}
@@ -977,7 +977,7 @@ static int ll_sbi_flags_seq_show(struct seq_file *m, void *v)
 	while (flags != 0) {
 		if (ARRAY_SIZE(str) <= i) {
 			CERROR("%s: Revise array LL_SBI_FLAGS to match sbi flags please.\n",
-			       ll_get_fsname(sb, NULL, 0));
+			       ll_s2sbi(sb)->ll_fsname);
 			return -EINVAL;
 		}
 
@@ -1273,8 +1273,7 @@ static ssize_t ll_root_squash_seq_write(struct file *file,
 	struct ll_sb_info *sbi = ll_s2sbi(sb);
 	struct root_squash_info *squash = &sbi->ll_squash;
 
-	return lprocfs_wr_root_squash(buffer, count, squash,
-				      ll_get_fsname(sb, NULL, 0));
+	return lprocfs_wr_root_squash(buffer, count, squash, sbi->ll_fsname);
 }
 LPROC_SEQ_FOPS(ll_root_squash);
 
@@ -1309,8 +1308,7 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 	struct root_squash_info *squash = &sbi->ll_squash;
 	int rc;
 
-	rc = lprocfs_wr_nosquash_nids(buffer, count, squash,
-				      ll_get_fsname(sb, NULL, 0));
+	rc = lprocfs_wr_nosquash_nids(buffer, count, squash, sbi->ll_fsname);
 	if (rc < 0)
 		return rc;
 
diff --git a/fs/lustre/llite/statahead.c b/fs/lustre/llite/statahead.c
index 1de62b5..7dfb045 100644
--- a/fs/lustre/llite/statahead.c
+++ b/fs/lustre/llite/statahead.c
@@ -663,9 +663,8 @@ static void sa_instantiate(struct ll_statahead_info *sai,
 		goto out;
 
 	CDEBUG(D_READA, "%s: setting %.*s" DFID " l_data to inode %p\n",
-	       ll_get_fsname(child->i_sb, NULL, 0),
-	       entry->se_qstr.len, entry->se_qstr.name,
-	       PFID(ll_inode2fid(child)), child);
+	       ll_i2sbi(dir)->ll_fsname, entry->se_qstr.len,
+	       entry->se_qstr.name, PFID(ll_inode2fid(child)), child);
 	ll_set_lock_data(ll_i2sbi(dir)->ll_md_exp, child, it, NULL);
 
 	entry->se_inode = child;
@@ -1270,7 +1269,7 @@ static int is_first_dirent(struct inode *dir, struct dentry *dentry)
 
 			rc = PTR_ERR(page);
 			CERROR("%s: error reading dir " DFID " at %llu: opendir_pid = %u : rc = %d\n",
-			       ll_get_fsname(dir->i_sb, NULL, 0),
+			       ll_i2sbi(dir)->ll_fsname,
 			       PFID(ll_inode2fid(dir)), pos,
 			       lli->lli_opendir_pid, rc);
 			break;
@@ -1472,8 +1471,7 @@ static int revalidate_statahead_dentry(struct inode *dir,
 				/* revalidate, but inode is recreated */
 				CDEBUG(D_READA,
 				       "%s: stale dentry %pd inode " DFID ", statahead inode " DFID "\n",
-				       ll_get_fsname((*dentryp)->d_inode->i_sb,
-						     NULL, 0),
+				       ll_i2sbi(inode)->ll_fsname,
 				       *dentryp,
 				       PFID(ll_inode2fid((*dentryp)->d_inode)),
 				       PFID(ll_inode2fid(inode)));
diff --git a/fs/lustre/llite/symlink.c b/fs/lustre/llite/symlink.c
index d2922d1..aae449c 100644
--- a/fs/lustre/llite/symlink.c
+++ b/fs/lustre/llite/symlink.c
@@ -75,7 +75,7 @@ static int ll_readlink_internal(struct inode *inode,
 	if (rc) {
 		if (rc != -ENOENT)
 			CERROR("%s: inode " DFID ": rc = %d\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0),
+			       ll_i2sbi(inode)->ll_fsname,
 			       PFID(ll_inode2fid(inode)), rc);
 		goto failed;
 	}
@@ -90,9 +90,8 @@ static int ll_readlink_internal(struct inode *inode,
 	LASSERT(symlen != 0);
 	if (body->mbo_eadatasize != symlen) {
 		CERROR("%s: inode " DFID ": symlink length %d not expected %d\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
-		       PFID(ll_inode2fid(inode)), body->mbo_eadatasize - 1,
-		       symlen - 1);
+		       sbi->ll_fsname, PFID(ll_inode2fid(inode)),
+		       body->mbo_eadatasize - 1, symlen - 1);
 		rc = -EPROTO;
 		goto failed;
 	}
@@ -101,7 +100,7 @@ static int ll_readlink_internal(struct inode *inode,
 	if (!*symname || strnlen(*symname, symlen) != symlen - 1) {
 		/* not full/NULL terminated */
 		CERROR("%s: inode " DFID ": symlink not NULL terminated string of length %d\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
+		       ll_i2sbi(inode)->ll_fsname,
 		       PFID(ll_inode2fid(inode)), symlen - 1);
 		rc = -EPROTO;
 		goto failed;
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index ad4b39e..43f4088 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -1012,7 +1012,7 @@ static int vvp_io_write_start(const struct lu_env *env,
 	if (pos + cnt > ll_file_maxbytes(inode)) {
 		CDEBUG(D_INODE,
 		       "%s: file " DFID " offset %llu > maxbytes %llu\n",
-		       ll_get_fsname(inode->i_sb, NULL, 0),
+		       ll_i2sbi(inode)->ll_fsname,
 		       PFID(ll_inode2fid(inode)), pos + cnt,
 		       ll_file_maxbytes(inode));
 		return -EFBIG;
@@ -1440,7 +1440,7 @@ int vvp_io_init(const struct lu_env *env, struct cl_object *obj,
 			result = 0;
 		if (result < 0)
 			CERROR("%s: refresh file layout " DFID " error %d.\n",
-			       ll_get_fsname(inode->i_sb, NULL, 0),
+			       ll_i2sbi(inode)->ll_fsname,
 			       PFID(lu_object_fid(&obj->co_lu)), result);
 	}
 
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index 948aaf6..aa61a5a 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -381,7 +381,7 @@ int ll_xattr_list(struct inode *inode, const char *name, int type, void *buffer,
 	if (rc == -EOPNOTSUPP && type == XATTR_USER_T) {
 		LCONSOLE_INFO(
 			"%s: disabling user_xattr feature because it is not supported on the server: rc = %d\n",
-			ll_get_fsname(inode->i_sb, NULL, 0), rc);
+			sbi->ll_fsname, rc);
 		spin_lock(&sbi->ll_lock);
 		sbi->ll_flags &= ~LL_SBI_USER_XATTR;
 		spin_unlock(&sbi->ll_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 271/622] lustre: llite: improve max_readahead console messages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (269 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 270/622] lustre: llite: switch to use ll_fsname directly James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 272/622] lustre: llite: fill copied dentry name's ending char properly James Simmons
                   ` (351 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Improve the max_readahead_mb, max_readahead_per_file_mb, and
max_read_ahead_whole_mb console error messages to print the
parameters properly in MB instead of PAGE_SIZE units, and include
the filesystem name and bad parameters in the output.

WC-bug-id: https://jira.whamcloud.com/browse/LU-1095
Lustre-commit: 48a0697d7910 ("LU-1095 llite: improve max_readahead console messages")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: http://review.whamcloud.com/12399
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin green at whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lproc_llite.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 197c09c..cc9f80e 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -346,16 +346,19 @@ static ssize_t max_read_ahead_mb_store(struct kobject *kobj,
 					      ll_kset.kobj);
 	int rc;
 	unsigned long pages_number;
+	int pages_shift;
 
+	pages_shift = 20 - PAGE_SHIFT;
 	rc = kstrtoul(buffer, 10, &pages_number);
 	if (rc)
 		return rc;
 
-	pages_number *= 1 << (20 - PAGE_SHIFT); /* MB -> pages */
+	pages_number <<= pages_shift; /* MB -> pages */
 
 	if (pages_number > totalram_pages() / 2) {
-		CERROR("can't set file readahead more than %lu MB\n",
-		       totalram_pages() >> (20 - PAGE_SHIFT + 1)); /*1/2 of RAM*/
+		CERROR("%s: can't set max_readahead_mb=%lu > %luMB\n",
+		       sbi->ll_fsname, pages_number >> pages_shift,
+		       totalram_pages() >> (pages_shift + 1)); /*1/2 of RAM*/
 		return -ERANGE;
 	}
 
@@ -393,14 +396,20 @@ static ssize_t max_read_ahead_per_file_mb_store(struct kobject *kobj,
 					      ll_kset.kobj);
 	int rc;
 	unsigned long pages_number;
+	int pages_shift;
 
+	pages_shift = 20 - PAGE_SHIFT;
 	rc = kstrtoul(buffer, 10, &pages_number);
 	if (rc)
 		return rc;
 
+	pages_number <<= pages_shift; /* MB -> pages */
+
 	if (pages_number > sbi->ll_ra_info.ra_max_pages) {
-		CERROR("can't set file readahead more than max_read_ahead_mb %lu MB\n",
-		       sbi->ll_ra_info.ra_max_pages);
+		CERROR("%s: can't set max_readahead_per_file_mb=%lu > max_read_ahead_mb=%lu\n",
+		       sbi->ll_fsname,
+		       pages_number >> pages_shift,
+		       sbi->ll_ra_info.ra_max_pages >> pages_shift);
 		return -ERANGE;
 	}
 
@@ -438,17 +447,22 @@ static ssize_t max_read_ahead_whole_mb_store(struct kobject *kobj,
 					      ll_kset.kobj);
 	int rc;
 	unsigned long pages_number;
+	int pages_shift;
 
+	pages_shift = 20 - PAGE_SHIFT;
 	rc = kstrtoul(buffer, 10, &pages_number);
 	if (rc)
 		return rc;
+	pages_number <<= pages_shift; /* MB -> pages */
 
 	/* Cap this at the current max readahead window size, the readahead
 	 * algorithm does this anyway so it's pointless to set it larger.
 	 */
 	if (pages_number > sbi->ll_ra_info.ra_max_pages_per_file) {
-		CERROR("can't set max_read_ahead_whole_mb more than max_read_ahead_per_file_mb: %lu\n",
-		       sbi->ll_ra_info.ra_max_pages_per_file >> (20 - PAGE_SHIFT));
+		CERROR("%s: can't set max_read_ahead_whole_mb=%lu > max_read_ahead_per_file_mb=%lu\n",
+		       sbi->ll_fsname,
+		       pages_number >> pages_shift,
+		       sbi->ll_ra_info.ra_max_pages_per_file >> pages_shift);
 		return -ERANGE;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 272/622] lustre: llite: fill copied dentry name's ending char properly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (270 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 271/622] lustre: llite: improve max_readahead console messages James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 273/622] lustre: obd: update udev event handling James Simmons
                   ` (350 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Dentry name expect an extra '\0'. and dentry_len won't calcualte
extra '\0' for it, but we should allocate memory and fill it
when copying dentry name by ourselves.

Otherwise, lu_name_is_valid_2() will try to access @name[len]
and check whether it is '\0'. this is invalid memory access.
We will possibly hit a crash if the first access that bit is '\0'.
and the bit overwritten by someone else, and finally we failed
sanity check in mdc_name_pack().

LustreError: 157839:0:(mdc_lib.c:137:mdc_pack_name()) LBUG

Fixes: 2eae6a4 ("lustre: llite: make sure name pack atomic")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12169
Lustre-commit: bc9cc327983c ("LU-12169 llite: fill copied dentry name's ending char properly")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34611
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h |  1 +
 fs/lustre/llite/file.c          | 10 ++++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 9ebdcb6..4e956da 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -456,6 +456,7 @@
 #define OBD_FAIL_LLITE_CREATE_NODE_PAUSE		0x140c
 #define OBD_FAIL_LLITE_IMUTEX_SEC			0x140e
 #define OBD_FAIL_LLITE_IMUTEX_NOSEC			0x140f
+#define OBD_FAIL_LLITE_OPEN_BY_NAME			0x1410
 
 #define OBD_FAIL_FID_INDIR				0x1501
 #define OBD_FAIL_FID_INLMA				0x1502
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 0f15ea8..61d53c4 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -513,12 +513,14 @@ static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 	 * if server supports open-by-fid, or file name is invalid, don't pack
 	 * name in open request
 	 */
-	if (!(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_OPEN_BY_FID)) {
+	if (OBD_FAIL_CHECK(OBD_FAIL_LLITE_OPEN_BY_NAME) ||
+	    !(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_OPEN_BY_FID)) {
 retry:
 		len = de->d_name.len;
-		name = kmalloc(len, GFP_NOFS);
+		name = kmalloc(len + 1, GFP_NOFS);
 		if (!name)
 			return -ENOMEM;
+
 		/* race here */
 		spin_lock(&de->d_lock);
 		if (len != de->d_name.len) {
@@ -527,12 +529,12 @@ static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 			goto retry;
 		}
 		memcpy(name, de->d_name.name, len);
+		name[len] = '\0';
 		spin_unlock(&de->d_lock);
 
 		if (!lu_name_is_valid_2(name, len)) {
 			kfree(name);
-			name = NULL;
-			len = 0;
+			return -ESTALE;
 		}
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 273/622] lustre: obd: update udev event handling
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (271 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 272/622] lustre: llite: fill copied dentry name's ending char properly James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 274/622] lustre: ptlrpc: Bulk assertion fails on -ENOMEM James Simmons
                   ` (349 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

Add a timestamp that users have requested so it can be recorded
then a sysfs lustre file changed. Second the PARAM field only
was created with the kobject source and parent name but the
sysfs file could be deeper in the lustre sysfs tree. Add handling
for deeper sysfs tree paths.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: b0d162390ad6 ("LU-8066 obd: update udev event handling")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/34624
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obd_config.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/obdclass/obd_config.c b/fs/lustre/obdclass/obd_config.c
index 4b1848f..97cb8c1 100644
--- a/fs/lustre/obdclass/obd_config.c
+++ b/fs/lustre/obdclass/obd_config.c
@@ -773,7 +773,7 @@ static int process_param2_config(struct lustre_cfg *lcfg)
 	char *param = lustre_cfg_string(lcfg, 1);
 	struct kobject *kobj = NULL;
 	const char *subsys = param;
-	char *envp[3];
+	char *envp[4];
 	char *value;
 	size_t len;
 	int rc;
@@ -802,7 +802,9 @@ static int process_param2_config(struct lustre_cfg *lcfg)
 	param = strsep(&value, "=");
 	envp[0] = kasprintf(GFP_KERNEL, "PARAM=%s", param);
 	envp[1] = kasprintf(GFP_KERNEL, "SETTING=%s", value);
-	envp[2] = NULL;
+	envp[2] = kasprintf(GFP_KERNEL, "TIME=%lld",
+			    ktime_get_real_seconds());
+	envp[3] = NULL;
 
 	rc = kobject_uevent_env(kobj, KOBJ_CHANGE, envp);
 	for (i = 0; i < ARRAY_SIZE(envp); i++)
@@ -1128,14 +1130,25 @@ ssize_t class_modify_config(struct lustre_cfg *lcfg, const char *prefix,
 		}
 
 		if (!attr) {
-			char *envp[3];
+			char *envp[4], *param, *path;
 
-			envp[0] = kasprintf(GFP_KERNEL, "PARAM=%s.%s.%.*s",
-					    kobject_name(kobj->parent),
-					    kobject_name(kobj),
-					    (int)keylen, key);
+			path = kobject_get_path(kobj, GFP_KERNEL);
+			if (!path)
+				return -EINVAL;
+
+			/* convert sysfs path to uevent format */
+			param = path;
+			while ((param = strchr(param, '/')) != NULL)
+				*param = '.';
+
+			param = strstr(path, "fs.lustre.") + 10;
+
+			envp[0] = kasprintf(GFP_KERNEL, "PARAM=%s.%.*s",
+					    param, (int)keylen, key);
 			envp[1] = kasprintf(GFP_KERNEL, "SETTING=%s", value);
-			envp[2] = NULL;
+			envp[2] = kasprintf(GFP_KERNEL, "TIME=%lld",
+					    ktime_get_real_seconds());
+			envp[3] = NULL;
 
 			if (kobject_uevent_env(kobj, KOBJ_CHANGE, envp)) {
 				CERROR("%s: failed to send uevent %s\n",
@@ -1144,6 +1157,7 @@ ssize_t class_modify_config(struct lustre_cfg *lcfg, const char *prefix,
 
 			for (i = 0; i < ARRAY_SIZE(envp); i++)
 				kfree(envp[i]);
+			kfree(path);
 		} else {
 			count += lustre_attr_store(kobj, attr, value,
 						   strlen(value));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 274/622] lustre: ptlrpc: Bulk assertion fails on -ENOMEM
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (272 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 273/622] lustre: obd: update udev event handling James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 275/622] lustre: obd: Add overstriping CONNECT flag James Simmons
                   ` (348 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

Recalculate rq_mbits on ENOMEM resend if OBD_CONNECT_BULK_MBITS
isn't used.

Cray-bug-id: LUS-7159
WC-bug-id: https://jira.whamcloud.com/browse/LU-12218
Lustre-commit: e63a49fa6920 ("LU-12218 ptlrpc: Bulk assertion fails on -ENOMEM")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/34753
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 0f5aa92..7c243af 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -3182,7 +3182,14 @@ void ptlrpc_set_bulk_mbits(struct ptlrpc_request *req)
 		       old_mbits, req->rq_mbits);
 	} else if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_REPLAY)) {
 		/* Request being sent first time, use xid as matchbits. */
-		req->rq_mbits = req->rq_xid;
+		if (OCD_HAS_FLAG(&bd->bd_import->imp_connect_data, BULK_MBITS)
+		    || req->rq_mbits == 0) {
+			req->rq_mbits = req->rq_xid;
+		} else {
+			int total_md = (bd->bd_iov_count + LNET_MAX_IOV - 1) /
+					LNET_MAX_IOV;
+			req->rq_mbits -= total_md - 1;
+		}
 	} else {
 		/*
 		 * Replay request, xid and matchbits have already been
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 275/622] lustre: obd: Add overstriping CONNECT flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (273 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 274/622] lustre: ptlrpc: Bulk assertion fails on -ENOMEM James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 276/622] lustre: llite, readahead: fix to call ll_ras_enter() properly James Simmons
                   ` (347 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

This patch reserves the OBD_CONNECT flag for overstriping,
and also does some cleanup of OBD_CONNECT flags, putting
them in the correct order and adding some missing ones in
proc and the wire{test,check} checks.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9846
Lustre-commit: 5d085745af43 ("LU-9846 obd: Add overstriping CONNECT flag")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34743
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_export.h      | 5 +++++
 fs/lustre/llite/llite_lib.c            | 6 +++---
 fs/lustre/obdclass/lprocfs_status.c    | 4 ++--
 fs/lustre/ptlrpc/wiretest.c            | 4 ++++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 5 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/include/lustre_export.h b/fs/lustre/include/lustre_export.h
index c94efb0..967ce37 100644
--- a/fs/lustre/include/lustre_export.h
+++ b/fs/lustre/include/lustre_export.h
@@ -264,6 +264,11 @@ static inline int exp_connect_lockahead(struct obd_export *exp)
 	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_LOCKAHEAD);
 }
 
+static inline int exp_connect_overstriping(struct obd_export *exp)
+{
+	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_OVERSTRIPING);
+}
+
 static inline int exp_connect_flr(struct obd_export *exp)
 {
 	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_FLR);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 8e5cf0a..fd19035 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -212,10 +212,10 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				  OBD_CONNECT_GRANT_PARAM |
 				  OBD_CONNECT_SHORTIO | OBD_CONNECT_FLAGS2;
 
-	data->ocd_connect_flags2 = OBD_CONNECT2_FLR |
-				   OBD_CONNECT2_LOCK_CONVERT |
-				   OBD_CONNECT2_DIR_MIGRATE |
+	data->ocd_connect_flags2 = OBD_CONNECT2_DIR_MIGRATE |
 				   OBD_CONNECT2_SUM_STATFS |
+				   OBD_CONNECT2_FLR |
+				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_ARCHIVE_ID_ARRAY |
 				   OBD_CONNECT2_LSOM;
 
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index a7c274a..55057cf 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -114,8 +114,8 @@
 	"file_secctx",	/* 0x01 */
 	"lockaheadv2",	/* 0x02 */
 	"dir_migrate",	/* 0x04 */
-	"unknown",	/* 0x08 */
-	"unknown",	/* 0x10 */
+	"sum_statfs",	/* 0x08 */
+	"overstriping",	/* 0x10 */
 	"flr",		/* 0x20 */
 	"wbc",		/* 0x40 */
 	"lock_convert",	/* 0x80 */
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 4a268f6..fb57def 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1136,6 +1136,10 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_LOCKAHEAD);
 	LASSERTF(OBD_CONNECT2_DIR_MIGRATE == 0x4ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_DIR_MIGRATE);
+	LASSERTF(OBD_CONNECT2_SUM_STATFS == 0x8ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_SUM_STATFS);
+	LASSERTF(OBD_CONNECT2_OVERSTRIPING == 0x10ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_OVERSTRIPING);
 	LASSERTF(OBD_CONNECT2_FLR == 0x20ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_FLR);
 	LASSERTF(OBD_CONNECT2_WBC_INTENTS == 0x40ULL, "found 0x%.16llxULL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 3a2a093..bba3a77 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -797,6 +797,7 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_DIR_MIGRATE	0x4ULL		/* migrate striped dir
 							 */
 #define OBD_CONNECT2_SUM_STATFS		0x8ULL /* MDT return aggregated stats */
+#define OBD_CONNECT2_OVERSTRIPING	0x10ULL /* OST overstriping support */
 #define OBD_CONNECT2_FLR		0x20ULL		/* FLR support */
 #define OBD_CONNECT2_WBC_INTENTS	0x40ULL /* create/unlink/... intents
 						 * for wbc, also operations
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 276/622] lustre: llite, readahead: fix to call ll_ras_enter() properly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (274 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 275/622] lustre: obd: Add overstriping CONNECT flag James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 277/622] lustre: ptlrpc: ASSERTION (req_transno < next_transno) failed James Simmons
                   ` (346 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

ll_ras_enter() is expected to be called per syscall.
However, with fast read enabled, it will be no longer true that
We will call vvp_io_read_start() for every syscall.

To fix this problem, we should move this to file read handler.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: 500edcada7e4 ("LU-12043 llite, readahead: fix to call ll_ras_enter() properly")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34755
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c   | 2 ++
 fs/lustre/llite/vvp_io.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 61d53c4..d059ac7 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1625,6 +1625,8 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	u16 refcheck;
 	ssize_t rc2;
 
+	ll_ras_enter(iocb->ki_filp);
+
 	result = ll_do_fast_read(iocb, to);
 	if (result < 0 || iov_iter_count(to) == 0)
 		goto out;
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 43f4088..1f82fe6 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -773,7 +773,6 @@ static int vvp_io_read_start(const struct lu_env *env,
 		vio->vui_ra_valid = true;
 		vio->vui_ra_start = cl_index(obj, pos);
 		vio->vui_ra_count = cl_index(obj, tot + PAGE_SIZE - 1);
-		ll_ras_enter(file);
 	}
 
 	/* BUG: 5972 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 277/622] lustre: ptlrpc: ASSERTION (req_transno < next_transno) failed
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (275 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 276/622] lustre: llite, readahead: fix to call ll_ras_enter() properly James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 278/622] lustre: lov: new foreign LOV format James Simmons
                   ` (345 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

An update request is checked for duplicates by xid in
is_req_replayed_by_update(). However xid is unique per
client only. It may happen that there are 2 requests
with the same xid from different clients.

Perform lookup by transno, it is unique per MDT.

Cray-bug-id: LUS-6015
WC-bug-id: https://jira.whamcloud.com/browse/LU-11251
Lustre-commit: 53764826b95f ("LU-11251 mdt: ASSERTION (req_transno < next_transno) failed")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-on: https://review.whamcloud.com/33001
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h |  3 ++-
 fs/lustre/ptlrpc/client.c       | 11 ++++++++---
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 4e956da..837b68d 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -355,7 +355,8 @@
 #define OBD_FAIL_PTLRPC_DROP_BULK			0x51a
 #define OBD_FAIL_PTLRPC_LONG_REQ_UNLINK			0x51b
 #define OBD_FAIL_PTLRPC_LONG_BOTH_UNLINK		0x51c
-#define OBD_FAIL_PTLRPC_BULK_ATTACH      0x521
+#define OBD_FAIL_PTLRPC_BULK_ATTACH			0x521
+#define OBD_FAIL_PTLRPC_ROUND_XID			0x530
 #define OBD_FAIL_PTLRPC_CONNECT_RACE			0x531
 
 #define OBD_FAIL_OBD_PING_NET				0x600
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 7c243af..ac16878 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -712,6 +712,8 @@ static inline void ptlrpc_assign_next_xid(struct ptlrpc_request *req)
 	spin_unlock(&req->rq_import->imp_lock);
 }
 
+static atomic64_t ptlrpc_last_xid;
+
 int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 			     u32 version, int opcode, char **bufs,
 			     struct ptlrpc_cli_ctx *ctx)
@@ -761,7 +763,6 @@ int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 	ptlrpc_at_set_req_timeout(request);
 
 	lustre_msg_set_opc(request->rq_reqmsg, opcode);
-	ptlrpc_assign_next_xid(request);
 
 	/* Let's setup deadline for req/reply/bulk unlink for opcode. */
 	if (cfs_fail_val == opcode) {
@@ -776,6 +777,11 @@ int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 		} else if (CFS_FAIL_CHECK(OBD_FAIL_PTLRPC_LONG_BOTH_UNLINK)) {
 			fail_t = &request->rq_reply_deadline;
 			fail2_t = &request->rq_bulk_deadline;
+		} else if (CFS_FAIL_CHECK(OBD_FAIL_PTLRPC_ROUND_XID)) {
+			time64_t now = ktime_get_real_seconds();
+
+			atomic64_set(&ptlrpc_last_xid,
+				     ((u64)now >> 4) << 24);
 		}
 
 		if (fail_t) {
@@ -791,6 +797,7 @@ int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 			msleep(4 * MSEC_PER_SEC);
 		}
 	}
+	ptlrpc_assign_next_xid(request);
 
 	return 0;
 
@@ -3085,8 +3092,6 @@ void ptlrpc_abort_set(struct ptlrpc_request_set *set)
 	}
 }
 
-static atomic64_t ptlrpc_last_xid;
-
 /**
  * Initialize the XID for the node.  This is common among all requests on
  * this node, and only requires the property that it is monotonically
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 278/622] lustre: lov: new foreign LOV format
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (276 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 277/622] lustre: ptlrpc: ASSERTION (req_transno < next_transno) failed James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 279/622] lustre: lmv: new foreign LMV format James Simmons
                   ` (344 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Bruno Faccini <bruno.faccini@intel.com>

This patch introduces a new layout/LOV format in order to
allow to specify an arbitrary external reference for a file
in Lustre namespace.
The new LOV format is made of {newmagic, length, type, flags,
string[length]} to be as flexible as possible.
Foreign file can be created by using the open(O_LOV_DELAY_CREATE) +
ioctl(LL_IOC_LOV_SETSTRIPE) operations and it can only be and remain
an empty file until removed.
A new API method llapi_file_create_foreign() has been introduced
and "lfs [[get,set]stripe,find" modified to understand new layout.
The idea behind this is to provide Lustre namespace support and
layout prefetch/caching under layout protection, for user/external
usage.

Code has been added for lfsck to handle foreign files, and
a new sub-test has been added in sanity-lfsck in order to verify
if does not break foreign file and that reverse is also true.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11376
Lustre-commit: 6a20bdcc608b ("LU-11376 lov: new foreign LOV format")
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: https://review.whamcloud.com/33755
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c                  | 12 ++++++-
 fs/lustre/llite/llite_internal.h        |  2 ++
 fs/lustre/llite/vvp_io.c                |  2 +-
 fs/lustre/llite/xattr.c                 |  4 ++-
 fs/lustre/lov/lov_cl_internal.h         |  6 ++++
 fs/lustre/lov/lov_ea.c                  | 63 ++++++++++++++++++++++++++++++---
 fs/lustre/lov/lov_internal.h            | 19 +++++++---
 fs/lustre/lov/lov_object.c              | 49 ++++++++++++++++++++++++-
 fs/lustre/lov/lov_pack.c                | 44 ++++++++++++++++++++---
 fs/lustre/lov/lov_page.c                |  7 ++++
 include/uapi/linux/lustre/lustre_idl.h  |  1 +
 include/uapi/linux/lustre/lustre_user.h | 31 ++++++++++++++++
 12 files changed, 222 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index d059ac7..0d7d566 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1827,7 +1827,8 @@ int ll_lov_getstripe_ea_info(struct inode *inode, const char *filename,
 
 	if (lmm->lmm_magic != cpu_to_le32(LOV_MAGIC_V1) &&
 	    lmm->lmm_magic != cpu_to_le32(LOV_MAGIC_V3) &&
-	    lmm->lmm_magic != cpu_to_le32(LOV_MAGIC_COMP_V1)) {
+	    lmm->lmm_magic != cpu_to_le32(LOV_MAGIC_COMP_V1) &&
+	    lmm->lmm_magic != cpu_to_le32(LOV_MAGIC_FOREIGN)) {
 		rc = -EPROTO;
 		goto out;
 	}
@@ -1863,6 +1864,15 @@ int ll_lov_getstripe_ea_info(struct inode *inode, const char *filename,
 								stripe_count);
 		} else if (lmm->lmm_magic == cpu_to_le32(LOV_MAGIC_COMP_V1)) {
 			lustre_swab_lov_comp_md_v1((struct lov_comp_md_v1 *)lmm);
+		} else if (lmm->lmm_magic ==
+			   cpu_to_le32(LOV_MAGIC_FOREIGN)) {
+			struct lov_foreign_md *lfm;
+
+			lfm = (struct lov_foreign_md *)lmm;
+			__swab32s(&lfm->lfm_magic);
+			__swab32s(&lfm->lfm_length);
+			__swab32s(&lfm->lfm_type);
+			__swab32s(&lfm->lfm_flags);
 		}
 	}
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index b9478f4d..9d7345a 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -962,6 +962,8 @@ static inline ssize_t ll_lov_user_md_size(const struct lov_user_md *lum)
 					LOV_USER_MAGIC_SPECIFIC);
 	case LOV_USER_MAGIC_COMP_V1:
 		return ((struct lov_comp_md_v1 *)lum)->lcm_size;
+	case LOV_USER_MAGIC_FOREIGN:
+		return foreign_size(lum);
 	}
 	return -EINVAL;
 }
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 1f82fe6..ee44a18 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -165,7 +165,7 @@ static int vvp_prep_size(const struct lu_env *env, struct cl_object *obj,
 				 * --bug 17336
 				 */
 				loff_t size = i_size_read(inode);
-				loff_t cur_index = start >> PAGE_SHIFT;
+				unsigned long cur_index = start >> PAGE_SHIFT;
 				loff_t size_index = (size - 1) >> PAGE_SHIFT;
 
 				if ((size == 0 && cur_index != 0) ||
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index aa61a5a..9707e78 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -453,6 +453,7 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		};
 		struct lu_env *env;
 		u16 refcheck;
+		u32 magic;
 
 		if (!obj)
 			return -ENODATA;
@@ -483,7 +484,8 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		 * recognizing layout gen as stripe offset when the
 		 * file is restored. See LU-2809.
 		 */
-		if (((struct lov_mds_md *)buf)->lmm_magic == LOV_MAGIC_COMP_V1)
+		magic = ((struct lov_mds_md *)buf)->lmm_magic;
+		if (magic == LOV_MAGIC_COMP_V1 || magic == LOV_MAGIC_FOREIGN)
 			goto out_env;
 
 		((struct lov_mds_md *)buf)->lmm_layout_gen = 0;
diff --git a/fs/lustre/lov/lov_cl_internal.h b/fs/lustre/lov/lov_cl_internal.h
index e14567d..7b95a00 100644
--- a/fs/lustre/lov/lov_cl_internal.h
+++ b/fs/lustre/lov/lov_cl_internal.h
@@ -122,6 +122,7 @@ enum lov_layout_type {
 	LLT_EMPTY,	/** empty file without body (mknod + truncate) */
 	LLT_RELEASED,	/** file with no objects (data in HSM) */
 	LLT_COMP,	/** support composite layout */
+	LLT_FOREIGN,	/** foreign layout */
 	LLT_NR
 };
 
@@ -134,6 +135,8 @@ static inline char *llt2str(enum lov_layout_type llt)
 		return "RELEASED";
 	case LLT_COMP:
 		return "COMPOSITE";
+	case LLT_FOREIGN:
+		return "FOREIGN";
 	case LLT_NR:
 		LBUG();
 	}
@@ -626,9 +629,12 @@ int lov_page_init_empty(const struct lu_env *env, struct cl_object *obj,
 			struct cl_page *page, pgoff_t index);
 int lov_page_init_composite(const struct lu_env *env, struct cl_object *obj,
 			    struct cl_page *page, pgoff_t index);
+int lov_page_init_foreign(const struct lu_env *env, struct cl_object *obj,
+			   struct cl_page *page, pgoff_t index);
 struct lu_object *lov_object_alloc(const struct lu_env *env,
 				   const struct lu_object_header *hdr,
 				   struct lu_device *dev);
+
 struct lu_object *lovsub_object_alloc(const struct lu_env *env,
 				      const struct lu_object_header *hdr,
 				      struct lu_device *dev);
diff --git a/fs/lustre/lov/lov_ea.c b/fs/lustre/lov/lov_ea.c
index 31a18d0..b7a6d91 100644
--- a/fs/lustre/lov/lov_ea.c
+++ b/fs/lustre/lov/lov_ea.c
@@ -134,8 +134,12 @@ void lsm_free(struct lov_stripe_md *lsm)
 	unsigned int entry_count = lsm->lsm_entry_count;
 	unsigned int i;
 
-	for (i = 0; i < entry_count; i++)
-		lsme_free(lsm->lsm_entries[i]);
+	if (lsm->lsm_magic == LOV_MAGIC_FOREIGN) {
+		kvfree(lsm_foreign(lsm));
+	} else {
+		for (i = 0; i < entry_count; i++)
+			lsme_free(lsm->lsm_entries[i]);
+	}
 
 	kfree(lsm);
 }
@@ -513,6 +517,44 @@ static int lsm_verify_comp_md_v1(struct lov_comp_md_v1 *lcm,
 	.lsm_unpackmd		= lsm_unpackmd_comp_md_v1,
 };
 
+static struct
+lov_stripe_md *lsm_unpackmd_foreign(struct lov_obd *lov, void *buf,
+				    size_t buf_size)
+{
+	struct lov_foreign_md *lfm = buf;
+	struct lov_stripe_md *lsm;
+	size_t lsm_size;
+	struct lov_stripe_md_entry *lsme;
+
+	lsm_size = offsetof(typeof(*lsm), lsm_entries[1]);
+	lsm = kzalloc(lsm_size, GFP_NOFS);
+	if (!lsm)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&lsm->lsm_refc, 1);
+	spin_lock_init(&lsm->lsm_lock);
+	lsm->lsm_magic = le32_to_cpu(lfm->lfm_magic);
+	lsm->lsm_foreign_size = foreign_size_le(lfm);
+
+	/* alloc for full foreign EA including format fields */
+	lsme = kvzalloc(lsm->lsm_foreign_size, GFP_NOFS);
+	if (!lsme) {
+		kfree(lsm);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* copy full foreign EA including format fields */
+	memcpy(lsme, buf, lsm->lsm_foreign_size);
+
+	lsm_foreign(lsm) = lsme;
+
+	return lsm;
+}
+
+const struct lsm_operations lsm_foreign_ops = {
+	.lsm_unpackmd         = lsm_unpackmd_foreign,
+};
+
 const struct lsm_operations *lsm_op_find(int magic)
 {
 	const struct lsm_operations *lsm = NULL;
@@ -527,6 +569,9 @@ const struct lsm_operations *lsm_op_find(int magic)
 	case LOV_MAGIC_COMP_V1:
 		lsm = &lsm_comp_md_v1_ops;
 		break;
+	case LOV_MAGIC_FOREIGN:
+		lsm = &lsm_foreign_ops;
+		break;
 	default:
 		CERROR("unrecognized lsm_magic %08x\n", magic);
 		break;
@@ -539,12 +584,22 @@ void dump_lsm(unsigned int level, const struct lov_stripe_md *lsm)
 {
 	int i, j;
 
-	CDEBUG(level,
-	       "lsm %p, objid " DOSTID ", maxbytes %#llx, magic 0x%08X, refc: %d, entry: %u, layout_gen %u\n",
+	CDEBUG_LIMIT(level,
+		     "lsm %p, objid " DOSTID ", maxbytes %#llx, magic 0x%08X, refc: %d, entry: %u, layout_gen %u\n",
 	       lsm, POSTID(&lsm->lsm_oi), lsm->lsm_maxbytes, lsm->lsm_magic,
 	       atomic_read(&lsm->lsm_refc), lsm->lsm_entry_count,
 	       lsm->lsm_layout_gen);
 
+	if (lsm->lsm_magic == LOV_MAGIC_FOREIGN) {
+		struct lov_foreign_md *lfm = (void *)lsm_foreign(lsm);
+
+		CDEBUG_LIMIT(level,
+			     "foreign LOV EA, magic %x, length %u, type %x, flags %x, value '%.*s'\n",
+		       lfm->lfm_magic, lfm->lfm_length, lfm->lfm_type,
+		       lfm->lfm_flags, lfm->lfm_length, lfm->lfm_value);
+		return;
+	}
+
 	for (i = 0; i < lsm->lsm_entry_count; i++) {
 		struct lov_stripe_md_entry *lse = lsm->lsm_entries[i];
 
diff --git a/fs/lustre/lov/lov_internal.h b/fs/lustre/lov/lov_internal.h
index 36586b3..d235abe 100644
--- a/fs/lustre/lov/lov_internal.h
+++ b/fs/lustre/lov/lov_internal.h
@@ -79,11 +79,15 @@ struct lov_stripe_md {
 	spinlock_t	lsm_lock;
 	pid_t		lsm_lock_owner; /* debugging */
 
-	/*
-	 * maximum possible file size, might change as OSTs status changes,
-	 * e.g. disconnected, deactivated
-	 */
-	loff_t		lsm_maxbytes;
+	union {
+		/*
+		 * maximum possible file size, might change as OSTs status
+		 * changes, e.g. disconnected, deactivated
+		 */
+		loff_t          lsm_maxbytes;
+		/* size of full foreign LOV */
+		size_t          lsm_foreign_size;
+	};
 	struct ost_id	lsm_oi;
 	u32		lsm_magic;
 	u32		lsm_layout_gen;
@@ -94,6 +98,8 @@ struct lov_stripe_md {
 	struct lov_stripe_md_entry *lsm_entries[];
 };
 
+#define lsm_foreign(lsm) (lsm->lsm_entries[0])
+
 static inline bool lsme_inited(const struct lov_stripe_md_entry *lsme)
 {
 	return lsme->lsme_flags & LCME_FL_INIT;
@@ -119,6 +125,9 @@ static inline size_t lov_comp_md_size(const struct lov_stripe_md *lsm)
 		return lov_mds_md_size(lsm->lsm_entries[0]->lsme_stripe_count,
 				       lsm->lsm_entries[0]->lsme_magic);
 
+	if (lsm->lsm_magic == LOV_MAGIC_FOREIGN)
+		return lsm->lsm_foreign_size;
+
 	LASSERT(lsm->lsm_magic == LOV_MAGIC_COMP_V1);
 
 	size = sizeof(struct lov_comp_md_v1);
diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index c04b2ae..7543ef2 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -810,10 +810,25 @@ static int lov_init_released(const struct lu_env *env,
 	return 0;
 }
 
+static int lov_init_foreign(const struct lu_env *env,
+			    struct lov_device *dev, struct lov_object *lov,
+			    struct lov_stripe_md *lsm,
+			    const struct cl_object_conf *conf,
+			    union lov_layout_state *state)
+{
+	LASSERT(lsm);
+	LASSERT(lov->lo_type == LLT_FOREIGN);
+	LASSERT(!lov->lo_lsm);
+
+	lov->lo_lsm = lsm_addref(lsm);
+	return 0;
+}
+
 static int lov_delete_empty(const struct lu_env *env, struct lov_object *lov,
 			    union lov_layout_state *state)
 {
-	LASSERT(lov->lo_type == LLT_EMPTY || lov->lo_type == LLT_RELEASED);
+	LASSERT(lov->lo_type == LLT_EMPTY || lov->lo_type == LLT_RELEASED ||
+		lov->lo_type == LLT_FOREIGN);
 
 	lov_layout_wait(env, lov);
 	return 0;
@@ -923,6 +938,23 @@ static int lov_print_released(const struct lu_env *env, void *cookie,
 	return 0;
 }
 
+static int lov_print_foreign(const struct lu_env *env, void *cookie,
+				lu_printer_t p, const struct lu_object *o)
+{
+	struct lov_object	*lov = lu2lov(o);
+	struct lov_stripe_md	*lsm = lov->lo_lsm;
+
+	(*p)(env, cookie,
+		"foreign: %s, lsm{%p 0x%08X %d %u}:\n",
+		lov->lo_layout_invalid ? "invalid" : "valid", lsm,
+		lsm->lsm_magic, atomic_read(&lsm->lsm_refc),
+		lsm->lsm_layout_gen);
+	(*p)(env, cookie,
+		"raw_ea_content '%.*s'\n",
+		(int)lsm->lsm_foreign_size, (char *)lsm_foreign(lsm));
+	return 0;
+}
+
 /**
  * Implements cl_object_operations::coo_attr_get() method for an object
  * without stripes (LLT_EMPTY layout type).
@@ -1020,6 +1052,16 @@ static int lov_attr_get_composite(const struct lu_env *env,
 		.llo_io_init	= lov_io_init_composite,
 		.llo_getattr	= lov_attr_get_composite,
 	},
+	[LLT_FOREIGN] = {
+		.llo_init      = lov_init_foreign,
+		.llo_delete    = lov_delete_empty,
+		.llo_fini      = lov_fini_released,
+		.llo_print     = lov_print_foreign,
+		.llo_page_init = lov_page_init_foreign,
+		.llo_lock_init = lov_lock_init_empty,
+		.llo_io_init   = lov_io_init_empty,
+		.llo_getattr   = lov_attr_get_empty,
+	},
 };
 
 /**
@@ -1051,6 +1093,9 @@ static enum lov_layout_type lov_type(struct lov_stripe_md *lsm)
 	    lsm->lsm_magic == LOV_MAGIC_COMP_V1)
 		return LLT_COMP;
 
+	if (lsm->lsm_magic == LOV_MAGIC_FOREIGN)
+		return LLT_FOREIGN;
+
 	return LLT_EMPTY;
 }
 
@@ -2141,6 +2186,8 @@ int lov_read_and_clear_async_rc(struct cl_object *clob)
 		}
 		case LLT_RELEASED:
 		case LLT_EMPTY:
+			/* fall through */
+		case LLT_FOREIGN:
 			break;
 		default:
 			LBUG();
diff --git a/fs/lustre/lov/lov_pack.c b/fs/lustre/lov/lov_pack.c
index c6dec2d..2b348d3 100644
--- a/fs/lustre/lov/lov_pack.c
+++ b/fs/lustre/lov/lov_pack.c
@@ -162,6 +162,28 @@ ssize_t lov_lsm_pack_v1v3(const struct lov_stripe_md *lsm, void *buf,
 	return lmm_size;
 }
 
+ssize_t lov_lsm_pack_foreign(const struct lov_stripe_md *lsm, void *buf,
+			     size_t buf_size)
+{
+	struct lov_foreign_md *lfm = buf;
+	size_t lfm_size;
+
+	lfm_size = lsm->lsm_foreign_size;
+
+	if (buf_size == 0)
+		return lfm_size;
+
+	if (buf_size < lfm_size)
+		return -ERANGE;
+
+	/* full foreign LOV is already avail in its cache
+	 * no need to translate format fields to little-endian
+	 */
+	memcpy(lfm, lsm_foreign(lsm), lsm->lsm_foreign_size);
+
+	return lfm_size;
+}
+
 ssize_t lov_lsm_pack(const struct lov_stripe_md *lsm, void *buf,
 		     size_t buf_size)
 {
@@ -177,6 +199,9 @@ ssize_t lov_lsm_pack(const struct lov_stripe_md *lsm, void *buf,
 	if (lsm->lsm_magic == LOV_MAGIC_V1 || lsm->lsm_magic == LOV_MAGIC_V3)
 		return lov_lsm_pack_v1v3(lsm, buf, buf_size);
 
+	if (lsm->lsm_magic == LOV_MAGIC_FOREIGN)
+		return lov_lsm_pack_foreign(lsm, buf, buf_size);
+
 	lmm_size = lov_comp_md_size(lsm);
 	if (buf_size == 0)
 		return lmm_size;
@@ -331,6 +356,7 @@ int lov_getstripe(const struct lu_env *env, struct lov_object *obj,
 {
 	/* we use lov_user_md_v3 because it is larger than lov_user_md_v1 */
 	struct lov_mds_md *lmmk, *lmm;
+	struct lov_foreign_md *lfm;
 	struct lov_user_md_v1 lum;
 	ssize_t lmm_size, lum_size = 0;
 	static bool printed;
@@ -338,7 +364,8 @@ int lov_getstripe(const struct lu_env *env, struct lov_object *obj,
 	int rc = 0;
 
 	if (lsm->lsm_magic != LOV_MAGIC_V1 && lsm->lsm_magic != LOV_MAGIC_V3 &&
-	    lsm->lsm_magic != LOV_MAGIC_COMP_V1) {
+	    lsm->lsm_magic != LOV_MAGIC_COMP_V1 &&
+	    lsm->lsm_magic != LOV_MAGIC_FOREIGN) {
 		CERROR("bad LSM MAGIC: 0x%08X != 0x%08X nor 0x%08X\n",
 		       lsm->lsm_magic, LOV_MAGIC_V1, LOV_MAGIC_V3);
 		rc = -EIO;
@@ -374,16 +401,23 @@ int lov_getstripe(const struct lu_env *env, struct lov_object *obj,
 				lmmk->lmm_stripe_count);
 		} else if (lmmk->lmm_magic == cpu_to_le32(LOV_MAGIC_COMP_V1)) {
 			lustre_swab_lov_comp_md_v1((struct lov_comp_md_v1 *)lmmk);
+		} else if (lmmk->lmm_magic == cpu_to_le32(LOV_MAGIC_FOREIGN)) {
+			lfm = (struct lov_foreign_md *)lmmk;
+			__swab32s(&lfm->lfm_magic);
+			__swab32s(&lfm->lfm_length);
+			__swab32s(&lfm->lfm_type);
+			__swab32s(&lfm->lfm_flags);
 		}
 	}
 
 	/* Legacy appication passes limited buffer, we need to figure out
 	 * the user buffer size by the passed in lmm_stripe_count.
 	 */
-	if (copy_from_user(&lum, lump, sizeof(struct lov_user_md_v1))) {
-		rc = -EFAULT;
-		goto out_free;
-	}
+	if (lsm->lsm_magic != LOV_MAGIC_FOREIGN)
+		if (copy_from_user(&lum, lump, sizeof(struct lov_user_md_v1))) {
+			rc = -EFAULT;
+			goto out_free;
+		}
 
 	if (lum.lmm_magic == LOV_USER_MAGIC_V1 ||
 	    lum.lmm_magic == LOV_USER_MAGIC_V3)
diff --git a/fs/lustre/lov/lov_page.c b/fs/lustre/lov/lov_page.c
index 3f08da7..c3337706 100644
--- a/fs/lustre/lov/lov_page.c
+++ b/fs/lustre/lov/lov_page.c
@@ -145,6 +145,13 @@ int lov_page_init_empty(const struct lu_env *env, struct cl_object *obj,
 	return 0;
 }
 
+int lov_page_init_foreign(const struct lu_env *env, struct cl_object *obj,
+			struct cl_page *page, pgoff_t index)
+{
+	CDEBUG(D_PAGE, DFID" has no data\n", PFID(lu_object_fid(&obj->co_lu)));
+	return -ENODATA;
+}
+
 bool lov_page_is_empty(const struct cl_page *page)
 {
 	const struct cl_page_slice *slice = cl_page_at(page, &lov_device_type);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index bba3a77..fd35023 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1022,6 +1022,7 @@ enum obdo_flags {
 #define LOV_MAGIC_SPECIFIC	(0x0BD50000 | LOV_MAGIC_MAGIC)
 #define LOV_MAGIC		LOV_MAGIC_V1
 #define LOV_MAGIC_COMP_V1	(0x0BD60000 | LOV_MAGIC_MAGIC)
+#define LOV_MAGIC_FOREIGN	(0x0BD70000 | LOV_MAGIC_MAGIC)
 
 /*
  * magic for fully defined striping
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 3901eb2..ad5d446 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -56,6 +56,7 @@
 # include <limits.h>
 # include <stdbool.h>
 # include <stdio.h> /* snprintf() */
+# include <stdint.h>
 # include <string.h>
 # include <sys/stat.h>
 #endif /* __KERNEL__ */
@@ -388,6 +389,7 @@ struct ll_ioc_lease_id {
 /* 0x0BD40BD0 is occupied by LOV_MAGIC_MIGRATE */
 #define LOV_USER_MAGIC_SPECIFIC	0x0BD50BD0	/* for specific OSTs */
 #define LOV_USER_MAGIC_COMP_V1	0x0BD60BD0
+#define LOV_USER_MAGIC_FOREIGN	0x0BD70BD0
 
 #define LMV_USER_MAGIC		0x0CD30CD0	/*default lmv magic*/
 #define LMV_USER_MAGIC_SPECIFIC	0x0CD40CD0
@@ -469,6 +471,21 @@ struct lov_user_md_v3 {		/* LOV EA user data (host-endian) */
 	struct lov_user_ost_data_v1 lmm_objects[0]; /* per-stripe data */
 } __packed;
 
+struct lov_foreign_md {
+	__u32 lfm_magic;	/* magic number = LOV_MAGIC_FOREIGN */
+	__u32 lfm_length;	/* length of lfm_value */
+	__u32 lfm_type;		/* type, see LOV_FOREIGN_TYPE_ */
+	__u32 lfm_flags;	/* flags, type specific */
+	char lfm_value[];
+};
+
+#define foreign_size(lfm) (((struct lov_foreign_md *)lfm)->lfm_length + \
+			   offsetof(struct lov_foreign_md, lfm_value))
+
+#define foreign_size_le(lfm) \
+	(le32_to_cpu(((struct lov_foreign_md *)lfm)->lfm_length) + \
+	offsetof(struct lov_foreign_md, lfm_value))
+
 struct lu_extent {
 	__u64	e_start;
 	__u64	e_end;
@@ -628,6 +645,20 @@ enum lmv_hash_type {
 #define LMV_HASH_NAME_ALL_CHARS		"all_char"
 #define LMV_HASH_NAME_FNV_1A_64		"fnv_1a_64"
 
+/**
+ * LOV foreign types
+ **/
+#define LOV_FOREIGN_TYPE_NONE 0
+#define LOV_FOREIGN_TYPE_DAOS 0xda05
+#define LOV_FOREIGN_TYPE_UNKNOWN UINT32_MAX
+
+struct lustre_foreign_type {
+	uint32_t lft_type;
+	const char *lft_name;
+};
+
+extern struct lustre_foreign_type lov_foreign_type[];
+
 /*
  * Got this according to how get LOV_MAX_STRIPE_COUNT, see above,
  * (max buffer size - lmv+rpc header) / sizeof(struct lmv_user_mds_data)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 279/622] lustre: lmv: new foreign LMV format
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (277 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 278/622] lustre: lov: new foreign LOV format James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 280/622] lustre: obd: replace class_uuid with linux kernel version James Simmons
                   ` (343 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Bruno Faccini <bruno.faccini@intel.com>

This patch introduces a new striping/LMV format in order to
allow to specify an arbitrary external reference for a dir
in Lustre namespace.
The new LMV format is made of {newmagic, length, type, flags,
string[length]} to be as flexible as possible.
Foreign dir can be created by using the ioctl(LL_IOC_LMV_SETDIRSTRIPE)
operation and it can only be and remain an empty dir until removed.

The idea behind this is to provide Lustre namespace support and
striping prefetch/caching under lock protection, for user/external
usage.

This patch is the LMV/dirs complement of LOV/files previous change
(lustre: lov: new foreign LOV format) has been rebased on top of
the latter along with some with obvious mutualizations and
simplifications.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11376
Lustre-commit: fdad38781ccc ("LU-11376 lmv: new foreign LMV format")
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: https://review.whamcloud.com/34087
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_lmv.h          |  7 +++
 fs/lustre/include/obd.h                 |  5 +-
 fs/lustre/llite/dir.c                   | 94 +++++++++++++++++++++++++++++----
 fs/lustre/llite/file.c                  |  5 ++
 fs/lustre/llite/llite_lib.c             | 20 +++++--
 fs/lustre/lmv/lmv_intent.c              | 14 +++++
 fs/lustre/lmv/lmv_obd.c                 | 50 +++++++++++++++++-
 fs/lustre/mdc/mdc_request.c             | 17 ++++--
 fs/lustre/ptlrpc/pack_generic.c         |  9 +++-
 include/uapi/linux/lustre/lustre_idl.h  | 11 ++++
 include/uapi/linux/lustre/lustre_user.h | 31 +++++++----
 11 files changed, 232 insertions(+), 31 deletions(-)

diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index 1246c25..cef315d 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -189,4 +189,11 @@ static inline bool lmv_is_known_hash_type(u32 type)
 	       (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_ALL_CHARS;
 }
 
+static inline bool lmv_magic_supported(u32 lum_magic)
+{
+	return lum_magic == LMV_USER_MAGIC ||
+	       lum_magic == LMV_USER_MAGIC_SPECIFIC ||
+	       lum_magic == LMV_MAGIC_FOREIGN;
+}
+
 #endif
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 687b54b..996211a 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -929,7 +929,10 @@ struct obd_ops {
 struct lustre_md {
 	struct mdt_body			*body;
 	struct lu_buf			 layout;
-	struct lmv_stripe_md		*lmv;
+	union {
+		struct lmv_stripe_md	*lmv;
+		struct lmv_foreign_md   *lfm;
+	};
 #ifdef CONFIG_LUSTRE_FS_POSIX_ACL
 	struct posix_acl		*posix_acl;
 #endif
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 8293a01..fd7cd2d 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -346,6 +346,14 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 		rc = PTR_ERR(op_data);
 		goto out;
 	}
+
+	/* foreign dirs are browsed out of Lustre */
+	if (unlikely(op_data->op_mea1 &&
+		     op_data->op_mea1->lsm_md_magic == LMV_MAGIC_FOREIGN)) {
+		ll_finish_md_op_data(op_data);
+		return -ENODATA;
+	}
+
 	op_data->op_fid3 = pfid;
 
 	ctx->pos = pos;
@@ -421,14 +429,22 @@ static int ll_dir_setdirstripe(struct dentry *dparent, struct lmv_user_md *lump,
 	};
 	int err;
 
-	if (unlikely(lump->lum_magic != LMV_USER_MAGIC &&
-		     lump->lum_magic != LMV_USER_MAGIC_SPECIFIC))
+	if (unlikely(!lmv_magic_supported(lump->lum_magic)))
 		return -EINVAL;
 
-	CDEBUG(D_VFSTRACE,
-	       "VFS Op:inode=" DFID "(%p) name %s stripe_offset %d, stripe_count: %u\n",
-	       PFID(ll_inode2fid(parent)), parent, dirname,
-	       (int)lump->lum_stripe_offset, lump->lum_stripe_count);
+	if (lump->lum_magic != LMV_MAGIC_FOREIGN) {
+		CDEBUG(D_VFSTRACE,
+		       "VFS Op:inode=" DFID "(%p) name %s stripe_offset %d, stripe_count: %u\n",
+		       PFID(ll_inode2fid(parent)), parent, dirname,
+		       (int)lump->lum_stripe_offset, lump->lum_stripe_count);
+	} else {
+		struct lmv_foreign_md *lfm = (struct lmv_foreign_md *)lump;
+
+		CDEBUG(D_VFSTRACE,
+		       "VFS Op:inode=" DFID "(%p) name %s foreign, length %u, value '%.*s'\n",
+		       PFID(ll_inode2fid(parent)), parent, dirname,
+		       lfm->lfm_length, lfm->lfm_length, lfm->lfm_value);
+	}
 
 	if (lump->lum_stripe_count > 1 &&
 	    !(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_DIR_STRIPE))
@@ -438,8 +454,7 @@ static int ll_dir_setdirstripe(struct dentry *dparent, struct lmv_user_md *lump,
 	    !OBD_FAIL_CHECK(OBD_FAIL_LLITE_NO_CHECK_DEAD))
 		return -ENOENT;
 
-	if (lump->lum_magic != cpu_to_le32(LMV_USER_MAGIC) &&
-	    lump->lum_magic != cpu_to_le32(LMV_USER_MAGIC_SPECIFIC))
+	if (unlikely(!lmv_magic_supported(cpu_to_le32(lump->lum_magic))))
 		lustre_swab_lmv_user_md(lump);
 
 	if (!IS_POSIXACL(parent) || !exp_connect_umask(ll_i2mdexp(parent)))
@@ -721,6 +736,17 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 			}
 		}
 		break;
+	case LMV_MAGIC_FOREIGN: {
+		struct lmv_foreign_md *lfm = (struct lmv_foreign_md *)lmm;
+
+		if (cpu_to_le32(LMV_MAGIC_FOREIGN) != LMV_MAGIC_FOREIGN) {
+			__swab32s(&lfm->lfm_magic);
+			__swab32s(&lfm->lfm_length);
+			__swab32s(&lfm->lfm_type);
+			__swab32s(&lfm->lfm_flags);
+		}
+		break;
+	}
 	default:
 		CERROR("unknown magic: %lX\n", (unsigned long)lmm->lmm_magic);
 		rc = -EPROTO;
@@ -1313,9 +1339,24 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		lum = (struct lmv_user_md *)data->ioc_inlbuf2;
 		lumlen = data->ioc_inllen2;
 
-		if ((lum->lum_magic != LMV_USER_MAGIC &&
-		     lum->lum_magic != LMV_USER_MAGIC_SPECIFIC) ||
+		if (!lmv_magic_supported(lum->lum_magic)) {
+			CERROR("%s: wrong lum magic %x : rc = %d\n", filename,
+			       lum->lum_magic, -EINVAL);
+			rc = -EINVAL;
+			goto lmv_out_free;
+		}
+
+		if ((lum->lum_magic == LMV_USER_MAGIC ||
+		     lum->lum_magic == LMV_USER_MAGIC_SPECIFIC) &&
 		    lumlen < sizeof(*lum)) {
+			CERROR("%s: wrong lum size %d for magic %x : rc = %d\n",
+			       filename, lumlen, lum->lum_magic, -EINVAL);
+			rc = -EINVAL;
+			goto lmv_out_free;
+		}
+
+		if (lum->lum_magic == LMV_MAGIC_FOREIGN &&
+		    lumlen < sizeof(struct lmv_foreign_md)) {
 			CERROR("%s: wrong lum magic %x or size %d: rc = %d\n",
 			       filename, lum->lum_magic, lumlen, -EFAULT);
 			rc = -EINVAL;
@@ -1447,7 +1488,25 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			goto finish_req;
 		}
 
-		stripe_count = lmv_mds_md_stripe_count_get(lmm);
+		/* if foreign LMV case, fake stripes number */
+		if (lmm->lmv_magic == LMV_MAGIC_FOREIGN) {
+			struct lmv_foreign_md *lfm;
+
+			lfm = (struct lmv_foreign_md *)lmm;
+			if (lfm->lfm_length < XATTR_SIZE_MAX -
+			    offsetof(typeof(*lfm), lfm_value)) {
+				u32 size = lfm->lfm_length +
+					   offsetof(typeof(*lfm), lfm_value);
+
+				stripe_count = lmv_foreign_to_md_stripes(size);
+			} else {
+				CERROR("invalid %d foreign size returned\n",
+					    lfm->lfm_length);
+				return -EINVAL;
+			}
+		} else {
+			stripe_count = lmv_mds_md_stripe_count_get(lmm);
+		}
 		if (max_stripe_count < stripe_count) {
 			lum.lum_stripe_count = stripe_count;
 			if (copy_to_user(ulmv, &lum, sizeof(lum))) {
@@ -1458,6 +1517,19 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			goto finish_req;
 		}
 
+		/* enough room on user side and foreign case */
+		if (lmm->lmv_magic == LMV_MAGIC_FOREIGN) {
+			struct lmv_foreign_md *lfm;
+			u32 size;
+
+			lfm = (struct lmv_foreign_md *)lmm;
+			size = lfm->lfm_length +
+			       offsetof(struct lmv_foreign_md, lfm_value);
+			if (copy_to_user(ulmv, lfm, size))
+				rc = -EFAULT;
+			goto finish_req;
+		}
+
 		lum_size = lmv_user_md_size(stripe_count,
 					    LMV_USER_MAGIC_SPECIFIC);
 		tmp = kzalloc(lum_size, GFP_NOFS);
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 0d7d566..76d3b4c 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4249,6 +4249,11 @@ static int ll_merge_md_attr(struct inode *inode)
 	int rc;
 
 	LASSERT(lli->lli_lsm_md);
+
+	/* foreign dir is not striped dir */
+	if (lli->lli_lsm_md->lsm_md_magic == LMV_MAGIC_FOREIGN)
+		return 0;
+
 	down_read(&lli->lli_lsm_sem);
 	rc = md_merge_attr(ll_i2mdexp(inode), ll_i2info(inode)->lli_lsm_md,
 			   &attr, ll_md_blocking_ast);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index fd19035..21825251 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1329,8 +1329,12 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	/*
 	 * if dir layout mismatch, check whether version is increased, which
 	 * means layout is changed, this happens in dir migration and lfsck.
+	 *
+	 * foreign LMV should not change.
 	 */
-	if (lli->lli_lsm_md && !lsm_md_eq(lli->lli_lsm_md, lsm)) {
+	if (lli->lli_lsm_md &&
+	    lli->lli_lsm_md->lsm_md_magic != LMV_MAGIC_FOREIGN &&
+	   !lsm_md_eq(lli->lli_lsm_md, lsm)) {
 		if (lsm->lsm_md_layout_version <=
 		    lli->lli_lsm_md->lsm_md_layout_version) {
 			CERROR("%s: " DFID " dir layout mismatch:\n",
@@ -1352,6 +1356,16 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	if (!lli->lli_lsm_md) {
 		struct cl_attr *attr;
 
+		if (lsm->lsm_md_magic == LMV_MAGIC_FOREIGN) {
+			/* set md->lmv to NULL, so the following free lustre_md
+			 * will not free this lsm
+			 */
+			md->lmv = NULL;
+			lli->lli_lsm_md = lsm;
+			up_write(&lli->lli_lsm_sem);
+			return 0;
+		}
+
 		rc = ll_init_lsm_md(inode, md);
 		up_write(&lli->lli_lsm_sem);
 		if (rc)
@@ -2297,7 +2311,7 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 	rc = md_get_lustre_md(sbi->ll_md_exp, req, sbi->ll_dt_exp,
 			      sbi->ll_md_exp, &md);
 	if (rc)
-		goto cleanup;
+		goto out;
 
 	if (*inode) {
 		rc = ll_update_inode(*inode, &md);
@@ -2365,8 +2379,8 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 	}
 
 out:
+	/* cleanup will be done if necessary */
 	md_free_lustre_md(sbi->ll_md_exp, &md);
-cleanup:
 	if (rc != 0 && it && it->it_op & IT_OPEN)
 		ll_open_cleanup(sb ? sb : (*inode)->i_sb, req);
 
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 45f1ac5..84a21a0 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -276,6 +276,11 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	u64 flags = it->it_flags;
 	int rc;
 
+	/* do not allow file creation in foreign dir */
+	if ((it->it_op & IT_CREAT) && op_data->op_mea1 &&
+	    op_data->op_mea1->lsm_md_magic == LMV_MAGIC_FOREIGN)
+		return -ENODATA;
+
 	if ((it->it_op & IT_CREAT) && !(flags & MDS_OPEN_BY_FID)) {
 		/* don't allow create under dir with bad hash */
 		if (lmv_is_dir_bad_hash(op_data->op_mea1))
@@ -426,6 +431,15 @@ static int lmv_intent_lookup(struct obd_export *exp,
 	struct mdt_body	*body;
 	int rc;
 
+	/* foreign dir is not striped */
+	if (op_data->op_mea1 &&
+	    op_data->op_mea1->lsm_md_magic == LMV_MAGIC_FOREIGN) {
+		/* only allow getattr/lookup for itself */
+		if (op_data->op_name)
+			return -ENODATA;
+		return 0;
+	}
+
 retry:
 	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
 	if (IS_ERR(tgt))
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 9f3d6de..dc4bd1e 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1166,15 +1166,22 @@ static int lmv_placement_policy(struct obd_device *obd,
 	 * 2. Then check if there is default stripe offset.
 	 * 3. Finally choose MDS by name hash if the parent
 	 *    is striped directory. (see lmv_locate_tgt()).
+	 *
+	 * presently explicit MDT location is not supported
+	 * for foreign dirs (as it can't be embedded into free
+	 * format LMV, like with lum_stripe_offset), so we only
+	 * rely on default stripe offset or then name hashing.
 	 */
 	if (op_data->op_cli_flags & CLI_SET_MEA && lum &&
+	    le32_to_cpu(lum->lum_magic != LMV_MAGIC_FOREIGN) &&
 	    le32_to_cpu(lum->lum_stripe_offset) != (u32)-1) {
 		*mds = le32_to_cpu(lum->lum_stripe_offset);
 	} else if (op_data->op_default_stripe_offset != (u32)-1) {
 		*mds = op_data->op_default_stripe_offset;
 		op_data->op_mds = *mds;
 		/* Correct the stripe offset in lum */
-		if (lum)
+		if (lum &&
+		    le32_to_cpu(lum->lum_magic != LMV_MAGIC_FOREIGN))
 			lum->lum_stripe_offset = cpu_to_le32(*mds);
 	} else {
 		*mds = op_data->op_mds;
@@ -1606,6 +1613,10 @@ struct lmv_tgt_desc*
 	struct lmv_oinfo *oinfo;
 	struct lmv_tgt_desc *tgt;
 
+	/* foreign dir is not striped dir */
+	if (lsm && lsm->lsm_md_magic == LMV_MAGIC_FOREIGN)
+		return ERR_PTR(-ENODATA);
+
 	/*
 	 * During creating VOLATILE file, it should honor the mdt
 	 * index if the file under striped dir is being restored, see
@@ -2657,6 +2668,10 @@ static int lmv_read_page(struct obd_export *exp, struct md_op_data *op_data,
 	struct lmv_tgt_desc *tgt;
 
 	if (unlikely(lsm)) {
+		/* foreign dir is not striped dir */
+		if (lsm->lsm_md_magic == LMV_MAGIC_FOREIGN)
+			return -ENODATA;
+
 		return lmv_striped_read_page(exp, op_data, cb_op,
 					     offset, ppage);
 	}
@@ -2962,6 +2977,16 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 	/* Free memmd */
 	if (lsm && !lmm) {
 		int i;
+		struct lmv_foreign_md *lfm = (struct lmv_foreign_md *)lsm;
+
+		if (lfm->lfm_magic == LMV_MAGIC_FOREIGN) {
+			size_t lfm_size;
+
+			lfm_size = lfm->lfm_length + offsetof(typeof(*lfm),
+							      lfm_value[0]);
+			kvfree(lfm);
+			return 0;
+		}
 
 		for (i = 0; i < lsm->lsm_md_stripe_count; i++)
 			iput(lsm->lsm_md_oinfo[i].lmo_root);
@@ -2971,6 +2996,25 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 		return 0;
 	}
 
+	/* foreign lmv case */
+	if (le32_to_cpu(lmm->lmv_magic) == LMV_MAGIC_FOREIGN) {
+		struct lmv_foreign_md *lfm = (struct lmv_foreign_md *)lsm;
+
+		if (!lfm) {
+			lfm = kvzalloc(lmm_size, GFP_NOFS);
+			if (!lfm)
+				return -ENOMEM;
+			*lsmp = (struct lmv_stripe_md *)lfm;
+		}
+		lfm->lfm_magic = le32_to_cpu(lmm->lmv_foreign_md.lfm_magic);
+		lfm->lfm_length = le32_to_cpu(lmm->lmv_foreign_md.lfm_length);
+		lfm->lfm_type = le32_to_cpu(lmm->lmv_foreign_md.lfm_type);
+		lfm->lfm_flags = le32_to_cpu(lmm->lmv_foreign_md.lfm_flags);
+		memcpy(&lfm->lfm_value, &lmm->lmv_foreign_md.lfm_value,
+		       lfm->lfm_length);
+		return lmm_size;
+	}
+
 	if (le32_to_cpu(lmm->lmv_magic) == LMV_MAGIC_STRIPE)
 		return -EPERM;
 
@@ -3279,6 +3323,10 @@ static int lmv_merge_attr(struct obd_export *exp,
 {
 	int rc, i;
 
+	/* foreign dir is not striped dir */
+	if (lsm->lsm_md_magic == LMV_MAGIC_FOREIGN)
+		return 0;
+
 	rc = lmv_revalidate_slaves(exp, lsm, cb_blocking, 0);
 	if (rc < 0)
 		return rc;
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 5931bc1..57da3c3 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -613,11 +613,18 @@ static int mdc_get_lustre_md(struct obd_export *exp,
 				goto out;
 
 			if (rc < (typeof(rc))sizeof(*md->lmv)) {
-				CDEBUG(D_INFO,
-				       "size too small: rc < sizeof(*md->lmv) (%d < %d)\n",
-					rc, (int)sizeof(*md->lmv));
-				rc = -EPROTO;
-				goto out;
+				struct lmv_foreign_md *lfm = md->lfm;
+
+				/* short (< sizeof(struct lmv_stripe_md))
+				 * foreign LMV case
+				 */
+				if (lfm->lfm_magic != LMV_MAGIC_FOREIGN) {
+					CDEBUG(D_INFO,
+					       "size too small: rc < sizeof(*md->lmv) (%d < %d)\n",
+					       rc, (int)sizeof(*md->lmv));
+					rc = -EPROTO;
+					goto out;
+				}
 			}
 		}
 	}
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 231cb26..a4f28f3 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1974,8 +1974,15 @@ void lustre_swab_lmv_user_md_objects(struct lmv_user_mds_data *lmd,
 
 void lustre_swab_lmv_user_md(struct lmv_user_md *lum)
 {
-	u32 count = lum->lum_stripe_count;
+	u32 count;
 
+	if (lum->lum_magic == LMV_MAGIC_FOREIGN) {
+		__swab32s(&lum->lum_magic);
+		__swab32s(&((struct lmv_foreign_md *)lum)->lfm_length);
+		return;
+	}
+
+	count = lum->lum_stripe_count;
 	__swab32s(&lum->lum_magic);
 	__swab32s(&lum->lum_stripe_count);
 	__swab32s(&lum->lum_stripe_offset);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index fd35023..f7ea744 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1976,11 +1976,21 @@ struct lmv_mds_md_v1 {
 	struct lu_fid lmv_stripe_fids[0];	/* FIDs for each stripe */
 };
 
+/* foreign LMV EA */
+struct lmv_foreign_md {
+	__u32 lfm_magic;	/* magic number = LMV_MAGIC_FOREIGN */
+	__u32 lfm_length;	/* length of lfm_value */
+	__u32 lfm_type;		/* type, see LU_FOREIGN_TYPE_ */
+	__u32 lfm_flags;	/* flags, type specific */
+	char lfm_value[];	/* free format value */
+};
+
 #define LMV_MAGIC_V1	 0x0CD20CD0	/* normal stripe lmv magic */
 #define LMV_MAGIC	 LMV_MAGIC_V1
 
 /* #define LMV_USER_MAGIC 0x0CD30CD0 */
 #define LMV_MAGIC_STRIPE 0x0CD40CD0	/* magic for dir sub_stripe */
+#define LMV_MAGIC_FOREIGN 0x0CD50CD0	/* magic for lmv foreign */
 
 /*
  *Right now only the lower part(0-16bits) of lmv_hash_type is being used,
@@ -2025,6 +2035,7 @@ static inline __u64 lustre_hash_fnv_1a_64(const void *buf, size_t size)
 	__u32			lmv_magic;
 	struct lmv_mds_md_v1	lmv_md_v1;
 	struct lmv_user_md	lmv_user_md;
+	struct lmv_foreign_md	lmv_foreign_md;
 };
 
 static inline ssize_t lmv_mds_md_size(int stripe_count, unsigned int lmm_magic)
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index ad5d446..03ec680 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -474,7 +474,7 @@ struct lov_user_md_v3 {		/* LOV EA user data (host-endian) */
 struct lov_foreign_md {
 	__u32 lfm_magic;	/* magic number = LOV_MAGIC_FOREIGN */
 	__u32 lfm_length;	/* length of lfm_value */
-	__u32 lfm_type;		/* type, see LOV_FOREIGN_TYPE_ */
+	__u32 lfm_type;		/* type, see LU_FOREIGN_TYPE_ */
 	__u32 lfm_flags;	/* flags, type specific */
 	char lfm_value[];
 };
@@ -645,19 +645,22 @@ enum lmv_hash_type {
 #define LMV_HASH_NAME_ALL_CHARS		"all_char"
 #define LMV_HASH_NAME_FNV_1A_64		"fnv_1a_64"
 
-/**
- * LOV foreign types
- **/
-#define LOV_FOREIGN_TYPE_NONE 0
-#define LOV_FOREIGN_TYPE_DAOS 0xda05
-#define LOV_FOREIGN_TYPE_UNKNOWN UINT32_MAX
-
 struct lustre_foreign_type {
 	uint32_t lft_type;
 	const char *lft_name;
 };
 
-extern struct lustre_foreign_type lov_foreign_type[];
+/**
+ * LOV/LMV foreign types
+ **/
+enum lustre_foreign_types {
+	LU_FOREIGN_TYPE_NONE = 0,
+	LU_FOREIGN_TYPE_DAOS = 0xda05,
+	/* must be the max/last one */
+	LU_FOREIGN_TYPE_UNKNOWN = 0xffffffff,
+};
+
+extern struct lustre_foreign_type lu_foreign_types[];
 
 /*
  * Got this according to how get LOV_MAX_STRIPE_COUNT, see above,
@@ -678,6 +681,16 @@ struct lmv_user_md_v1 {
 	struct lmv_user_mds_data  lum_objects[0];
 } __packed;
 
+static inline __u32 lmv_foreign_to_md_stripes(__u32 size)
+{
+	if (size <= sizeof(struct lmv_user_md))
+		return 0;
+
+	size -= sizeof(struct lmv_user_md);
+	return (size + sizeof(struct lmv_user_mds_data) - 1) /
+	       sizeof(struct lmv_user_mds_data);
+}
+
 static inline int lmv_user_md_size(int stripes, int lmm_magic)
 {
 	int size = sizeof(struct lmv_user_md);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 280/622] lustre: obd: replace class_uuid with linux kernel version.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (278 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 279/622] lustre: lmv: new foreign LMV format James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 281/622] lustre: ptlrpc: Fix style issues for sec_null.c James Simmons
                   ` (342 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

We can replace the lustre custom class_uuid_t with the linux
kernels uuid handling.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11803
Lustre-commit: 604c266a175b ("LU-11803 obd: replace class_uuid with linux kernel version.")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/33916
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h  | 10 ----------
 fs/lustre/llite/llite_lib.c    | 23 +++++++++++++----------
 fs/lustre/obdclass/obd_mount.c |  8 +++++---
 3 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 6cddc4f..a142d6e 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1672,13 +1672,6 @@ struct lwp_register_item {
 /* obd_mount.c */
 int lustre_check_exclusion(struct super_block *sb, char *svname);
 
-typedef u8 class_uuid_t[16];
-
-static inline void class_uuid_unparse(class_uuid_t uu, struct obd_uuid *out)
-{
-	sprintf(out->uuid, "%pU", uu);
-}
-
 /* lustre_peer.c    */
 int lustre_uuid_to_peer(const char *uuid, lnet_nid_t *peer_nid, int index);
 int class_add_uuid(const char *uuid, u64 nid);
@@ -1689,9 +1682,6 @@ static inline void class_uuid_unparse(class_uuid_t uu, struct obd_uuid *out)
 extern char obd_jobid_name[];
 int class_procfs_init(void);
 int class_procfs_clean(void);
-/* prng.c */
-#define ll_generate_random_uuid(uuid_out) \
-	get_random_bytes(uuid_out, sizeof(class_uuid_t))
 
 /* statfs_pack.c */
 struct kstatfs;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 21825251..99cedcf 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -38,9 +38,11 @@
 #define DEBUG_SUBSYSTEM S_LLITE
 
 #include <linux/module.h>
+#include <linux/random.h>
 #include <linux/statfs.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/uuid.h>
 #include <linux/random.h>
 #include <linux/security.h>
 #include <linux/fs_struct.h>
@@ -69,7 +71,6 @@ static struct ll_sb_info *ll_init_sbi(void)
 	unsigned long pages;
 	unsigned long lru_page_max;
 	struct sysinfo si;
-	class_uuid_t uuid;
 	int i;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_NOFS);
@@ -97,11 +98,6 @@ static struct ll_sb_info *ll_init_sbi(void)
 	sbi->ll_ra_info.ra_max_pages = sbi->ll_ra_info.ra_max_pages_per_file;
 	sbi->ll_ra_info.ra_max_read_ahead_whole_pages = -1;
 
-	ll_generate_random_uuid(uuid);
-	sprintf(sbi->ll_sb_uuid.uuid, "%pU", uuid);
-
-	CDEBUG(D_CONFIG, "generated uuid: %s\n", sbi->ll_sb_uuid.uuid);
-
 	sbi->ll_flags |= LL_SBI_VERBOSE;
 	sbi->ll_flags |= LL_SBI_CHECKSUM;
 	sbi->ll_flags |= LL_SBI_FLOCK;
@@ -965,6 +961,7 @@ int ll_fill_super(struct super_block *sb)
 	char *profilenm = get_profile_name(sb);
 	struct config_llog_instance *cfg;
 	char name[MAX_OBD_NAME];
+	uuid_t uuid;
 	char *ptr;
 	int len;
 	int err;
@@ -991,13 +988,15 @@ int ll_fill_super(struct super_block *sb)
 	if (err)
 		goto out_free;
 
-	err = super_setup_bdi_name(sb, "lustre-%p", sb);
-	if (err)
-		goto out_free;
-
 	/* kernel >= 2.6.38 store dentry operations in sb->s_d_op. */
 	sb->s_d_op = &ll_d_ops;
 
+	/* UUID handling */
+	generate_random_uuid(uuid.b);
+	snprintf(sbi->ll_sb_uuid.uuid, UUID_SIZE, "%pU", uuid.b);
+
+	CDEBUG(D_CONFIG, "llite sb uuid: %s\n", sbi->ll_sb_uuid.uuid);
+
 	/* Get fsname */
 	len = strlen(lsi->lsi_lmd->lmd_profile);
 	ptr = strrchr(lsi->lsi_lmd->lmd_profile, '-');
@@ -1021,6 +1020,10 @@ int ll_fill_super(struct super_block *sb)
 	snprintf(name, sizeof(name), "%.*s-%px", len,
 		 lsi->lsi_lmd->lmd_profile, sb);
 
+	err = super_setup_bdi_name(sb, "%s", name);
+	if (err)
+		goto out_free;
+
 	/* Call ll_debugsfs_register_super() before lustre_process_log()
 	 * so that "llite.*.*" params can be processed correctly.
 	 */
diff --git a/fs/lustre/obdclass/obd_mount.c b/fs/lustre/obdclass/obd_mount.c
index 6c68bc7..31f2f5b 100644
--- a/fs/lustre/obdclass/obd_mount.c
+++ b/fs/lustre/obdclass/obd_mount.c
@@ -44,6 +44,8 @@
 #include <linux/random.h>
 #include <obd.h>
 #include <obd_class.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 #include <lustre_log.h>
 #include <lustre_disk.h>
@@ -216,7 +218,7 @@ int lustre_start_mgc(struct super_block *sb)
 	struct obd_device *obd;
 	struct obd_export *exp;
 	struct obd_uuid *uuid = NULL;
-	class_uuid_t uuidc;
+	uuid_t uuidc;
 	lnet_nid_t nid;
 	char nidstr[LNET_NIDSTR_SIZE];
 	char *mgcname = NULL, *niduuid = NULL, *mgssec = NULL;
@@ -336,8 +338,8 @@ int lustre_start_mgc(struct super_block *sb)
 		goto out_free;
 	}
 
-	ll_generate_random_uuid(uuidc);
-	sprintf(uuid->uuid, "%pU", uuidc);
+	generate_random_uuid(uuidc.b);
+	snprintf(uuid->uuid, UUID_SIZE, "%pU", uuidc.b);
 
 	/* Start the MGC */
 	rc = lustre_start_simple(mgcname, LUSTRE_MGC_NAME,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 281/622] lustre: ptlrpc: Fix style issues for sec_null.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (279 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 280/622] lustre: obd: replace class_uuid with linux kernel version James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 282/622] lustre: ptlrpc: Fix style issues for service.c James Simmons
                   ` (341 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ptlrpc/sec_null.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 7d00fbae100b ("LU-6142 ptlrpc: Fix style issues for sec_null.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34549
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/sec_null.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/ptlrpc/sec_null.c b/fs/lustre/ptlrpc/sec_null.c
index 3c7fb68..2eaa788 100644
--- a/fs/lustre/ptlrpc/sec_null.c
+++ b/fs/lustre/ptlrpc/sec_null.c
@@ -101,6 +101,7 @@ int null_ctx_verify(struct ptlrpc_cli_ctx *ctx, struct ptlrpc_request *req)
 	if (req->rq_early) {
 		cksums = lustre_msg_get_cksum(req->rq_repdata);
 		cksumc = lustre_msg_calc_cksum(req->rq_repmsg);
+
 		if (cksumc != cksums) {
 			CDEBUG(D_SEC,
 			       "early reply checksum mismatch: %08x != %08x\n",
@@ -119,7 +120,8 @@ struct ptlrpc_sec *null_create_sec(struct obd_import *imp,
 {
 	LASSERT(SPTLRPC_FLVR_POLICY(sf->sf_rpc) == SPTLRPC_POLICY_NULL);
 
-	/* general layer has take a module reference for us, because we never
+	/*
+	 * general layer has take a module reference for us, because we never
 	 * really destroy the sec, simply release the reference here.
 	 */
 	sptlrpc_policy_put(&null_policy);
@@ -142,9 +144,8 @@ struct ptlrpc_cli_ctx *null_lookup_ctx(struct ptlrpc_sec *sec,
 }
 
 static
-int null_flush_ctx_cache(struct ptlrpc_sec *sec,
-			 uid_t uid,
-			 int grace, int force)
+int null_flush_ctx_cache(struct ptlrpc_sec *sec, uid_t uid, int grace,
+			 int force)
 {
 	return 0;
 }
@@ -250,7 +251,8 @@ int null_enlarge_reqbuf(struct ptlrpc_sec *sec,
 		if (!newbuf)
 			return -ENOMEM;
 
-		/* Must lock this, so that otherwise unprotected change of
+		/*
+		 * Must lock this, so that otherwise unprotected change of
 		 * rq_reqmsg is not racing with parallel processing of
 		 * imp_replay_list traversing threads. See LU-3333
 		 * This is a bandaid at best, we really need to deal with this
@@ -454,6 +456,6 @@ void sptlrpc_null_fini(void)
 
 	rc = sptlrpc_unregister_policy(&null_policy);
 	if (rc)
-		CERROR("failed to unregister %s: %d\n",
-		       null_policy.sp_name, rc);
+		CERROR("failed to unregister %s: %d\n", null_policy.sp_name,
+		       rc);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 282/622] lustre: ptlrpc: Fix style issues for service.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (280 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 281/622] lustre: ptlrpc: Fix style issues for sec_null.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 283/622] lustre: uapi: fix file heat support James Simmons
                   ` (340 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ptlrpc/service.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: cb82520d2474 ("LU-6142 ptlrpc: Fix style issues for service.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34605
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 159 +++++++++++++++++++++++++++++----------------
 1 file changed, 102 insertions(+), 57 deletions(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 362102b..1513f51 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -145,7 +145,8 @@ static int ptlrpc_grow_req_bufs(struct ptlrpc_service_part *svcpt, int post)
 	spin_unlock(&svcpt->scp_lock);
 
 	for (i = 0; i < svc->srv_nbuf_per_group; i++) {
-		/* NB: another thread might have recycled enough rqbds, we
+		/*
+		 * NB: another thread might have recycled enough rqbds, we
 		 * need to make sure it wouldn't over-allocate, see LU-1212.
 		 */
 		if (svcpt->scp_nrqbds_posted >= svc->srv_nbuf_per_group ||
@@ -321,7 +322,8 @@ static int ptlrpc_server_post_idle_rqbds(struct ptlrpc_service_part *svcpt)
 	svcpt->scp_nrqbds_posted--;
 	list_move_tail(&rqbd->rqbd_list, &svcpt->scp_rqbd_idle);
 
-	/* Don't complain if no request buffers are posted right now; LNET
+	/*
+	 * Don't complain if no request buffers are posted right now; LNET
 	 * won't drop requests because we set the portal lazy!
 	 */
 
@@ -362,13 +364,15 @@ static void ptlrpc_server_nthreads_check(struct ptlrpc_service *svc,
 	init = PTLRPC_NTHRS_INIT + (svc->srv_ops.so_hpreq_handler != NULL);
 	init = max_t(int, init, tc->tc_nthrs_init);
 
-	/* NB: please see comments in lustre_lnet.h for definition
+	/*
+	 * NB: please see comments in lustre_lnet.h for definition
 	 * details of these members
 	 */
 	LASSERT(tc->tc_nthrs_max != 0);
 
 	if (tc->tc_nthrs_user != 0) {
-		/* In case there is a reason to test a service with many
+		/*
+		 * In case there is a reason to test a service with many
 		 * threads, we give a less strict check here, it can
 		 * be up to 8 * nthrs_max
 		 */
@@ -380,7 +384,8 @@ static void ptlrpc_server_nthreads_check(struct ptlrpc_service *svc,
 
 	total = tc->tc_nthrs_max;
 	if (tc->tc_nthrs_base == 0) {
-		/* don't care about base threads number per partition,
+		/*
+		 * don't care about base threads number per partition,
 		 * this is most for non-affinity service
 		 */
 		nthrs = total / svc->srv_ncpts;
@@ -391,7 +396,8 @@ static void ptlrpc_server_nthreads_check(struct ptlrpc_service *svc,
 	if (svc->srv_ncpts == 1) {
 		int i;
 
-		/* NB: Increase the base number if it's single partition
+		/*
+		 * NB: Increase the base number if it's single partition
 		 * and total number of cores/HTs is larger or equal to 4.
 		 * result will always < 2 * nthrs_base
 		 */
@@ -419,7 +425,8 @@ static void ptlrpc_server_nthreads_check(struct ptlrpc_service *svc,
 		 */
 		/* weight is # of HTs */
 		preempt_disable();
-		if (cpumask_weight(topology_sibling_cpumask(smp_processor_id())) > 1) {
+		if (cpumask_weight
+		    (topology_sibling_cpumask(smp_processor_id())) > 1) {
 			/* depress thread factor for hyper-thread */
 			factor = factor - (factor >> 1) + (factor >> 3);
 		}
@@ -511,7 +518,8 @@ static int ptlrpc_service_part_init(struct ptlrpc_service *svc,
 
 	timer_setup(&svcpt->scp_at_timer, ptlrpc_at_timer, 0);
 
-	/* At SOW, service time should be quick; 10s seems generous. If client
+	/*
+	 * At SOW, service time should be quick; 10s seems generous. If client
 	 * timeout is less than this, we'll be sending an early reply.
 	 */
 	at_init(&svcpt->scp_at_estimate, 10, 0);
@@ -520,7 +528,8 @@ static int ptlrpc_service_part_init(struct ptlrpc_service *svc,
 	svcpt->scp_service = svc;
 	/* Now allocate the request buffers, but don't post them now */
 	rc = ptlrpc_grow_req_bufs(svcpt, 0);
-	/* We shouldn't be under memory pressure at startup, so
+	/*
+	 * We shouldn't be under memory pressure at startup, so
 	 * fail if we can't allocate all our buffers at this time.
 	 */
 	if (rc != 0)
@@ -719,7 +728,8 @@ static void ptlrpc_server_free_request(struct ptlrpc_request *req)
 	LASSERT(atomic_read(&req->rq_refcount) == 0);
 	LASSERT(list_empty(&req->rq_timed_list));
 
-	/* DEBUG_REQ() assumes the reply state of a request with a valid
+	/*
+	 * DEBUG_REQ() assumes the reply state of a request with a valid
 	 * ref will not be destroyed until that reference is dropped.
 	 */
 	ptlrpc_req_drop_rs(req);
@@ -727,7 +737,8 @@ static void ptlrpc_server_free_request(struct ptlrpc_request *req)
 	sptlrpc_svc_ctx_decref(req);
 
 	if (req != &req->rq_rqbd->rqbd_req) {
-		/* NB request buffers use an embedded
+		/*
+		 * NB request buffers use an embedded
 		 * req if the incoming req unlinked the
 		 * MD; this isn't one of them!
 		 */
@@ -751,7 +762,8 @@ static void ptlrpc_server_drop_request(struct ptlrpc_request *req)
 
 	if (req->rq_at_linked) {
 		spin_lock(&svcpt->scp_at_lock);
-		/* recheck with lock, in case it's unlinked by
+		/*
+		 * recheck with lock, in case it's unlinked by
 		 * ptlrpc_at_check_timed()
 		 */
 		if (likely(req->rq_at_linked))
@@ -777,7 +789,8 @@ static void ptlrpc_server_drop_request(struct ptlrpc_request *req)
 		list_move_tail(&rqbd->rqbd_list, &svcpt->scp_hist_rqbds);
 		svcpt->scp_hist_nrqbds++;
 
-		/* cull some history?
+		/*
+		 * cull some history?
 		 * I expect only about 1 or 2 rqbds need to be recycled here
 		 */
 		while (svcpt->scp_hist_nrqbds > svc->srv_hist_nrqbds_cpt_max) {
@@ -788,11 +801,12 @@ static void ptlrpc_server_drop_request(struct ptlrpc_request *req)
 			list_del(&rqbd->rqbd_list);
 			svcpt->scp_hist_nrqbds--;
 
-			/* remove rqbd's reqs from svc's req history while
+			/*
+			 * remove rqbd's reqs from svc's req history while
 			 * I've got the service lock
 			 */
 			list_for_each_entry(req, &rqbd->rqbd_reqs, rq_list) {
-				/* Track the highest culled req seq */
+				/* Track the highest culled */
 				if (req->rq_history_seq >
 				    svcpt->scp_hist_seq_culled) {
 					svcpt->scp_hist_seq_culled =
@@ -980,7 +994,8 @@ static int ptlrpc_at_add_timed(struct ptlrpc_request *req)
 
 	div_u64_rem(req->rq_deadline, array->paa_size, &index);
 	if (array->paa_reqs_count[index] > 0) {
-		/* latest rpcs will have the latest deadlines in the list,
+		/*
+		 * latest rpcs will have the latest deadlines in the list,
 		 * so search backward.
 		 */
 		list_for_each_entry_reverse(rq, &array->paa_reqs_array[index],
@@ -1043,7 +1058,8 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 	time64_t newdl;
 	int rc;
 
-	/* deadline is when the client expects us to reply, margin is the
+	/*
+	 * deadline is when the client expects us to reply, margin is the
 	 * difference between clients' and servers' expectations
 	 */
 	DEBUG_REQ(D_ADAPTTO, req,
@@ -1057,14 +1073,15 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 
 	if (olddl < 0) {
 		DEBUG_REQ(D_WARNING, req,
-			  "Already past deadline (%+lds), not sending early reply. Consider increasing at_early_margin (%d)?",
-			  olddl, at_early_margin);
+			  "Already past deadline (%+llds), not sending early reply. Consider increasing at_early_margin (%d)?",
+			  (s64)olddl, at_early_margin);
 
 		/* Return an error so we're not re-added to the timed list. */
 		return -ETIMEDOUT;
 	}
 
-	if (!(lustre_msghdr_get_flags(req->rq_reqmsg) & MSGHDR_AT_SUPPORT)) {
+	if (!(lustre_msghdr_get_flags(req->rq_reqmsg) &
+	     MSGHDR_AT_SUPPORT)) {
 		DEBUG_REQ(D_INFO, req,
 			  "Wanted to ask client for more time, but no AT support");
 		return -ENOSYS;
@@ -1082,7 +1099,8 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 		    ktime_get_real_seconds() - req->rq_arrival_time.tv_sec);
 	newdl = req->rq_arrival_time.tv_sec + at_get(&svcpt->scp_at_estimate);
 
-	/* Check to see if we've actually increased the deadline -
+	/*
+	 * Check to see if we've actually increased the deadline -
 	 * we may be past adaptive_max
 	 */
 	if (req->rq_deadline >= newdl) {
@@ -1159,7 +1177,8 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 		DEBUG_REQ(D_ERROR, req, "Early reply send failed %d", rc);
 	}
 
-	/* Free the (early) reply state from lustre_pack_reply.
+	/*
+	 * Free the (early) reply state from lustre_pack_reply.
 	 * (ptlrpc_send_reply takes it's own rs ref, so this is safe here)
 	 */
 	ptlrpc_req_drop_rs(reqcopy);
@@ -1175,7 +1194,8 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 	return rc;
 }
 
-/* Send early replies to everybody expiring within at_early_margin
+/*
+ * Send early replies to everybody expiring within at_early_margin
  * asking for at_extra time
  */
 static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
@@ -1211,7 +1231,8 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 		return;
 	}
 
-	/* We're close to a timeout, and we don't know how much longer the
+	/*
+	 * We're close to a timeout, and we don't know how much longer the
 	 * server will take. Send early replies to everyone expiring soon.
 	 */
 	INIT_LIST_HEAD(&work_list);
@@ -1258,7 +1279,8 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 	       "timeout in %+ds, asking for %d secs on %d early replies\n",
 	       first, at_extra, counter);
 	if (first < 0) {
-		/* We're already past request deadlines before we even get a
+		/*
+		 * We're already past request deadlines before we even get a
 		 * chance to send early replies
 		 */
 		LCONSOLE_WARN("%s: This server is not able to keep up with request traffic (cpu-bound).\n",
@@ -1269,7 +1291,8 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 		      at_get(&svcpt->scp_at_estimate), delay);
 	}
 
-	/* we took additional refcount so entries can't be deleted from list, no
+	/*
+	 * we took additional refcount so entries can't be deleted from list, no
 	 * locking is needed
 	 */
 	while ((rq = list_first_entry_or_null(&work_list,
@@ -1285,8 +1308,10 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 }
 
 /**
+ *
  * Put the request to the export list if the request may become
  * a high priority one.
+
  */
 static int ptlrpc_server_hpreq_init(struct ptlrpc_service_part *svcpt,
 				    struct ptlrpc_request *req)
@@ -1300,7 +1325,8 @@ static int ptlrpc_server_hpreq_init(struct ptlrpc_service_part *svcpt,
 		LASSERT(rc == 0);
 	}
 	if (req->rq_export && req->rq_ops) {
-		/* Perform request specific check. We should do this check
+		/*
+		 * Perform request specific check. We should do this check
 		 * before the request is added into exp_hp_rpcs list otherwise
 		 * it may hit swab race at LU-1044.
 		 */
@@ -1310,9 +1336,10 @@ static int ptlrpc_server_hpreq_init(struct ptlrpc_service_part *svcpt,
 				req->rq_status = rc;
 				ptlrpc_error(req);
 			}
-			/** can only return error,
+			/*
+			 * can only return error,
 			 * 0 for normal request,
-			 *  or 1 for high priority request
+			 * or 1 for high priority request
 			 */
 			LASSERT(rc <= 1);
 		}
@@ -1331,7 +1358,8 @@ static int ptlrpc_server_hpreq_init(struct ptlrpc_service_part *svcpt,
 static void ptlrpc_server_hpreq_fini(struct ptlrpc_request *req)
 {
 	if (req->rq_export && req->rq_ops) {
-		/* refresh lock timeout again so that client has more
+		/*
+		 * refresh lock timeout again so that client has more
 		 * room to send lock cancel RPC.
 		 */
 		if (req->rq_ops->hpreq_fini)
@@ -1357,7 +1385,7 @@ static int ptlrpc_server_request_add(struct ptlrpc_service_part  *svcpt,
 	return 0;
 }
 
-/**
+/*
  * Allow to handle high priority request
  * User can call it w/o any lock but need to hold
  * ptlrpc_service_part::scp_req_lock to get reliable result
@@ -1521,7 +1549,8 @@ static int ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
 			       struct ptlrpc_request, rq_list);
 	list_del_init(&req->rq_list);
 	svcpt->scp_nreqs_incoming--;
-	/* Consider this still a "queued" request as far as stats are
+	/*
+	 * Consider this still a "queued" request as far as stats are
 	 * concerned
 	 */
 	spin_unlock(&svcpt->scp_lock);
@@ -1556,7 +1585,7 @@ static int ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
 
 	rc = lustre_unpack_req_ptlrpc_body(req, MSG_PTLRPC_BODY_OFF);
 	if (rc) {
-		CERROR("error unpacking ptlrpc body: ptl %d from %s x%llu\n",
+		CERROR("error unpacking ptlrpc body: ptl %d from %s x %llu\n",
 		       svc->srv_req_portal, libcfs_id2str(req->rq_peer),
 		       req->rq_xid);
 		goto err_req;
@@ -1615,8 +1644,9 @@ static int ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
 	/* Set rpc server deadline and add it to the timed list */
 	deadline = (lustre_msghdr_get_flags(req->rq_reqmsg) &
 		    MSGHDR_AT_SUPPORT) ?
-		   /* The max time the client expects us to take */
-		   lustre_msg_get_timeout(req->rq_reqmsg) : obd_timeout;
+		    /* The max time the client expects us to take */
+		    lustre_msg_get_timeout(req->rq_reqmsg) : obd_timeout;
+
 	req->rq_deadline = req->rq_arrival_time.tv_sec + deadline;
 	if (unlikely(deadline == 0)) {
 		DEBUG_REQ(D_ERROR, req, "Dropping request with 0 timeout");
@@ -1625,11 +1655,12 @@ static int ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
 
 	req->rq_svc_thread = thread;
 	if (thread) {
-		/* initialize request session, it is needed for request
+		/*
+		 * initialize request session, it is needed for request
 		 * processing by target
 		 */
-		rc = lu_context_init(&req->rq_session,
-				     LCT_SERVER_SESSION | LCT_NOREF);
+		rc = lu_context_init(&req->rq_session, LCT_SERVER_SESSION |
+						       LCT_NOREF);
 		if (rc) {
 			CERROR("%s: failure to initialize session: rc = %d\n",
 			       thread->t_name, rc);
@@ -1710,7 +1741,8 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 			goto put_conn;
 	}
 
-	/* Discard requests queued for longer than the deadline.
+	/*
+	 * Discard requests queued for longer than the deadline.
 	 * The deadline is increased if we send an early reply.
 	 */
 	if (ktime_get_real_seconds() > request->rq_deadline) {
@@ -1827,7 +1859,8 @@ static int ptlrpc_handle_rs(struct ptlrpc_reply_state *rs)
 	list_del_init(&rs->rs_exp_list);
 	spin_unlock(&exp->exp_lock);
 
-	/* The disk commit callback holds exp_uncommitted_replies_lock while it
+	/*
+	 * The disk commit callback holds exp_uncommitted_replies_lock while it
 	 * iterates over newly committed replies, removing them from
 	 * exp_uncommitted_replies.  It then drops this lock and schedules the
 	 * replies it found for handling here.
@@ -1864,7 +1897,8 @@ static int ptlrpc_handle_rs(struct ptlrpc_reply_state *rs)
 	rs->rs_nlocks = 0;		/* locks still on rs_locks! */
 
 	if (nlocks == 0 && !been_handled) {
-		/* If we see this, we should already have seen the warning
+		/*
+		 * If we see this, we should already have seen the warning
 		 * in mds_steal_ack_locks()
 		 */
 		CDEBUG(D_HA,
@@ -1916,7 +1950,8 @@ static void ptlrpc_check_rqbd_pool(struct ptlrpc_service_part *svcpt)
 
 	/* NB I'm not locking; just looking. */
 
-	/* CAVEAT EMPTOR: We might be allocating buffers here because we've
+	/*
+	 * CAVEAT EMPTOR: We might be allocating buffers here because we've
 	 * allowed the request history to grow out of control.  We could put a
 	 * sanity check on that here and cull some history if we need the
 	 * space.
@@ -2194,7 +2229,8 @@ static int ptlrpc_main(void *arg)
 	LASSERT(svcpt->scp_nthrs_starting == 1);
 	svcpt->scp_nthrs_starting--;
 
-	/* SVC_STOPPING may already be set here if someone else is trying
+	/*
+	 * SVC_STOPPING may already be set here if someone else is trying
 	 * to stop the service while this new thread has been dynamically
 	 * forked. We still set SVC_RUNNING to let our creator know that
 	 * we are now running, however we will exit as soon as possible
@@ -2254,7 +2290,8 @@ static int ptlrpc_main(void *arg)
 
 		if (ptlrpc_rqbd_pending(svcpt) &&
 		    ptlrpc_server_post_idle_rqbds(svcpt) < 0) {
-			/* I just failed to repost request buffers.
+			/*
+			 * I just failed to repost request buffers.
 			 * Wait for a timeout (unless something else
 			 * happens) before I try again
 			 */
@@ -2262,8 +2299,8 @@ static int ptlrpc_main(void *arg)
 			CDEBUG(D_RPCTRACE, "Posted buffers: %d\n",
 			       svcpt->scp_nrqbds_posted);
 		}
-
-		/* If the number of threads has been tuned downward and this
+		/*
+		 * If the number of threads has been tuned downward and this
 		 * thread should be stopped, then stop in reverse order so the
 		 * the threads always have contiguous thread index values.
 		 */
@@ -2285,7 +2322,6 @@ static int ptlrpc_main(void *arg)
 out:
 	CDEBUG(D_RPCTRACE, "%s: service thread [%p:%u] %d exiting: rc = %d\n",
 	       thread->t_name, thread, thread->t_pid, thread->t_id, rc);
-
 	spin_lock(&svcpt->scp_lock);
 	if (thread_test_and_clear_flags(thread, SVC_STARTING))
 		svcpt->scp_nthrs_starting--;
@@ -2546,7 +2582,8 @@ int ptlrpc_start_thread(struct ptlrpc_service_part *svcpt, int wait)
 	}
 
 	if (svcpt->scp_nthrs_starting != 0) {
-		/* serialize starting because some modules (obdfilter)
+		/*
+		 * serialize starting because some modules (obdfilter)
 		 * might require unique and contiguous t_id
 		 */
 		LASSERT(svcpt->scp_nthrs_starting == 1);
@@ -2589,7 +2626,8 @@ int ptlrpc_start_thread(struct ptlrpc_service_part *svcpt, int wait)
 		spin_lock(&svcpt->scp_lock);
 		--svcpt->scp_nthrs_starting;
 		if (thread_is_stopping(thread)) {
-			/* this ptlrpc_thread is being handled
+			/*
+			 * this ptlrpc_thread is being handled
 			 * by ptlrpc_svcpt_stop_threads now
 			 */
 			thread_add_flags(thread, SVC_STOPPED);
@@ -2616,7 +2654,7 @@ int ptlrpc_start_thread(struct ptlrpc_service_part *svcpt, int wait)
 int ptlrpc_hr_init(void)
 {
 	struct ptlrpc_hr_partition *hrp;
-	struct ptlrpc_hr_thread	*hrt;
+	struct ptlrpc_hr_thread *hrt;
 	int rc;
 	int i;
 	int j;
@@ -2736,7 +2774,8 @@ static void ptlrpc_wait_replies(struct ptlrpc_service_part *svcpt)
 	int rc;
 	int i;
 
-	/* All history will be culled when the next request buffer is
+	/*
+	 * All history will be culled when the next request buffer is
 	 * freed in ptlrpc_service_purge_all()
 	 */
 	svc->srv_hist_nrqbds_cpt_max = 0;
@@ -2748,7 +2787,8 @@ static void ptlrpc_wait_replies(struct ptlrpc_service_part *svcpt)
 		if (!svcpt->scp_service)
 			break;
 
-		/* Unlink all the request buffers.  This forces a 'final'
+		/*
+		 * Unlink all the request buffers.  This forces a 'final'
 		 * event with its 'unlink' flag set for each posted rqbd
 		 */
 		list_for_each_entry(rqbd, &svcpt->scp_rqbd_posted,
@@ -2762,13 +2802,15 @@ static void ptlrpc_wait_replies(struct ptlrpc_service_part *svcpt)
 		if (!svcpt->scp_service)
 			break;
 
-		/* Wait for the network to release any buffers
+		/*
+		 * Wait for the network to release any buffers
 		 * it's currently filling
 		 */
 		spin_lock(&svcpt->scp_lock);
 		while (svcpt->scp_nrqbds_posted != 0) {
 			spin_unlock(&svcpt->scp_lock);
-			/* Network access will complete in finite time but
+			/*
+			 * Network access will complete in finite time but
 			 * the HUGE timeout lets us CWARN for visibility
 			 * of sluggish LNDs
 			 */
@@ -2811,7 +2853,8 @@ static void ptlrpc_wait_replies(struct ptlrpc_service_part *svcpt)
 		}
 		spin_unlock(&svcpt->scp_rep_lock);
 
-		/* purge the request queue.  NB No new replies (rqbds
+		/*
+		 * purge the request queue.  NB No new replies (rqbds
 		 * all unlinked) and no service threads, so I'm the only
 		 * thread noodling the request queue now
 		 */
@@ -2831,12 +2874,14 @@ static void ptlrpc_wait_replies(struct ptlrpc_service_part *svcpt)
 		LASSERT(list_empty(&svcpt->scp_rqbd_posted));
 		LASSERT(svcpt->scp_nreqs_incoming == 0);
 		LASSERT(svcpt->scp_nreqs_active == 0);
-		/* history should have been culled by
+		/*
+		 * history should have been culled by
 		 * ptlrpc_server_finish_request
 		 */
 		LASSERT(svcpt->scp_hist_nrqbds == 0);
 
-		/* Now free all the request buffers since nothing
+		/*
+		 * Now free all the request buffers since nothing
 		 * references them any more...
 		 */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 283/622] lustre: uapi: fix file heat support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (281 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 282/622] lustre: ptlrpc: Fix style issues for service.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 284/622] lnet: libcfs: poll fail_loc in cfs_fail_timeout_set() James Simmons
                   ` (339 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Change the LL_IOC_HEAT_SET ioctl number assignment to reduce the
number of different values used, since we are running out.  Use
a __u64 as the IOC struct argument instead of a "long" since that
is what is actually passed, and it avoids being CPU-dependent.

Move the LU_HEAT_FLAG_* values into an enum to avoid a generic
"flags" argument in the code.  This makes it clear what is passed.

Clean up code style for lfs_heat_get() and lfs_heat_set().

Fixes: 868c66dca13f ("lustre: llite: add file heat support")
WC-bug-id: https://jira.whamcloud.com/browse/LU-10602
Lustre-commit: ac1f97a88101 ("LU-10602 utils: fix file heat support")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34757
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c                  | 2 +-
 include/uapi/linux/lustre/lustre_user.h | 8 +++++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 76d3b4c..e9d0ff9 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3193,7 +3193,7 @@ static void ll_heat_get(struct inode *inode, struct lu_heat *heat)
 	spin_unlock(&lli->lli_heat_lock);
 }
 
-static int ll_heat_set(struct inode *inode, u64 flags)
+static int ll_heat_set(struct inode *inode, enum lu_heat_flag flags)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
 	int rc = 0;
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 03ec680..d52879e 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -354,7 +354,7 @@ struct ll_ioc_lease_id {
 #define LL_IOC_GETPARENT		_IOWR('f', 249, struct getparent)
 #define LL_IOC_LADVISE			_IOR('f', 250, struct llapi_lu_ladvise)
 #define LL_IOC_HEAT_GET			_IOWR('f', 251, struct lu_heat)
-#define LL_IOC_HEAT_SET			_IOW('f', 252, long)
+#define LL_IOC_HEAT_SET			_IOW('f', 251, __u64)
 
 #define LL_STATFS_LMV		1
 #define LL_STATFS_LOV		2
@@ -2010,8 +2010,10 @@ enum lu_heat_flag_bit {
 	LU_HEAT_FLAG_BIT_CLEAR,
 };
 
-#define LU_HEAT_FLAG_CLEAR	(1 << LU_HEAT_FLAG_BIT_CLEAR)
-#define LU_HEAT_FLAG_OFF	(1 << LU_HEAT_FLAG_BIT_OFF)
+enum lu_heat_flag {
+	LU_HEAT_FLAG_OFF	= 1ULL << LU_HEAT_FLAG_BIT_OFF,
+	LU_HEAT_FLAG_CLEAR	= 1ULL << LU_HEAT_FLAG_BIT_CLEAR,
+};
 
 enum obd_heat_type {
 	OBD_HEAT_READSAMPLE	= 0,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 284/622] lnet: libcfs: poll fail_loc in cfs_fail_timeout_set()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (282 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 283/622] lustre: uapi: fix file heat support James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 285/622] lustre: obd: round values to nearest MiB for *_mb syfs files James Simmons
                   ` (338 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

Some internal test usually take 800-900s which is almost
half of the whole sanityn test suite run time. 99.(9)% of
the time the tests just wait to ensure specific order the
operations execute in.

the patch changes cfs_fail_timeout_set() so that it can
interrupt waiting if fail_loc is set to 0 - polling with
1/10s frequency is used.

the tests itself are modified to reset fail_loc. to be
able to do so both operations (referenced as OP1 and OP2
in the tests) are run in background. once started and then
ensured with pdo_sched() helper that MDS threads got to the
blocking points, we can interrupt OP1 and do usual checks.

ONLY=40-47 sh sanityn.sh take: 1017s before and 78s after.

WC-bug-id: https://jira.whamcloud.com/browse/LU-2233
Lustre-commit: 743b85a32e24 ("LU-2233 tests: improve tests sanityn/40-47")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/4392
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/fail.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/lnet/libcfs/fail.c b/net/lnet/libcfs/fail.c
index 6ee4de2..40e93b00 100644
--- a/net/lnet/libcfs/fail.c
+++ b/net/lnet/libcfs/fail.c
@@ -131,14 +131,21 @@ int __cfs_fail_check_set(u32 id, u32 value, int set)
 
 int __cfs_fail_timeout_set(u32 id, u32 value, int ms, int set)
 {
+	ktime_t till = ktime_add_ms(ktime_get(), ms);
 	int ret;
 
 	ret = __cfs_fail_check_set(id, value, set);
 	if (ret && likely(ms > 0)) {
-		CERROR("cfs_fail_timeout id %x sleeping for %dms\n",
-		       id, ms);
-		schedule_timeout_uninterruptible(ms * HZ / 1000);
-		CERROR("cfs_fail_timeout id %x awake\n", id);
+		CERROR("cfs_fail_timeout id %x sleeping for %dms\n", id, ms);
+		while (ktime_before(ktime_get(), till)) {
+			schedule_timeout_uninterruptible(HZ / 10);
+			if (!cfs_fail_loc) {
+				CERROR("cfs_fail_timeout interrupted\n");
+				break;
+			}
+		}
+		if (cfs_fail_loc)
+			CERROR("cfs_fail_timeout id %x awake\n", id);
 	}
 	return ret;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 285/622] lustre: obd: round values to nearest MiB for *_mb syfs files
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (283 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 284/622] lnet: libcfs: poll fail_loc in cfs_fail_timeout_set() James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 286/622] lustre: osc: don't check capability for every page James Simmons
                   ` (337 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

Several sysfs files report their settings with the functions
lprocfs_read_frac_helper() which has the intent of showing
fractional values i.e 1.5 MiB. This approach has caused problems
with shells which don't handle fractional representation and the
values reported don't faithfully represent the original value the
configurator passed into the sysfs file. To resolve this lets
instead always round up the value the configurator passed into
the sysfs file to the nearest MiB value. This way it is always
guaranteed the values reported are always exactly some MiB value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11157
Lustre-commit: ba2817fe3ead ("LU-11157 obd: round values to nearest MiB for *_mb syfs files")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/34317
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lprocfs_status.h |  5 ++-
 fs/lustre/llite/llite_internal.h   |  6 +--
 fs/lustre/llite/lproc_llite.c      | 78 +++++++++++++++-----------------------
 fs/lustre/mdc/lproc_mdc.c          |  8 ++--
 fs/lustre/osc/lproc_osc.c          | 21 +++++-----
 5 files changed, 50 insertions(+), 68 deletions(-)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index 8d74822..9f62d4e 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -63,6 +63,9 @@ static inline unsigned int pct(unsigned long a, unsigned long b)
 	return b ? a * 100 / b : 0;
 }
 
+#define PAGES_TO_MiB(pages)	((pages) >> (20 - PAGE_SHIFT))
+#define MiB_TO_PAGES(mb)	((mb) << (20 - PAGE_SHIFT))
+
 struct lprocfs_static_vars {
 	struct lprocfs_vars		*obd_vars;
 	const struct attribute_group	*sysfs_vars;
@@ -363,8 +366,6 @@ enum {
 
 int lprocfs_write_frac_helper(const char __user *buffer,
 			      unsigned long count, int *val, int mult);
-int lprocfs_read_frac_helper(char *buffer, unsigned long count,
-			     long val, int mult);
 
 int lprocfs_stats_alloc_one(struct lprocfs_stats *stats,
 			    unsigned int cpuid);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 9d7345a..eb7e0dc 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -297,18 +297,16 @@ int ll_listsecurity(struct inode *inode, char *secctx_name,
 void ll_inode_size_lock(struct inode *inode);
 void ll_inode_size_unlock(struct inode *inode);
 
-/* FIXME: replace the name of this with LL_I to conform to kernel stuff */
-/* static inline struct ll_inode_info *LL_I(struct inode *inode) */
 static inline struct ll_inode_info *ll_i2info(struct inode *inode)
 {
 	return container_of(inode, struct ll_inode_info, lli_vfs_inode);
 }
 
 /* default to about 64M of readahead on a given system. */
-#define SBI_DEFAULT_READAHEAD_MAX	(64UL << (20 - PAGE_SHIFT))
+#define SBI_DEFAULT_READAHEAD_MAX		MiB_TO_PAGES(64UL)
 
 /* default to read-ahead full files smaller than 2MB on the second read */
-#define SBI_DEFAULT_READAHEAD_WHOLE_MAX (2UL << (20 - PAGE_SHIFT))
+#define SBI_DEFAULT_READAHEAD_WHOLE_MAX		MiB_TO_PAGES(2UL)
 
 enum ra_stat {
 	RA_STAT_HIT = 0,
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index cc9f80e..165d37f 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -326,15 +326,13 @@ static ssize_t max_read_ahead_mb_show(struct kobject *kobj,
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
-	long pages_number;
-	int mult;
+	unsigned long ra_max_mb;
 
 	spin_lock(&sbi->ll_lock);
-	pages_number = sbi->ll_ra_info.ra_max_pages;
+	ra_max_mb = PAGES_TO_MiB(sbi->ll_ra_info.ra_max_pages);
 	spin_unlock(&sbi->ll_lock);
 
-	mult = 1 << (20 - PAGE_SHIFT);
-	return lprocfs_read_frac_helper(buf, PAGE_SIZE, pages_number, mult);
+	return scnprintf(buf, PAGE_SIZE, "%lu\n", ra_max_mb);
 }
 
 static ssize_t max_read_ahead_mb_store(struct kobject *kobj,
@@ -344,21 +342,19 @@ static ssize_t max_read_ahead_mb_store(struct kobject *kobj,
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
+	u64 ra_max_mb, pages_number;
 	int rc;
-	unsigned long pages_number;
-	int pages_shift;
 
-	pages_shift = 20 - PAGE_SHIFT;
-	rc = kstrtoul(buffer, 10, &pages_number);
+	rc = kstrtoull(buffer, 10, &ra_max_mb);
 	if (rc)
 		return rc;
 
-	pages_number <<= pages_shift; /* MB -> pages */
-
+	pages_number = round_up(ra_max_mb, 1024 * 1024) >> PAGE_SHIFT;
 	if (pages_number > totalram_pages() / 2) {
-		CERROR("%s: can't set max_readahead_mb=%lu > %luMB\n",
-		       sbi->ll_fsname, pages_number >> pages_shift,
-		       totalram_pages() >> (pages_shift + 1)); /*1/2 of RAM*/
+		/* 1/2 of RAM */
+		CERROR("%s: can't set max_readahead_mb=%llu > %luMB\n",
+		       sbi->ll_fsname, PAGES_TO_MiB(pages_number),
+		       PAGES_TO_MiB(totalram_pages()));
 		return -ERANGE;
 	}
 
@@ -376,15 +372,13 @@ static ssize_t max_read_ahead_per_file_mb_show(struct kobject *kobj,
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
-	long pages_number;
-	int mult;
+	unsigned long ra_max_file_mb;
 
 	spin_lock(&sbi->ll_lock);
-	pages_number = sbi->ll_ra_info.ra_max_pages_per_file;
+	ra_max_file_mb = PAGES_TO_MiB(sbi->ll_ra_info.ra_max_pages_per_file);
 	spin_unlock(&sbi->ll_lock);
 
-	mult = 1 << (20 - PAGE_SHIFT);
-	return lprocfs_read_frac_helper(buf, PAGE_SIZE, pages_number, mult);
+	return scnprintf(buf, PAGE_SIZE, "%lu\n", ra_max_file_mb);
 }
 
 static ssize_t max_read_ahead_per_file_mb_store(struct kobject *kobj,
@@ -394,22 +388,18 @@ static ssize_t max_read_ahead_per_file_mb_store(struct kobject *kobj,
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
+	u64 ra_max_file_mb, pages_number;
 	int rc;
-	unsigned long pages_number;
-	int pages_shift;
 
-	pages_shift = 20 - PAGE_SHIFT;
-	rc = kstrtoul(buffer, 10, &pages_number);
+	rc = kstrtoull(buffer, 10, &ra_max_file_mb);
 	if (rc)
 		return rc;
 
-	pages_number <<= pages_shift; /* MB -> pages */
-
+	pages_number = round_up(ra_max_file_mb, 1024 * 1024) >> PAGE_SHIFT;
 	if (pages_number > sbi->ll_ra_info.ra_max_pages) {
-		CERROR("%s: can't set max_readahead_per_file_mb=%lu > max_read_ahead_mb=%lu\n",
-		       sbi->ll_fsname,
-		       pages_number >> pages_shift,
-		       sbi->ll_ra_info.ra_max_pages >> pages_shift);
+		CERROR("%s: can't set max_readahead_per_file_mb=%llu > max_read_ahead_mb=%lu\n",
+		       sbi->ll_fsname, PAGES_TO_MiB(pages_number),
+		       PAGES_TO_MiB(sbi->ll_ra_info.ra_max_pages));
 		return -ERANGE;
 	}
 
@@ -427,15 +417,13 @@ static ssize_t max_read_ahead_whole_mb_show(struct kobject *kobj,
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
-	long pages_number;
-	int mult;
+	unsigned long ra_max_whole_mb;
 
 	spin_lock(&sbi->ll_lock);
-	pages_number = sbi->ll_ra_info.ra_max_read_ahead_whole_pages;
+	ra_max_whole_mb = PAGES_TO_MiB(sbi->ll_ra_info.ra_max_read_ahead_whole_pages);
 	spin_unlock(&sbi->ll_lock);
 
-	mult = 1 << (20 - PAGE_SHIFT);
-	return lprocfs_read_frac_helper(buf, PAGE_SIZE, pages_number, mult);
+	return scnprintf(buf, PAGE_SIZE, "%lu\n", ra_max_whole_mb);
 }
 
 static ssize_t max_read_ahead_whole_mb_store(struct kobject *kobj,
@@ -445,24 +433,21 @@ static ssize_t max_read_ahead_whole_mb_store(struct kobject *kobj,
 {
 	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
 					      ll_kset.kobj);
+	u64 ra_max_whole_mb, pages_number;
 	int rc;
-	unsigned long pages_number;
-	int pages_shift;
 
-	pages_shift = 20 - PAGE_SHIFT;
-	rc = kstrtoul(buffer, 10, &pages_number);
+	rc = kstrtoull(buffer, 10, &ra_max_whole_mb);
 	if (rc)
 		return rc;
-	pages_number <<= pages_shift; /* MB -> pages */
 
+	pages_number = round_up(ra_max_whole_mb, 1024 * 1024) >> PAGE_SHIFT;
 	/* Cap this at the current max readahead window size, the readahead
 	 * algorithm does this anyway so it's pointless to set it larger.
 	 */
 	if (pages_number > sbi->ll_ra_info.ra_max_pages_per_file) {
-		CERROR("%s: can't set max_read_ahead_whole_mb=%lu > max_read_ahead_per_file_mb=%lu\n",
-		       sbi->ll_fsname,
-		       pages_number >> pages_shift,
-		       sbi->ll_ra_info.ra_max_pages_per_file >> pages_shift);
+		CERROR("%s: can't set max_read_ahead_whole_mb=%llu > max_read_ahead_per_file_mb=%lu\n",
+		       sbi->ll_fsname, PAGES_TO_MiB(pages_number),
+		       PAGES_TO_MiB(sbi->ll_ra_info.ra_max_pages_per_file));
 		return -ERANGE;
 	}
 
@@ -479,12 +464,11 @@ static int ll_max_cached_mb_seq_show(struct seq_file *m, void *v)
 	struct super_block *sb = m->private;
 	struct ll_sb_info *sbi = ll_s2sbi(sb);
 	struct cl_client_cache *cache = sbi->ll_cache;
-	int shift = 20 - PAGE_SHIFT;
 	long max_cached_mb;
 	long unused_mb;
 
-	max_cached_mb = cache->ccc_lru_max >> shift;
-	unused_mb = atomic_long_read(&cache->ccc_lru_left) >> shift;
+	max_cached_mb = PAGES_TO_MiB(cache->ccc_lru_max);
+	unused_mb = PAGES_TO_MiB(atomic_long_read(&cache->ccc_lru_left));
 	seq_printf(m,
 		   "users: %d\n"
 		   "max_cached_mb: %ld\n"
@@ -538,7 +522,7 @@ static ssize_t ll_max_cached_mb_seq_write(struct file *file,
 	if (pages_number < 0 || pages_number > totalram_pages()) {
 		CERROR("%s: can't set max cache more than %lu MB\n",
 		       sbi->ll_fsname,
-		       totalram_pages() >> (20 - PAGE_SHIFT));
+		       PAGES_TO_MiB(totalram_pages()));
 		return -ERANGE;
 	}
 	/* Allow enough cache so clients can make well-formed RPCs */
diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index 81167bbd..454b69d 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -47,7 +47,7 @@ static int mdc_max_dirty_mb_seq_show(struct seq_file *m, void *v)
 	unsigned long val;
 
 	spin_lock(&cli->cl_loi_list_lock);
-	val = cli->cl_dirty_max_pages >> (20 - PAGE_SHIFT);
+	val = PAGES_TO_MiB(cli->cl_dirty_max_pages);
 	spin_unlock(&cli->cl_loi_list_lock);
 
 	seq_printf(m, "%lu\n", val);
@@ -69,10 +69,10 @@ static ssize_t mdc_max_dirty_mb_seq_write(struct file *file,
 	if (rc)
 		return rc;
 
-	pages_number >>= PAGE_SHIFT;
-
+	/* MB -> pages */
+	pages_number = round_up(pages_number, 1024 * 1024) >> PAGE_SHIFT;
 	if (pages_number <= 0 ||
-	    pages_number >= OSC_MAX_DIRTY_MB_MAX << (20 - PAGE_SHIFT) ||
+	    pages_number >= MiB_TO_PAGES(OSC_MAX_DIRTY_MB_MAX) ||
 	    pages_number > totalram_pages() / 4) /* 1/4 of RAM */
 		return -ERANGE;
 
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 5faf518..775bf74 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -134,12 +134,13 @@ static ssize_t max_dirty_mb_show(struct kobject *kobj,
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
 	struct client_obd *cli = &dev->u.cli;
-	long val;
-	int mult;
+	unsigned long val;
 
-	val = cli->cl_dirty_max_pages;
-	mult = 1 << (20 - PAGE_SHIFT);
-	return lprocfs_read_frac_helper(buf, PAGE_SIZE, val, mult);
+	spin_lock(&cli->cl_loi_list_lock);
+	val = PAGES_TO_MiB(cli->cl_dirty_max_pages);
+	spin_unlock(&cli->cl_loi_list_lock);
+
+	return scnprintf(buf, PAGE_SIZE, "%lu\n", val);
 }
 
 static ssize_t max_dirty_mb_store(struct kobject *kobj,
@@ -150,17 +151,15 @@ static ssize_t max_dirty_mb_store(struct kobject *kobj,
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
 	struct client_obd *cli = &dev->u.cli;
-	unsigned long pages_number;
+	unsigned long pages_number, max_dirty_mb;
 	int rc;
 
-	rc = kstrtoul(buffer, 10, &pages_number);
+	rc = kstrtoul(buffer, 10, &max_dirty_mb);
 	if (rc)
 		return rc;
 
-	pages_number *= 1 << (20 - PAGE_SHIFT); /* MB -> pages */
-
-	if (pages_number <= 0 ||
-	    pages_number >= OSC_MAX_DIRTY_MB_MAX << (20 - PAGE_SHIFT) ||
+	pages_number = MiB_TO_PAGES(max_dirty_mb);
+	if (pages_number >= MiB_TO_PAGES(OSC_MAX_DIRTY_MB_MAX) ||
 	    pages_number > totalram_pages() / 4) /* 1/4 of RAM */
 		return -ERANGE;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 286/622] lustre: osc: don't check capability for every page
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (284 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 285/622] lustre: obd: round values to nearest MiB for *_mb syfs files James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 287/622] lustre: statahead: sa_handle_callback get lli_sa_lock earlier James Simmons
                   ` (336 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

We check CFS_CAP_SYS_RESOURCE for every page during the io.
This is expensive on apparmor enabled systems, we can only
do that once for the entire io and use the result when
submitting the pages.

Don't init the oap_brw_flags during osc_page_init(), the flag
will be set in either osc_queue_async_io() or osc_page_submit().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12093
Lustre-commit: c1cab789aaa2 ("LU-12093 osc: don't check capability for every page")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/34478
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h | 4 +++-
 fs/lustre/osc/osc_cache.c      | 5 +----
 fs/lustre/osc/osc_io.c         | 6 ++++--
 fs/lustre/osc/osc_page.c       | 5 +++--
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index aa3d4c3..1c5af80 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -139,7 +139,9 @@ struct osc_io {
 	/* true if this io is lockless. */
 	unsigned int		oi_lockless:1,
 	/* true if this io is counted as active IO */
-				oi_is_active:1;
+				oi_is_active:1,
+	/** true if this io has CAP_SYS_RESOURCE */
+				oi_cap_sys_resource:1;
 	/* how many LRU pages are reserved for this IO */
 	unsigned long		oi_lru_reserved;
 
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index bdaf65f..a02adac 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2283,9 +2283,6 @@ int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
 	oap->oap_obj_off = offset;
 	LASSERT(!(offset & ~PAGE_MASK));
 
-	if (capable(CAP_SYS_RESOURCE))
-		oap->oap_brw_flags = OBD_BRW_NOQUOTA;
-
 	INIT_LIST_HEAD(&oap->oap_pending_item);
 	INIT_LIST_HEAD(&oap->oap_rpc_item);
 
@@ -2324,7 +2321,7 @@ int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 
 	/* Set the OBD_BRW_SRVLOCK before the page is queued. */
 	brw_flags |= ops->ops_srvlock ? OBD_BRW_SRVLOCK : 0;
-	if (capable(CAP_SYS_RESOURCE)) {
+	if (oio->oi_cap_sys_resource) {
 		brw_flags |= OBD_BRW_NOQUOTA;
 		cmd |= OBD_BRW_NOQUOTA;
 	}
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 76657f3..dfdf064 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -357,18 +357,20 @@ int osc_io_iter_init(const struct lu_env *env, const struct cl_io_slice *ios)
 {
 	struct osc_object *osc = cl2osc(ios->cis_obj);
 	struct obd_import *imp = osc_cli(osc)->cl_import;
+	struct osc_io *oio = osc_env_io(env);
 	int rc = -EIO;
 
 	spin_lock(&imp->imp_lock);
 	if (likely(!imp->imp_invalid)) {
-		struct osc_io *oio = osc_env_io(env);
-
 		atomic_inc(&osc->oo_nr_ios);
 		oio->oi_is_active = 1;
 		rc = 0;
 	}
 	spin_unlock(&imp->imp_lock);
 
+	if (capable(CAP_SYS_RESOURCE))
+		oio->oi_cap_sys_resource = 1;
+
 	return rc;
 }
 EXPORT_SYMBOL(osc_io_iter_init);
diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 7382e0d..0910f3a 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -302,6 +302,7 @@ int osc_page_init(const struct lu_env *env, struct cl_object *obj,
 void osc_page_submit(const struct lu_env *env, struct osc_page *opg,
 		     enum cl_req_type crt, int brw_flags)
 {
+	struct osc_io *oio = osc_env_io(env);
 	struct osc_async_page *oap = &opg->ops_oap;
 
 	LASSERTF(oap->oap_magic == OAP_MAGIC,
@@ -313,9 +314,9 @@ void osc_page_submit(const struct lu_env *env, struct osc_page *opg,
 	oap->oap_cmd = crt == CRT_WRITE ? OBD_BRW_WRITE : OBD_BRW_READ;
 	oap->oap_page_off = opg->ops_from;
 	oap->oap_count = opg->ops_to - opg->ops_from;
-	oap->oap_brw_flags = brw_flags | OBD_BRW_SYNC;
+	oap->oap_brw_flags = OBD_BRW_SYNC | brw_flags;
 
-	if (capable(CAP_SYS_RESOURCE)) {
+	if (oio->oi_cap_sys_resource) {
 		oap->oap_brw_flags |= OBD_BRW_NOQUOTA;
 		oap->oap_cmd |= OBD_BRW_NOQUOTA;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 287/622] lustre: statahead: sa_handle_callback get lli_sa_lock earlier
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (285 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 286/622] lustre: osc: don't check capability for every page James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 288/622] lnet: use number of wrs to calculate CQEs James Simmons
                   ` (335 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

sa_handle_callback() must acquire the lli_sa_lock before calling
sa_has_callback(), which checks whether the sai_interim_entries list is
empty. Acquiring the lock avoids a race between an rpc handler
executing ll_statahead_interpret and the separate ll_statahead_thread.

When a client receives a stat request response, ll_statahead_interpret
increments sai_replied and if needed adds the request to the
sai_interim_entries list for instantiating by the ll_statahead_thread.
ll_statahead_interpret() holds the lli_sa_lock while doing this work.
On process termination, ll_statahead_thread() waits for sai_sent to
equal sai_replied and then removes any entries in the
sai_interim_entries list. It does not get the lli_sa_lock until
it determines that there are sai_interim_entries to process.

A bug occurs on weak memory model processors that do not guarantee
that both ll_statahead_interpret updates done under the lock are
visible to other processors at the same time. For example, on ARM
nodes, an ll_statahead_thread can read the updated value of
sai_replied and a non-updated value of sai_interim_lists.
ll_statahead_thread then thinks all replies have been received (true)
and all sai_interim_entries have been processed false). Later, the
update to sai_interim_entries becomes visible leaving the
ll_statahead_info struct in an unexpected state.

The bad state eventually triggers the LBUG:
statahead.c:477:ll_sai_put()) ASSERTION( !sa_has_callback(sai) )

Cray-bug-id: LUS-6243
WC-bug-id: https://jira.whamcloud.com/browse/LU-12221
Lustre-commit: 31ef093c2197 ("LU-12221 statahead: sa_handle_callback get lli_sa_lock earlier")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/34760
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/statahead.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/llite/statahead.c b/fs/lustre/llite/statahead.c
index 7dfb045..497aba3 100644
--- a/fs/lustre/llite/statahead.c
+++ b/fs/lustre/llite/statahead.c
@@ -688,21 +688,19 @@ static void sa_handle_callback(struct ll_statahead_info *sai)
 
 	lli = ll_i2info(sai->sai_dentry->d_inode);
 
+	spin_lock(&lli->lli_sa_lock);
 	while (sa_has_callback(sai)) {
 		struct sa_entry *entry;
 
-		spin_lock(&lli->lli_sa_lock);
-		if (unlikely(!sa_has_callback(sai))) {
-			spin_unlock(&lli->lli_sa_lock);
-			break;
-		}
 		entry = list_first_entry(&sai->sai_interim_entries,
 					 struct sa_entry, se_list);
 		list_del_init(&entry->se_list);
 		spin_unlock(&lli->lli_sa_lock);
 
 		sa_instantiate(sai, entry);
+		spin_lock(&lli->lli_sa_lock);
 	}
+	spin_unlock(&lli->lli_sa_lock);
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 288/622] lnet: use number of wrs to calculate CQEs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (286 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 287/622] lustre: statahead: sa_handle_callback get lli_sa_lock earlier James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 289/622] lustre: ldlm: Fix style issues for ldlm_resource.c James Simmons
                   ` (334 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Using concurrent sends to calculate the number of CQEs results
in a small number of CQEs which exposes an issue where under
failure scenarios, example when a node reboots, there wouldn't
be enough CQEs available leading to IB_EVENT_QP_FATAL

Fixes: b61010ddf672 ("lnet: lnd: bring back concurrent_sends")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12279
Lustre-commit: 24294b843f79 ("LU-12279 lnet: use number of wrs to calculate CQEs")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34945
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index eb80d5e..2f7ca52 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -136,9 +136,7 @@ struct kib_tunables {
 /* WRs and CQEs (per connection) */
 #define IBLND_RECV_WRS(c)	IBLND_RX_MSGS(c)
 
-#define IBLND_CQ_ENTRIES(c)	\
-	(IBLND_RECV_WRS(c) + 2 * kiblnd_concurrent_sends(c->ibc_version, \
-							 c->ibc_peer->ibp_ni))
+#define IBLND_CQ_ENTRIES(c) (IBLND_RECV_WRS(c) + kiblnd_send_wrs(c))
 
 struct kib_hca_dev;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 289/622] lustre: ldlm: Fix style issues for ldlm_resource.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (287 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 288/622] lnet: use number of wrs to calculate CQEs James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 290/622] lustre: ptlrpc: Fix style issues for sec_gc.c James Simmons
                   ` (333 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ldlm/ldlm_resource.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: d7627feb4594 ("LU-6142 ldlm: Fix style issues for ldlm_resource.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34492
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_resource.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 59b17b5..14e03bc 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -443,7 +443,7 @@ struct ldlm_resource *ldlm_resource_getref(struct ldlm_resource *res)
 static unsigned int ldlm_res_hop_hash(struct cfs_hash *hs,
 				      const void *key, unsigned int mask)
 {
-	const struct ldlm_res_id *id  = key;
+	const struct ldlm_res_id *id = key;
 	unsigned int val = 0;
 	unsigned int i;
 
@@ -627,7 +627,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 		return NULL;
 	}
 
-	for (idx = 0;; idx++) {
+	for (idx = 0; ; idx++) {
 		nsd = &ldlm_ns_hash_defs[idx];
 		if (nsd->nsd_type == LDLM_NS_TYPE_UNKNOWN) {
 			CERROR("Unknown type %d for ns %s\n", ns_type, name);
@@ -770,7 +770,8 @@ static void cleanup_resource(struct ldlm_resource *res, struct list_head *q,
 			ldlm_set_local_only(lock);
 
 		if (local_only && (lock->l_readers || lock->l_writers)) {
-			/* This is a little bit gross, but much better than the
+			/*
+			 * This is a little bit gross, but much better than the
 			 * alternative: pretend that we got a blocking AST from
 			 * the server, so that when the lock is decref'd, it
 			 * will go away ...
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 290/622] lustre: ptlrpc: Fix style issues for sec_gc.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (288 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 289/622] lustre: ldlm: Fix style issues for ldlm_resource.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 291/622] lustre: ptlrpc: Fix style issues for llog_client.c James Simmons
                   ` (332 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch for
file fs/lustre/ptlrpc/sec_gc.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 930d88e71d16 ("LU-6142 ptlrpc: Fix style issues for sec_gc.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34551
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/sec_gc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/sec_gc.c b/fs/lustre/ptlrpc/sec_gc.c
index 3baed8c..36ac319 100644
--- a/fs/lustre/ptlrpc/sec_gc.c
+++ b/fs/lustre/ptlrpc/sec_gc.c
@@ -147,7 +147,8 @@ static void sec_gc_main(struct work_struct *ws)
 
 	sec_process_ctx_list();
 again:
-	/* go through sec list do gc.
+	/*
+	 * go through sec list do gc.
 	 * FIXME here we iterate through the whole list each time which
 	 * is not optimal. we perhaps want to use balanced binary tree
 	 * to trace each sec as order of expiry time.
@@ -156,7 +157,8 @@ static void sec_gc_main(struct work_struct *ws)
 	 */
 	mutex_lock(&sec_gc_mutex);
 	list_for_each_entry(sec, &sec_gc_list, ps_gc_list) {
-		/* if someone is waiting to be deleted, let it
+		/*
+		 * if someone is waiting to be deleted, let it
 		 * proceed as soon as possible.
 		 */
 		if (atomic_read(&sec_gc_wait_del)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 291/622] lustre: ptlrpc: Fix style issues for llog_client.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (289 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 290/622] lustre: ptlrpc: Fix style issues for sec_gc.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 292/622] lustre: dne: allow access to striped dir with broken layout James Simmons
                   ` (331 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch
for file fs/lustre/ptlrpc/llog_client.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: b0372d346200 ("LU-6142 ptlrpc: Fix style issues for llog_client.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34900
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/llog_client.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/ptlrpc/llog_client.c b/fs/lustre/ptlrpc/llog_client.c
index e5ff080..ff1ca36 100644
--- a/fs/lustre/ptlrpc/llog_client.c
+++ b/fs/lustre/ptlrpc/llog_client.c
@@ -55,7 +55,7 @@
 		       ctxt->loc_idx);					\
 		imp = NULL;						\
 		mutex_unlock(&ctxt->loc_mutex);				\
-		return (-EINVAL);					\
+		return -EINVAL;						\
 	}								\
 	mutex_unlock(&ctxt->loc_mutex);					\
 } while (0)
@@ -64,12 +64,13 @@
 	mutex_lock(&ctxt->loc_mutex);					\
 	if (ctxt->loc_imp != imp)					\
 		CWARN("loc_imp has changed from %p to %p\n",		\
-		       ctxt->loc_imp, imp);				\
+		      ctxt->loc_imp, imp);				\
 	class_import_put(imp);						\
 	mutex_unlock(&ctxt->loc_mutex);					\
 } while (0)
 
-/* This is a callback from the llog_* functions.
+/*
+ * This is a callback from the llog_* functions.
  * Assumes caller has already pushed us into the kernel context.
  */
 static int llog_client_open(const struct lu_env *env,
@@ -171,7 +172,8 @@ static int llog_client_next_block(const struct lu_env *env,
 	req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_SERVER, len);
 	ptlrpc_request_set_replen(req);
 	rc = ptlrpc_queue_wait(req);
-	/* -EIO has a special meaning here. If llog_osd_next_block()
+	/*
+	 * -EIO has a special meaning here. If llog_osd_next_block()
 	 * reaches the end of the log without finding the desired
 	 * record then it updates *cur_offset and *cur_idx and returns
 	 * -EIO. In llog_process_thread() we use this to detect
@@ -338,8 +340,9 @@ static int llog_client_read_header(const struct lu_env *env,
 static int llog_client_close(const struct lu_env *env,
 			     struct llog_handle *handle)
 {
-	/* this doesn't call LLOG_ORIGIN_HANDLE_CLOSE because
-	 *  the servers all close the file at the end of every
+	/*
+	 * this doesn't call LLOG_ORIGIN_HANDLE_CLOSE because
+	 * the servers all close the file at the end of every
 	 * other LLOG_ RPC.
 	 */
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 292/622] lustre: dne: allow access to striped dir with broken layout
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (290 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 291/622] lustre: ptlrpc: Fix style issues for llog_client.c James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 293/622] lustre: ptlrpc: ocd_connect_flags are wrong during reconnect James Simmons
                   ` (330 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Sometimes the layout of striped directories may become broken:
* creation/unlink is partially executed on some MDT.
* disk failure or stopped MDS cause some stripe inaccessible.
* software bugs.

In this situation, this directory should still be accessible,
and specially be able to migrate to other active MDTs.

This patch add this support on both server and client: don't
imply stripe FID is sane, and when stripe doesn't exist, skip
it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11907
Lustre-commit: d2725563e7af ("LU-11907 dne: allow access to striped dir with broken layout")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34750
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c       | 17 ++++++++++-------
 fs/lustre/llite/llite_lib.c |  4 ++++
 fs/lustre/lmv/lmv_intent.c  | 16 ++++++++++++++++
 fs/lustre/lmv/lmv_obd.c     | 27 ++++++++++++++++++++++++---
 4 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index fd7cd2d..f75183b 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -321,7 +321,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 		 */
 		if (file_dentry(filp)->d_parent &&
 		    file_dentry(filp)->d_parent->d_inode) {
-			u64 ibits = MDS_INODELOCK_UPDATE;
+			u64 ibits = MDS_INODELOCK_LOOKUP;
 			struct inode *parent;
 
 			parent = file_dentry(filp)->d_parent->d_inode;
@@ -1551,13 +1551,16 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			struct lu_fid fid;
 
 			fid_le_to_cpu(&fid, &lmm->lmv_md_v1.lmv_stripe_fids[i]);
-			mdt_index = ll_get_mdt_idx_by_fid(sbi, &fid);
-			if (mdt_index < 0) {
-				rc = mdt_index;
-				goto out_tmp;
+			if (fid_is_sane(&fid)) {
+				mdt_index = ll_get_mdt_idx_by_fid(sbi, &fid);
+				if (mdt_index < 0) {
+					rc = mdt_index;
+					goto out_tmp;
+				}
+				tmp->lum_objects[i].lum_mds = mdt_index;
+				tmp->lum_objects[i].lum_fid = fid;
 			}
-			tmp->lum_objects[i].lum_mds = mdt_index;
-			tmp->lum_objects[i].lum_fid = fid;
+
 			tmp->lum_stripe_count++;
 		}
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 99cedcf..ba477ad 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1279,6 +1279,10 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 	for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
 		fid = &lsm->lsm_md_oinfo[i].lmo_fid;
 		LASSERT(!lsm->lsm_md_oinfo[i].lmo_root);
+
+		if (!fid_is_sane(fid))
+			continue;
+
 		/* Unfortunately ll_iget will call ll_update_inode,
 		 * where the initialization of slave inode is slightly
 		 * different, so it reset lsm_md to NULL to avoid
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 84a21a0..ba14e7c 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -162,6 +162,7 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 	struct ptlrpc_request *req = NULL;
 	struct mdt_body *body;
 	struct md_op_data *op_data;
+	int valid_stripe_count = 0;
 	int rc = 0, i;
 
 	/**
@@ -186,6 +187,9 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 		fid = lsm->lsm_md_oinfo[i].lmo_fid;
 		inode = lsm->lsm_md_oinfo[i].lmo_root;
 
+		if (!inode)
+			continue;
+
 		/*
 		 * Prepare op_data for revalidating. Note that @fid2 shluld be
 		 * defined otherwise it will go to server and take new lock
@@ -211,6 +215,12 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 
 		rc = md_intent_lock(tgt->ltd_exp, op_data, &it, &req,
 				    cb_blocking, extra_lock_flags);
+		if (rc == -ENOENT) {
+			/* skip stripe is not exists */
+			rc = 0;
+			continue;
+		}
+
 		if (rc < 0)
 			goto cleanup;
 
@@ -249,12 +259,18 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 			ldlm_lock_decref(lockh, it.it_lock_mode);
 			it.it_lock_mode = 0;
 		}
+
+		valid_stripe_count++;
 	}
 
 cleanup:
 	if (req)
 		ptlrpc_req_finished(req);
 
+	/* if all stripes are invalid, return -ENOENT to notify user */
+	if (!rc && !valid_stripe_count)
+		rc = -ENOENT;
+
 	kfree(op_data);
 	return rc;
 }
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index dc4bd1e..4b5bd36 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -2398,6 +2398,11 @@ static struct lu_dirent *stripe_dirent_load(struct lmv_dir_ctxt *ctxt,
 		}
 
 		oinfo = &op_data->op_mea1->lsm_md_oinfo[stripe_index];
+		if (!oinfo->lmo_root) {
+			rc = -ENOENT;
+			break;
+		}
+
 		tgt = lmv_get_target(ctxt->ldc_lmv, oinfo->lmo_mds, NULL);
 		if (IS_ERR(tgt)) {
 			rc = PTR_ERR(tgt);
@@ -2953,10 +2958,22 @@ static int lmv_unpack_md_v1(struct obd_export *exp, struct lmv_stripe_md *lsm,
 	for (i = 0; i < stripe_count; i++) {
 		fid_le_to_cpu(&lsm->lsm_md_oinfo[i].lmo_fid,
 			      &lmm1->lmv_stripe_fids[i]);
+		/*
+		 * set default value -1, so lmv_locate_tgt() knows this stripe
+		 * target is not initialized.
+		 */
+		lsm->lsm_md_oinfo[i].lmo_mds = (u32)-1;
+		if (!fid_is_sane(&lsm->lsm_md_oinfo[i].lmo_fid))
+			continue;
+
 		rc = lmv_fld_lookup(lmv, &lsm->lsm_md_oinfo[i].lmo_fid,
 				    &lsm->lsm_md_oinfo[i].lmo_mds);
+		if (rc == -ENOENT)
+			continue;
+
 		if (rc)
 			return rc;
+
 		CDEBUG(D_INFO, "unpack fid #%d " DFID "\n", i,
 		       PFID(&lsm->lsm_md_oinfo[i].lmo_fid));
 	}
@@ -2988,9 +3005,10 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 			return 0;
 		}
 
-		for (i = 0; i < lsm->lsm_md_stripe_count; i++)
-			iput(lsm->lsm_md_oinfo[i].lmo_root);
-
+		for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
+			if (lsm->lsm_md_oinfo[i].lmo_root)
+				iput(lsm->lsm_md_oinfo[i].lmo_root);
+		}
 		kvfree(lsm);
 		*lsmp = NULL;
 		return 0;
@@ -3334,6 +3352,9 @@ static int lmv_merge_attr(struct obd_export *exp,
 	for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
 		struct inode *inode = lsm->lsm_md_oinfo[i].lmo_root;
 
+		if (!inode)
+			continue;
+
 		CDEBUG(D_INFO,
 		       "" DFID " size %llu, blocks %llu nlink %u, atime %lld ctime %lld, mtime %lld.\n",
 		       PFID(&lsm->lsm_md_oinfo[i].lmo_fid),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 293/622] lustre: ptlrpc: ocd_connect_flags are wrong during reconnect
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (291 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 292/622] lustre: dne: allow access to striped dir with broken layout James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 294/622] lnet: libcfs: fix panic for too large cpu partitions James Simmons
                   ` (329 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

Import connect flags are reset to original ones during
reconnect, so a request can be created with unsupported
features.

Use separate obd_connect_data to send connect request.

Cray-bug-id: LUS-6397
WC-bug-id: https://jira.whamcloud.com/browse/LU-12095
Lustre-commit: 1224084c6300 ("LU-12095 ptlrpc: ocd_connect_flags are wrong during reconnect")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/34480
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index a75856a..6f13ec1 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -602,11 +602,12 @@ int ptlrpc_connect_import(struct obd_import *imp)
 	int set_transno = 0;
 	u64 committed_before_reconnect = 0;
 	struct ptlrpc_request *request;
+	struct obd_connect_data ocd;
 	char *bufs[] = { NULL,
 			 obd2cli_tgt(imp->imp_obd),
 			 obd->obd_uuid.uuid,
 			 (char *)&imp->imp_dlm_handle,
-			 (char *)&imp->imp_connect_data,
+			 (char *)&ocd,
 			 NULL };
 	struct ptlrpc_connect_async_args *aa;
 	int rc;
@@ -653,15 +654,16 @@ int ptlrpc_connect_import(struct obd_import *imp)
 	/* Reset connect flags to the originally requested flags, in case
 	 * the server is updated on-the-fly we will get the new features.
 	 */
-	imp->imp_connect_data.ocd_connect_flags = imp->imp_connect_flags_orig;
-	imp->imp_connect_data.ocd_connect_flags2 = imp->imp_connect_flags2_orig;
+	ocd = imp->imp_connect_data;
+	ocd.ocd_connect_flags = imp->imp_connect_flags_orig;
+	ocd.ocd_connect_flags2 = imp->imp_connect_flags2_orig;
 	/* Reset ocd_version each time so the server knows the exact versions */
-	imp->imp_connect_data.ocd_version = LUSTRE_VERSION_CODE;
+	ocd.ocd_version = LUSTRE_VERSION_CODE;
 	imp->imp_msghdr_flags &= ~MSGHDR_AT_SUPPORT;
 	imp->imp_msghdr_flags &= ~MSGHDR_CKSUM_INCOMPAT18;
 
 	rc = obd_reconnect(NULL, imp->imp_obd->obd_self_export, obd,
-			   &obd->obd_uuid, &imp->imp_connect_data, NULL);
+			   &obd->obd_uuid, &ocd, NULL);
 	if (rc)
 		goto out;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 294/622] lnet: libcfs: fix panic for too large cpu partitions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (292 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 293/622] lustre: ptlrpc: ocd_connect_flags are wrong during reconnect James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 295/622] lustre: obdclass: put all service's env on the list James Simmons
                   ` (328 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

If cpu partitions larger than online cpus, following calcuation
will be 0:

num = num_online_cpus() / ncpt;

And it will trigger following panic in cfs_cpt_choose_ncpus()

        LASSERT(number > 0);

We actually did not support this, instead of panic
it, return failure is better.

Also fix a invalid pointer access if we failed to init @cfs_cpt_table,
as it will be converted to ERR_PTR() if error happen.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12299
Lustre-commit: 77771ff24c03 ("LU-12299 libcfs: fix panic for too large cpu partions")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34864
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/libcfs_cpu.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/lnet/libcfs/libcfs_cpu.c b/net/lnet/libcfs/libcfs_cpu.c
index 3e566ac..80533c2 100644
--- a/net/lnet/libcfs/libcfs_cpu.c
+++ b/net/lnet/libcfs/libcfs_cpu.c
@@ -878,7 +878,14 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 	if (ncpt <= 0)
 		ncpt = num;
 
-	if (ncpt > num_online_cpus() || ncpt > 4 * num) {
+	if (ncpt > num_online_cpus()) {
+		rc = -EINVAL;
+		CERROR("libcfs: CPU partition count %d > cores %d: rc = %d\n",
+		       ncpt, num_online_cpus(), rc);
+		goto failed;
+	}
+
+	if (ncpt > 4 * num) {
 		CWARN("CPU partition number %d is larger than suggested value (%d), your system may have performance issue or run out of memory while under pressure\n",
 		      ncpt, num);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 295/622] lustre: obdclass: put all service's env on the list
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (293 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 294/622] lnet: libcfs: fix panic for too large cpu partitions James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 296/622] lustre: mdt: fix mdt_dom_discard_data() timeouts James Simmons
                   ` (327 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

to be able to lookup by current thread where it's too
complicated to pass env by argument.

this version has stats to see slow/fast lookups. so, in sanity-benchmark
there were 172850 fast lookups (from per-cpu cache) and 27228 slow lookups
(from rhashtable). going to see the ration in autotest's reports.

Fixes: 8a9e013dad74 ("lustre: ldlm: pass env to lvbo methods")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12034
Lustre-commit: aa82cc83612d ("LU-12034 obdclass: put all service's env on the list")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34566
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h   |   4 ++
 fs/lustre/include/obd_class.h   |  19 +++++---
 fs/lustre/ldlm/ldlm_lockd.c     |  20 +++++++-
 fs/lustre/obdclass/lu_object.c  | 104 +++++++++++++++++++++++++++++++++++++++-
 fs/lustre/obdecho/echo_client.c |   2 +
 fs/lustre/ptlrpc/service.c      |  20 ++++----
 6 files changed, 153 insertions(+), 16 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index a709ad7..c34605c 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1213,6 +1213,10 @@ struct lu_env {
 void lu_env_fini(struct lu_env *env);
 int lu_env_refill(struct lu_env *env);
 
+struct lu_env *lu_env_find(void);
+int lu_env_add(struct lu_env *env);
+void lu_env_remove(struct lu_env *env);
+
 /** @} lu_context */
 
 /**
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index a142d6e..a890d00 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -477,12 +477,19 @@ static inline int obd_precleanup(struct obd_device *obd)
 	int rc;
 
 	if (ldt && d) {
-		struct lu_env env;
-
-		rc = lu_env_init(&env, ldt->ldt_ctx_tags);
-		if (!rc) {
-			ldt->ldt_ops->ldto_device_fini(&env, d);
-			lu_env_fini(&env);
+		struct lu_env *env = lu_env_find();
+		struct lu_env _env;
+
+		if (!env) {
+			env = &_env;
+			rc = lu_env_init(env, ldt->ldt_ctx_tags);
+			LASSERT(!rc);
+			lu_env_add(env);
+		}
+		ldt->ldt_ops->ldto_device_fini(env, d);
+		if (env == &_env) {
+			lu_env_remove(env);
+			lu_env_fini(env);
 		}
 	}
 	if (!obd->obd_type->typ_dt_ops->precleanup)
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index f37d8ef..3b405be 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -846,8 +846,20 @@ static int ldlm_bl_thread_blwi(struct ldlm_bl_pool *blp,
  */
 static int ldlm_bl_thread_main(void *arg)
 {
+	struct lu_env *env;
 	struct ldlm_bl_pool *blp;
 	struct ldlm_bl_thread_data *bltd = arg;
+	int rc;
+
+	env = kzalloc(sizeof(*env), GFP_NOFS);
+	if (!env)
+		return -ENOMEM;
+	rc = lu_env_init(env, LCT_DT_THREAD);
+	if (rc)
+		goto out_env;
+	rc = lu_env_add(env);
+	if (rc)
+		goto out_env_fini;
 
 	blp = bltd->bltd_blp;
 
@@ -888,7 +900,13 @@ static int ldlm_bl_thread_main(void *arg)
 
 	atomic_dec(&blp->blp_num_threads);
 	complete(&blp->blp_comp);
-	return 0;
+
+	lu_env_remove(env);
+out_env_fini:
+	lu_env_fini(env);
+out_env:
+	kfree(env);
+	return rc;
 }
 
 static int ldlm_setup(void);
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index 2ab4977..2f709b0 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -1859,6 +1859,101 @@ static unsigned long lu_cache_shrink_scan(struct shrinker *sk,
 /**
  * Debugging printer function using printk().
  */
+
+struct lu_env_item {
+	struct task_struct	*lei_task;	/* rhashtable key */
+	struct rhash_head	lei_linkage;
+	struct lu_env		*lei_env;
+};
+
+static const struct rhashtable_params lu_env_rhash_params = {
+	.key_len     = sizeof(struct task_struct *),
+	.key_offset  = offsetof(struct lu_env_item, lei_task),
+	.head_offset = offsetof(struct lu_env_item, lei_linkage),
+};
+
+struct rhashtable lu_env_rhash;
+
+struct lu_env_percpu {
+	struct task_struct *lep_task;
+	struct lu_env *lep_env ____cacheline_aligned_in_smp;
+};
+
+static struct lu_env_percpu lu_env_percpu[NR_CPUS];
+
+int lu_env_add(struct lu_env *env)
+{
+	struct lu_env_item *lei, *old;
+
+	LASSERT(env);
+
+	lei = kzalloc(sizeof(*lei), GFP_NOFS);
+	if (!lei)
+		return -ENOMEM;
+
+	lei->lei_task = current;
+	lei->lei_env = env;
+
+	old = rhashtable_lookup_get_insert_fast(&lu_env_rhash,
+						&lei->lei_linkage,
+						lu_env_rhash_params);
+	LASSERT(!old);
+
+	return 0;
+}
+EXPORT_SYMBOL(lu_env_add);
+
+void lu_env_remove(struct lu_env *env)
+{
+	struct lu_env_item *lei;
+	const void *task = current;
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (lu_env_percpu[i].lep_env == env) {
+			LASSERT(lu_env_percpu[i].lep_task == task);
+			lu_env_percpu[i].lep_task = NULL;
+			lu_env_percpu[i].lep_env = NULL;
+		}
+	}
+
+	rcu_read_lock();
+	lei = rhashtable_lookup_fast(&lu_env_rhash, &task,
+				     lu_env_rhash_params);
+	if (lei && rhashtable_remove_fast(&lu_env_rhash, &lei->lei_linkage,
+					  lu_env_rhash_params) == 0)
+		kfree(lei);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(lu_env_remove);
+
+struct lu_env *lu_env_find(void)
+{
+	struct lu_env *env = NULL;
+	struct lu_env_item *lei;
+	const void *task = current;
+	int i = get_cpu();
+
+	if (lu_env_percpu[i].lep_task == current) {
+		env = lu_env_percpu[i].lep_env;
+		put_cpu();
+		LASSERT(env);
+		return env;
+	}
+
+	lei = rhashtable_lookup_fast(&lu_env_rhash, &task,
+				     lu_env_rhash_params);
+	if (lei) {
+		env = lei->lei_env;
+		lu_env_percpu[i].lep_task = current;
+		lu_env_percpu[i].lep_env = env;
+	}
+	put_cpu();
+
+	return env;
+}
+EXPORT_SYMBOL(lu_env_find);
+
 static struct shrinker lu_site_shrinker = {
 	.count_objects		= lu_cache_shrink_count,
 	.scan_objects		= lu_cache_shrink_scan,
@@ -1905,6 +2000,11 @@ int lu_global_init(void)
 	 * lu_object/inode cache consuming all the memory.
 	 */
 	result = register_shrinker(&lu_site_shrinker);
+	if (result == 0) {
+		result = rhashtable_init(&lu_env_rhash, &lu_env_rhash_params);
+		if (result != 0)
+			unregister_shrinker(&lu_site_shrinker);
+	}
 	if (result != 0) {
 		/* Order explained in lu_global_fini(). */
 		lu_context_key_degister(&lu_global_key);
@@ -1917,7 +2017,7 @@ int lu_global_init(void)
 		return result;
 	}
 
-	return 0;
+	return result;
 }
 
 /**
@@ -1936,6 +2036,8 @@ void lu_global_fini(void)
 	lu_env_fini(&lu_shrink_env);
 	up_write(&lu_sites_guard);
 
+	rhashtable_destroy(&lu_env_rhash);
+
 	lu_ref_global_fini();
 }
 
diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 5ac4519..01d8c04 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -1506,6 +1506,7 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 		rc = -ENOMEM;
 		goto out;
 	}
+	lu_env_add(env);
 
 	switch (cmd) {
 	case OBD_IOC_CREATE:		/* may create echo object */
@@ -1572,6 +1573,7 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 	}
 
 out:
+	lu_env_remove(env);
 	lu_env_fini(env);
 	kfree(env);
 
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 1513f51..d93cf14 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -2194,11 +2194,14 @@ static int ptlrpc_main(void *arg)
 		rc = -ENOMEM;
 		goto out_srv_fini;
 	}
+	rc = lu_env_add(env);
+	if (rc)
+		goto out_env;
 
 	rc = lu_context_init(&env->le_ctx,
 			     svc->srv_ctx_tags | LCT_REMEMBER | LCT_NOREF);
 	if (rc)
-		goto out_srv_fini;
+		goto out_env_remove;
 
 	thread->t_env = env;
 	env->le_ctx.lc_thread = thread;
@@ -2211,14 +2214,14 @@ static int ptlrpc_main(void *arg)
 
 		CERROR("Failed to post rqbd for %s on CPT %d: %d\n",
 		       svc->srv_name, svcpt->scp_cpt, rc);
-		goto out_srv_fini;
+		goto out_ctx_fini;
 	}
 
 	/* Alloc reply state structure for this one */
 	rs = kvzalloc(svc->srv_max_reply_size, GFP_KERNEL);
 	if (!rs) {
 		rc = -ENOMEM;
-		goto out_srv_fini;
+		goto out_ctx_fini;
 	}
 
 	spin_lock(&svcpt->scp_lock);
@@ -2310,15 +2313,16 @@ static int ptlrpc_main(void *arg)
 
 	ptlrpc_watchdog_disable(&thread->t_watchdog);
 
+out_ctx_fini:
+	lu_context_fini(&env->le_ctx);
+out_env_remove:
+	lu_env_remove(env);
+out_env:
+	kfree(env);
 out_srv_fini:
 	/* deconstruct service thread state created by ptlrpc_start_thread() */
 	if (svc->srv_ops.so_thr_done)
 		svc->srv_ops.so_thr_done(thread);
-
-	if (env) {
-		lu_context_fini(&env->le_ctx);
-		kfree(env);
-	}
 out:
 	CDEBUG(D_RPCTRACE, "%s: service thread [%p:%u] %d exiting: rc = %d\n",
 	       thread->t_name, thread, thread->t_pid, thread->t_id, rc);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 296/622] lustre: mdt: fix mdt_dom_discard_data() timeouts
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (294 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 295/622] lustre: obdclass: put all service's env on the list James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 297/622] lustre: lov: Add overstriping support James Simmons
                   ` (326 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The mdt_dom_discard_data() issues new lock to cause data
discard for all conflicting client locks. This was done in
context of unlink RPC processing and may cause it to be stuck
waiting for client to cancel their locks leading to cascading
timeouts for any other locks waiting on the same resource and
parent directory.

Patch skips discard lock waiting in the current context by
using own CP callback for that which doesn't wait for blocking
locks. They will be finished later by LDLM and cleaned up in
that completion callback. So current thread just makes sure
discard locks are taken and BL ASTs are sent but doesn't wait
for lock granting and that fixes the original problem.

At the same time that opens window for race with data being
flushed on client, so it is possible that new IO from client
will happen on just unlinked object causing error message and
it is not possible to distinguish that case from other
possibly critical situations. To solve that the unlinked object
is pinned in memory while until discard lock is granted.
Therefore, such objects can be easily distinguished as stale one
and any IO against it can be just silently ignored.

Older clients are not fully compatible with async DoM discard so
patch adds also new connection flag ASYNC_DISCARD to distinguish
old clients and use old blocking discard for then.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11359
Lustre-commit: 9c028e74c220 ("LU-11359 mdt: fix mdt_dom_discard_data() timeouts")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34071
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h         |  2 ++
 fs/lustre/ldlm/ldlm_internal.h         |  5 +----
 fs/lustre/ldlm/ldlm_request.c          | 13 +++++++++++++
 fs/lustre/llite/llite_lib.c            | 19 ++++++++++++-------
 fs/lustre/llite/namei.c                | 12 +++++++++++-
 fs/lustre/obdclass/lprocfs_status.c    |  1 +
 fs/lustre/osc/osc_cache.c              |  2 +-
 fs/lustre/ptlrpc/service.c             | 23 +++++++++++++++++++++++
 fs/lustre/ptlrpc/wiretest.c            |  2 ++
 include/uapi/linux/lustre/lustre_idl.h |  3 +++
 10 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 355049f..4060bb4 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -1082,6 +1082,8 @@ static inline struct ldlm_lock *ldlm_handle2lock(const struct lustre_handle *h)
 	return lock;
 }
 
+int is_granted_or_cancelled_nolock(struct ldlm_lock *lock);
+
 int ldlm_error2errno(enum ldlm_error error);
 
 #if LUSTRE_TRACKS_LOCK_EXP_REFS
diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index ede48b2..3789496 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -310,10 +310,7 @@ static inline int is_granted_or_cancelled(struct ldlm_lock *lock)
 	int ret = 0;
 
 	lock_res_and_lock(lock);
-	if (ldlm_is_granted(lock) && !ldlm_is_cp_reqd(lock))
-		ret = 1;
-	else if (ldlm_is_failed(lock) || ldlm_is_cancel(lock))
-		ret = 1;
+	ret = is_granted_or_cancelled_nolock(lock);
 	unlock_res_and_lock(lock);
 
 	return ret;
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 45d70d4..71892a5 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -138,6 +138,19 @@ static void ldlm_expired_completion_wait(struct ldlm_lock *lock, u32 conn_cnt)
 		   obd2cli_tgt(obd), imp->imp_connection->c_remote_uuid.uuid);
 }
 
+int is_granted_or_cancelled_nolock(struct ldlm_lock *lock)
+{
+	int ret = 0;
+
+	check_res_locked(lock->l_resource);
+	if (ldlm_is_granted(lock) && !ldlm_is_cp_reqd(lock))
+		ret = 1;
+	else if (ldlm_is_failed(lock) || ldlm_is_cancel(lock))
+		ret = 1;
+	return ret;
+}
+EXPORT_SYMBOL(is_granted_or_cancelled_nolock);
+
 /**
  * Calculate the Completion timeout (covering enqueue, BL AST, data flush,
  * lock cancel, and their replies). Used for lock completion timeout on the
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index ba477ad..a89189c 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -213,7 +213,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				   OBD_CONNECT2_FLR |
 				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_ARCHIVE_ID_ARRAY |
-				   OBD_CONNECT2_LSOM;
+				   OBD_CONNECT2_LSOM |
+				   OBD_CONNECT2_ASYNC_DISCARD;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
@@ -2054,13 +2055,17 @@ void ll_delete_inode(struct inode *inode)
 	struct address_space *mapping = &inode->i_data;
 	unsigned long nrpages;
 
-	if (S_ISREG(inode->i_mode) && lli->lli_clob)
-		/* discard all dirty pages before truncating them, required by
-		 * osc_extent implementation at LU-1030.
+	if (S_ISREG(inode->i_mode) && lli->lli_clob) {
+		/* It is last chance to write out dirty pages,
+		 * otherwise we may lose data while umount.
+		 *
+		 * If i_nlink is 0 then just discard data. This is safe because
+		 * local inode gets i_nlink 0 from server only for the last
+		 * unlink, so that file is not opened somewhere else
 		 */
-		cl_sync_file_range(inode, 0, OBD_OBJECT_EOF,
-				   CL_FSYNC_LOCAL, 1);
-
+		cl_sync_file_range(inode, 0, OBD_OBJECT_EOF, inode->i_nlink ?
+				   CL_FSYNC_LOCAL : CL_FSYNC_DISCARD, 1);
+	}
 	truncate_inode_pages_final(mapping);
 
 	/* Workaround for LU-118: Note nrpages may not be totally updated when
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index ee3ce70..c3e8de4 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -224,8 +224,18 @@ void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 	u64 bits = to_cancel;
 	int rc;
 
-	if (!inode)
+	if (!inode) {
+		/* That means the inode is evicted most likely and may cause
+		 * the skipping of lock cleanups below, so print the message
+		 * about that in log.
+		 */
+		if (lock->l_resource->lr_lvb_inode)
+			LDLM_DEBUG(lock,
+				   "can't take inode for the lock (%sevicted)\n",
+				   lock->l_resource->lr_lvb_inode->i_state &
+				   I_FREEING ? "" : "not ");
 		return;
+	}
 
 	if (!fid_res_name_eq(ll_inode2fid(inode),
 			     &lock->l_resource->lr_name)) {
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 55057cf..c244adb 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -125,6 +125,7 @@
 	"lsom",			/* 0x800 */
 	"pcc",			/* 0x1000 */
 	"plain_layout",		/* 0x2000 */
+	"async_discard",	/* 0x4000 */
 	NULL
 };
 
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index a02adac..8ffd8f9 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2926,7 +2926,7 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 				 * [start, end] must contain this extent
 				 */
 				EASSERT(ext->oe_start >= start &&
-					ext->oe_max_end <= end, ext);
+					ext->oe_end <= end, ext);
 				osc_extent_state_set(ext, OES_LOCKING);
 				ext->oe_owner = current;
 				list_move_tail(&ext->oe_link, &discard_list);
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index d93cf14..8e6013a 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -2367,8 +2367,13 @@ static int ptlrpc_hr_main(void *arg)
 	struct ptlrpc_hr_thread	*hrt = arg;
 	struct ptlrpc_hr_partition *hrp = hrt->hrt_partition;
 	LIST_HEAD(replies);
+	struct lu_env *env;
 	int rc;
 
+	env = kzalloc(sizeof(*env), GFP_NOFS);
+	if (!env)
+		return -ENOMEM;
+
 	unshare_fs_struct();
 
 	rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
@@ -2381,6 +2386,15 @@ static int ptlrpc_hr_main(void *arg)
 		      threadname, hrp->hrp_cpt, ptlrpc_hr.hr_cpt_table, rc);
 	}
 
+	rc = lu_context_init(&env->le_ctx, LCT_MD_THREAD | LCT_DT_THREAD |
+			     LCT_REMEMBER | LCT_NOREF);
+	if (rc)
+		goto out_env;
+
+	rc = lu_env_add(env);
+	if (rc)
+		goto out_ctx_fini;
+
 	atomic_inc(&hrp->hrp_nstarted);
 	wake_up(&ptlrpc_hr.hr_waitq);
 
@@ -2394,13 +2408,22 @@ static int ptlrpc_hr_main(void *arg)
 					     struct ptlrpc_reply_state,
 					     rs_list);
 			list_del_init(&rs->rs_list);
+			/* refill keys if needed */
+			lu_env_refill(env);
+			lu_context_enter(&env->le_ctx);
 			ptlrpc_handle_rs(rs);
+			lu_context_exit(&env->le_ctx);
 		}
 	}
 
 	atomic_inc(&hrp->hrp_nstopped);
 	wake_up(&ptlrpc_hr.hr_waitq);
 
+	lu_env_remove(env);
+out_ctx_fini:
+	lu_context_fini(&env->le_ctx);
+out_env:
+	kfree(env);
 	return 0;
 }
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index fb57def..34c1d13 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1156,6 +1156,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_PCC);
 	LASSERTF(OBD_CONNECT2_PLAIN_LAYOUT == 0x2000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_PLAIN_LAYOUT);
+	LASSERTF(OBD_CONNECT2_ASYNC_DISCARD == 0x4000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_ASYNC_DISCARD);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index f7ea744..86395b7 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -810,6 +810,9 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
 #define OBD_CONNECT2_PCC	       0x1000ULL /* Persistent Client Cache */
 #define OBD_CONNECT2_PLAIN_LAYOUT      0x2000ULL /* Plain Directory Layout */
+#define OBD_CONNECT2_ASYNC_DISCARD     0x4000ULL /* support async DoM data
+						  * discard
+						  */
 
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 297/622] lustre: lov: Add overstriping support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (295 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 296/622] lustre: mdt: fix mdt_dom_discard_data() timeouts James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 298/622] lustre: rpc: support maximum 64MB I/O RPC James Simmons
                   ` (325 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Each stripe in a shared file in Lustre corresponds to a
single LDLM extent locking domain and also to a single
object on disk (and in the OSS page cache).  LDLM locks are
extent locks, but there are still significant issues with
false sharing with multiple writers.  On-disk file systems
also have per-object performance limitations for both read
and write.

The LDLM limitation means it is best to have a single
writer per stripe, but modern OSTs can be faster than a
single client, so this restricts maximum performance unless
special methods are used (eg, Lustre lock ahead).

The on disk file system limitations mean that even if LDLM
locking is not an issue (read and not write, or lockahead),
OST performance in a shared file is still limited by having
only one object per OST.

These limitations make it impossible to get the full
performance of a modern Lustre FS with a single shared
file.

This patch makes it possible to have >1 stripe on a given
OST in each layout component.  This is known as
overstriping.  It works exactly like a normally striped
file, and is largely transparent to users.

By raising the object count per OST, this avoids the single
object limits, and by creating more stripes, also avoids
the "single effective writer per stripe" LDLM limitation.

However, it is only desirable in some situations, so users
must request it with a special setstripe command:

lfs setstripe -C [count] [file]

Users can also access overstriping using the standard '-o'
option to manually select OSTs:

lfs setstripe -o [ost_indices] [file]

Overstriping also makes it easy to test layout size limits,so we add a
test for that.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9846
Lustre-commit: 591a9b4cebc5 ("LU-9846 lod: Add overstriping support")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/28425
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c             |  1 +
 fs/lustre/lov/lov_cl_internal.h         |  5 +++--
 fs/lustre/lov/lov_ea.c                  | 33 ++++++++++++++++++++++-----------
 fs/lustre/lov/lov_obd.c                 |  4 ++--
 fs/lustre/ptlrpc/wiretest.c             |  4 ++--
 include/uapi/linux/lustre/lustre_user.h | 22 +++++++++++++++++-----
 6 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index a89189c..d6293d1 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -210,6 +210,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	data->ocd_connect_flags2 = OBD_CONNECT2_DIR_MIGRATE |
 				   OBD_CONNECT2_SUM_STATFS |
+				   OBD_CONNECT2_OVERSTRIPING |
 				   OBD_CONNECT2_FLR |
 				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_ARCHIVE_ID_ARRAY |
diff --git a/fs/lustre/lov/lov_cl_internal.h b/fs/lustre/lov/lov_cl_internal.h
index 7b95a00..6fea0f5 100644
--- a/fs/lustre/lov/lov_cl_internal.h
+++ b/fs/lustre/lov/lov_cl_internal.h
@@ -150,9 +150,10 @@ static inline char *llt2str(enum lov_layout_type llt)
  */
 static inline u32 lov_entry_type(struct lov_stripe_md_entry *lsme)
 {
-	if ((lov_pattern(lsme->lsme_pattern) == LOV_PATTERN_RAID0) ||
+	if ((lov_pattern(lsme->lsme_pattern) & LOV_PATTERN_RAID0) ||
 	    (lov_pattern(lsme->lsme_pattern) == LOV_PATTERN_MDT))
-		return lov_pattern(lsme->lsme_pattern);
+		return lov_pattern(lsme->lsme_pattern &
+				   ~LOV_PATTERN_OVERSTRIPING);
 	return 0;
 }
 
diff --git a/fs/lustre/lov/lov_ea.c b/fs/lustre/lov/lov_ea.c
index b7a6d91..07bfe0f 100644
--- a/fs/lustre/lov/lov_ea.c
+++ b/fs/lustre/lov/lov_ea.c
@@ -84,34 +84,45 @@ static loff_t lov_tgt_maxbytes(struct lov_tgt_desc *tgt)
 static int lsm_lmm_verify_v1v3(struct lov_mds_md *lmm, size_t lmm_size,
 			       u16 stripe_count)
 {
+	int rc = 0;
+
 	if (stripe_count > LOV_V1_INSANE_STRIPE_COUNT) {
-		CERROR("bad stripe count %d\n", stripe_count);
+		rc = -EINVAL;
+		CERROR("lov: bad stripe count %d: rc = %d\n",
+		       stripe_count, rc);
 		lov_dump_lmm_common(D_WARNING, lmm);
-		return -EINVAL;
+		goto out;
 	}
 
 	if (lmm_oi_id(&lmm->lmm_oi) == 0) {
-		CERROR("zero object id\n");
+		rc = -EINVAL;
+		CERROR("lov: zero object id: rc = %d\n", rc);
 		lov_dump_lmm_common(D_WARNING, lmm);
-		return -EINVAL;
+		goto out;
 	}
 
 	if (lov_pattern(le32_to_cpu(lmm->lmm_pattern)) != LOV_PATTERN_MDT &&
-	    lov_pattern(le32_to_cpu(lmm->lmm_pattern)) != LOV_PATTERN_RAID0) {
-		CERROR("bad striping pattern\n");
+	    lov_pattern(le32_to_cpu(lmm->lmm_pattern)) != LOV_PATTERN_RAID0 &&
+	    lov_pattern(le32_to_cpu(lmm->lmm_pattern)) !=
+			(LOV_PATTERN_RAID0 | LOV_PATTERN_OVERSTRIPING)) {
+		rc = -EINVAL;
+		CERROR("lov: unrecognized striping pattern: rc = %d\n", rc);
 		lov_dump_lmm_common(D_WARNING, lmm);
-		return -EINVAL;
+		goto out;
 	}
 
 	if (lmm->lmm_stripe_size == 0 ||
 	    (le32_to_cpu(lmm->lmm_stripe_size) &
 	     (LOV_MIN_STRIPE_SIZE - 1)) != 0) {
-		CERROR("bad stripe size %u\n",
-		       le32_to_cpu(lmm->lmm_stripe_size));
+		rc = -EINVAL;
+		CERROR("lov: bad stripe size %u: rc = %d\n",
+		       le32_to_cpu(lmm->lmm_stripe_size), rc);
 		lov_dump_lmm_common(D_WARNING, lmm);
-		return -EINVAL;
+		goto out;
 	}
-	return 0;
+
+out:
+	return rc;
 }
 
 static void lsme_free(struct lov_stripe_md_entry *lsme)
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 3a90e7e..234b556 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -699,8 +699,8 @@ void lov_fix_desc_stripe_count(u32 *val)
 void lov_fix_desc_pattern(u32 *val)
 {
 	/* from lov_setstripe */
-	if ((*val != 0) && (*val != LOV_PATTERN_RAID0)) {
-		LCONSOLE_WARN("Unknown stripe pattern: %#x\n", *val);
+	if ((*val != 0) && !lov_pattern_supported_normal_comp(*val)) {
+		LCONSOLE_WARN("lov: Unknown stripe pattern: %#x\n", *val);
 		*val = 0;
 	}
 }
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 34c1d13..b8b561c 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1517,8 +1517,8 @@ void lustre_assert_wire_constants(void)
 		 (unsigned int)LOV_PATTERN_RAID1);
 	LASSERTF(LOV_PATTERN_MDT == 0x00000100UL, "found 0x%.8xUL\n",
 		 (unsigned int)LOV_PATTERN_MDT);
-	LASSERTF(LOV_PATTERN_CMOBD == 0x00000200UL, "found 0x%.8xUL\n",
-		 (unsigned int)LOV_PATTERN_CMOBD);
+	LASSERTF(LOV_PATTERN_OVERSTRIPING == 0x00000200UL, "found 0x%.8xUL\n",
+		 (unsigned int)LOV_PATTERN_OVERSTRIPING);
 
 	/* Checks for struct lov_comp_md_entry_v1 */
 	LASSERTF((int)sizeof(struct lov_comp_md_entry_v1) == 48, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index d52879e..dc39265 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -394,16 +394,28 @@ struct ll_ioc_lease_id {
 #define LMV_USER_MAGIC		0x0CD30CD0	/*default lmv magic*/
 #define LMV_USER_MAGIC_SPECIFIC	0x0CD40CD0
 
-#define LOV_PATTERN_RAID0	0x001
+#define LOV_PATTERN_NONE		0x000
+#define LOV_PATTERN_RAID0		0x001
 
-#define LOV_PATTERN_RAID1	0x002
-#define LOV_PATTERN_MDT		0x100
-#define LOV_PATTERN_CMOBD	0x200
+#define LOV_PATTERN_RAID1		0x002
+#define LOV_PATTERN_MDT			0x100
+#define LOV_PATTERN_OVERSTRIPING	0x200
 
 #define LOV_PATTERN_F_MASK	0xffff0000
 #define LOV_PATTERN_F_HOLE	0x40000000 /* there is hole in LOV EA */
 #define LOV_PATTERN_F_RELEASED	0x80000000 /* HSM released file */
 
+/* RELEASED and MDT patterns are not valid in many places, so rather than
+ * having many extra checks on lov_pattern_supported, we have this separate
+ * check for non-released, non-DOM components
+ */
+static inline bool lov_pattern_supported_normal_comp(__u32 pattern)
+{
+	return pattern == LOV_PATTERN_RAID0 ||
+	       pattern == (LOV_PATTERN_RAID0 | LOV_PATTERN_OVERSTRIPING);
+
+}
+
 #define LOV_MAXPOOLNAME 15
 #define LOV_POOLNAMEF "%.15s"
 #define LOV_OFFSET_DEFAULT      ((__u16)-1)
@@ -421,7 +433,7 @@ struct ll_ioc_lease_id {
  *
  * (max buffer size - lov+rpc header) / sizeof(struct lov_ost_data_v1)
  */
-#define LOV_MAX_STRIPE_COUNT	2000  /* ((12 * 4096 - 256) / 24) */
+#define LOV_MAX_STRIPE_COUNT	2000  /* ~((12 * 4096 - 256) / 24) */
 #define LOV_ALL_STRIPES		0xffff /* only valid for directories */
 #define LOV_V1_INSANE_STRIPE_COUNT 65532 /* maximum stripe count bz13933 */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 298/622] lustre: rpc: support maximum 64MB I/O RPC
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (296 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 297/622] lustre: lov: Add overstriping support James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 299/622] lustre: dom: per-resource ELC for WRITE lock enqueue James Simmons
                   ` (324 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

On newer systems, some block drivers allow max_hw_sector_kb to
be up to 65536KB (64MB) to the underlying storage. To maximize
driver efficiency, Lustre should also have bump up maximum I/O
RPC size to 64MB.
Clamp max_read_ahead_whold_mb not to exceed
max_read_ahead_per_file_mb

WC-bug-id: https://jira.whamcloud.com/browse/LU-11526
Lustre-commit: 1a9be0046b1f ("LU-11526 rpc: support maximum 64MB I/O RPC")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/34042
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 2 +-
 fs/lustre/llite/llite_lib.c    | 7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 8d71559..f96265b 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -82,7 +82,7 @@
  * transfer via cl_max_pages_per_rpc to some non-power-of-two value.
  * NOTE: This is limited to 16 (=64GB RPCs) by IOOBJ_MAX_BRW_BITS.
  */
-#define PTLRPC_BULK_OPS_BITS	4
+#define PTLRPC_BULK_OPS_BITS	6
 #if PTLRPC_BULK_OPS_BITS > 16
 #error "More than 65536 BRW RPCs not allowed by IOOBJ_MAX_BRW_BITS."
 #endif
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index d6293d1..e6ac16f 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -281,10 +281,15 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	sbi->ll_md_exp->exp_connect_data = *data;
 
 	/* Don't change value if it was specified in the config log */
-	if (sbi->ll_ra_info.ra_max_read_ahead_whole_pages == -1)
+	if (sbi->ll_ra_info.ra_max_read_ahead_whole_pages == -1) {
 		sbi->ll_ra_info.ra_max_read_ahead_whole_pages =
 			max_t(unsigned long, SBI_DEFAULT_READAHEAD_WHOLE_MAX,
 			      (data->ocd_brw_size >> PAGE_SHIFT));
+		if (sbi->ll_ra_info.ra_max_read_ahead_whole_pages >
+		    sbi->ll_ra_info.ra_max_pages_per_file)
+			sbi->ll_ra_info.ra_max_read_ahead_whole_pages =
+				sbi->ll_ra_info.ra_max_pages_per_file;
+	}
 
 	err = obd_fid_init(sbi->ll_md_exp->exp_obd, sbi->ll_md_exp,
 			   LUSTRE_SEQ_METADATA);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 299/622] lustre: dom: per-resource ELC for WRITE lock enqueue
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (297 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 298/622] lustre: rpc: support maximum 64MB I/O RPC James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 300/622] lustre: dom: mdc_lock_flush() improvement James Simmons
                   ` (323 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Improve client write lock enqueue by doing ELC for any
read lock on the same resource. This helps with read/write
access, e.g. compilebench shows ~10% better results with
about 45% less ldlm cancel RPCs.

In mdc_enqueue_send() collect resource unused read locks
and pack them into enqueue request.

The ldlm_cancel_resource_local() is changed also to don't
skip DOM lock if it is set in policy explicitly

WC-bug-id: https://jira.whamcloud.com/browse/LU-10894
Lustre-commit: 16c156c3218b ("LU-10894 dom: per-resource ELC for WRITE lock enqueue")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34736
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 17 ++++++++++++-----
 fs/lustre/mdc/mdc_dev.c       | 13 +++++++++++--
 fs/lustre/mdc/mdc_internal.h  |  5 ++++-
 fs/lustre/mdc/mdc_reint.c     | 26 +++++++++++++++++---------
 4 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 71892a5..5a7026d 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1888,12 +1888,19 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 		/*
 		 * If policy is given and this is IBITS lock, add to list only
 		 * those locks that match by policy.
-		 * Skip locks with DoM bit always to don't flush data.
 		 */
-		if (policy && (lock->l_resource->lr_type == LDLM_IBITS) &&
-		    (!(lock->l_policy_data.l_inodebits.bits &
-		      policy->l_inodebits.bits) || ldlm_has_dom(lock)))
-			continue;
+		if (policy && (lock->l_resource->lr_type == LDLM_IBITS)) {
+			if (!(lock->l_policy_data.l_inodebits.bits &
+			      policy->l_inodebits.bits))
+				continue;
+			/* Skip locks with DoM bit if it is not set in policy
+			 * to don't flush data by side-bits. Lock convert will
+			 * drop those bits separately.
+			 */
+			if (ldlm_has_dom(lock) &&
+			    !(policy->l_inodebits.bits & MDS_INODELOCK_DOM))
+				continue;
+		}
 
 		/* See CBPENDING comment in ldlm_cancel_lru */
 		lock->l_flags |= LDLM_FL_CBPENDING | LDLM_FL_CANCELING |
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index cb173f4..8f0e283 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -670,7 +670,8 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 	enum ldlm_mode mode;
 	bool glimpse = *flags & LDLM_FL_HAS_INTENT;
 	u64 match_flags = *flags;
-	int rc;
+	LIST_HEAD(cancels);
+	int rc, count;
 
 	mode = einfo->ei_mode;
 	if (einfo->ei_mode == LCK_PR)
@@ -726,7 +727,15 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 	if (!req)
 		return -ENOMEM;
 
-	rc = ldlm_prep_enqueue_req(exp, req, NULL, 0);
+	/* For WRITE lock cancel other locks on resource early if any */
+	if (einfo->ei_mode & LCK_PW)
+		count = mdc_resource_get_unused_res(exp, res_id, &cancels,
+						    einfo->ei_mode,
+						    MDS_INODELOCK_DOM);
+	else
+		count = 0;
+
+	rc = ldlm_prep_enqueue_req(exp, req, &cancels, count);
 	if (rc < 0) {
 		ptlrpc_request_free(req);
 		return rc;
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index f75498a..2b540f8 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -86,7 +86,10 @@ int mdc_enqueue(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		const union ldlm_policy_data *policy,
 		struct md_op_data *op_data,
 		struct lustre_handle *lockh, u64 extra_lock_flags);
-
+int mdc_resource_get_unused_res(struct obd_export *exp,
+				struct ldlm_res_id *res_id,
+				struct list_head *cancels,
+				enum ldlm_mode mode, u64 bits);
 int mdc_resource_get_unused(struct obd_export *exp, const struct lu_fid *fid,
 			    struct list_head *cancels, enum ldlm_mode  mode,
 			    u64 bits);
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 86acb4e..d26e27d 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -62,13 +62,13 @@ static int mdc_reint(struct ptlrpc_request *request, int level)
  * found by @fid. Found locks are added into @cancel list. Returns the amount of
  * locks added to @cancels list.
  */
-int mdc_resource_get_unused(struct obd_export *exp, const struct lu_fid *fid,
-			    struct list_head *cancels, enum ldlm_mode mode,
-			    u64 bits)
+int mdc_resource_get_unused_res(struct obd_export *exp,
+				struct ldlm_res_id *res_id,
+				struct list_head *cancels,
+				enum ldlm_mode mode, u64 bits)
 {
 	struct ldlm_namespace *ns = exp->exp_obd->obd_namespace;
 	union ldlm_policy_data policy = {};
-	struct ldlm_res_id res_id;
 	struct ldlm_resource *res;
 	int count;
 
@@ -82,21 +82,29 @@ int mdc_resource_get_unused(struct obd_export *exp, const struct lu_fid *fid,
 	if (exp_connect_cancelset(exp) && !ns_connect_cancelset(ns))
 		return 0;
 
-	fid_build_reg_res_name(fid, &res_id);
-	res = ldlm_resource_get(exp->exp_obd->obd_namespace,
-				NULL, &res_id, 0, 0);
+	res = ldlm_resource_get(ns, NULL, res_id, 0, 0);
 	if (IS_ERR(res))
 		return 0;
 	LDLM_RESOURCE_ADDREF(res);
 	/* Initialize ibits lock policy. */
 	policy.l_inodebits.bits = bits;
-	count = ldlm_cancel_resource_local(res, cancels, &policy,
-					   mode, 0, 0, NULL);
+	count = ldlm_cancel_resource_local(res, cancels, &policy, mode, 0, 0,
+					   NULL);
 	LDLM_RESOURCE_DELREF(res);
 	ldlm_resource_putref(res);
 	return count;
 }
 
+int mdc_resource_get_unused(struct obd_export *exp, const struct lu_fid *fid,
+			    struct list_head *cancels, enum ldlm_mode mode,
+			    u64 bits)
+{
+	struct ldlm_res_id res_id;
+
+	fid_build_reg_res_name(fid, &res_id);
+	return mdc_resource_get_unused_res(exp, &res_id, cancels, mode, bits);
+}
+
 int mdc_setattr(struct obd_export *exp, struct md_op_data *op_data,
 		void *ea, size_t ealen, struct ptlrpc_request **request)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 300/622] lustre: dom: mdc_lock_flush() improvement
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (298 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 299/622] lustre: dom: per-resource ELC for WRITE lock enqueue James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 301/622] lnet: Fix NI status in debugfs for loopback ni James Simmons
                   ` (322 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

There is small improvement in osc_lock_flush() to don't
match other locks for write lock because there are none.

Do the same in mdc_lock_flush().

WC-bug-id: https://jira.whamcloud.com/browse/LU-10894
Lustre-commit: 276221c2a1d2 ("LU-10894 dom: mdc_lock_flush() improvement")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34738
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 8f0e283..14cece1 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -253,7 +253,9 @@ static int mdc_lock_flush(const struct lu_env *env, struct osc_object *obj,
 			result = 0;
 	}
 
-	rc = mdc_lock_discard_pages(env, obj, start, end, discard);
+	/* Avoid lock matching with CLM_WRITE, there can be no other locks */
+	rc = mdc_lock_discard_pages(env, obj, start, end,
+				    mode == CLM_WRITE || discard);
 	if (result == 0 && rc < 0)
 		result = rc;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 301/622] lnet: Fix NI status in debugfs for loopback ni
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (299 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 300/622] lustre: dom: mdc_lock_flush() improvement James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 302/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro James Simmons
                   ` (321 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

The loopback NI is never really "down", but since its associated
ns_status is used for other purposes that's how it is reported in
proc_lnet_nis(). There's an existing check for lolnd so just hardcode
the status as "up" there.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12302
Lustre-commit: 0c27e760c357 ("LU-12302 lnet: Fix NI status in proc for loopback ni")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/34871
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router_proc.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 8517411..5341599 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -723,16 +723,18 @@ static int proc_lnet_nis(struct ctl_table *table, int write,
 			if (the_lnet.ln_routing)
 				last_alive = now - ni->ni_last_alive;
 
-			/* @lo forever alive */
-			if (ni->ni_net->net_lnd->lnd_type == LOLND)
-				last_alive = 0;
-
 			lnet_ni_lock(ni);
 			LASSERT(ni->ni_status);
 			stat = (ni->ni_status->ns_status ==
 				LNET_NI_STATUS_UP) ? "up" : "down";
 			lnet_ni_unlock(ni);
 
+			/* @lo forever alive */
+			if (ni->ni_net->net_lnd->lnd_type == LOLND) {
+				last_alive = 0;
+				stat = "up";
+			}
+
 			/*
 			 * we actually output credits information for
 			 * TX queue of each partition
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 302/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (300 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 301/622] lnet: Fix NI status in debugfs for loopback ni James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 303/622] lustre: llite: Revalidate dentries in ll_intent_file_open James Simmons
                   ` (320 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

The rq_req_unlinked, rq_reply_unlinked and rq_receiving_reply flags
determine whether a PtlRPC request can transition out of
RQ_PHASE_UNREG_RPC. Add these flags to the DEBUG_REQ_FLAGS macro to
aid in debugging issues where requests are stuck in this unregistering
state.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12333
Lustre-commit: 5bcc3a330e21 ("LU-12333 ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/34949
Reviewed-by: Ann Koehler <amk@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index f96265b..383d59e 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1069,9 +1069,12 @@ static inline void lustre_set_rep_swabbed(struct ptlrpc_request *req,
 	FLAG(req->rq_no_resend, "N"),					      \
 	FLAG(req->rq_waiting, "W"),					      \
 	FLAG(req->rq_wait_ctx, "C"), FLAG(req->rq_hp, "H"),		      \
-	FLAG(req->rq_committed, "M")
+	FLAG(req->rq_committed, "M"),                                          \
+	FLAG(req->rq_req_unlinked, "Q"),                                       \
+	FLAG(req->rq_reply_unlinked, "U"),                                     \
+	FLAG(req->rq_receiving_reply, "r")
 
-#define REQ_FLAGS_FMT "%s:%s%s%s%s%s%s%s%s%s%s%s%s%s"
+#define REQ_FLAGS_FMT "%s:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s"
 
 void _debug_req(struct ptlrpc_request *req,
 		struct libcfs_debug_msg_data *data, const char *fmt, ...)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 303/622] lustre: llite: Revalidate dentries in ll_intent_file_open
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (301 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 302/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 304/622] lustre: llite: hash just created files if lock allows James Simmons
                   ` (319 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Oleg Drokin <green@whamcloud.com>

We might get a lookup lock in response to our open request and we
definitely want to ensure that our dentry is valid, so it could
actually be matched by dcache code in future operations.

Benchmark results:

This patch can significantly improve open-create + stat on the same
client.

This patch in combination with two others:

https://review.whamcloud.com/#/c/33584
https://review.whamcloud.com/#/c/33585

Improves the 'stat' side of open-create + stat by >10x.

Without patches (master branch commit 26a7abe):

mpirun -np 24 --allow-run-as-root /work/tools/bin/mdtest -n 50000 -d /cache1/out/ -F -C -T -v -w 32k

   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :       3838.205       3838.204       3838.204          0.000
   File stat         :      33459.289      33459.249      33459.271          0.011
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :       3146.841       3146.841       3146.841          0.000
   Tree removal      :          0.000          0.000          0.000          0.000

With the three patches:

mpirun -np 24 --allow-run-as-root /work/tools/bin/mdtest -n 50000 -d /cache1/out/ -F -C -T -v -w 32k
SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :       3822.440       3822.439       3822.440          0.000
   File stat         :     350620.140     350615.980     350617.193          1.051
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :       2076.727       2076.727       2076.727          0.000
   Tree removal      :          0.000          0.000          0.000          0.000

Note 33K stats/second vs 350K stats/second.

ls -l time of the mdtest directory is also reduced from 23.5 seconds to
5.8 seconds.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10948
Lustre-commit: 14ca3157b21d ("LU-10948 llite: Revalidate dentries in ll_intent_file_open")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32157
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index e9d0ff9..191b0f9 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -419,25 +419,12 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	struct page *vmpage;
 	struct niobuf_remote *rnb;
 	char *data;
-	struct lustre_handle lockh;
-	struct ldlm_lock *lock;
 	unsigned long index, start;
 	struct niobuf_local lnb;
-	bool dom_lock = false;
 
 	if (!obj)
 		return;
 
-	if (it->it_lock_mode != 0) {
-		lockh.cookie = it->it_lock_handle;
-		lock = ldlm_handle2lock(&lockh);
-		if (lock)
-			dom_lock = ldlm_has_dom(lock);
-		LDLM_LOCK_PUT(lock);
-	}
-	if (!dom_lock)
-		return;
-
 	if (!req_capsule_has_field(&req->rq_pill, &RMF_NIOBUF_INLINE,
 				   RCL_SERVER))
 		return;
@@ -576,8 +563,27 @@ static int ll_intent_file_open(struct dentry *de, void *lmm, int lmmsize,
 	rc = ll_prep_inode(&inode, req, NULL, itp);
 
 	if (!rc && itp->it_lock_mode) {
-		ll_dom_finish_open(d_inode(de), req, itp);
+		struct lustre_handle handle = {.cookie = itp->it_lock_handle};
+		struct ldlm_lock *lock;
+		bool has_dom_bit = false;
+
+		/* If we got a lock back and it has a LOOKUP bit set,
+		 * make sure the dentry is marked as valid so we can find it.
+		 * We don't need to care about actual hashing since other bits
+		 * of kernel will deal with that later.
+		 */
+		lock = ldlm_handle2lock(&handle);
+		if (lock) {
+			has_dom_bit = ldlm_has_dom(lock);
+			if (lock->l_policy_data.l_inodebits.bits &
+			    MDS_INODELOCK_LOOKUP)
+				d_lustre_revalidate(de);
+
+			LDLM_LOCK_PUT(lock);
+		}
 		ll_set_lock_data(sbi->ll_md_exp, inode, itp, NULL);
+		if (has_dom_bit)
+			ll_dom_finish_open(inode, req, itp);
 	}
 
 out:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 304/622] lustre: llite: hash just created files if lock allows
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (302 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 303/622] lustre: llite: Revalidate dentries in ll_intent_file_open James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 305/622] lnet: adds checking msg len James Simmons
                   ` (318 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Oleg Drokin <green@whamcloud.com>

If open|creat (and other intent operations later) returned a lookup bit
as part of the lock, hash the resultant dentry under this lock,
not to trigger further RPCs in subsequent lookups.

Benchmark results:

This patch can significantly improve open-create + stat on the same
client.

This patch in combination with two others:

https://review.whamcloud.com/32157
https://review.whamcloud.com/33585

Improves the 'stat' side of open-create + stat by >10x.

Without patches (master branch commit 26a7abe):

mpirun -np 24 --allow-run-as-root /work/tools/bin/mdtest -n 50000 -d /cache1/out/ -F -C -T -v -w 32k
SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :       3838.205       3838.204       3838.204          0.000
   File stat         :      33459.289      33459.249      33459.271          0.011
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :       3146.841       3146.841       3146.841          0.000
   Tree removal      :          0.000          0.000          0.000          0.000

With the three patches:

mpirun -np 24 --allow-run-as-root /work/tools/bin/mdtest -n 50000 -d /cache1/out/ -F -C -T -v -w 32k
SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :       3822.440       3822.439       3822.440          0.000
   File stat         :     350620.140     350615.980     350617.193          1.051
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :       2076.727       2076.727       2076.727          0.000
   Tree removal      :          0.000          0.000          0.000          0.000

Note 33K stats/second vs 350K stats/second.

ls -l time of the mdtest directory is also reduced from 23.5 seconds to
5.8 seconds.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11623
Lustre-commit: fc42cbe0e2e5 ("LU-11623 llite: hash just created files if lock allows")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33584
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/namei.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index c3e8de4..3c796bd 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -678,9 +678,9 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		if (bits & MDS_INODELOCK_LOOKUP)
 			d_lustre_revalidate(*de);
 	} else if (!it_disposition(it, DISP_OPEN_CREATE)) {
-		/* If file created on server, don't depend on parent UPDATE
-		 * lock to unhide it. It is left hidden and next lookup can
-		 * find it in ll_splice_alias.
+		/*
+		 * If file was created on the server, the dentry is revalidated
+		 * in ll_create_it if the lock allows for it.
 		 */
 		/* Check that parent has UPDATE lock. */
 		struct lookup_intent parent_it = {
@@ -1063,6 +1063,7 @@ static int ll_create_it(struct inode *dir, struct dentry *dentry,
 			struct lookup_intent *it, void *secctx, u32 secctxlen)
 {
 	struct inode *inode;
+	u64 bits = 0;
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir=" DFID "(%p), intent=%s\n",
@@ -1088,6 +1089,10 @@ static int ll_create_it(struct inode *dir, struct dentry *dentry,
 			return rc;
 	}
 
+	ll_set_lock_data(ll_i2sbi(dir)->ll_md_exp, inode, it, &bits);
+	if (bits & MDS_INODELOCK_LOOKUP)
+		d_lustre_revalidate(dentry);
+
 	d_instantiate(dentry, inode);
 
 	if (!(ll_i2sbi(inode)->ll_flags & LL_SBI_FILE_SECCTX))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 305/622] lnet: adds checking msg len
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (303 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 304/622] lustre: llite: hash just created files if lock allows James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 306/622] lustre: dne: add new dir hash type "space" James Simmons
                   ` (317 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

The LNET can't handle a msg with len larger than LNET_MTU.
The next error occurred for DOM 1MB
 LNetError: 3137:0:(lib-move.c:4143:lnet_parse()) 192.168.8.1 at tcp,
 src 192.168.8.1 at tcp: bad PUT payload 1051832 (1048576 max expected)

The patch adds fragment size check.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12140
Lustre-commit: 4d43a6c3b182 ("LU-12140 lnet: adds checking msg len")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-7174
Reviewed-on: https://review.whamcloud.com/34975
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-md.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/lnet/lnet/lib-md.c b/net/lnet/lnet/lib-md.c
index 7ea0f5e..4a70c76 100644
--- a/net/lnet/lnet/lib-md.c
+++ b/net/lnet/lnet/lib-md.c
@@ -325,6 +325,10 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 		CERROR("Invalid option: too many fragments %u, %d max\n",
 		       umd->length, LNET_MAX_IOV);
 		return -EINVAL;
+	} else if (umd->length > LNET_MTU) {
+		CERROR("Invalid length: too big fragment size %u, %d max\n",
+		       umd->length, LNET_MTU);
+		return -EINVAL;
 	}
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 306/622] lustre: dne: add new dir hash type "space"
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (304 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 305/622] lnet: adds checking msg len James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 307/622] lustre: uapi: Add nonrotational flag to statfs James Simmons
                   ` (316 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Add a new hash type "space", if this is set on default LMV of
a directory, its subdirs will be created on all MDTs with
balanced space usage.

* new hash type LMV_HASH_TYPE_SPACE.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: a24f61532927 ("LU-11213 dne: add new dir hash type "space"")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34358
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index dc39265..22a0144 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -650,12 +650,16 @@ enum lmv_hash_type {
 	LMV_HASH_TYPE_UNKNOWN	= 0,	/* 0 is reserved for testing purpose */
 	LMV_HASH_TYPE_ALL_CHARS = 1,
 	LMV_HASH_TYPE_FNV_1A_64 = 2,
+	LMV_HASH_TYPE_SPACE	= 3,	/*
+					 * distribute subdirs among all MDTs
+					 * with balanced space usage.
+					 */
+	LMV_HASH_TYPE_MAX,
 };
 
-#define LMV_HASH_TYPE_MAX	LMV_HASH_TYPE_FNV_1A_64 + 1
-
 #define LMV_HASH_NAME_ALL_CHARS		"all_char"
 #define LMV_HASH_NAME_FNV_1A_64		"fnv_1a_64"
+#define LMV_HASH_NAME_SPACE		"space"
 
 struct lustre_foreign_type {
 	uint32_t lft_type;
@@ -685,7 +689,7 @@ struct lmv_user_md_v1 {
 	__u32	lum_stripe_count;	/* dirstripe count */
 	__u32	lum_stripe_offset;	/* MDT idx for default dirstripe */
 	__u32	lum_hash_type;		/* Dir stripe policy */
-	__u32	lum_type;		/* LMV type: default or normal */
+	__u32	lum_type;		/* LMV type: default */
 	__u32	lum_padding1;
 	__u32	lum_padding2;
 	__u32	lum_padding3;
@@ -703,6 +707,15 @@ static inline __u32 lmv_foreign_to_md_stripes(__u32 size)
 	       sizeof(struct lmv_user_mds_data);
 }
 
+/*
+ * NB, historically default layout didn't set type, but use XATTR name to differ
+ * from normal layout, for backward compatibility, define LMV_TYPE_DEFAULT 0x0,
+ * and still use the same method.
+ */
+enum lmv_type {
+	LMV_TYPE_DEFAULT = 0x0000,
+};
+
 static inline int lmv_user_md_size(int stripes, int lmm_magic)
 {
 	int size = sizeof(struct lmv_user_md);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 307/622] lustre: uapi: Add nonrotational flag to statfs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (305 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 306/622] lustre: dne: add new dir hash type "space" James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 308/622] lnet: libcfs: crashes with certain cpu part numbers James Simmons
                   ` (315 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

It is potentially useful for the MDS and userspace to
know whether or not an OST is using non-rotational media.

Add a flag to obd_statfs that reflects this.

Users can override this parameter in sysfs.

ZFS does not currently make this information available to
Lustre, so default to rotational and allow users to
override.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11963
Lustre-commit: 68635c3d9b31 ("LU-11963 osd: Add nonrotational flag to statfs")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34235
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c             | 14 ++++++++++++++
 include/uapi/linux/lustre/lustre_user.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index b8b561c..64ccc6e 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1745,6 +1745,20 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct obd_statfs, os_spare9));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_spare9) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_spare9));
+	LASSERTF(OS_STATE_DEGRADED == 0x1, "found %lld\n",
+		 (long long)OS_STATE_DEGRADED);
+	LASSERTF(OS_STATE_READONLY == 0x2, "found %lld\n",
+		 (long long)OS_STATE_READONLY);
+	LASSERTF(OS_STATE_NOPRECREATE == 0x4, "found %lld\n",
+		 (long long)OS_STATE_NOPRECREATE);
+	LASSERTF(OS_STATE_ENOSPC == 0x20, "found %lld\n",
+		 (long long)OS_STATE_ENOSPC);
+	LASSERTF(OS_STATE_ENOINO == 0x40, "found %lld\n",
+		 (long long)OS_STATE_ENOINO);
+	LASSERTF(OS_STATE_SUM == 0x100, "found %lld\n",
+		 (long long)OS_STATE_SUM);
+	LASSERTF(OS_STATE_NONROT == 0x200, "found %lld\n",
+		 (long long)OS_STATE_NONROT);
 
 	/* Checks for struct obd_ioobj */
 	LASSERTF((int)sizeof(struct obd_ioobj) == 24, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 22a0144..d66c883 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -105,6 +105,7 @@ enum obd_statfs_state {
 	OS_STATE_ENOSPC		= 0x00000020, /**< not enough free space */
 	OS_STATE_ENOINO		= 0x00000040, /**< not enough inodes */
 	OS_STATE_SUM		= 0x00000100, /**< aggregated for all tagrets */
+	OS_STATE_NONROT		= 0x00000200, /**< non-rotational device */
 };
 
 struct obd_statfs {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 308/622] lnet: libcfs: crashes with certain cpu part numbers
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (306 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 307/622] lustre: uapi: Add nonrotational flag to statfs James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 309/622] lustre: lov: fix wrong calculated length for fiemap James Simmons
                   ` (314 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Andrew Perepechko <c17827@cray.com>

Due to a bug in the code, libcfs will crash if the
number of online cpus does not divide by the number
of cpu partitions.

Based on the checks in cfs_cpt_table_create(), it
appears that the original intent was to push the
remaining cpus into the initial partitions.

So let's do that properly.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12352
Lustre-commit: e33e3da58972 ("LU-12352 libcfs: crashes with certain cpu part numbers")
Signed-off-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Cray-bug-id: LUS-6455
Reviewed-on: https://review.whamcloud.com/34991
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/libcfs_cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/libcfs/libcfs_cpu.c b/net/lnet/libcfs/libcfs_cpu.c
index 80533c2..20ca15a 100644
--- a/net/lnet/libcfs/libcfs_cpu.c
+++ b/net/lnet/libcfs/libcfs_cpu.c
@@ -913,7 +913,7 @@ static struct cfs_cpt_table *cfs_cpt_table_create(int ncpt)
 			int ncpu = cpumask_weight(part->cpt_cpumask);
 
 			rc = cfs_cpt_choose_ncpus(cptab, cpt, node_mask,
-						  num - ncpu);
+						  (rem > 0) + num - ncpu);
 			if (rc < 0) {
 				rc = -EINVAL;
 				goto failed_mask;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 309/622] lustre: lov: fix wrong calculated length for fiemap
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (307 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 308/622] lnet: libcfs: crashes with certain cpu part numbers James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 310/622] lustre: obdclass: remove unprotected access to lu_object James Simmons
                   ` (313 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

lov_stripe_intersects() will return a closed interval
[@obd_start, @obd_end], so to calcuate length of interval we need

 @obd_end - @obd_start + 1

rather than

 @obd_end - @obd_start

Wrong extent length will make us return wrong fiemap information.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12361
Lustre-commit: 225e7b8c70fb ("LU-12361 lov: fix wrong calculated length for fiemap")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34998
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_object.c | 4 ++--
 fs/lustre/lov/lov_offset.c | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index 7543ef2..27e0ca5 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -1677,7 +1677,7 @@ static int fiemap_for_stripe(const struct lu_env *env, struct cl_object *obj,
 	if (lun_start == lun_end)
 		return 0;
 
-	req_fm_len = obd_object_end - lun_start;
+	req_fm_len = obd_object_end - lun_start + 1;
 	fs->fs_fm->fm_length = 0;
 	len_mapped_single_call = 0;
 
@@ -1723,7 +1723,7 @@ static int fiemap_for_stripe(const struct lu_env *env, struct cl_object *obj,
 			fs->fs_fm->fm_mapped_extents = 1;
 
 			fm_ext[0].fe_logical = lun_start;
-			fm_ext[0].fe_length = obd_object_end - lun_start;
+			fm_ext[0].fe_length = obd_object_end - lun_start + 1;
 			fm_ext[0].fe_flags |= FIEMAP_EXTENT_UNKNOWN;
 
 			goto inactive_tgt;
diff --git a/fs/lustre/lov/lov_offset.c b/fs/lustre/lov/lov_offset.c
index bb67d82..b53ce43 100644
--- a/fs/lustre/lov/lov_offset.c
+++ b/fs/lustre/lov/lov_offset.c
@@ -226,6 +226,8 @@ u64 lov_size_to_stripe(struct lov_stripe_md *lsm, int index, u64 file_size,
 /* given an extent in an lov and a stripe, calculate the extent of the stripe
  * that is contained within the lov extent.  this returns true if the given
  * stripe does intersect with the lov extent.
+ *
+ * Closed interval [@obd_start, @obd_end] will be returned.
  */
 int lov_stripe_intersects(struct lov_stripe_md *lsm, int index, int stripeno,
 			  struct lu_extent *ext, u64 *obd_start, u64 *obd_end)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 310/622] lustre: obdclass: remove unprotected access to lu_object
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (308 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 309/622] lustre: lov: fix wrong calculated length for fiemap James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:12 ` [lustre-devel] [PATCH 311/622] lustre: push rcu_barrier() before destroying slab James Simmons
                   ` (312 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The check of lu_object_is_dying() is done after reference
drop and without lock, so can access freed object if concurrent
thread did final put.

The patch saves object state right before atomic_dec_and_lock()
and checks it after check, so object is not being accessed

WC-bug-id: https://jira.whamcloud.com/browse/LU-11204
Lustre-commit: 336cf0f2f3a9 ("LU-11204 obdclass: remove unprotected access to lu_object")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34960
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lu_object.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index 2f709b0..bafd817 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -128,22 +128,18 @@ enum {
 void lu_object_put(const struct lu_env *env, struct lu_object *o)
 {
 	struct lu_site_bkt_data *bkt;
-	struct lu_object_header *top;
-	struct lu_site *site;
-	struct lu_object *orig;
+	struct lu_object_header *top = o->lo_header;
+	struct lu_site *site = o->lo_dev->ld_site;
+	struct lu_object *orig = o;
 	struct cfs_hash_bd bd;
-	const struct lu_fid *fid;
-
-	top  = o->lo_header;
-	site = o->lo_dev->ld_site;
-	orig = o;
+	const struct lu_fid *fid = lu_object_fid(o);
+	bool is_dying;
 
 	/*
 	 * till we have full fids-on-OST implemented anonymous objects
 	 * are possible in OSP. such an object isn't listed in the site
 	 * so we should not remove it from the site.
 	 */
-	fid = lu_object_fid(o);
 	if (fid_is_zero(fid)) {
 		LASSERT(!top->loh_hash.next && !top->loh_hash.pprev);
 		LASSERT(list_empty(&top->loh_lru));
@@ -160,8 +156,14 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	cfs_hash_bd_get(site->ls_obj_hash, &top->loh_fid, &bd);
 	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
 
+	is_dying = lu_object_is_dying(top);
 	if (!cfs_hash_bd_dec_and_lock(site->ls_obj_hash, &bd, &top->loh_ref)) {
-		if (lu_object_is_dying(top)) {
+		/* at this point the object reference is dropped and lock is
+		 * not taken, so lu_object should not be touched because it
+		 * can be freed by concurrent thread. Use local variable for
+		 * check.
+		 */
+		if (is_dying) {
 			/*
 			 * somebody may be waiting for this, currently only
 			 * used for cl_object, see cl_object_put_last().
@@ -180,6 +182,10 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			o->lo_ops->loo_object_release(env, o);
 	}
 
+	/* don't use local 'is_dying' here because if was taken without lock
+	 * but here we need the latest actual value of it so check lu_object
+	 * directly here.
+	 */
 	if (!lu_object_is_dying(top)) {
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 311/622] lustre: push rcu_barrier() before destroying slab
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (309 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 310/622] lustre: obdclass: remove unprotected access to lu_object James Simmons
@ 2020-02-27 21:12 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 312/622] lustre: ptlrpc: intent_getattr fetches default LMV James Simmons
                   ` (311 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:12 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

From rcubarrier.txt:

"
We could try placing a synchronize_rcu() in the module-exit code path,
but this is not sufficient. Although synchronize_rcu() does wait for a
grace period to elapse, it does not wait for the callbacks to complete.

One might be tempted to try several back-to-back synchronize_rcu()
calls, but this is still not guaranteed to work. If there is a very
heavy RCU-callback load, then some of the callbacks might be deferred
in order to allow other processing to proceed. Such deferral is required
in realtime kernels in order to avoid excessive scheduling latencies.

We instead need the rcu_barrier() primitive. This primitive is similar
to synchronize_rcu(), but instead of waiting solely for a grace
period to elapse, it also waits for all outstanding RCU callbacks to
complete. Pseudo-code using rcu_barrier() is as follows:

   1. Prevent any new RCU callbacks from being posted.
   2. Execute rcu_barrier().
   3. Allow the module to be unloaded.
"

So use synchronize_rcu() in ldlm_exit() is not safe enough, and we might
still hit use-after-free problem, also we missed rcu_barrier() when destroy
inode cache, this is simiar idea what current local filesystem does.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12374
Lustre-commit: 1f7613968c80 ("LU-12374 lustre: push rcu_barrier() before destroying slab")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35030
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lockd.c | 6 +++---
 fs/lustre/llite/super25.c   | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 3b405be..79dab6e 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -1204,10 +1204,10 @@ void ldlm_exit(void)
 	kmem_cache_destroy(ldlm_resource_slab);
 	/*
 	 * ldlm_lock_put() use RCU to call ldlm_lock_free, so need call
-	 * synchronize_rcu() to wait a grace period elapsed, so that
-	 * ldlm_lock_free() get a chance to be called.
+	 * rcu_barrier() to wait all outstanding RCU callbacks to complete,
+	 * so that ldlm_lock_free() get a chance to be called.
 	 */
-	synchronize_rcu();
+	rcu_barrier();
 	kmem_cache_destroy(ldlm_lock_slab);
 	kmem_cache_destroy(ldlm_interval_tree_slab);
 }
diff --git a/fs/lustre/llite/super25.c b/fs/lustre/llite/super25.c
index 133fe2a..6cae48c 100644
--- a/fs/lustre/llite/super25.c
+++ b/fs/lustre/llite/super25.c
@@ -271,6 +271,11 @@ static void __exit lustre_exit(void)
 	cl_env_put(cl_inode_fini_env, &cl_inode_fini_refcheck);
 	vvp_global_fini();
 
+	/*
+	 * Make sure all delayed rcu free inodes are flushed before we
+	 * destroy cache.
+	 */
+	rcu_barrier();
 	kmem_cache_destroy(ll_inode_cachep);
 	kmem_cache_destroy(ll_file_data_slab);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 312/622] lustre: ptlrpc: intent_getattr fetches default LMV
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (310 preceding siblings ...)
  2020-02-27 21:12 ` [lustre-devel] [PATCH 311/622] lustre: push rcu_barrier() before destroying slab James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 313/622] lustre: mdc: add async statfs James Simmons
                   ` (310 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Intent_getattr fetches default LMV, and caches it on client,
which will be used in subdir creation.

* Add RMF_DEFAULT_MDT_MD in intent_getattr reply.
* Save default LMV in ll_inode_info->lli_default_lsm_md, and
  replace lli_def_stripe_offset with it.
* take LOOKUP lock on default LMV setting to let client update
  cached default LMV.
* improve mdt_object_striped() to read from bottom device
  to avoid reading stripe FIDs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: 55ca00c3d1cd ("LU-11213 ptlrpc: intent_getattr fetches default LMV")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34802
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_lmv.h        | 13 +++++--
 fs/lustre/include/lustre_req_layout.h |  1 +
 fs/lustre/include/obd.h               | 14 +++++--
 fs/lustre/llite/llite_internal.h      | 20 ++--------
 fs/lustre/llite/llite_lib.c           | 72 ++++++++++++++++++++++++++++++++---
 fs/lustre/llite/namei.c               | 56 +++++++++++++++++++--------
 fs/lustre/lmv/lmv_obd.c               | 41 +++++++++++++++-----
 fs/lustre/mdc/mdc_locks.c             | 10 +++--
 fs/lustre/mdc/mdc_request.c           | 41 ++++++++++++++++----
 fs/lustre/ptlrpc/layout.c             |  8 +++-
 10 files changed, 210 insertions(+), 66 deletions(-)

diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index cef315d..c88e4b5 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -72,10 +72,12 @@ struct lmv_stripe_md {
 	    strcmp(lsm1->lsm_md_pool_name, lsm2->lsm_md_pool_name) != 0)
 		return false;
 
-	for (idx = 0; idx < lsm1->lsm_md_stripe_count; idx++) {
-		if (!lu_fid_eq(&lsm1->lsm_md_oinfo[idx].lmo_fid,
-			       &lsm2->lsm_md_oinfo[idx].lmo_fid))
-			return false;
+	if (lsm1->lsm_md_magic == LMV_MAGIC_V1) {
+		for (idx = 0; idx < lsm1->lsm_md_stripe_count; idx++) {
+			if (!lu_fid_eq(&lsm1->lsm_md_oinfo[idx].lmo_fid,
+				       &lsm2->lsm_md_oinfo[idx].lmo_fid))
+				return false;
+		}
 	}
 
 	return true;
@@ -92,6 +94,9 @@ static inline void lsm_md_dump(int mask, const struct lmv_stripe_md *lsm)
 	       lsm->lsm_md_layout_version, lsm->lsm_md_migrate_offset,
 	       lsm->lsm_md_migrate_hash, lsm->lsm_md_pool_name);
 
+	if (lsm->lsm_md_magic != LMV_MAGIC_V1)
+		return;
+
 	for (i = 0; i < lsm->lsm_md_stripe_count; i++)
 		CDEBUG(mask, "stripe[%d] "DFID"\n",
 		       i, PFID(&lsm->lsm_md_oinfo[i].lmo_fid));
diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 378f0b6..dca4ef4 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -249,6 +249,7 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_msg_field RMF_LDLM_INTENT;
 extern struct req_msg_field RMF_LAYOUT_INTENT;
 extern struct req_msg_field RMF_MDT_MD;
+extern struct req_msg_field RMF_DEFAULT_MDT_MD;
 extern struct req_msg_field RMF_REC_REINT;
 extern struct req_msg_field RMF_EADATA;
 extern struct req_msg_field RMF_EAVALS;
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 996211a..fb77df7 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -729,6 +729,14 @@ enum md_cli_flags {
 	CLI_MIGRATE		= BIT(4),
 };
 
+enum md_op_code {
+	LUSTRE_OPC_MKDIR	= 0,
+	LUSTRE_OPC_SYMLINK	= 1,
+	LUSTRE_OPC_MKNOD	= 2,
+	LUSTRE_OPC_CREATE	= 3,
+	LUSTRE_OPC_ANY		= 5,
+};
+
 /**
  * GETXATTR is not included as only a couple of fields in the reply body
  * is filled, but not FID which is needed for common intent handling in
@@ -746,6 +754,7 @@ struct md_op_data {
 	struct lu_fid		op_fid4; /* to the operation locks. */
 	u32			op_mds;  /* what mds server open will go to */
 	u32			op_mode;
+	enum md_op_code		op_code;
 	struct lustre_handle	op_open_handle;
 	s64			op_mod_time;
 	const char	       *op_name;
@@ -754,6 +763,7 @@ struct md_op_data {
 	struct rw_semaphore	*op_mea2_sem;
 	struct lmv_stripe_md   *op_mea1;
 	struct lmv_stripe_md   *op_mea2;
+	struct lmv_stripe_md	*op_default_mea1;	/* default LMV */
 	u32			op_suppgids[2];
 	u32			op_fsuid;
 	u32			op_fsgid;
@@ -791,9 +801,6 @@ struct md_op_data {
 	void		       *op_file_secctx;
 	u32			op_file_secctx_size;
 
-	/* default stripe offset */
-	u32			op_default_stripe_offset;
-
 	u32			op_projid;
 
 	u16			op_mirror_id;
@@ -933,6 +940,7 @@ struct lustre_md {
 		struct lmv_stripe_md	*lmv;
 		struct lmv_foreign_md   *lfm;
 	};
+	struct lmv_stripe_md    *default_lmv;
 #ifdef CONFIG_LUSTRE_FS_POSIX_ACL
 	struct posix_acl		*posix_acl;
 #endif
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index eb7e0dc..687d504 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -172,13 +172,8 @@ struct ll_inode_info {
 			struct rw_semaphore		lli_lsm_sem;
 			/* directory stripe information */
 			struct lmv_stripe_md	       *lli_lsm_md;
-			/* default directory stripe offset.  This is extracted
-			 * from the "dmv" xattr in order to decide which MDT to
-			 * create a subdirectory on.  The MDS itself fetches
-			 * "dmv" and gets the rest of the default layout itself
-			 * (count, hash, etc).
-			 */
-			u32				lli_def_stripe_offset;
+			/* directory default LMV */
+			struct lmv_stripe_md		*lli_default_lsm_md;
 		};
 
 		/* for non-directory */
@@ -921,19 +916,12 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 int ll_get_default_mdsize(struct ll_sb_info *sbi, int *default_mdsize);
 int ll_set_default_mdsize(struct ll_sb_info *sbi, int default_mdsize);
 
-enum {
-	LUSTRE_OPC_MKDIR	= 0,
-	LUSTRE_OPC_SYMLINK	= 1,
-	LUSTRE_OPC_MKNOD	= 2,
-	LUSTRE_OPC_CREATE	= 3,
-	LUSTRE_OPC_ANY		= 5,
-};
-
 void ll_unlock_md_op_lsm(struct md_op_data *op_data);
 struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 				      struct inode *i1, struct inode *i2,
 				      const char *name, size_t namelen,
-				      u32 mode, u32 opc, void *data);
+				      u32 mode, enum md_op_code opc,
+				      void *data);
 void ll_finish_md_op_data(struct md_op_data *op_data);
 int ll_get_obd_name(struct inode *inode, unsigned int cmd, unsigned long arg);
 void ll_compute_rootsquash_state(struct ll_sb_info *sbi);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index e6ac16f..bd17ba1 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -939,7 +939,6 @@ void ll_lli_init(struct ll_inode_info *lli)
 		spin_lock_init(&lli->lli_sa_lock);
 		lli->lli_opendir_pid = 0;
 		lli->lli_sa_enabled = 0;
-		lli->lli_def_stripe_offset = -1;
 		init_rwsem(&lli->lli_lsm_sem);
 	} else {
 		mutex_init(&lli->lli_size_mutex);
@@ -1216,6 +1215,11 @@ void ll_dir_clear_lsm_md(struct inode *inode)
 		lmv_free_memmd(lli->lli_lsm_md);
 		lli->lli_lsm_md = NULL;
 	}
+
+	if (lli->lli_default_lsm_md) {
+		lmv_free_memmd(lli->lli_default_lsm_md);
+		lli->lli_default_lsm_md = NULL;
+	}
 }
 
 static struct inode *ll_iget_anon_dir(struct super_block *sb,
@@ -1314,6 +1318,46 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 	return 0;
 }
 
+static void ll_update_default_lsm_md(struct inode *inode, struct lustre_md *md)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+
+	if (!md->default_lmv) {
+		/* clear default lsm */
+		if (lli->lli_default_lsm_md) {
+			down_write(&lli->lli_lsm_sem);
+			if (lli->lli_default_lsm_md) {
+				lmv_free_memmd(lli->lli_default_lsm_md);
+				lli->lli_default_lsm_md = NULL;
+			}
+			up_write(&lli->lli_lsm_sem);
+		}
+	} else if (lli->lli_default_lsm_md) {
+		/* update default lsm if it changes */
+		down_read(&lli->lli_lsm_sem);
+		if (lli->lli_default_lsm_md &&
+		    !lsm_md_eq(lli->lli_default_lsm_md, md->default_lmv)) {
+			up_read(&lli->lli_lsm_sem);
+			down_write(&lli->lli_lsm_sem);
+			if (lli->lli_default_lsm_md)
+				lmv_free_memmd(lli->lli_default_lsm_md);
+			lli->lli_default_lsm_md = md->default_lmv;
+			lsm_md_dump(D_INODE, md->default_lmv);
+			md->default_lmv = NULL;
+			up_write(&lli->lli_lsm_sem);
+		} else {
+			up_read(&lli->lli_lsm_sem);
+		}
+	} else {
+		/* init default lsm */
+		down_write(&lli->lli_lsm_sem);
+		lli->lli_default_lsm_md = md->default_lmv;
+		lsm_md_dump(D_INODE, md->default_lmv);
+		md->default_lmv = NULL;
+		up_write(&lli->lli_lsm_sem);
+	}
+}
+
 static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
@@ -1324,6 +1368,10 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	CDEBUG(D_INODE, "update lsm %p of " DFID "\n", lli->lli_lsm_md,
 	       PFID(ll_inode2fid(inode)));
 
+	/* update default LMV */
+	if (md->default_lmv)
+		ll_update_default_lsm_md(inode, md);
+
 	/*
 	 * no striped information from request, lustre_md from req does not
 	 * include stripeEA, see ll_md_setattr()
@@ -2322,6 +2370,7 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 {
 	struct ll_sb_info *sbi = NULL;
 	struct lustre_md md = { NULL };
+	bool default_lmv_deleted = false;
 	int rc;
 
 	LASSERT(*inode || sb);
@@ -2331,6 +2380,15 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 	if (rc)
 		goto out;
 
+	/*
+	 * clear default_lmv only if intent_getattr reply doesn't contain it.
+	 * but it needs to be done after iget, check this early because
+	 * ll_update_lsm_md() may change md.
+	 */
+	if (it && (it->it_op & (IT_LOOKUP | IT_GETATTR)) &&
+	    S_ISDIR(md.body->mbo_mode) && !md.default_lmv)
+		default_lmv_deleted = true;
+
 	if (*inode) {
 		rc = ll_update_inode(*inode, &md);
 		if (rc)
@@ -2396,9 +2454,12 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 		LDLM_LOCK_PUT(lock);
 	}
 
+	if (default_lmv_deleted)
+		ll_update_default_lsm_md(*inode, &md);
 out:
 	/* cleanup will be done if necessary */
 	md_free_lustre_md(sbi->ll_md_exp, &md);
+
 	if (rc != 0 && it && it->it_op & IT_OPEN)
 		ll_open_cleanup(sb ? sb : (*inode)->i_sb, req);
 
@@ -2481,7 +2542,8 @@ void ll_unlock_md_op_lsm(struct md_op_data *op_data)
 struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 				      struct inode *i1, struct inode *i2,
 				      const char *name, size_t namelen,
-				      u32 mode, u32 opc, void *data)
+				      u32 mode, enum md_op_code opc,
+				      void *data)
 {
 	if (!name) {
 		/* Do not reuse namelen for something else. */
@@ -2503,15 +2565,13 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 
 	ll_i2gids(op_data->op_suppgids, i1, i2);
 	op_data->op_fid1 = *ll_inode2fid(i1);
-	op_data->op_default_stripe_offset = -1;
+	op_data->op_code = opc;
 
 	if (S_ISDIR(i1->i_mode)) {
 		down_read(&ll_i2info(i1)->lli_lsm_sem);
 		op_data->op_mea1_sem = &ll_i2info(i1)->lli_lsm_sem;
 		op_data->op_mea1 = ll_i2info(i1)->lli_lsm_md;
-		if (opc == LUSTRE_OPC_MKDIR)
-			op_data->op_default_stripe_offset =
-				ll_i2info(i1)->lli_def_stripe_offset;
+		op_data->op_default_mea1 = ll_i2info(i1)->lli_default_lsm_md;
 	}
 
 	if (i2) {
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 3c796bd..1aaf184 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -246,8 +246,6 @@ void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 	}
 
 	if (bits & MDS_INODELOCK_XATTR) {
-		if (S_ISDIR(inode->i_mode))
-			ll_i2info(inode)->lli_def_stripe_offset = -1;
 		ll_xattr_cache_destroy(inode);
 		bits &= ~MDS_INODELOCK_XATTR;
 	}
@@ -1155,14 +1153,10 @@ static int ll_new_node(struct inode *dir, struct dentry *dentry,
 			from_kuid(&init_user_ns, current_fsuid()),
 			from_kgid(&init_user_ns, current_fsgid()),
 			current_cap(), rdev, &request);
-	if (err < 0 && err != -EREMOTE)
-		goto err_exit;
-
+#if OBD_OCD_VERSION(2, 14, 58, 0) > LUSTRE_VERSION_CODE
 	/*
-	 * If the client doesn't know where to create a subdirectory (or
-	 * in case of a race that sends the RPC to the wrong MDS), the
-	 * MDS will return -EREMOTE and the client will fetch the layout
-	 * of the directory, then create the directory on the right MDT.
+	 * server < 2.12.58 doesn't pack default LMV in intent_getattr reply,
+	 * fetch default LMV here.
 	 */
 	if (unlikely(err == -EREMOTE)) {
 		struct ll_inode_info *lli = ll_i2info(dir);
@@ -1174,26 +1168,58 @@ static int ll_new_node(struct inode *dir, struct dentry *dentry,
 
 		err2 = ll_dir_getstripe(dir, (void **)&lum, &lumsize, &request,
 					OBD_MD_DEFAULT_MEA);
+		ll_finish_md_op_data(op_data);
+		op_data = NULL;
 		if (!err2) {
-			/* Update stripe_offset and retry */
-			lli->lli_def_stripe_offset = lum->lum_stripe_offset;
-		} else if (err2 == -ENODATA &&
-			   lli->lli_def_stripe_offset != -1) {
+			struct lustre_md md = { NULL };
+
+			md.body = req_capsule_server_get(&request->rq_pill,
+							 &RMF_MDT_BODY);
+			if (!md.body) {
+				err = -EPROTO;
+				goto err_exit;
+			}
+
+			md.default_lmv = kzalloc(sizeof(*md.default_lmv),
+						 GFP_NOFS);
+			if (!md.default_lmv) {
+				err = -ENOMEM;
+				goto err_exit;
+			}
+
+			md.default_lmv->lsm_md_magic = lum->lum_magic;
+			md.default_lmv->lsm_md_stripe_count =
+				lum->lum_stripe_count;
+			md.default_lmv->lsm_md_master_mdt_index =
+				lum->lum_stripe_offset;
+			md.default_lmv->lsm_md_hash_type = lum->lum_hash_type;
+
+			err = ll_update_inode(dir, &md);
+			md_free_lustre_md(sbi->ll_md_exp, &md);
+			if (err)
+				goto err_exit;
+		} else if (err2 == -ENODATA && lli->lli_default_lsm_md) {
 			/*
 			 * If there are no default stripe EA on the MDT, but the
 			 * client has default stripe, then it probably means
 			 * default stripe EA has just been deleted.
 			 */
-			lli->lli_def_stripe_offset = -1;
+			down_write(&lli->lli_lsm_sem);
+			kfree(lli->lli_default_lsm_md);
+			lli->lli_default_lsm_md = NULL;
+			up_write(&lli->lli_lsm_sem);
 		} else {
 			goto err_exit;
 		}
 
 		ptlrpc_req_finished(request);
 		request = NULL;
-		ll_finish_md_op_data(op_data);
 		goto again;
 	}
+#endif
+
+	if (err < 0)
+		goto err_exit;
 
 	ll_update_times(request, dir);
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 4b5bd36..48cd41a 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1176,13 +1176,12 @@ static int lmv_placement_policy(struct obd_device *obd,
 	    le32_to_cpu(lum->lum_magic != LMV_MAGIC_FOREIGN) &&
 	    le32_to_cpu(lum->lum_stripe_offset) != (u32)-1) {
 		*mds = le32_to_cpu(lum->lum_stripe_offset);
-	} else if (op_data->op_default_stripe_offset != (u32)-1) {
-		*mds = op_data->op_default_stripe_offset;
+	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
+		   op_data->op_default_mea1 &&
+		   op_data->op_default_mea1->lsm_md_master_mdt_index !=
+			 (u32)-1) {
+		*mds = op_data->op_default_mea1->lsm_md_master_mdt_index;
 		op_data->op_mds = *mds;
-		/* Correct the stripe offset in lum */
-		if (lum &&
-		    le32_to_cpu(lum->lum_magic != LMV_MAGIC_FOREIGN))
-			lum->lum_stripe_offset = cpu_to_le32(*mds);
 	} else {
 		*mds = op_data->op_mds;
 	}
@@ -2981,6 +2980,18 @@ static int lmv_unpack_md_v1(struct obd_export *exp, struct lmv_stripe_md *lsm,
 	return rc;
 }
 
+static inline int lmv_unpack_user_md(struct obd_export *exp,
+				     struct lmv_stripe_md *lsm,
+				     const struct lmv_user_md *lmu)
+{
+	lsm->lsm_md_magic = le32_to_cpu(lmu->lum_magic);
+	lsm->lsm_md_stripe_count = le32_to_cpu(lmu->lum_stripe_count);
+	lsm->lsm_md_master_mdt_index = le32_to_cpu(lmu->lum_stripe_offset);
+	lsm->lsm_md_hash_type = le32_to_cpu(lmu->lum_hash_type);
+
+	return 0;
+}
+
 static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 			const union lmv_mds_md *lmm, size_t lmm_size)
 {
@@ -3005,9 +3016,14 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 			return 0;
 		}
 
-		for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
-			if (lsm->lsm_md_oinfo[i].lmo_root)
-				iput(lsm->lsm_md_oinfo[i].lmo_root);
+		if (lsm->lsm_md_magic == LMV_MAGIC) {
+			for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
+				if (lsm->lsm_md_oinfo[i].lmo_root)
+					iput(lsm->lsm_md_oinfo[i].lmo_root);
+			}
+			lsm_size = lmv_stripe_md_size(lsm->lsm_md_stripe_count);
+		} else {
+			lsm_size = lmv_stripe_md_size(0);
 		}
 		kvfree(lsm);
 		*lsmp = NULL;
@@ -3066,6 +3082,9 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 	case LMV_MAGIC_V1:
 		rc = lmv_unpack_md_v1(exp, lsm, &lmm->lmv_md_v1);
 		break;
+	case LMV_USER_MAGIC:
+		rc = lmv_unpack_user_md(exp, lsm, &lmm->lmv_user_md);
+		break;
 	default:
 		CERROR("%s: unrecognized magic %x\n", exp->exp_obd->obd_name,
 		       le32_to_cpu(lmm->lmv_magic));
@@ -3190,6 +3209,10 @@ static int lmv_free_lustre_md(struct obd_export *exp, struct lustre_md *md)
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt = lmv->tgts[0];
 
+	if (md->default_lmv) {
+		lmv_free_memmd(md->default_lmv);
+		md->default_lmv = NULL;
+	}
 	if (md->lmv) {
 		lmv_free_memmd(md->lmv);
 		md->lmv = NULL;
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index f6273ef..cf6bc9d 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -504,13 +504,13 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 {
 	struct ptlrpc_request *req;
 	struct obd_device *obddev = class_exp2obd(exp);
-	u64 valid = OBD_MD_FLGETATTR | OBD_MD_FLEASIZE |
-		    OBD_MD_FLMODEASIZE | OBD_MD_FLDIREA |
-		    OBD_MD_MEA | OBD_MD_FLACL;
+	u64 valid = OBD_MD_FLGETATTR | OBD_MD_FLEASIZE | OBD_MD_FLMODEASIZE |
+		    OBD_MD_FLDIREA | OBD_MD_MEA | OBD_MD_FLACL |
+		    OBD_MD_DEFAULT_MEA;
 	struct ldlm_intent *lit;
-	int rc;
 	u32 easize;
 	bool have_secctx = false;
+	int rc;
 
 	req = ptlrpc_request_alloc(class_exp2cliimp(exp),
 				   &RQF_LDLM_INTENT_GETATTR);
@@ -549,6 +549,8 @@ static int mdc_save_lovea(struct ptlrpc_request *req,
 
 	req_capsule_set_size(&req->rq_pill, &RMF_MDT_MD, RCL_SERVER, easize);
 	req_capsule_set_size(&req->rq_pill, &RMF_ACL, RCL_SERVER, acl_bufsize);
+	req_capsule_set_size(&req->rq_pill, &RMF_DEFAULT_MDT_MD, RCL_SERVER,
+			     sizeof(struct lmv_user_md));
 
 	if (have_secctx) {
 		char *secctx_name;
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 57da3c3..c834891 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -594,13 +594,13 @@ static int mdc_get_lustre_md(struct obd_export *exp,
 			goto out;
 		}
 
-		lmv_size = md->body->mbo_eadatasize;
-		if (!lmv_size) {
-			CDEBUG(D_INFO,
-			       "OBD_MD_FLDIREA is set, but eadatasize 0\n");
-			return -EPROTO;
-		}
 		if (md->body->mbo_valid & OBD_MD_MEA) {
+			lmv_size = md->body->mbo_eadatasize;
+			if (!lmv_size) {
+				CDEBUG(D_INFO,
+				       "OBD_MD_FLDIREA is set, but eadatasize 0\n");
+				return -EPROTO;
+			}
 			lmv = req_capsule_server_sized_get(pill, &RMF_MDT_MD,
 							   lmv_size);
 			if (!lmv) {
@@ -612,7 +612,7 @@ static int mdc_get_lustre_md(struct obd_export *exp,
 			if (rc < 0)
 				goto out;
 
-			if (rc < (typeof(rc))sizeof(*md->lmv)) {
+			if (rc < (int)sizeof(*md->lmv)) {
 				struct lmv_foreign_md *lfm = md->lfm;
 
 				/* short (< sizeof(struct lmv_stripe_md))
@@ -620,13 +620,38 @@ static int mdc_get_lustre_md(struct obd_export *exp,
 				 */
 				if (lfm->lfm_magic != LMV_MAGIC_FOREIGN) {
 					CDEBUG(D_INFO,
-					       "size too small: rc < sizeof(*md->lmv) (%d < %d)\n",
+					       "lmv size too small: %d < %d\n",
 					       rc, (int)sizeof(*md->lmv));
 					rc = -EPROTO;
 					goto out;
 				}
 			}
 		}
+
+		/* since 2.12.58 intent_getattr fetches default LMV */
+		if (md->body->mbo_valid & OBD_MD_DEFAULT_MEA) {
+			lmv_size = sizeof(struct lmv_user_md);
+			lmv = req_capsule_server_sized_get(pill,
+							   &RMF_DEFAULT_MDT_MD,
+							   lmv_size);
+			if (!lmv) {
+				rc = -EPROTO;
+				goto out;
+			}
+
+			rc = md_unpackmd(md_exp, &md->default_lmv, lmv,
+					 lmv_size);
+			if (rc < 0)
+				goto out;
+
+			if (rc < (int)sizeof(*md->default_lmv)) {
+				CDEBUG(D_INFO,
+				       "default lmv size too small: %d < %d\n",
+				       rc, (int)sizeof(*md->lmv));
+				rc = -EPROTO;
+				goto out;
+			}
+		}
 	}
 	rc = 0;
 
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 9a676ae..c10b593 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -446,7 +446,8 @@
 	&RMF_MDT_MD,
 	&RMF_ACL,
 	&RMF_CAPA1,
-	&RMF_FILE_SECCTX
+	&RMF_FILE_SECCTX,
+	&RMF_DEFAULT_MDT_MD
 };
 
 static const struct req_msg_field *ldlm_intent_create_client[] = {
@@ -1016,6 +1017,11 @@ struct req_msg_field RMF_MDT_MD =
 	DEFINE_MSGF("mdt_md", RMF_F_NO_SIZE_CHECK, MIN_MD_SIZE, NULL, NULL);
 EXPORT_SYMBOL(RMF_MDT_MD);
 
+struct req_msg_field RMF_DEFAULT_MDT_MD =
+	DEFINE_MSGF("default_mdt_md", RMF_F_NO_SIZE_CHECK, MIN_MD_SIZE, NULL,
+		    NULL);
+EXPORT_SYMBOL(RMF_DEFAULT_MDT_MD);
+
 struct req_msg_field RMF_REC_REINT =
 	DEFINE_MSGF("rec_reint", 0, sizeof(struct mdt_rec_reint),
 		    lustre_swab_mdt_rec_reint, NULL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 313/622] lustre: mdc: add async statfs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (311 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 312/622] lustre: ptlrpc: intent_getattr fetches default LMV James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 314/622] lustre: lmv: mkdir with balanced space usage James Simmons
                   ` (309 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Add obd_statfs_async() interface for MDC, the statfs request
is sent by ptlrpcd.

This statfs result is for each MDT separately, it's different
from current cached statfs which is aggregated statfs of all
MDTs.

The max age of statfs result is decided by lmv_desc.ld_qos_maxage.

It will deactivate MDC on failure, and activate MDC on success.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: 7f412954ad38 ("LU-11213 mdc: add async statfs")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34359
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h       |  4 ++++
 fs/lustre/include/obd_class.h | 18 +++-------------
 fs/lustre/lmv/lmv_internal.h  |  2 ++
 fs/lustre/lmv/lmv_obd.c       | 44 +++++++++++++++++++++++++++++++++++++++
 fs/lustre/mdc/mdc_request.c   | 48 +++++++++++++++++++++++++++++++++++++++++++
 fs/lustre/osc/osc_request.c   | 16 +++++++++++++++
 6 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index fb77df7..e815584 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -86,6 +86,8 @@ static inline void loi_kms_set(struct lov_oinfo *oinfo, u64 kms)
 struct obd_info {
 	/* OBD_STATFS_* flags */
 	u64			oi_flags;
+	struct obd_device      *oi_obd;
+	struct lmv_tgt_desc    *oi_tgt;
 	/* lsm data specific for every OSC. */
 	struct lov_stripe_md   *oi_md;
 	/* statfs data specific for every OSC, if needed at all. */
@@ -435,6 +437,8 @@ struct lmv_tgt_desc {
 	struct obd_export      *ltd_exp;
 	u32			ltd_idx;
 	struct mutex		ltd_fid_mutex;
+	struct obd_statfs	ltd_statfs;
+	time64_t		ltd_statfs_age;
 	unsigned long		ltd_active:1; /* target up for requests */
 };
 
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index a890d00..58c743c 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -912,21 +912,9 @@ static inline int obd_statfs_async(struct obd_export *exp,
 
 	CDEBUG(D_SUPER, "%s: age %lld, max_age %lld\n",
 	       obd->obd_name, obd->obd_osfs_age, max_age);
-	if (obd->obd_osfs_age < max_age) {
-		rc = OBP(obd, statfs_async)(exp, oinfo, max_age, rqset);
-	} else {
-		CDEBUG(D_SUPER,
-		       "%s: use %p cache blocks %llu/%llu objects %llu/%llu\n",
-		       obd->obd_name, &obd->obd_osfs,
-		       obd->obd_osfs.os_bavail, obd->obd_osfs.os_blocks,
-		       obd->obd_osfs.os_ffree, obd->obd_osfs.os_files);
-		spin_lock(&obd->obd_osfs_lock);
-		memcpy(oinfo->oi_osfs, &obd->obd_osfs, sizeof(*oinfo->oi_osfs));
-		spin_unlock(&obd->obd_osfs_lock);
-		oinfo->oi_flags |= OBD_STATFS_FROM_CACHE;
-		if (oinfo->oi_cb_up)
-			oinfo->oi_cb_up(oinfo, 0);
-	}
+
+	rc = OBP(obd, statfs_async)(exp, oinfo, max_age, rqset);
+
 	return rc;
 }
 
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index e434919..b4c5297 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -61,6 +61,8 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 int lmv_getattr_name(struct obd_export *exp, struct md_op_data *op_data,
 		     struct ptlrpc_request **preq);
 
+int lmv_statfs_check_update(struct obd_device *obd, struct lmv_tgt_desc *tgt);
+
 static inline struct obd_device *lmv2obd_dev(struct lmv_obd *lmv)
 {
 	return container_of_safe(lmv, struct obd_device, u.lmv);
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 48cd41a..4365533 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -349,6 +349,8 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 	       mdc_obd->obd_name, mdc_obd->obd_uuid.uuid,
 	       atomic_read(&obd->obd_refcount));
 
+	lmv_statfs_check_update(obd, tgt);
+
 	if (lmv->lmv_tgts_kobj)
 		/* Even if we failed to create the link, that's fine */
 		rc = sysfs_create_link(lmv->lmv_tgts_kobj,
@@ -1276,6 +1278,7 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	obd_str2uuid(&lmv->desc.ld_uuid, desc->ld_uuid.uuid);
 	lmv->desc.ld_tgt_count = 0;
 	lmv->desc.ld_active_tgt_count = 0;
+	lmv->desc.ld_qos_maxage = 60;
 	lmv->max_def_easize = 0;
 	lmv->max_easize = 0;
 
@@ -1445,6 +1448,47 @@ static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 	return rc;
 }
 
+static int lmv_statfs_update(void *cookie, int rc)
+{
+	struct obd_info *oinfo = cookie;
+	struct obd_device *obd = oinfo->oi_obd;
+	struct lmv_obd *lmv = &obd->u.lmv;
+	struct lmv_tgt_desc *tgt = oinfo->oi_tgt;
+	struct obd_statfs *osfs = oinfo->oi_osfs;
+
+	/*
+	 * NB: don't deactivate TGT upon error, because we may not trigger async
+	 * statfs any longer, then there is no chance to activate TGT.
+	 */
+	if (!rc) {
+		spin_lock(&lmv->lmv_lock);
+		tgt->ltd_statfs = *osfs;
+		tgt->ltd_statfs_age = ktime_get_seconds();
+		spin_unlock(&lmv->lmv_lock);
+	}
+
+	return rc;
+}
+
+/* update tgt statfs async if it's ld_qos_maxage old */
+int lmv_statfs_check_update(struct obd_device *obd, struct lmv_tgt_desc *tgt)
+{
+	struct obd_info oinfo = {
+		.oi_obd	= obd,
+		.oi_tgt = tgt,
+		.oi_cb_up = lmv_statfs_update,
+	};
+	int rc;
+
+	if (ktime_get_seconds() - tgt->ltd_statfs_age <
+	    obd->u.lmv.desc.ld_qos_maxage)
+		return 0;
+
+	rc = obd_statfs_async(tgt->ltd_exp, &oinfo, 0, NULL);
+
+	return rc;
+}
+
 static int lmv_get_root(struct obd_export *exp, const char *fileset,
 			struct lu_fid *fid)
 {
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index c834891..a26efa1 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1570,6 +1570,53 @@ static int mdc_read_page(struct obd_export *exp, struct md_op_data *op_data,
 	goto out_unlock;
 }
 
+static int mdc_statfs_interpret(const struct lu_env *env,
+				struct ptlrpc_request *req, void *args, int rc)
+{
+	struct obd_info *oinfo = args;
+	struct obd_statfs *osfs;
+
+	if (!rc) {
+		osfs = req_capsule_server_get(&req->rq_pill, &RMF_OBD_STATFS);
+		if (!osfs)
+			return -EPROTO;
+
+		oinfo->oi_osfs = osfs;
+
+		CDEBUG(D_CACHE,
+		       "blocks=%llu free=%llu avail=%llu objects=%llu free=%llu state=%x\n",
+			osfs->os_blocks, osfs->os_bfree, osfs->os_bavail,
+			osfs->os_files, osfs->os_ffree, osfs->os_state);
+	}
+
+	oinfo->oi_cb_up(oinfo, rc);
+
+	return rc;
+}
+
+static int mdc_statfs_async(struct obd_export *exp,
+			    struct obd_info *oinfo, time64_t max_age,
+			    struct ptlrpc_request_set *unused)
+{
+	struct ptlrpc_request *req;
+	struct obd_info *aa;
+
+	req = ptlrpc_request_alloc_pack(class_exp2cliimp(exp), &RQF_MDS_STATFS,
+					LUSTRE_MDS_VERSION, MDS_STATFS);
+	if (!req)
+		return -ENOMEM;
+
+	ptlrpc_request_set_replen(req);
+	req->rq_interpret_reply = mdc_statfs_interpret;
+
+	aa = ptlrpc_req_async_args(aa, req);
+	*aa = *oinfo;
+
+	ptlrpcd_add_req(req);
+
+	return 0;
+}
+
 static int mdc_statfs(const struct lu_env *env,
 		      struct obd_export *exp, struct obd_statfs *osfs,
 		      time64_t max_age, u32 flags)
@@ -2802,6 +2849,7 @@ static int mdc_cleanup(struct obd_device *obd)
 	.iocontrol		= mdc_iocontrol,
 	.set_info_async		= mdc_set_info_async,
 	.statfs			= mdc_statfs,
+	.statfs_async		= mdc_statfs_async,
 	.fid_init		= client_fid_init,
 	.fid_fini		= client_fid_fini,
 	.fid_alloc		= mdc_fid_alloc,
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index a988cbf..f929908 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -2736,6 +2736,22 @@ static int osc_statfs_async(struct obd_export *exp,
 	struct osc_async_args *aa;
 	int rc;
 
+	if (obd->obd_osfs_age >= max_age) {
+		CDEBUG(D_SUPER,
+		       "%s: use %p cache blocks %llu/%llu objects %llu/%llu\n",
+		       obd->obd_name, &obd->obd_osfs,
+		       obd->obd_osfs.os_bavail, obd->obd_osfs.os_blocks,
+		       obd->obd_osfs.os_ffree, obd->obd_osfs.os_files);
+		spin_lock(&obd->obd_osfs_lock);
+		memcpy(oinfo->oi_osfs, &obd->obd_osfs, sizeof(*oinfo->oi_osfs));
+		spin_unlock(&obd->obd_osfs_lock);
+		oinfo->oi_flags |= OBD_STATFS_FROM_CACHE;
+		if (oinfo->oi_cb_up)
+			oinfo->oi_cb_up(oinfo, 0);
+
+		return 0;
+	}
+
 	/* We could possibly pass max_age in the request (as an absolute
 	 * timestamp or a "seconds.usec ago") so the target can avoid doing
 	 * extra calls into the filesystem if that isn't necessary (e.g.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 314/622] lustre: lmv: mkdir with balanced space usage
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (312 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 313/622] lustre: mdc: add async statfs James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 315/622] lustre: llite: check correct size in ll_dom_finish_open() James Simmons
                   ` (308 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

If a plain directory default LMV hash type is "space", create
subdirs on all MDTs with balanced space usage:
* client mkdir allocate FID on MDT with balanced space usage
  (space QoS code is in next patch).
* MDT allows mkdir on different MDT with its parent if it has
  "space" hash type in default LMV, this is normally rejected
  because mkdir shouldn't create remote directory.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: 6d296587441d ("LU-11213 lmv: mkdir with balanced space usage")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34360
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_lmv.h   |  51 +++++--
 fs/lustre/llite/dir.c            |   5 +-
 fs/lustre/llite/file.c           |  10 +-
 fs/lustre/llite/llite_internal.h |   7 +
 fs/lustre/llite/llite_lib.c      |  25 ++--
 fs/lustre/llite/namei.c          |   8 +-
 fs/lustre/lmv/lmv_intent.c       |  21 ++-
 fs/lustre/lmv/lmv_internal.h     |  30 +---
 fs/lustre/lmv/lmv_obd.c          | 299 +++++++++++++++++++--------------------
 9 files changed, 229 insertions(+), 227 deletions(-)

diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index c88e4b5..bb1efb4 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -55,6 +55,47 @@ struct lmv_stripe_md {
 	struct lmv_oinfo lsm_md_oinfo[0];
 };
 
+/* NB: LMV_HASH_TYPE_SPACE is set in default LMV only */
+static inline bool lmv_is_known_hash_type(u32 type)
+{
+	return (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_FNV_1A_64 ||
+	       (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_ALL_CHARS;
+}
+
+static inline bool lmv_dir_striped(const struct lmv_stripe_md *lsm)
+{
+	return lsm && lsm->lsm_md_magic == LMV_MAGIC;
+}
+
+static inline bool lmv_dir_foreign(const struct lmv_stripe_md *lsm)
+{
+	return lsm && lsm->lsm_md_magic == LMV_MAGIC_FOREIGN;
+}
+
+static inline bool lmv_dir_migrating(const struct lmv_stripe_md *lsm)
+{
+	return lmv_dir_striped(lsm) &&
+	       lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION;
+}
+
+static inline bool lmv_dir_bad_hash(const struct lmv_stripe_md *lsm)
+{
+	if (!lmv_dir_striped(lsm))
+		return false;
+
+	if (lmv_dir_migrating(lsm) &&
+	    lsm->lsm_md_stripe_count - lsm->lsm_md_migrate_offset <= 1)
+		return false;
+
+	return !lmv_is_known_hash_type(lsm->lsm_md_hash_type);
+}
+
+/* NB, this is checking directory default LMV */
+static inline bool lmv_dir_space_hashed(const struct lmv_stripe_md *lsm)
+{
+	return lsm && lsm->lsm_md_hash_type == LMV_HASH_TYPE_SPACE;
+}
+
 static inline bool
 lsm_md_eq(const struct lmv_stripe_md *lsm1, const struct lmv_stripe_md *lsm2)
 {
@@ -72,7 +113,7 @@ struct lmv_stripe_md {
 	    strcmp(lsm1->lsm_md_pool_name, lsm2->lsm_md_pool_name) != 0)
 		return false;
 
-	if (lsm1->lsm_md_magic == LMV_MAGIC_V1) {
+	if (lmv_dir_striped(lsm1)) {
 		for (idx = 0; idx < lsm1->lsm_md_stripe_count; idx++) {
 			if (!lu_fid_eq(&lsm1->lsm_md_oinfo[idx].lmo_fid,
 				       &lsm2->lsm_md_oinfo[idx].lmo_fid))
@@ -94,7 +135,7 @@ static inline void lsm_md_dump(int mask, const struct lmv_stripe_md *lsm)
 	       lsm->lsm_md_layout_version, lsm->lsm_md_migrate_offset,
 	       lsm->lsm_md_migrate_hash, lsm->lsm_md_pool_name);
 
-	if (lsm->lsm_md_magic != LMV_MAGIC_V1)
+	if (!lmv_dir_striped(lsm))
 		return;
 
 	for (i = 0; i < lsm->lsm_md_stripe_count; i++)
@@ -188,12 +229,6 @@ static inline int lmv_name_to_stripe_index(u32 lmv_hash_type,
 	return idx;
 }
 
-static inline bool lmv_is_known_hash_type(u32 type)
-{
-	return (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_FNV_1A_64 ||
-	       (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_ALL_CHARS;
-}
-
 static inline bool lmv_magic_supported(u32 lum_magic)
 {
 	return lum_magic == LMV_USER_MAGIC ||
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index f75183b..a1dce52 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -160,8 +160,7 @@ void ll_release_page(struct inode *inode, struct page *page, bool remove)
 	 * Always remove the page for striped dir, because the page is
 	 * built from temporarily in LMV layer
 	 */
-	if (inode && S_ISDIR(inode->i_mode) &&
-	    ll_i2info(inode)->lli_lsm_md) {
+	if (inode && ll_dir_striped(inode)) {
 		__free_page(page);
 		return;
 	}
@@ -314,7 +313,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 		goto out;
 	}
 
-	if (unlikely(ll_i2info(inode)->lli_lsm_md)) {
+	if (unlikely(ll_dir_striped(inode))) {
 		/*
 		 * This is only needed for striped dir to fill ..,
 		 * see lmv_read_page
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 191b0f9..50220eb 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3987,7 +3987,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 	if (!(exp_connect_flags2(ll_i2sbi(parent)->ll_md_exp) &
 	      OBD_CONNECT2_DIR_MIGRATE)) {
 		if (le32_to_cpu(lum->lum_stripe_count) > 1 ||
-		    ll_i2info(child_inode)->lli_lsm_md) {
+		    ll_dir_striped(child_inode)) {
 			CERROR("%s: MDT doesn't support stripe directory migration!\n",
 			       ll_i2sbi(parent)->ll_fsname);
 			rc = -EOPNOTSUPP;
@@ -4179,7 +4179,7 @@ static int ll_inode_revalidate_fini(struct inode *inode, int rc)
 		 * Let's revalidate the dentry again, instead of returning
 		 * error
 		 */
-		if (S_ISDIR(inode->i_mode) && ll_i2info(inode)->lli_lsm_md)
+		if (ll_dir_striped(inode))
 			return 0;
 
 		/* This path cannot be hit for regular files unless in
@@ -4256,8 +4256,7 @@ static int ll_merge_md_attr(struct inode *inode)
 
 	LASSERT(lli->lli_lsm_md);
 
-	/* foreign dir is not striped dir */
-	if (lli->lli_lsm_md->lsm_md_magic == LMV_MAGIC_FOREIGN)
+	if (!lmv_dir_striped(lli->lli_lsm_md))
 		return 0;
 
 	down_read(&lli->lli_lsm_sem);
@@ -4307,8 +4306,7 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 		}
 	} else {
 		/* If object isn't regular a file then don't validate size. */
-		if (S_ISDIR(inode->i_mode) &&
-		    lli->lli_lsm_md != NULL) {
+		if (ll_dir_striped(inode)) {
 			rc = ll_merge_md_attr(inode);
 			if (rc < 0)
 				return rc;
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 687d504..9e413c2 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -1071,6 +1071,13 @@ static inline struct lu_fid *ll_inode2fid(struct inode *inode)
 	return fid;
 }
 
+static inline bool ll_dir_striped(struct inode *inode)
+{
+	LASSERT(inode);
+	return S_ISDIR(inode->i_mode) &&
+	       lmv_dir_striped(ll_i2info(inode)->lli_lsm_md);
+}
+
 static inline loff_t ll_file_maxbytes(struct inode *inode)
 {
 	struct cl_object *obj = ll_i2info(inode)->lli_clob;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index bd17ba1..0633cc5 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1282,6 +1282,9 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 	       ll_i2sbi(inode)->ll_fsname, PFID(&lli->lli_fid));
 	lsm_md_dump(D_INODE, lsm);
 
+	if (!lmv_dir_striped(lsm))
+		goto out;
+
 	/*
 	 * XXX sigh, this lsm_root initialization should be in
 	 * LMV layer, but it needs ll_iget right now, so we
@@ -1312,7 +1315,7 @@ static int ll_init_lsm_md(struct inode *inode, struct lustre_md *md)
 			return rc;
 		}
 	}
-
+out:
 	lli->lli_lsm_md = lsm;
 
 	return 0;
@@ -1394,10 +1397,9 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	 *
 	 * foreign LMV should not change.
 	 */
-	if (lli->lli_lsm_md &&
-	    lli->lli_lsm_md->lsm_md_magic != LMV_MAGIC_FOREIGN &&
-	   !lsm_md_eq(lli->lli_lsm_md, lsm)) {
-		if (lsm->lsm_md_layout_version <=
+	if (lli->lli_lsm_md && !lsm_md_eq(lli->lli_lsm_md, lsm)) {
+		if (lmv_dir_striped(lli->lli_lsm_md) &&
+		    lsm->lsm_md_layout_version <=
 		    lli->lli_lsm_md->lsm_md_layout_version) {
 			CERROR("%s: " DFID " dir layout mismatch:\n",
 			       ll_i2sbi(inode)->ll_fsname,
@@ -1418,16 +1420,6 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	if (!lli->lli_lsm_md) {
 		struct cl_attr *attr;
 
-		if (lsm->lsm_md_magic == LMV_MAGIC_FOREIGN) {
-			/* set md->lmv to NULL, so the following free lustre_md
-			 * will not free this lsm
-			 */
-			md->lmv = NULL;
-			lli->lli_lsm_md = lsm;
-			up_write(&lli->lli_lsm_sem);
-			return 0;
-		}
-
 		rc = ll_init_lsm_md(inode, md);
 		up_write(&lli->lli_lsm_sem);
 		if (rc)
@@ -1445,6 +1437,9 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 		 */
 		down_read(&lli->lli_lsm_sem);
 
+		if (!lmv_dir_striped(lli->lli_lsm_md))
+			goto unlock;
+
 		attr = kzalloc(sizeof(*attr), GFP_NOFS);
 		if (!attr) {
 			rc = -ENOMEM;
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 1aaf184..fb5caaf 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -221,6 +221,7 @@ int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
 void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 {
 	struct inode *inode = ll_inode_from_resource_lock(lock);
+	struct ll_inode_info *lli;
 	u64 bits = to_cancel;
 	int rc;
 
@@ -308,13 +309,12 @@ void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 			       PFID(ll_inode2fid(inode)), rc);
 	}
 
+	lli = ll_i2info(inode);
 	if (bits & MDS_INODELOCK_UPDATE)
 		set_bit(LLIF_UPDATE_ATIME,
-			&ll_i2info(inode)->lli_flags);
+			&lli->lli_flags);
 
 	if ((bits & MDS_INODELOCK_UPDATE) && S_ISDIR(inode->i_mode)) {
-		struct ll_inode_info *lli = ll_i2info(inode);
-
 		CDEBUG(D_INODE,
 		       "invalidating inode "DFID" lli = %p, pfid  = "DFID"\n",
 		       PFID(ll_inode2fid(inode)),
@@ -688,7 +688,7 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		struct lu_fid fid = ll_i2info(parent)->lli_fid;
 
 		/* If it is striped directory, get the real stripe parent */
-		if (unlikely(ll_i2info(parent)->lli_lsm_md)) {
+		if (unlikely(ll_dir_striped(parent))) {
 			rc = md_get_fid_from_lsm(ll_i2mdexp(parent),
 						 ll_i2info(parent)->lli_lsm_md,
 						 (*de)->d_name.name,
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index ba14e7c..6017375 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -293,16 +293,15 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	int rc;
 
 	/* do not allow file creation in foreign dir */
-	if ((it->it_op & IT_CREAT) && op_data->op_mea1 &&
-	    op_data->op_mea1->lsm_md_magic == LMV_MAGIC_FOREIGN)
+	if ((it->it_op & IT_CREAT) && lmv_dir_foreign(op_data->op_mea1))
 		return -ENODATA;
 
 	if ((it->it_op & IT_CREAT) && !(flags & MDS_OPEN_BY_FID)) {
 		/* don't allow create under dir with bad hash */
-		if (lmv_is_dir_bad_hash(op_data->op_mea1))
+		if (lmv_dir_bad_hash(op_data->op_mea1))
 			return -EBADF;
 
-		if (lmv_is_dir_migrating(op_data->op_mea1)) {
+		if (lmv_dir_migrating(op_data->op_mea1)) {
 			if (flags & O_EXCL) {
 				/*
 				 * open(O_CREAT | O_EXCL) needs to check
@@ -311,8 +310,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 				 * file under old layout, check old layout on
 				 * client side.
 				 */
-				tgt = lmv_locate_tgt(lmv, op_data,
-						     &op_data->op_fid1);
+				tgt = lmv_locate_tgt(lmv, op_data);
 				if (IS_ERR(tgt))
 					return PTR_ERR(tgt);
 
@@ -348,7 +346,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 		 * without name, but we can set it to child fid, and MDT
 		 * will obtain it from linkea in open in such case.
 		 */
-		if (op_data->op_mea1)
+		if (lmv_dir_striped(op_data->op_mea1))
 			op_data->op_fid1 = op_data->op_fid2;
 
 		tgt = lmv_find_target(lmv, &op_data->op_fid2);
@@ -361,7 +359,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 		LASSERT(fid_is_zero(&op_data->op_fid2));
 		LASSERT(op_data->op_name);
 
-		tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+		tgt = lmv_locate_tgt(lmv, op_data);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 	}
@@ -448,8 +446,7 @@ static int lmv_intent_lookup(struct obd_export *exp,
 	int rc;
 
 	/* foreign dir is not striped */
-	if (op_data->op_mea1 &&
-	    op_data->op_mea1->lsm_md_magic == LMV_MAGIC_FOREIGN) {
+	if (lmv_dir_foreign(op_data->op_mea1)) {
 		/* only allow getattr/lookup for itself */
 		if (op_data->op_name)
 			return -ENODATA;
@@ -457,7 +454,7 @@ static int lmv_intent_lookup(struct obd_export *exp,
 	}
 
 retry:
-	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+	tgt = lmv_locate_tgt(lmv, op_data);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -482,7 +479,7 @@ static int lmv_intent_lookup(struct obd_export *exp,
 		 * If RPC happens, lsm information will be revalidated
 		 * during update_inode process (see ll_update_lsm_md)
 		 */
-		if (op_data->op_mea2) {
+		if (lmv_dir_striped(op_data->op_mea2)) {
 			rc = lmv_revalidate_slaves(exp, op_data->op_mea2,
 						   cb_blocking,
 						   extra_lock_flags);
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index b4c5297..9974ec5 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -137,6 +137,8 @@ static inline int lmv_stripe_md_size(int stripe_count)
 	u32 stripe_count = lsm->lsm_md_stripe_count;
 	int stripe_index;
 
+	LASSERT(lmv_dir_striped(lsm));
+
 	if (hash_type & LMV_HASH_FLAG_MIGRATION) {
 		if (post_migrate) {
 			hash_type &= ~LMV_HASH_FLAG_MIGRATION;
@@ -166,26 +168,6 @@ static inline int lmv_stripe_md_size(int stripe_count)
 	return &lsm->lsm_md_oinfo[stripe_index];
 }
 
-static inline bool lmv_is_dir_migrating(const struct lmv_stripe_md *lsm)
-{
-	return lsm ? lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION : false;
-}
-
-static inline bool lmv_is_dir_bad_hash(const struct lmv_stripe_md *lsm)
-{
-	if (!lsm)
-		return false;
-
-	if (lmv_is_dir_migrating(lsm)) {
-		if (lsm->lsm_md_stripe_count - lsm->lsm_md_migrate_offset > 1)
-			return !lmv_is_known_hash_type(
-					lsm->lsm_md_migrate_hash);
-		return false;
-	}
-
-	return !lmv_is_known_hash_type(lsm->lsm_md_hash_type);
-}
-
 static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 {
 	const struct lmv_stripe_md *lsm = op_data->op_mea1;
@@ -193,12 +175,12 @@ static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 	if (!lsm)
 		return false;
 
-	if (lmv_is_dir_migrating(lsm) && !op_data->op_post_migrate) {
+	if (lmv_dir_migrating(lsm) && !op_data->op_post_migrate) {
 		op_data->op_post_migrate = true;
 		return true;
 	}
 
-	if (lmv_is_dir_bad_hash(lsm) &&
+	if (lmv_dir_bad_hash(lsm) &&
 	    op_data->op_stripe_index < lsm->lsm_md_stripe_count - 1) {
 		op_data->op_stripe_index++;
 		return true;
@@ -208,8 +190,8 @@ static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 }
 
 struct lmv_tgt_desc *lmv_locate_tgt(struct lmv_obd *lmv,
-				    struct md_op_data *op_data,
-				    struct lu_fid *fid);
+				    struct md_op_data *op_data);
+
 /* lproc_lmv.c */
 int lmv_tunables_init(struct obd_device *obd);
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 4365533..02dfd35 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1149,24 +1149,24 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 /**
  * This is _inode_ placement policy function (not name).
  */
-static int lmv_placement_policy(struct obd_device *obd,
-				struct md_op_data *op_data, u32 *mds)
+static u32 lmv_placement_policy(struct obd_device *obd,
+				struct md_op_data *op_data)
 {
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_user_md *lum;
+	u32 mdt;
 
-	LASSERT(mds);
-
-	if (lmv->desc.ld_tgt_count == 1) {
-		*mds = 0;
+	if (lmv->desc.ld_tgt_count == 1)
 		return 0;
-	}
 
 	lum = op_data->op_data;
-	/* Choose MDS by
+	/*
+	 * Choose MDT by
 	 * 1. See if the stripe offset is specified by lum.
-	 * 2. Then check if there is default stripe offset.
-	 * 3. Finally choose MDS by name hash if the parent
+	 * 2. If parent has default LMV, and its hash type is "space", choose
+	 *    MDT with QoS. (see lmv_locate_tgt_qos()).
+	 * 3. Then check if default LMV stripe offset is not -1.
+	 * 4. Finally choose MDS by name hash if the parent
 	 *    is striped directory. (see lmv_locate_tgt()).
 	 *
 	 * presently explicit MDT location is not supported
@@ -1177,18 +1177,22 @@ static int lmv_placement_policy(struct obd_device *obd,
 	if (op_data->op_cli_flags & CLI_SET_MEA && lum &&
 	    le32_to_cpu(lum->lum_magic != LMV_MAGIC_FOREIGN) &&
 	    le32_to_cpu(lum->lum_stripe_offset) != (u32)-1) {
-		*mds = le32_to_cpu(lum->lum_stripe_offset);
+		mdt = le32_to_cpu(lum->lum_stripe_offset);
+	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
+		   !lmv_dir_striped(op_data->op_mea1) &&
+		   lmv_dir_space_hashed(op_data->op_default_mea1)) {
+		mdt = op_data->op_mds;
 	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
 		   op_data->op_default_mea1 &&
 		   op_data->op_default_mea1->lsm_md_master_mdt_index !=
-			 (u32)-1) {
-		*mds = op_data->op_default_mea1->lsm_md_master_mdt_index;
-		op_data->op_mds = *mds;
+			(u32)-1) {
+		mdt = op_data->op_default_mea1->lsm_md_master_mdt_index;
+		op_data->op_mds = mdt;
 	} else {
-		*mds = op_data->op_mds;
+		mdt = op_data->op_mds;
 	}
 
-	return 0;
+	return mdt;
 }
 
 int __lmv_fid_alloc(struct lmv_obd *lmv, struct lu_fid *fid, u32 mds)
@@ -1230,24 +1234,17 @@ int lmv_fid_alloc(const struct lu_env *env, struct obd_export *exp,
 {
 	struct obd_device *obd = class_exp2obd(exp);
 	struct lmv_obd *lmv = &obd->u.lmv;
-	u32 mds = 0;
+	u32 mds;
 	int rc;
 
 	LASSERT(op_data);
 	LASSERT(fid);
 
-	rc = lmv_placement_policy(obd, op_data, &mds);
-	if (rc) {
-		CERROR("Can't get target for allocating fid, rc %d\n",
-		       rc);
-		return rc;
-	}
+	mds = lmv_placement_policy(obd, op_data);
 
 	rc = __lmv_fid_alloc(lmv, fid, mds);
-	if (rc) {
+	if (rc)
 		CERROR("Can't alloc new fid, rc %d\n", rc);
-		return rc;
-	}
 
 	return rc;
 }
@@ -1588,20 +1585,30 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 	return md_close(tgt->ltd_exp, op_data, mod, request);
 }
 
-struct lmv_tgt_desc*
-__lmv_locate_tgt(struct lmv_obd *lmv, struct lmv_stripe_md *lsm,
-		 const char *name, int namelen, struct lu_fid *fid, u32 *mds,
-		 bool post_migrate)
+static struct lmv_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
+{
+	static unsigned int rr_index;
+
+	/* locate MDT round-robin is the first step */
+	*mdt = rr_index % lmv->tgts_size;
+	rr_index++;
+
+	return lmv->tgts[*mdt];
+}
+
+static struct lmv_tgt_desc *
+lmv_locate_tgt_by_name(struct lmv_obd *lmv, struct lmv_stripe_md *lsm,
+		       const char *name, int namelen, struct lu_fid *fid,
+		       u32 *mds, bool post_migrate)
 {
 	const struct lmv_oinfo *oinfo;
 	struct lmv_tgt_desc *tgt;
 
-	if (!lsm || namelen == 0) {
+	if (!lmv_dir_striped(lsm) || !namelen) {
 		tgt = lmv_find_target(lmv, fid);
 		if (IS_ERR(tgt))
 			return tgt;
 
-		LASSERT(mds);
 		*mds = tgt->ltd_idx;
 		return tgt;
 	}
@@ -1617,47 +1624,41 @@ struct lmv_tgt_desc*
 			return ERR_CAST(oinfo);
 	}
 
-	if (fid)
-		*fid = oinfo->lmo_fid;
-	if (mds)
-		*mds = oinfo->lmo_mds;
-
+	*fid = oinfo->lmo_fid;
+	*mds = oinfo->lmo_mds;
 	tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
 
-	CDEBUG(D_INFO, "locate on mds %u " DFID "\n", oinfo->lmo_mds,
-	       PFID(&oinfo->lmo_fid));
+	CDEBUG(D_INODE, "locate MDT %u parent " DFID "\n", *mds, PFID(fid));
 
 	return tgt;
 }
 
 /**
- * Locate mdt by fid or name
+ * Locate MDT of op_data->op_fid1
  *
  * For striped directory, it will locate the stripe by name hash, if hash_type
  * is unknown, it will return the stripe specified by 'op_data->op_stripe_index'
  * which is set outside, and if dir is migrating, 'op_data->op_post_migrate'
  * indicates whether old or new layout is used to locate.
  *
- * For normal direcotry, it will locate MDS by FID directly.
+ * For plain direcotry, normally it will locate MDT by FID, but if this
+ * directory has default LMV, and its hash type is "space", locate MDT with QoS.
  *
  * @lmv:	LMV device
  * @op_data:	client MD stack parameters, name, namelen
  *		mds_num etc.
- * @fid:	object FID used to locate MDS.
  *
  * Returns:	pointer to the lmv_tgt_desc if succeed.
  *		ERR_PTR(errno) if failed.
  */
-struct lmv_tgt_desc*
-lmv_locate_tgt(struct lmv_obd *lmv, struct md_op_data *op_data,
-	       struct lu_fid *fid)
+struct lmv_tgt_desc *
+lmv_locate_tgt(struct lmv_obd *lmv, struct md_op_data *op_data)
 {
 	struct lmv_stripe_md *lsm = op_data->op_mea1;
 	struct lmv_oinfo *oinfo;
 	struct lmv_tgt_desc *tgt;
 
-	/* foreign dir is not striped dir */
-	if (lsm && lsm->lsm_md_magic == LMV_MAGIC_FOREIGN)
+	if (lmv_dir_foreign(lsm))
 		return ERR_PTR(-ENODATA);
 
 	/*
@@ -1671,43 +1672,101 @@ struct lmv_tgt_desc*
 		if (IS_ERR(tgt))
 			return tgt;
 
-		if (lsm) {
+		if (lmv_dir_striped(lsm)) {
 			int i;
 
 			/* refill the right parent fid */
 			for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
 				oinfo = &lsm->lsm_md_oinfo[i];
 				if (oinfo->lmo_mds == op_data->op_mds) {
-					*fid = oinfo->lmo_fid;
+					op_data->op_fid1 = oinfo->lmo_fid;
 					break;
 				}
 			}
 
 			if (i == lsm->lsm_md_stripe_count)
-				*fid = lsm->lsm_md_oinfo[0].lmo_fid;
+				op_data->op_fid1 = lsm->lsm_md_oinfo[0].lmo_fid;
 		}
-	} else if (lmv_is_dir_bad_hash(lsm)) {
+	} else if (lmv_dir_bad_hash(lsm)) {
 		LASSERT(op_data->op_stripe_index < lsm->lsm_md_stripe_count);
 		oinfo = &lsm->lsm_md_oinfo[op_data->op_stripe_index];
 
-		*fid = oinfo->lmo_fid;
+		op_data->op_fid1 = oinfo->lmo_fid;
 		op_data->op_mds = oinfo->lmo_mds;
-
 		tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
+	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
+		   lmv_dir_space_hashed(op_data->op_default_mea1) &&
+		   !lmv_dir_striped(lsm)) {
+		tgt = lmv_locate_tgt_qos(lmv, &op_data->op_mds);
+		/*
+		 * only update statfs when mkdir under dir with "space" hash,
+		 * this means the cached statfs may be stale, and current mkdir
+		 * may not follow QoS accurately, but it's not serious, and it
+		 * avoids periodic statfs when client doesn't mkdir under
+		 * "space" hashed directories.
+		 */
+		if (!IS_ERR(tgt)) {
+			struct obd_device *obd;
+
+			obd = container_of(lmv, struct obd_device, u.lmv);
+			lmv_statfs_check_update(obd, tgt);
+		}
 	} else {
-		tgt = __lmv_locate_tgt(lmv, lsm, op_data->op_name,
-				       op_data->op_namelen, fid,
-				       &op_data->op_mds,
-				       op_data->op_post_migrate);
+		tgt = lmv_locate_tgt_by_name(lmv, op_data->op_mea1,
+				op_data->op_name, op_data->op_namelen,
+				&op_data->op_fid1, &op_data->op_mds,
+				op_data->op_post_migrate);
 	}
 
 	return tgt;
 }
 
-static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
-		      const void *data, size_t datalen, umode_t mode,
-		      uid_t uid, gid_t gid, kernel_cap_t cap_effective,
-		      u64 rdev, struct ptlrpc_request **request)
+/* Locate MDT of op_data->op_fid2 for link/rename */
+static struct lmv_tgt_desc *
+lmv_locate_tgt2(struct lmv_obd *lmv, struct md_op_data *op_data)
+{
+	struct lmv_tgt_desc *tgt;
+	int rc;
+
+	LASSERT(op_data->op_name);
+	if (lmv_dir_migrating(op_data->op_mea2)) {
+		struct lu_fid fid1 = op_data->op_fid1;
+		struct lmv_stripe_md *lsm1 = op_data->op_mea1;
+		struct ptlrpc_request *request = NULL;
+
+		/*
+		 * avoid creating new file under old layout of migrating
+		 * directory, check it here.
+		 */
+		tgt = lmv_locate_tgt_by_name(lmv, op_data->op_mea2,
+				op_data->op_name, op_data->op_namelen,
+				&op_data->op_fid2, &op_data->op_mds, false);
+		if (IS_ERR(tgt))
+			return tgt;
+
+		op_data->op_fid1 = op_data->op_fid2;
+		op_data->op_mea1 = op_data->op_mea2;
+		rc = md_getattr_name(tgt->ltd_exp, op_data, &request);
+		op_data->op_fid1 = fid1;
+		op_data->op_mea1 = lsm1;
+		if (!rc) {
+			ptlrpc_req_finished(request);
+			return ERR_PTR(-EEXIST);
+		}
+
+		if (rc != -ENOENT)
+			return ERR_PTR(rc);
+	}
+
+	return lmv_locate_tgt_by_name(lmv, op_data->op_mea2, op_data->op_name,
+				op_data->op_namelen, &op_data->op_fid2,
+				&op_data->op_mds, true);
+}
+
+int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
+		const void *data, size_t datalen, umode_t mode, uid_t uid,
+		gid_t gid, kernel_cap_t cap_effective, u64 rdev,
+		struct ptlrpc_request **request)
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
@@ -1717,16 +1776,16 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	if (!lmv->desc.ld_active_tgt_count)
 		return -EIO;
 
-	if (lmv_is_dir_bad_hash(op_data->op_mea1))
+	if (lmv_dir_bad_hash(op_data->op_mea1))
 		return -EBADF;
 
-	if (lmv_is_dir_migrating(op_data->op_mea1)) {
+	if (lmv_dir_migrating(op_data->op_mea1)) {
 		/*
 		 * if parent is migrating, create() needs to lookup existing
 		 * name, to avoid creating new file under old layout of
 		 * migrating directory, check old layout here.
 		 */
-		tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+		tgt = lmv_locate_tgt(lmv, op_data);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 
@@ -1743,7 +1802,7 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 		op_data->op_post_migrate = true;
 	}
 
-	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+	tgt = lmv_locate_tgt(lmv, op_data);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1765,8 +1824,6 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 			return PTR_ERR(tgt);
 
 		op_data->op_mds = tgt->ltd_idx;
-	} else {
-		CDEBUG(D_CONFIG, "Server doesn't support striped dirs\n");
 	}
 
 	CDEBUG(D_INODE, "CREATE obj " DFID " -> mds #%x\n",
@@ -1818,7 +1875,7 @@ static int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	int rc;
 
 retry:
-	tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+	tgt = lmv_locate_tgt(lmv, op_data);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1916,39 +1973,7 @@ static int lmv_link(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
 
-	if (lmv_is_dir_migrating(op_data->op_mea2)) {
-		struct lu_fid fid1 = op_data->op_fid1;
-		struct lmv_stripe_md *lsm1 = op_data->op_mea1;
-
-		/*
-		 * avoid creating new file under old layout of migrating
-		 * directory, check it here.
-		 */
-		tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, op_data->op_name,
-				       op_data->op_namelen, &op_data->op_fid2,
-				       &op_data->op_mds, false);
-		tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
-		if (IS_ERR(tgt))
-			return PTR_ERR(tgt);
-
-		op_data->op_fid1 = op_data->op_fid2;
-		op_data->op_mea1 = op_data->op_mea2;
-		rc = md_getattr_name(tgt->ltd_exp, op_data, request);
-		op_data->op_fid1 = fid1;
-		op_data->op_mea1 = lsm1;
-		if (!rc) {
-			ptlrpc_req_finished(*request);
-			*request = NULL;
-			return -EEXIST;
-		}
-
-		if (rc != -ENOENT)
-			return rc;
-	}
-
-	tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, op_data->op_name,
-			       op_data->op_namelen, &op_data->op_fid2,
-			       &op_data->op_mds, true);
+	tgt = lmv_locate_tgt2(lmv, op_data);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1992,7 +2017,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	if (IS_ERR(parent_tgt))
 		return PTR_ERR(parent_tgt);
 
-	if (lsm) {
+	if (lmv_dir_striped(lsm)) {
 		u32 hash_type = lsm->lsm_md_hash_type;
 		u32 stripe_count = lsm->lsm_md_stripe_count;
 
@@ -2000,7 +2025,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 		 * old stripes are appended after new stripes for migrating
 		 * directory.
 		 */
-		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION) {
+		if (lmv_dir_migrating(lsm)) {
 			hash_type = lsm->lsm_md_migrate_hash;
 			stripe_count -= lsm->lsm_md_migrate_offset;
 		}
@@ -2010,7 +2035,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 		if (rc < 0)
 			return rc;
 
-		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION)
+		if (lmv_dir_migrating(lsm))
 			rc += lsm->lsm_md_migrate_offset;
 
 		/* save it in fid4 temporarily for early cancel */
@@ -2024,7 +2049,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 		 * if parent is being migrated too, fill op_fid2 with target
 		 * stripe fid, otherwise the target stripe is not created yet.
 		 */
-		if (lsm->lsm_md_hash_type & LMV_HASH_FLAG_MIGRATION) {
+		if (lmv_dir_migrating(lsm)) {
 			hash_type = lsm->lsm_md_hash_type &
 				    ~LMV_HASH_FLAG_MIGRATION;
 			stripe_count = lsm->lsm_md_migrate_offset;
@@ -2151,44 +2176,10 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
 
-	if (lmv_is_dir_migrating(op_data->op_mea2)) {
-		struct lu_fid fid1 = op_data->op_fid1;
-		struct lmv_stripe_md *lsm1 = op_data->op_mea1;
-
-		/*
-		 * we avoid creating new file under old layout of migrating
-		 * directory, if there is an existing file with new name under
-		 * old layout, we can't unlink file in old layout and rename to
-		 * new layout in one transaction, so return -EBUSY here.`
-		 */
-		tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, new, newlen,
-				       &op_data->op_fid2, &op_data->op_mds,
-				       false);
-		if (IS_ERR(tgt))
-			return PTR_ERR(tgt);
-
-		op_data->op_fid1 = op_data->op_fid2;
-		op_data->op_mea1 = op_data->op_mea2;
-		op_data->op_name = new;
-		op_data->op_namelen = newlen;
-		rc = md_getattr_name(tgt->ltd_exp, op_data, request);
-		op_data->op_fid1 = fid1;
-		op_data->op_mea1 = lsm1;
-		op_data->op_name = NULL;
-		op_data->op_namelen = 0;
-		if (!rc) {
-			ptlrpc_req_finished(*request);
-			*request = NULL;
-			return -EBUSY;
-		}
+	op_data->op_name = new;
+	op_data->op_namelen = newlen;
 
-		if (rc != -ENOENT)
-			return rc;
-	}
-
-	/* rename to new layout for migrating directory */
-	tp_tgt = __lmv_locate_tgt(lmv, op_data->op_mea2, new, newlen,
-				  &op_data->op_fid2, &op_data->op_mds, true);
+	tp_tgt = lmv_locate_tgt2(lmv, op_data);
 	if (IS_ERR(tp_tgt))
 		return PTR_ERR(tp_tgt);
 
@@ -2240,10 +2231,10 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 			return rc;
 	}
 
+	op_data->op_name = old;
+	op_data->op_namelen = oldlen;
 retry:
-	sp_tgt = __lmv_locate_tgt(lmv, op_data->op_mea1, old, oldlen,
-				  &op_data->op_fid1, &op_data->op_mds,
-				  op_data->op_post_migrate);
+	sp_tgt = lmv_locate_tgt(lmv, op_data);
 	if (IS_ERR(sp_tgt))
 		return PTR_ERR(sp_tgt);
 
@@ -2710,16 +2701,14 @@ static int lmv_read_page(struct obd_export *exp, struct md_op_data *op_data,
 			 struct md_callback *cb_op, u64 offset,
 			 struct page **ppage)
 {
-	struct lmv_stripe_md *lsm = op_data->op_mea1;
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	if (unlikely(lsm)) {
-		/* foreign dir is not striped dir */
-		if (lsm->lsm_md_magic == LMV_MAGIC_FOREIGN)
-			return -ENODATA;
+	if (unlikely(lmv_dir_foreign(op_data->op_mea1)))
+		return -ENODATA;
 
+	if (unlikely(lmv_dir_striped(op_data->op_mea1))) {
 		return lmv_striped_read_page(exp, op_data, cb_op,
 					     offset, ppage);
 	}
@@ -2770,7 +2759,7 @@ static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_cap = current_cap();
 
 retry:
-	parent_tgt = lmv_locate_tgt(lmv, op_data, &op_data->op_fid1);
+	parent_tgt = lmv_locate_tgt(lmv, op_data);
 	if (IS_ERR(parent_tgt))
 		return PTR_ERR(parent_tgt);
 
@@ -3060,7 +3049,7 @@ static int lmv_unpackmd(struct obd_export *exp, struct lmv_stripe_md **lsmp,
 			return 0;
 		}
 
-		if (lsm->lsm_md_magic == LMV_MAGIC) {
+		if (lmv_dir_striped(lsm)) {
 			for (i = 0; i < lsm->lsm_md_stripe_count; i++) {
 				if (lsm->lsm_md_oinfo[i].lmo_root)
 					iput(lsm->lsm_md_oinfo[i].lmo_root);
@@ -3343,7 +3332,8 @@ static int lmv_revalidate_lock(struct obd_export *exp, struct lookup_intent *it,
 {
 	const struct lmv_oinfo *oinfo;
 
-	LASSERT(lsm);
+	LASSERT(lmv_dir_striped(lsm));
+
 	oinfo = lsm_name_to_stripe_info(lsm, name, namelen, false);
 	if (IS_ERR(oinfo))
 		return PTR_ERR(oinfo);
@@ -3408,8 +3398,7 @@ static int lmv_merge_attr(struct obd_export *exp,
 {
 	int rc, i;
 
-	/* foreign dir is not striped dir */
-	if (lsm->lsm_md_magic == LMV_MAGIC_FOREIGN)
+	if (!lmv_dir_striped(lsm))
 		return 0;
 
 	rc = lmv_revalidate_slaves(exp, lsm, cb_blocking, 0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 315/622] lustre: llite: check correct size in ll_dom_finish_open()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (313 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 314/622] lustre: lmv: mkdir with balanced space usage James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 316/622] lnet: recovery event handling broken James Simmons
                   ` (307 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The check in ll_dom_finish_open() for data end shouldn't
use i_size for comparision because it may be not updated
yet with just returned data from server. Use size value in
mdt_body from reply for that check.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12014
Lustre-commit: 7b9fd576f7de ("LU-12014 llite: check correct size in ll_dom_finish_open()")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33895
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 50220eb..88d5c2d 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -418,6 +418,7 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	struct address_space *mapping = inode->i_mapping;
 	struct page *vmpage;
 	struct niobuf_remote *rnb;
+	struct mdt_body *body;
 	char *data;
 	unsigned long index, start;
 	struct niobuf_local lnb;
@@ -441,18 +442,19 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	if (rnb->rnb_offset % PAGE_SIZE)
 		return;
 
-	/* Server returns whole file or just file tail if it fills in
-	 * reply buffer, in both cases total size should be inode size.
+	/* Server returns whole file or just file tail if it fills in reply
+	 * buffer, in both cases total size should be equal to the file size.
 	 */
-	if (rnb->rnb_offset + rnb->rnb_len < i_size_read(inode)) {
-		CERROR("%s: server returns off/len %llu/%u < i_size %llu\n",
+	body = req_capsule_server_get(&req->rq_pill, &RMF_MDT_BODY);
+	if (rnb->rnb_offset + rnb->rnb_len != body->mbo_dom_size) {
+		CERROR("%s: server returns off/len %llu/%u but size %llu\n",
 		       ll_i2sbi(inode)->ll_fsname, rnb->rnb_offset,
-		       rnb->rnb_len, i_size_read(inode));
+		       rnb->rnb_len, body->mbo_dom_size);
 		return;
 	}
 
-	CDEBUG(D_INFO, "Get data along with open at %llu len %i, i_size %llu\n",
-	       rnb->rnb_offset, rnb->rnb_len, i_size_read(inode));
+	CDEBUG(D_INFO, "Get data along with open at %llu len %i, size %llu\n",
+	       rnb->rnb_offset, rnb->rnb_len, body->mbo_dom_size);
 
 	data = (char *)rnb + sizeof(*rnb);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 316/622] lnet: recovery event handling broken
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (314 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 315/622] lustre: llite: check correct size in ll_dom_finish_open() James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 317/622] lnet: clean mt_eqh properly James Simmons
                   ` (306 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Don't increment health on unlink event.
If a SEND fails an unlink will follow so no need to do any
special processing on SEND event. If SEND succeeds then we
wait for the reply.
When queuing a message on the NI recovery queue only do so
if the MT thread is still running.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12080
Lustre-commit: 5409e620e025 ("LU-12080 lnet: recovery event handling broken")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34445
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 9 +++++----
 net/lnet/lnet/lib-msg.c  | 5 +++++
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 809d2b6..a6df9ba 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3197,7 +3197,7 @@ struct lnet_mt_event_info {
 
 static void
 lnet_handle_recovery_reply(struct lnet_mt_event_info *ev_info,
-			   int status)
+			   int status, bool unlink_event)
 {
 	lnet_nid_t nid = ev_info->mt_nid;
 
@@ -3228,7 +3228,8 @@ struct lnet_mt_event_info {
 		 * carry forward too much information.
 		 * In the peer case, it'll naturally be incremented
 		 */
-		lnet_inc_healthv(&ni->ni_healthv);
+		if (!unlink_event)
+			lnet_inc_healthv(&ni->ni_healthv);
 	} else {
 		struct lnet_peer_ni *lpni;
 		int cpt;
@@ -3273,14 +3274,14 @@ struct lnet_mt_event_info {
 		       libcfs_nid2str(ev_info->mt_nid));
 		/* fall-through */
 	case LNET_EVENT_REPLY:
-		lnet_handle_recovery_reply(ev_info, event->status);
+		lnet_handle_recovery_reply(ev_info, event->status,
+					   event->type == LNET_EVENT_UNLINK);
 		break;
 	case LNET_EVENT_SEND:
 		CDEBUG(D_NET, "%s recovery message sent %s:%d\n",
 		       libcfs_nid2str(ev_info->mt_nid),
 		       (event->status) ? "unsuccessfully" :
 		       "successfully", event->status);
-		lnet_handle_recovery_reply(ev_info, event->status);
 		break;
 	default:
 		CERROR("Unexpected event: %d\n", event->type);
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 0738bf7..146e23c 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -521,6 +521,11 @@
 		return;
 
 	lnet_net_lock(0);
+	/* the mt could've shutdown and cleaned up the queues */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
+		lnet_net_unlock(0);
+		return;
+	}
 	lnet_handle_remote_failure_locked(lpni);
 	lnet_net_unlock(0);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 317/622] lnet: clean mt_eqh properly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (315 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 316/622] lnet: recovery event handling broken James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 318/622] lnet: handle remote health error James Simmons
                   ` (305 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

There is a scenario where you have a peer on your recovery queue
that's down. So you keep pinging it, but every ping times out
after 10 seconds. In the middle of these 10 seconds you perform a
shutdown. First you try to do the rsp_tracker_clean. It goes through
and calls MDUnlink on the MD related to that ping. But because the
message has a ref count on the MD, it doesn't go away. The MD gets
zombied. And just waits for lnet_md_unlink to be called in
lnet_finalize(). Then you hit clean_peer_ni_recovery. We see the peer
on the queue, we try to call Unlink on it, but when we lookup the
MD using lnet_handle2md() we can't find it. Afterwards we try to clean
up the EQ and it asserts. Even if we remove the assert we end up with
a resource leak since the EQ is not actually freed since we won't call
LNetEQFree() again.

The solution is to pull the EQ create in the LNetNIInit() and deletion
happens in lnet_unprepare. By this point all the remaining messages
would've been finalized and all references on the EQ are gone,
allowing us to clean it up properly

WC-bug-id: https://jira.whamcloud.com/browse/LU-12080
Lustre-commit: 1065c8888e96 ("LU-12080 lnet: clean mt_eqh properly")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34477
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  2 ++
 net/lnet/lnet/api-ni.c        | 15 +++++++++++++++
 net/lnet/lnet/lib-eq.c        |  2 --
 net/lnet/lnet/lib-move.c      | 13 +------------
 4 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index a6e64f6..10922ae 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -513,6 +513,8 @@ struct lnet_ni *
 int lnet_lib_init(void);
 void lnet_lib_exit(void);
 
+void lnet_mt_event_handler(struct lnet_event *event);
+
 int lnet_notify(struct lnet_ni *ni, lnet_nid_t peer, int alive,
 		time64_t when);
 void lnet_notify_locked(struct lnet_peer_ni *lp, int notifylnd, int alive,
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index e5f5c6c..1388bd4 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1059,6 +1059,7 @@ struct lnet_libhandle *
 	INIT_LIST_HEAD(&the_lnet.ln_mt_localNIRecovq);
 	INIT_LIST_HEAD(&the_lnet.ln_mt_peerNIRecovq);
 	init_waitqueue_head(&the_lnet.ln_dc_waitq);
+	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
 
 	rc = lnet_descriptor_setup();
 	if (rc != 0)
@@ -1126,6 +1127,8 @@ struct lnet_libhandle *
 static int
 lnet_unprepare(void)
 {
+	int rc;
+
 	/*
 	 * NB no LNET_LOCK since this is the last reference.  All LND instances
 	 * have shut down already, so it is safe to unlink and free all
@@ -1138,6 +1141,12 @@ struct lnet_libhandle *
 	LASSERT(list_empty(&the_lnet.ln_test_peers));
 	LASSERT(list_empty(&the_lnet.ln_nets));
 
+	if (!LNetEQHandleIsInvalid(the_lnet.ln_mt_eqh)) {
+		rc = LNetEQFree(the_lnet.ln_mt_eqh);
+		LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
+		LASSERT(rc == 0);
+	}
+
 	lnet_portals_destroy();
 
 	if (the_lnet.ln_md_containers) {
@@ -2503,6 +2512,12 @@ void lnet_lib_exit(void)
 
 	lnet_ping_target_update(pbuf, ping_mdh);
 
+	rc = LNetEQAlloc(0, lnet_mt_event_handler, &the_lnet.ln_mt_eqh);
+	if (rc != 0) {
+		CERROR("Can't allocate monitor thread EQ: %d\n", rc);
+		goto err_stop_ping;
+	}
+
 	rc = lnet_monitor_thr_start();
 	if (rc)
 		goto err_stop_ping;
diff --git a/net/lnet/lnet/lib-eq.c b/net/lnet/lnet/lib-eq.c
index 3d99f0a..01b8ee3 100644
--- a/net/lnet/lnet/lib-eq.c
+++ b/net/lnet/lnet/lib-eq.c
@@ -164,8 +164,6 @@
 	int size = 0;
 	int i;
 
-	LASSERT(the_lnet.ln_refcount > 0);
-
 	lnet_res_lock(LNET_LOCK_EX);
 	/*
 	 * NB: hold lnet_eq_wait_lock for EQ link/unlink, so we can do
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index a6df9ba..7c135c4 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3254,7 +3254,7 @@ struct lnet_mt_event_info {
 	}
 }
 
-static void
+void
 lnet_mt_event_handler(struct lnet_event *event)
 {
 	struct lnet_mt_event_info *ev_info = event->md.user_ptr;
@@ -3333,12 +3333,6 @@ int lnet_monitor_thr_start(void)
 	if (rc)
 		goto clean_queues;
 
-	rc = LNetEQAlloc(0, lnet_mt_event_handler, &the_lnet.ln_mt_eqh);
-	if (rc != 0) {
-		CERROR("Can't allocate monitor thread EQ: %d\n", rc);
-		goto clean_queues;
-	}
-
 	/* Pre monitor thread start processing */
 	rc = lnet_router_pre_mt_start();
 	if (rc)
@@ -3371,7 +3365,6 @@ int lnet_monitor_thr_start(void)
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
 	lnet_clean_resendqs();
-	LNetEQFree(the_lnet.ln_mt_eqh);
 	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
 	return rc;
 clean_queues:
@@ -3384,8 +3377,6 @@ int lnet_monitor_thr_start(void)
 
 void lnet_monitor_thr_stop(void)
 {
-	int rc;
-
 	if (the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN)
 		return;
 
@@ -3405,8 +3396,6 @@ void lnet_monitor_thr_stop(void)
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
 	lnet_clean_resendqs();
-	rc = LNetEQFree(the_lnet.ln_mt_eqh);
-	LASSERT(rc == 0);
 }
 
 void
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 318/622] lnet: handle remote health error
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (316 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 317/622] lnet: clean mt_eqh properly James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 319/622] lnet: setup health timeout defaults James Simmons
                   ` (304 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When a peer is dead set the health status to REMOTE_DROPPED
in order to handle health properly for the peer.
When dropping a routed message set REMOTE_ERROR. Routed messages
are dropped when the routing feature is turned off which could
be considered a configuration error if it happens in the middle
of traffic. Therefore, it's better to flag this issue at this
point without resending the message.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12344
Lustre-commit: b45e3d96fc4d ("LU-12344 lnet: handle remote health error")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34967
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 7c135c4..8eeb5ec 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -770,7 +770,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 		CNETERR("Dropping message for %s: peer not alive\n",
 			libcfs_id2str(msg->msg_target));
-		msg->msg_health_status = LNET_MSG_STATUS_LOCAL_DROPPED;
+		msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED;
 		if (do_send)
 			lnet_finalize(msg, -EHOSTUNREACH);
 
@@ -786,6 +786,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			libcfs_id2str(msg->msg_target));
 		if (do_send) {
 			msg->msg_no_resend = true;
+			CDEBUG(D_NET,
+			       "msg %p to %s canceled and will not be resent\n",
+			       msg, libcfs_id2str(msg->msg_target));
 			lnet_finalize(msg, -ECANCELED);
 		}
 
@@ -1065,6 +1068,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			     0, 0, 0, msg->msg_hdr.payload_length);
 		list_del_init(&msg->msg_list);
 		msg->msg_no_resend = true;
+		msg->msg_health_status = LNET_MSG_STATUS_REMOTE_ERROR;
 		lnet_finalize(msg, -ECANCELED);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 319/622] lnet: setup health timeout defaults
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (317 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 318/622] lnet: handle remote health error James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 320/622] lnet: fix cpt locking James Simmons
                   ` (303 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Enable health feature by default.
Setup transaction timeout to a default 10 seconds and
retry count to 3 when health is enabled. When health
is disabled set default transaction timeout to 50.
When toggling between health enabled/disabled the defaults
will always kick in.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11816
Lustre-commit: 8632e94aeb7e ("LU-11816 lnet: setup health timeout defaults")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34252
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 55 +++++++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 1388bd4..aeb9d92 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -79,10 +79,10 @@ struct lnet the_lnet = {
 		 "NUMA range to consider during Multi-Rail selection");
 
 /* lnet_health_sensitivity determines by how much we decrement the health
- * value on sending error. The value defaults to 0, which means health
- * checking is turned off by default.
+ * value on sending error. The value defaults to 100, which means health
+ * interface health is decremented by 100 points every failure.
  */
-unsigned int lnet_health_sensitivity;
+unsigned int lnet_health_sensitivity = 100;
 static int sensitivity_set(const char *val, const struct kernel_param *kp);
 static struct kernel_param_ops param_ops_health_sensitivity = {
 	.set = sensitivity_set,
@@ -140,7 +140,10 @@ static int recovery_interval_set(const char *val,
 MODULE_PARM_DESC(lnet_drop_asym_route,
 		 "Set to 1 to drop asymmetrical route messages.");
 
-unsigned int lnet_transaction_timeout = 50;
+#define LNET_TRANSACTION_TIMEOUT_NO_HEALTH_DEFAULT 50
+#define LNET_TRANSACTION_TIMEOUT_HEALTH_DEFAULT 10
+
+unsigned int lnet_transaction_timeout = LNET_TRANSACTION_TIMEOUT_HEALTH_DEFAULT;
 static int transaction_to_set(const char *val, const struct kernel_param *kp);
 static struct kernel_param_ops param_ops_transaction_timeout = {
 	.set = transaction_to_set,
@@ -153,7 +156,8 @@ static int recovery_interval_set(const char *val,
 MODULE_PARM_DESC(lnet_transaction_timeout,
 		 "Maximum number of seconds to wait for a peer response.");
 
-unsigned int lnet_retry_count;
+#define LNET_RETRY_COUNT_HEALTH_DEFAULT 3
+unsigned int lnet_retry_count = LNET_RETRY_COUNT_HEALTH_DEFAULT;
 static int retry_count_set(const char *val, const struct kernel_param *kp);
 static struct kernel_param_ops param_ops_retry_count = {
 	.set = retry_count_set,
@@ -201,11 +205,6 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	 */
 	mutex_lock(&the_lnet.ln_api_mutex);
 
-	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
-		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
-	}
-
 	if (value > LNET_MAX_HEALTH_VALUE) {
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		CERROR("Invalid health value. Maximum: %d value = %lu\n",
@@ -213,6 +212,22 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 		return -EINVAL;
 	}
 
+	/* if we're turning on health then use the health timeout
+	 * defaults.
+	 */
+	if (*sensitivity == 0 && value != 0) {
+		lnet_transaction_timeout =
+			LNET_TRANSACTION_TIMEOUT_HEALTH_DEFAULT;
+		lnet_retry_count = LNET_RETRY_COUNT_HEALTH_DEFAULT;
+	/* if we're turning off health then use the no health timeout
+	 * default.
+	 */
+	} else if (*sensitivity != 0 && value == 0) {
+		lnet_transaction_timeout =
+			LNET_TRANSACTION_TIMEOUT_NO_HEALTH_DEFAULT;
+		lnet_retry_count = 0;
+	}
+
 	*sensitivity = value;
 
 	mutex_unlock(&the_lnet.ln_api_mutex);
@@ -243,11 +258,6 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	 */
 	mutex_lock(&the_lnet.ln_api_mutex);
 
-	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
-		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
-	}
-
 	*interval = value;
 
 	mutex_unlock(&the_lnet.ln_api_mutex);
@@ -353,11 +363,6 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	 */
 	mutex_lock(&the_lnet.ln_api_mutex);
 
-	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
-		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
-	}
-
 	if (value < lnet_retry_count || value == 0) {
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		CERROR("Invalid value for lnet_transaction_timeout (%lu). Has to be greater than lnet_retry_count (%u)\n",
@@ -399,9 +404,10 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	 */
 	mutex_lock(&the_lnet.ln_api_mutex);
 
-	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
+	if (lnet_health_sensitivity == 0) {
 		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
+		CERROR("Can not set retry_count when health feature is turned off\n");
+		return -EINVAL;
 	}
 
 	if (value > lnet_transaction_timeout) {
@@ -411,11 +417,6 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 		return -EINVAL;
 	}
 
-	if (value == *retry_count) {
-		mutex_unlock(&the_lnet.ln_api_mutex);
-		return 0;
-	}
-
 	*retry_count = value;
 
 	if (value == 0)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 320/622] lnet: fix cpt locking
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (318 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 319/622] lnet: setup health timeout defaults James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 321/622] lnet: detach response tracker James Simmons
                   ` (302 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

In lnet_select_pathway() the call to lnet_handle_send_case_locked()
can result in sd_cpt being changed. If this function returns
REPEAT_SEND, we'll go back to the again label. It is possible at
this time to initiate discovery, which will unlock the cpt.
If the local cpt isn't updated we could potentially be manipulating
the wrong cpt resulting in some form of corruption or dead lock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12163
Lustre-commit: f6d63067e1ec ("LU-12163 lnet: fix cpt locking")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34607
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 8eeb5ec..0ee3a55 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2390,10 +2390,15 @@ struct lnet_ni *
 
 	rc = lnet_handle_send_case_locked(&send_data);
 
+	/* Update the local cpt since send_data.sd_cpt might've been
+	 * updated as a result of calling lnet_handle_send_case_locked().
+	 */
+	cpt = send_data.sd_cpt;
+
 	if (rc == REPEAT_SEND)
 		goto again;
 
-	lnet_net_unlock(send_data.sd_cpt);
+	lnet_net_unlock(cpt);
 
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 321/622] lnet: detach response tracker
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (319 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 320/622] lnet: fix cpt locking James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 322/622] lnet: invalidate recovery ping mdh James Simmons
                   ` (301 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

We need to unlink the response tracker from MDs even if the
corresponding message failed to send.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12201
Lustre-commit: 1bb91b966d15 ("LU-12201 lnet: detach response tracker")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34770
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 146e23c..a245942 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -771,12 +771,7 @@
 	}
 
 	if (unlink) {
-		/* if this is an ACK or a REPLY then make sure to remove the
-		 * response tracker.
-		 */
-		if (msg->msg_ev.type == LNET_EVENT_REPLY ||
-		    msg->msg_ev.type == LNET_EVENT_ACK)
-			lnet_detach_rsp_tracker(msg->msg_md, cpt);
+		lnet_detach_rsp_tracker(md, cpt);
 		lnet_md_unlink(md);
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 322/622] lnet: invalidate recovery ping mdh
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (320 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 321/622] lnet: detach response tracker James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 323/622] lnet: fix list corruption James Simmons
                   ` (300 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

For cleanliness, ensure that recovery ping mdh is invalidated when
an peer ni or a local ni are allocated

WC-bug-id: https://jira.whamcloud.com/browse/LU-11297
Lustre-commit: d7b5f3114d51 ("LU-11297 lnet: invalidate recovery ping mdh")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34771
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/config.c | 1 +
 net/lnet/lnet/peer.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index 5e0831a..760452c 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -443,6 +443,7 @@ struct lnet_net *
 	spin_lock_init(&ni->ni_lock);
 	INIT_LIST_HEAD(&ni->ni_netlist);
 	INIT_LIST_HEAD(&ni->ni_recovery);
+	LNetInvalidateMDHandle(&ni->ni_ping_mdh);
 	ni->ni_refs = cfs_percpt_alloc(lnet_cpt_table(),
 				       sizeof(*ni->ni_refs[0]));
 	if (!ni->ni_refs)
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 24a5cd3..7b11f28 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -126,6 +126,7 @@
 	INIT_LIST_HEAD(&lpni->lpni_peer_nis);
 	INIT_LIST_HEAD(&lpni->lpni_recovery);
 	INIT_LIST_HEAD(&lpni->lpni_on_remote_peer_ni_list);
+	LNetInvalidateMDHandle(&lpni->lpni_recovery_ping_mdh);
 
 	spin_lock_init(&lpni->lpni_lock);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 323/622] lnet: fix list corruption
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (321 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 322/622] lnet: invalidate recovery ping mdh James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 324/622] lnet: correct discovery LNetEQFree() James Simmons
                   ` (299 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

In shutdown the resend queues are cleared and freed. The monitor
thread state is set to shutdown. It is possible to get lnet_finalize()
called after the queues are freed. The code checks for ln_state to see
if we're shutting down. But in this case we should really be checking
ln_mt_state. The monitor thread is the one that matters in this case,
because it's the one which allocates and frees the resend queues.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12249
Lustre-commit: d799ac910cd6 ("LU-12249 lnet: fix list corruption")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34778
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 10 ++++++++++
 net/lnet/lnet/lib-msg.c  |  8 +++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 0ee3a55..8bce3a9 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3135,7 +3135,9 @@ struct lnet_mt_event_info {
 	lnet_prune_rc_data(1);
 
 	/* Shutting down */
+	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
+	lnet_net_unlock(LNET_LOCK_EX);
 
 	/* signal that the monitor thread is exiting */
 	complete(&the_lnet.ln_mt_signal);
@@ -3349,7 +3351,9 @@ int lnet_monitor_thr_start(void)
 
 	init_completion(&the_lnet.ln_mt_signal);
 
+	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_RUNNING;
+	lnet_net_unlock(LNET_LOCK_EX);
 	task = kthread_run(lnet_monitor_thread, NULL, "monitor_thread");
 	if (IS_ERR(task)) {
 		rc = PTR_ERR(task);
@@ -3363,13 +3367,17 @@ int lnet_monitor_thr_start(void)
 	return 0;
 
 clean_thread:
+	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_STOPPING;
+	lnet_net_unlock(LNET_LOCK_EX);
 	/* block until event callback signals exit */
 	wait_for_completion(&the_lnet.ln_mt_signal);
 	/* clean up */
 	lnet_router_cleanup();
 free_mem:
+	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
+	lnet_net_unlock(LNET_LOCK_EX);
 	lnet_rsp_tracker_clean();
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
@@ -3390,7 +3398,9 @@ void lnet_monitor_thr_stop(void)
 		return;
 
 	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING);
+	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_STOPPING;
+	lnet_net_unlock(LNET_LOCK_EX);
 
 	/* tell the monitor thread that we're shutting down */
 	wake_up(&the_lnet.ln_mt_waitq);
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index a245942..ad35c3d 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -604,7 +604,7 @@
 	bool lo = false;
 
 	/* if we're shutting down no point in handling health. */
-	if (the_lnet.ln_state != LNET_STATE_RUNNING)
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING)
 		return -1;
 
 	LASSERT(msg->msg_txni);
@@ -712,6 +712,12 @@
 
 	lnet_net_lock(msg->msg_tx_cpt);
 
+	/* check again under lock */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
+		lnet_net_unlock(msg->msg_tx_cpt);
+		return -1;
+	}
+
 	/* remove message from the active list and reset it in preparation
 	 * for a resend. Two exception to this
 	 *
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 324/622] lnet: correct discovery LNetEQFree()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (322 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 323/622] lnet: fix list corruption James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 325/622] lnet: Protect lp_dc_pendq manipulation with lp_lock James Simmons
                   ` (298 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

The EQ needs to be freed after all the queues are cleaned to avoid
having non-processed events on the event queue on free. This will
prevent the memory from being freed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12254
Lustre-commit: a0879b5985b4 ("LU-12254 lnet: correct discovery LNetEQFree()")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34796
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 7b11f28..8af9db2 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3142,8 +3142,6 @@ static int lnet_peer_discovery(void *arg)
 	 * size of the thundering herd if there are multiple threads
 	 * waiting on discovery of a single peer.
 	 */
-	LNetEQFree(the_lnet.ln_dc_eqh);
-	LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
 
 	/* Queue cleanup 1: stop all pending pings and pushes. */
 	lnet_net_lock(LNET_LOCK_EX);
@@ -3171,6 +3169,9 @@ static int lnet_peer_discovery(void *arg)
 	}
 	lnet_net_unlock(LNET_LOCK_EX);
 
+	LNetEQFree(the_lnet.ln_dc_eqh);
+	LNetInvalidateEQHandle(&the_lnet.ln_dc_eqh);
+
 	the_lnet.ln_dc_state = LNET_DC_STATE_SHUTDOWN;
 	wake_up(&the_lnet.ln_dc_waitq);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 325/622] lnet: Protect lp_dc_pendq manipulation with lp_lock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (323 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 324/622] lnet: correct discovery LNetEQFree() James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 326/622] lnet: Ensure md is detached when msg is not committed James Simmons
                   ` (297 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Protect the peer discovery queue from concurrent manipulation by
acquiring the lp_lock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12264
Lustre-commit: dd16a31bf4ae ("LU-12264 lnet: Protect lp_dc_pendq manipulation with lp_lock")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/34798
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 2 ++
 net/lnet/lnet/peer.c     | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 8bce3a9..de5951a 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2336,7 +2336,9 @@ struct lnet_ni *
 		/* queue message and return */
 		msg->msg_rtr_nid_param = rtr_nid;
 		msg->msg_sending = 0;
+		spin_lock(&peer->lp_lock);
 		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
+		spin_unlock(&peer->lp_lock);
 		lnet_peer_ni_decref_locked(lpni);
 		primary_nid = peer->lp_primary_nid;
 		lnet_net_unlock(cpt);
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 8af9db2..0d2d356 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -254,7 +254,9 @@
 	 * Releasing the lock can cause an inconsistent state
 	 */
 	spin_lock(&the_lnet.ln_msg_resend_lock);
+	spin_lock(&lp->lp_lock);
 	list_splice(&lp->lp_dc_pendq, &the_lnet.ln_msg_resend);
+	spin_unlock(&lp->lp_lock);
 	spin_unlock(&the_lnet.ln_msg_resend_lock);
 	wake_up(&the_lnet.ln_dc_waitq);
 
@@ -1778,7 +1780,9 @@ static void lnet_peer_discovery_complete(struct lnet_peer *lp)
 	       libcfs_nid2str(lp->lp_primary_nid));
 
 	list_del_init(&lp->lp_dc_list);
+	spin_lock(&lp->lp_lock);
 	list_splice_init(&lp->lp_dc_pendq, &pending_msgs);
+	spin_unlock(&lp->lp_lock);
 	wake_up_all(&lp->lp_dc_waitq);
 
 	lnet_net_unlock(LNET_LOCK_EX);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 326/622] lnet: Ensure md is detached when msg is not committed
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (324 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 325/622] lnet: Protect lp_dc_pendq manipulation with lp_lock James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 327/622] lnet: verify msg is commited for send/recv James Simmons
                   ` (296 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

It's possible for lnet_is_health_check() to return "true" when the
message has not hit the network. In this situation the message is
freed without detaching the MD. As a result, requests do not receive
their unlink events and these requests are stuck forever.

A little cleanup is included here:
 - The value of lnet_is_health_check() is only used in one place, so
   we don't need to save the result of it in a variable.
 - We don't need separate logic to detach the md when the send was
   successful. We'll fall through to the finalizing code after
   incrementing the health counters

Cray-bug-id: LUS-7239
WC-bug-id: https://jira.whamcloud.com/browse/LU-12199
Lustre-commit: b65f3a1767ae ("LU-12199 lnet: Ensure md is detached when msg is not committed")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/34885
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 66 +++++++++++++++----------------------------------
 1 file changed, 20 insertions(+), 46 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index ad35c3d..dbd8de4 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -784,16 +784,6 @@
 	msg->msg_md = NULL;
 }
 
-static void
-lnet_detach_md(struct lnet_msg *msg, int status)
-{
-	int cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
-
-	lnet_res_lock(cpt);
-	lnet_msg_detach_md(msg, cpt, status);
-	lnet_res_unlock(cpt);
-}
-
 static bool
 lnet_is_health_check(struct lnet_msg *msg)
 {
@@ -881,7 +871,6 @@
 	int cpt;
 	int rc;
 	int i;
-	bool hc;
 
 	LASSERT(!in_interrupt());
 
@@ -890,36 +879,7 @@
 
 	msg->msg_ev.status = status;
 
-	/* if the message is successfully sent, no need to keep the MD around */
-	if (msg->msg_md && !status)
-		lnet_detach_md(msg, status);
-
-again:
-	hc = lnet_is_health_check(msg);
-
-	/* the MD would've been detached from the message if it was
-	 * successfully sent. However, if it wasn't successfully sent the
-	 * MD would be around. And since we recalculate whether to
-	 * health check or not, it's possible that we change our minds and
-	 * we don't want to health check this message. In this case also
-	 * free the MD.
-	 *
-	 * If the message is successful we're going to
-	 * go through the lnet_health_check() function, but that'll just
-	 * increment the appropriate health value and return.
-	 */
-	if (msg->msg_md && !hc)
-		lnet_detach_md(msg, status);
-
-	rc = 0;
-	if (!msg->msg_tx_committed && !msg->msg_rx_committed) {
-		/* not committed to network yet */
-		LASSERT(!msg->msg_onactivelist);
-		kfree(msg);
-		return;
-	}
-
-	if (hc) {
+	if (lnet_is_health_check(msg)) {
 		/* Check the health status of the message. If it has one
 		 * of the errors that we're supposed to handle, and it has
 		 * not timed out, then
@@ -932,13 +892,26 @@
 		 * put on the resend queue.
 		 */
 		if (!lnet_health_check(msg))
+			/* Message is queued for resend */
 			return;
+	}
 
-		/* if we get here then we need to clean up the md because we're
-		 * finalizing the message.
-		 */
-		if (msg->msg_md)
-			lnet_detach_md(msg, status);
+	/* We're not going to resend this message so detach its MD and invoke
+	 * the appropriate callbacks
+	 */
+	if (msg->msg_md) {
+		cpt = lnet_cpt_of_cookie(msg->msg_md->md_lh.lh_cookie);
+		lnet_res_lock(cpt);
+		lnet_msg_detach_md(msg, cpt, status);
+		lnet_res_unlock(cpt);
+	}
+
+again:
+	if (!msg->msg_tx_committed && !msg->msg_rx_committed) {
+		/* not committed to network yet */
+		LASSERT(!msg->msg_onactivelist);
+		kfree(msg);
+		return;
 	}
 
 	/*
@@ -972,6 +945,7 @@
 
 	container->msc_finalizers[my_slot] = current;
 
+	rc = 0;
 	while ((msg = list_first_entry_or_null(&container->msc_finalizing,
 					       struct lnet_msg,
 					       msg_list)) != NULL) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 327/622] lnet: verify msg is commited for send/recv
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (325 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 326/622] lnet: Ensure md is detached when msg is not committed James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 328/622] lnet: select LO interface for sending James Simmons
                   ` (295 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Before performing a health check make sure the message
is committed for either send or receive. Otherwise we
can just finalize it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12199
Lustre-commit: fc6b321036f3 ("LU-12199 lnet: verify msg is commited for send/recv")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34797
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index dbd8de4..e4253de 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -790,6 +790,20 @@
 	bool hc;
 	int status = msg->msg_ev.status;
 
+	if ((!msg->msg_tx_committed && !msg->msg_rx_committed) ||
+	    !msg->msg_onactivelist) {
+		CDEBUG(D_NET, "msg %p not committed for send or receive\n",
+		       msg);
+		return false;
+	}
+
+	if ((msg->msg_tx_committed && !msg->msg_txpeer) ||
+	    (msg->msg_rx_committed && !msg->msg_rxpeer)) {
+		CDEBUG(D_NET, "msg %p failed too early to retry and send\n",
+		       msg);
+		return false;
+	}
+
 	/* perform a health check for any message committed for transmit */
 	hc = msg->msg_tx_committed;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 328/622] lnet: select LO interface for sending
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (326 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 327/622] lnet: verify msg is commited for send/recv James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 329/622] lnet: remove route add restriction James Simmons
                   ` (294 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

In the following scenario

Lustre->LNetPrimaryNID with 0 at lo
Discover is initiated on 0 at lo
The peer is created with 0 at lo and <addr>@<net>
The interface health of the peer's <addr>@<net> is decremented
LNetPut() to self
selection algorithm selects 0 at lo to send to

This exposes an issue where we try and go through the peer credit
management algorithm, but because there are no credits associated with
0 at lo we end up indefinitely queuing the message. ptlrpc will then get
stuck waiting for send completion on the message.

This was exposed via conf-sanity 32a

WC-bug-id: https://jira.whamcloud.com/browse/LU-12339
Lustre-commit: 69d1535ebdac ("LU-12339 lnet: select LO interface for sending")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34957
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 53 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index de5951a..75049ec 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -751,6 +751,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	LASSERT(!do_send || msg->msg_tx_delayed);
 	LASSERT(!msg->msg_receiving);
 	LASSERT(msg->msg_tx_committed);
+	/* can't get here if we're sending to the loopback interface */
+	LASSERT(lp->lpni_nid != the_lnet.ln_loni->ni_nid);
 
 	/* NB 'lp' is always the next hop */
 	if (!(msg->msg_target.pid & LNET_PID_USERFLAG) &&
@@ -1426,6 +1428,25 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 #define SRC_ANY_ROUTER_NMR_DST	(SRC_ANY | REMOTE_DST | NMR_DST)
 
 static int
+lnet_handle_lo_send(struct lnet_send_data *sd)
+{
+	struct lnet_msg *msg = sd->sd_msg;
+	int cpt = sd->sd_cpt;
+
+	/* No send credit hassles with LOLND */
+	lnet_ni_addref_locked(the_lnet.ln_loni, cpt);
+	msg->msg_hdr.dest_nid = cpu_to_le64(the_lnet.ln_loni->ni_nid);
+	if (!msg->msg_routing)
+		msg->msg_hdr.src_nid =
+			cpu_to_le64(the_lnet.ln_loni->ni_nid);
+	msg->msg_target.nid = the_lnet.ln_loni->ni_nid;
+	lnet_msg_commit(msg, cpt);
+	msg->msg_txni = the_lnet.ln_loni;
+
+	return LNET_CREDIT_OK;
+}
+
+static int
 lnet_handle_send(struct lnet_send_data *sd)
 {
 	struct lnet_ni *best_ni = sd->sd_best_ni;
@@ -1733,7 +1754,10 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 					     sd->sd_best_ni->ni_net->net_id);
 	}
 
-	if (sd->sd_best_lpni)
+	if (sd->sd_best_lpni &&
+	    sd->sd_best_lpni->lpni_nid == the_lnet.ln_loni->ni_nid)
+		return lnet_handle_lo_send(sd);
+	else if (sd->sd_best_lpni)
 		return lnet_handle_send(sd);
 
 	CERROR("can't send to %s. no NI on %s\n",
@@ -2074,7 +2098,15 @@ struct lnet_ni *
 		 * try and see if we can reach it over another routed
 		 * network
 		 */
-		if (sd->sd_best_lpni) {
+		if (sd->sd_best_lpni &&
+		    sd->sd_best_lpni->lpni_nid == the_lnet.ln_loni->ni_nid) {
+			/* in case we initially started with a routed
+			 * destination, let's reset to local
+			 */
+			sd->sd_send_case &= ~REMOTE_DST;
+			sd->sd_send_case |= LOCAL_DST;
+			return lnet_handle_lo_send(sd);
+		} else if (sd->sd_best_lpni) {
 			/* in case we initially started with a routed
 			 * destination, let's reset to local
 			 */
@@ -2284,19 +2316,12 @@ struct lnet_ni *
 	 * is no need to go through any selection. We can just shortcut
 	 * the entire process and send over lolnd
 	 */
+	send_data.sd_msg = msg;
+	send_data.sd_cpt = cpt;
 	if (LNET_NETTYP(LNET_NIDNET(dst_nid)) == LOLND) {
-		/* No send credit hassles with LOLND */
-		lnet_ni_addref_locked(the_lnet.ln_loni, cpt);
-		msg->msg_hdr.dest_nid = cpu_to_le64(the_lnet.ln_loni->ni_nid);
-		if (!msg->msg_routing)
-			msg->msg_hdr.src_nid =
-				cpu_to_le64(the_lnet.ln_loni->ni_nid);
-		msg->msg_target.nid = the_lnet.ln_loni->ni_nid;
-		lnet_msg_commit(msg, cpt);
-		msg->msg_txni = the_lnet.ln_loni;
+		rc = lnet_handle_lo_send(&send_data);
 		lnet_net_unlock(cpt);
-
-		return LNET_CREDIT_OK;
+		return rc;
 	}
 
 	/* find an existing peer_ni, or create one and mark it as having been
@@ -2376,7 +2401,6 @@ struct lnet_ni *
 		send_case |= SND_RESP;
 
 	/* assign parameters to the send_data */
-	send_data.sd_msg = msg;
 	send_data.sd_rtr_nid = rtr_nid;
 	send_data.sd_src_nid = src_nid;
 	send_data.sd_dst_nid = dst_nid;
@@ -2387,7 +2411,6 @@ struct lnet_ni *
 	send_data.sd_final_dst_lpni = lpni;
 	send_data.sd_peer = peer;
 	send_data.sd_md_cpt = md_cpt;
-	send_data.sd_cpt = cpt;
 	send_data.sd_send_case = send_case;
 
 	rc = lnet_handle_send_case_locked(&send_data);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 329/622] lnet: remove route add restriction
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (327 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 328/622] lnet: select LO interface for sending James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 330/622] lnet: Discover routers on first use James Simmons
                   ` (293 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Remove restriction with adding routes to the same remote network
via two different gateways.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10153
Lustre-commit: 79ea6af86f57 ("LU-10153 lnet: remove route add restriction")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33447
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 -
 net/lnet/lnet/api-ni.c        | 10 ---------
 net/lnet/lnet/router.c        | 49 -------------------------------------------
 3 files changed, 60 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 10922ae..534be2a 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -521,7 +521,6 @@ void lnet_notify_locked(struct lnet_peer_ni *lp, int notifylnd, int alive,
 			time64_t when);
 int lnet_add_route(u32 net, u32 hops, lnet_nid_t gateway_nid,
 		   unsigned int priority);
-int lnet_check_routes(void);
 int lnet_del_route(u32 net, lnet_nid_t gw_nid);
 void lnet_destroy_routes(void);
 int lnet_get_route(int idx, u32 *net, u32 *hops,
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index aeb9d92..d27e9a4 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -2491,10 +2491,6 @@ void lnet_lib_exit(void)
 		if (rc)
 			goto err_shutdown_lndnis;
 
-		rc = lnet_check_routes();
-		if (rc)
-			goto err_destroy_routes;
-
 		rc = lnet_rtrpools_alloc(im_a_router);
 		if (rc)
 			goto err_destroy_routes;
@@ -3449,12 +3445,6 @@ u32 lnet_get_dlc_seq_locked(void)
 				    config->cfg_config_u.cfg_route.rtr_hop,
 				    config->cfg_nid,
 				    config->cfg_config_u.cfg_route.rtr_priority);
-		if (!rc) {
-			rc = lnet_check_routes();
-			if (rc)
-				lnet_del_route(config->cfg_net,
-					       config->cfg_nid);
-		}
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 78a8659..c00b9251 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -427,55 +427,6 @@ static void lnet_shuffle_seed(void)
 }
 
 int
-lnet_check_routes(void)
-{
-	struct lnet_remotenet *rnet;
-	struct lnet_route *route;
-	struct lnet_route *route2;
-	int cpt;
-	struct list_head *rn_list;
-	int i;
-
-	cpt = lnet_net_lock_current();
-
-	for (i = 0; i < LNET_REMOTE_NETS_HASH_SIZE; i++) {
-		rn_list = &the_lnet.ln_remote_nets_hash[i];
-		list_for_each_entry(rnet, rn_list, lrn_list) {
-			route2 = NULL;
-			list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
-				lnet_nid_t nid1;
-				lnet_nid_t nid2;
-				int net;
-
-				if (!route2) {
-					route2 = route;
-					continue;
-				}
-
-				if (route->lr_gateway->lpni_net ==
-				    route2->lr_gateway->lpni_net)
-					continue;
-
-				nid1 = route->lr_gateway->lpni_nid;
-				nid2 = route2->lr_gateway->lpni_nid;
-				net = rnet->lrn_net;
-
-				lnet_net_unlock(cpt);
-
-				CERROR("Routes to %s via %s and %s not supported\n",
-				       libcfs_net2str(net),
-				       libcfs_nid2str(nid1),
-				       libcfs_nid2str(nid2));
-				return -EINVAL;
-			}
-		}
-	}
-
-	lnet_net_unlock(cpt);
-	return 0;
-}
-
-int
 lnet_del_route(u32 net, lnet_nid_t gw_nid)
 {
 	struct lnet_peer_ni *gateway;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 330/622] lnet: Discover routers on first use
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (328 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 329/622] lnet: remove route add restriction James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 331/622] lnet: use peer for gateway James Simmons
                   ` (292 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Discover routers on first use. This brings the behavior when
interacting with routers in line with when dealing with normal
peers.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11292
Lustre-commit: c7f8215d74a2 ("LU-11292 lnet: Discover routers on first use")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33182
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 101 +++++++++++++++++++++++++++++++----------------
 1 file changed, 67 insertions(+), 34 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 75049ec..e080580 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1224,7 +1224,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 static struct lnet_peer_ni *
 lnet_find_route_locked(struct lnet_net *net, u32 remote_net,
-		       lnet_nid_t rtr_nid)
+		       lnet_nid_t rtr_nid, struct lnet_route **use_route,
+		       struct lnet_route **prev_route)
 {
 	struct lnet_remotenet *rnet;
 	struct lnet_route *route;
@@ -1276,13 +1277,10 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		lpni_best = lp;
 	}
 
-	/*
-	 * set sequence number on the best router to the latest sequence + 1
-	 * so we can round-robin all routers, it's race and inaccurate but
-	 * harmless and functional
-	 */
-	if (best_route)
-		best_route->lr_seq = last_route->lr_seq + 1;
+	if (best_route) {
+		*use_route = best_route;
+		*prev_route = last_route;
+	}
 	return lpni_best;
 }
 
@@ -1798,16 +1796,52 @@ struct lnet_ni *
 }
 
 static int
+lnet_initiate_peer_discovery(struct lnet_peer_ni *lpni,
+			     struct lnet_msg *msg, lnet_nid_t rtr_nid,
+			     int cpt)
+{
+	struct lnet_peer *peer;
+	lnet_nid_t primary_nid;
+	int rc;
+
+	lnet_peer_ni_addref_locked(lpni);
+
+	rc = lnet_discover_peer_locked(lpni, cpt, false);
+	if (rc) {
+		lnet_peer_ni_decref_locked(lpni);
+		return rc;
+	}
+	/* The peer may have changed. */
+	peer = lpni->lpni_peer_net->lpn_peer;
+	/* queue message and return */
+	msg->msg_rtr_nid_param = rtr_nid;
+	msg->msg_sending = 0;
+	msg->msg_txpeer = NULL;
+	spin_lock(&peer->lp_lock);
+	list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
+	spin_unlock(&peer->lp_lock);
+	lnet_peer_ni_decref_locked(lpni);
+	primary_nid = peer->lp_primary_nid;
+
+	CDEBUG(D_NET, "msg %p delayed. %s pending discovery\n",
+	       msg, libcfs_nid2str(primary_nid));
+
+	return LNET_DC_WAIT;
+}
+
+static int
 lnet_handle_find_routed_path(struct lnet_send_data *sd,
 			     lnet_nid_t dst_nid,
 			     struct lnet_peer_ni **gw_lpni,
 			     struct lnet_peer **gw_peer)
 {
+	struct lnet_route *best_route = NULL;
+	struct lnet_route *last_route = NULL;
 	struct lnet_peer_ni *gw;
 	lnet_nid_t src_nid = sd->sd_src_nid;
 
 	gw = lnet_find_route_locked(NULL, LNET_NIDNET(dst_nid),
-				    sd->sd_rtr_nid);
+				    sd->sd_rtr_nid, &best_route, &last_route);
 	if (!gw) {
 		CERROR("no route to %s from %s\n",
 		       libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid));
@@ -1820,6 +1854,17 @@ struct lnet_ni *
 
 	*gw_peer = gw->lpni_peer_net->lpn_peer;
 
+	/* Discover this gateway if it hasn't already been discovered.
+	 * This means we might delay the message until discovery has
+	 * completed
+	 */
+	if (lnet_msg_discovery(sd->sd_msg) &&
+	    !lnet_peer_is_uptodate(*gw_peer)) {
+		sd->sd_msg->msg_src_nid_param = sd->sd_src_nid;
+		return lnet_initiate_peer_discovery(gw, sd->sd_msg,
+						    sd->sd_rtr_nid, sd->sd_cpt);
+	}
+
 	if (!sd->sd_best_ni)
 		sd->sd_best_ni =
 			lnet_find_best_ni_on_spec_net(NULL, *gw_peer,
@@ -1853,6 +1898,12 @@ struct lnet_ni *
 
 	*gw_lpni = gw;
 
+	/* increment the route sequence number since now we're sure we're
+	 * going to use it
+	 */
+	LASSERT(best_route && last_route);
+	best_route->lr_seq = last_route->lr_seq + 1;
+
 	return 0;
 }
 
@@ -1889,7 +1940,7 @@ struct lnet_ni *
 
 	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
 					  &gw_peer);
-	if (rc < 0)
+	if (rc)
 		return rc;
 
 	if (sd->sd_send_case & NMR_DST)
@@ -2165,6 +2216,8 @@ struct lnet_ni *
 			CERROR("Can't send response to %s. No route available\n",
 			       libcfs_nid2str(sd->sd_dst_nid));
 			return -EHOSTUNREACH;
+		} else if (rc > 0) {
+			return rc;
 		}
 
 		sd->sd_best_lpni = gw;
@@ -2192,7 +2245,7 @@ struct lnet_ni *
 	 */
 	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
 					  &gw_peer);
-	if (rc < 0)
+	if (rc)
 		return rc;
 
 	sd->sd_send_case &= ~LOCAL_DST;
@@ -2228,7 +2281,7 @@ struct lnet_ni *
 	 */
 	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
 					  &gw_peer);
-	if (rc < 0)
+	if (rc)
 		return rc;
 
 	/* set the best_ni we've chosen as the preferred one for
@@ -2348,30 +2401,10 @@ struct lnet_ni *
 	 */
 	peer = lpni->lpni_peer_net->lpn_peer;
 	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
-		lnet_nid_t primary_nid;
-
-		rc = lnet_discover_peer_locked(lpni, cpt, false);
-		if (rc) {
-			lnet_peer_ni_decref_locked(lpni);
-			lnet_net_unlock(cpt);
-			return rc;
-		}
-		/* The peer may have changed. */
-		peer = lpni->lpni_peer_net->lpn_peer;
-		/* queue message and return */
-		msg->msg_rtr_nid_param = rtr_nid;
-		msg->msg_sending = 0;
-		spin_lock(&peer->lp_lock);
-		list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
-		spin_unlock(&peer->lp_lock);
+		rc = lnet_initiate_peer_discovery(lpni, msg, rtr_nid, cpt);
 		lnet_peer_ni_decref_locked(lpni);
-		primary_nid = peer->lp_primary_nid;
 		lnet_net_unlock(cpt);
-
-		CDEBUG(D_NET, "%s pending discovery\n",
-		       libcfs_nid2str(primary_nid));
-
-		return LNET_DC_WAIT;
+		return rc;
 	}
 	lnet_peer_ni_decref_locked(lpni);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 331/622] lnet: use peer for gateway
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (329 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 330/622] lnet: Discover routers on first use James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 332/622] lnet: lnet_add/del_route() James Simmons
                   ` (291 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

The routing code uses peer_ni for a gateway. However with Mulit-Rail
a gateway could have multiple interfaces on several different
networks. Instead of using a single peer_ni as the gateway we should
be using the peer and let the MR selection code select the best
peer_ni to send to.

This patch moves the gateway from peer to peer_ni. Much of the
code needs to be rewritten in the following patches to account
for that change. This patch disables the routing features by
disabling the code to add/delete routes.

The asymmetric routing detection feature is also modified to
use the MR routing

WC-bug-id: https://jira.whamcloud.com/browse/LU-11298
Lustre-commit: 53f7b8b7a228 ("LU-11298 lnet: use peer for gateway")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33183
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |  19 +-
 include/linux/lnet/lib-types.h |  46 +--
 net/lnet/lnet/lib-move.c       | 215 +++++++-----
 net/lnet/lnet/peer.c           |  17 +-
 net/lnet/lnet/router.c         | 720 ++---------------------------------------
 net/lnet/lnet/router_proc.c    |  31 +-
 6 files changed, 230 insertions(+), 818 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 534be2a..80f6f8c 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -92,15 +92,12 @@
 
 static inline int lnet_is_route_alive(struct lnet_route *route)
 {
-	/* gateway is down */
-	if (!route->lr_gateway->lpni_alive)
-		return 0;
-	/* no NI status, assume it's alive */
-	if ((route->lr_gateway->lpni_ping_feats &
-	     LNET_PING_FEAT_NI_STATUS) == 0)
-		return 1;
-	/* has NI status, check # down NIs */
-	return route->lr_downis == 0;
+	/* TODO re-implement gateway alive indication */
+	CDEBUG(D_NET, "TODO: reimplement routing. gateway = %s\n",
+	       route->lr_gateway ?
+		libcfs_nid2str(route->lr_gateway->lp_primary_nid) :
+		"undefined");
+	return 1;
 }
 
 static inline int lnet_is_wire_handle_none(struct lnet_handle_wire *wh)
@@ -402,9 +399,9 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 }
 
 static inline int
-lnet_isrouter(struct lnet_peer_ni *lp)
+lnet_isrouter(struct lnet_peer_ni *lpni)
 {
-	return lp->lpni_rtr_refcount ? 1 : 0;
+	return lpni->lpni_peer_net->lpn_peer->lp_rtr_refcount ? 1 : 0;
 }
 
 static inline void
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index b1a6f6a..31fe22a 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -534,20 +534,21 @@ struct lnet_peer_ni {
 	struct list_head	 lpni_hashlist;
 	/* messages blocking for tx credits */
 	struct list_head	 lpni_txq;
-	/* messages blocking for router credits */
-	struct list_head	 lpni_rtrq;
-	/* chain on router list */
-	struct list_head	 lpni_rtr_list;
+	/* pointer to peer net I'm part of */
+	struct lnet_peer_net	*lpni_peer_net;
 	/* statistics kept on each peer NI */
 	struct lnet_element_stats lpni_stats;
 	struct lnet_health_remote_stats lpni_hstats;
-	/* spin lock protecting credits and lpni_txq / lpni_rtrq */
+	/* spin lock protecting credits and lpni_txq */
 	spinlock_t		 lpni_lock;
 	/* # tx credits available */
 	int			 lpni_txcredits;
-	struct lnet_peer_net	*lpni_peer_net;
 	/* low water mark */
 	int			 lpni_mintxcredits;
+	/*
+	 * Each peer_ni in a gateway maintains its own credits. This
+	 * allows more traffic to gateways that have multiple interfaces.
+	 */
 	/* # router credits */
 	int			 lpni_rtrcredits;
 	/* low water mark */
@@ -560,18 +561,12 @@ struct lnet_peer_ni {
 	bool			 lpni_notifylnd;
 	/* some thread is handling notification */
 	bool			 lpni_notifying;
-	/* SEND event outstanding from ping */
-	unsigned int		 lpni_ping_notsent;
 	/* # times router went dead<->alive */
 	int			 lpni_alive_count;
 	/* ytes queued for sending */
 	long			 lpni_txqnob;
 	/* time of last aliveness news */
 	time64_t		 lpni_timestamp;
-	/* time of last ping attempt */
-	time64_t		 lpni_ping_timestamp;
-	/* != 0 if ping reply expected */
-	time64_t		 lpni_ping_deadline;
 	/* when I was last alive */
 	time64_t		 lpni_last_alive;
 	/* when lpni_ni was queried last time */
@@ -590,18 +585,12 @@ struct lnet_peer_ni {
 	int			 lpni_cpt;
 	/* state flags -- protected by lpni_lock */
 	unsigned int		 lpni_state;
-	/* # refs from lnet_route::lr_gateway */
-	int			 lpni_rtr_refcount;
 	/* sequence number used to round robin over peer nis within a net */
 	u32			 lpni_seq;
 	/* sequence number used to round robin over gateways */
 	u32			 lpni_gw_seq;
-	/* health flag */
-	bool			 lpni_healthy;
 	/* returned RC ping features. Protected with lpni_lock */
 	unsigned int		 lpni_ping_feats;
-	/* routers on this peer */
-	struct list_head	 lpni_routes;
 	/* preferred local nids: if only one, use lpni_pref.nid */
 	union lpni_pref {
 		lnet_nid_t	 nid;
@@ -632,6 +621,9 @@ struct lnet_peer {
 	/* list of messages pending discovery*/
 	struct list_head	lp_dc_pendq;
 
+	/* chain on router list */
+	struct list_head	lp_rtr_list;
+
 	/* primary NID of the peer */
 	lnet_nid_t		lp_primary_nid;
 
@@ -641,10 +633,22 @@ struct lnet_peer {
 	/* number of NIDs on this peer */
 	int			lp_nnis;
 
+	/* # refs from lnet_route_t::lr_gateway */
+	int			lp_rtr_refcount;
+
+	/* messages blocking for router credits */
+	struct list_head	lp_rtrq;
+
+	/* routes on this peer */
+	struct list_head	lp_routes;
+
+	/* time of last router check attempt */
+	time64_t		lp_rtrcheck_timestamp;
+
 	/* reference count */
 	atomic_t		lp_refcount;
 
-	/* lock protecting peer state flags */
+	/* lock protecting peer state flags and lpni_rtrq */
 	spinlock_t		lp_lock;
 
 	/* peer state flags */
@@ -808,9 +812,11 @@ struct lnet_route {
 	/* chain on gateway */
 	struct list_head	lr_gwlist;
 	/* router node */
-	struct lnet_peer_ni    *lr_gateway;
+	struct lnet_peer       *lr_gateway;
 	/* remote network number */
 	u32			lr_net;
+	/* local network number */
+	u32			lr_lnet;
 	/* sequence for round-robin */
 	int			lr_seq;
 	/* number of down NIs */
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index e080580..99ff882 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -877,7 +877,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	 * I return LNET_CREDIT_WAIT if msg blocked and LNET_CREDIT_OK if
 	 * received or OK to receive
 	 */
-	struct lnet_peer_ni *lp = msg->msg_rxpeer;
+	struct lnet_peer_ni *lpni = msg->msg_rxpeer;
+	struct lnet_peer *lp;
 	struct lnet_rtrbufpool *rbp;
 	struct lnet_rtrbuf *rb;
 
@@ -887,29 +888,36 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	LASSERT(msg->msg_routing);
 	LASSERT(msg->msg_receiving);
 	LASSERT(!msg->msg_sending);
+	LASSERT(lpni->lpni_peer_net);
+	LASSERT(lpni->lpni_peer_net->lpn_peer);
+
+	lp = lpni->lpni_peer_net->lpn_peer;
 
 	/* non-lnet_parse callers only receive delayed messages */
 	LASSERT(!do_recv || msg->msg_rx_delayed);
 
 	if (!msg->msg_peerrtrcredit) {
-		spin_lock(&lp->lpni_lock);
-		LASSERT((lp->lpni_rtrcredits < 0) ==
-			!list_empty(&lp->lpni_rtrq));
+		/* lpni_lock protects the credit manipulation */
+		spin_lock(&lpni->lpni_lock);
+		/* lp_lock protects the lp_rtrq */
+		spin_lock(&lp->lp_lock);
 
 		msg->msg_peerrtrcredit = 1;
-		lp->lpni_rtrcredits--;
-		if (lp->lpni_rtrcredits < lp->lpni_minrtrcredits)
-			lp->lpni_minrtrcredits = lp->lpni_rtrcredits;
+		lpni->lpni_rtrcredits--;
+		if (lpni->lpni_rtrcredits < lpni->lpni_minrtrcredits)
+			lpni->lpni_minrtrcredits = lpni->lpni_rtrcredits;
 
-		if (lp->lpni_rtrcredits < 0) {
+		if (lpni->lpni_rtrcredits < 0) {
 			/* must have checked eager_recv before here */
 			LASSERT(msg->msg_rx_ready_delay);
 			msg->msg_rx_delayed = 1;
-			list_add_tail(&msg->msg_list, &lp->lpni_rtrq);
-			spin_unlock(&lp->lpni_lock);
+			list_add_tail(&msg->msg_list, &lp->lp_rtrq);
+			spin_unlock(&lp->lp_lock);
+			spin_unlock(&lpni->lpni_lock);
 			return LNET_CREDIT_WAIT;
 		}
-		spin_unlock(&lp->lpni_lock);
+		spin_unlock(&lp->lp_lock);
+		spin_unlock(&lpni->lpni_lock);
 	}
 
 	rbp = lnet_msg2bufpool(msg);
@@ -1080,7 +1088,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 void
 lnet_return_rx_credits_locked(struct lnet_msg *msg)
 {
-	struct lnet_peer_ni *rxpeer = msg->msg_rxpeer;
+	struct lnet_peer_ni *rxpeerni = msg->msg_rxpeer;
+	struct lnet_peer *lp;
 	struct lnet_ni *rxni = msg->msg_rxni;
 	struct lnet_msg *msg2;
 
@@ -1135,44 +1144,69 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 routing_off:
 	if (msg->msg_peerrtrcredit) {
+		LASSERT(rxpeerni);
+		LASSERT(rxpeerni->lpni_peer_net);
+		LASSERT(rxpeerni->lpni_peer_net->lpn_peer);
+
+		lp = rxpeerni->lpni_peer_net->lpn_peer;
+
 		/* give back peer router credits */
 		msg->msg_peerrtrcredit = 0;
 
-		spin_lock(&rxpeer->lpni_lock);
-		LASSERT((rxpeer->lpni_rtrcredits < 0) ==
-			!list_empty(&rxpeer->lpni_rtrq));
+		spin_lock(&rxpeerni->lpni_lock);
+		spin_lock(&lp->lp_lock);
 
-		rxpeer->lpni_rtrcredits++;
-		/*
-		 * drop all messages which are queued to be routed on that
+		rxpeerni->lpni_rtrcredits++;
+
+		/* drop all messages which are queued to be routed on that
 		 * peer.
 		 */
 		if (!the_lnet.ln_routing) {
 			LIST_HEAD(drop);
 
-			list_splice_init(&rxpeer->lpni_rtrq, &drop);
-			spin_unlock(&rxpeer->lpni_lock);
+			list_splice_init(&lp->lp_rtrq, &drop);
+			spin_unlock(&lp->lp_lock);
+			spin_unlock(&rxpeerni->lpni_lock);
 			lnet_drop_routed_msgs_locked(&drop, msg->msg_rx_cpt);
-		} else if (rxpeer->lpni_rtrcredits <= 0) {
-			msg2 = list_first_entry(&rxpeer->lpni_rtrq,
+		} else if (!list_empty(&lp->lp_rtrq)) {
+			int msg2_cpt;
+
+			msg2 = list_first_entry(&lp->lp_rtrq,
 						struct lnet_msg, msg_list);
 			list_del(&msg2->msg_list);
-			spin_unlock(&rxpeer->lpni_lock);
+			msg2_cpt = msg2->msg_rx_cpt;
+			spin_unlock(&lp->lp_lock);
+			spin_unlock(&rxpeerni->lpni_lock);
+			/* messages on the lp_rtrq can be from any NID in
+			 * the peer, which means they might have different
+			 * cpts. We need to make sure we lock the right
+			 * one.
+			 */
+			if (msg2_cpt != msg->msg_rx_cpt) {
+				lnet_net_unlock(msg->msg_rx_cpt);
+				lnet_net_lock(msg2_cpt);
+			}
 			(void)lnet_post_routed_recv_locked(msg2, 1);
+			if (msg2_cpt != msg->msg_rx_cpt) {
+				lnet_net_unlock(msg2_cpt);
+				lnet_net_lock(msg->msg_rx_cpt);
+			}
 		} else {
-			spin_unlock(&rxpeer->lpni_lock);
+			spin_unlock(&lp->lp_lock);
+			spin_unlock(&rxpeerni->lpni_lock);
 		}
 	}
 	if (rxni) {
 		msg->msg_rxni = NULL;
 		lnet_ni_decref_locked(rxni, msg->msg_rx_cpt);
 	}
-	if (rxpeer) {
+	if (rxpeerni) {
 		msg->msg_rxpeer = NULL;
-		lnet_peer_ni_decref_locked(rxpeer);
+		lnet_peer_ni_decref_locked(rxpeerni);
 	}
 }
 
+#if 0
 static int
 lnet_compare_peers(struct lnet_peer_ni *p1, struct lnet_peer_ni *p2)
 {
@@ -1190,15 +1224,18 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 	return 0;
 }
+#endif
 
 static int
 lnet_compare_routes(struct lnet_route *r1, struct lnet_route *r2)
 {
+	/* TODO re-implement gateway comparison
 	struct lnet_peer_ni *p1 = r1->lr_gateway;
 	struct lnet_peer_ni *p2 = r2->lr_gateway;
+	*/
 	int r1_hops = (r1->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r1->lr_hops;
 	int r2_hops = (r2->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r2->lr_hops;
-	int rc;
+	/*int rc;*/
 
 	if (r1->lr_priority < r2->lr_priority)
 		return 1;
@@ -1212,9 +1249,11 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	if (r1_hops > r2_hops)
 		return -1;
 
+	/*
 	rc = lnet_compare_peers(p1, p2);
 	if (rc)
 		return rc;
+	*/
 
 	if (r1->lr_seq - r2->lr_seq <= 0)
 		return 1;
@@ -1222,17 +1261,17 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	return -1;
 }
 
-static struct lnet_peer_ni *
+/* TODO: lnet_find_route_locked() needs to be reimplemented */
+static struct lnet_route *
 lnet_find_route_locked(struct lnet_net *net, u32 remote_net,
-		       lnet_nid_t rtr_nid, struct lnet_route **use_route,
-		       struct lnet_route **prev_route)
+		       lnet_nid_t rtr_nid, struct lnet_route **prev_route)
 {
 	struct lnet_remotenet *rnet;
 	struct lnet_route *route;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
-	struct lnet_peer_ni *lpni_best;
-	struct lnet_peer_ni *lp;
+	struct lnet_peer *lp_best;
+	struct lnet_peer *lp;
 	int rc;
 
 	/*
@@ -1243,7 +1282,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	if (!rnet)
 		return NULL;
 
-	lpni_best = NULL;
+	lp_best = NULL;
 	best_route = NULL;
 	last_route = NULL;
 	list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
@@ -1252,16 +1291,10 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		if (!lnet_is_route_alive(route))
 			continue;
 
-		if (net && lp->lpni_net != net)
-			continue;
-
-		if (lp->lpni_nid == rtr_nid) /* it's pre-determined router */
-			return lp;
-
-		if (!lpni_best) {
+		if (!lp_best) {
 			best_route = route;
 			last_route = route;
-			lpni_best = lp;
+			lp_best = lp;
 			continue;
 		}
 
@@ -1274,14 +1307,12 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			continue;
 
 		best_route = route;
-		lpni_best = lp;
+		lp_best = lp;
 	}
 
-	if (best_route) {
-		*use_route = best_route;
-		*prev_route = last_route;
-	}
-	return lpni_best;
+	*prev_route = last_route;
+
+	return best_route;
 }
 
 static struct lnet_ni *
@@ -1835,60 +1866,80 @@ struct lnet_ni *
 			     struct lnet_peer_ni **gw_lpni,
 			     struct lnet_peer **gw_peer)
 {
-	struct lnet_route *best_route = NULL;
-	struct lnet_route *last_route = NULL;
-	struct lnet_peer_ni *gw;
+	struct lnet_peer *gw;
+	struct lnet_route *best_route;
+	struct lnet_route *last_route;
+	struct lnet_peer_ni *lpni = NULL;
 	lnet_nid_t src_nid = sd->sd_src_nid;
 
-	gw = lnet_find_route_locked(NULL, LNET_NIDNET(dst_nid),
-				    sd->sd_rtr_nid, &best_route, &last_route);
-	if (!gw) {
+	best_route = lnet_find_route_locked(NULL, LNET_NIDNET(dst_nid),
+					    sd->sd_rtr_nid, &last_route);
+	if (!best_route) {
 		CERROR("no route to %s from %s\n",
 		       libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid));
 		return -EHOSTUNREACH;
 	}
 
-	/* get the peer of the gw_ni */
-	LASSERT(gw->lpni_peer_net);
-	LASSERT(gw->lpni_peer_net->lpn_peer);
-
-	*gw_peer = gw->lpni_peer_net->lpn_peer;
+	gw = best_route->lr_gateway;
+	*gw_peer = gw;
 
 	/* Discover this gateway if it hasn't already been discovered.
 	 * This means we might delay the message until discovery has
 	 * completed
 	 */
+#if 0
+	/* TODO: disable discovey for now */
 	if (lnet_msg_discovery(sd->sd_msg) &&
 	    !lnet_peer_is_uptodate(*gw_peer)) {
 		sd->sd_msg->msg_src_nid_param = sd->sd_src_nid;
 		return lnet_initiate_peer_discovery(gw, sd->sd_msg,
 						    sd->sd_rtr_nid, sd->sd_cpt);
 	}
+#endif
 
-	if (!sd->sd_best_ni)
-		sd->sd_best_ni =
-			lnet_find_best_ni_on_spec_net(NULL, *gw_peer,
-						      gw->lpni_peer_net,
-						      sd->sd_md_cpt,
-						      true);
+	if (!sd->sd_best_ni) {
+		struct lnet_peer_net *lpeer;
 
+		lpeer = lnet_peer_get_net_locked(gw, best_route->lr_lnet);
+		sd->sd_best_ni = lnet_find_best_ni_on_spec_net(NULL, gw, lpeer,
+							       sd->sd_md_cpt,
+							       true);
+	}
 	if (!sd->sd_best_ni) {
 		CERROR("Internal Error. Expected local ni on %s but non found :%s\n",
-		       libcfs_net2str(gw->lpni_peer_net->lpn_net_id),
+		       libcfs_net2str(best_route->lr_lnet),
 		       libcfs_nid2str(sd->sd_src_nid));
 		return -EFAULT;
 	}
 
 	/* if gw is MR let's find its best peer_ni
 	 */
-	if (lnet_peer_is_multi_rail(*gw_peer)) {
-		gw = lnet_find_best_lpni_on_net(sd, *gw_peer,
-						sd->sd_best_ni->ni_net->net_id);
+	if (lnet_peer_is_multi_rail(gw)) {
+		lpni = lnet_find_best_lpni_on_net(sd, gw,
+						  sd->sd_best_ni->ni_net->net_id);
 		/* We've already verified that the gw has an NI on that
 		 * desired net, but we're not finding it. Something is
 		 * wrong.
 		 */
-		if (!gw) {
+		if (!lpni) {
+			CERROR("Internal Error. Route expected to %s from %s\n",
+			       libcfs_nid2str(dst_nid),
+			       libcfs_nid2str(src_nid));
+			return -EFAULT;
+		}
+	} else {
+		struct lnet_peer_net *lpn;
+
+		lpn = lnet_peer_get_net_locked(gw, best_route->lr_lnet);
+		if (!lpn) {
+			CERROR("Internal Error. Route expected to %s from %s\n",
+			       libcfs_nid2str(dst_nid),
+			       libcfs_nid2str(src_nid));
+			return -EFAULT;
+		}
+		lpni = list_entry(lpn->lpn_peer_nis.next, struct lnet_peer_ni,
+				  lpni_peer_nis);
+		if (!lpni) {
 			CERROR("Internal Error. Route expected to %s from %s\n",
 			       libcfs_nid2str(dst_nid),
 			       libcfs_nid2str(src_nid));
@@ -1896,7 +1947,7 @@ struct lnet_ni *
 		}
 	}
 
-	*gw_lpni = gw;
+	*gw_lpni = lpni;
 
 	/* increment the route sequence number since now we're sure we're
 	 * going to use it
@@ -4046,17 +4097,23 @@ void lnet_monitor_thr_stop(void)
 
 		rnet = lnet_find_rnet_locked(LNET_NIDNET(src_nid));
 		if (rnet) {
-			struct lnet_peer_ni *gw = NULL;
+			struct lnet_peer *gw = NULL;
+			struct lnet_peer_ni *lpni = NULL;
 			struct lnet_route *route;
 
 			list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
 				found = false;
 				gw = route->lr_gateway;
-				if (gw->lpni_net != net)
+				if (route->lr_lnet != net->net_id)
 					continue;
-				if (gw->lpni_nid == from_nid) {
-					found = true;
-					break;
+				/* if the nid is one of the gateway's NIDs
+				 * then this is a valid gateway
+				 */
+				while ((lpni = lnet_get_next_peer_ni_locked(gw, NULL, lpni)) != NULL) {
+					if (lpni->lpni_nid == from_nid) {
+						found = true;
+						break;
+					}
 				}
 			}
 		}
@@ -4773,9 +4830,11 @@ struct lnet_msg *
 			LASSERT(shortest);
 			hops = shortest_hops;
 			if (srcnidp) {
-				ni = lnet_get_next_ni_locked(
-					shortest->lr_gateway->lpni_net,
-					NULL);
+				struct lnet_net *net;
+
+				net = lnet_get_net_locked(shortest->lr_lnet);
+				LASSERT(net);
+				ni = lnet_get_next_ni_locked(net, NULL);
 				*srcnidp = ni->ni_nid;
 			}
 			if (orderp)
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 0d2d356..faaf94a 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -120,8 +120,6 @@
 		return NULL;
 
 	INIT_LIST_HEAD(&lpni->lpni_txq);
-	INIT_LIST_HEAD(&lpni->lpni_rtrq);
-	INIT_LIST_HEAD(&lpni->lpni_routes);
 	INIT_LIST_HEAD(&lpni->lpni_hashlist);
 	INIT_LIST_HEAD(&lpni->lpni_peer_nis);
 	INIT_LIST_HEAD(&lpni->lpni_recovery);
@@ -206,10 +204,13 @@
 	if (!lp)
 		return NULL;
 
+	INIT_LIST_HEAD(&lp->lp_rtrq);
+	INIT_LIST_HEAD(&lp->lp_routes);
 	INIT_LIST_HEAD(&lp->lp_peer_list);
 	INIT_LIST_HEAD(&lp->lp_peer_nets);
 	INIT_LIST_HEAD(&lp->lp_dc_list);
 	INIT_LIST_HEAD(&lp->lp_dc_pendq);
+	INIT_LIST_HEAD(&lp->lp_rtr_list);
 	init_waitqueue_head(&lp->lp_dc_waitq);
 	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
@@ -235,6 +236,7 @@
 	CDEBUG(D_NET, "%p nid %s\n", lp, libcfs_nid2str(lp->lp_primary_nid));
 
 	LASSERT(atomic_read(&lp->lp_refcount) == 0);
+	LASSERT(lp->lp_rtr_refcount == 0);
 	LASSERT(list_empty(&lp->lp_peer_nets));
 	LASSERT(list_empty(&lp->lp_peer_list));
 	LASSERT(list_empty(&lp->lp_dc_list));
@@ -324,7 +326,7 @@
 	struct lnet_peer_table *ptable = NULL;
 
 	/* don't remove a peer_ni if it's also a gateway */
-	if (lpni->lpni_rtr_refcount > 0) {
+	if (lnet_isrouter(lpni)) {
 		CERROR("Peer NI %s is a gateway. Can not delete it\n",
 		       libcfs_nid2str(lpni->lpni_nid));
 		return -EBUSY;
@@ -570,7 +572,7 @@ void lnet_peer_uninit(void)
 {
 	struct lnet_peer_ni *lp;
 	struct lnet_peer_ni *tmp;
-	lnet_nid_t lpni_nid;
+	lnet_nid_t gw_nid;
 	int i;
 
 	for (i = 0; i < LNET_PEER_HASH_SIZE; i++) {
@@ -579,13 +581,13 @@ void lnet_peer_uninit(void)
 			if (net != lp->lpni_net)
 				continue;
 
-			if (!lp->lpni_rtr_refcount)
+			if (!lnet_isrouter(lp))
 				continue;
 
-			lpni_nid = lp->lpni_nid;
+			gw_nid = lp->lpni_peer_net->lpn_peer->lp_primary_nid;
 
 			lnet_net_unlock(LNET_LOCK_EX);
-			lnet_del_route(LNET_NIDNET(LNET_NID_ANY), lpni_nid);
+			lnet_del_route(LNET_NIDNET(LNET_NID_ANY), gw_nid);
 			lnet_net_lock(LNET_LOCK_EX);
 		}
 	}
@@ -1567,7 +1569,6 @@ struct lnet_peer_net *
 	CDEBUG(D_NET, "%p nid %s\n", lpni, libcfs_nid2str(lpni->lpni_nid));
 
 	LASSERT(atomic_read(&lpni->lpni_refcount) == 0);
-	LASSERT(lpni->lpni_rtr_refcount == 0);
 	LASSERT(list_empty(&lpni->lpni_txq));
 	LASSERT(lpni->lpni_txqnob == 0);
 	LASSERT(list_empty(&lpni->lpni_peer_nis));
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index c00b9251..4e79c21 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -114,7 +114,6 @@
 	spin_lock(&lp->lpni_lock);
 
 	lp->lpni_timestamp = when;		/* update timestamp */
-	lp->lpni_ping_deadline = 0;		/* disable ping timeout */
 
 	if (lp->lpni_alive_count &&		/* got old news */
 	    (!lp->lpni_alive) == (!alive)) {	/* new date for old news */
@@ -191,58 +190,6 @@
 	spin_unlock(&lp->lpni_lock);
 }
 
-static void
-lnet_rtr_addref_locked(struct lnet_peer_ni *lp)
-{
-	LASSERT(atomic_read(&lp->lpni_refcount) > 0);
-	LASSERT(lp->lpni_rtr_refcount >= 0);
-
-	/* lnet_net_lock must be exclusively locked */
-	lp->lpni_rtr_refcount++;
-	if (lp->lpni_rtr_refcount == 1) {
-		struct list_head *pos;
-
-		/* a simple insertion sort */
-		list_for_each_prev(pos, &the_lnet.ln_routers) {
-			struct lnet_peer_ni *rtr;
-
-			rtr = list_entry(pos, struct lnet_peer_ni,
-					 lpni_rtr_list);
-			if (rtr->lpni_nid < lp->lpni_nid)
-				break;
-		}
-
-		list_add(&lp->lpni_rtr_list, pos);
-		/* addref for the_lnet.ln_routers */
-		lnet_peer_ni_addref_locked(lp);
-		the_lnet.ln_routers_version++;
-	}
-}
-
-static void
-lnet_rtr_decref_locked(struct lnet_peer_ni *lp)
-{
-	LASSERT(atomic_read(&lp->lpni_refcount) > 0);
-	LASSERT(lp->lpni_rtr_refcount > 0);
-
-	/* lnet_net_lock must be exclusively locked */
-	lp->lpni_rtr_refcount--;
-	if (!lp->lpni_rtr_refcount) {
-		LASSERT(list_empty(&lp->lpni_routes));
-
-		if (lp->lpni_rcd) {
-			list_add(&lp->lpni_rcd->rcd_list,
-				 &the_lnet.ln_rcd_deathrow);
-			lp->lpni_rcd = NULL;
-		}
-
-		list_del(&lp->lpni_rtr_list);
-		/* decref for the_lnet.ln_routers */
-		lnet_peer_ni_decref_locked(lp);
-		the_lnet.ln_routers_version++;
-	}
-}
-
 struct lnet_remotenet *
 lnet_find_rnet_locked(u32 net)
 {
@@ -259,239 +206,24 @@ struct lnet_remotenet *
 	return NULL;
 }
 
-static void lnet_shuffle_seed(void)
-{
-	static int seeded;
-	struct lnet_ni *ni = NULL;
-
-	if (seeded)
-		return;
-
-	/* Nodes with small feet have little entropy
-	 * the NID for this node gives the most entropy in the low bits */
-	while ((ni = lnet_get_next_ni_locked(NULL, ni))) {
-		u32 lnd_type, seed;
-
-		lnd_type = LNET_NETTYP(LNET_NIDNET(ni->ni_nid));
-		if (lnd_type != LOLND) {
-			seed = (LNET_NIDADDR(ni->ni_nid) | lnd_type);
-			add_device_randomness(&seed, sizeof(seed));
-		}
-	}
-
-	seeded = 1;
-}
-
-/* NB expects LNET_LOCK held */
-static void
-lnet_add_route_to_rnet(struct lnet_remotenet *rnet, struct lnet_route *route)
-{
-	unsigned int len = 0;
-	unsigned int offset = 0;
-	struct list_head *e;
-
-	lnet_shuffle_seed();
-
-	list_for_each(e, &rnet->lrn_routes) {
-		len++;
-	}
-
-	/* len+1 positions to add a new entry */
-	offset = prandom_u32_max(len + 1);
-	list_for_each(e, &rnet->lrn_routes) {
-		if (!offset)
-			break;
-		offset--;
-	}
-	list_add(&route->lr_list, e);
-	list_add(&route->lr_gwlist, &route->lr_gateway->lpni_routes);
-
-	the_lnet.ln_remote_nets_version++;
-	lnet_rtr_addref_locked(route->lr_gateway);
-}
-
 int
 lnet_add_route(u32 net, u32 hops, lnet_nid_t gateway,
 	       unsigned int priority)
 {
-	struct lnet_remotenet *rnet;
-	struct lnet_remotenet *rnet2;
-	struct lnet_route *route;
-	struct lnet_route *route2;
-	struct lnet_ni *ni;
-	struct lnet_peer_ni *lpni;
-	int add_route;
-	int rc;
-
-	CDEBUG(D_NET, "Add route: net %s hops %d priority %u gw %s\n",
-	       libcfs_net2str(net), hops, priority, libcfs_nid2str(gateway));
-
-	if (gateway == LNET_NID_ANY ||
-	    LNET_NETTYP(LNET_NIDNET(gateway)) == LOLND ||
-	    net == LNET_NIDNET(LNET_NID_ANY) ||
-	    LNET_NETTYP(net) == LOLND ||
-	    LNET_NIDNET(gateway) == net ||
-	    (hops != LNET_UNDEFINED_HOPS && (hops < 1 || hops > 255)))
-		return -EINVAL;
-
-	if (lnet_islocalnet(net))	/* it's a local network */
-		return -EEXIST;
-
-	/* Assume net, route, all new */
-	route = kzalloc(sizeof(*route), GFP_NOFS);
-	rnet = kzalloc(sizeof(*rnet), GFP_NOFS);
-	if (!route || !rnet) {
-		CERROR("Out of memory creating route %s %d %s\n",
-		       libcfs_net2str(net), hops, libcfs_nid2str(gateway));
-		kfree(route);
-		kfree(rnet);
-		return -ENOMEM;
-	}
-
-	INIT_LIST_HEAD(&rnet->lrn_routes);
-	rnet->lrn_net = net;
-	route->lr_hops = hops;
-	route->lr_net = net;
-	route->lr_priority = priority;
-
-	lnet_net_lock(LNET_LOCK_EX);
-
-	lpni = lnet_nid2peerni_ex(gateway, LNET_LOCK_EX);
-	if (IS_ERR(lpni)) {
-		lnet_net_unlock(LNET_LOCK_EX);
-
-		kfree(route);
-		kfree(rnet);
-
-		rc = PTR_ERR(lpni);
-		if (rc == -EHOSTUNREACH) /* gateway is not on a local net */
-			return rc;	/* ignore the route entry */
-		CERROR("Error %d creating route %s %d %s\n", rc,
-		       libcfs_net2str(net), hops,
-		       libcfs_nid2str(gateway));
-		return rc;
-	}
-	route->lr_gateway = lpni;
-	LASSERT(the_lnet.ln_state == LNET_STATE_RUNNING);
-
-	rnet2 = lnet_find_rnet_locked(net);
-	if (!rnet2) {
-		/* new network */
-		list_add_tail(&rnet->lrn_list, lnet_net2rnethash(net));
-		rnet2 = rnet;
-	}
-
-	/* Search for a duplicate route (it's a NOOP if it is) */
-	add_route = 1;
-	list_for_each_entry(route2, &rnet2->lrn_routes, lr_list) {
-		if (route2->lr_gateway == route->lr_gateway) {
-			add_route = 0;
-			break;
-		}
-
-		/* our lookups must be true */
-		LASSERT(route2->lr_gateway->lpni_nid != gateway);
-	}
-
-	if (add_route) {
-		lnet_peer_ni_addref_locked(route->lr_gateway); /* +1 for notify */
-		lnet_add_route_to_rnet(rnet2, route);
-
-		ni = lnet_get_next_ni_locked(route->lr_gateway->lpni_net, NULL);
-		lnet_net_unlock(LNET_LOCK_EX);
-
-		/* XXX Assume alive */
-		if (ni->ni_net->net_lnd->lnd_notify)
-			ni->ni_net->net_lnd->lnd_notify(ni, gateway, 1);
-
-		lnet_net_lock(LNET_LOCK_EX);
-	}
-
-	/* -1 for notify or !add_route */
-	lnet_peer_ni_decref_locked(route->lr_gateway);
-	lnet_net_unlock(LNET_LOCK_EX);
-	rc = 0;
-
-	if (!add_route) {
-		rc = -EEXIST;
-		kfree(route);
-	}
-
-	if (rnet != rnet2)
-		kfree(rnet);
-
-	/* kick start the monitor thread to handle the added route */
-	wake_up(&the_lnet.ln_mt_waitq);
-
-	return rc;
+	net = net;
+	hops = hops;
+	gateway = gateway;
+	priority = priority;
+	return -EINVAL;
 }
 
+/* TODO: reimplement lnet_check_routes() */
 int
 lnet_del_route(u32 net, lnet_nid_t gw_nid)
 {
-	struct lnet_peer_ni *gateway;
-	struct lnet_remotenet *rnet;
-	struct lnet_route *route;
-	int rc = -ENOENT;
-	struct list_head *rn_list;
-	int idx = 0;
-
-	CDEBUG(D_NET, "Del route: net %s : gw %s\n",
-	       libcfs_net2str(net), libcfs_nid2str(gw_nid));
-
-	/*
-	 * NB Caller may specify either all routes via the given gateway
-	 * or a specific route entry actual NIDs)
-	 */
-	lnet_net_lock(LNET_LOCK_EX);
-	if (net == LNET_NIDNET(LNET_NID_ANY))
-		rn_list = &the_lnet.ln_remote_nets_hash[0];
-	else
-		rn_list = lnet_net2rnethash(net);
-
-again:
-	list_for_each_entry(rnet, rn_list, lrn_list) {
-		if (!(net == LNET_NIDNET(LNET_NID_ANY) ||
-		      net == rnet->lrn_net))
-			continue;
-
-		list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
-			gateway = route->lr_gateway;
-			if (!(gw_nid == LNET_NID_ANY ||
-			      gw_nid == gateway->lpni_nid))
-				continue;
-
-			list_del(&route->lr_list);
-			list_del(&route->lr_gwlist);
-			the_lnet.ln_remote_nets_version++;
-
-			if (list_empty(&rnet->lrn_routes))
-				list_del(&rnet->lrn_list);
-			else
-				rnet = NULL;
-
-			lnet_rtr_decref_locked(gateway);
-			lnet_peer_ni_decref_locked(gateway);
-
-			lnet_net_unlock(LNET_LOCK_EX);
-
-			kfree(route);
-			kfree(rnet);
-
-			rc = 0;
-			lnet_net_lock(LNET_LOCK_EX);
-			goto again;
-		}
-	}
-
-	if (net == LNET_NIDNET(LNET_NID_ANY) &&
-	    ++idx < LNET_REMOTE_NETS_HASH_SIZE) {
-		rn_list = &the_lnet.ln_remote_nets_hash[idx];
-		goto again;
-	}
-	lnet_net_unlock(LNET_LOCK_EX);
-
-	return rc;
+	net = net;
+	gw_nid = gw_nid;
+	return -EINVAL;
 }
 
 void
@@ -553,7 +285,8 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 					*net = rnet->lrn_net;
 					*hops = route->lr_hops;
 					*priority = route->lr_priority;
-					*gateway = route->lr_gateway->lpni_nid;
+					*gateway =
+					    route->lr_gateway->lp_primary_nid;
 					*alive = lnet_is_route_alive(route);
 					lnet_net_unlock(cpt);
 					return 0;
@@ -588,110 +321,12 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 }
 
 /**
- * parse router-checker pinginfo, record number of down NIs for remote
- * networks on that router.
+ * TODO: re-implement
  */
 static void
 lnet_parse_rc_info(struct lnet_rc_data *rcd)
 {
-	struct lnet_ping_buffer *pbuf = rcd->rcd_pingbuffer;
-	struct lnet_peer_ni *gw = rcd->rcd_gateway;
-	struct lnet_route *rte;
-	int nnis;
-
-	if (!gw->lpni_alive || !pbuf)
-		return;
-
-	/*
-	 * Protect gw->lpni_ping_feats. This can be set from
-	 * lnet_notify_locked with different locks being held
-	 */
-	spin_lock(&gw->lpni_lock);
-
-	if (pbuf->pb_info.pi_magic == __swab32(LNET_PROTO_PING_MAGIC))
-		lnet_swap_pinginfo(pbuf);
-
-	/* NB always racing with network! */
-	if (pbuf->pb_info.pi_magic != LNET_PROTO_PING_MAGIC) {
-		CDEBUG(D_NET, "%s: Unexpected magic %08x\n",
-		       libcfs_nid2str(gw->lpni_nid), pbuf->pb_info.pi_magic);
-		gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
-		goto out;
-	}
-
-	gw->lpni_ping_feats = pbuf->pb_info.pi_features;
-
-	/* Without NI status info there's nothing more to do. */
-	if (!(gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS))
-		goto out;
-
-	/* Determine the number of NIs for which there is data. */
-	nnis = pbuf->pb_info.pi_nnis;
-	if (pbuf->pb_nnis < nnis) {
-		if (rcd->rcd_nnis < nnis)
-			rcd->rcd_nnis = nnis;
-		nnis = pbuf->pb_nnis;
-	}
-
-	list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
-		int down = 0;
-		int up = 0;
-		int i;
-
-		/* If routing disabled then the route is down. */
-		if (gw->lpni_ping_feats & LNET_PING_FEAT_RTE_DISABLED) {
-			rte->lr_downis = 1;
-			continue;
-		}
-
-		for (i = 0; i < nnis; i++) {
-			struct lnet_ni_status *stat = &pbuf->pb_info.pi_ni[i];
-			lnet_nid_t nid = stat->ns_nid;
-
-			if (nid == LNET_NID_ANY) {
-				CDEBUG(D_NET, "%s: unexpected LNET_NID_ANY\n",
-				       libcfs_nid2str(gw->lpni_nid));
-				gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
-				goto out;
-			}
-
-			if (LNET_NETTYP(LNET_NIDNET(nid)) == LOLND)
-				continue;
-
-			if (stat->ns_status == LNET_NI_STATUS_DOWN) {
-				down++;
-				continue;
-			}
-
-			if (stat->ns_status == LNET_NI_STATUS_UP) {
-				if (LNET_NIDNET(nid) == rte->lr_net) {
-					up = 1;
-					break;
-				}
-				continue;
-			}
-
-			CDEBUG(D_NET, "%s: Unexpected status 0x%x\n",
-			       libcfs_nid2str(gw->lpni_nid), stat->ns_status);
-			gw->lpni_ping_feats = LNET_PING_FEAT_INVAL;
-			goto out;
-		}
-
-		if (up) { /* ignore downed NIs if NI for dest network is up */
-			rte->lr_downis = 0;
-			continue;
-		}
-		/**
-		 * if @down is zero and this route is single-hop, it means
-		 * we can't find NI for target network
-		 */
-		if (!down && rte->lr_hops == 1)
-			down = 1;
-
-		rte->lr_downis = down;
-	}
-out:
-	spin_unlock(&gw->lpni_lock);
+	rcd = rcd;
 }
 
 static void
@@ -725,7 +360,6 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 
 	if (event->type == LNET_EVENT_SEND) {
-		lp->lpni_ping_notsent = 0;
 		if (!event->status)
 			goto out;
 	}
@@ -755,7 +389,7 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 static void
 lnet_wait_known_routerstate(void)
 {
-	struct lnet_peer_ni *rtr;
+	struct lnet_peer *rtr;
 	int all_known;
 
 	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING);
@@ -764,15 +398,15 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 		int cpt = lnet_net_lock_current();
 
 		all_known = 1;
-		list_for_each_entry(rtr, &the_lnet.ln_routers, lpni_rtr_list) {
-			spin_lock(&rtr->lpni_lock);
+		list_for_each_entry(rtr, &the_lnet.ln_routers, lp_rtr_list) {
+			spin_lock(&rtr->lp_lock);
 
-			if (!rtr->lpni_alive_count) {
+			if (!(rtr->lp_state & LNET_PEER_DISCOVERED)) {
 				all_known = 0;
-				spin_unlock(&rtr->lpni_lock);
+				spin_unlock(&rtr->lp_lock);
 				break;
 			}
-			spin_unlock(&rtr->lpni_lock);
+			spin_unlock(&rtr->lp_lock);
 		}
 
 		lnet_net_unlock(cpt);
@@ -784,17 +418,22 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
+/* TODO: reimplement */
 void
 lnet_router_ni_update_locked(struct lnet_peer_ni *gw, u32 net)
 {
 	struct lnet_route *rte;
+	struct lnet_peer *lp;
 
-	if ((gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS)) {
-		list_for_each_entry(rte, &gw->lpni_routes, lr_gwlist) {
-			if (rte->lr_net == net) {
-				rte->lr_downis = 0;
-				break;
-			}
+	if ((gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS))
+		lp = gw->lpni_peer_net->lpn_peer;
+	else
+		return;
+
+	list_for_each_entry(rte, &lp->lp_routes, lr_gwlist) {
+		if (rte->lr_net == net) {
+			rte->lr_downis = 0;
+			break;
 		}
 	}
 }
@@ -841,212 +480,6 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
-static void
-lnet_destroy_rc_data(struct lnet_rc_data *rcd)
-{
-	LASSERT(list_empty(&rcd->rcd_list));
-	/* detached from network */
-	LASSERT(LNetMDHandleIsInvalid(rcd->rcd_mdh));
-
-	if (rcd->rcd_gateway) {
-		int cpt = rcd->rcd_gateway->lpni_cpt;
-
-		lnet_net_lock(cpt);
-		lnet_peer_ni_decref_locked(rcd->rcd_gateway);
-		lnet_net_unlock(cpt);
-	}
-
-	if (rcd->rcd_pingbuffer)
-		lnet_ping_buffer_decref(rcd->rcd_pingbuffer);
-
-	kfree(rcd);
-}
-
-static struct lnet_rc_data *
-lnet_update_rc_data_locked(struct lnet_peer_ni *gateway)
-{
-	struct lnet_handle_md mdh;
-	struct lnet_rc_data *rcd;
-	struct lnet_ping_buffer *pbuf = NULL;
-	struct lnet_md md;
-	int nnis = LNET_INTERFACES_MIN;
-	int rc;
-	int i;
-
-	rcd = gateway->lpni_rcd;
-	if (rcd) {
-		nnis = rcd->rcd_nnis;
-		mdh = rcd->rcd_mdh;
-		LNetInvalidateMDHandle(&rcd->rcd_mdh);
-		pbuf = rcd->rcd_pingbuffer;
-		rcd->rcd_pingbuffer = NULL;
-	} else {
-		LNetInvalidateMDHandle(&mdh);
-	}
-
-	lnet_net_unlock(gateway->lpni_cpt);
-
-	if (rcd) {
-		LNetMDUnlink(mdh);
-		lnet_ping_buffer_decref(pbuf);
-	} else {
-		rcd = kzalloc(sizeof(*rcd), GFP_NOFS);
-		if (!rcd)
-			goto out;
-
-		LNetInvalidateMDHandle(&rcd->rcd_mdh);
-		INIT_LIST_HEAD(&rcd->rcd_list);
-		rcd->rcd_nnis = nnis;
-	}
-
-	pbuf = lnet_ping_buffer_alloc(nnis, GFP_NOFS);
-	if (!pbuf)
-		goto out;
-
-	for (i = 0; i < nnis; i++) {
-		pbuf->pb_info.pi_ni[i].ns_nid = LNET_NID_ANY;
-		pbuf->pb_info.pi_ni[i].ns_status = LNET_NI_STATUS_INVALID;
-	}
-	rcd->rcd_pingbuffer = pbuf;
-
-	md.start = &pbuf->pb_info;
-	md.user_ptr = rcd;
-	md.length = LNET_PING_INFO_SIZE(nnis);
-	md.threshold = LNET_MD_THRESH_INF;
-	md.options = LNET_MD_TRUNCATE;
-	md.eq_handle = the_lnet.ln_rc_eqh;
-
-	LASSERT(!LNetEQHandleIsInvalid(the_lnet.ln_rc_eqh));
-	rc = LNetMDBind(md, LNET_UNLINK, &rcd->rcd_mdh);
-	if (rc < 0) {
-		CERROR("Can't bind MD: %d\n", rc);
-		goto out_ping_buffer_decref;
-	}
-	LASSERT(!rc);
-
-	lnet_net_lock(gateway->lpni_cpt);
-	/* Check if this is still a router. */
-	if (!lnet_isrouter(gateway))
-		goto out_unlock;
-	/* Check if someone else installed router data. */
-	if (gateway->lpni_rcd && gateway->lpni_rcd != rcd)
-		goto out_unlock;
-
-	/* Install and/or update the router data. */
-	if (!gateway->lpni_rcd) {
-		lnet_peer_ni_addref_locked(gateway);
-		rcd->rcd_gateway = gateway;
-		gateway->lpni_rcd = rcd;
-	}
-	gateway->lpni_ping_notsent = 0;
-
-	return rcd;
-
-out_unlock:
-	lnet_net_unlock(gateway->lpni_cpt);
-	rc = LNetMDUnlink(mdh);
-	LASSERT(!rc);
-out_ping_buffer_decref:
-	lnet_ping_buffer_decref(pbuf);
-out:
-	if (rcd && rcd != gateway->lpni_rcd)
-		lnet_destroy_rc_data(rcd);
-	lnet_net_lock(gateway->lpni_cpt);
-	return gateway->lpni_rcd;
-}
-
-static int
-lnet_router_check_interval(struct lnet_peer_ni *rtr)
-{
-	int secs;
-
-	secs = rtr->lpni_alive ? live_router_check_interval :
-			       dead_router_check_interval;
-	if (secs < 0)
-		secs = 0;
-
-	return secs;
-}
-
-static void
-lnet_ping_router_locked(struct lnet_peer_ni *rtr)
-{
-	struct lnet_rc_data *rcd = NULL;
-	time64_t now = ktime_get_seconds();
-	time64_t secs;
-	struct lnet_ni *ni;
-
-	lnet_peer_ni_addref_locked(rtr);
-
-	if (rtr->lpni_ping_deadline && /* ping timed out? */
-	    now > rtr->lpni_ping_deadline)
-		lnet_notify_locked(rtr, 1, 0, now);
-
-	/* Run any outstanding notifications */
-	ni = lnet_get_next_ni_locked(rtr->lpni_net, NULL);
-	lnet_ni_notify_locked(ni, rtr);
-
-	if (!lnet_isrouter(rtr) ||
-	    the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
-		/* router table changed or router checker is shutting down */
-		lnet_peer_ni_decref_locked(rtr);
-		return;
-	}
-
-	rcd = rtr->lpni_rcd;
-
-	/* The response to the router checker ping could've timed out and
-	 * the mdh might've been invalidated, so we need to update it
-	 * again.
-	 */
-	if (!rcd || rcd->rcd_nnis > rcd->rcd_pingbuffer->pb_nnis ||
-	    LNetMDHandleIsInvalid(rcd->rcd_mdh))
-		rcd = lnet_update_rc_data_locked(rtr);
-	if (!rcd)
-		return;
-
-	secs = lnet_router_check_interval(rtr);
-
-	CDEBUG(D_NET,
-	       "rtr %s %lldd: deadline %lld ping_notsent %d alive %d alive_count %d lpni_ping_timestamp %lld\n",
-	       libcfs_nid2str(rtr->lpni_nid), secs,
-	       rtr->lpni_ping_deadline, rtr->lpni_ping_notsent,
-	       rtr->lpni_alive, rtr->lpni_alive_count,
-	       rtr->lpni_ping_timestamp);
-
-	if (secs && !rtr->lpni_ping_notsent &&
-	    now > rtr->lpni_ping_timestamp + secs) {
-		int rc;
-		struct lnet_process_id id;
-		struct lnet_handle_md mdh;
-
-		id.nid = rtr->lpni_nid;
-		id.pid = LNET_PID_LUSTRE;
-		CDEBUG(D_NET, "Check: %s\n", libcfs_id2str(id));
-
-		rtr->lpni_ping_notsent = 1;
-		rtr->lpni_ping_timestamp = now;
-
-		mdh = rcd->rcd_mdh;
-
-		if (!rtr->lpni_ping_deadline) {
-			rtr->lpni_ping_deadline = ktime_get_seconds() +
-						  router_ping_timeout;
-		}
-
-		lnet_net_unlock(rtr->lpni_cpt);
-
-		rc = LNetGet(LNET_NID_ANY, mdh, id, LNET_RESERVED_PORTAL,
-			     LNET_PROTO_PING_MATCHBITS, 0, false);
-
-		lnet_net_lock(rtr->lpni_cpt);
-		if (rc)
-			rtr->lpni_ping_notsent = 0; /* no event pending */
-	}
-
-	lnet_peer_ni_decref_locked(rtr);
-}
-
 int lnet_router_pre_mt_start(void)
 {
 	int rc;
@@ -1088,81 +521,7 @@ void lnet_router_cleanup(void)
 
 void lnet_prune_rc_data(int wait_unlink)
 {
-	struct lnet_rc_data *rcd;
-	struct lnet_rc_data *tmp;
-	struct lnet_peer_ni *lp;
-	struct list_head head;
-	int i = 2;
-
-	if (likely(the_lnet.ln_mt_state == LNET_MT_STATE_RUNNING &&
-		   list_empty(&the_lnet.ln_rcd_deathrow) &&
-		   list_empty(&the_lnet.ln_rcd_zombie)))
-		return;
-
-	INIT_LIST_HEAD(&head);
-
-	lnet_net_lock(LNET_LOCK_EX);
-
-	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
-		/* router checker is stopping, prune all */
-		list_for_each_entry(lp, &the_lnet.ln_routers,
-				    lpni_rtr_list) {
-			if (!lp->lpni_rcd)
-				continue;
-
-			LASSERT(list_empty(&lp->lpni_rcd->rcd_list));
-			list_add(&lp->lpni_rcd->rcd_list,
-				 &the_lnet.ln_rcd_deathrow);
-			lp->lpni_rcd = NULL;
-		}
-	}
-
-	/* unlink all RCDs on deathrow list */
-	list_splice_init(&the_lnet.ln_rcd_deathrow, &head);
-
-	if (!list_empty(&head)) {
-		lnet_net_unlock(LNET_LOCK_EX);
-
-		list_for_each_entry(rcd, &head, rcd_list)
-			LNetMDUnlink(rcd->rcd_mdh);
-
-		lnet_net_lock(LNET_LOCK_EX);
-	}
-
-	list_splice_init(&head, &the_lnet.ln_rcd_zombie);
-
-	/* release all zombie RCDs */
-	while (!list_empty(&the_lnet.ln_rcd_zombie)) {
-		list_for_each_entry_safe(rcd, tmp, &the_lnet.ln_rcd_zombie,
-					 rcd_list) {
-			if (LNetMDHandleIsInvalid(rcd->rcd_mdh))
-				list_move(&rcd->rcd_list, &head);
-		}
-
-		wait_unlink = wait_unlink &&
-			      !list_empty(&the_lnet.ln_rcd_zombie);
-
-		lnet_net_unlock(LNET_LOCK_EX);
-
-		while ((rcd = list_first_entry_or_null(&head,
-						       struct lnet_rc_data,
-						       rcd_list)) != NULL) {
-			list_del_init(&rcd->rcd_list);
-			lnet_destroy_rc_data(rcd);
-		}
-
-		if (!wait_unlink)
-			return;
-
-		i++;
-		CDEBUG(((i & (-i)) == i) ? D_WARNING : D_NET,
-		       "Waiting for rc buffers to unlink\n");
-		schedule_timeout_uninterruptible(HZ / 4);
-
-		lnet_net_lock(LNET_LOCK_EX);
-	}
-
-	lnet_net_unlock(LNET_LOCK_EX);
+	wait_unlink = wait_unlink;
 }
 
 /*
@@ -1194,27 +553,16 @@ bool lnet_router_checker_active(void)
 void
 lnet_check_routers(void)
 {
-	struct lnet_peer_ni *rtr;
+	struct lnet_peer *rtr;
 	u64 version;
 	int cpt;
-	int cpt2;
 
 	cpt = lnet_net_lock_current();
 rescan:
 	version = the_lnet.ln_routers_version;
 
-	list_for_each_entry(rtr, &the_lnet.ln_routers, lpni_rtr_list) {
-		cpt2 = rtr->lpni_cpt;
-		if (cpt != cpt2) {
-			lnet_net_unlock(cpt);
-			cpt = cpt2;
-			lnet_net_lock(cpt);
-			/* the routers list has changed */
-			if (version != the_lnet.ln_routers_version)
-				goto rescan;
-		}
-
-		lnet_ping_router_locked(rtr);
+	list_for_each_entry(rtr, &the_lnet.ln_routers, lp_rtr_list) {
+		/* TODO use discovery to determine if router is alive */
 
 		/* NB dropped lock */
 		if (version != the_lnet.ln_routers_version) {
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 5341599..d41ff00 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -215,7 +215,7 @@ static int proc_lnet_routes(struct ctl_table *table, int write,
 			u32 net = rnet->lrn_net;
 			u32 hops = route->lr_hops;
 			unsigned int priority = route->lr_priority;
-			lnet_nid_t nid = route->lr_gateway->lpni_nid;
+			lnet_nid_t nid = route->lr_gateway->lp_primary_nid;
 			int alive = lnet_is_route_alive(route);
 
 			s += snprintf(s, tmpstr + tmpsiz - s,
@@ -290,7 +290,7 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 		*ppos = LNET_PROC_POS_MAKE(0, ver, 0, off);
 	} else {
 		struct list_head *r;
-		struct lnet_peer_ni *peer = NULL;
+		struct lnet_peer *peer = NULL;
 		int skip = off - 1;
 
 		lnet_net_lock(0);
@@ -305,9 +305,9 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 		r = the_lnet.ln_routers.next;
 
 		while (r != &the_lnet.ln_routers) {
-			struct lnet_peer_ni *lp;
+			struct lnet_peer *lp;
 
-			lp = list_entry(r, struct lnet_peer_ni, lpni_rtr_list);
+			lp = list_entry(r, struct lnet_peer, lp_rtr_list);
 			if (!skip) {
 				peer = lp;
 				break;
@@ -318,21 +318,22 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 		}
 
 		if (peer) {
-			lnet_nid_t nid = peer->lpni_nid;
+			lnet_nid_t nid = peer->lp_primary_nid;
 			time64_t now = ktime_get_seconds();
-			time64_t deadline = peer->lpni_ping_deadline;
-			int nrefs = atomic_read(&peer->lpni_refcount);
-			int nrtrrefs = peer->lpni_rtr_refcount;
-			int alive_cnt = peer->lpni_alive_count;
-			int alive = peer->lpni_alive;
-			int pingsent = !peer->lpni_ping_notsent;
-			time64_t last_ping = now - peer->lpni_ping_timestamp;
+			/* TODO: readjust what's being printed */
+			time64_t deadline = 0;
+			int nrefs = atomic_read(&peer->lp_refcount);
+			int nrtrrefs = peer->lp_rtr_refcount;
+			int alive_cnt = 0;
+			int alive = 0;
+			int pingsent = ((peer->lp_state & LNET_PEER_PING_SENT)
+				       != 0);
+			time64_t last_ping = now - peer->lp_rtrcheck_timestamp;
 			int down_ni = 0;
 			struct lnet_route *rtr;
 
-			if ((peer->lpni_ping_feats &
-			     LNET_PING_FEAT_NI_STATUS)) {
-				list_for_each_entry(rtr, &peer->lpni_routes,
+			if (nrtrrefs > 0) {
+				list_for_each_entry(rtr, &peer->lp_routes,
 						    lr_gwlist) {
 					/*
 					 * downis on any route should be the
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 332/622] lnet: lnet_add/del_route()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (330 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 331/622] lnet: use peer for gateway James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 333/622] lnet: Do not allow deleting of router nis James Simmons
                   ` (290 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Reimplemented lnet_add_route() and lnet_del_route() to use
the peer instead of the peer_ni.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11299
Lustre-commit: 680da7444a06 ("LU-11299 lnet: lnet_add/del_route()")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33184
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 317 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 307 insertions(+), 10 deletions(-)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 4e79c21..8374ce1 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -190,6 +190,39 @@
 	spin_unlock(&lp->lpni_lock);
 }
 
+static void
+lnet_rtr_addref_locked(struct lnet_peer *lp)
+{
+	LASSERT(lp->lp_rtr_refcount >= 0);
+
+	/* lnet_net_lock must be exclusively locked */
+	lp->lp_rtr_refcount++;
+	if (lp->lp_rtr_refcount == 1) {
+		list_add_tail(&lp->lp_rtr_list, &the_lnet.ln_routers);
+		/* addref for the_lnet.ln_routers */
+		lnet_peer_addref_locked(lp);
+		the_lnet.ln_routers_version++;
+	}
+}
+
+static void
+lnet_rtr_decref_locked(struct lnet_peer *lp)
+{
+	LASSERT(atomic_read(&lp->lp_refcount) > 0);
+	LASSERT(lp->lp_rtr_refcount > 0);
+
+	/* lnet_net_lock must be exclusively locked */
+	lp->lp_rtr_refcount--;
+	if (lp->lp_rtr_refcount == 0) {
+		LASSERT(list_empty(&lp->lp_routes));
+
+		list_del(&lp->lp_rtr_list);
+		/* decref for the_lnet.ln_routers */
+		lnet_peer_decref_locked(lp);
+		the_lnet.ln_routers_version++;
+	}
+}
+
 struct lnet_remotenet *
 lnet_find_rnet_locked(u32 net)
 {
@@ -206,24 +239,288 @@ struct lnet_remotenet *
 	return NULL;
 }
 
+static void lnet_shuffle_seed(void)
+{
+	static int seeded;
+	struct lnet_ni *ni = NULL;
+
+	if (seeded)
+		return;
+
+	/* Nodes with small feet have little entropy
+	 * the NID for this node gives the most entropy in the low bits
+	 */
+	while ((ni = lnet_get_next_ni_locked(NULL, ni)))
+		add_device_randomness(&ni->ni_nid, sizeof(ni->ni_nid));
+
+	seeded = 1;
+}
+
+/* NB expects LNET_LOCK held */
+static void
+lnet_add_route_to_rnet(struct lnet_remotenet *rnet, struct lnet_route *route)
+{
+	unsigned int len = 0;
+	unsigned int offset = 0;
+	struct list_head *e;
+
+	lnet_shuffle_seed();
+
+	list_for_each(e, &rnet->lrn_routes)
+		len++;
+
+	/* Randomly adding routes to the list is done to ensure that when
+	 * different nodes are using the same list of routers, they end up
+	 * preferring different routers.
+	 */
+	offset = prandom_u32_max(len + 1);
+	list_for_each(e, &rnet->lrn_routes) {
+		if (offset == 0)
+			break;
+		offset--;
+	}
+	list_add(&route->lr_list, e);
+	/* force a router check on the gateway to make sure the route is
+	 * alive
+	 */
+	route->lr_gateway->lp_rtrcheck_timestamp = 0;
+
+	the_lnet.ln_remote_nets_version++;
+
+	/* add the route on the gateway list */
+	list_add(&route->lr_gwlist, &route->lr_gateway->lp_routes);
+
+	/* take a router reference count on the gateway */
+	lnet_rtr_addref_locked(route->lr_gateway);
+}
+
 int
 lnet_add_route(u32 net, u32 hops, lnet_nid_t gateway,
 	       unsigned int priority)
 {
-	net = net;
-	hops = hops;
-	gateway = gateway;
-	priority = priority;
-	return -EINVAL;
+	struct list_head *route_entry;
+	struct lnet_remotenet *rnet;
+	struct lnet_remotenet *rnet2;
+	struct lnet_route *route;
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer *gw;
+	int add_route;
+	int rc;
+
+	CDEBUG(D_NET, "Add route: remote net %s hops %d priority %u gw %s\n",
+	       libcfs_net2str(net), hops, priority, libcfs_nid2str(gateway));
+
+	if (gateway == LNET_NID_ANY ||
+	    LNET_NETTYP(LNET_NIDNET(gateway)) == LOLND ||
+	    net == LNET_NIDNET(LNET_NID_ANY) ||
+	    LNET_NETTYP(net) == LOLND ||
+	    LNET_NIDNET(gateway) == net ||
+	    (hops != LNET_UNDEFINED_HOPS && (hops < 1 || hops > 255)))
+		return -EINVAL;
+
+	/* it's a local network */
+	if (lnet_islocalnet(net))
+		return -EEXIST;
+
+	/* Assume net, route, all new */
+	route = kzalloc(sizeof(*route), GFP_NOFS);
+	rnet = kzalloc(sizeof(*rnet), GFP_NOFS);
+	if (!route || !rnet) {
+		CERROR("Out of memory creating route %s %d %s\n",
+		       libcfs_net2str(net), hops, libcfs_nid2str(gateway));
+		kfree(route);
+		kfree(rnet);
+		return -ENOMEM;
+	}
+
+	INIT_LIST_HEAD(&rnet->lrn_routes);
+	rnet->lrn_net = net;
+	/* store the local and remote net that the route represents */
+	route->lr_lnet = LNET_NIDNET(gateway);
+	route->lr_net = net;
+	route->lr_priority = priority;
+	route->lr_hops = hops;
+
+	lnet_net_lock(LNET_LOCK_EX);
+
+	/* lnet_nid2peerni_ex() grabs a ref on the lpni. We will need to
+	 * lose that once we're done
+	 */
+	lpni = lnet_nid2peerni_ex(gateway, LNET_LOCK_EX);
+	if (IS_ERR(lpni)) {
+		lnet_net_unlock(LNET_LOCK_EX);
+
+		kfree(route);
+		kfree(rnet);
+
+		rc = PTR_ERR(lpni);
+		CERROR("Error %d creating route %s %d %s\n", rc,
+		       libcfs_net2str(net), hops,
+		       libcfs_nid2str(gateway));
+		return rc;
+	}
+
+	LASSERT(lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer);
+	gw = lpni->lpni_peer_net->lpn_peer;
+
+	route->lr_gateway = gw;
+
+	rnet2 = lnet_find_rnet_locked(net);
+	if (!rnet2) {
+		/* new network */
+		list_add_tail(&rnet->lrn_list, lnet_net2rnethash(net));
+		rnet2 = rnet;
+	}
+
+	/* Search for a duplicate route (it's a NOOP if it is) */
+	add_route = 1;
+	list_for_each(route_entry, &rnet2->lrn_routes) {
+		struct lnet_route *route2;
+
+		route2 = list_entry(route_entry, struct lnet_route, lr_list);
+		if (route2->lr_gateway == route->lr_gateway) {
+			add_route = 0;
+			break;
+		}
+
+		/* our lookups must be true */
+		LASSERT(route2->lr_gateway->lp_primary_nid != gateway);
+	}
+
+	/* It is possible to add multiple routes through the same peer,
+	 * but it'll be using a different NID of that peer. When the
+	 * gateway is discovered, discovery will consolidate the different
+	 * peers into one peer. In this case the discovery code will have
+	 * to move the routes from the peer that's being deleted to the
+	 * consolidated peer lp_routes list
+	 */
+	if (add_route)
+		lnet_add_route_to_rnet(rnet2, route);
+
+	/* get rid of the reference on the lpni.
+	 */
+	lnet_peer_ni_decref_locked(lpni);
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	rc = 0;
+
+	if (!add_route) {
+		rc = -EEXIST;
+		kfree(route);
+	}
+
+	if (rnet != rnet2)
+		kfree(rnet);
+
+	/* kick start the monitor thread to handle the added route */
+	wake_up(&the_lnet.ln_mt_waitq);
+
+	return rc;
+}
+
+static void
+lnet_del_route_from_rnet(lnet_nid_t gw_nid, struct list_head *route_list,
+			 struct list_head *zombies)
+{
+	struct lnet_peer *gateway;
+	struct lnet_route *route;
+	struct lnet_route *tmp;
+
+	list_for_each_entry_safe(route, tmp, route_list, lr_list) {
+		gateway = route->lr_gateway;
+		if (gw_nid != LNET_NID_ANY &&
+		    gw_nid != gateway->lp_primary_nid)
+			continue;
+
+		/* move to zombie to delete outside the lock
+		 * Note that this function is called with the
+		 * ln_api_mutex held as well as the exclusive net
+		 * lock. Adding to the remote net list happens
+		 * under the same conditions. Same goes for the
+		 * gateway router list
+		 */
+		list_move(&route->lr_list, zombies);
+		the_lnet.ln_remote_nets_version++;
+
+		list_del(&route->lr_gwlist);
+		lnet_rtr_decref_locked(gateway);
+	}
 }
 
-/* TODO: reimplement lnet_check_routes() */
 int
 lnet_del_route(u32 net, lnet_nid_t gw_nid)
 {
-	net = net;
-	gw_nid = gw_nid;
-	return -EINVAL;
+	struct list_head rnet_zombies;
+	struct lnet_remotenet *rnet;
+	struct lnet_remotenet *tmp;
+	struct list_head *rn_list;
+	struct lnet_peer_ni *lpni;
+	struct lnet_route *route;
+	struct list_head zombies;
+	struct lnet_peer *lp;
+	int i = 0;
+
+	INIT_LIST_HEAD(&rnet_zombies);
+	INIT_LIST_HEAD(&zombies);
+
+	CDEBUG(D_NET, "Del route: net %s : gw %s\n",
+	       libcfs_net2str(net), libcfs_nid2str(gw_nid));
+
+	/* NB Caller may specify either all routes via the given gateway
+	 * or a specific route entry actual NIDs)
+	 */
+
+	lnet_net_lock(LNET_LOCK_EX);
+
+	lpni = lnet_find_peer_ni_locked(gw_nid);
+	if (lpni) {
+		lp = lpni->lpni_peer_net->lpn_peer;
+		LASSERT(lp);
+		gw_nid = lp->lp_primary_nid;
+		lnet_peer_ni_decref_locked(lpni);
+	}
+
+	if (net != LNET_NIDNET(LNET_NID_ANY)) {
+		rnet = lnet_find_rnet_locked(net);
+		if (!rnet) {
+			lnet_net_unlock(LNET_LOCK_EX);
+			return -ENOENT;
+		}
+		lnet_del_route_from_rnet(gw_nid, &rnet->lrn_routes,
+					 &zombies);
+		if (list_empty(&rnet->lrn_routes))
+			list_move(&rnet->lrn_list, &rnet_zombies);
+		goto delete_zombies;
+	}
+
+	for (i = 0; i < LNET_REMOTE_NETS_HASH_SIZE; i++) {
+		rn_list = &the_lnet.ln_remote_nets_hash[i];
+
+		list_for_each_entry_safe(rnet, tmp, rn_list, lrn_list) {
+			lnet_del_route_from_rnet(gw_nid, &rnet->lrn_routes,
+						 &zombies);
+			if (list_empty(&rnet->lrn_routes))
+				list_move(&rnet->lrn_list, &rnet_zombies);
+		}
+	}
+
+delete_zombies:
+	lnet_net_unlock(LNET_LOCK_EX);
+
+	while (!list_empty(&zombies)) {
+		route = list_first_entry(&zombies, struct lnet_route, lr_list);
+		list_del(&route->lr_list);
+		kfree(route);
+	}
+
+	while (!list_empty(&rnet_zombies)) {
+		rnet = list_first_entry(&rnet_zombies, struct lnet_remotenet,
+					lrn_list);
+		list_del(&rnet->lrn_list);
+		kfree(rnet);
+	}
+
+	return 0;
 }
 
 void
@@ -900,7 +1197,7 @@ bool lnet_router_checker_active(void)
 	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_routing = 1;
 	lnet_net_unlock(LNET_LOCK_EX);
-
+	wake_up(&the_lnet.ln_mt_waitq);
 	return 0;
 
 failed:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 333/622] lnet: Do not allow deleting of router nis
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (331 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 332/622] lnet: lnet_add/del_route() James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 334/622] lnet: router sensitivity James Simmons
                   ` (289 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Check the peer before deleting a peer_ni. If it's a router then do
not allow deletion of the peer-ni.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11551
Lustre-commit: 7832a9f52d90 ("LU-11551 lnet: Do not allow deleting of router nis")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33448
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index faaf94a..cb70bc7 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1550,6 +1550,15 @@ struct lnet_peer_net *
 		return -ENODEV;
 	}
 
+	lnet_net_lock(LNET_LOCK_EX);
+	if (lp->lp_rtr_refcount > 0) {
+		lnet_net_unlock(LNET_LOCK_EX);
+		CERROR("%s is a router. Can not be deleted\n",
+		       libcfs_nid2str(prim_nid));
+		return -EBUSY;
+	}
+	lnet_net_unlock(LNET_LOCK_EX);
+
 	if (nid == LNET_NID_ANY || nid == lp->lp_primary_nid)
 		return lnet_peer_del(lp);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 334/622] lnet: router sensitivity
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (332 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 333/622] lnet: Do not allow deleting of router nis James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 335/622] lnet: cache ni status James Simmons
                   ` (288 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Introduce the router_sensitivity_percentage module parameter to
control the sensitivity of routers to failures. It defaults to 100%
which means a router interface needs to be fully healthy in order
to be used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: 2b59dae54efc ("LU-11300 lnet: router sensitivity")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33449
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 +
 net/lnet/lnet/router.c        | 50 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 80f6f8c..eae55d5 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -505,6 +505,7 @@ struct lnet_ni *
 extern unsigned int lnet_recovery_interval;
 extern unsigned int lnet_peer_discovery_disabled;
 extern unsigned int lnet_drop_asym_route;
+extern unsigned int router_sensitivity_percentage;
 extern int portal_rotor;
 
 int lnet_lib_init(void);
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 8374ce1..40725d2 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -90,6 +90,56 @@
 module_param(router_ping_timeout, int, 0644);
 MODULE_PARM_DESC(router_ping_timeout, "Seconds to wait for the reply to a router health query");
 
+/* A value between 0 and 100. 0 meaning that even if router's interfaces
+ * have the worse health still consider the gateway usable.
+ * 100 means that at least one interface on the route's remote net is 100%
+ * healthy to consider the route alive.
+ * The default is set to 100 to ensure we maintain the original behavior.
+ */
+unsigned int router_sensitivity_percentage = 100;
+static int rtr_sensitivity_set(const char *val,
+			       const struct kernel_param *kp);
+static struct kernel_param_ops param_ops_rtr_sensitivity = {
+	.set = rtr_sensitivity_set,
+	.get = param_get_int,
+};
+
+#define param_check_rtr_sensitivity(name, p) \
+	__param_check(name, p, int)
+module_param(router_sensitivity_percentage, rtr_sensitivity, 0644);
+MODULE_PARM_DESC(router_sensitivity_percentage,
+		 "How healthy a gateway should be to be used in percent");
+
+static int
+rtr_sensitivity_set(const char *val, const struct kernel_param *kp)
+{
+	int rc;
+	unsigned int *sen = (unsigned int *)kp->arg;
+	unsigned long value;
+
+	rc = kstrtoul(val, 0, &value);
+	if (rc) {
+		CERROR("Invalid module parameter value for 'router_sensitivity_percentage'\n");
+		return rc;
+	}
+
+	if (value < 0 || value > 100) {
+		CERROR("Invalid value: %lu for 'router_sensitivity_percentage'\n", value);
+		return -EINVAL;
+	}
+
+	/* The purpose of locking the api_mutex here is to ensure that
+	 * the correct value ends up stored properly.
+	 */
+	mutex_lock(&the_lnet.ln_api_mutex);
+
+	*sen = value;
+
+	mutex_unlock(&the_lnet.ln_api_mutex);
+
+	return 0;
+}
+
 int
 lnet_peers_start_down(void)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 335/622] lnet: cache ni status
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (333 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 334/622] lnet: router sensitivity James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 336/622] lnet: Cache the routing feature James Simmons
                   ` (287 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When processing the data in the PUSH or the REPLY make sure to cache
the ns_status. This is the status of the peer_ni as reported by the
peer itself.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: 398f4071dc17 ("LU-11300 lnet: cache ni status")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33450
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  2 ++
 net/lnet/lnet/peer.c           | 42 +++++++++++++++++++++++++++++++-----------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 31fe22a..a551005 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -585,6 +585,8 @@ struct lnet_peer_ni {
 	int			 lpni_cpt;
 	/* state flags -- protected by lpni_lock */
 	unsigned int		 lpni_state;
+	/* status of the peer NI as reported by the peer */
+	u32			lpni_ns_status;
 	/* sequence number used to round robin over peer nis within a net */
 	u32			 lpni_seq;
 	/* sequence number used to round robin over gateways */
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index cb70bc7..cba3da2 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -128,8 +128,10 @@
 
 	spin_lock_init(&lpni->lpni_lock);
 
-	lpni->lpni_alive = !lnet_peers_start_down(); /* 1 bit!! */
-	lpni->lpni_last_alive = ktime_get_seconds(); /* assumes alive */
+	if (lnet_peers_start_down())
+		lpni->lpni_ns_status = LNET_NI_STATUS_DOWN;
+	else
+		lpni->lpni_ns_status = LNET_NI_STATUS_UP;
 	lpni->lpni_ping_feats = LNET_PING_FEAT_INVAL;
 	lpni->lpni_nid = nid;
 	lpni->lpni_cpt = cpt;
@@ -2410,7 +2412,7 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 {
 	struct lnet_peer_ni *lpni;
 	lnet_nid_t *curnis = NULL;
-	lnet_nid_t *addnis = NULL;
+	struct lnet_ni_status *addnis = NULL;
 	lnet_nid_t *delnis = NULL;
 	unsigned int flags;
 	int ncurnis;
@@ -2426,9 +2428,9 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 		flags |= LNET_PEER_MULTI_RAIL;
 
 	nnis = max_t(int, lp->lp_nnis, pbuf->pb_info.pi_nnis);
-	curnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
-	addnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
-	delnis = kmalloc_array(nnis, sizeof(lnet_nid_t), GFP_NOFS);
+	curnis = kmalloc_array(nnis, sizeof(*curnis), GFP_NOFS);
+	addnis = kmalloc_array(nnis, sizeof(*addnis), GFP_NOFS);
+	delnis = kmalloc_array(nnis, sizeof(*delnis), GFP_NOFS);
 	if (!curnis || !addnis || !delnis) {
 		rc = -ENOMEM;
 		goto out;
@@ -2451,7 +2453,7 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 			if (pbuf->pb_info.pi_ni[i].ns_nid == curnis[j])
 				break;
 		if (j == ncurnis)
-			addnis[naddnis++] = pbuf->pb_info.pi_ni[i].ns_nid;
+			addnis[naddnis++] = pbuf->pb_info.pi_ni[i];
 	}
 	/*
 	 * Check for NIDs in curnis[] not present in pbuf.
@@ -2463,23 +2465,41 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 	for (i = 0; i < ncurnis; i++) {
 		if (LNET_NETTYP(LNET_NIDNET(curnis[i])) == LOLND)
 			continue;
-		for (j = 1; j < pbuf->pb_info.pi_nnis; j++)
-			if (curnis[i] == pbuf->pb_info.pi_ni[j].ns_nid)
+		for (j = 1; j < pbuf->pb_info.pi_nnis; j++) {
+			if (curnis[i] == pbuf->pb_info.pi_ni[j].ns_nid) {
+				/* update the information we cache for the
+				 * peer with the latest information we
+				 * received
+				 */
+				lpni = lnet_find_peer_ni_locked(curnis[i]);
+				if (lpni) {
+					lpni->lpni_ns_status =
+						pbuf->pb_info.pi_ni[j].ns_status;
+					lnet_peer_ni_decref_locked(lpni);
+				}
 				break;
+			}
+		}
 		if (j == pbuf->pb_info.pi_nnis)
 			delnis[ndelnis++] = curnis[i];
 	}
 
 	for (i = 0; i < naddnis; i++) {
-		rc = lnet_peer_add_nid(lp, addnis[i], flags);
+		rc = lnet_peer_add_nid(lp, addnis[i].ns_nid, flags);
 		if (rc) {
 			CERROR("Error adding NID %s to peer %s: %d\n",
-			       libcfs_nid2str(addnis[i]),
+			       libcfs_nid2str(addnis[i].ns_nid),
 			       libcfs_nid2str(lp->lp_primary_nid), rc);
 			if (rc == -ENOMEM)
 				goto out;
 		}
+		lpni = lnet_find_peer_ni_locked(addnis[i].ns_nid);
+		if (lpni) {
+			lpni->lpni_ns_status = addnis[i].ns_status;
+			lnet_peer_ni_decref_locked(lpni);
+		}
 	}
+
 	for (i = 0; i < ndelnis; i++) {
 		rc = lnet_peer_del_nid(lp, delnis[i], flags);
 		if (rc) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 336/622] lnet: Cache the routing feature
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (334 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 335/622] lnet: cache ni status James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 337/622] lnet: peer aliveness James Simmons
                   ` (286 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When processing a REPLY or a PUSH for a discovery cache the
whether the routing feature is enabled or disabled as
reported by the peer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: d65a7b8727ee ("LU-11300 lnet: Cache the routing feature")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33451
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h | 28 ++++++++++++++++------------
 net/lnet/lnet/peer.c           | 10 ++++++++++
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index a551005..ecc6dee 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -705,9 +705,13 @@ struct lnet_peer {
  *
  * A peer is marked NO_DISCOVERY if the LNET_PING_FEAT_DISCOVERY bit was
  * NOT set when the peer was pinged by discovery.
+ *
+ * A peer is marked ROUTER if it indicates so in the feature bit.
  */
 #define LNET_PEER_MULTI_RAIL	BIT(0)	/* Multi-rail aware */
 #define LNET_PEER_NO_DISCOVERY	BIT(1)	/* Peer disabled discovery */
+#define LNET_PEER_ROUTER_ENABLED BIT(2)	/* router feature enabled */
+
 /*
  * A peer is marked CONFIGURED if it was configured by DLC.
  *
@@ -721,28 +725,28 @@ struct lnet_peer {
  * A peer that was created as the result of inbound traffic will not
  * be marked at all.
  */
-#define LNET_PEER_CONFIGURED	BIT(2)	/* Configured via DLC */
-#define LNET_PEER_DISCOVERED	BIT(3)	/* Peer was discovered */
-#define LNET_PEER_REDISCOVER	BIT(4)	/* Discovery was disabled */
+#define LNET_PEER_CONFIGURED	BIT(3)	/* Configured via DLC */
+#define LNET_PEER_DISCOVERED	BIT(4)	/* Peer was discovered */
+#define LNET_PEER_REDISCOVER	BIT(5)	/* Discovery was disabled */
 /*
  * A peer is marked DISCOVERING when discovery is in progress.
  * The other flags below correspond to stages of discovery.
  */
-#define LNET_PEER_DISCOVERING	BIT(5)	/* Discovering */
-#define LNET_PEER_DATA_PRESENT	BIT(6)	/* Remote peer data present */
-#define LNET_PEER_NIDS_UPTODATE	BIT(7)	/* Remote peer info uptodate */
-#define LNET_PEER_PING_SENT	BIT(8)	/* Waiting for REPLY to Ping */
-#define LNET_PEER_PUSH_SENT	BIT(9)	/* Waiting for ACK of Push */
-#define LNET_PEER_PING_FAILED	BIT(10)	/* Ping send failure */
-#define LNET_PEER_PUSH_FAILED	BIT(11)	/* Push send failure */
+#define LNET_PEER_DISCOVERING	BIT(6)	/* Discovering */
+#define LNET_PEER_DATA_PRESENT	BIT(7)	/* Remote peer data present */
+#define LNET_PEER_NIDS_UPTODATE	BIT(8)	/* Remote peer info uptodate */
+#define LNET_PEER_PING_SENT	BIT(9)	/* Waiting for REPLY to Ping */
+#define LNET_PEER_PUSH_SENT	BIT(10)	/* Waiting for ACK of Push */
+#define LNET_PEER_PING_FAILED	BIT(11)	/* Ping send failure */
+#define LNET_PEER_PUSH_FAILED	BIT(12)	/* Push send failure */
 /*
  * A ping can be forced as a way to fix up state, or as a manual
  * intervention by an admin.
  * A push can be forced in circumstances that would normally not
  * allow for one to happen.
  */
-#define LNET_PEER_FORCE_PING	BIT(12)	/* Forced Ping */
-#define LNET_PEER_FORCE_PUSH	BIT(13)	/* Forced Push */
+#define LNET_PEER_FORCE_PING	BIT(13)	/* Forced Ping */
+#define LNET_PEER_FORCE_PUSH	BIT(14)	/* Forced Push */
 
 struct lnet_peer_net {
 	/* chain on lp_peer_nets */
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index cba3da2..91ad6b4 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2427,6 +2427,16 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL)
 		flags |= LNET_PEER_MULTI_RAIL;
 
+	/* Cache the routing feature for the peer; whether it is enabled
+	 * for disabled as reported by the remote peer.
+	 */
+	spin_lock(&lp->lp_lock);
+	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_RTE_DISABLED))
+		lp->lp_state |= LNET_PEER_ROUTER_ENABLED;
+	else
+		lp->lp_state &= ~LNET_PEER_ROUTER_ENABLED;
+	spin_unlock(&lp->lp_lock);
+
 	nnis = max_t(int, lp->lp_nnis, pbuf->pb_info.pi_nnis);
 	curnis = kmalloc_array(nnis, sizeof(*curnis), GFP_NOFS);
 	addnis = kmalloc_array(nnis, sizeof(*addnis), GFP_NOFS);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 337/622] lnet: peer aliveness
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (335 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 336/622] lnet: Cache the routing feature James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 338/622] lnet: router aliveness James Simmons
                   ` (285 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Peer NI aliveness is now solely dependent on the health
infrastructure. With the addition of router_sensitivity_percentage,
peer NI is considered dead if its health drops below the percentage
specified of the total health. Setting the percentage to 100% means
that a peer_ni is considered dead if it's interface is less than
fully healthy.

Removed obsolete code that queries the peer NI every second since
the health infrastructure introduces the recovery mechanism which
is designed to recover the health of peer NIs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: 8e498d3f23ea ("LU-11300 lnet: peer aliveness")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33186
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |  25 ++++++---
 include/linux/lnet/lib-types.h |   2 -
 net/lnet/lnet/lib-move.c       | 124 +++++------------------------------------
 net/lnet/lnet/peer.c           |   7 ++-
 net/lnet/lnet/router.c         |  11 ++--
 net/lnet/lnet/router_proc.c    |   3 +-
 6 files changed, 42 insertions(+), 130 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index eae55d5..d5704b7 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -846,15 +846,6 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 	return NULL;
 }
 
-static inline void
-lnet_peer_set_alive(struct lnet_peer_ni *lp)
-{
-	lp->lpni_last_query = ktime_get_seconds();
-	lp->lpni_last_alive = lp->lpni_last_query;
-	if (!lp->lpni_alive)
-		lnet_notify_locked(lp, 0, 1, lp->lpni_last_alive);
-}
-
 static inline bool
 lnet_peer_is_multi_rail(struct lnet_peer *lp)
 {
@@ -889,6 +880,22 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 	return false;
 }
 
+/*
+ * A peer is alive if it satisfies the following two conditions:
+ *  1. peer health >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage
+ *  2. the cached NI status received when we discover the peer is UP
+ */
+static inline bool
+lnet_is_peer_ni_alive(struct lnet_peer_ni *lpni)
+{
+	bool halive = false;
+
+	halive = (atomic_read(&lpni->lpni_healthv) >=
+		 (LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage / 100));
+
+	return halive && lpni->lpni_ns_status == LNET_NI_STATUS_UP;
+}
+
 static inline void
 lnet_inc_healthv(atomic_t *healthv)
 {
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index ecc6dee..9a09fad 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -553,8 +553,6 @@ struct lnet_peer_ni {
 	int			 lpni_rtrcredits;
 	/* low water mark */
 	int			 lpni_minrtrcredits;
-	/* alive/dead? */
-	bool			 lpni_alive;
 	/* notification outstanding? */
 	bool			 lpni_notify;
 	/* outstanding notification for LND? */
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 99ff882..af3cd1e 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -609,86 +609,16 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 }
 
 /*
- * This function can be called from two paths:
- *	1. when sending a message
- *	2. when decommiting a message (lnet_msg_decommit_tx())
- * In both these cases the peer_ni should have it's reference count
- * acquired by the caller and therefore it is safe to drop the spin
- * lock before calling lnd_query()
- */
-static void
-lnet_ni_query_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp)
-{
-	time64_t last_alive = 0;
-	int cpt = lnet_cpt_of_nid_locked(lp->lpni_nid, ni);
-
-	LASSERT(lnet_peer_aliveness_enabled(lp));
-	LASSERT(ni->ni_net->net_lnd->lnd_query);
-
-	lnet_net_unlock(cpt);
-	ni->ni_net->net_lnd->lnd_query(ni, lp->lpni_nid, &last_alive);
-	lnet_net_lock(cpt);
-
-	lp->lpni_last_query = ktime_get_seconds();
-
-	if (last_alive) /* NI has updated timestamp */
-		lp->lpni_last_alive = last_alive;
-}
-
-/* NB: always called with lnet_net_lock held */
-static inline int
-lnet_peer_is_alive(struct lnet_peer_ni *lp, unsigned long now)
-{
-	int alive;
-	time64_t deadline;
-
-	LASSERT(lnet_peer_aliveness_enabled(lp));
-
-	/* Trust lnet_notify() if it has more recent aliveness news, but
-	 * ignore the initial assumed death (see lnet_peers_start_down()).
-	 */
-	spin_lock(&lp->lpni_lock);
-	if (!lp->lpni_alive && lp->lpni_alive_count > 0 &&
-	    lp->lpni_timestamp >= lp->lpni_last_alive) {
-		spin_unlock(&lp->lpni_lock);
-		return 0;
-	}
-
-	deadline = lp->lpni_last_alive +
-		   lp->lpni_net->net_tunables.lct_peer_timeout;
-	alive = deadline > now;
-
-	/* Update obsolete lpni_alive except for routers assumed to be dead
-	 * initially, because router checker would update aliveness in this
-	 * case, and moreover lpni_last_alive at peer creation is assumed.
-	 */
-	if (alive && !lp->lpni_alive &&
-	    !(lnet_isrouter(lp) && !lp->lpni_alive_count)) {
-		spin_unlock(&lp->lpni_lock);
-		lnet_notify_locked(lp, 0, 1, lp->lpni_last_alive);
-	} else {
-		spin_unlock(&lp->lpni_lock);
-	}
-
-	return alive;
-}
-
-/*
  * NB: returns 1 when alive, 0 when dead, negative when error;
  *     may drop the lnet_net_lock
  */
 static int
-lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp,
+lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lpni,
 		       struct lnet_msg *msg)
 {
-	time64_t now = ktime_get_seconds();
-
-	if (!lnet_peer_aliveness_enabled(lp))
+	if (!lnet_peer_aliveness_enabled(lpni))
 		return -ENODEV;
 
-	if (lnet_peer_is_alive(lp, now))
-		return 1;
-
 	/*
 	 * If we're resending a message, let's attempt to send it even if
 	 * the peer is down to fulfill our resend quota on the message
@@ -696,35 +626,16 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	if (msg->msg_retry_count > 0)
 		return 1;
 
-	/*
-	 * Peer appears dead, but we should avoid frequent NI queries (at
-	 * most once per lnet_queryinterval seconds).
-	 */
-	if (lp->lpni_last_query) {
-		static const int lnet_queryinterval = 1;
-		time64_t next_query;
-
-		next_query = lp->lpni_last_query + lnet_queryinterval;
-
-		if (now < next_query) {
-			if (lp->lpni_alive)
-				CWARN("Unexpected aliveness of peer %s: %lld < %lld (%d/%d)\n",
-				      libcfs_nid2str(lp->lpni_nid),
-				      now, next_query,
-				      lnet_queryinterval,
-				      lp->lpni_net->net_tunables.lct_peer_timeout);
-			return 0;
-		}
-	}
-
-	/* query NI for latest aliveness news */
-	lnet_ni_query_locked(ni, lp);
+	/* try and send recovery messages irregardless */
+	if (msg->msg_recovery)
+		return 1;
 
-	if (lnet_peer_is_alive(lp, now))
+	/* always send any responses */
+	if (msg->msg_type == LNET_MSG_ACK ||
+	    msg->msg_type == LNET_MSG_REPLY)
 		return 1;
 
-	lnet_notify_locked(lp, 0, 0, lp->lpni_last_alive);
-	return 0;
+	return lnet_is_peer_ni_alive(lpni);
 }
 
 /**
@@ -4184,18 +4095,11 @@ void lnet_monitor_thr_stop(void)
 	/* Multi-Rail: Primary NID of source. */
 	msg->msg_initiator = lnet_peer_primary_nid_locked(src_nid);
 
-	if (lnet_isrouter(msg->msg_rxpeer)) {
-		lnet_peer_set_alive(msg->msg_rxpeer);
-		if (avoid_asym_router_failure &&
-		    LNET_NIDNET(src_nid) != LNET_NIDNET(from_nid)) {
-			/* received a remote message from router, update
-			 * remote NI status on this router.
-			 * NB: multi-hop routed message will be ignored.
-			 */
-			lnet_router_ni_update_locked(msg->msg_rxpeer,
-						     LNET_NIDNET(src_nid));
-		}
-	}
+	/* mark the status of this lpni as UP since we received a message
+	 * from it. The ping response reports back the ns_status which is
+	 * marked on the remote as up or down and we cache it here.
+	 */
+	msg->msg_rxpeer->lpni_ns_status = LNET_NI_STATUS_UP;
 
 	lnet_msg_commit(msg, cpt);
 
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 91ad6b4..8669fbb 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3296,7 +3296,7 @@ void lnet_peer_discovery_stop(void)
 	}
 
 	if (lnet_isrouter(lp) || lnet_peer_aliveness_enabled(lp))
-		aliveness = lp->lpni_alive ? "up" : "down";
+		aliveness = (lnet_is_peer_ni_alive(lp)) ? "up" : "down";
 
 	CDEBUG(D_WARNING, "%-24s %4d %5s %5d %5d %5d %5d %5d %ld\n",
 	       libcfs_nid2str(lp->lpni_nid), atomic_read(&lp->lpni_refcount),
@@ -3353,7 +3353,8 @@ void lnet_peer_discovery_stop(void)
 			if (lnet_isrouter(lp) ||
 			    lnet_peer_aliveness_enabled(lp))
 				snprintf(aliveness, LNET_MAX_STR_LEN,
-					 lp->lpni_alive ? "up" : "down");
+					 lnet_is_peer_ni_alive(lp)
+					 ? "up" : "down");
 
 			*nid = lp->lpni_nid;
 			*refcount = atomic_read(&lp->lpni_refcount);
@@ -3439,7 +3440,7 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 		if (lnet_isrouter(lpni) ||
 		    lnet_peer_aliveness_enabled(lpni))
 			snprintf(lpni_info->cr_aliveness, LNET_MAX_STR_LEN,
-				 lpni->lpni_alive ? "up" : "down");
+				 lnet_is_peer_ni_alive(lpni) ? "up" : "down");
 
 		lpni_info->cr_refcount = atomic_read(&lpni->lpni_refcount);
 		lpni_info->cr_ni_peer_tx_credits = lpni->lpni_net ?
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 40725d2..d5b4914 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -165,8 +165,10 @@ static int rtr_sensitivity_set(const char *val,
 
 	lp->lpni_timestamp = when;		/* update timestamp */
 
-	if (lp->lpni_alive_count &&		/* got old news */
-	    (!lp->lpni_alive) == (!alive)) {	/* new date for old news */
+	/* got old news */
+	if (lp->lpni_alive_count != 0 &&
+	    /* new date for old news */
+	    (!lnet_is_peer_ni_alive(lp)) == !alive) {
 		spin_unlock(&lp->lpni_lock);
 		CDEBUG(D_NET, "Old news\n");
 		return;
@@ -175,10 +177,9 @@ static int rtr_sensitivity_set(const char *val,
 	/* Flag that notification is outstanding */
 
 	lp->lpni_alive_count++;
-	lp->lpni_alive = !!alive;	/* 1 bit! */
 	lp->lpni_notify = 1;
 	lp->lpni_notifylnd = notifylnd;
-	if (lp->lpni_alive)
+	if (lnet_is_peer_ni_alive(lp))
 		lp->lpni_ping_feats = LNET_PING_FEAT_INVAL; /* reset */
 
 	spin_unlock(&lp->lpni_lock);
@@ -214,7 +215,7 @@ static int rtr_sensitivity_set(const char *val,
 	 * lnet_notify_locked().
 	 */
 	while (lp->lpni_notify) {
-		alive = lp->lpni_alive;
+		alive = lnet_is_peer_ni_alive(lp);
 		notifylnd = lp->lpni_notifylnd;
 
 		lp->lpni_notifylnd = 0;
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index d41ff00..e9aef1e 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -529,7 +529,8 @@ static int proc_lnet_peers(struct ctl_table *table, int write,
 
 			if (lnet_isrouter(peer) ||
 			    lnet_peer_aliveness_enabled(peer))
-				aliveness = peer->lpni_alive ? "up" : "down";
+				aliveness = lnet_is_peer_ni_alive(peer) ?
+					"up" : "down";
 
 			if (lnet_peer_aliveness_enabled(peer)) {
 				time64_t now = ktime_get_seconds();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 338/622] lnet: router aliveness
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (336 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 337/622] lnet: peer aliveness James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 339/622] lnet: simplify lnet_handle_local_failure() James Simmons
                   ` (284 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

A route is considered alive if the gateway is able to route
messages from the local to the remote net. That means that
at least one of the network interfaces on the remote net of
the gateway is viable.

Introduced the concept of sensitivity percentage. This defaults
to 100%. It holds a dual meaning:
1. A route is considered alive if at least one of the its interfaces'
health is >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage
100 means at least one interface has to be 100% healthy
2. On a router consider a peer_ni dead if its health is not at least
LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage.
100% means the interface has to be 100% healthy.

Re-implemented lnet_notify() to decrement the health of the
peer interface if the LND reports a failure on that peer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: 21d2252648be ("LU-11300 lnet: router aliveness")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33185
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 11 ++-----
 net/lnet/lnet/router.c        | 74 +++++++++++++++++++++++++++++++++++++++++++
 net/lnet/lnet/router_proc.c   |  2 +-
 3 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index d5704b7..0007adf 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -90,15 +90,8 @@
 						  */
 #define LNET_LND_DEFAULT_TIMEOUT 5
 
-static inline int lnet_is_route_alive(struct lnet_route *route)
-{
-	/* TODO re-implement gateway alive indication */
-	CDEBUG(D_NET, "TODO: reimplement routing. gateway = %s\n",
-	       route->lr_gateway ?
-		libcfs_nid2str(route->lr_gateway->lp_primary_nid) :
-		"undefined");
-	return 1;
-}
+bool lnet_is_route_alive(struct lnet_route *route);
+bool lnet_is_gateway_alive(struct lnet_peer *gw);
 
 static inline int lnet_is_wire_handle_none(struct lnet_handle_wire *wh)
 {
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index d5b4914..bb92759 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -146,6 +146,80 @@ static int rtr_sensitivity_set(const char *val,
 	return check_routers_before_use;
 }
 
+/* A net is alive if at least one gateway NI on the network is alive. */
+static bool
+lnet_is_gateway_net_alive(struct lnet_peer_net *lpn)
+{
+	struct lnet_peer_ni *lpni;
+
+	list_for_each_entry(lpni, &lpn->lpn_peer_nis, lpni_peer_nis) {
+		if (lnet_is_peer_ni_alive(lpni))
+			return true;
+	}
+
+	return false;
+}
+
+/* a gateway is alive only if all its nets are alive
+ * called with cpt lock held
+ */
+bool lnet_is_gateway_alive(struct lnet_peer *gw)
+{
+	struct lnet_peer_net *lpn;
+
+	list_for_each_entry(lpn, &gw->lp_peer_nets, lpn_peer_nets) {
+		if (!lnet_is_gateway_net_alive(lpn))
+			return false;
+	}
+
+	return true;
+}
+
+/* lnet_is_route_alive() needs to be called with cpt lock held
+ * A route is alive if the gateway can route between the local network and
+ * the remote network of the route.
+ * This means at least one NI is alive on each of the local and remote
+ * networks of the gateway.
+ */
+bool lnet_is_route_alive(struct lnet_route *route)
+{
+	struct lnet_peer *gw = route->lr_gateway;
+	struct lnet_peer_net *llpn;
+	struct lnet_peer_net *rlpn;
+	bool route_alive;
+
+	/* check the gateway's interfaces on the route rnet to make sure
+	 * that the gateway is viable.
+	 */
+	llpn = lnet_peer_get_net_locked(gw, route->lr_lnet);
+	if (!llpn)
+		return false;
+
+	route_alive = lnet_is_gateway_net_alive(llpn);
+
+	if (avoid_asym_router_failure) {
+		rlpn = lnet_peer_get_net_locked(gw, route->lr_net);
+		if (!rlpn)
+			return false;
+		route_alive = route_alive &&
+			      lnet_is_gateway_net_alive(rlpn);
+	}
+
+	if (!route_alive)
+		return route_alive;
+
+	spin_lock(&gw->lp_lock);
+	if (!(gw->lp_state & LNET_PEER_ROUTER_ENABLED)) {
+		if (gw->lp_rtr_refcount > 0)
+			CERROR("peer %s is being used as a gateway but routing feature is not turned on\n",
+			       libcfs_nid2str(gw->lp_primary_nid));
+		route_alive = false;
+	}
+	spin_unlock(&gw->lp_lock);
+
+	return route_alive;
+}
+
 void
 lnet_notify_locked(struct lnet_peer_ni *lp, int notifylnd, int alive,
 		   time64_t when)
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index e9aef1e..3120533 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -325,7 +325,7 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 			int nrefs = atomic_read(&peer->lp_refcount);
 			int nrtrrefs = peer->lp_rtr_refcount;
 			int alive_cnt = 0;
-			int alive = 0;
+			int alive = lnet_is_gateway_alive(peer);
 			int pingsent = ((peer->lp_state & LNET_PEER_PING_SENT)
 				       != 0);
 			time64_t last_ping = now - peer->lp_rtrcheck_timestamp;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 339/622] lnet: simplify lnet_handle_local_failure()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (337 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 338/622] lnet: router aliveness James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 340/622] lnet: Cleanup rcd James Simmons
                   ` (283 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Pass the struct lnet_ni to lnet_handle_local_failure() instead of the
message structure, since nothing else from the message is being
used. This also makes symmetrical with lnet_handle_remote_failure()

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: f8c7dd6f5374 ("LU-11300 lnet: simplify lnet_handle_local_failure()")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33452
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index e4253de..23c3bf4 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -461,12 +461,8 @@
 }
 
 static void
-lnet_handle_local_failure(struct lnet_msg *msg)
+lnet_handle_local_failure(struct lnet_ni *local_ni)
 {
-	struct lnet_ni *local_ni;
-
-	local_ni = msg->msg_txni;
-
 	/* the lnet_net_lock(0) is used to protect the addref on the ni
 	 * and the recovery queue.
 	 */
@@ -652,7 +648,7 @@
 	case LNET_MSG_STATUS_LOCAL_ABORTED:
 	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
 	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
-		lnet_handle_local_failure(msg);
+		lnet_handle_local_failure(msg->msg_txni);
 		/* add to the re-send queue */
 		goto resend;
 
@@ -660,7 +656,7 @@
 	 * finalize the message
 	 */
 	case LNET_MSG_STATUS_LOCAL_ERROR:
-		lnet_handle_local_failure(msg);
+		lnet_handle_local_failure(msg->msg_txni);
 		return -1;
 
 	/* TODO: since the remote dropped the message we can
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 340/622] lnet: Cleanup rcd
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (338 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 339/622] lnet: simplify lnet_handle_local_failure() James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 341/622] lnet: modify lnd notification mechanism James Simmons
                   ` (282 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Cleanup all code pertaining to rcd, as routing code will use
discovery going forward and there will be no need to keep its own
pinging code.

test_215 looks at the routers file which had its format changed.
Update the test to reflect the change.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11299
Lustre-commit: 9ee453928ab8 ("LU-11299 lnet: Cleanup rcd")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33187
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |   4 -
 include/linux/lnet/lib-types.h |  40 +------
 net/lnet/lnet/api-ni.c         |  24 +++-
 net/lnet/lnet/lib-move.c       |  11 --
 net/lnet/lnet/router.c         | 255 -----------------------------------------
 net/lnet/lnet/router_proc.c    |  66 ++---------
 6 files changed, 31 insertions(+), 369 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 0007adf..8730670 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -748,11 +748,7 @@ int lnet_sock_connect(struct socket **sockp, int *fatal,
 
 bool lnet_router_checker_active(void);
 void lnet_check_routers(void);
-int lnet_router_pre_mt_start(void);
 void lnet_router_post_mt_start(void);
-void lnet_prune_rc_data(int wait_unlink);
-void lnet_router_cleanup(void);
-void lnet_router_ni_update_locked(struct lnet_peer_ni *gw, u32 net);
 void lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf);
 
 int lnet_ping_info_validate(struct lnet_ping_info *pinfo);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 9a09fad..495e805 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -509,20 +509,6 @@ struct lnet_ping_buffer {
 #define LNET_PING_INFO_TO_BUFFER(PINFO)	\
 	container_of((PINFO), struct lnet_ping_buffer, pb_info)
 
-/* router checker data, per router */
-struct lnet_rc_data {
-	/* chain on the_lnet.ln_zombie_rcd or ln_deathrow_rcd */
-	struct list_head	rcd_list;
-	/* ping buffer MD */
-	struct lnet_handle_md	rcd_mdh;
-	/* reference to gateway */
-	struct lnet_peer_ni	*rcd_gateway;
-	/* ping buffer */
-	struct lnet_ping_buffer	*rcd_pingbuffer;
-	/* desired size of buffer */
-	int			rcd_nnis;
-};
-
 struct lnet_peer_ni {
 	/* chain on lpn_peer_nis */
 	struct list_head	 lpni_peer_nis;
@@ -553,22 +539,8 @@ struct lnet_peer_ni {
 	int			 lpni_rtrcredits;
 	/* low water mark */
 	int			 lpni_minrtrcredits;
-	/* notification outstanding? */
-	bool			 lpni_notify;
-	/* outstanding notification for LND? */
-	bool			 lpni_notifylnd;
-	/* some thread is handling notification */
-	bool			 lpni_notifying;
-	/* # times router went dead<->alive */
-	int			 lpni_alive_count;
-	/* ytes queued for sending */
+	/* bytes queued for sending */
 	long			 lpni_txqnob;
-	/* time of last aliveness news */
-	time64_t		 lpni_timestamp;
-	/* when I was last alive */
-	time64_t		 lpni_last_alive;
-	/* when lpni_ni was queried last time */
-	time64_t		 lpni_last_query;
 	/* network peer is on */
 	struct lnet_net		*lpni_net;
 	/* peer's NID */
@@ -598,8 +570,6 @@ struct lnet_peer_ni {
 	} lpni_pref;
 	/* number of preferred NIDs in lnpi_pref_nids */
 	u32			 lpni_pref_nnids;
-	/* router checker state */
-	struct lnet_rc_data	*lpni_rcd;
 };
 
 /* Preferred path added due to traffic on non-MR peer_ni */
@@ -823,8 +793,6 @@ struct lnet_route {
 	u32			lr_lnet;
 	/* sequence for round-robin */
 	int			lr_seq;
-	/* number of down NIs */
-	unsigned int		lr_downis;
 	/* how far I am */
 	u32			lr_hops;
 	/* route priority */
@@ -1115,12 +1083,6 @@ struct lnet {
 
 	/* monitor thread startup/shutdown state */
 	enum lnet_rc_state		ln_mt_state;
-	/* router checker's event queue */
-	struct lnet_handle_eq		ln_rc_eqh;
-	/* rcd still pending on net */
-	struct list_head		ln_rcd_deathrow;
-	/* rcd ready for free */
-	struct list_head		ln_rcd_zombie;
 	/* serialise startup/shutdown */
 	struct completion		ln_mt_signal;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index d27e9a4..32b4b4f 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1457,6 +1457,27 @@ struct lnet_ping_buffer *
 	return count;
 }
 
+void
+lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf)
+{
+	struct lnet_ni_status *stat;
+	int nnis;
+	int i;
+
+	__swab32s(&pbuf->pb_info.pi_magic);
+	__swab32s(&pbuf->pb_info.pi_features);
+	__swab32s(&pbuf->pb_info.pi_pid);
+	__swab32s(&pbuf->pb_info.pi_nnis);
+	nnis = pbuf->pb_info.pi_nnis;
+	if (nnis > pbuf->pb_nnis)
+		nnis = pbuf->pb_nnis;
+	for (i = 0; i < nnis; i++) {
+		stat = &pbuf->pb_info.pi_ni[i];
+		__swab64s(&stat->ns_nid);
+		__swab32s(&stat->ns_status);
+	}
+}
+
 int
 lnet_ping_info_validate(struct lnet_ping_info *pinfo)
 {
@@ -2362,12 +2383,9 @@ int lnet_lib_init(void)
 	}
 
 	the_lnet.ln_refcount = 0;
-	LNetInvalidateEQHandle(&the_lnet.ln_rc_eqh);
 	INIT_LIST_HEAD(&the_lnet.ln_lnds);
 	INIT_LIST_HEAD(&the_lnet.ln_net_zombie);
-	INIT_LIST_HEAD(&the_lnet.ln_rcd_zombie);
 	INIT_LIST_HEAD(&the_lnet.ln_msg_resend);
-	INIT_LIST_HEAD(&the_lnet.ln_rcd_deathrow);
 
 	/*
 	 * The hash table size is the number of bits it takes to express the set
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index af3cd1e..2e2299d 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3151,9 +3151,6 @@ struct lnet_mt_event_info {
 						 false, HZ * interval);
 	}
 
-	/* clean up the router checker */
-	lnet_prune_rc_data(1);
-
 	/* Shutting down */
 	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
@@ -3364,11 +3361,6 @@ int lnet_monitor_thr_start(void)
 	if (rc)
 		goto clean_queues;
 
-	/* Pre monitor thread start processing */
-	rc = lnet_router_pre_mt_start();
-	if (rc)
-		goto free_mem;
-
 	init_completion(&the_lnet.ln_mt_signal);
 
 	lnet_net_lock(LNET_LOCK_EX);
@@ -3393,8 +3385,6 @@ int lnet_monitor_thr_start(void)
 	/* block until event callback signals exit */
 	wait_for_completion(&the_lnet.ln_mt_signal);
 	/* clean up */
-	lnet_router_cleanup();
-free_mem:
 	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
 	lnet_net_unlock(LNET_LOCK_EX);
@@ -3430,7 +3420,6 @@ void lnet_monitor_thr_stop(void)
 	LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
 
 	/* perform cleanup tasks */
-	lnet_router_cleanup();
 	lnet_rsp_tracker_clean();
 	lnet_clean_local_ni_recoveryq();
 	lnet_clean_peer_ni_recoveryq();
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index bb92759..1399545 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -220,101 +220,6 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	return route_alive;
 }
 
-void
-lnet_notify_locked(struct lnet_peer_ni *lp, int notifylnd, int alive,
-		   time64_t when)
-{
-	if (lp->lpni_timestamp > when) { /* out of date information */
-		CDEBUG(D_NET, "Out of date\n");
-		return;
-	}
-
-	/*
-	 * This function can be called with different cpt locks being
-	 * held. lpni_alive_count modification needs to be properly protected.
-	 * Significant reads to lpni_alive_count are also protected with
-	 * the same lock
-	 */
-	spin_lock(&lp->lpni_lock);
-
-	lp->lpni_timestamp = when;		/* update timestamp */
-
-	/* got old news */
-	if (lp->lpni_alive_count != 0 &&
-	    /* new date for old news */
-	    (!lnet_is_peer_ni_alive(lp)) == !alive) {
-		spin_unlock(&lp->lpni_lock);
-		CDEBUG(D_NET, "Old news\n");
-		return;
-	}
-
-	/* Flag that notification is outstanding */
-
-	lp->lpni_alive_count++;
-	lp->lpni_notify = 1;
-	lp->lpni_notifylnd = notifylnd;
-	if (lnet_is_peer_ni_alive(lp))
-		lp->lpni_ping_feats = LNET_PING_FEAT_INVAL; /* reset */
-
-	spin_unlock(&lp->lpni_lock);
-
-	CDEBUG(D_NET, "set %s %d\n", libcfs_nid2str(lp->lpni_nid), alive);
-}
-
-/*
- * This function will always be called with lp->lpni_cpt lock held.
- */
-static void
-lnet_ni_notify_locked(struct lnet_ni *ni, struct lnet_peer_ni *lp)
-{
-	int alive;
-	int notifylnd;
-
-	/*
-	 * Notify only in 1 thread at any time to ensure ordered notification.
-	 * NB individual events can be missed; the only guarantee is that you
-	 * always get the most recent news
-	 */
-	spin_lock(&lp->lpni_lock);
-
-	if (lp->lpni_notifying || !ni) {
-		spin_unlock(&lp->lpni_lock);
-		return;
-	}
-
-	lp->lpni_notifying = 1;
-
-	/*
-	 * lp->lpni_notify needs to be protected because it can be set in
-	 * lnet_notify_locked().
-	 */
-	while (lp->lpni_notify) {
-		alive = lnet_is_peer_ni_alive(lp);
-		notifylnd = lp->lpni_notifylnd;
-
-		lp->lpni_notifylnd = 0;
-		lp->lpni_notify = 0;
-
-		if (notifylnd && ni->ni_net->net_lnd->lnd_notify) {
-			spin_unlock(&lp->lpni_lock);
-			lnet_net_unlock(lp->lpni_cpt);
-
-			/*
-			 * A new notification could happen now; I'll handle it
-			 * when control returns to me
-			 */
-			ni->ni_net->net_lnd->lnd_notify(ni, lp->lpni_nid,
-							alive);
-
-			lnet_net_lock(lp->lpni_cpt);
-			spin_lock(&lp->lpni_lock);
-		}
-	}
-
-	lp->lpni_notifying = 0;
-	spin_unlock(&lp->lpni_lock);
-}
-
 static void
 lnet_rtr_addref_locked(struct lnet_peer *lp)
 {
@@ -721,93 +626,6 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	return -ENOENT;
 }
 
-void
-lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf)
-{
-	struct lnet_ni_status *stat;
-	int nnis;
-	int i;
-
-	__swab32s(&pbuf->pb_info.pi_magic);
-	__swab32s(&pbuf->pb_info.pi_features);
-	__swab32s(&pbuf->pb_info.pi_pid);
-	__swab32s(&pbuf->pb_info.pi_nnis);
-	nnis = pbuf->pb_info.pi_nnis;
-	if (nnis > pbuf->pb_nnis)
-		nnis = pbuf->pb_nnis;
-	for (i = 0; i < nnis; i++) {
-		stat = &pbuf->pb_info.pi_ni[i];
-		__swab64s(&stat->ns_nid);
-		__swab32s(&stat->ns_status);
-	}
-}
-
-/**
- * TODO: re-implement
- */
-static void
-lnet_parse_rc_info(struct lnet_rc_data *rcd)
-{
-	rcd = rcd;
-}
-
-static void
-lnet_router_checker_event(struct lnet_event *event)
-{
-	struct lnet_rc_data *rcd = event->md.user_ptr;
-	struct lnet_peer_ni *lp;
-
-	LASSERT(rcd);
-
-	if (event->unlinked) {
-		LNetInvalidateMDHandle(&rcd->rcd_mdh);
-		return;
-	}
-
-	LASSERT(event->type == LNET_EVENT_SEND ||
-		event->type == LNET_EVENT_REPLY);
-
-	lp = rcd->rcd_gateway;
-	LASSERT(lp);
-
-	/*
-	 * NB: it's called with holding lnet_res_lock, we have a few
-	 * places need to hold both locks at the same time, please take
-	 * care of lock ordering
-	 */
-	lnet_net_lock(lp->lpni_cpt);
-	if (!lnet_isrouter(lp) || lp->lpni_rcd != rcd) {
-		/* ignore if no longer a router or rcd is replaced */
-		goto out;
-	}
-
-	if (event->type == LNET_EVENT_SEND) {
-		if (!event->status)
-			goto out;
-	}
-
-	/* LNET_EVENT_REPLY */
-	/*
-	 * A successful REPLY means the router is up.  If _any_ comms
-	 * to the router fail I assume it's down (this will happen if
-	 * we ping alive routers to try to detect router death before
-	 * apps get burned).
-	 */
-	lnet_notify_locked(lp, 1, !event->status, ktime_get_seconds());
-
-	/*
-	 * The router checker will wake up very shortly and do the
-	 * actual notification.
-	 * XXX If 'lp' stops being a router before then, it will still
-	 * have the notification pending!!!
-	 */
-	if (avoid_asym_router_failure && !event->status)
-		lnet_parse_rc_info(rcd);
-
-out:
-	lnet_net_unlock(lp->lpni_cpt);
-}
-
 static void
 lnet_wait_known_routerstate(void)
 {
@@ -840,26 +658,6 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
-/* TODO: reimplement */
-void
-lnet_router_ni_update_locked(struct lnet_peer_ni *gw, u32 net)
-{
-	struct lnet_route *rte;
-	struct lnet_peer *lp;
-
-	if ((gw->lpni_ping_feats & LNET_PING_FEAT_NI_STATUS))
-		lp = gw->lpni_peer_net->lpn_peer;
-	else
-		return;
-
-	list_for_each_entry(rte, &lp->lp_routes, lr_gwlist) {
-		if (rte->lr_net == net) {
-			rte->lr_downis = 0;
-			break;
-		}
-	}
-}
-
 static void
 lnet_update_ni_status_locked(void)
 {
@@ -902,25 +700,6 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
-int lnet_router_pre_mt_start(void)
-{
-	int rc;
-
-	if (check_routers_before_use &&
-	    dead_router_check_interval <= 0) {
-		LCONSOLE_ERROR_MSG(0x10a, "'dead_router_check_interval' must be set if 'check_routers_before_use' is set\n");
-		return -EINVAL;
-	}
-
-	rc = LNetEQAlloc(0, lnet_router_checker_event, &the_lnet.ln_rc_eqh);
-	if (rc) {
-		CERROR("Can't allocate EQ(0): %d\n", rc);
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
 void lnet_router_post_mt_start(void)
 {
 	if (check_routers_before_use) {
@@ -933,19 +712,6 @@ void lnet_router_post_mt_start(void)
 	}
 }
 
-void lnet_router_cleanup(void)
-{
-	int rc;
-
-	rc = LNetEQFree(the_lnet.ln_rc_eqh);
-	LASSERT(rc == 0);
-}
-
-void lnet_prune_rc_data(int wait_unlink)
-{
-	wait_unlink = wait_unlink;
-}
-
 /*
  * This function is called from the monitor thread to check if there are
  * any active routers that need to be checked.
@@ -962,11 +728,6 @@ bool lnet_router_checker_active(void)
 	if (the_lnet.ln_routing)
 		return true;
 
-	/* if there are routers that need to be cleaned up then do so */
-	if (!list_empty(&the_lnet.ln_rcd_deathrow) ||
-	    !list_empty(&the_lnet.ln_rcd_zombie))
-		return true;
-
 	return !list_empty(&the_lnet.ln_routers) &&
 		(live_router_check_interval > 0 ||
 		 dead_router_check_interval > 0);
@@ -997,8 +758,6 @@ bool lnet_router_checker_active(void)
 		lnet_update_ni_status_locked();
 
 	lnet_net_unlock(cpt);
-
-	lnet_prune_rc_data(0); /* don't wait for UNLINK */
 }
 
 void
@@ -1503,20 +1262,6 @@ bool lnet_router_checker_active(void)
 		lnet_net_lock(cpt);
 	}
 
-	/*
-	 * We can't fully trust LND on reporting exact peer last_alive
-	 * if he notifies us about dead peer. For example ksocklnd can
-	 * call us with when == _time_when_the_node_was_booted_ if
-	 * no connections were successfully established
-	 */
-	if (ni && !alive && when < lp->lpni_last_alive)
-		when = lp->lpni_last_alive;
-
-	lnet_notify_locked(lp, !ni, alive, when);
-
-	if (ni)
-		lnet_ni_notify_locked(ni, lp);
-
 	lnet_peer_ni_decref_locked(lp);
 
 	lnet_net_unlock(cpt);
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 3120533..e494d19 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -215,7 +215,6 @@ static int proc_lnet_routes(struct ctl_table *table, int write,
 			u32 net = rnet->lrn_net;
 			u32 hops = route->lr_hops;
 			unsigned int priority = route->lr_priority;
-			lnet_nid_t nid = route->lr_gateway->lp_primary_nid;
 			int alive = lnet_is_route_alive(route);
 
 			s += snprintf(s, tmpstr + tmpsiz - s,
@@ -223,7 +222,8 @@ static int proc_lnet_routes(struct ctl_table *table, int write,
 				      libcfs_net2str(net), hops,
 				      priority,
 				      alive ? "up" : "down",
-				      libcfs_nid2str(nid));
+				      /* TODO: replace with actual nid */
+				      libcfs_nid2str(LNET_NID_ANY));
 			LASSERT(tmpstr + tmpsiz - s > 0);
 		}
 
@@ -278,10 +278,8 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 
 	if (!*ppos) {
 		s += snprintf(s, tmpstr + tmpsiz - s,
-			      "%-4s %7s %9s %6s %12s %9s %8s %7s %s\n",
-			      "ref", "rtr_ref", "alive_cnt", "state",
-			      "last_ping", "ping_sent", "deadline",
-			      "down_ni", "router");
+			      "%-4s %7s %5s %s\n",
+			      "ref", "rtr_ref", "alive", "router");
 		LASSERT(tmpstr + tmpsiz - s > 0);
 
 		lnet_net_lock(0);
@@ -319,48 +317,15 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 
 		if (peer) {
 			lnet_nid_t nid = peer->lp_primary_nid;
-			time64_t now = ktime_get_seconds();
-			/* TODO: readjust what's being printed */
-			time64_t deadline = 0;
 			int nrefs = atomic_read(&peer->lp_refcount);
 			int nrtrrefs = peer->lp_rtr_refcount;
-			int alive_cnt = 0;
 			int alive = lnet_is_gateway_alive(peer);
-			int pingsent = ((peer->lp_state & LNET_PEER_PING_SENT)
-				       != 0);
-			time64_t last_ping = now - peer->lp_rtrcheck_timestamp;
-			int down_ni = 0;
-			struct lnet_route *rtr;
-
-			if (nrtrrefs > 0) {
-				list_for_each_entry(rtr, &peer->lp_routes,
-						    lr_gwlist) {
-					/*
-					 * downis on any route should be the
-					 * number of downis on the gateway
-					 */
-					if (rtr->lr_downis) {
-						down_ni = rtr->lr_downis;
-						break;
-					}
-				}
-			}
 
-			if (!deadline)
-				s += snprintf(s, tmpstr + tmpsiz - s,
-					      "%-4d %7d %9d %6s %12llu %9d %8s %7d %s\n",
-					      nrefs, nrtrrefs, alive_cnt,
-					      alive ? "up" : "down", last_ping,
-					      pingsent, "NA", down_ni,
-					      libcfs_nid2str(nid));
-			else
-				s += snprintf(s, tmpstr + tmpsiz - s,
-					      "%-4d %7d %9d %6s %12llu %9d %8llu %7d %s\n",
-					      nrefs, nrtrrefs, alive_cnt,
-					      alive ? "up" : "down", last_ping,
-					      pingsent, deadline - now,
-					      down_ni, libcfs_nid2str(nid));
-			LASSERT(tmpstr + tmpsiz - s > 0);
+			s += snprintf(s, tmpstr + tmpsiz - s,
+				      "%-4d %7d %5s %s\n",
+				      nrefs, nrtrrefs,
+				      alive ? "up" : "down",
+				      libcfs_nid2str(nid));
 		}
 
 		lnet_net_unlock(0);
@@ -532,19 +497,6 @@ static int proc_lnet_peers(struct ctl_table *table, int write,
 				aliveness = lnet_is_peer_ni_alive(peer) ?
 					"up" : "down";
 
-			if (lnet_peer_aliveness_enabled(peer)) {
-				time64_t now = ktime_get_seconds();
-
-				lastalive = now - peer->lpni_last_alive;
-
-				/* No need to mess up peers contents with
-				 * arbitrarily long integers - it suffices to
-				 * know that lastalive is more than 10000s old
-				 */
-				if (lastalive >= 10000)
-					lastalive = 9999;
-			}
-
 			lnet_net_unlock(cpt);
 
 			s += snprintf(s, tmpstr + tmpsiz - s,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 341/622] lnet: modify lnd notification mechanism
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (339 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 340/622] lnet: Cleanup rcd James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 342/622] lnet: use discovery for routing James Simmons
                   ` (281 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

LND notifies when a peer is up or down. If the LND notifies
LNet that the peer is up and sets the "reset" flag to true
then this indicates to LNet that the LND knows about the health
of the peer and is telling LNet that the peer is fully healthy.
LNet will set the health value of the peer to maximum, otherwise
it will increment the health by one.

If the LND notifies the LNet that the peer is down, LNet will
decrement the health of the peer by sensitivity value configured.

LNet then turns around and rechecks the peer aliveness and if its
dead it'll notify the LND. This code is only used by the socklnd
because it needs to tear down connections. This is in keeping with
the original functionality.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11299
Lustre-commit: b34e754c1a0b ("LU-11299 lnet: modify lnd notification mechanism")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33453
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h       |  8 ++++-
 include/linux/lnet/lib-types.h      |  4 +--
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c |  2 +-
 net/lnet/klnds/socklnd/socklnd.c    | 21 ++++++-------
 net/lnet/klnds/socklnd/socklnd.h    |  2 +-
 net/lnet/lnet/api-ni.c              |  2 +-
 net/lnet/lnet/router.c              | 60 +++++++++++++++++++++++++------------
 7 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 8730670..94918d3 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -506,7 +506,7 @@ struct lnet_ni *
 
 void lnet_mt_event_handler(struct lnet_event *event);
 
-int lnet_notify(struct lnet_ni *ni, lnet_nid_t peer, int alive,
+int lnet_notify(struct lnet_ni *ni, lnet_nid_t peer, bool alive, bool reset,
 		time64_t when);
 void lnet_notify_locked(struct lnet_peer_ni *lp, int notifylnd, int alive,
 			time64_t when);
@@ -886,6 +886,12 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 }
 
 static inline void
+lnet_set_healthv(atomic_t *healthv, int value)
+{
+	atomic_set(healthv, value);
+}
+
+static inline void
 lnet_inc_healthv(atomic_t *healthv)
 {
 	atomic_add_unless(healthv, 1, LNET_MAX_HEALTH_VALUE);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 495e805..2d5ae21 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -298,8 +298,8 @@ struct lnet_lnd {
 	int (*lnd_eager_recv)(struct lnet_ni *ni, void *private,
 			      struct lnet_msg *msg, void **new_privatep);
 
-	/* notification of peer health */
-	void (*lnd_notify)(struct lnet_ni *ni, lnet_nid_t peer, int alive);
+	/* notification of peer down */
+	void (*lnd_notify_peer_down)(lnet_nid_t peer);
 
 	/* query of peer aliveness */
 	void (*lnd_query)(struct lnet_ni *ni, lnet_nid_t peer, time64_t *when);
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index a3abbb6..69918cf 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1960,7 +1960,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 
 	if (error)
 		lnet_notify(peer_ni->ibp_ni,
-			    peer_ni->ibp_nid, 0, last_alive);
+			    peer_ni->ibp_nid, false, false, last_alive);
 }
 
 void
diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 8b283ac..0f5c7fc 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -1518,8 +1518,8 @@ struct ksock_peer *
 	read_unlock(&ksocknal_data.ksnd_global_lock);
 
 	if (notify)
-		lnet_notify(peer_ni->ksnp_ni, peer_ni->ksnp_id.nid, 0,
-			    last_alive);
+		lnet_notify(peer_ni->ksnp_ni, peer_ni->ksnp_id.nid,
+			    false, false, last_alive);
 }
 
 void
@@ -1787,7 +1787,7 @@ struct ksock_peer *
 }
 
 void
-ksocknal_notify(struct lnet_ni *ni, lnet_nid_t gw_nid, int alive)
+ksocknal_notify_gw_down(lnet_nid_t gw_nid)
 {
 	/*
 	 * The router is telling me she's been notified of a change in
@@ -1798,17 +1798,14 @@ struct ksock_peer *
 	id.nid = gw_nid;
 	id.pid = LNET_PID_ANY;
 
-	CDEBUG(D_NET, "gw %s %s\n", libcfs_nid2str(gw_nid),
-	       alive ? "up" : "down");
+	CDEBUG(D_NET, "gw %s down\n", libcfs_nid2str(gw_nid));
 
-	if (!alive) {
-		/* If the gateway crashed, close all open connections... */
-		ksocknal_close_matching_conns(id, 0);
-		return;
-	}
+	/* If the gateway crashed, close all open connections... */
+	ksocknal_close_matching_conns(id, 0);
+	return;
 
 	/*
-	 * ...otherwise do nothing.  We can only establish new connections
+	 * We can only establish new connections
 	 * if we have autroutes, and these connect on demand.
 	 */
 }
@@ -2839,7 +2836,7 @@ static int __init ksocklnd_init(void)
 	the_ksocklnd.lnd_ctl = ksocknal_ctl;
 	the_ksocklnd.lnd_send = ksocknal_send;
 	the_ksocklnd.lnd_recv = ksocknal_recv;
-	the_ksocklnd.lnd_notify = ksocknal_notify;
+	the_ksocklnd.lnd_notify_peer_down = ksocknal_notify_gw_down;
 	the_ksocklnd.lnd_query = ksocknal_query;
 	the_ksocklnd.lnd_accept = ksocknal_accept;
 
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 2e292f0..80c2e19 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -659,7 +659,7 @@ int ksocknal_launch_packet(struct lnet_ni *ni, struct ksock_tx *tx,
 void ksocknal_next_tx_carrier(struct ksock_conn *conn);
 void ksocknal_queue_tx_locked(struct ksock_tx *tx, struct ksock_conn *conn);
 void ksocknal_txlist_done(struct lnet_ni *ni, struct list_head *txlist, int error);
-void ksocknal_notify(struct lnet_ni *ni, lnet_nid_t gw_nid, int alive);
+void ksocknal_notify(lnet_nid_t gw_nid);
 void ksocknal_query(struct lnet_ni *ni, lnet_nid_t nid, time64_t *when);
 int ksocknal_thread_start(int (*fn)(void *arg), void *arg, char *name);
 void ksocknal_thread_fini(void);
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 32b4b4f..4dc9514 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3767,7 +3767,7 @@ u32 lnet_get_dlc_seq_locked(void)
 		 * that deadline to the wall clock.
 		 */
 		deadline += ktime_get_seconds();
-		return lnet_notify(NULL, data->ioc_nid, data->ioc_flags,
+		return lnet_notify(NULL, data->ioc_nid, data->ioc_flags, false,
 				   deadline);
 	}
 
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 1399545..22a3018 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -1199,12 +1199,26 @@ bool lnet_router_checker_active(void)
 	lnet_rtrpools_free(1);
 }
 
+static inline void
+lnet_notify_peer_down(struct lnet_ni *ni, lnet_nid_t nid)
+{
+	if (ni->ni_net->net_lnd->lnd_notify_peer_down)
+		ni->ni_net->net_lnd->lnd_notify_peer_down(nid);
+}
+
+/* ni: local NI used to communicate with the peer
+ * nid: peer NID
+ * alive: true if peer is alive, false otherwise
+ * reset: reset health value. This is requested by the LND.
+ * when: notificaiton time.
+ */
 int
-lnet_notify(struct lnet_ni *ni, lnet_nid_t nid, int alive, time64_t when)
+lnet_notify(struct lnet_ni *ni, lnet_nid_t nid, bool alive, bool reset,
+	    time64_t when)
 {
-	struct lnet_peer_ni *lp = NULL;
+	struct lnet_peer_ni *lpni = NULL;
 	time64_t now = ktime_get_seconds();
-	int cpt = lnet_cpt_of_nid(nid, ni);
+	int cpt;
 
 	LASSERT(!in_interrupt());
 
@@ -1235,36 +1249,44 @@ bool lnet_router_checker_active(void)
 		return 0;
 	}
 
-	lnet_net_lock(cpt);
+	/* must lock 0 since this is used for synchronization */
+	lnet_net_lock(0);
 
 	if (the_lnet.ln_state != LNET_STATE_RUNNING) {
-		lnet_net_unlock(cpt);
+		lnet_net_unlock(0);
 		return -ESHUTDOWN;
 	}
 
-	lp = lnet_find_peer_ni_locked(nid);
-	if (!lp) {
+	lpni = lnet_find_peer_ni_locked(nid);
+	if (!lpni) {
 		/* nid not found */
-		lnet_net_unlock(cpt);
+		lnet_net_unlock(0);
 		CDEBUG(D_NET, "%s not found\n", libcfs_nid2str(nid));
 		return 0;
 	}
 
-	/*
-	 * It is possible for this function to be called for the same peer
-	 * but with different NIs. We want to synchronize the notification
-	 * between the different calls. So we will use the lpni_cpt to
-	 * grab the net lock.
-	 */
-	if (lp->lpni_cpt != cpt) {
-		lnet_net_unlock(cpt);
-		cpt = lp->lpni_cpt;
-		lnet_net_lock(cpt);
+	if (alive) {
+		if (reset)
+			lnet_set_healthv(&lpni->lpni_healthv,
+					 LNET_MAX_HEALTH_VALUE);
+		else
+			lnet_inc_healthv(&lpni->lpni_healthv);
+	} else {
+		lnet_handle_remote_failure_locked(lpni);
 	}
 
-	lnet_peer_ni_decref_locked(lp);
+	/* recalculate aliveness */
+	alive = lnet_is_peer_ni_alive(lpni);
+	lnet_net_unlock(0);
 
+	if (ni && !alive)
+		lnet_notify_peer_down(ni, lpni->lpni_nid);
+
+	cpt = lpni->lpni_cpt;
+	lnet_net_lock(cpt);
+	lnet_peer_ni_decref_locked(lpni);
 	lnet_net_unlock(cpt);
+
 	return 0;
 }
 EXPORT_SYMBOL(lnet_notify);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 342/622] lnet: use discovery for routing
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (340 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 341/622] lnet: modify lnd notification mechanism James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 343/622] lnet: MR aware gateway selection James Simmons
                   ` (280 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Instead of re-inventing the wheel, routing now uses discovery.
Every router interval the router is discovered. This will
update the router information locally and will serve to let the
router know that the peer is alive.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11299
Lustre-commit: 146580754295 ("LU-11299 lnet: use discovery for routing")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33454
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |   9 ++-
 include/linux/lnet/lib-types.h |   5 ++
 net/lnet/lnet/api-ni.c         |  19 +++---
 net/lnet/lnet/lib-move.c       |  10 ++-
 net/lnet/lnet/peer.c           |  41 ++++++++++++-
 net/lnet/lnet/router.c         | 134 +++++++++++++++++++++++++++++++++++------
 net/lnet/lnet/router_proc.c    |   3 +-
 7 files changed, 186 insertions(+), 35 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 94918d3..1d06263 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -499,6 +499,7 @@ struct lnet_ni *
 extern unsigned int lnet_peer_discovery_disabled;
 extern unsigned int lnet_drop_asym_route;
 extern unsigned int router_sensitivity_percentage;
+extern int alive_router_check_interval;
 extern int portal_rotor;
 
 int lnet_lib_init(void);
@@ -742,13 +743,16 @@ int lnet_sock_connect(struct socket **sockp, int *fatal,
 
 int lnet_peers_start_down(void);
 int lnet_peer_buffer_credits(struct lnet_net *net);
+void lnet_consolidate_routes_locked(struct lnet_peer *orig_lp,
+				    struct lnet_peer *new_lp);
+void lnet_router_discovery_complete(struct lnet_peer *lp);
 
 int lnet_monitor_thr_start(void);
 void lnet_monitor_thr_stop(void);
 
 bool lnet_router_checker_active(void);
 void lnet_check_routers(void);
-void lnet_router_post_mt_start(void);
+void lnet_wait_router_start(void);
 void lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf);
 
 int lnet_ping_info_validate(struct lnet_ping_info *pinfo);
@@ -795,6 +799,8 @@ struct lnet_peer_ni *lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 struct lnet_peer_ni *lnet_nid2peerni_locked(lnet_nid_t nid, lnet_nid_t pref,
 					    int cpt);
 struct lnet_peer_ni *lnet_nid2peerni_ex(lnet_nid_t nid, int cpt);
+struct lnet_peer_ni *lnet_peer_get_ni_locked(struct lnet_peer *lp,
+					     lnet_nid_t nid);
 struct lnet_peer_ni *lnet_find_peer_ni_locked(lnet_nid_t nid);
 struct lnet_peer *lnet_find_peer(lnet_nid_t nid);
 void lnet_peer_net_added(struct lnet_net *net);
@@ -854,6 +860,7 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 }
 
 bool lnet_peer_is_uptodate(struct lnet_peer *lp);
+bool lnet_peer_gw_discovery(struct lnet_peer *lp);
 
 static inline bool
 lnet_peer_needs_push(struct lnet_peer *lp)
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 2d5ae21..9662c9e 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -716,6 +716,9 @@ struct lnet_peer {
 #define LNET_PEER_FORCE_PING	BIT(13)	/* Forced Ping */
 #define LNET_PEER_FORCE_PUSH	BIT(14)	/* Forced Push */
 
+/* gw undergoing alive discovery */
+#define LNET_PEER_RTR_DISCOVERY	BIT(16)
+
 struct lnet_peer_net {
 	/* chain on lp_peer_nets */
 	struct list_head	lpn_peer_nets;
@@ -787,6 +790,8 @@ struct lnet_route {
 	struct list_head	lr_gwlist;
 	/* router node */
 	struct lnet_peer       *lr_gateway;
+	/* NID used to add route */
+	lnet_nid_t		lr_nid;
 	/* remote network number */
 	u32			lr_net;
 	/* local network number */
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 4dc9514..b1823cd 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -2533,29 +2533,32 @@ void lnet_lib_exit(void)
 		goto err_stop_ping;
 	}
 
-	rc = lnet_monitor_thr_start();
+	rc = lnet_push_target_init();
 	if (rc)
 		goto err_stop_ping;
 
-	rc = lnet_push_target_init();
-	if (rc != 0)
-		goto err_stop_monitor_thr;
-
 	rc = lnet_peer_discovery_start();
 	if (rc != 0)
 		goto err_destroy_push_target;
 
+	rc = lnet_monitor_thr_start();
+	if (rc != 0)
+		goto err_stop_discovery_thr;
+
 	lnet_fault_init();
 	lnet_router_debugfs_init();
 
 	mutex_unlock(&the_lnet.ln_api_mutex);
 
+	/* wait for all routers to start */
+	lnet_wait_router_start();
+
 	return 0;
 
+err_stop_discovery_thr:
+	lnet_peer_discovery_stop();
 err_destroy_push_target:
 	lnet_push_target_fini();
-err_stop_monitor_thr:
-	lnet_monitor_thr_stop();
 err_stop_ping:
 	lnet_ping_target_fini();
 err_acceptor_stop:
@@ -2603,9 +2606,9 @@ void lnet_lib_exit(void)
 
 		lnet_fault_fini();
 		lnet_router_debugfs_fini();
+		lnet_monitor_thr_stop();
 		lnet_peer_discovery_stop();
 		lnet_push_target_fini();
-		lnet_monitor_thr_stop();
 		lnet_ping_target_fini();
 
 		/* Teardown fns that use my own API functions BEFORE here */
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 2e2299d..e214a95 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1748,6 +1748,13 @@ struct lnet_ni *
 
 	lnet_peer_ni_addref_locked(lpni);
 
+	peer = lpni->lpni_peer_net->lpn_peer;
+
+	if (lnet_peer_gw_discovery(peer)) {
+		lnet_peer_ni_decref_locked(lpni);
+		return 0;
+	}
+
 	rc = lnet_discover_peer_locked(lpni, cpt, false);
 	if (rc) {
 		lnet_peer_ni_decref_locked(lpni);
@@ -3373,9 +3380,6 @@ int lnet_monitor_thr_start(void)
 		goto clean_thread;
 	}
 
-	/* post monitor thread start processing */
-	lnet_router_post_mt_start();
-
 	return 0;
 
 clean_thread:
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 8669fbb..b804d78 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -659,6 +659,24 @@ struct lnet_peer_ni *
 	return lpni;
 }
 
+struct lnet_peer_ni *
+lnet_peer_get_ni_locked(struct lnet_peer *lp, lnet_nid_t nid)
+{
+	struct lnet_peer_net *lpn;
+	struct lnet_peer_ni *lpni;
+
+	lpn = lnet_peer_get_net_locked(lp, LNET_NIDNET(nid));
+	if (!lpn)
+		return NULL;
+
+	list_for_each_entry(lpni, &lpn->lpn_peer_nis, lpni_peer_nis) {
+		if (lpni->lpni_nid == nid)
+			return lpni;
+	}
+
+	return NULL;
+}
+
 struct lnet_peer *
 lnet_find_peer(lnet_nid_t nid)
 {
@@ -1708,6 +1726,19 @@ struct lnet_peer_ni *
  * Peer Discovery
  */
 
+bool
+lnet_peer_gw_discovery(struct lnet_peer *lp)
+{
+	bool rc = false;
+
+	spin_lock(&lp->lp_lock);
+	if (lp->lp_state & LNET_PEER_RTR_DISCOVERY)
+		rc = true;
+	spin_unlock(&lp->lp_lock);
+
+	return rc;
+}
+
 /*
  * Is a peer uptodate from the point of view of discovery?
  *
@@ -1797,6 +1828,9 @@ static void lnet_peer_discovery_complete(struct lnet_peer *lp)
 	spin_unlock(&lp->lp_lock);
 	wake_up_all(&lp->lp_dc_waitq);
 
+	if (lp->lp_rtr_refcount > 0)
+		lnet_router_discovery_complete(lp);
+
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	/* iterate through all pending messages and send them again */
@@ -2685,8 +2719,11 @@ static int lnet_peer_data_present(struct lnet_peer *lp)
 				rc = lnet_peer_merge_data(lp, pbuf);
 			}
 		} else {
-			rc = lnet_peer_set_primary_data(
-				lpni->lpni_peer_net->lpn_peer, pbuf);
+			struct lnet_peer *new_lp;
+
+			new_lp = lpni->lpni_peer_net->lpn_peer;
+			rc = lnet_peer_set_primary_data(new_lp, pbuf);
+			lnet_consolidate_routes_locked(lp, new_lp);
 			lnet_peer_ni_decref_locked(lpni);
 		}
 	}
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 22a3018..4a061f3 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -78,13 +78,9 @@
 module_param(avoid_asym_router_failure, int, 0644);
 MODULE_PARM_DESC(avoid_asym_router_failure, "Avoid asymmetrical router failures (0 to disable)");
 
-static int dead_router_check_interval = 60;
-module_param(dead_router_check_interval, int, 0644);
-MODULE_PARM_DESC(dead_router_check_interval, "Seconds between dead router health checks (<= 0 to disable)");
-
-static int live_router_check_interval = 60;
-module_param(live_router_check_interval, int, 0644);
-MODULE_PARM_DESC(live_router_check_interval, "Seconds between live router health checks (<= 0 to disable)");
+int alive_router_check_interval = 60;
+module_param(alive_router_check_interval, int, 0644);
+MODULE_PARM_DESC(alive_router_check_interval, "Seconds between live router health checks (<= 0 to disable)");
 
 static int router_ping_timeout = 50;
 module_param(router_ping_timeout, int, 0644);
@@ -220,6 +216,61 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	return route_alive;
 }
 
+void
+lnet_consolidate_routes_locked(struct lnet_peer *orig_lp,
+			       struct lnet_peer *new_lp)
+{
+	struct lnet_peer_ni *lpni;
+	struct lnet_route *route;
+
+	/* Although a route is correlated with a peer, but when it's added
+	 * a specific NID is used. That NID refers to a peer_ni within
+	 * a peer. There could be other peer_nis on the same net, which
+	 * can be used to send to that gateway. However when we are
+	 * consolidating gateways because of discovery, the nid used to
+	 * add the route might've moved between gateway peers. In this
+	 * case we want to move the route to the new gateway as well. The
+	 * intent here is not to confuse the user who added the route.
+	 */
+	list_for_each_entry(route, &orig_lp->lp_routes, lr_gwlist) {
+		lpni = lnet_peer_get_ni_locked(orig_lp, route->lr_nid);
+		if (!lpni) {
+			lnet_net_lock(LNET_LOCK_EX);
+			list_move(&route->lr_gwlist, &new_lp->lp_routes);
+			lnet_net_unlock(LNET_LOCK_EX);
+		}
+	}
+}
+
+void
+lnet_router_discovery_complete(struct lnet_peer *lp)
+{
+	struct lnet_peer_ni *lpni = NULL;
+
+	spin_lock(&lp->lp_lock);
+	lp->lp_state &= ~LNET_PEER_RTR_DISCOVERY;
+	spin_unlock(&lp->lp_lock);
+
+	/* Router discovery successful? All peer information would've been
+	 * updated already. No need to do any more processing
+	 */
+	if (!lp->lp_dc_error)
+		return;
+	/* discovery failed? then we need to set the status of each lpni
+	 * to DOWN. It will be updated the next time we discover the
+	 * router. For router peer NIs not on local networks, we never send
+	 * messages directly to them, so their health will always remain
+	 * at maximum. We can only tell if they are up or down from the
+	 * status returned in the PING response. If we fail to get that
+	 * status in our scheduled router discovery, then we'll assume
+	 * it's down until we're told otherwise.
+	 */
+	CDEBUG(D_NET, "%s: Router discovery failed %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_dc_error);
+	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
+		lpni->lpni_ns_status = LNET_NI_STATUS_DOWN;
+}
+
 static void
 lnet_rtr_addref_locked(struct lnet_peer *lp)
 {
@@ -368,6 +419,7 @@ static void lnet_shuffle_seed(void)
 	/* store the local and remote net that the route represents */
 	route->lr_lnet = LNET_NIDNET(gateway);
 	route->lr_net = net;
+	route->lr_nid = gateway;
 	route->lr_priority = priority;
 	route->lr_hops = hops;
 
@@ -610,10 +662,10 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 			list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
 				if (!idx--) {
 					*net = rnet->lrn_net;
+					*gateway = route->lr_nid;
 					*hops = route->lr_hops;
-					*priority = route->lr_priority;
-					*gateway =
-					    route->lr_gateway->lp_primary_nid;
+					*priority =
+					    route->lr_priority;
 					*alive = lnet_is_route_alive(route);
 					lnet_net_unlock(cpt);
 					return 0;
@@ -667,8 +719,7 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 
 	LASSERT(the_lnet.ln_routing);
 
-	timeout = router_ping_timeout +
-		  max(live_router_check_interval, dead_router_check_interval);
+	timeout = router_ping_timeout + alive_router_check_interval;
 
 	now = ktime_get_real_seconds();
 	while ((ni = lnet_get_next_ni_locked(NULL, ni))) {
@@ -700,7 +751,7 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
-void lnet_router_post_mt_start(void)
+void lnet_wait_router_start(void)
 {
 	if (check_routers_before_use) {
 		/*
@@ -718,9 +769,6 @@ void lnet_router_post_mt_start(void)
  */
 bool lnet_router_checker_active(void)
 {
-	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING)
-		return true;
-
 	/*
 	 * Router Checker thread needs to run when routing is enabled in
 	 * order to call lnet_update_ni_status_locked()
@@ -729,23 +777,71 @@ bool lnet_router_checker_active(void)
 		return true;
 
 	return !list_empty(&the_lnet.ln_routers) &&
-		(live_router_check_interval > 0 ||
-		 dead_router_check_interval > 0);
+		alive_router_check_interval > 0;
 }
 
 void
 lnet_check_routers(void)
 {
+	struct lnet_peer_ni *lpni;
 	struct lnet_peer *rtr;
 	u64 version;
+	time64_t now;
 	int cpt;
+	int rc;
 
 	cpt = lnet_net_lock_current();
 rescan:
 	version = the_lnet.ln_routers_version;
 
 	list_for_each_entry(rtr, &the_lnet.ln_routers, lp_rtr_list) {
-		/* TODO use discovery to determine if router is alive */
+		now = ktime_get_real_seconds();
+
+		/* only discover the router if we've passed
+		 * alive_router_check_interval seconds. Some of the router
+		 * interfaces could be down and in that case they would be
+		 * undergoing recovery separately from this discovery.
+		 */
+		if (now - rtr->lp_rtrcheck_timestamp <
+		    alive_router_check_interval)
+			continue;
+
+		/* If we're currently discovering the peer then don't
+		 * issue another discovery
+		 */
+		spin_lock(&rtr->lp_lock);
+		if (rtr->lp_state & LNET_PEER_RTR_DISCOVERY) {
+			spin_unlock(&rtr->lp_lock);
+			continue;
+		}
+		/* make sure we actively discover the router */
+		rtr->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+		rtr->lp_state |= LNET_PEER_RTR_DISCOVERY;
+		spin_unlock(&rtr->lp_lock);
+
+		/* find the peer_ni associated with the primary NID */
+		lpni = lnet_peer_get_ni_locked(rtr, rtr->lp_primary_nid);
+		if (!lpni) {
+			CDEBUG(D_NET,
+			       "Expected to find an lpni for %s, but non found\n",
+			       libcfs_nid2str(rtr->lp_primary_nid));
+			continue;
+		}
+		lnet_peer_ni_addref_locked(lpni);
+
+		/* discover the router */
+		CDEBUG(D_NET, "discover %s, cpt = %d\n",
+		       libcfs_nid2str(lpni->lpni_nid), cpt);
+		rc = lnet_discover_peer_locked(lpni, cpt, false);
+
+		/* decrement ref count acquired by find_peer_ni_locked() */
+		lnet_peer_ni_decref_locked(lpni);
+
+		if (!rc)
+			rtr->lp_rtrcheck_timestamp = now;
+		else
+			CERROR("Failed to discover router %s\n",
+			       libcfs_nid2str(rtr->lp_primary_nid));
 
 		/* NB dropped lock */
 		if (version != the_lnet.ln_routers_version) {
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index e494d19..9771ef0 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -222,8 +222,7 @@ static int proc_lnet_routes(struct ctl_table *table, int write,
 				      libcfs_net2str(net), hops,
 				      priority,
 				      alive ? "up" : "down",
-				      /* TODO: replace with actual nid */
-				      libcfs_nid2str(LNET_NID_ANY));
+				      libcfs_nid2str(route->lr_nid));
 			LASSERT(tmpstr + tmpsiz - s > 0);
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 343/622] lnet: MR aware gateway selection
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (341 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 342/622] lnet: use discovery for routing James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 344/622] lnet: consider alive_router_check_interval James Simmons
                   ` (279 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When selecting a route use the Multi-Rail Selection algorithm to
select the best available peer_ni of the best route. The selected
peer_ni can then be used to send the message or to discover it
if the gateway peer needs discovering.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11378
Lustre-commit: 11d8380d5ad0 ("LU-11378 lnet: MR aware gateway selection")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33188
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 353 +++++++++++++++++++++++------------------------
 1 file changed, 171 insertions(+), 182 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index e214a95..054ae48 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1117,7 +1117,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	}
 }
 
-#if 0
 static int
 lnet_compare_peers(struct lnet_peer_ni *p1, struct lnet_peer_ni *p2)
 {
@@ -1135,53 +1134,189 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 	return 0;
 }
-#endif
+
+static struct lnet_peer_ni *
+lnet_select_peer_ni(struct lnet_ni *best_ni, lnet_nid_t dst_nid,
+		    struct lnet_peer *peer,
+		    struct lnet_peer_net *peer_net)
+{
+	/* Look at the peer NIs for the destination peer that connect
+	 * to the chosen net. If a peer_ni is preferred when using the
+	 * best_ni to communicate, we use that one. If there is no
+	 * preferred peer_ni, or there are multiple preferred peer_ni,
+	 * the available transmit credits are used. If the transmit
+	 * credits are equal, we round-robin over the peer_ni.
+	 */
+	struct lnet_peer_ni *lpni = NULL;
+	struct lnet_peer_ni *best_lpni = NULL;
+	int best_lpni_credits = INT_MIN;
+	bool preferred = false;
+	bool ni_is_pref;
+	int best_lpni_healthv = 0;
+	int lpni_healthv;
+
+	while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni))) {
+		/* if the best_ni we've chosen aleady has this lpni
+		 * preferred, then let's use it
+		 */
+		if (best_ni) {
+			ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
+								  best_ni->ni_nid);
+			CDEBUG(D_NET, "%s ni_is_pref = %d\n",
+			       libcfs_nid2str(best_ni->ni_nid), ni_is_pref);
+		} else {
+			ni_is_pref = false;
+		}
+
+		lpni_healthv = atomic_read(&lpni->lpni_healthv);
+
+		if (best_lpni)
+			CDEBUG(D_NET, "%s c:[%d, %d], s:[%d, %d]\n",
+			       libcfs_nid2str(lpni->lpni_nid),
+			       lpni->lpni_txcredits, best_lpni_credits,
+			       lpni->lpni_seq, best_lpni->lpni_seq);
+
+		/* pick the healthiest peer ni */
+		if (lpni_healthv < best_lpni_healthv) {
+			continue;
+		} else if (lpni_healthv > best_lpni_healthv) {
+			best_lpni_healthv = lpni_healthv;
+		/* if this is a preferred peer use it */
+		} else if (!preferred && ni_is_pref) {
+			preferred = true;
+		} else if (preferred && !ni_is_pref) {
+			/* this is not the preferred peer so let's ignore
+			 * it.
+			 */
+			continue;
+		} else if (lpni->lpni_txcredits < best_lpni_credits) {
+			/* We already have a peer that has more credits
+			 * available than this one. No need to consider
+			 * this peer further.
+			 */
+			continue;
+		} else if (lpni->lpni_txcredits == best_lpni_credits) {
+			/* The best peer found so far and the current peer
+			 * have the same number of available credits let's
+			 * make sure to select between them using Round
+			 * Robin
+			 */
+			if (best_lpni) {
+				if (best_lpni->lpni_seq <= lpni->lpni_seq)
+					continue;
+			}
+		}
+
+		best_lpni = lpni;
+		best_lpni_credits = lpni->lpni_txcredits;
+	}
+
+	/* if we still can't find a peer ni then we can't reach it */
+	if (!best_lpni) {
+		u32 net_id = (peer_net) ? peer_net->lpn_net_id :
+			LNET_NIDNET(dst_nid);
+		CDEBUG(D_NET, "no peer_ni found on peer net %s\n",
+		       libcfs_net2str(net_id));
+		return NULL;
+	}
+
+	CDEBUG(D_NET, "sd_best_lpni = %s\n",
+	       libcfs_nid2str(best_lpni->lpni_nid));
+
+	return best_lpni;
+}
+
+/* Prerequisite: the best_ni should already be set in the sd */
+static inline struct lnet_peer_ni *
+lnet_find_best_lpni_on_net(struct lnet_send_data *sd, struct lnet_peer *peer,
+			   u32 net_id)
+{
+	struct lnet_peer_net *peer_net;
+
+	/* The gateway is Multi-Rail capable so now we must select the
+	 * proper peer_ni
+	 */
+	peer_net = lnet_peer_get_net_locked(peer, net_id);
+
+	if (!peer_net) {
+		CERROR("gateway peer %s has no NI on net %s\n",
+		       libcfs_nid2str(peer->lp_primary_nid),
+		       libcfs_net2str(net_id));
+		return NULL;
+	}
+
+	return lnet_select_peer_ni(sd->sd_best_ni, sd->sd_dst_nid,
+				   peer, peer_net);
+}
 
 static int
-lnet_compare_routes(struct lnet_route *r1, struct lnet_route *r2)
+lnet_compare_routes(struct lnet_route *r1, struct lnet_route *r2,
+		    struct lnet_peer_ni **best_lpni)
 {
-	/* TODO re-implement gateway comparison
-	struct lnet_peer_ni *p1 = r1->lr_gateway;
-	struct lnet_peer_ni *p2 = r2->lr_gateway;
-	*/
 	int r1_hops = (r1->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r1->lr_hops;
 	int r2_hops = (r2->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r2->lr_hops;
-	/*int rc;*/
+	struct lnet_peer *lp1 = r1->lr_gateway;
+	struct lnet_peer *lp2 = r2->lr_gateway;
+	struct lnet_peer_ni *lpni1;
+	struct lnet_peer_ni *lpni2;
+	struct lnet_send_data sd;
+	int rc;
+
+	sd.sd_best_ni = NULL;
+	sd.sd_dst_nid = LNET_NID_ANY;
+	lpni1 = lnet_find_best_lpni_on_net(&sd, lp1, r1->lr_lnet);
+	lpni2 = lnet_find_best_lpni_on_net(&sd, lp2, r2->lr_lnet);
+	LASSERT(lpni1 && lpni2);
 
-	if (r1->lr_priority < r2->lr_priority)
+	if (r1->lr_priority < r2->lr_priority) {
+		*best_lpni = lpni1;
 		return 1;
+	}
 
-	if (r1->lr_priority > r2->lr_priority)
+	if (r1->lr_priority > r2->lr_priority) {
+		*best_lpni = lpni2;
 		return -1;
+	}
 
-	if (r1_hops < r2_hops)
+	if (r1_hops < r2_hops) {
+		*best_lpni = lpni1;
 		return 1;
+	}
 
-	if (r1_hops > r2_hops)
+	if (r1_hops > r2_hops) {
+		*best_lpni = lpni2;
 		return -1;
+	}
 
-	/*
-	rc = lnet_compare_peers(p1, p2);
-	if (rc)
+	rc = lnet_compare_peers(lpni1, lpni2);
+	if (rc == 1) {
+		*best_lpni = lpni1;
+		return rc;
+	} else if (rc == -1) {
+		*best_lpni = lpni2;
 		return rc;
-	*/
+	}
 
-	if (r1->lr_seq - r2->lr_seq <= 0)
+	if (r1->lr_seq - r2->lr_seq <= 0) {
+		*best_lpni = lpni1;
 		return 1;
+	}
 
+	*best_lpni = lpni2;
 	return -1;
 }
 
-/* TODO: lnet_find_route_locked() needs to be reimplemented */
 static struct lnet_route *
 lnet_find_route_locked(struct lnet_net *net, u32 remote_net,
-		       lnet_nid_t rtr_nid, struct lnet_route **prev_route)
+		       lnet_nid_t rtr_nid, struct lnet_route **prev_route,
+		       struct lnet_peer_ni **gwni)
 {
-	struct lnet_remotenet *rnet;
-	struct lnet_route *route;
+	struct lnet_peer_ni *best_gw_ni = NULL;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
+	struct lnet_remotenet *rnet;
 	struct lnet_peer *lp_best;
+	struct lnet_route *route;
 	struct lnet_peer *lp;
 	int rc;
 
@@ -1206,14 +1341,13 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			best_route = route;
 			last_route = route;
 			lp_best = lp;
-			continue;
 		}
 
 		/* no protection on below fields, but it's harmless */
 		if (last_route->lr_seq - route->lr_seq < 0)
 			last_route = route;
 
-		rc = lnet_compare_routes(route, best_route);
+		rc = lnet_compare_routes(route, best_route, &best_gw_ni);
 		if (rc < 0)
 			continue;
 
@@ -1222,6 +1356,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	}
 
 	*prev_route = last_route;
+	*gwni = best_gw_ni;
 
 	return best_route;
 }
@@ -1507,123 +1642,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	return rc;
 }
 
-static struct lnet_peer_ni *
-lnet_select_peer_ni(struct lnet_send_data *sd, struct lnet_peer *peer,
-		    struct lnet_peer_net *peer_net)
-{
-	/*
-	 * Look@the peer NIs for the destination peer that connect
-	 * to the chosen net. If a peer_ni is preferred when using the
-	 * best_ni to communicate, we use that one. If there is no
-	 * preferred peer_ni, or there are multiple preferred peer_ni,
-	 * the available transmit credits are used. If the transmit
-	 * credits are equal, we round-robin over the peer_ni.
-	 */
-	struct lnet_peer_ni *lpni = NULL;
-	struct lnet_peer_ni *best_lpni = NULL;
-	struct lnet_ni *best_ni = sd->sd_best_ni;
-	lnet_nid_t dst_nid = sd->sd_dst_nid;
-	int best_lpni_credits = INT_MIN;
-	bool preferred = false;
-	bool ni_is_pref;
-	int best_lpni_healthv = 0;
-	int lpni_healthv;
-
-	while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni))) {
-		/* if the best_ni we've chosen aleady has this lpni
-		 * preferred, then let's use it
-		 */
-		ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
-							  best_ni->ni_nid);
-
-		lpni_healthv = atomic_read(&lpni->lpni_healthv);
-
-		CDEBUG(D_NET, "%s ni_is_pref = %d\n",
-		       libcfs_nid2str(best_ni->ni_nid), ni_is_pref);
-
-		if (best_lpni)
-			CDEBUG(D_NET, "%s c:[%d, %d], s:[%d, %d]\n",
-			       libcfs_nid2str(lpni->lpni_nid),
-			       lpni->lpni_txcredits, best_lpni_credits,
-			       lpni->lpni_seq, best_lpni->lpni_seq);
-
-		/* pick the healthiest peer ni */
-		if (lpni_healthv < best_lpni_healthv) {
-			continue;
-		} else if (lpni_healthv > best_lpni_healthv) {
-			best_lpni_healthv = lpni_healthv;
-		/* if this is a preferred peer use it */
-		} else if (!preferred && ni_is_pref) {
-			preferred = true;
-		} else if (preferred && !ni_is_pref) {
-			/*
-			 * this is not the preferred peer so let's ignore
-			 * it.
-			 */
-			continue;
-		} else if (lpni->lpni_txcredits < best_lpni_credits) {
-			/*
-			 * We already have a peer that has more credits
-			 * available than this one. No need to consider
-			 * this peer further.
-			 */
-			continue;
-		} else if (lpni->lpni_txcredits == best_lpni_credits) {
-			/*
-			 * The best peer found so far and the current peer
-			 * have the same number of available credits let's
-			 * make sure to select between them using Round
-			 * Robin
-			 */
-			if (best_lpni) {
-				if (best_lpni->lpni_seq <= lpni->lpni_seq)
-					continue;
-			}
-		}
-
-		best_lpni = lpni;
-		best_lpni_credits = lpni->lpni_txcredits;
-	}
-
-	/* if we still can't find a peer ni then we can't reach it */
-	if (!best_lpni) {
-		u32 net_id = peer_net ? peer_net->lpn_net_id :
-					LNET_NIDNET(dst_nid);
-
-		CDEBUG(D_NET, "no peer_ni found on peer net %s\n",
-		       libcfs_net2str(net_id));
-		return NULL;
-	}
-
-	CDEBUG(D_NET, "sd_best_lpni = %s\n",
-	       libcfs_nid2str(best_lpni->lpni_nid));
-
-	return best_lpni;
-}
-
-/* Prerequisite: the best_ni should already be set in the sd
- */
-static inline struct lnet_peer_ni *
-lnet_find_best_lpni_on_net(struct lnet_send_data *sd, struct lnet_peer *peer,
-			   u32 net_id)
-{
-	struct lnet_peer_net *peer_net;
-
-	/* The gateway is Multi-Rail capable so now we must select the
-	 * proper peer_ni
-	 */
-	peer_net = lnet_peer_get_net_locked(peer, net_id);
-
-	if (!peer_net) {
-		CERROR("gateway peer %s has no NI on net %s\n",
-		       libcfs_nid2str(peer->lp_primary_nid),
-		       libcfs_net2str(net_id));
-		return NULL;
-	}
-
-	return lnet_select_peer_ni(sd, peer, peer_net);
-}
-
 static inline void
 lnet_set_non_mr_pref_nid(struct lnet_send_data *sd)
 {
@@ -1791,29 +1809,34 @@ struct lnet_ni *
 	lnet_nid_t src_nid = sd->sd_src_nid;
 
 	best_route = lnet_find_route_locked(NULL, LNET_NIDNET(dst_nid),
-					    sd->sd_rtr_nid, &last_route);
+					    sd->sd_rtr_nid, &last_route,
+					    &lpni);
 	if (!best_route) {
 		CERROR("no route to %s from %s\n",
 		       libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid));
 		return -EHOSTUNREACH;
 	}
 
+	if (!lpni) {
+		CERROR("Internal Error. Route expected to %s from %s\n",
+		       libcfs_nid2str(dst_nid),
+		       libcfs_nid2str(src_nid));
+		return -EFAULT;
+	}
+
 	gw = best_route->lr_gateway;
-	*gw_peer = gw;
+	LASSERT(gw == lpni->lpni_peer_net->lpn_peer);
 
 	/* Discover this gateway if it hasn't already been discovered.
 	 * This means we might delay the message until discovery has
 	 * completed
 	 */
-#if 0
-	/* TODO: disable discovey for now */
 	if (lnet_msg_discovery(sd->sd_msg) &&
-	    !lnet_peer_is_uptodate(*gw_peer)) {
+	    !lnet_peer_is_uptodate(gw)) {
 		sd->sd_msg->msg_src_nid_param = sd->sd_src_nid;
-		return lnet_initiate_peer_discovery(gw, sd->sd_msg,
+		return lnet_initiate_peer_discovery(lpni, sd->sd_msg,
 						    sd->sd_rtr_nid, sd->sd_cpt);
 	}
-#endif
 
 	if (!sd->sd_best_ni) {
 		struct lnet_peer_net *lpeer;
@@ -1830,42 +1853,8 @@ struct lnet_ni *
 		return -EFAULT;
 	}
 
-	/* if gw is MR let's find its best peer_ni
-	 */
-	if (lnet_peer_is_multi_rail(gw)) {
-		lpni = lnet_find_best_lpni_on_net(sd, gw,
-						  sd->sd_best_ni->ni_net->net_id);
-		/* We've already verified that the gw has an NI on that
-		 * desired net, but we're not finding it. Something is
-		 * wrong.
-		 */
-		if (!lpni) {
-			CERROR("Internal Error. Route expected to %s from %s\n",
-			       libcfs_nid2str(dst_nid),
-			       libcfs_nid2str(src_nid));
-			return -EFAULT;
-		}
-	} else {
-		struct lnet_peer_net *lpn;
-
-		lpn = lnet_peer_get_net_locked(gw, best_route->lr_lnet);
-		if (!lpn) {
-			CERROR("Internal Error. Route expected to %s from %s\n",
-			       libcfs_nid2str(dst_nid),
-			       libcfs_nid2str(src_nid));
-			return -EFAULT;
-		}
-		lpni = list_entry(lpn->lpn_peer_nis.next, struct lnet_peer_ni,
-				  lpni_peer_nis);
-		if (!lpni) {
-			CERROR("Internal Error. Route expected to %s from %s\n",
-			       libcfs_nid2str(dst_nid),
-			       libcfs_nid2str(src_nid));
-			return -EFAULT;
-		}
-	}
-
 	*gw_lpni = lpni;
+	*gw_peer = gw;
 
 	/* increment the route sequence number since now we're sure we're
 	 * going to use it
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 344/622] lnet: consider alive_router_check_interval
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (342 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 343/622] lnet: MR aware gateway selection James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 345/622] lnet: allow deleting router primary_nid James Simmons
                   ` (278 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Consider router_check_interval when waking up the monitor thread,
to make sure you wakeup the monitor thread at the earliest possible
time.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: 434456256f30 ("LU-11300 lnet: consider alive_router_check_interval")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33298
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 054ae48..90b4e3f 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3142,7 +3142,8 @@ struct lnet_mt_event_info {
 		 * is waking up unnecessarily.
 		 */
 		interval = min(lnet_recovery_interval,
-			       lnet_transaction_timeout / 2);
+			       min((unsigned int)alive_router_check_interval,
+				   lnet_transaction_timeout / 2));
 		wait_event_interruptible_timeout(the_lnet.ln_mt_waitq,
 						 false, HZ * interval);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 345/622] lnet: allow deleting router primary_nid
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (343 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 344/622] lnet: consider alive_router_check_interval James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 346/622] lnet: transfer routers James Simmons
                   ` (277 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Discovery doesn't allow deleting a primary_nid of a peer. This
is necessary because upper layers only know to reach the peer by
using the primary_nid. For routers this is not the case. So
if a router changes its interfaces and comes back up again, the
peer_ni should be adjusted.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11475
Lustre-commit: 086962e37737 ("LU-11475 lnet: allow deleting router primary_nid")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33300
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  3 +++
 net/lnet/lnet/peer.c           | 29 ++++++++++++++++++++++-------
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 9662c9e..97d35e0 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -716,6 +716,9 @@ struct lnet_peer {
 #define LNET_PEER_FORCE_PING	BIT(13)	/* Forced Ping */
 #define LNET_PEER_FORCE_PUSH	BIT(14)	/* Forced Push */
 
+/* force delete even if router */
+#define LNET_PEER_RTR_NI_FORCE_DEL BIT(15)
+
 /* gw undergoing alive discovery */
 #define LNET_PEER_RTR_DISCOVERY	BIT(16)
 
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index b804d78..a81fee2 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -323,12 +323,12 @@
 
 /* called with lnet_net_lock LNET_LOCK_EX held */
 static int
-lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni)
+lnet_peer_ni_del_locked(struct lnet_peer_ni *lpni, bool force)
 {
 	struct lnet_peer_table *ptable = NULL;
 
 	/* don't remove a peer_ni if it's also a gateway */
-	if (lnet_isrouter(lpni)) {
+	if (lnet_isrouter(lpni) && !force) {
 		CERROR("Peer NI %s is a gateway. Can not delete it\n",
 		       libcfs_nid2str(lpni->lpni_nid));
 		return -EBUSY;
@@ -384,7 +384,7 @@ void lnet_peer_uninit(void)
 	/* remove all peer_nis from the remote peer and the hash list */
 	list_for_each_entry_safe(lpni, tmp, &the_lnet.ln_remote_peer_ni_list,
 				 lpni_on_remote_peer_ni_list)
-		lnet_peer_ni_del_locked(lpni);
+		lnet_peer_ni_del_locked(lpni, false);
 
 	lnet_peer_tables_destroy();
 
@@ -439,7 +439,7 @@ void lnet_peer_uninit(void)
 	lpni = lnet_get_next_peer_ni_locked(peer, NULL, lpni);
 	while (lpni) {
 		lpni2 = lnet_get_next_peer_ni_locked(peer, NULL, lpni);
-		rc = lnet_peer_ni_del_locked(lpni);
+		rc = lnet_peer_ni_del_locked(lpni, false);
 		if (rc != 0)
 			rc2 = rc;
 		lpni = lpni2;
@@ -473,6 +473,7 @@ void lnet_peer_uninit(void)
 	struct lnet_peer_ni *lpni;
 	lnet_nid_t primary_nid = lp->lp_primary_nid;
 	int rc = 0;
+	bool force = (flags & LNET_PEER_RTR_NI_FORCE_DEL) ? true : false;
 
 	if (!(flags & LNET_PEER_CONFIGURED)) {
 		if (lp->lp_state & LNET_PEER_CONFIGURED) {
@@ -495,14 +496,21 @@ void lnet_peer_uninit(void)
 	 * This function only allows deletion of the primary NID if it
 	 * is the only NID.
 	 */
-	if (nid == lp->lp_primary_nid && lp->lp_nnis != 1) {
+	if (nid == lp->lp_primary_nid && lp->lp_nnis != 1 && !force) {
 		rc = -EBUSY;
 		goto out;
 	}
 
 	lnet_net_lock(LNET_LOCK_EX);
 
-	rc = lnet_peer_ni_del_locked(lpni);
+	if (nid == lp->lp_primary_nid && lp->lp_nnis != 1 && force) {
+		struct lnet_peer_ni *lpni2;
+		/* assign the next peer_ni to be the primary */
+		lpni2 = lnet_get_next_peer_ni_locked(lp, NULL, lpni);
+		LASSERT(lpni2);
+		lp->lp_primary_nid = lpni->lpni_nid;
+	}
+	rc = lnet_peer_ni_del_locked(lpni, force);
 
 	lnet_net_unlock(LNET_LOCK_EX);
 
@@ -530,7 +538,7 @@ void lnet_peer_uninit(void)
 
 			peer = lpni->lpni_peer_net->lpn_peer;
 			if (peer->lp_primary_nid != lpni->lpni_nid) {
-				lnet_peer_ni_del_locked(lpni);
+				lnet_peer_ni_del_locked(lpni, false);
 				continue;
 			}
 			/*
@@ -2545,6 +2553,13 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 	}
 
 	for (i = 0; i < ndelnis; i++) {
+		/* for routers it's okay to delete the primary_nid because
+		 * the upper layers don't really rely on it. So if we're
+		 * being told that the router changed its primary_nid
+		 * then it's okay to delete it.
+		 */
+		if (lp->lp_rtr_refcount > 0)
+			flags |= LNET_PEER_RTR_NI_FORCE_DEL;
 		rc = lnet_peer_del_nid(lp, delnis[i], flags);
 		if (rc) {
 			CERROR("Error deleting NID %s from peer %s: %d\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 346/622] lnet: transfer routers
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (344 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 345/622] lnet: allow deleting router primary_nid James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 347/622] lnet: handle health for incoming messages James Simmons
                   ` (276 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When a primary NID of a peer is about to be deleted because
it's being transferred to another peer, if that peer is a gateway
then transfer all gateway properties to the new peer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11475
Lustre-commit: cab57464e17b ("LU-11475 lnet: transfer routers")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34539
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  2 ++
 net/lnet/lnet/peer.c          | 12 ++++++++++++
 net/lnet/lnet/router.c        | 29 +++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 1d06263..5a83e3a 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -534,6 +534,8 @@ int lnet_get_peer_list(u32 *countp, u32 *sizep,
 int lnet_rtrpools_enable(void);
 void lnet_rtrpools_disable(void);
 void lnet_rtrpools_free(int keep_pools);
+void lnet_rtr_transfer_to_peer(struct lnet_peer *src,
+			       struct lnet_peer *target);
 struct lnet_remotenet *lnet_find_rnet_locked(u32 net);
 int lnet_dyn_add_net(struct lnet_ioctl_config_data *conf);
 int lnet_dyn_del_net(u32 net);
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index a81fee2..5d13986 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1355,6 +1355,18 @@ struct lnet_peer_net *
 		}
 		/* If this is the primary NID, destroy the peer. */
 		if (lnet_peer_ni_is_primary(lpni)) {
+			struct lnet_peer *rtr_lp =
+			  lpni->lpni_peer_net->lpn_peer;
+			int rtr_refcount = rtr_lp->lp_rtr_refcount;
+
+			/* if we're trying to delete a router it means
+			 * we're moving this peer NI to a new peer so must
+			 * transfer router properties to the new peer
+			 */
+			if (rtr_refcount > 0) {
+				flags |= LNET_PEER_RTR_NI_FORCE_DEL;
+				lnet_rtr_transfer_to_peer(rtr_lp, lp);
+			}
 			lnet_peer_del(lpni->lpni_peer_net->lpn_peer);
 			lpni = lnet_peer_ni_alloc(nid);
 			if (!lpni) {
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 4a061f3..aa8ec8c 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -136,6 +136,35 @@ static int rtr_sensitivity_set(const char *val,
 	return 0;
 }
 
+void
+lnet_rtr_transfer_to_peer(struct lnet_peer *src, struct lnet_peer *target)
+{
+	struct lnet_route *route;
+
+	lnet_net_lock(LNET_LOCK_EX);
+	target->lp_rtr_refcount += src->lp_rtr_refcount;
+	/* move the list of queued messages to the new peer */
+	list_splice_init(&src->lp_rtrq, &target->lp_rtrq);
+	/* move all the routes that reference the peer */
+	list_splice_init(&src->lp_routes, &target->lp_routes);
+	/* update all the routes to point to the new peer */
+	list_for_each_entry(route, &target->lp_routes, lr_gwlist)
+		route->lr_gateway = target;
+	/* remove the old peer from the ln_routers list */
+	list_del_init(&src->lp_rtr_list);
+	/* add the new peer to the ln_routers list */
+	if (list_empty(&target->lp_rtr_list)) {
+		lnet_peer_addref_locked(target);
+		list_add_tail(&target->lp_rtr_list, &the_lnet.ln_routers);
+	}
+	/* reset the ref count on the old peer and decrement its ref count */
+	src->lp_rtr_refcount = 0;
+	lnet_peer_decref_locked(src);
+	/* update the router version */
+	the_lnet.ln_routers_version++;
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+
 int
 lnet_peers_start_down(void)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 347/622] lnet: handle health for incoming messages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (345 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 346/622] lnet: transfer routers James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 348/622] lnet: misleading discovery seqno James Simmons
                   ` (275 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

In case of routers (as well as for the general case) it's important to
update the health of the ni/lpni for incoming messages. For an lpni
specifically when we receive a message is when we know that the lpni
is up.

A percentage router health is required in order to send a message to a
gateway. That defaults to 100, meaning that a router interface has to
be absolutely healthy in order to send to it. This matches the current
behavior. So if a router interface goes down an its health goes down
significantly, but then it comes back up again; either we receive a
message from it or we discover it and get a reply, then in order to
start using that router interface again we have to boost its health
all the way up to maximum.

This behavior is special cased for routers.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11477
Lustre-commit: 18c850cb91a6 ("LU-11477 lnet: handle health for incoming messages")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33301
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 90 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 65 insertions(+), 25 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 23c3bf4..2cbaff8a 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -598,19 +598,23 @@
 {
 	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
 	bool lo = false;
+	struct lnet_ni *ni;
+	struct lnet_peer_ni *lpni;
 
 	/* if we're shutting down no point in handling health. */
 	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING)
 		return -1;
 
-	LASSERT(msg->msg_txni);
+	LASSERT(msg->msg_tx_committed || msg->msg_rx_committed);
 
 	/* if we're sending to the LOLND then the msg_txpeer will not be
 	 * set. So no need to sanity check it.
 	 */
-	if (LNET_NETTYP(LNET_NIDNET(msg->msg_txni->ni_nid)) != LOLND)
+	if (msg->msg_tx_committed &&
+	    LNET_NETTYP(LNET_NIDNET(msg->msg_txni->ni_nid)) != LOLND)
 		LASSERT(msg->msg_txpeer);
-	else
+	else if (msg->msg_tx_committed &&
+		 LNET_NETTYP(LNET_NIDNET(msg->msg_txni->ni_nid)) == LOLND)
 		lo = true;
 
 	if (hstatus != LNET_MSG_STATUS_OK &&
@@ -626,20 +630,52 @@
 		lnet_net_unlock(0);
 	}
 
+	/* always prefer txni/txpeer if they message is committed for both
+	 * directions.
+	 */
+	if (msg->msg_tx_committed) {
+		ni = msg->msg_txni;
+		lpni = msg->msg_txpeer;
+	} else {
+		ni = msg->msg_rxni;
+		lpni = msg->msg_rxpeer;
+	}
+
+	if (!lo)
+		LASSERT(ni && lpni);
+	else
+		LASSERT(ni);
+
 	CDEBUG(D_NET, "health check: %s->%s: %s: %s\n",
-	       libcfs_nid2str(msg->msg_txni->ni_nid),
-	       (lo) ? "self" : libcfs_nid2str(msg->msg_txpeer->lpni_nid),
+	       libcfs_nid2str(ni->ni_nid),
+	       (lo) ? "self" : libcfs_nid2str(lpni->lpni_nid),
 	       lnet_msgtyp2str(msg->msg_type),
 	       lnet_health_error2str(hstatus));
 
 	switch (hstatus) {
 	case LNET_MSG_STATUS_OK:
-		lnet_inc_healthv(&msg->msg_txni->ni_healthv);
+		/* increment the local ni health weather we successfully
+		 * received or sent a message on it.
+		 */
+		lnet_inc_healthv(&ni->ni_healthv);
 		/* It's possible msg_txpeer is NULL in the LOLND
-		 * case.
+		 * case. Only increment the peer's health if we're
+		 * receiving a message from it. It's the only sure way to
+		 * know that a remote interface is up.
+		 * If this interface is part of a router, then take that
+		 * as indication that the router is fully healthy.
 		 */
-		if (msg->msg_txpeer)
-			lnet_inc_healthv(&msg->msg_txpeer->lpni_healthv);
+		if (lpni && msg->msg_rx_committed) {
+			/* If we're receiving a message from the router or
+			 * I'm a router, then set that lpni's health to
+			 * maximum so we can commence communication
+			 */
+			if (lnet_isrouter(lpni) || the_lnet.ln_routing)
+				lnet_set_healthv(&lpni->lpni_healthv,
+						 LNET_MAX_HEALTH_VALUE);
+			else
+				lnet_inc_healthv(&lpni->lpni_healthv);
+		}
 
 		/* we can finalize this message */
 		return -1;
@@ -648,34 +684,41 @@
 	case LNET_MSG_STATUS_LOCAL_ABORTED:
 	case LNET_MSG_STATUS_LOCAL_NO_ROUTE:
 	case LNET_MSG_STATUS_LOCAL_TIMEOUT:
-		lnet_handle_local_failure(msg->msg_txni);
-		/* add to the re-send queue */
-		goto resend;
+		lnet_handle_local_failure(ni);
+		if (msg->msg_tx_committed)
+			/* add to the re-send queue */
+			goto resend;
+		break;
 
 	/* These errors will not trigger a resend so simply
 	 * finalize the message
 	 */
 	case LNET_MSG_STATUS_LOCAL_ERROR:
-		lnet_handle_local_failure(msg->msg_txni);
+		lnet_handle_local_failure(ni);
 		return -1;
 
 	/* TODO: since the remote dropped the message we can
 	 * attempt a resend safely.
 	 */
 	case LNET_MSG_STATUS_REMOTE_DROPPED:
-		lnet_handle_remote_failure(msg->msg_txpeer);
-		goto resend;
+		lnet_handle_remote_failure(lpni);
+		if (msg->msg_tx_committed)
+			goto resend;
+		break;
 
 	case LNET_MSG_STATUS_REMOTE_ERROR:
 	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
 	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
-		lnet_handle_remote_failure(msg->msg_txpeer);
+		lnet_handle_remote_failure(lpni);
 		return -1;
 	default:
 		LBUG();
 	}
 
 resend:
+	/* we can only resend tx_committed messages */
+	LASSERT(msg->msg_tx_committed);
+
 	/* don't resend recovery messages */
 	if (msg->msg_recovery) {
 		CDEBUG(D_NET, "msg %s->%s is a recovery ping. retry# %d\n",
@@ -783,7 +826,7 @@
 static bool
 lnet_is_health_check(struct lnet_msg *msg)
 {
-	bool hc;
+	bool hc = true;
 	int status = msg->msg_ev.status;
 
 	if ((!msg->msg_tx_committed && !msg->msg_rx_committed) ||
@@ -800,15 +843,12 @@
 		return false;
 	}
 
-	/* perform a health check for any message committed for transmit */
-	hc = msg->msg_tx_committed;
-
 	/* Check for status inconsistencies */
-	if (hc &&
-	    ((!status && msg->msg_health_status != LNET_MSG_STATUS_OK) ||
-	     (status && msg->msg_health_status == LNET_MSG_STATUS_OK))) {
-		CERROR("Msg is in inconsistent state, don't perform health checking (%d, %d)\n",
-		       status, msg->msg_health_status);
+	if ((!status && msg->msg_health_status != LNET_MSG_STATUS_OK) ||
+	    (status && msg->msg_health_status == LNET_MSG_STATUS_OK)) {
+		CDEBUG(D_NET,
+		       "Msg %p is in inconsistent state, don't perform health checking (%d, %d)\n",
+		       msg, status, msg->msg_health_status);
 		hc = false;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 348/622] lnet: misleading discovery seqno.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (346 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 347/622] lnet: handle health for incoming messages James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 349/622] lnet: drop all rule James Simmons
                   ` (274 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

There is a sequence number used when sending discovery messages. This
sequence number is intended to detect stale messages. However it
could be misleading if the peer reboots. In this case the peer's
sequence number will reset. The node will think that all information
being sent to it is stale, while in reality the peer might've
changed configuration.

There is no reliable why to know whether a peer rebooted, so we'll
always assume that the messages we're receiving are valid. So we'll
operate on first come first serve basis.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11478
Lustre-commit: 42d999ed8f61 ("LU-11478 lnet: misleading discovery seqno.")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33304
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 45 +++++++--------------------------------------
 1 file changed, 7 insertions(+), 38 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 5d13986..2097a97 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1987,38 +1987,9 @@ void lnet_peer_push_event(struct lnet_event *ev)
 		goto out;
 	}
 
-	/*
-	 * Check whether the Put data is stale. Stale data can just be
-	 * dropped.
-	 */
-	if (pbuf->pb_info.pi_nnis > 1 &&
-	    lp->lp_primary_nid == pbuf->pb_info.pi_ni[1].ns_nid &&
-	    LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno) {
-		CDEBUG(D_NET, "Stale Push from %s: got %u have %u\n",
-		       libcfs_nid2str(lp->lp_primary_nid),
-		       LNET_PING_BUFFER_SEQNO(pbuf),
-		       lp->lp_peer_seqno);
-		goto out;
-	}
-
-	/*
-	 * Check whether the Put data is new, in which case we clear
-	 * the UPTODATE flag and prepare to process it.
-	 *
-	 * If the Put data is current, and the peer is UPTODATE then
-	 * we assome everything is all right and drop the data as
-	 * stale.
-	 */
-	if (LNET_PING_BUFFER_SEQNO(pbuf) > lp->lp_peer_seqno) {
-		lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
-		lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
-	} else if (lp->lp_state & LNET_PEER_NIDS_UPTODATE) {
-		CDEBUG(D_NET, "Stale Push from %s: got %u have %u\n",
-		       libcfs_nid2str(lp->lp_primary_nid),
-		       LNET_PING_BUFFER_SEQNO(pbuf),
-		       lp->lp_peer_seqno);
-		goto out;
-	}
+	/* always assume new data */
+	lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
+	lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
 
 	/*
 	 * If there is data present that hasn't been processed yet,
@@ -2302,16 +2273,14 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL &&
 	    pbuf->pb_info.pi_nnis > 1 &&
 	    lp->lp_primary_nid == pbuf->pb_info.pi_ni[1].ns_nid) {
-		if (LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno) {
-			CDEBUG(D_NET, "Stale Reply from %s: got %u have %u\n",
+		if (LNET_PING_BUFFER_SEQNO(pbuf) < lp->lp_peer_seqno)
+			CDEBUG(D_NET,
+			       "peer %s: seq# got %u have %u. peer rebooted?\n",
 			       libcfs_nid2str(lp->lp_primary_nid),
 			       LNET_PING_BUFFER_SEQNO(pbuf),
 			       lp->lp_peer_seqno);
-			goto out;
-		}
 
-		if (LNET_PING_BUFFER_SEQNO(pbuf) > lp->lp_peer_seqno)
-			lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
+		lp->lp_peer_seqno = LNET_PING_BUFFER_SEQNO(pbuf);
 	}
 
 	/* We're happy with the state of the data in the buffer. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 349/622] lnet: drop all rule
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (347 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 348/622] lnet: misleading discovery seqno James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 350/622] lnet: handle discovery off James Simmons
                   ` (273 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Add a rule to drop all messages arriving on a specific interface.
This is useful for simulating failures on a specific router interface.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11470
Lustre-commit: deb31c2ffad5 ("LU-11470 lnet: drop all rule")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33305
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h     |  3 ++-
 include/uapi/linux/lnet/lnetctl.h |  6 ++++++
 net/lnet/lnet/lib-move.c          |  2 +-
 net/lnet/lnet/lib-msg.c           |  7 +++++--
 net/lnet/lnet/net_fault.c         | 28 +++++++++++++++++++++-------
 5 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 5a83e3a..4dee7a9 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -663,7 +663,8 @@ void lnet_drop_message(struct lnet_ni *ni, int cpt, void *private,
 int lnet_fault_init(void);
 void lnet_fault_fini(void);
 
-bool lnet_drop_rule_match(struct lnet_hdr *hdr, enum lnet_msg_hstatus *hstatus);
+bool lnet_drop_rule_match(struct lnet_hdr *hdr, lnet_nid_t local_nid,
+			  enum lnet_msg_hstatus *hstatus);
 
 int lnet_delay_rule_add(struct lnet_fault_attr *attr);
 int lnet_delay_rule_del(lnet_nid_t src, lnet_nid_t dst, bool shutdown);
diff --git a/include/uapi/linux/lnet/lnetctl.h b/include/uapi/linux/lnet/lnetctl.h
index 2eb9c82..bd08b4f 100644
--- a/include/uapi/linux/lnet/lnetctl.h
+++ b/include/uapi/linux/lnet/lnetctl.h
@@ -64,6 +64,10 @@ struct lnet_fault_attr {
 	lnet_nid_t			fa_src;
 	/** destination NID of drop rule, see @dr_src for details */
 	lnet_nid_t			fa_dst;
+	/** local NID. In case of router this is the NID we're ceiving
+	 * messages on
+	 */
+	lnet_nid_t			fa_local_nid;
 	/**
 	 * Portal mask to drop, -1 means all portals, for example:
 	 * fa_ptl_mask = (1 << _LDLM_CB_REQUEST_PORTAL ) |
@@ -95,6 +99,8 @@ struct lnet_fault_attr {
 			__u32			da_health_error_mask;
 			/** randomize error generation */
 			bool			da_random;
+			/** drop all messages if flag is set */
+			bool			da_drop_all;
 		} drop;
 		/** message latency simulation */
 		struct {
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 90b4e3f..fff9fea 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3964,7 +3964,7 @@ void lnet_monitor_thr_stop(void)
 	}
 
 	if (!list_empty(&the_lnet.ln_drop_rules) &&
-	    lnet_drop_rule_match(hdr, NULL)) {
+	    lnet_drop_rule_match(hdr, ni->ni_nid, NULL)) {
 		CDEBUG(D_NET, "%s, src %s, dst %s: Dropping %s to simulate silent message loss\n",
 		       libcfs_nid2str(from_nid), libcfs_nid2str(src_nid),
 		       libcfs_nid2str(dest_nid), lnet_msgtyp2str(type));
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 2cbaff8a..8876866 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -900,11 +900,14 @@
 		return false;
 
 	/* match only health rules */
-	if (!lnet_drop_rule_match(&msg->msg_hdr, hstatus))
+	if (!lnet_drop_rule_match(&msg->msg_hdr, LNET_NID_ANY,
+				  hstatus))
 		return false;
 
-	CDEBUG(D_NET, "src %s, dst %s: %s simulate health error: %s\n",
+	CDEBUG(D_NET,
+	       "src %s(%s)->dst %s: %s simulate health error: %s\n",
 	       libcfs_nid2str(msg->msg_hdr.src_nid),
+	       libcfs_nid2str(msg->msg_txni->ni_nid),
 	       libcfs_nid2str(msg->msg_hdr.dest_nid),
 	       lnet_msgtyp2str(msg->msg_type),
 	       lnet_health_error2str(*hstatus));
diff --git a/net/lnet/lnet/net_fault.c b/net/lnet/lnet/net_fault.c
index becb709..9f78e43 100644
--- a/net/lnet/lnet/net_fault.c
+++ b/net/lnet/lnet/net_fault.c
@@ -79,10 +79,12 @@ struct lnet_drop_rule {
 
 static bool
 lnet_fault_attr_match(struct lnet_fault_attr *attr, lnet_nid_t src,
-		      lnet_nid_t dst, unsigned int type, unsigned int portal)
+		      lnet_nid_t local_nid, lnet_nid_t dst,
+		      unsigned int type, unsigned int portal)
 {
 	if (!lnet_fault_nid_match(attr->fa_src, src) ||
-	    !lnet_fault_nid_match(attr->fa_dst, dst))
+	    !lnet_fault_nid_match(attr->fa_dst, dst) ||
+	    !lnet_fault_nid_match(attr->fa_local_nid, local_nid))
 		return false;
 
 	if (!(attr->fa_msg_mask & (1 << type)))
@@ -340,15 +342,22 @@ struct lnet_drop_rule {
  */
 static bool
 drop_rule_match(struct lnet_drop_rule *rule, lnet_nid_t src,
-		lnet_nid_t dst, unsigned int type, unsigned int portal,
+		lnet_nid_t local_nid, lnet_nid_t dst,
+		unsigned int type, unsigned int portal,
 		enum lnet_msg_hstatus *hstatus)
 {
 	struct lnet_fault_attr *attr = &rule->dr_attr;
 	bool drop;
 
-	if (!lnet_fault_attr_match(attr, src, dst, type, portal))
+	if (!lnet_fault_attr_match(attr, src, local_nid, dst, type, portal))
 		return false;
 
+	if (attr->u.drop.da_drop_all) {
+		CDEBUG(D_NET, "set to drop all messages\n");
+		drop = true;
+		goto drop_matched;
+	}
+
 	/* if we're trying to match a health status error but it hasn't
 	 * been set in the rule, then don't match
 	 */
@@ -396,6 +405,8 @@ struct lnet_drop_rule {
 		}
 	}
 
+drop_matched:
+
 	if (drop) { /* drop this message, update counters */
 		if (hstatus)
 			lnet_fault_match_health(hstatus,
@@ -412,7 +423,9 @@ struct lnet_drop_rule {
  * Check if message from @src to @dst can match any existed drop rule
  */
 bool
-lnet_drop_rule_match(struct lnet_hdr *hdr, enum lnet_msg_hstatus *hstatus)
+lnet_drop_rule_match(struct lnet_hdr *hdr,
+		     lnet_nid_t local_nid,
+		     enum lnet_msg_hstatus *hstatus)
 {
 	lnet_nid_t src = le64_to_cpu(hdr->src_nid);
 	lnet_nid_t dst = le64_to_cpu(hdr->dest_nid);
@@ -433,7 +446,7 @@ struct lnet_drop_rule {
 
 	cpt = lnet_net_lock_current();
 	list_for_each_entry(rule, &the_lnet.ln_drop_rules, dr_link) {
-		drop = drop_rule_match(rule, src, dst, typ, ptl,
+		drop = drop_rule_match(rule, src, local_nid, dst, typ, ptl,
 				       hstatus);
 		if (drop)
 			break;
@@ -524,7 +537,8 @@ struct delay_daemon_data {
 	struct lnet_fault_attr *attr = &rule->dl_attr;
 	bool delay;
 
-	if (!lnet_fault_attr_match(attr, src, dst, type, portal))
+	if (!lnet_fault_attr_match(attr, src, LNET_NID_ANY,
+				   dst, type, portal))
 		return false;
 
 	/* match this rule, check delay rate now */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 350/622] lnet: handle discovery off
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (348 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 349/622] lnet: drop all rule James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 351/622] lnet: handle router health off James Simmons
                   ` (272 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When discovery is turned off locally or when the peer either has
discovery off or doesn't support MR at all then degrade discovery
behavior to a standard ping. This will allow routers to continue
using discovery mechanism even if it's turned off.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11641
Lustre-commit: f9ad0d13b092 ("LU-11641 lnet: handle discovery off")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33620
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |   4 +
 net/lnet/lnet/lib-move.c      |  21 +++--
 net/lnet/lnet/peer.c          | 176 ++++++++++++++++++++++++++++++------------
 3 files changed, 144 insertions(+), 57 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 4dee7a9..09adfc3 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -863,6 +863,7 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 }
 
 bool lnet_peer_is_uptodate(struct lnet_peer *lp);
+bool lnet_is_discovery_disabled(struct lnet_peer *lp);
 bool lnet_peer_gw_discovery(struct lnet_peer *lp);
 
 static inline bool
@@ -874,6 +875,9 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 		return true;
 	if (lp->lp_state & LNET_PEER_NO_DISCOVERY)
 		return false;
+	/* if discovery is not enabled then no need to push */
+	if (lnet_peer_discovery_disabled)
+		return false;
 	if (lp->lp_node_seqno < atomic_read(&the_lnet.ln_ping_target_seqno))
 		return true;
 	return false;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index fff9fea..0ff1d38 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1773,6 +1773,11 @@ struct lnet_ni *
 		return 0;
 	}
 
+	if (!lnet_msg_discovery(msg) || lnet_peer_is_uptodate(peer)) {
+		lnet_peer_ni_decref_locked(lpni);
+		return 0;
+	}
+
 	rc = lnet_discover_peer_locked(lpni, cpt, false);
 	if (rc) {
 		lnet_peer_ni_decref_locked(lpni);
@@ -1802,6 +1807,7 @@ struct lnet_ni *
 			     struct lnet_peer_ni **gw_lpni,
 			     struct lnet_peer **gw_peer)
 {
+	int rc;
 	struct lnet_peer *gw;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
@@ -1831,12 +1837,11 @@ struct lnet_ni *
 	 * This means we might delay the message until discovery has
 	 * completed
 	 */
-	if (lnet_msg_discovery(sd->sd_msg) &&
-	    !lnet_peer_is_uptodate(gw)) {
-		sd->sd_msg->msg_src_nid_param = sd->sd_src_nid;
-		return lnet_initiate_peer_discovery(lpni, sd->sd_msg,
-						    sd->sd_rtr_nid, sd->sd_cpt);
-	}
+	sd->sd_msg->msg_src_nid_param = sd->sd_src_nid;
+	rc = lnet_initiate_peer_discovery(lpni, sd->sd_msg, sd->sd_rtr_nid,
+					  sd->sd_cpt);
+	if (rc)
+		return rc;
 
 	if (!sd->sd_best_ni) {
 		struct lnet_peer_net *lpeer;
@@ -2358,8 +2363,8 @@ struct lnet_ni *
 	 * trigger discovery.
 	 */
 	peer = lpni->lpni_peer_net->lpn_peer;
-	if (lnet_msg_discovery(msg) && !lnet_peer_is_uptodate(peer)) {
-		rc = lnet_initiate_peer_discovery(lpni, msg, rtr_nid, cpt);
+	rc = lnet_initiate_peer_discovery(lpni, msg, rtr_nid, cpt);
+	if (rc) {
 		lnet_peer_ni_decref_locked(lpni);
 		lnet_net_unlock(cpt);
 		return rc;
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 2097a97..41a6180 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1444,7 +1444,10 @@ struct lnet_peer_net *
 	struct lnet_peer *lp;
 	struct lnet_peer_net *lpn;
 	struct lnet_peer_ni *lpni;
-	unsigned int flags = 0;
+	/* Assume peer is Multi-Rail capable and let discovery find out
+	 * otherwise.
+	 */
+	unsigned int flags = LNET_PEER_MULTI_RAIL;
 	int rc = 0;
 
 	if (nid == LNET_NID_ANY) {
@@ -1742,9 +1745,34 @@ struct lnet_peer_ni *
 	return lpni;
 }
 
+bool
+lnet_is_discovery_disabled_locked(struct lnet_peer *lp)
+{
+	if (lnet_peer_discovery_disabled)
+		return true;
+
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL) ||
+	    (lp->lp_state & LNET_PEER_NO_DISCOVERY)) {
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Peer Discovery
  */
+bool
+lnet_is_discovery_disabled(struct lnet_peer *lp)
+{
+	bool rc = false;
+
+	spin_lock(&lp->lp_lock);
+	rc = lnet_is_discovery_disabled_locked(lp);
+	spin_unlock(&lp->lp_lock);
+
+	return rc;
+}
 
 bool
 lnet_peer_gw_discovery(struct lnet_peer *lp)
@@ -1777,13 +1805,8 @@ struct lnet_peer_ni *
 			    LNET_PEER_FORCE_PING |
 			    LNET_PEER_FORCE_PUSH)) {
 		rc = false;
-	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
-		rc = true;
 	} else if (lp->lp_state & LNET_PEER_REDISCOVER) {
-		if (lnet_peer_discovery_disabled)
-			rc = true;
-		else
-			rc = false;
+		rc = false;
 	} else if (lnet_peer_needs_push(lp)) {
 		rc = false;
 	} else if (lp->lp_state & LNET_PEER_DISCOVERED) {
@@ -2095,6 +2118,9 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 		if (lnet_peer_is_uptodate(lp))
 			break;
 		lnet_peer_queue_for_discovery(lp);
+
+		if (lnet_is_discovery_disabled(lp))
+			break;
 		/*
 		 * if caller requested a non-blocking operation then
 		 * return immediately. Once discovery is complete then the
@@ -2133,7 +2159,7 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 		rc = lp->lp_dc_error;
 	else if (!block)
 		CDEBUG(D_NET, "non-blocking discovery\n");
-	else if (!lnet_peer_is_uptodate(lp))
+	else if (!lnet_peer_is_uptodate(lp) && !lnet_is_discovery_disabled(lp))
 		goto again;
 
 	CDEBUG(D_NET, "peer %s NID %s: %d. %s\n",
@@ -2205,6 +2231,34 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	}
 
 	/*
+	 * Only enable the multi-rail feature on the peer if both sides of
+	 * the connection have discovery on
+	 */
+	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL) {
+		CDEBUG(D_NET, "Peer %s has Multi-Rail feature enabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state |= LNET_PEER_MULTI_RAIL;
+	} else {
+		CDEBUG(D_NET, "Peer %s has Multi-Rail feature disabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state &= ~LNET_PEER_MULTI_RAIL;
+	}
+
+	/* The peer may have discovery disabled at its end. Set
+	 * NO_DISCOVERY as appropriate.
+	 */
+	if ((pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY) &&
+	    !lnet_peer_discovery_disabled) {
+		CDEBUG(D_NET, "Peer %s has discovery enabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
+	} else {
+		CDEBUG(D_NET, "Peer %s has discovery disabled\n",
+		       libcfs_nid2str(lp->lp_primary_nid));
+		lp->lp_state |= LNET_PEER_NO_DISCOVERY;
+	}
+
+	/*
 	 * Update the MULTI_RAIL flag based on the reply. If the peer
 	 * was configured with DLC then the setting should match what
 	 * DLC put in.
@@ -2216,8 +2270,16 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 			CWARN("Reply says %s is Multi-Rail, DLC says not\n",
 			      libcfs_nid2str(lp->lp_primary_nid));
 		} else {
-			lp->lp_state |= LNET_PEER_MULTI_RAIL;
-			lnet_peer_clr_non_mr_pref_nids(lp);
+			/* if discovery is disabled then we don't want to
+			 * update the state of the peer. All we'll do is
+			 * update the peer_nis which were reported back in
+			 * the initial ping
+			 */
+
+			if (!lnet_is_discovery_disabled_locked(lp)) {
+				lp->lp_state |= LNET_PEER_MULTI_RAIL;
+				lnet_peer_clr_non_mr_pref_nids(lp);
+			}
 		}
 	} else if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
 		if (lp->lp_state & LNET_PEER_CONFIGURED) {
@@ -2238,20 +2300,6 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 		lp->lp_data_nnis = pbuf->pb_info.pi_nnis;
 
 	/*
-	 * The peer may have discovery disabled at its end. Set
-	 * NO_DISCOVERY as appropriate.
-	 */
-	if (!(pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY)) {
-		CDEBUG(D_NET, "Peer %s has discovery disabled\n",
-		       libcfs_nid2str(lp->lp_primary_nid));
-		lp->lp_state |= LNET_PEER_NO_DISCOVERY;
-	} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
-		CDEBUG(D_NET, "Peer %s has discovery enabled\n",
-		       libcfs_nid2str(lp->lp_primary_nid));
-		lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
-	}
-
-	/*
 	 * Check for truncation of the Reply. Clear PING_SENT and set
 	 * PING_FAILED to trigger a retry.
 	 */
@@ -2284,8 +2332,9 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	}
 
 	/* We're happy with the state of the data in the buffer. */
-	CDEBUG(D_NET, "peer %s data present %u\n",
-	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_peer_seqno);
+	CDEBUG(D_NET, "peer %s data present %u. state = 0x%x\n",
+	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_peer_seqno,
+	       lp->lp_state);
 	if (lp->lp_state & LNET_PEER_DATA_PRESENT)
 		lnet_ping_buffer_decref(lp->lp_data);
 	else
@@ -2517,6 +2566,14 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 			delnis[ndelnis++] = curnis[i];
 	}
 
+	/* If we get here and the discovery is disabled then we don't want
+	 * to add or delete any NIs. We just updated the ones we have some
+	 * information on, and call it a day
+	 */
+	rc = 0;
+	if (lnet_is_discovery_disabled(lp))
+		goto out;
+
 	for (i = 0; i < naddnis; i++) {
 		rc = lnet_peer_add_nid(lp, addnis[i].ns_nid, flags);
 		if (rc) {
@@ -2561,7 +2618,8 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 	kfree(addnis);
 	kfree(delnis);
 	lnet_ping_buffer_decref(pbuf);
-	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+	CDEBUG(D_NET, "peer %s (%p): %d\n", libcfs_nid2str(lp->lp_primary_nid),
+	       lp, rc);
 
 	if (rc) {
 		spin_lock(&lp->lp_lock);
@@ -2634,6 +2692,19 @@ static int lnet_peer_merge_data(struct lnet_peer *lp,
 	return 0;
 }
 
+static bool lnet_is_nid_in_ping_info(lnet_nid_t nid,
+				     struct lnet_ping_info *pinfo)
+{
+	int i;
+
+	for (i = 0; i < pinfo->pi_nnis; i++) {
+		if (pinfo->pi_ni[i].ns_nid == nid)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Update a peer using the data received.
  */
@@ -2701,7 +2772,17 @@ static int lnet_peer_data_present(struct lnet_peer *lp)
 		rc = lnet_peer_set_primary_nid(lp, nid, flags);
 		if (!rc)
 			rc = lnet_peer_merge_data(lp, pbuf);
-	} else if (lp->lp_primary_nid == nid) {
+	/* if the primary nid of the peer is present in the ping info returned
+	 * from the peer, but it's not the local primary peer we have
+	 * cached and discovery is disabled, then we don't want to update
+	 * our local peer info, by adding or removing NIDs, we just want
+	 * to update the status of the nids that we currently have
+	 * recorded in that peer.
+	 */
+	} else if (lp->lp_primary_nid == nid ||
+		   (lnet_is_nid_in_ping_info(lp->lp_primary_nid,
+					     &pbuf->pb_info) &&
+		    lnet_is_discovery_disabled(lp))) {
 		rc = lnet_peer_merge_data(lp, pbuf);
 	} else {
 		lpni = lnet_find_peer_ni_locked(nid);
@@ -2718,13 +2799,24 @@ static int lnet_peer_data_present(struct lnet_peer *lp)
 			struct lnet_peer *new_lp;
 
 			new_lp = lpni->lpni_peer_net->lpn_peer;
+			/* if lp has discovery/MR enabled that means new_lp
+			 * should have discovery/MR enabled as well, since
+			 * it's the same peer, which we're about to merge
+			 */
+			if (!(lp->lp_state & LNET_PEER_NO_DISCOVERY))
+				new_lp->lp_state &= ~LNET_PEER_NO_DISCOVERY;
+			if (lp->lp_state & LNET_PEER_MULTI_RAIL)
+				new_lp->lp_state |= LNET_PEER_MULTI_RAIL;
+
 			rc = lnet_peer_set_primary_data(new_lp, pbuf);
 			lnet_consolidate_routes_locked(lp, new_lp);
 			lnet_peer_ni_decref_locked(lpni);
 		}
 	}
 out:
-	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+	CDEBUG(D_NET, "peer %s(%p): %d. state = 0x%x\n",
+	       libcfs_nid2str(lp->lp_primary_nid), lp, rc,
+	       lp->lp_state);
 	mutex_unlock(&the_lnet.ln_api_mutex);
 
 	spin_lock(&lp->lp_lock);
@@ -2941,7 +3033,8 @@ static int lnet_peer_send_push(struct lnet_peer *lp)
 	LNetMDUnlink(lp->lp_push_mdh);
 	LNetInvalidateMDHandle(&lp->lp_push_mdh);
 fail_error:
-	CDEBUG(D_NET, "peer %s: %d\n", libcfs_nid2str(lp->lp_primary_nid), rc);
+	CDEBUG(D_NET, "peer %s(%p): %d\n",
+	       libcfs_nid2str(lp->lp_primary_nid), lp, rc);
 	/*
 	 * The errors that get us here are considered hard errors and
 	 * cause Discovery to terminate. So we clear PUSH_SENT, but do
@@ -2985,19 +3078,6 @@ static int lnet_peer_discovered(struct lnet_peer *lp)
 	return 0;
 }
 
-/*
- * Mark the peer as to be rediscovered.
- */
-static int lnet_peer_rediscover(struct lnet_peer *lp)
-__must_hold(&lp->lp_lock)
-{
-	lp->lp_state |= LNET_PEER_REDISCOVER;
-	lp->lp_state &= ~LNET_PEER_DISCOVERING;
-
-	CDEBUG(D_NET, "peer %s\n", libcfs_nid2str(lp->lp_primary_nid));
-
-	return 0;
-}
 
 /*
  * Discovering this peer is taking too long. Cancel any Ping or Push
@@ -3170,8 +3250,8 @@ static int lnet_peer_discovery(void *arg)
 			 * forcing a Ping or Push.
 			 */
 			spin_lock(&lp->lp_lock);
-			CDEBUG(D_NET, "peer %s state %#x\n",
-			       libcfs_nid2str(lp->lp_primary_nid),
+			CDEBUG(D_NET, "peer %s(%p) state %#x\n",
+			       libcfs_nid2str(lp->lp_primary_nid), lp,
 			       lp->lp_state);
 			if (lp->lp_state & LNET_PEER_DATA_PRESENT)
 				rc = lnet_peer_data_present(lp);
@@ -3183,16 +3263,14 @@ static int lnet_peer_discovery(void *arg)
 				rc = lnet_peer_send_ping(lp);
 			else if (lp->lp_state & LNET_PEER_FORCE_PUSH)
 				rc = lnet_peer_send_push(lp);
-			else if (lnet_peer_discovery_disabled)
-				rc = lnet_peer_rediscover(lp);
 			else if (!(lp->lp_state & LNET_PEER_NIDS_UPTODATE))
 				rc = lnet_peer_send_ping(lp);
 			else if (lnet_peer_needs_push(lp))
 				rc = lnet_peer_send_push(lp);
 			else
 				rc = lnet_peer_discovered(lp);
-			CDEBUG(D_NET, "peer %s state %#x rc %d\n",
-			       libcfs_nid2str(lp->lp_primary_nid),
+			CDEBUG(D_NET, "peer %s(%p) state %#x rc %d\n",
+			       libcfs_nid2str(lp->lp_primary_nid), lp,
 			       lp->lp_state, rc);
 			spin_unlock(&lp->lp_lock);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 351/622] lnet: handle router health off
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (349 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 350/622] lnet: handle discovery off James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 352/622] lnet: push router interface updates James Simmons
                   ` (271 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Routing infrastructure depends on health infrastructure to manage
route status. However, health can be turned off. Therefore, we need
to enable health for gateways in order to monitor them properly.
Each peer now has its own health sensitivity. When adding a route
the gateway's health sensitivity can be explicitly set from lnetctl
or if not specified then it'll default to 1, thereby turning health
on for that gateway, allowing peer NI recovery if there is a failure.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11297
Lustre-commit: 00a2932b0aa7 ("LU-11297 lnet: handle router health off")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33634
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h      |  5 +++--
 include/linux/lnet/lib-types.h     |  6 ++++++
 include/uapi/linux/lnet/lnet-dlc.h |  1 +
 net/lnet/lnet/api-ni.c             | 16 +++++++++++++---
 net/lnet/lnet/config.c             |  2 +-
 net/lnet/lnet/lib-msg.c            | 20 +++++++++++++++-----
 net/lnet/lnet/peer.c               |  6 ++++++
 net/lnet/lnet/router.c             | 11 +++++++----
 8 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 09adfc3..36aaaa5 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -512,11 +512,12 @@ int lnet_notify(struct lnet_ni *ni, lnet_nid_t peer, bool alive, bool reset,
 void lnet_notify_locked(struct lnet_peer_ni *lp, int notifylnd, int alive,
 			time64_t when);
 int lnet_add_route(u32 net, u32 hops, lnet_nid_t gateway_nid,
-		   unsigned int priority);
+		   u32 priority, u32 sensitivity);
 int lnet_del_route(u32 net, lnet_nid_t gw_nid);
 void lnet_destroy_routes(void);
 int lnet_get_route(int idx, u32 *net, u32 *hops,
-		   lnet_nid_t *gateway, u32 *alive, u32 *priority);
+		   lnet_nid_t *gateway, u32 *alive, u32 *priority,
+		   u32 *sensitivity);
 int lnet_get_rtr_pool_cfg(int idx, struct lnet_ioctl_pool_cfg *pool_cfg);
 struct lnet_ni *lnet_get_next_ni_locked(struct lnet_net *mynet,
 					struct lnet_ni *prev);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 97d35e0..56654f5 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -606,6 +606,12 @@ struct lnet_peer {
 	/* # refs from lnet_route_t::lr_gateway */
 	int			lp_rtr_refcount;
 
+	/*
+	 * peer specific health sensitivity value to decrement peer nis in
+	 * this peer with if set to something other than 0
+	 */
+	u32			lp_health_sensitivity;
+
 	/* messages blocking for router credits */
 	struct list_head	lp_rtrq;
 
diff --git a/include/uapi/linux/lnet/lnet-dlc.h b/include/uapi/linux/lnet/lnet-dlc.h
index 87f7680..e0b9eae 100644
--- a/include/uapi/linux/lnet/lnet-dlc.h
+++ b/include/uapi/linux/lnet/lnet-dlc.h
@@ -129,6 +129,7 @@ struct lnet_ioctl_config_data {
 			__u32 rtr_hop;
 			__u32 rtr_priority;
 			__u32 rtr_flags;
+			__u32 rtr_sensitivity;
 		} cfg_route;
 		struct {
 			char net_intf[LNET_MAX_STR_LEN];
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index b1823cd..702e4b9 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3455,19 +3455,28 @@ u32 lnet_get_dlc_seq_locked(void)
 	case IOC_LIBCFS_FAIL_NID:
 		return lnet_fail_nid(data->ioc_nid, data->ioc_count);
 
-	case IOC_LIBCFS_ADD_ROUTE:
+	case IOC_LIBCFS_ADD_ROUTE: {
+		/* default router sensitivity to 1 */
+		unsigned int sensitivity = 1;
 		config = arg;
 
 		if (config->cfg_hdr.ioc_len < sizeof(*config))
 			return -EINVAL;
 
+		if (config->cfg_config_u.cfg_route.rtr_sensitivity) {
+			sensitivity =
+			  config->cfg_config_u.cfg_route.rtr_sensitivity;
+		}
+
 		mutex_lock(&the_lnet.ln_api_mutex);
 		rc = lnet_add_route(config->cfg_net,
 				    config->cfg_config_u.cfg_route.rtr_hop,
 				    config->cfg_nid,
-				    config->cfg_config_u.cfg_route.rtr_priority);
+				    config->cfg_config_u.cfg_route.rtr_priority,
+				    sensitivity);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
+	}
 
 	case IOC_LIBCFS_DEL_ROUTE:
 		config = arg;
@@ -3492,7 +3501,8 @@ u32 lnet_get_dlc_seq_locked(void)
 				    &config->cfg_config_u.cfg_route.rtr_hop,
 				    &config->cfg_nid,
 				    &config->cfg_config_u.cfg_route.rtr_flags,
-				    &config->cfg_config_u.cfg_route.rtr_priority);
+				    &config->cfg_config_u.cfg_route.rtr_priority,
+				    &config->cfg_config_u.cfg_route.rtr_sensitivity);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index 760452c..949cdd3 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -1215,7 +1215,7 @@ struct lnet_ni *
 				continue;
 			}
 
-			rc = lnet_add_route(net, hops, nid, priority);
+			rc = lnet_add_route(net, hops, nid, priority, 1);
 			if (rc && rc != -EEXIST && rc != -EHOSTUNREACH) {
 				CERROR("Can't create route to %s via %s\n",
 				       libcfs_net2str(net),
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 8876866..9ffd874 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -448,14 +448,14 @@
 }
 
 static void
-lnet_dec_healthv_locked(atomic_t *healthv)
+lnet_dec_healthv_locked(atomic_t *healthv, int sensitivity)
 {
 	int h = atomic_read(healthv);
 
-	if (h < lnet_health_sensitivity) {
+	if (h < sensitivity) {
 		atomic_set(healthv, 0);
 	} else {
-		h -= lnet_health_sensitivity;
+		h -= sensitivity;
 		atomic_set(healthv, h);
 	}
 }
@@ -473,7 +473,7 @@
 		return;
 	}
 
-	lnet_dec_healthv_locked(&local_ni->ni_healthv);
+	lnet_dec_healthv_locked(&local_ni->ni_healthv, lnet_health_sensitivity);
 	/* add the NI to the recovery queue if it's not already there
 	 * and it's health value is actually below the maximum. It's
 	 * possible that the sensitivity might be set to 0, and the health
@@ -495,11 +495,21 @@
 void
 lnet_handle_remote_failure_locked(struct lnet_peer_ni *lpni)
 {
+	u32 sensitivity = lnet_health_sensitivity;
+	u32 lp_sensitivity;
+
 	/* lpni could be NULL if we're in the LOLND case */
 	if (!lpni)
 		return;
 
-	lnet_dec_healthv_locked(&lpni->lpni_healthv);
+	/* If there is a health sensitivity in the peer then use that
+	 * instead of the globally set one.
+	 */
+	lp_sensitivity = lpni->lpni_peer_net->lpn_peer->lp_health_sensitivity;
+	if (lp_sensitivity)
+		sensitivity = lp_sensitivity;
+
+	lnet_dec_healthv_locked(&lpni->lpni_healthv, sensitivity);
 	/* add the peer NI to the recovery queue if it's not already there
 	 * and it's health value is actually below the maximum. It's
 	 * possible that the sensitivity might be set to 0, and the health
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 41a6180..294f968 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -217,6 +217,12 @@
 	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
 
+	/* all peers created on a router should have health on
+	 * if it's not already on.
+	 */
+	if (the_lnet.ln_routing && !lnet_health_sensitivity)
+		lp->lp_health_sensitivity = 1;
+
 	/* Turn off discovery for loopback peer. If you're creating a peer
 	 * for the loopback interface then that was initiated when we
 	 * attempted to send a message over the loopback. There is no need
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index aa8ec8c..eb36df5 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -406,7 +406,7 @@ static void lnet_shuffle_seed(void)
 
 int
 lnet_add_route(u32 net, u32 hops, lnet_nid_t gateway,
-	       unsigned int priority)
+	       u32 priority, u32 sensitivity)
 {
 	struct list_head *route_entry;
 	struct lnet_remotenet *rnet;
@@ -505,8 +505,10 @@ static void lnet_shuffle_seed(void)
 	 * to move the routes from the peer that's being deleted to the
 	 * consolidated peer lp_routes list
 	 */
-	if (add_route)
+	if (add_route) {
+		gw->lp_health_sensitivity = sensitivity;
 		lnet_add_route_to_rnet(rnet2, route);
+	}
 
 	/* get rid of the reference on the lpni.
 	 */
@@ -675,13 +677,13 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 
 int
 lnet_get_route(int idx, u32 *net, u32 *hops,
-	       lnet_nid_t *gateway, u32 *alive, u32 *priority)
+	       lnet_nid_t *gateway, u32 *alive, u32 *priority, u32 *sensitivity)
 {
 	struct lnet_remotenet *rnet;
+	struct list_head *rn_list;
 	struct lnet_route *route;
 	int cpt;
 	int i;
-	struct list_head *rn_list;
 
 	cpt = lnet_net_lock_current();
 
@@ -695,6 +697,7 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 					*hops = route->lr_hops;
 					*priority =
 					    route->lr_priority;
+					*sensitivity = route->lr_gateway->lp_health_sensitivity;
 					*alive = lnet_is_route_alive(route);
 					lnet_net_unlock(cpt);
 					return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 352/622] lnet: push router interface updates
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (350 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 351/622] lnet: handle router health off James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 353/622] lnet: net aliveness James Simmons
                   ` (270 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

A router can bring up/down its interfaces if it hasn't received any
messages on that interface for a configurable period
(alive_router_ping_timeout). When this even occures the router can now
push its status change to the peers it's talking to in order to inform
them of the change in its status. This will allow the router users to
handle asym router failures quicker.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11664
Lustre-commit: 0fa02a7d81e7 ("LU-11664 lnet: push router interface updates")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33651
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 18 ++++++++++++------
 net/lnet/lnet/router.c   | 13 +++++++++++--
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 0ff1d38..d6cbcd1 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3840,16 +3840,17 @@ void lnet_monitor_thr_stop(void)
 lnet_parse(struct lnet_ni *ni, struct lnet_hdr *hdr, lnet_nid_t from_nid,
 	   void *private, int rdma_req)
 {
-	int rc = 0;
-	int cpt;
-	int for_me;
+	struct lnet_peer_ni *lpni;
 	struct lnet_msg *msg;
+	u32 payload_length;
 	lnet_pid_t dest_pid;
 	lnet_nid_t dest_nid;
 	lnet_nid_t src_nid;
-	struct lnet_peer_ni *lpni;
-	u32 payload_length;
+	bool push = false;
+	int for_me;
 	u32 type;
+	int rc = 0;
+	int cpt;
 
 	LASSERT(!in_interrupt());
 
@@ -3907,11 +3908,16 @@ void lnet_monitor_thr_stop(void)
 		lnet_ni_lock(ni);
 		ni->ni_last_alive = ktime_get_real_seconds();
 		if (ni->ni_status &&
-		    ni->ni_status->ns_status == LNET_NI_STATUS_DOWN)
+		    ni->ni_status->ns_status == LNET_NI_STATUS_DOWN) {
 			ni->ni_status->ns_status = LNET_NI_STATUS_UP;
+			push = true;
+		}
 		lnet_ni_unlock(ni);
 	}
 
+	if (push)
+		lnet_push_update_to_peers(1);
+
 	/*
 	 * Regard a bad destination NID as a protocol error.  Senders should
 	 * know what they're doing; if they don't they're misconfigured, buggy
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index eb36df5..0a396d9 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -742,10 +742,11 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
-static void
+static bool
 lnet_update_ni_status_locked(void)
 {
 	struct lnet_ni *ni = NULL;
+	bool push = false;
 	time64_t now;
 	time64_t timeout;
 
@@ -778,9 +779,12 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 			 * NI status to "down"
 			 */
 			ni->ni_status->ns_status = LNET_NI_STATUS_DOWN;
+			push = true;
 		}
 		lnet_ni_unlock(ni);
 	}
+
+	return push;
 }
 
 void lnet_wait_router_start(void)
@@ -817,6 +821,7 @@ bool lnet_router_checker_active(void)
 {
 	struct lnet_peer_ni *lpni;
 	struct lnet_peer *rtr;
+	bool push = false;
 	u64 version;
 	time64_t now;
 	int cpt;
@@ -883,9 +888,13 @@ bool lnet_router_checker_active(void)
 	}
 
 	if (the_lnet.ln_routing)
-		lnet_update_ni_status_locked();
+		push = lnet_update_ni_status_locked();
 
 	lnet_net_unlock(cpt);
+
+	/* if the status of the ni changed update the peers */
+	if (push)
+		lnet_push_update_to_peers(1);
 }
 
 void
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 353/622] lnet: net aliveness
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (351 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 352/622] lnet: push router interface updates James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 354/622] lnet: discover each gateway Net James Simmons
                   ` (269 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

If a router is discovered on any interface on the network, then
update the network last alive time and the NI's status to UP.
If a router isn't discovered on any interface on a network,
then change the status of all the interfaces on that network to down.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11299
Lustre-commit: 1d80e9debf99 ("LU-11299 lnet: net aliveness")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34510
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  9 +++++---
 net/lnet/lnet/config.c         |  3 ++-
 net/lnet/lnet/lib-move.c       |  7 +++---
 net/lnet/lnet/router.c         | 52 ++++++++++++++++++++++++++----------------
 net/lnet/lnet/router_proc.c    |  2 +-
 5 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 56654f5..7b43236 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -397,6 +397,12 @@ struct lnet_net {
 
 	/* dying LND instances */
 	struct list_head	net_ni_zombie;
+
+	/* when I was last alive */
+	time64_t		net_last_alive;
+
+	/* protects access to net_last_alive */
+	spinlock_t		net_lock;
 };
 
 struct lnet_ni {
@@ -431,9 +437,6 @@ struct lnet_ni {
 	/* percpt reference count */
 	int			**ni_refs;
 
-	/* when I was last alive */
-	time64_t		ni_last_alive;
-
 	/* pointer to parent network */
 	struct lnet_net		*ni_net;
 
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index 949cdd3..a2a9c79 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -366,8 +366,10 @@ struct lnet_net *
 	INIT_LIST_HEAD(&net->net_ni_list);
 	INIT_LIST_HEAD(&net->net_ni_added);
 	INIT_LIST_HEAD(&net->net_ni_zombie);
+	spin_lock_init(&net->net_lock);
 
 	net->net_id = net_id;
+	net->net_last_alive = ktime_get_real_seconds();
 
 	/* initialize global paramters to undefiend */
 	net->net_tunables.lct_peer_timeout = -1;
@@ -467,7 +469,6 @@ struct lnet_net *
 	else
 		ni->ni_net_ns = NULL;
 
-	ni->ni_last_alive = ktime_get_real_seconds();
 	ni->ni_state = LNET_NI_STATE_INIT;
 	list_add_tail(&ni->ni_netlist, &net->net_ni_added);
 
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index d6cbcd1..ec32d22 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3903,10 +3903,11 @@ void lnet_monitor_thr_stop(void)
 	}
 
 	if (the_lnet.ln_routing &&
-	    ni->ni_last_alive != ktime_get_real_seconds()) {
-		/* NB: so far here is the only place to set NI status to "up */
+	    ni->ni_net->net_last_alive != ktime_get_real_seconds()) {
 		lnet_ni_lock(ni);
-		ni->ni_last_alive = ktime_get_real_seconds();
+		spin_lock(&ni->ni_net->net_lock);
+		ni->ni_net->net_last_alive = ktime_get_real_seconds();
+		spin_unlock(&ni->ni_net->net_lock);
 		if (ni->ni_status &&
 		    ni->ni_status->ns_status == LNET_NI_STATUS_DOWN) {
 			ni->ni_status->ns_status = LNET_NI_STATUS_UP;
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 0a396d9..4ca3c5c 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -742,10 +742,29 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	}
 }
 
+static inline bool
+lnet_net_set_status_locked(struct lnet_net *net, u32 status)
+{
+	struct lnet_ni *ni;
+	bool update = false;
+
+	list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
+		lnet_ni_lock(ni);
+		if (ni->ni_status &&
+		    ni->ni_status->ns_status != status) {
+			ni->ni_status->ns_status = status;
+			update = true;
+		}
+		lnet_ni_unlock(ni);
+	}
+
+	return update;
+}
+
 static bool
 lnet_update_ni_status_locked(void)
 {
-	struct lnet_ni *ni = NULL;
+	struct lnet_net *net;
 	bool push = false;
 	time64_t now;
 	time64_t timeout;
@@ -755,33 +774,26 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 	timeout = router_ping_timeout + alive_router_check_interval;
 
 	now = ktime_get_real_seconds();
-	while ((ni = lnet_get_next_ni_locked(NULL, ni))) {
-		if (ni->ni_net->net_lnd->lnd_type == LOLND)
+	list_for_each_entry(net, &the_lnet.ln_nets, net_list) {
+		if (net->net_lnd->lnd_type == LOLND)
 			continue;
 
-		if (now < ni->ni_last_alive + timeout)
+		if (now < net->net_last_alive + timeout)
 			continue;
 
-		lnet_ni_lock(ni);
+		spin_lock(&net->net_lock);
 		/* re-check with lock */
-		if (now < ni->ni_last_alive + timeout) {
-			lnet_ni_unlock(ni);
+		if (now < net->net_last_alive + timeout) {
+			spin_unlock(&net->net_lock);
 			continue;
 		}
+		spin_unlock(&net->net_lock);
 
-		LASSERT(ni->ni_status);
-
-		if (ni->ni_status->ns_status != LNET_NI_STATUS_DOWN) {
-			CDEBUG(D_NET, "NI(%s:%lld) status changed to down\n",
-			       libcfs_nid2str(ni->ni_nid), timeout);
-			/*
-			 * NB: so far, this is the only place to set
-			 * NI status to "down"
-			 */
-			ni->ni_status->ns_status = LNET_NI_STATUS_DOWN;
-			push = true;
-		}
-		lnet_ni_unlock(ni);
+		/* if the net didn't receive any traffic for past the
+		 * timeout on any of its constituent NIs, then mark all
+		 * the NIs down.
+		 */
+		push = lnet_net_set_status_locked(net, LNET_NI_STATUS_DOWN);
 	}
 
 	return push;
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 9771ef0..2e9342c 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -674,7 +674,7 @@ static int proc_lnet_nis(struct ctl_table *table, int write,
 			int j;
 
 			if (the_lnet.ln_routing)
-				last_alive = now - ni->ni_last_alive;
+				last_alive = now - ni->ni_net->net_last_alive;
 
 			lnet_ni_lock(ni);
 			LASSERT(ni->ni_status);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 354/622] lnet: discover each gateway Net
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (352 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 353/622] lnet: net aliveness James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 355/622] lnet: look up MR peers routes James Simmons
                   ` (268 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Wakeup every gateway aliveness interval / number of local networks.
Discover each local gateway network in round robin.

This is done to make sure the gateway keeps its networks up.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11299
Lustre-commit: 526679c681c3 ("LU-11299 lnet: discover each gateway Net")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34511
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |  5 ++++
 include/linux/lnet/lib-types.h |  9 ++++---
 net/lnet/lnet/api-ni.c         | 39 +++++++++++++++++++++++++++---
 net/lnet/lnet/lib-move.c       | 19 ++++++++++++---
 net/lnet/lnet/peer.c           | 32 ++++++++++++++++++++++++
 net/lnet/lnet/router.c         | 55 ++++++++++++++++++++++++++++++++++++++----
 6 files changed, 145 insertions(+), 14 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 36aaaa5..3dd56a2 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -53,6 +53,7 @@
 #define CFS_FAIL_PTLRPC_OST_BULK_CB2	0xe000
 
 extern struct lnet the_lnet;	/* THE network */
+extern unsigned int lnet_current_net_count;
 
 #if (BITS_PER_LONG == 32)
 /* 2 CPTs, allowing more CPTs might make us under memory pressure */
@@ -547,6 +548,7 @@ void lnet_rtr_transfer_to_peer(struct lnet_peer *src,
 
 int lnet_islocalnid(lnet_nid_t nid);
 int lnet_islocalnet(u32 net);
+int lnet_islocalnet_locked(u32 net);
 
 void lnet_msg_attach_md(struct lnet_msg *msg, struct lnet_libmd *md,
 			unsigned int offset, unsigned int mlen);
@@ -796,7 +798,10 @@ bool lnet_net_unique(u32 net_id, struct list_head *nilist,
 bool lnet_ni_unique_net(struct list_head *nilist, char *iface);
 void lnet_incr_dlc_seq(void);
 u32 lnet_get_dlc_seq_locked(void);
+int lnet_get_net_count(void);
 
+struct lnet_peer_net *lnet_get_next_peer_net_locked(struct lnet_peer *lp,
+						    u32 prev_lpn_id);
 struct lnet_peer_ni *lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 						  struct lnet_peer_net *peer_net,
 						  struct lnet_peer_ni *prev);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 7b43236..8c9ae9e 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -600,6 +600,9 @@ struct lnet_peer {
 	/* primary NID of the peer */
 	lnet_nid_t		lp_primary_nid;
 
+	/* net to perform discovery on */
+	u32			lp_disc_net_id;
+
 	/* CPT of peer_table */
 	int			lp_cpt;
 
@@ -621,9 +624,6 @@ struct lnet_peer {
 	/* routes on this peer */
 	struct list_head	lp_routes;
 
-	/* time of last router check attempt */
-	time64_t		lp_rtrcheck_timestamp;
-
 	/* reference count */
 	atomic_t		lp_refcount;
 
@@ -744,6 +744,9 @@ struct lnet_peer_net {
 	/* Net ID */
 	u32			lpn_net_id;
 
+	/* time of last router net check attempt */
+	time64_t		lpn_rtrcheck_timestamp;
+
 	/* reference count */
 	atomic_t		lpn_refcount;
 };
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 702e4b9..65f1f17 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -171,6 +171,7 @@ static int recovery_interval_set(const char *val,
 		 "Maximum number of times to retry transmitting a message");
 
 unsigned int lnet_lnd_timeout = LNET_LND_DEFAULT_TIMEOUT;
+unsigned int lnet_current_net_count;
 
 /*
  * This sequence number keeps track of how many times DLC was used to
@@ -1294,16 +1295,28 @@ struct lnet_net *
 EXPORT_SYMBOL(lnet_cpt_of_nid);
 
 int
-lnet_islocalnet(u32 net_id)
+lnet_islocalnet_locked(u32 net_id)
 {
 	struct lnet_net *net;
+
+	net = lnet_get_net_locked(net_id);
+
+	return !!net;
+}
+
+int
+lnet_islocalnet(u32 net_id)
+{
 	int cpt;
+	bool local;
 
 	cpt = lnet_net_lock_current();
-	net = lnet_get_net_locked(net_id);
+
+	local = lnet_islocalnet_locked(net_id);
+
 	lnet_net_unlock(cpt);
 
-	return !!net;
+	return local;
 }
 
 struct lnet_ni *
@@ -1457,6 +1470,23 @@ struct lnet_ping_buffer *
 	return count;
 }
 
+int
+lnet_get_net_count(void)
+{
+	struct lnet_net *net;
+	int count = 0;
+
+	lnet_net_lock(0);
+
+	list_for_each_entry(net, &the_lnet.ln_nets, net_list) {
+		count++;
+	}
+
+	lnet_net_unlock(0);
+
+	return count;
+}
+
 void
 lnet_swap_pinginfo(struct lnet_ping_buffer *pbuf)
 {
@@ -2292,6 +2322,9 @@ static void lnet_push_target_fini(void)
 		lnet_net_unlock(LNET_LOCK_EX);
 	}
 
+	/* update net count */
+	lnet_current_net_count = lnet_get_net_count();
+
 	return ni_count;
 
 failed1:
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index ec32d22..e93284b 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1922,7 +1922,8 @@ struct lnet_ni *
 }
 
 struct lnet_ni *
-lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt)
+lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt,
+			       bool discovery)
 {
 	struct lnet_peer_net *peer_net = NULL;
 	struct lnet_ni *best_ni = NULL;
@@ -1943,6 +1944,12 @@ struct lnet_ni *
 		best_ni = lnet_find_best_ni_on_spec_net(best_ni, peer,
 							peer_net, md_cpt,
 							false);
+		/* if this is a discovery message and lp_disc_net_id is
+		 * specified then use that net to send the discovery on.
+		 */
+		if (peer->lp_disc_net_id == peer_net->lpn_net_id &&
+		    discovery)
+			break;
 	}
 
 	if (best_ni)
@@ -2101,7 +2108,8 @@ struct lnet_ni *
 	 * networks.
 	 */
 	sd->sd_best_ni = lnet_find_best_ni_on_local_net(sd->sd_peer,
-							sd->sd_md_cpt);
+					sd->sd_md_cpt,
+					lnet_msg_discovery(sd->sd_msg));
 	if (sd->sd_best_ni) {
 		sd->sd_best_lpni =
 		  lnet_find_best_lpni_on_net(sd, sd->sd_peer,
@@ -3145,9 +3153,14 @@ struct lnet_mt_event_info {
 		 * if we wake up every 1 second? Although, we've seen
 		 * cases where we get a complaint that an idle thread
 		 * is waking up unnecessarily.
+		 *
+		 * Take into account the current net_count when you wake
+		 * up for alive router checking, since we need to check
+		 * possibly as many networks as we have configured.
 		 */
 		interval = min(lnet_recovery_interval,
-			       min((unsigned int)alive_router_check_interval,
+			       min((unsigned int)alive_router_check_interval /
+					lnet_current_net_count,
 				   lnet_transaction_timeout / 2));
 		wait_event_interruptible_timeout(the_lnet.ln_mt_waitq,
 						 false, HZ * interval);
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 294f968..55ff01d 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -710,6 +710,38 @@ struct lnet_peer *
 	return lp;
 }
 
+struct lnet_peer_net *
+lnet_get_next_peer_net_locked(struct lnet_peer *lp, u32 prev_lpn_id)
+{
+	struct lnet_peer_net *net;
+
+	if (!prev_lpn_id) {
+		/* no net id provided return the first net */
+		net = list_first_entry_or_null(&lp->lp_peer_nets,
+					       struct lnet_peer_net,
+					       lpn_peer_nets);
+
+		return net;
+	}
+
+	/* find the net after the one provided */
+	list_for_each_entry(net, &lp->lp_peer_nets, lpn_peer_nets) {
+		if (net->lpn_net_id == prev_lpn_id) {
+			/* if we reached the end of the list loop to the
+			 * beginning.
+			 */
+			if (net->lpn_peer_nets.next == &lp->lp_peer_nets)
+				return list_first_entry_or_null(&lp->lp_peer_nets,
+								struct lnet_peer_net,
+								lpn_peer_nets);
+			else
+				return list_next_entry(net, lpn_peer_nets);
+		}
+	}
+
+	return NULL;
+}
+
 struct lnet_peer_ni *
 lnet_get_next_peer_ni_locked(struct lnet_peer *peer,
 			     struct lnet_peer_net *peer_net,
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 4ca3c5c..81f7a94 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -370,8 +370,9 @@ static void lnet_shuffle_seed(void)
 static void
 lnet_add_route_to_rnet(struct lnet_remotenet *rnet, struct lnet_route *route)
 {
-	unsigned int len = 0;
+	struct lnet_peer_net *lpn;
 	unsigned int offset = 0;
+	unsigned int len = 0;
 	struct list_head *e;
 
 	lnet_shuffle_seed();
@@ -393,7 +394,10 @@ static void lnet_shuffle_seed(void)
 	/* force a router check on the gateway to make sure the route is
 	 * alive
 	 */
-	route->lr_gateway->lp_rtrcheck_timestamp = 0;
+	list_for_each_entry(lpn, &route->lr_gateway->lp_peer_nets,
+			    lpn_peer_nets) {
+		lpn->lpn_rtrcheck_timestamp = 0;
+	}
 
 	the_lnet.ln_remote_nets_version++;
 
@@ -618,6 +622,17 @@ static void lnet_shuffle_seed(void)
 	}
 
 delete_zombies:
+	/* check if there are any routes remaining on the gateway
+	 * If there are no more routes make sure to set the peer's
+	 * lp_disc_net_id to 0 (invalid), in case we add more routes in
+	 * the future on that gateway, then we start our discovery process
+	 * from scratch
+	 */
+	if (lpni) {
+		if (list_empty(&lp->lp_routes))
+			lp->lp_disc_net_id = 0;
+	}
+
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	while (!list_empty(&zombies)) {
@@ -831,10 +846,14 @@ bool lnet_router_checker_active(void)
 void
 lnet_check_routers(void)
 {
+	struct lnet_peer_net *first_lpn = NULL;
+	struct lnet_peer_net *lpn;
 	struct lnet_peer_ni *lpni;
 	struct lnet_peer *rtr;
 	bool push = false;
+	bool found_lpn;
 	u64 version;
+	u32 net_id;
 	time64_t now;
 	int cpt;
 	int rc;
@@ -851,8 +870,31 @@ bool lnet_router_checker_active(void)
 		 * interfaces could be down and in that case they would be
 		 * undergoing recovery separately from this discovery.
 		 */
-		if (now - rtr->lp_rtrcheck_timestamp <
-		    alive_router_check_interval)
+		/* find next peer net which is also local */
+		net_id = rtr->lp_disc_net_id;
+		do {
+			lpn = lnet_get_next_peer_net_locked(rtr, net_id);
+			if (!lpn) {
+				CERROR("gateway %s has no networks\n",
+				       libcfs_nid2str(rtr->lp_primary_nid));
+				break;
+			}
+			if (first_lpn == lpn)
+				break;
+			if (!first_lpn)
+				first_lpn = lpn;
+			found_lpn = lnet_islocalnet_locked(lpn->lpn_net_id);
+			net_id = lpn->lpn_net_id;
+		} while (!found_lpn);
+
+		if (!found_lpn || !lpn) {
+			CERROR("no local network found for gateway %s\n",
+			       libcfs_nid2str(rtr->lp_primary_nid));
+			continue;
+		}
+
+		if (now - lpn->lpn_rtrcheck_timestamp <
+		    alive_router_check_interval / lnet_current_net_count)
 			continue;
 
 		/* If we're currently discovering the peer then don't
@@ -878,6 +920,9 @@ bool lnet_router_checker_active(void)
 		}
 		lnet_peer_ni_addref_locked(lpni);
 
+		/* specify the net to use */
+		rtr->lp_disc_net_id = lpn->lpn_net_id;
+
 		/* discover the router */
 		CDEBUG(D_NET, "discover %s, cpt = %d\n",
 		       libcfs_nid2str(lpni->lpni_nid), cpt);
@@ -887,7 +932,7 @@ bool lnet_router_checker_active(void)
 		lnet_peer_ni_decref_locked(lpni);
 
 		if (!rc)
-			rtr->lp_rtrcheck_timestamp = now;
+			lpn->lpn_rtrcheck_timestamp = now;
 		else
 			CERROR("Failed to discover router %s\n",
 			       libcfs_nid2str(rtr->lp_primary_nid));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 355/622] lnet: look up MR peers routes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (353 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 354/622] lnet: discover each gateway Net James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 356/622] lnet: check peer timeout on a router James Simmons
                   ` (267 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

An MR peer can have multiple interfaces some of which we might
have a route to. The primary NID of the peer might not necessarily
specify a NID we have a route to. When looking up a route, we must
iterate over all the nets the peer is on and select the one which
we can route to. Taking into consideration the peer can exist on
multiple routed networks we also have a simple round robin algorithm
to iterate over all the networks we can reach the peer on.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12053
Lustre-commit: 52eef8179743 ("LU-12053 lnet: look up MR peers routes")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34625
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  3 ++
 net/lnet/lnet/lib-move.c       | 73 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 8c9ae9e..da5b860 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -747,6 +747,9 @@ struct lnet_peer_net {
 	/* time of last router net check attempt */
 	time64_t		lpn_rtrcheck_timestamp;
 
+	/* selection sequence number */
+	u32			lpn_seq;
+
 	/* reference count */
 	atomic_t		lpn_refcount;
 };
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index e93284b..f0804e1 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1809,21 +1809,60 @@ struct lnet_ni *
 {
 	int rc;
 	struct lnet_peer *gw;
+	struct lnet_peer *lp;
+	struct lnet_peer_net *lpn;
+	struct lnet_peer_net *best_lpn = NULL;
+	struct lnet_remotenet *rnet;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
 	struct lnet_peer_ni *lpni = NULL;
+	struct lnet_peer_ni *gwni = NULL;
 	lnet_nid_t src_nid = sd->sd_src_nid;
 
-	best_route = lnet_find_route_locked(NULL, LNET_NIDNET(dst_nid),
+	/* we've already looked up the initial lpni using dst_nid */
+	lpni = sd->sd_best_lpni;
+	/* the peer tree must be in existence */
+	LASSERT(lpni && lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer);
+	lp = lpni->lpni_peer_net->lpn_peer;
+
+	list_for_each_entry(lpn, &lp->lp_peer_nets, lpn_peer_nets) {
+		/* is this remote network reachable?  */
+		rnet = lnet_find_rnet_locked(lpn->lpn_net_id);
+		if (!rnet)
+			continue;
+
+		if (!best_lpn)
+			best_lpn = lpn;
+
+		if (best_lpn->lpn_seq <= lpn->lpn_seq)
+			continue;
+
+		best_lpn = lpn;
+	}
+
+	if (!best_lpn) {
+		CERROR("peer %s has no available nets\n",
+		       libcfs_nid2str(sd->sd_dst_nid));
+		return -EHOSTUNREACH;
+	}
+
+	sd->sd_best_lpni = lnet_find_best_lpni_on_net(sd, lp,
+						      best_lpn->lpn_net_id);
+	if (!sd->sd_best_lpni) {
+		CERROR("peer %s down\n", libcfs_nid2str(sd->sd_dst_nid));
+		return -EHOSTUNREACH;
+	}
+
+	best_route = lnet_find_route_locked(NULL, best_lpn->lpn_net_id,
 					    sd->sd_rtr_nid, &last_route,
-					    &lpni);
+					    &gwni);
 	if (!best_route) {
 		CERROR("no route to %s from %s\n",
 		       libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid));
 		return -EHOSTUNREACH;
 	}
 
-	if (!lpni) {
+	if (!gwni) {
 		CERROR("Internal Error. Route expected to %s from %s\n",
 		       libcfs_nid2str(dst_nid),
 		       libcfs_nid2str(src_nid));
@@ -1831,14 +1870,14 @@ struct lnet_ni *
 	}
 
 	gw = best_route->lr_gateway;
-	LASSERT(gw == lpni->lpni_peer_net->lpn_peer);
+	LASSERT(gw == gwni->lpni_peer_net->lpn_peer);
 
 	/* Discover this gateway if it hasn't already been discovered.
 	 * This means we might delay the message until discovery has
 	 * completed
 	 */
 	sd->sd_msg->msg_src_nid_param = sd->sd_src_nid;
-	rc = lnet_initiate_peer_discovery(lpni, sd->sd_msg, sd->sd_rtr_nid,
+	rc = lnet_initiate_peer_discovery(gwni, sd->sd_msg, sd->sd_rtr_nid,
 					  sd->sd_cpt);
 	if (rc)
 		return rc;
@@ -1858,14 +1897,15 @@ struct lnet_ni *
 		return -EFAULT;
 	}
 
-	*gw_lpni = lpni;
+	*gw_lpni = gwni;
 	*gw_peer = gw;
 
-	/* increment the route sequence number since now we're sure we're
-	 * going to use it
+	/* increment the sequence numbers since now we're sure we're
+	 * going to use this path
 	 */
 	LASSERT(best_route && last_route);
 	best_route->lr_seq = last_route->lr_seq + 1;
+	best_lpn->lpn_seq++;
 
 	return 0;
 }
@@ -2208,11 +2248,11 @@ struct lnet_ni *
 	if (rc != PASS_THROUGH)
 		return rc;
 
-	/* TODO; One possible enhancement is to run the selection
-	 * algorithm on the peer. However for remote peers the credits are
-	 * not decremented, so we'll be basically going over the peer NIs
-	 * in round robin. An MR router will run the selection algorithm
-	 * on the next-hop interfaces.
+	/* Now that we must route to the destination, we must consider the
+	 * MR case, where the destination has multiple interfaces, some of
+	 * which we can route to and others we do not. For this reason we
+	 * need to select the destination which we can route to and if
+	 * there are multiple, we need to round robin.
 	 */
 	rc = lnet_handle_find_routed_path(sd, sd->sd_dst_nid, &gw_lpni,
 					  &gw_peer);
@@ -2455,8 +2495,13 @@ struct lnet_ni *
 	LASSERT(!msg->msg_tx_committed);
 
 	rc = lnet_select_pathway(src_nid, dst_nid, msg, rtr_nid);
-	if (rc < 0)
+	if (rc < 0) {
+		if (rc == -EHOSTUNREACH)
+			msg->msg_health_status = LNET_MSG_STATUS_REMOTE_ERROR;
+		else
+			msg->msg_health_status = LNET_MSG_STATUS_LOCAL_ERROR;
 		return rc;
+	}
 
 	if (rc == LNET_CREDIT_OK)
 		lnet_ni_send(msg->msg_txni, msg);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 356/622] lnet: check peer timeout on a router
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (354 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 355/622] lnet: look up MR peers routes James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 357/622] lustre: lmv: reuse object alloc QoS code from LOD James Simmons
                   ` (266 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

On a router assume that a peer is alive and attempt to send it
messages as long as the peer_timeout hasn't expired.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12200
Lustre-commit: 41f3c27adf16 ("LU-12200 lnet: check peer timeout on a router")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34772
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  2 ++
 net/lnet/lnet/lib-move.c       | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index da5b860..b240361 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -566,6 +566,8 @@ struct lnet_peer_ni {
 	u32			 lpni_gw_seq;
 	/* returned RC ping features. Protected with lpni_lock */
 	unsigned int		 lpni_ping_feats;
+	/* time last message was received from the peer */
+	time64_t		lpni_last_alive;
 	/* preferred local nids: if only one, use lpni_pref.nid */
 	union lpni_pref {
 		lnet_nid_t	 nid;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index f0804e1..629856c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -608,6 +608,23 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	return rc;
 }
 
+static bool
+lnet_is_peer_deadline_passed(struct lnet_peer_ni *lpni, time64_t now)
+{
+	time64_t deadline;
+
+	deadline = lpni->lpni_last_alive +
+		   lpni->lpni_net->net_tunables.lct_peer_timeout;
+
+	/* assume peer_ni is alive as long as we're within the configured
+	 * peer timeout
+	 */
+	if (deadline > now)
+		return false;
+
+	return true;
+}
+
 /*
  * NB: returns 1 when alive, 0 when dead, negative when error;
  *     may drop the lnet_net_lock
@@ -616,6 +633,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 lnet_peer_alive_locked(struct lnet_ni *ni, struct lnet_peer_ni *lpni,
 		       struct lnet_msg *msg)
 {
+	time64_t now = ktime_get_seconds();
+
 	if (!lnet_peer_aliveness_enabled(lpni))
 		return -ENODEV;
 
@@ -635,6 +654,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	    msg->msg_type == LNET_MSG_REPLY)
 		return 1;
 
+	if (!lnet_is_peer_deadline_passed(lpni, now))
+		return true;
+
 	return lnet_is_peer_ni_alive(lpni);
 }
 
@@ -4142,6 +4164,10 @@ void lnet_monitor_thr_stop(void)
 			return 0;
 		goto drop;
 	}
+
+	if (the_lnet.ln_routing)
+		lpni->lpni_last_alive = ktime_get_seconds();
+
 	msg->msg_rxpeer = lpni;
 	msg->msg_rxni = ni;
 	lnet_ni_addref_locked(ni, cpt);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 357/622] lustre: lmv: reuse object alloc QoS code from LOD
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (355 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 356/622] lnet: check peer timeout on a router James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 358/622] lustre: llite: Add persistent cache on client James Simmons
                   ` (265 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Reuse the same object alloc QoS code as LOD, but the QoS code is
not moved to lower layer module, instead it's copied to LMV, because
it involves almost all LMV code, which is too big a change and should
be done separately in the future.

And for LMV round-robin object allocation, because we only need to
allocate one object, use the MDT index saved and update it to next
MDT.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: b601eb35e97a ("LU-11213 lmv: reuse object alloc QoS code from LOD")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34657
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h          |  88 +++++++
 fs/lustre/include/obd.h                |  36 +--
 fs/lustre/lmv/Makefile                 |   2 +-
 fs/lustre/lmv/lmv_intent.c             |  10 +-
 fs/lustre/lmv/lmv_internal.h           |   8 +-
 fs/lustre/lmv/lmv_obd.c                | 106 +++++---
 fs/lustre/lmv/lmv_qos.c                | 446 +++++++++++++++++++++++++++++++++
 fs/lustre/lmv/lproc_lmv.c              | 108 +++++++-
 fs/lustre/obdclass/Makefile            |   2 +-
 fs/lustre/obdclass/lu_qos.c            | 166 ++++++++++++
 include/uapi/linux/lustre/lustre_idl.h |   2 +
 11 files changed, 896 insertions(+), 78 deletions(-)
 create mode 100644 fs/lustre/lmv/lmv_qos.c
 create mode 100644 fs/lustre/obdclass/lu_qos.c

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index c34605c..0f3e3be 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1303,5 +1303,93 @@ struct lu_kmem_descr {
 extern u32 lu_context_tags_default;
 extern u32 lu_session_tags_default;
 
+/* Generic subset of OSTs */
+struct ost_pool {
+	u32		   *op_array;	/* array of index of
+					 * lov_obd->lov_tgts
+					 */
+	unsigned int	    op_count;	/* number of OSTs in the array */
+	unsigned int	    op_size;	/* allocated size of lp_array */
+	struct rw_semaphore op_rw_sem;	/* to protect ost_pool use */
+};
+
+/* round-robin QoS data for LOD/LMV */
+struct lu_qos_rr {
+	spinlock_t		 lqr_alloc;	/* protect allocation index */
+	u32			 lqr_start_idx;	/* start index of new inode */
+	u32			 lqr_offset_idx;/* aliasing for start_idx */
+	int			 lqr_start_count;/* reseed counter */
+	struct ost_pool		 lqr_pool;	/* round-robin optimized list */
+	unsigned long		 lqr_dirty:1;	/* recalc round-robin list */
+};
+
+/* QoS data per MDS/OSS */
+struct lu_svr_qos {
+	struct obd_uuid		 lsq_uuid;	/* ptlrpc's c_remote_uuid */
+	struct list_head	 lsq_svr_list;	/* link to lq_svr_list */
+	u64			 lsq_bavail;	/* total bytes avail on svr */
+	u64			 lsq_iavail;	/* tital inode avail on svr */
+	u64			 lsq_penalty;	/* current penalty */
+	u64			 lsq_penalty_per_obj; /* penalty decrease
+						       * every obj
+						       */
+	time64_t		 lsq_used;	/* last used time, seconds */
+	u32			 lsq_tgt_count;	/* number of tgts on this svr */
+	u32			 lsq_id;	/* unique svr id */
+};
+
+/* QoS data per MDT/OST */
+struct lu_tgt_qos {
+	struct lu_svr_qos	*ltq_svr;	/* svr info */
+	u64			 ltq_penalty;	/* current penalty */
+	u64			 ltq_penalty_per_obj; /* penalty decrease
+						       * every obj
+						       */
+	u64			 ltq_weight;	/* net weighting */
+	time64_t		 ltq_used;	/* last used time, seconds */
+	bool			 ltq_usable:1;	/* usable for striping */
+};
+
+/* target descriptor */
+struct lu_tgt_desc {
+	union {
+		struct dt_device	*ltd_tgt;
+		struct obd_device	*ltd_obd;
+	};
+	struct obd_export		*ltd_exp;
+	struct obd_uuid			ltd_uuid;
+	u32				ltd_index;
+	u32				ltd_gen;
+	struct list_head		ltd_kill;
+	struct ptlrpc_thread		*ltd_recovery_thread;
+	struct mutex			ltd_fid_mutex;
+	struct lu_tgt_qos		ltd_qos; /* qos info per target */
+	struct obd_statfs		ltd_statfs;
+	time64_t			ltd_statfs_age;
+	unsigned long	ltd_active:1,	/* is this target up for requests */
+			ltd_activate:1,	/* should target be activated */
+			ltd_reap:1,	/* should this target be deleted */
+			ltd_got_update_log:1, /* Already got update log */
+			ltd_connecting:1;  /* target is connecting */
+};
+
+/* QoS data for LOD/LMV */
+struct lu_qos {
+	struct list_head	 lq_svr_list;	/* lu_svr_qos list */
+	struct rw_semaphore	 lq_rw_sem;
+	u32			 lq_active_svr_count;
+	unsigned int		 lq_prio_free;   /* priority for free space */
+	unsigned int		 lq_threshold_rr;/* priority for rr */
+	struct lu_qos_rr	 lq_rr;          /* round robin qos data */
+	unsigned long		 lq_dirty:1,     /* recalc qos data */
+				 lq_same_space:1,/* the servers all have approx.
+						  * the same space avail
+						  */
+				 lq_reset:1;     /* zero current penalties */
+};
+
+int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
+int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
+
 /** @} lu */
 #endif /* __LUSTRE_LU_OBJECT_H */
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index e815584..2f878d6 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -87,7 +87,7 @@ struct obd_info {
 	/* OBD_STATFS_* flags */
 	u64			oi_flags;
 	struct obd_device      *oi_obd;
-	struct lmv_tgt_desc    *oi_tgt;
+	struct lu_tgt_desc     *oi_tgt;
 	/* lsm data specific for every OSC. */
 	struct lov_stripe_md   *oi_md;
 	/* statfs data specific for every OSC, if needed at all. */
@@ -377,28 +377,10 @@ struct echo_client_obd {
 	u64			ec_unique;
 };
 
-/* Generic subset of OSTs */
-struct ost_pool {
-	u32			*op_array;  /* array of index of lov_obd->lov_tgts */
-	unsigned int		 op_count;  /* number of OSTs in the array */
-	unsigned int		 op_size;   /* allocated size of lp_array */
-	struct rw_semaphore	 op_rw_sem; /* to protect ost_pool use */
-};
-
 /* allow statfs data caching for 1 second */
 #define OBD_STATFS_CACHE_SECONDS 1
 
-struct lov_tgt_desc {
-	struct list_head	ltd_kill;
-	struct obd_uuid		ltd_uuid;
-	struct obd_device      *ltd_obd;
-	struct obd_export      *ltd_exp;
-	u32			ltd_gen;
-	u32			ltd_index;   /* index in lov_obd->tgts */
-	unsigned long		ltd_active:1,/* is this target up for requests */
-				ltd_activate:1,/* should  target be activated */
-				ltd_reap:1;  /* should this target be deleted */
-};
+#define lov_tgt_desc lu_tgt_desc
 
 struct lov_md_tgt_desc {
 	struct obd_device *lmtd_mdc;
@@ -431,16 +413,7 @@ struct lov_obd {
 	struct lov_md_tgt_desc	*lov_mdc_tgts;
 };
 
-struct lmv_tgt_desc {
-	struct obd_uuid		ltd_uuid;
-	struct obd_device	*ltd_obd;
-	struct obd_export      *ltd_exp;
-	u32			ltd_idx;
-	struct mutex		ltd_fid_mutex;
-	struct obd_statfs	ltd_statfs;
-	time64_t		ltd_statfs_age;
-	unsigned long		ltd_active:1; /* target up for requests */
-};
+#define lmv_tgt_desc lu_tgt_desc
 
 struct lmv_obd {
 	struct lu_client_fld	lmv_fld;
@@ -458,6 +431,9 @@ struct lmv_obd {
 	struct obd_connect_data	conn_data;
 	struct kobject		*lmv_tgts_kobj;
 	void			*lmv_cache;
+
+	struct lu_qos		lmv_qos;
+	u32			lmv_qos_rr_index;
 };
 
 struct niobuf_local {
diff --git a/fs/lustre/lmv/Makefile b/fs/lustre/lmv/Makefile
index ad470bf..6f9a19c 100644
--- a/fs/lustre/lmv/Makefile
+++ b/fs/lustre/lmv/Makefile
@@ -1,4 +1,4 @@
 ccflags-y += -I$(srctree)/$(src)/../include
 
 obj-$(CONFIG_LUSTRE_FS) += lmv.o
-lmv-y := lmv_obd.o lmv_intent.o lmv_fld.o lproc_lmv.o
+lmv-y := lmv_obd.o lmv_intent.o lmv_fld.o lproc_lmv.o lmv_qos.o
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 6017375..3efd977 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -108,7 +108,7 @@ static int lmv_intent_remote(struct obd_export *exp, struct lookup_intent *it,
 
 	op_data->op_bias = MDS_CROSS_REF;
 	CDEBUG(D_INODE, "REMOTE_INTENT with fid=" DFID " -> mds #%u\n",
-	       PFID(&body->mbo_fid1), tgt->ltd_idx);
+	       PFID(&body->mbo_fid1), tgt->ltd_index);
 
 	/* ask for security context upon intent */
 	if (it->it_op & (IT_LOOKUP | IT_GETATTR | IT_OPEN) &&
@@ -206,7 +206,7 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 		}
 
 		CDEBUG(D_INODE, "Revalidate slave " DFID " -> mds #%u\n",
-		       PFID(&fid), tgt->ltd_idx);
+		       PFID(&fid), tgt->ltd_index);
 
 		if (req) {
 			ptlrpc_req_finished(req);
@@ -353,7 +353,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 
-		op_data->op_mds = tgt->ltd_idx;
+		op_data->op_mds = tgt->ltd_index;
 	} else {
 		LASSERT(fid_is_sane(&op_data->op_fid1));
 		LASSERT(fid_is_zero(&op_data->op_fid2));
@@ -380,7 +380,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	CDEBUG(D_INODE,
 	       "OPEN_INTENT with fid1=" DFID ", fid2=" DFID ", name='%s' -> mds #%u\n",
 	       PFID(&op_data->op_fid1),
-	       PFID(&op_data->op_fid2), op_data->op_name, tgt->ltd_idx);
+	       PFID(&op_data->op_fid2), op_data->op_name, tgt->ltd_index);
 
 	rc = md_intent_lock(tgt->ltd_exp, op_data, it, reqp, cb_blocking,
 			    extra_lock_flags);
@@ -465,7 +465,7 @@ static int lmv_intent_lookup(struct obd_export *exp,
 	       "LOOKUP_INTENT with fid1=" DFID ", fid2=" DFID ", name='%s' -> mds #%u\n",
 	       PFID(&op_data->op_fid1), PFID(&op_data->op_fid2),
 	       op_data->op_name ? op_data->op_name : "<NULL>",
-	       tgt->ltd_idx);
+	       tgt->ltd_index);
 
 	op_data->op_bias &= ~MDS_CROSS_REF;
 
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index 9974ec5..c673656 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -60,6 +60,8 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 
 int lmv_getattr_name(struct obd_export *exp, struct md_op_data *op_data,
 		     struct ptlrpc_request **preq);
+void lmv_activate_target(struct lmv_obd *lmv, struct lmv_tgt_desc *tgt,
+			 int activate);
 
 int lmv_statfs_check_update(struct obd_device *obd, struct lmv_tgt_desc *tgt);
 
@@ -77,7 +79,7 @@ static inline struct obd_device *lmv2obd_dev(struct lmv_obd *lmv)
 		if (!lmv->tgts[i])
 			continue;
 
-		if (lmv->tgts[i]->ltd_idx == mdt_idx) {
+		if (lmv->tgts[i]->ltd_index == mdt_idx) {
 			if (index)
 				*index = i;
 			return lmv->tgts[i];
@@ -192,6 +194,10 @@ static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 struct lmv_tgt_desc *lmv_locate_tgt(struct lmv_obd *lmv,
 				    struct md_op_data *op_data);
 
+/* lmv_qos.c */
+struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt);
+struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt);
+
 /* lproc_lmv.c */
 int lmv_tunables_init(struct obd_device *obd);
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 02dfd35..20ae322 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -57,9 +57,8 @@
 
 static int lmv_check_connect(struct obd_device *obd);
 
-static void lmv_activate_target(struct lmv_obd *lmv,
-				struct lmv_tgt_desc *tgt,
-				int activate)
+void lmv_activate_target(struct lmv_obd *lmv, struct lmv_tgt_desc *tgt,
+			 int activate)
 {
 	if (tgt->ltd_active == activate)
 		return;
@@ -315,7 +314,7 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 
 	target.ft_srv = NULL;
 	target.ft_exp = mdc_exp;
-	target.ft_idx = tgt->ltd_idx;
+	target.ft_idx = tgt->ltd_index;
 
 	fld_client_add_target(&lmv->lmv_fld, &target);
 
@@ -345,6 +344,12 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 
 	md_init_ea_size(tgt->ltd_exp, lmv->max_easize, lmv->max_def_easize);
 
+	rc = lqos_add_tgt(&lmv->lmv_qos, tgt);
+	if (rc) {
+		obd_disconnect(mdc_exp);
+		return rc;
+	}
+
 	CDEBUG(D_CONFIG, "Connected to %s(%s) successfully (%d)\n",
 	       mdc_obd->obd_name, mdc_obd->obd_uuid.uuid,
 	       atomic_read(&obd->obd_refcount));
@@ -364,6 +369,8 @@ static void lmv_del_target(struct lmv_obd *lmv, int index)
 	if (!lmv->tgts[index])
 		return;
 
+	lqos_del_tgt(&lmv->lmv_qos, lmv->tgts[index]);
+
 	kfree(lmv->tgts[index]);
 	lmv->tgts[index] = NULL;
 }
@@ -435,7 +442,7 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	}
 
 	mutex_init(&tgt->ltd_fid_mutex);
-	tgt->ltd_idx = index;
+	tgt->ltd_index = index;
 	tgt->ltd_uuid = *uuidp;
 	tgt->ltd_active = 0;
 	lmv->tgts[index] = tgt;
@@ -1099,7 +1106,7 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 			return -EINVAL;
 
 		/* only files on same MDT can have their layouts swapped */
-		if (tgt1->ltd_idx != tgt2->ltd_idx)
+		if (tgt1->ltd_index != tgt2->ltd_index)
 			return -EPERM;
 
 		rc = obd_iocontrol(cmd, tgt1->ltd_exp, len, karg, uarg);
@@ -1253,6 +1260,8 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 {
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_desc *desc;
+	struct lnet_process_id lnet_id;
+	int i = 0;
 	int rc;
 
 	if (LUSTRE_CFG_BUFLEN(lcfg, 1) < 1) {
@@ -1275,13 +1284,35 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	obd_str2uuid(&lmv->desc.ld_uuid, desc->ld_uuid.uuid);
 	lmv->desc.ld_tgt_count = 0;
 	lmv->desc.ld_active_tgt_count = 0;
-	lmv->desc.ld_qos_maxage = 60;
+	lmv->desc.ld_qos_maxage = LMV_DESC_QOS_MAXAGE_DEFAULT;
 	lmv->max_def_easize = 0;
 	lmv->max_easize = 0;
 
 	spin_lock_init(&lmv->lmv_lock);
 	mutex_init(&lmv->lmv_init_mutex);
 
+	/* Set up allocation policy (QoS and RR) */
+	INIT_LIST_HEAD(&lmv->lmv_qos.lq_svr_list);
+	init_rwsem(&lmv->lmv_qos.lq_rw_sem);
+	lmv->lmv_qos.lq_dirty = 1;
+	lmv->lmv_qos.lq_rr.lqr_dirty = 1;
+	lmv->lmv_qos.lq_reset = 1;
+	/* Default priority is toward free space balance */
+	lmv->lmv_qos.lq_prio_free = 232;
+	/* Default threshold for rr (roughly 17%) */
+	lmv->lmv_qos.lq_threshold_rr = 43;
+
+	/*
+	 * initialize rr_index to lower 32bit of netid, so that client
+	 * can distribute subdirs evenly from the beginning.
+	 */
+	while (LNetGetId(i++, &lnet_id) != -ENOENT) {
+		if (LNET_NETTYP(LNET_NIDNET(lnet_id.nid)) != LOLND) {
+			lmv->lmv_qos_rr_index = (u32)lnet_id.nid;
+			break;
+		}
+	}
+
 	rc = lmv_tunables_init(obd);
 	if (rc)
 		CWARN("%s: error adding LMV sysfs/debugfs files: rc = %d\n",
@@ -1462,6 +1493,7 @@ static int lmv_statfs_update(void *cookie, int rc)
 		tgt->ltd_statfs = *osfs;
 		tgt->ltd_statfs_age = ktime_get_seconds();
 		spin_unlock(&lmv->lmv_lock);
+		lmv->lmv_qos.lq_dirty = 1;
 	}
 
 	return rc;
@@ -1541,7 +1573,7 @@ static int lmv_getattr(struct obd_export *exp, struct md_op_data *op_data,
 		return PTR_ERR(tgt);
 
 	if (op_data->op_flags & MF_GET_MDT_IDX) {
-		op_data->op_mds = tgt->ltd_idx;
+		op_data->op_mds = tgt->ltd_index;
 		return 0;
 	}
 
@@ -1585,17 +1617,6 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 	return md_close(tgt->ltd_exp, op_data, mod, request);
 }
 
-static struct lmv_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
-{
-	static unsigned int rr_index;
-
-	/* locate MDT round-robin is the first step */
-	*mdt = rr_index % lmv->tgts_size;
-	rr_index++;
-
-	return lmv->tgts[*mdt];
-}
-
 static struct lmv_tgt_desc *
 lmv_locate_tgt_by_name(struct lmv_obd *lmv, struct lmv_stripe_md *lsm,
 		       const char *name, int namelen, struct lu_fid *fid,
@@ -1609,7 +1630,7 @@ static struct lmv_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 		if (IS_ERR(tgt))
 			return tgt;
 
-		*mds = tgt->ltd_idx;
+		*mds = tgt->ltd_index;
 		return tgt;
 	}
 
@@ -1698,12 +1719,18 @@ struct lmv_tgt_desc *
 		   lmv_dir_space_hashed(op_data->op_default_mea1) &&
 		   !lmv_dir_striped(lsm)) {
 		tgt = lmv_locate_tgt_qos(lmv, &op_data->op_mds);
+		if (tgt == ERR_PTR(-EAGAIN))
+			tgt = lmv_locate_tgt_rr(lmv, &op_data->op_mds);
 		/*
 		 * only update statfs when mkdir under dir with "space" hash,
 		 * this means the cached statfs may be stale, and current mkdir
 		 * may not follow QoS accurately, but it's not serious, and it
 		 * avoids periodic statfs when client doesn't mkdir under
 		 * "space" hashed directories.
+		 *
+		 * TODO: after MDT support QoS object allocation, also update
+		 * statfs for 'lfs mkdir -i -1 ...", currently it's done in user
+		 * space.
 		 */
 		if (!IS_ERR(tgt)) {
 			struct obd_device *obd;
@@ -1823,7 +1850,7 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 
-		op_data->op_mds = tgt->ltd_idx;
+		op_data->op_mds = tgt->ltd_index;
 	}
 
 	CDEBUG(D_INODE, "CREATE obj " DFID " -> mds #%x\n",
@@ -1858,7 +1885,7 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 		return PTR_ERR(tgt);
 
 	CDEBUG(D_INODE, "ENQUEUE on " DFID " -> mds #%u\n",
-	       PFID(&op_data->op_fid1), tgt->ltd_idx);
+	       PFID(&op_data->op_fid1), tgt->ltd_index);
 
 	return md_enqueue(tgt->ltd_exp, einfo, policy, op_data, lockh,
 			  extra_lock_flags);
@@ -1881,7 +1908,7 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 
 	CDEBUG(D_INODE, "GETATTR_NAME for %*s on " DFID " -> mds #%u\n",
 	       (int)op_data->op_namelen, op_data->op_name,
-	       PFID(&op_data->op_fid1), tgt->ltd_idx);
+	       PFID(&op_data->op_fid1), tgt->ltd_index);
 
 	rc = md_getattr_name(tgt->ltd_exp, op_data, preq);
 	if (rc == -ENOENT && lmv_dir_retry_check_update(op_data)) {
@@ -1935,7 +1962,7 @@ static int lmv_early_cancel(struct obd_export *exp, struct lmv_tgt_desc *tgt,
 			return PTR_ERR(tgt);
 	}
 
-	if (tgt->ltd_idx != op_tgt) {
+	if (tgt->ltd_index != op_tgt) {
 		CDEBUG(D_INODE, "EARLY_CANCEL on " DFID "\n", PFID(fid));
 		policy.l_inodebits.bits = bits;
 		rc = md_cancel_unused(tgt->ltd_exp, fid, &policy,
@@ -1981,7 +2008,7 @@ static int lmv_link(struct obd_export *exp, struct md_op_data *op_data,
 	 * Cancel UPDATE lock on child (fid1).
 	 */
 	op_data->op_flags |= MF_MDC_CANCEL_FID2;
-	rc = lmv_early_cancel(exp, NULL, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, NULL, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
 	if (rc != 0)
 		return rc;
@@ -2075,7 +2102,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 		return PTR_ERR(child_tgt);
 
 	if (!S_ISDIR(op_data->op_mode) && tp_tgt)
-		rc = __lmv_fid_alloc(lmv, &target_fid, tp_tgt->ltd_idx);
+		rc = __lmv_fid_alloc(lmv, &target_fid, tp_tgt->ltd_index);
 	else
 		rc = lmv_fid_alloc(NULL, exp, &target_fid, op_data);
 	if (rc)
@@ -2101,7 +2128,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	}
 
 	/* cancel UPDATE lock of parent master object */
-	rc = lmv_early_cancel(exp, parent_tgt, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, parent_tgt, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
 	if (rc)
 		return rc;
@@ -2126,14 +2153,14 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fid4 = target_fid;
 
 	/* cancel UPDATE locks of target parent */
-	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID2);
 	if (rc)
 		return rc;
 
 	/* cancel LOOKUP lock of source if source is remote object */
 	if (child_tgt != sp_tgt) {
-		rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_idx,
+		rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_index,
 				      LCK_EX, MDS_INODELOCK_LOOKUP,
 				      MF_MDC_CANCEL_FID3);
 		if (rc)
@@ -2141,7 +2168,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	}
 
 	/* cancel ELC locks of source */
-	rc = lmv_early_cancel(exp, child_tgt, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, child_tgt, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_ELC, MF_MDC_CANCEL_FID3);
 	if (rc)
 		return rc;
@@ -2201,7 +2228,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_flags |= MF_MDC_CANCEL_FID4;
 
 	/* cancel UPDATE locks of target parent */
-	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, tp_tgt, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID2);
 	if (rc != 0)
 		return rc;
@@ -2210,7 +2237,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 		/* cancel LOOKUP lock of target on target parent */
 		if (tgt != tp_tgt) {
 			rc = lmv_early_cancel(exp, tp_tgt, op_data,
-					      tgt->ltd_idx, LCK_EX,
+					      tgt->ltd_index, LCK_EX,
 					      MDS_INODELOCK_LOOKUP,
 					      MF_MDC_CANCEL_FID4);
 			if (rc != 0)
@@ -2224,7 +2251,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 			return PTR_ERR(src_tgt);
 
 		/* cancel ELC locks of source */
-		rc = lmv_early_cancel(exp, src_tgt, op_data, tgt->ltd_idx,
+		rc = lmv_early_cancel(exp, src_tgt, op_data, tgt->ltd_index,
 				      LCK_EX, MDS_INODELOCK_ELC,
 				      MF_MDC_CANCEL_FID3);
 		if (rc != 0)
@@ -2239,7 +2266,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 		return PTR_ERR(sp_tgt);
 
 	/* cancel UPDATE locks of source parent */
-	rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, sp_tgt, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_UPDATE, MF_MDC_CANCEL_FID1);
 	if (rc != 0)
 		return rc;
@@ -2248,7 +2275,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 		/* cancel LOOKUP lock of source on source parent */
 		if (src_tgt != sp_tgt) {
 			rc = lmv_early_cancel(exp, sp_tgt, op_data,
-					      tgt->ltd_idx, LCK_EX,
+					      tgt->ltd_index, LCK_EX,
 					      MDS_INODELOCK_LOOKUP,
 					      MF_MDC_CANCEL_FID3);
 			if (rc != 0)
@@ -2293,7 +2320,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 		/* cancel LOOKUP lock of target on target parent */
 		if (tgt != tp_tgt) {
 			rc = lmv_early_cancel(exp, tp_tgt, op_data,
-					      tgt->ltd_idx, LCK_EX,
+					      tgt->ltd_index, LCK_EX,
 					      MDS_INODELOCK_LOOKUP,
 					      MF_MDC_CANCEL_FID4);
 			if (rc != 0)
@@ -2781,17 +2808,18 @@ static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_flags |= MF_MDC_CANCEL_FID1 | MF_MDC_CANCEL_FID3;
 
 	if (parent_tgt != tgt)
-		rc = lmv_early_cancel(exp, parent_tgt, op_data, tgt->ltd_idx,
+		rc = lmv_early_cancel(exp, parent_tgt, op_data, tgt->ltd_index,
 				      LCK_EX, MDS_INODELOCK_LOOKUP,
 				      MF_MDC_CANCEL_FID3);
 
-	rc = lmv_early_cancel(exp, NULL, op_data, tgt->ltd_idx, LCK_EX,
+	rc = lmv_early_cancel(exp, NULL, op_data, tgt->ltd_index, LCK_EX,
 			      MDS_INODELOCK_ELC, MF_MDC_CANCEL_FID3);
 	if (rc)
 		return rc;
 
 	CDEBUG(D_INODE, "unlink with fid=" DFID "/" DFID " -> mds #%u\n",
-	       PFID(&op_data->op_fid1), PFID(&op_data->op_fid2), tgt->ltd_idx);
+	       PFID(&op_data->op_fid1), PFID(&op_data->op_fid2),
+	       tgt->ltd_index);
 
 	rc = md_unlink(tgt->ltd_exp, op_data, request);
 	if (rc == -ENOENT && lmv_dir_retry_check_update(op_data)) {
diff --git a/fs/lustre/lmv/lmv_qos.c b/fs/lustre/lmv/lmv_qos.c
new file mode 100644
index 0000000..e323398
--- /dev/null
+++ b/fs/lustre/lmv/lmv_qos.c
@@ -0,0 +1,446 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ *
+ * lustre/lmv/lmv_qos.c
+ *
+ * LMV QoS.
+ * These are the only exported functions, they provide some generic
+ * infrastructure for object allocation QoS
+ *
+ */
+
+#define DEBUG_SUBSYSTEM S_LMV
+
+#include <asm/div64.h>
+#include <uapi/linux/lustre/lustre_idl.h>
+#include <lustre_swab.h>
+#include <obd_class.h>
+
+#include "lmv_internal.h"
+
+static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
+{
+	struct obd_statfs *statfs = &tgt->ltd_statfs;
+
+	return statfs->os_bavail * statfs->os_bsize;
+}
+
+static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
+{
+	return tgt->ltd_statfs.os_ffree;
+}
+
+/**
+ * Calculate penalties per-tgt and per-server
+ *
+ * Re-calculate penalties when the configuration changes, active targets
+ * change and after statfs refresh (all these are reflected by lq_dirty flag).
+ * On every MDT and MDS: decay the penalty by half for every 8x the update
+ * interval that the device has been idle. That gives lots of time for the
+ * statfs information to be updated (which the penalty is only a proxy for),
+ * and avoids penalizing MDS/MDTs under light load.
+ * See lmv_qos_calc_weight() for how penalties are factored into the weight.
+ *
+ * @lmv			LMV device
+ *
+ * Return:		0 on success
+ *			-EAGAIN	if the number of MDTs isn't enough or all
+ *			MDT spaces are almost the same
+ */
+static int lmv_qos_calc_ppts(struct lmv_obd *lmv)
+{
+	struct lu_qos *qos = &lmv->lmv_qos;
+	struct lu_tgt_desc *tgt;
+	struct lu_svr_qos *svr;
+	u64 ba_max, ba_min, ba;
+	u64 ia_max, ia_min, ia;
+	u32 num_active;
+	unsigned int i;
+	int prio_wide;
+	time64_t now, age;
+	u32 maxage = lmv->desc.ld_qos_maxage;
+	int rc = 0;
+
+
+	if (!qos->lq_dirty)
+		goto out;
+
+	num_active = lmv->desc.ld_active_tgt_count;
+	if (num_active < 2) {
+		rc = -EAGAIN;
+		goto out;
+	}
+
+	/* find bavail on each server */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		svr->lsq_bavail = 0;
+		svr->lsq_iavail = 0;
+	}
+	qos->lq_active_svr_count = 0;
+
+	/*
+	 * How badly user wants to select targets "widely" (not recently chosen
+	 * and not on recent MDS's).  As opposed to "freely" (free space avail.)
+	 * 0-256
+	 */
+	prio_wide = 256 - qos->lq_prio_free;
+
+	ba_min = (u64)(-1);
+	ba_max = 0;
+	ia_min = (u64)(-1);
+	ia_max = 0;
+	now = ktime_get_real_seconds();
+
+	/* Calculate server penalty per object */
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+		tgt = lmv->tgts[i];
+		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+			continue;
+
+		/* bavail >> 16 to avoid overflow */
+		ba = tgt_statfs_bavail(tgt) >> 16;
+		if (!ba)
+			continue;
+
+		ba_min = min(ba, ba_min);
+		ba_max = max(ba, ba_max);
+
+		/* iavail >> 8 to avoid overflow */
+		ia = tgt_statfs_iavail(tgt) >> 8;
+		if (!ia)
+			continue;
+
+		ia_min = min(ia, ia_min);
+		ia_max = max(ia, ia_max);
+
+		/* Count the number of usable MDS's */
+		if (tgt->ltd_qos.ltq_svr->lsq_bavail == 0)
+			qos->lq_active_svr_count++;
+		tgt->ltd_qos.ltq_svr->lsq_bavail += ba;
+		tgt->ltd_qos.ltq_svr->lsq_iavail += ia;
+
+		/*
+		 * per-MDT penalty is
+		 * prio * bavail * iavail / (num_tgt - 1) / 2
+		 */
+		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia;
+		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active - 1);
+		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
+
+		age = (now - tgt->ltd_qos.ltq_used) >> 3;
+		if (qos->lq_reset || age > 32 * maxage)
+			tgt->ltd_qos.ltq_penalty = 0;
+		else if (age > maxage)
+			/* Decay tgt penalty. */
+			tgt->ltd_qos.ltq_penalty >>= (age / maxage);
+	}
+
+	num_active = qos->lq_active_svr_count;
+	if (num_active < 2) {
+		/*
+		 * If there's only 1 MDS, we can't penalize it, so instead
+		 * we have to double the MDT penalty
+		 */
+		num_active = 2;
+		for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+			tgt = lmv->tgts[i];
+			if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+				continue;
+
+			tgt->ltd_qos.ltq_penalty_per_obj <<= 1;
+		}
+	}
+
+	/*
+	 * Per-MDS penalty is
+	 * prio * bavail * iavail / server_tgts / (num_svr - 1) / 2
+	 */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		ba = svr->lsq_bavail;
+		ia = svr->lsq_iavail;
+		svr->lsq_penalty_per_obj = prio_wide * ba  * ia;
+		do_div(ba, svr->lsq_tgt_count * (num_active - 1));
+		svr->lsq_penalty_per_obj >>= 1;
+
+		age = (now - svr->lsq_used) >> 3;
+		if (qos->lq_reset || age > 32 * maxage)
+			svr->lsq_penalty = 0;
+		else if (age > maxage)
+			/* Decay server penalty. */
+			svr->lsq_penalty >>= age / maxage;
+	}
+
+	qos->lq_dirty = 0;
+	qos->lq_reset = 0;
+
+	/*
+	 * If each MDT has almost same free space, do rr allocation for better
+	 * creation performance
+	 */
+	qos->lq_same_space = 0;
+	if ((ba_max * (256 - qos->lq_threshold_rr)) >> 8 < ba_min &&
+	    (ia_max * (256 - qos->lq_threshold_rr)) >> 8 < ia_min) {
+		qos->lq_same_space = 1;
+		/* Reset weights for the next time we enter qos mode */
+		qos->lq_reset = 1;
+	}
+	rc = 0;
+
+out:
+	if (!rc && qos->lq_same_space)
+		return -EAGAIN;
+
+	return rc;
+}
+
+static inline bool lmv_qos_is_usable(struct lmv_obd *lmv)
+{
+	if (!lmv->lmv_qos.lq_dirty && lmv->lmv_qos.lq_same_space)
+		return false;
+
+	if (lmv->desc.ld_active_tgt_count < 2)
+		return false;
+
+	return true;
+}
+
+/**
+ * Calculate weight for a given MDT.
+ *
+ * The final MDT weight is bavail >> 16 * iavail >> 8 minus the MDT and MDS
+ * penalties.  See lmv_qos_calc_ppts() for how penalties are calculated.
+ *
+ * \param[in] tgt	MDT target descriptor
+ */
+static void lmv_qos_calc_weight(struct lu_tgt_desc *tgt)
+{
+	struct lu_tgt_qos *ltq = &tgt->ltd_qos;
+	u64 temp, temp2;
+
+	temp = (tgt_statfs_bavail(tgt) >> 16) * (tgt_statfs_iavail(tgt) >> 8);
+	temp2 = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
+	if (temp < temp2)
+		ltq->ltq_weight = 0;
+	else
+		ltq->ltq_weight = temp - temp2;
+}
+
+/**
+ * Re-calculate weights.
+ *
+ * The function is called when some target was used for a new object. In
+ * this case we should re-calculate all the weights to keep new allocations
+ * balanced well.
+ *
+ * \param[in] lmv	LMV device
+ * \param[in] tgt	target where a new object was placed
+ * \param[out] total_wt	new total weight for the pool
+ *
+ * \retval		0
+ */
+static int lmv_qos_used(struct lmv_obd *lmv, struct lu_tgt_desc *tgt,
+			u64 *total_wt)
+{
+	struct lu_tgt_qos *ltq;
+	struct lu_svr_qos *svr;
+	unsigned int i;
+
+	ltq = &tgt->ltd_qos;
+	LASSERT(ltq);
+
+	/* Don't allocate on this device anymore, until the next alloc_qos */
+	ltq->ltq_usable = 0;
+
+	svr = ltq->ltq_svr;
+
+	/*
+	 * Decay old penalty by half (we're adding max penalty, and don't
+	 * want it to run away.)
+	 */
+	ltq->ltq_penalty >>= 1;
+	svr->lsq_penalty >>= 1;
+
+	/* mark the MDS and MDT as recently used */
+	ltq->ltq_used = svr->lsq_used = ktime_get_real_seconds();
+
+	/* Set max penalties for this MDT and MDS */
+	ltq->ltq_penalty += ltq->ltq_penalty_per_obj *
+			    lmv->desc.ld_active_tgt_count;
+	svr->lsq_penalty += svr->lsq_penalty_per_obj *
+		lmv->lmv_qos.lq_active_svr_count;
+
+	/* Decrease all MDS penalties */
+	list_for_each_entry(svr, &lmv->lmv_qos.lq_svr_list, lsq_svr_list) {
+		if (svr->lsq_penalty < svr->lsq_penalty_per_obj)
+			svr->lsq_penalty = 0;
+		else
+			svr->lsq_penalty -= svr->lsq_penalty_per_obj;
+	}
+
+	*total_wt = 0;
+	/* Decrease all MDT penalties */
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+		ltq = &lmv->tgts[i]->ltd_qos;
+		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+			continue;
+
+		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
+			ltq->ltq_penalty = 0;
+		else
+			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
+
+		lmv_qos_calc_weight(lmv->tgts[i]);
+
+		/* Recalc the total weight of usable osts */
+		if (ltq->ltq_usable)
+			*total_wt += ltq->ltq_weight;
+
+		CDEBUG(D_OTHER,
+		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
+		       i, ltq->ltq_usable,
+		       tgt_statfs_bavail(tgt) >> 10,
+		       ltq->ltq_penalty_per_obj >> 10,
+		       ltq->ltq_penalty >> 10,
+		       ltq->ltq_svr->lsq_penalty_per_obj >> 10,
+		       ltq->ltq_svr->lsq_penalty >> 10,
+		       ltq->ltq_weight >> 10);
+	}
+
+	return 0;
+}
+
+struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
+{
+	struct lu_tgt_desc *tgt;
+	u64 total_weight = 0;
+	u64 cur_weight = 0;
+	u64 rand;
+	int i;
+	int rc;
+
+	if (!lmv_qos_is_usable(lmv))
+		return ERR_PTR(-EAGAIN);
+
+	down_write(&lmv->lmv_qos.lq_rw_sem);
+
+	if (!lmv_qos_is_usable(lmv)) {
+		tgt = ERR_PTR(-EAGAIN);
+		goto unlock;
+	}
+
+	rc = lmv_qos_calc_ppts(lmv);
+	if (rc) {
+		tgt = ERR_PTR(rc);
+		goto unlock;
+	}
+
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+		tgt = lmv->tgts[i];
+		if (!tgt)
+			continue;
+
+		tgt->ltd_qos.ltq_usable = 0;
+		if (!tgt->ltd_exp || !tgt->ltd_active)
+			continue;
+
+		tgt->ltd_qos.ltq_usable = 1;
+		lmv_qos_calc_weight(tgt);
+		total_weight += tgt->ltd_qos.ltq_weight;
+	}
+
+	if (total_weight) {
+#if BITS_PER_LONG == 32
+		/*
+		 * If total_weight > 32-bit, first generate the high
+		 * 32 bits of the random number, then add in the low
+		 * 32 bits (truncated to the upper limit, if needed)
+		 */
+		if (total_weight > 0xffffffffULL)
+			rand = (u64)(prandom_u32_max(
+				(unsigned int)(total_weight >> 32)) << 32;
+		else
+			rand = 0;
+
+		if (rand == (total_weight & 0xffffffff00000000ULL))
+			rand |= prandom_u32_max((unsigned int)total_weight);
+		else
+			rand |= prandom_u32();
+
+#else
+		rand = ((u64)prandom_u32() << 32 | prandom_u32()) %
+			total_weight;
+#endif
+	} else {
+		rand = 0;
+	}
+
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+		tgt = lmv->tgts[i];
+
+		if (!tgt || !tgt->ltd_qos.ltq_usable)
+			continue;
+
+		cur_weight += tgt->ltd_qos.ltq_weight;
+		if (cur_weight < rand)
+			continue;
+
+		*mdt = tgt->ltd_index;
+		lmv_qos_used(lmv, tgt, &total_weight);
+		rc = 0;
+		goto unlock;
+	}
+
+	/* no proper target found */
+	tgt = ERR_PTR(-EAGAIN);
+	goto unlock;
+unlock:
+	up_write(&lmv->lmv_qos.lq_rw_sem);
+
+	return tgt;
+}
+
+struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt)
+{
+	struct lu_tgt_desc *tgt;
+	int i;
+
+	spin_lock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+		tgt = lmv->tgts[(i + lmv->lmv_qos_rr_index) %
+				lmv->desc.ld_tgt_count];
+		if (tgt && tgt->ltd_exp && tgt->ltd_active) {
+			*mdt = tgt->ltd_index;
+			lmv->lmv_qos_rr_index =
+				(i + lmv->lmv_qos_rr_index + 1) %
+				lmv->desc.ld_tgt_count;
+			spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+
+			return tgt;
+		}
+	}
+	spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+
+	return ERR_PTR(-ENODEV);
+}
diff --git a/fs/lustre/lmv/lproc_lmv.c b/fs/lustre/lmv/lproc_lmv.c
index 170ed564..659ebeb 100644
--- a/fs/lustre/lmv/lproc_lmv.c
+++ b/fs/lustre/lmv/lproc_lmv.c
@@ -76,6 +76,109 @@ static ssize_t desc_uuid_show(struct kobject *kobj, struct attribute *attr,
 }
 LUSTRE_RO_ATTR(desc_uuid);
 
+static ssize_t qos_maxage_show(struct kobject *kobj,
+			       struct attribute *attr,
+			       char *buf)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+
+	return sprintf(buf, "%u\n", dev->u.lmv.desc.ld_qos_maxage);
+}
+
+static ssize_t qos_maxage_store(struct kobject *kobj,
+				struct attribute *attr,
+				const char *buffer,
+				size_t count)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buffer, 0, &val);
+	if (rc)
+		return rc;
+
+	dev->u.lmv.desc.ld_qos_maxage = val;
+
+	return count;
+}
+LUSTRE_RW_ATTR(qos_maxage);
+
+static ssize_t qos_prio_free_show(struct kobject *kobj,
+				  struct attribute *attr,
+				  char *buf)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+
+	return sprintf(buf, "%u%%\n",
+		       (dev->u.lmv.lmv_qos.lq_prio_free * 100 + 255) >> 8);
+}
+
+static ssize_t qos_prio_free_store(struct kobject *kobj,
+				   struct attribute *attr,
+				   const char *buffer,
+				   size_t count)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+	struct lmv_obd *lmv = &dev->u.lmv;
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buffer, 0, &val);
+	if (rc)
+		return rc;
+
+	if (val > 100)
+		return -EINVAL;
+
+	lmv->lmv_qos.lq_prio_free = (val << 8) / 100;
+	lmv->lmv_qos.lq_dirty = 1;
+	lmv->lmv_qos.lq_reset = 1;
+
+	return count;
+}
+LUSTRE_RW_ATTR(qos_prio_free);
+
+static ssize_t qos_threshold_rr_show(struct kobject *kobj,
+				     struct attribute *attr,
+				     char *buf)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+
+	return sprintf(buf, "%u%%\n",
+		       (dev->u.lmv.lmv_qos.lq_threshold_rr * 100 + 255) >> 8);
+}
+
+static ssize_t qos_threshold_rr_store(struct kobject *kobj,
+				      struct attribute *attr,
+				      const char *buffer,
+				      size_t count)
+{
+	struct obd_device *dev = container_of(kobj, struct obd_device,
+					      obd_kset.kobj);
+	struct lmv_obd *lmv = &dev->u.lmv;
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buffer, 0, &val);
+	if (rc)
+		return rc;
+
+	if (val > 100)
+		return -EINVAL;
+
+	lmv->lmv_qos.lq_threshold_rr = (val << 8) / 100;
+	lmv->lmv_qos.lq_dirty = 1;
+
+	return count;
+}
+LUSTRE_RW_ATTR(qos_threshold_rr);
+
 static void *lmv_tgt_seq_start(struct seq_file *p, loff_t *pos)
 {
 	struct obd_device *dev = p->private;
@@ -117,7 +220,7 @@ static int lmv_tgt_seq_show(struct seq_file *p, void *v)
 		return 0;
 
 	seq_printf(p, "%u: %s %sACTIVE\n",
-		   tgt->ltd_idx, tgt->ltd_uuid.uuid,
+		   tgt->ltd_index, tgt->ltd_uuid.uuid,
 		   tgt->ltd_active ? "" : "IN");
 	return 0;
 }
@@ -156,6 +259,9 @@ static int lmv_target_seq_open(struct inode *inode, struct file *file)
 	&lustre_attr_activeobd.attr,
 	&lustre_attr_desc_uuid.attr,
 	&lustre_attr_numobd.attr,
+	&lustre_attr_qos_maxage.attr,
+	&lustre_attr_qos_prio_free.attr,
+	&lustre_attr_qos_threshold_rr.attr,
 	NULL,
 };
 
diff --git a/fs/lustre/obdclass/Makefile b/fs/lustre/obdclass/Makefile
index 25d2e1d..6d762ed 100644
--- a/fs/lustre/obdclass/Makefile
+++ b/fs/lustre/obdclass/Makefile
@@ -8,4 +8,4 @@ obdclass-y := llog.o llog_cat.o llog_obd.o llog_swab.o class_obd.o \
 	      lustre_handles.o lustre_peer.o statfs_pack.o linkea.o \
 	      obdo.o obd_config.o obd_mount.o lu_object.o lu_ref.o \
 	      cl_object.o cl_page.o cl_lock.o cl_io.o kernelcomm.o \
-	      jobid.o integrity.o obd_cksum.o
+	      jobid.o integrity.o obd_cksum.o lu_qos.o
diff --git a/fs/lustre/obdclass/lu_qos.c b/fs/lustre/obdclass/lu_qos.c
new file mode 100644
index 0000000..4ee3f59
--- /dev/null
+++ b/fs/lustre/obdclass/lu_qos.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ *
+ * lustre/obdclass/lu_qos.c
+ *
+ * Lustre QoS.
+ * These are the only exported functions, they provide some generic
+ * infrastructure for object allocation QoS
+ *
+ */
+
+#define DEBUG_SUBSYSTEM S_CLASS
+
+#include <linux/module.h>
+#include <linux/list.h>
+#include <obd_class.h>
+#include <obd_support.h>
+#include <lustre_disk.h>
+#include <lustre_fid.h>
+#include <lu_object.h>
+
+/**
+ * Add a new target to Quality of Service (QoS) target table.
+ *
+ * Add a new MDT/OST target to the structure representing an OSS. Resort the
+ * list of known MDSs/OSSs by the number of MDTs/OSTs attached to each MDS/OSS.
+ * The MDS/OSS list is protected internally and no external locking is required.
+ *
+ * @qos		lu_qos data
+ * @ltd		target description
+ *
+ * Return:	0 on success
+ *		-ENOMEM	on error
+ */
+int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
+{
+	struct lu_svr_qos *svr = NULL;
+	struct lu_svr_qos *tempsvr;
+	struct obd_export *exp = ltd->ltd_exp;
+	int found = 0;
+	u32 id = 0;
+	int rc = 0;
+
+	down_write(&qos->lq_rw_sem);
+	/*
+	 * a bit hacky approach to learn NID of corresponding connection
+	 * but there is no official API to access information like this
+	 * with OSD API.
+	 */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		if (obd_uuid_equals(&svr->lsq_uuid,
+				    &exp->exp_connection->c_remote_uuid)) {
+			found++;
+			break;
+		}
+		if (svr->lsq_id > id)
+			id = svr->lsq_id;
+	}
+
+	if (!found) {
+		svr = kmalloc(sizeof(*svr), GFP_NOFS);
+		if (!svr) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		memcpy(&svr->lsq_uuid, &exp->exp_connection->c_remote_uuid,
+		       sizeof(svr->lsq_uuid));
+		++id;
+		svr->lsq_id = id;
+	} else {
+		/* Assume we have to move this one */
+		list_del(&svr->lsq_svr_list);
+	}
+
+	svr->lsq_tgt_count++;
+	ltd->ltd_qos.ltq_svr = svr;
+
+	CDEBUG(D_OTHER, "add tgt %s to server %s (%d targets)\n",
+	       obd_uuid2str(&ltd->ltd_uuid), obd_uuid2str(&svr->lsq_uuid),
+	       svr->lsq_tgt_count);
+
+	/*
+	 * Add sorted by # of tgts.  Find the first entry that we're
+	 * bigger than...
+	 */
+	list_for_each_entry(tempsvr, &qos->lq_svr_list, lsq_svr_list) {
+		if (svr->lsq_tgt_count > tempsvr->lsq_tgt_count)
+			break;
+	}
+	/*
+	 * ...and add before it.  If we're the first or smallest, tempsvr
+	 * points to the list head, and we add to the end.
+	 */
+	list_add_tail(&svr->lsq_svr_list, &tempsvr->lsq_svr_list);
+
+	qos->lq_dirty = 1;
+	qos->lq_rr.lqr_dirty = 1;
+
+out:
+	up_write(&qos->lq_rw_sem);
+	return rc;
+}
+EXPORT_SYMBOL(lqos_add_tgt);
+
+/**
+ * Remove MDT/OST target from QoS table.
+ *
+ * Removes given MDT/OST target from QoS table and releases related
+ * MDS/OSS structure if no target remain on the MDS/OSS.
+ *
+ * @qos		lu_qos data
+ * @ltd		target description
+ *
+ * Return:	0 on success
+ *		-ENOENT	if no server was found
+ */
+int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
+{
+	struct lu_svr_qos *svr;
+	int rc = 0;
+
+	down_write(&qos->lq_rw_sem);
+	svr = ltd->ltd_qos.ltq_svr;
+	if (!svr) {
+		rc = -ENOENT;
+		goto out;
+	}
+
+	svr->lsq_tgt_count--;
+	if (svr->lsq_tgt_count == 0) {
+		CDEBUG(D_OTHER, "removing server %s\n",
+		       obd_uuid2str(&svr->lsq_uuid));
+		list_del(&svr->lsq_svr_list);
+		ltd->ltd_qos.ltq_svr = NULL;
+		kfree(svr);
+	}
+
+	qos->lq_dirty = 1;
+	qos->lq_rr.lqr_dirty = 1;
+out:
+	up_write(&qos->lq_rw_sem);
+	return rc;
+}
+EXPORT_SYMBOL(lqos_del_tgt);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 86395b7..a26f3ae 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1931,6 +1931,8 @@ struct mdt_rec_reint {
 	__u16		rr_padding_4; /* also fix lustre_swab_mdt_rec_reint */
 };
 
+#define LMV_DESC_QOS_MAXAGE_DEFAULT 60  /* Seconds */
+
 /* lmv structures */
 struct lmv_desc {
 	__u32 ld_tgt_count;		/* how many MDS's */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 358/622] lustre: llite: Add persistent cache on client
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (356 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 357/622] lustre: lmv: reuse object alloc QoS code from LOD James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 359/622] lustre: pcc: Non-blocking PCC caching James Simmons
                   ` (264 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

PCC is a new framework which provides a group of local cache
on Lustre client side. No global namespace will be provided
by PCC. Each client uses its own local storage as a cache for
itself. Local file system is used to manage the data on local
caches. Cached I/O is directed to local filesystem while
normal I/O is directed to OSTs.

PCC uses HSM for data synchronization. It uses HSM copytool
to restore file from local caches to Lustre OSTs. Each PCC
has a copytool instance running with unique archive number.
Any remote access from another Lustre client would trigger
the data synchronization. If a client with PCC goes offline,
the cached data becomes inaccessible for other client
temporarilly. And after the PCC client reboots and the copytool
restarts, the data will be accessible again.

ToDo:
1) Make PCC exclusive with HSM.
2) Strong size consistence for PCC cached file among clients.
3) Support to cache partial content of a file.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10092
Lustre-commit: f172b1168857 ("LU-10092 llite: Add persistent cache on client")
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/32963
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-12438
Lustre-commit: b5a6ec93ce56 ("LU-12438 llite: vfs_read/write removed, use kernel_read/write")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/35223
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h                 |    2 +
 fs/lustre/llite/Makefile                |    2 +-
 fs/lustre/llite/dir.c                   |   74 +++
 fs/lustre/llite/file.c                  |  164 ++++-
 fs/lustre/llite/llite_internal.h        |   25 +
 fs/lustre/llite/llite_lib.c             |   45 +-
 fs/lustre/llite/llite_mmap.c            |    8 +
 fs/lustre/llite/lproc_llite.c           |   45 +-
 fs/lustre/llite/namei.c                 |   79 ++-
 fs/lustre/llite/pcc.c                   | 1042 +++++++++++++++++++++++++++++++
 fs/lustre/llite/pcc.h                   |  129 ++++
 fs/lustre/llite/super25.c               |   10 +
 fs/lustre/lmv/lmv_intent.c              |    6 +-
 fs/lustre/lmv/lmv_obd.c                 |    1 +
 fs/lustre/mdc/mdc_lib.c                 |    6 +
 include/uapi/linux/lustre/lustre_idl.h  |    8 +-
 include/uapi/linux/lustre/lustre_user.h |   50 +-
 17 files changed, 1654 insertions(+), 42 deletions(-)
 create mode 100644 fs/lustre/llite/pcc.c
 create mode 100644 fs/lustre/llite/pcc.h

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 2f878d6..f53c303 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -796,6 +796,8 @@ struct md_op_data {
 	bool			op_post_migrate;
 	/* used to access dir with bash hash */
 	u32			op_stripe_index;
+	/* Archive ID for PCC attach */
+	u32			op_archive_id;
 };
 
 struct md_callback {
diff --git a/fs/lustre/llite/Makefile b/fs/lustre/llite/Makefile
index 811b9ab..c88a1b0 100644
--- a/fs/lustre/llite/Makefile
+++ b/fs/lustre/llite/Makefile
@@ -7,6 +7,6 @@ lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
 	    xattr.o xattr_cache.o xattr_security.o \
 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
 	    vvp_dev.o vvp_page.o vvp_io.o vvp_object.o \
-	    lproc_llite.o
+	    lproc_llite.o pcc.o
 
 lustre-$(CONFIG_LUSTRE_FS_POSIX_ACL) += acl.o
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index a1dce52..337582b 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1917,6 +1917,80 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return ll_ioctl_fsgetxattr(inode, cmd, arg);
 	case FS_IOC_FSSETXATTR:
 		return ll_ioctl_fssetxattr(inode, cmd, arg);
+	case LL_IOC_PCC_DETACH: {
+		struct lu_pcc_detach *detach;
+		struct lu_fid *fid;
+		struct inode *inode2;
+		unsigned long ino;
+
+		/*
+		 * The reason why a dir IOCTL is used to detach a PCC-cached
+		 * file rather than making it a file IOCTL is:
+		 * When PCC caching a file, it will attach the file firstly,
+		 * and increase the refcount of PCC inode (pcci->pcci_refcount)
+		 * from 0 to 1.
+		 * When detaching a PCC-cached file, it will check whether the
+		 * refcount is 1. If so, the file can be detached successfully.
+		 * Otherwise, it means there are some users opened and using
+		 * the file currently, and it will return -EBUSY.
+		 * Each open on the PCC-cached file will increase the refcount
+		 * of the PCC inode;
+		 * Each close on the PCC-cached file will decrease the refcount
+		 * of the PCC inode;
+		 * When used a file IOCTL to detach a PCC-cached file, it needs
+		 * to open it at first, which will increase the refcount. So
+		 * during the process of the detach IOCTL, it will return
+		 * -EBUSY as the PCC inode refcount is larger than 1. Someone
+		 * might argue that here it can just decrease the refcount
+		 * of the PCC inode, return succeed and make the close of
+		 * IOCTL file handle to perform the real detach. But this
+		 * may result in inconsistent state of a PCC file. i.e. Process
+		 * A got a successful return form the detach IOCTL; Process B
+		 * opens the file before Process A finally closed the IOCTL
+		 * file handle. It makes the following I/O of Process B will
+		 * direct into PCC although the file was already detached from
+		 * the view of Process A.
+		 * Using a dir IOCTL does not exist the problem above.
+		 */
+		detach = kzalloc(sizeof(*detach), GFP_KERNEL);
+		if (!detach)
+			return -ENOMEM;
+
+		if (copy_from_user(detach,
+				   (const struct lu_pcc_detach __user *)arg,
+				   sizeof(*detach))) {
+			rc = -EFAULT;
+			goto out_detach;
+		}
+
+		fid = &detach->pccd_fid;
+		ino = cl_fid_build_ino(fid, ll_need_32bit_api(sbi));
+		inode2 = ilookup5(inode->i_sb, ino, ll_test_inode_by_fid, fid);
+		if (!inode2) {
+			/* Target inode is not in inode cache, and PCC file
+			 * has aleady released, return immdiately.
+			 */
+			rc = 0;
+			goto out_detach;
+		}
+
+		if (!S_ISREG(inode2->i_mode)) {
+			rc = -EINVAL;
+			goto out_iput;
+		}
+
+		if (!inode_owner_or_capable(inode2)) {
+			rc = -EPERM;
+			goto out_iput;
+		}
+
+		rc = pcc_ioctl_detach(inode2);
+out_iput:
+		iput(inode2);
+out_detach:
+		kfree(detach);
+		return rc;
+	}
 	default:
 		return obd_iocontrol(cmd, sbi->ll_dt_exp, 0, NULL,
 				     (void __user *)arg);
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 88d5c2d..95e7c73 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -56,6 +56,11 @@ struct split_param {
 	u16		sp_mirror_id;
 };
 
+struct pcc_param {
+	u64	pa_data_version;
+	u32	pa_archive_id;
+};
+
 static int
 ll_put_grouplock(struct inode *inode, struct file *file, unsigned long arg);
 
@@ -70,6 +75,8 @@ static struct ll_file_data *ll_file_data_get(void)
 	if (!fd)
 		return NULL;
 	fd->fd_write_failed = false;
+	pcc_file_init(&fd->fd_pcc_file);
+
 	return fd;
 }
 
@@ -192,6 +199,17 @@ static int ll_close_inode_openhandle(struct inode *inode,
 		break;
 	}
 
+	case MDS_PCC_ATTACH: {
+		struct pcc_param *param = data;
+
+		LASSERT(data);
+		op_data->op_bias |= MDS_HSM_RELEASE | MDS_PCC_ATTACH;
+		op_data->op_archive_id = param->pa_archive_id;
+		op_data->op_data_version = param->pa_data_version;
+		op_data->op_lease_handle = och->och_lease_handle;
+		break;
+	}
+
 	case MDS_HSM_RELEASE:
 		LASSERT(data);
 		op_data->op_bias |= MDS_HSM_RELEASE;
@@ -378,6 +396,8 @@ int ll_file_release(struct inode *inode, struct file *file)
 		return 0;
 	}
 
+	pcc_file_release(inode, file);
+
 	if (!S_ISDIR(inode->i_mode)) {
 		if (lli->lli_clob)
 			lov_read_and_clear_async_rc(lli->lli_clob);
@@ -833,6 +853,10 @@ int ll_file_open(struct inode *inode, struct file *file)
 		if (rc)
 			goto out_och_free;
 	}
+	rc = pcc_file_open(inode, file);
+	if (rc)
+		goto out_och_free;
+
 	mutex_unlock(&lli->lli_och_mutex);
 	fd = NULL;
 
@@ -858,6 +882,7 @@ int ll_file_open(struct inode *inode, struct file *file)
 out_openerr:
 		if (lli->lli_opendir_key == fd)
 			ll_deauthorize_statahead(inode, fd);
+
 		if (fd)
 			ll_file_data_put(fd);
 	} else {
@@ -1632,6 +1657,22 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t result;
 	u16 refcheck;
 	ssize_t rc2;
+	bool cached = false;
+
+	/**
+	 * Currently when PCC read failed, we do not fall back to the
+	 * normal read path, just return the error.
+	 * The resaon is that: for RW-PCC, the file data may be modified
+	 * in the PCC and inconsistent with the data on OSTs (or file
+	 * data has been removed from the Lustre file system), at this
+	 * time, fallback to the normal read path may read the wrong
+	 * data.
+	 * TODO: for RO-PCC (readonly PCC), fall back to normal read
+	 * path: read data from data copy on OSTs.
+	 */
+	result = pcc_file_read_iter(iocb, to, &cached);
+	if (cached)
+		return result;
 
 	ll_ras_enter(iocb->ki_filp);
 
@@ -1725,6 +1766,21 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct vvp_io_args *args;
 	ssize_t rc_tiny = 0, rc_normal;
 	u16 refcheck;
+	bool cached = false;
+	int result;
+
+	/**
+	 * When PCC write failed, we do not fall back to the normal
+	 * write path, just return the error. The reason is that:
+	 * PCC is actually a HSM device, and HSM does not handle the
+	 * failure especially -ENOSPC due to space used out; Moreover,
+	 * the fallback to normal I/O path for ENOSPC failure, needs
+	 * to restore the file data to OSTs first and redo the write
+	 * again, making the logic of PCC very complex.
+	 */
+	result = pcc_file_write_iter(iocb, from, &cached);
+	if (cached)
+		return result;
 
 	/* NB: we can't do direct IO for tiny writes because they use the page
 	 * cache, we can't do sync writes because tiny writes can't flush
@@ -2979,13 +3035,15 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct obd_client_handle *och = NULL;
 	struct split_param sp;
-	bool lease_broken;
+	struct pcc_param param;
+	bool lease_broken = false;
 	fmode_t fmode = 0;
 	enum mds_op_bias bias = 0;
 	struct file *layout_file = NULL;
 	void *data = NULL;
 	size_t data_size = 0;
-	long rc;
+	bool attached = false;
+	long rc, rc2 = 0;
 
 	mutex_lock(&lli->lli_och_mutex);
 	if (fd->fd_lease_och) {
@@ -2994,10 +3052,8 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 	}
 	mutex_unlock(&lli->lli_och_mutex);
 
-	if (!och) {
-		rc = -ENOLCK;
-		goto out;
-	}
+	if (!och)
+		return -ENOLCK;
 
 	fmode = och->och_flags;
 
@@ -3005,19 +3061,19 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 	case LL_LEASE_RESYNC_DONE:
 		if (ioc->lil_count > IOC_IDS_MAX) {
 			rc = -EINVAL;
-			goto out;
+			goto out_lease_close;
 		}
 
 		data_size = offsetof(typeof(*ioc), lil_ids[ioc->lil_count]);
 		data = kzalloc(data_size, GFP_KERNEL);
 		if (!data) {
 			rc = -ENOMEM;
-			goto out;
+			goto out_lease_close;
 		}
 
 		if (copy_from_user(data, (void __user *)arg, data_size)) {
 			rc = -EFAULT;
-			goto out;
+			goto out_lease_close;
 		}
 
 		bias = MDS_CLOSE_RESYNC_DONE;
@@ -3027,25 +3083,25 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 
 		if (ioc->lil_count != 1) {
 			rc = -EINVAL;
-			goto out;
+			goto out_lease_close;
 		}
 
 		arg += sizeof(*ioc);
 		if (copy_from_user(&fd, (void __user *)arg, sizeof(u32))) {
 			rc = -EFAULT;
-			goto out;
+			goto out_lease_close;
 		}
 
 		layout_file = fget(fd);
 		if (!layout_file) {
 			rc = -EBADF;
-			goto out;
+			goto out_lease_close;
 		}
 
 		if ((file->f_flags & O_ACCMODE) == O_RDONLY ||
 		    (layout_file->f_flags & O_ACCMODE) == O_RDONLY) {
 			rc = -EPERM;
-			goto out;
+			goto out_lease_close;
 		}
 
 		data = file_inode(layout_file);
@@ -3058,26 +3114,26 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 
 		if (ioc->lil_count != 2) {
 			rc = -EINVAL;
-			goto out;
+			goto out_lease_close;
 		}
 
 		arg += sizeof(*ioc);
 		if (copy_from_user(&fdv, (void __user *)arg, sizeof(u32))) {
 			rc = -EFAULT;
-			goto out;
+			goto out_lease_close;
 		}
 
 		arg += sizeof(u32);
 		if (copy_from_user(&mirror_id, (void __user *)arg,
 				   sizeof(u32))) {
 			rc = -EFAULT;
-			goto out;
+			goto out_lease_close;
 		}
 
 		layout_file = fget(fdv);
 		if (!layout_file) {
 			rc = -EBADF;
-			goto out;
+			goto out_lease_close;
 		}
 
 		sp.sp_inode = file_inode(layout_file);
@@ -3086,11 +3142,37 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 		bias = MDS_CLOSE_LAYOUT_SPLIT;
 		break;
 	}
+	case LL_LEASE_PCC_ATTACH:
+		if (ioc->lil_count != 1)
+			return -EINVAL;
+
+		arg += sizeof(*ioc);
+		if (copy_from_user(&param.pa_archive_id, (void __user *)arg,
+				   sizeof(u32))) {
+			rc2 = -EFAULT;
+			goto out_lease_close;
+		}
+
+		rc2 = pcc_readwrite_attach(file, inode, param.pa_archive_id);
+		if (rc2)
+			goto out_lease_close;
+
+		attached = true;
+		/* Grab latest data version */
+		rc2 = ll_data_version(inode, &param.pa_data_version,
+				     LL_DV_WR_FLUSH);
+		if (rc2)
+			goto out_lease_close;
+
+		data = &param;
+		bias = MDS_PCC_ATTACH;
+		break;
 	default:
 		/* without close intent */
 		break;
 	}
 
+out_lease_close:
 	rc = ll_lease_close_intent(och, inode, &lease_broken, bias, data);
 	if (rc < 0)
 		goto out;
@@ -3112,6 +3194,12 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 		if (layout_file)
 			fput(layout_file);
 		break;
+	case LL_LEASE_PCC_ATTACH:
+		if (!rc)
+			rc = rc2;
+		rc = pcc_readwrite_attach_fini(file, inode, lease_broken,
+					       rc, attached);
+		break;
 	}
 
 	if (!rc)
@@ -3633,6 +3721,33 @@ static int ll_heat_set(struct inode *inode, enum lu_heat_flag flags)
 		rc = ll_heat_set(inode, flags);
 		return rc;
 	}
+	case LL_IOC_PCC_STATE: {
+		struct lu_pcc_state __user *ustate =
+			(struct lu_pcc_state __user *)arg;
+		struct lu_pcc_state *state;
+
+		state = kzalloc(sizeof(*state), GFP_KERNEL);
+		if (!state)
+			return -ENOMEM;
+
+		if (copy_from_user(state, ustate, sizeof(*state))) {
+			rc = -EFAULT;
+			goto out_state;
+		}
+
+		rc = pcc_ioctl_state(inode, state);
+		if (rc)
+			goto out_state;
+
+		if (copy_to_user(ustate, state, sizeof(*state))) {
+			rc = -EFAULT;
+			goto out_state;
+		}
+
+out_state:
+		kfree(state);
+		return rc;
+	}
 	default:
 		return obd_iocontrol(cmd, ll_i2dtexp(inode), 0, NULL,
 				     (void __user *)arg);
@@ -3740,13 +3855,20 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct inode *inode = file_inode(file);
 	struct ll_inode_info *lli = ll_i2info(inode);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct ptlrpc_request *req;
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
 	int rc, err;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p)\n",
 	       PFID(ll_inode2fid(inode)), inode);
 	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FSYNC, 1);
 
+	/* pcc cache path */
+	if (pcc_file)
+		return file_inode(pcc_file)->i_fop->fsync(pcc_file,
+					start, end, datasync);
+
 	rc = file_write_and_wait_range(file, start, end);
 	inode_lock(inode);
 
@@ -4294,6 +4416,11 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 		return rc;
 
 	if (S_ISREG(inode->i_mode)) {
+		bool cached = false;
+
+		rc = pcc_inode_getattr(inode, &cached);
+		if (cached && rc < 0)
+			return rc;
 		/* In case of restore, the MDT has the right size and has
 		 * already send it back without granting the layout lock,
 		 * inode is up-to-date so glimpse is useless.
@@ -4301,7 +4428,8 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 		 * restore the MDT holds the layout lock so the glimpse will
 		 * block up to the end of restore (getattr will block)
 		 */
-		if (!test_bit(LLIF_FILE_RESTORING, &lli->lli_flags)) {
+		if (!cached && !test_bit(LLIF_FILE_RESTORING,
+					 &lli->lli_flags)) {
 			rc = ll_glimpse_size(inode);
 			if (rc < 0)
 				return rc;
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 9e413c2..f2ea856 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -49,6 +49,7 @@
 #include <linux/posix_acl_xattr.h>
 #include "vvp_internal.h"
 #include "range_lock.h"
+#include "pcc.h"
 
 /** Only used on client-side for indicating the tail of dir hash/offset. */
 #define LL_DIR_END_OFF	  0x7fffffffffffffffULL
@@ -205,6 +206,9 @@ struct ll_inode_info {
 			 * accurate if the file is shared by different jobs.
 			 */
 			char				lli_jobid[LUSTRE_JOBID_SIZE];
+
+			struct mutex		 lli_pcc_lock;
+			struct pcc_inode	*lli_pcc_inode;
 		};
 	};
 
@@ -297,6 +301,11 @@ static inline struct ll_inode_info *ll_i2info(struct inode *inode)
 	return container_of(inode, struct ll_inode_info, lli_vfs_inode);
 }
 
+static inline struct pcc_inode *ll_i2pcci(struct inode *inode)
+{
+	return ll_i2info(inode)->lli_pcc_inode;
+}
+
 /* default to about 64M of readahead on a given system. */
 #define SBI_DEFAULT_READAHEAD_MAX		MiB_TO_PAGES(64UL)
 
@@ -552,6 +561,9 @@ struct ll_sb_info {
 
 	/* filesystem fsname */
 	char			ll_fsname[LUSTRE_MAXFSNAME + 1];
+
+	/* Persistent Client Cache */
+	struct pcc_super	ll_pcc_super;
 };
 
 #define SBI_DEFAULT_HEAT_DECAY_WEIGHT	((80 * 256 + 50) / 100)
@@ -672,6 +684,7 @@ struct ll_file_data {
 	 * layout version for verification to OST objects
 	 */
 	u32 fd_layout_version;
+	struct pcc_file fd_pcc_file;
 };
 
 void llite_tunables_unregister(void);
@@ -1355,6 +1368,18 @@ static inline void d_lustre_revalidate(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 }
 
+static inline dev_t ll_compat_encode_dev(dev_t dev)
+{
+	/* The compat_sys_*stat*() syscalls will fail unless the
+	 * device majors and minors are both less than 256. Note that
+	 * the value returned here will be passed through
+	 * old_encode_dev() in cp_compat_stat(). And so we are not
+	 * trying to return a valid compat (u16) device number, just
+	 * one that will pass the old_valid_dev() check.
+	 */
+	return MKDEV(MAJOR(dev) & 0xff, MINOR(dev) & 0xff);
+}
+
 int ll_layout_conf(struct inode *inode, const struct cl_object_conf *conf);
 int ll_layout_refresh(struct inode *inode, u32 *gen);
 int ll_layout_restore(struct inode *inode, loff_t start, u64 length);
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 0633cc5..d46bc99 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -128,6 +128,7 @@ static struct ll_sb_info *ll_init_sbi(void)
 	sbi->ll_squash.rsi_gid = 0;
 	INIT_LIST_HEAD(&sbi->ll_squash.rsi_nosquash_nids);
 	spin_lock_init(&sbi->ll_squash.rsi_lock);
+	pcc_super_init(&sbi->ll_pcc_super);
 
 	/* Per-filesystem file heat */
 	sbi->ll_heat_decay_weight = SBI_DEFAULT_HEAT_DECAY_WEIGHT;
@@ -139,13 +140,13 @@ static void ll_free_sbi(struct super_block *sb)
 {
 	struct ll_sb_info *sbi = ll_s2sbi(sb);
 
+	if (!list_empty(&sbi->ll_squash.rsi_nosquash_nids))
+		cfs_free_nidlist(&sbi->ll_squash.rsi_nosquash_nids);
 	if (sbi->ll_cache) {
-		if (!list_empty(&sbi->ll_squash.rsi_nosquash_nids))
-			cfs_free_nidlist(&sbi->ll_squash.rsi_nosquash_nids);
 		cl_cache_decref(sbi->ll_cache);
 		sbi->ll_cache = NULL;
 	}
-
+	pcc_super_fini(&sbi->ll_pcc_super);
 	kfree(sbi);
 }
 
@@ -215,7 +216,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_ARCHIVE_ID_ARRAY |
 				   OBD_CONNECT2_LSOM |
-				   OBD_CONNECT2_ASYNC_DISCARD;
+				   OBD_CONNECT2_ASYNC_DISCARD |
+				   OBD_CONNECT2_PCC;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
@@ -953,6 +955,8 @@ void ll_lli_init(struct ll_inode_info *lli)
 		spin_lock_init(&lli->lli_heat_lock);
 		obd_heat_clear(lli->lli_heat_instances, OBD_HEAT_COUNT);
 		lli->lli_heat_flags = 0;
+		mutex_init(&lli->lli_pcc_lock);
+		lli->lli_pcc_inode = NULL;
 	}
 	mutex_init(&lli->lli_layout_mutex);
 	memset(lli->lli_jobid, 0, sizeof(lli->lli_jobid));
@@ -1486,6 +1490,8 @@ void ll_clear_inode(struct inode *inode)
 		LASSERT(!lli->lli_opendir_key);
 		LASSERT(!lli->lli_sai);
 		LASSERT(lli->lli_opendir_pid == 0);
+	} else {
+		pcc_inode_free(inode);
 	}
 
 	md_null_inode(sbi->ll_md_exp, ll_inode2fid(inode));
@@ -1709,15 +1715,28 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 	if (attr->ia_valid & (ATTR_SIZE | ATTR_ATIME | ATTR_ATIME_SET |
 			      ATTR_MTIME | ATTR_MTIME_SET | ATTR_CTIME) ||
 	    xvalid & OP_XVALID_CTIME_SET) {
-		/* For truncate and utimes sending attributes to OSTs, setting
-		 * mtime/atime to the past will be performed under PW [0:EOF]
-		 * extent lock (new_size:EOF for truncate).  It may seem
-		 * excessive to send mtime/atime updates to OSTs when not
-		 * setting times to past, but it is necessary due to possible
-		 * time de-synchronization between MDT inode and OST objects
-		 */
-		rc = cl_setattr_ost(ll_i2info(inode)->lli_clob,
-				    attr, xvalid, 0);
+		bool cached = false;
+
+		rc = pcc_inode_setattr(inode, attr, &cached);
+		if (cached) {
+			if (rc) {
+				CERROR("%s: PCC inode "DFID" setattr failed: rc = %d\n",
+				       ll_i2sbi(inode)->ll_fsname,
+				       PFID(&lli->lli_fid), rc);
+				goto out;
+			}
+		} else {
+			/* For truncate and utimes sending attributes to OSTs,
+			 * setting mtime/atime to the past will be performed
+			 * under PW [0:EOF] extent lock (new_size:EOF for
+			 * truncate). It may seem excessive to send mtime/atime
+			 * updates to OSTs when not setting times to past, but
+			 * it is necessary due to possible time
+			 * de-synchronization between MDT inode and OST objects
+			 */
+			rc = cl_setattr_ost(ll_i2info(inode)->lli_clob,
+					    attr, xvalid, 0);
+		}
 	}
 
 	/*
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index 37ce508..fc2331b 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -505,6 +505,14 @@ int ll_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file_inode(file);
 	int rc;
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+
+	/* pcc cache path */
+	if (pcc_file) {
+		vma->vm_file = pcc_file;
+		return file_inode(pcc_file)->i_fop->mmap(pcc_file, vma);
+	}
 
 	if (ll_file_nolock(file))
 		return -EOPNOTSUPP;
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 165d37f..8cb4983 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1317,7 +1317,46 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 
 LPROC_SEQ_FOPS(ll_nosquash_nids);
 
-static struct lprocfs_vars lprocfs_llite_obd_vars[] = {
+static int ll_pcc_seq_show(struct seq_file *m, void *v)
+{
+	struct super_block *sb = m->private;
+	struct ll_sb_info *sbi = ll_s2sbi(sb);
+
+	return pcc_super_dump(&sbi->ll_pcc_super, m);
+}
+
+static ssize_t ll_pcc_seq_write(struct file *file, const char __user *buffer,
+				size_t count, loff_t *off)
+{
+	struct seq_file *m = file->private_data;
+	struct super_block *sb = m->private;
+	struct ll_sb_info *sbi = ll_s2sbi(sb);
+	int rc;
+	char *kernbuf;
+
+	if (count >= LPROCFS_WR_PCC_MAX_CMD)
+		return -EINVAL;
+
+	if (!(exp_connect_flags2(sbi->ll_md_exp) & OBD_CONNECT2_PCC))
+		return -EOPNOTSUPP;
+
+	kernbuf = kzalloc(count + 1, GFP_KERNEL);
+	if (!kernbuf)
+		return -ENOMEM;
+
+	if (copy_from_user(kernbuf, buffer, count)) {
+		rc = -EFAULT;
+		goto out_free_kernbuff;
+	}
+
+	rc = pcc_cmd_handle(kernbuf, count, &sbi->ll_pcc_super);
+out_free_kernbuff:
+	kfree(kernbuf);
+	return rc ? rc : count;
+}
+LPROC_SEQ_FOPS(ll_pcc);
+
+struct lprocfs_vars lprocfs_llite_obd_vars[] = {
 	{ .name	=	"site",
 	  .fops	=	&ll_site_stats_fops			},
 	{ .name	=	"max_cached_mb",
@@ -1329,9 +1368,11 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 	{ .name	=	"sbi_flags",
 	  .fops	=	&ll_sbi_flags_fops			},
 	{ .name =	"root_squash",
-	  .fops =       &ll_root_squash_fops			},
+	  .fops =	&ll_root_squash_fops			},
 	{ .name =	"nosquash_nids",
 	  .fops =	&ll_nosquash_nids_fops			},
+	{ .name =	"pcc",
+	  .fops =	&ll_pcc_fops,				},
 	{ NULL }
 };
 
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index fb5caaf..4f39b2c 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -711,14 +711,21 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 	return rc;
 }
 
+struct pcc_create_attach {
+	struct pcc_dataset *pca_dataset;
+	struct dentry *pca_dentry;
+};
+
 static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 				   struct lookup_intent *it, void **secctx,
-				   u32 *secctxlen)
+				   u32 *secctxlen,
+				   struct pcc_create_attach *pca)
 {
 	struct lookup_intent lookup_it = { .it_op = IT_LOOKUP };
 	struct dentry *save = dentry, *retval;
 	struct ptlrpc_request *req = NULL;
 	struct md_op_data *op_data = NULL;
+	struct lov_user_md *lum = NULL;
 	char secctx_name[XATTR_NAME_MAX + 1];
 	struct inode *inode;
 	u32 opc;
@@ -806,6 +813,42 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 		}
 	}
 
+	if (pca && pca->pca_dataset) {
+		struct pcc_dataset *dataset = pca->pca_dataset;
+
+		lum = kzalloc(sizeof(*lum), GFP_NOFS);
+		if (!lum) {
+			retval = ERR_PTR(-ENOMEM);
+			goto out;
+		}
+
+		lum->lmm_magic = LOV_USER_MAGIC_V1;
+		lum->lmm_pattern = LOV_PATTERN_F_RELEASED | LOV_PATTERN_RAID0;
+		lum->lmm_stripe_size = 0;
+		lum->lmm_stripe_count = 0;
+		lum->lmm_stripe_offset = 0;
+
+		op_data->op_data = lum;
+		op_data->op_data_size = sizeof(*lum);
+		op_data->op_archive_id = dataset->pccd_id;
+
+		rc = obd_fid_alloc(NULL, ll_i2mdexp(parent), &op_data->op_fid2,
+				   op_data);
+		if (rc) {
+			retval = ERR_PTR(rc);
+			goto out;
+		}
+
+		rc = pcc_inode_create(dataset, &op_data->op_fid2,
+				      &pca->pca_dentry);
+		if (rc) {
+			retval = ERR_PTR(rc);
+			goto out;
+		}
+
+		it->it_flags |= MDS_OPEN_PCC;
+	}
+
 	rc = md_intent_lock(ll_i2mdexp(parent), op_data, it, &req,
 			    &ll_md_blocking_ast, 0);
 	/*
@@ -878,6 +921,8 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 		ll_finish_md_op_data(op_data);
 	}
 
+	kfree(lum);
+
 	ptlrpc_req_finished(req);
 	return retval;
 }
@@ -903,7 +948,7 @@ static struct dentry *ll_lookup_nd(struct inode *parent, struct dentry *dentry,
 		itp = NULL;
 	else
 		itp = &it;
-	de = ll_lookup_it(parent, dentry, itp, NULL, NULL);
+	de = ll_lookup_it(parent, dentry, itp, NULL, NULL, NULL);
 
 	if (itp)
 		ll_intent_release(itp);
@@ -923,6 +968,9 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	void *secctx = NULL;
 	u32 secctxlen = 0;
 	struct dentry *de;
+	struct ll_sb_info *sbi;
+	struct pcc_create_attach pca = {NULL, NULL};
+	struct pcc_dataset *dataset = NULL;
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE,
@@ -952,14 +1000,24 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 		return -ENOMEM;
 
 	it->it_op = IT_OPEN;
-	if (open_flags & O_CREAT)
+	if (open_flags & O_CREAT) {
 		it->it_op |= IT_CREAT;
+		sbi = ll_i2sbi(dir);
+		/* Volatile file is used for HSM restore, so do not use PCC */
+		if (!filename_is_volatile(dentry->d_name.name,
+					  dentry->d_name.len, NULL)) {
+			dataset = pcc_dataset_get(&sbi->ll_pcc_super,
+						  ll_i2info(dir)->lli_projid,
+						  0);
+			pca.pca_dataset = dataset;
+		}
+	}
 	it->it_create_mode = (mode & S_IALLUGO) | S_IFREG;
 	it->it_flags = (open_flags & ~O_ACCMODE) | OPEN_FMODE(open_flags);
 	it->it_flags &= ~MDS_OPEN_FL_INTERNAL;
 
 	/* Dentry added to dcache tree in ll_lookup_it */
-	de = ll_lookup_it(dir, dentry, it, &secctx, &secctxlen);
+	de = ll_lookup_it(dir, dentry, it, &secctx, &secctxlen, &pca);
 	if (IS_ERR(de))
 		rc = PTR_ERR(de);
 	else if (de)
@@ -976,9 +1034,20 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 					dput(de);
 				goto out_release;
 			}
+			if (dataset && dentry->d_inode) {
+				rc = pcc_inode_create_fini(dataset,
+							   dentry->d_inode,
+							   pca.pca_dentry);
+				if (rc) {
+					if (de)
+						dput(de);
+					goto out_release;
+				}
+			}
 
 			file->f_mode |= FMODE_CREATED;
 		}
+
 		if (d_really_is_positive(dentry) &&
 		    it_disposition(it, DISP_OPEN_OPEN)) {
 			/* Open dentry. */
@@ -1003,6 +1072,8 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 
 out_release:
+	if (dataset)
+		pcc_dataset_put(dataset);
 	ll_intent_release(it);
 	kfree(it);
 
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
new file mode 100644
index 0000000..53e5cda
--- /dev/null
+++ b/fs/lustre/llite/pcc.c
@@ -0,0 +1,1042 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2017, DDN Storage Corporation.
+ */
+/*
+ * Persistent Client Cache
+ *
+ * PCC is a new framework which provides a group of local cache on Lustre
+ * client side. It works in two modes: RW-PCC enables a read-write cache on the
+ * local SSDs of a single client; RO-PCC provides a read-only cache on the
+ * local SSDs of multiple clients. Less overhead is visible to the applications
+ * and network latencies and lock conflicts can be significantly reduced.
+ *
+ * For RW-PCC, no global namespace will be provided. Each client uses its own
+ * local storage as a cache for itself. Local file system is used to manage
+ * the data on local caches. Cached I/O is directed to local file system while
+ * normal I/O is directed to OSTs. RW-PCC uses HSM for data synchronization.
+ * It uses HSM copytool to restore file from local caches to Lustre OSTs. Each
+ * PCC has a copytool instance running with unique archive number. Any remote
+ * access from another Lustre client would trigger the data synchronization. If
+ * a client with RW-PCC goes offline, the cached data becomes inaccessible for
+ * other client temporarily. And after the RW-PCC client reboots and the
+ * copytool restarts, the data will be accessible again.
+ *
+ * Following is what will happen in different conditions for RW-PCC:
+ *
+ * > When file is being created on RW-PCC
+ *
+ * A normal HSM released file is created on MDT;
+ * An empty mirror file is created on local cache;
+ * The HSM status of the Lustre file will be set to archived and released;
+ * The archive number will be set to the proper value.
+ *
+ * > When file is being prefetched to RW-PCC
+ *
+ * An file is copied to the local cache;
+ * The HSM status of the Lustre file will be set to archived and released;
+ * The archive number will be set to the proper value.
+ *
+ * > When file is being accessed from PCC
+ *
+ * Data will be read directly from local cache;
+ * Metadata will be read from MDT, except file size;
+ * File size will be got from local cache.
+ *
+ * > When PCC cached file is being accessed on another client
+ *
+ * RW-PCC cached files are automatically restored when a process on another
+ * client tries to read or modify them. The corresponding I/O will block
+ * waiting for the released file to be restored. This is transparent to the
+ * process.
+ *
+ * For RW-PCC, when a file is being created, a rule-based policy is used to
+ * determine whether it will be cached. Rule-based caching of newly created
+ * files can determine which file can use a cache on PCC directly without any
+ * admission control.
+ *
+ * RW-PCC design can accelerate I/O intensive applications with one-to-one
+ * mappings between files and accessing clients. However, in several use cases,
+ * files will never be updated, but need to be read simultaneously from many
+ * clients. RO-PCC implements a read-only caching on Lustre clients using
+ * SSDs. RO-PCC is based on the same framework as RW-PCC, expect
+ * that no HSM mechanism is used.
+ *
+ * The main advantages to use this SSD cache on the Lustre clients via PCC
+ * is that:
+ * - The I/O stack becomes much simpler for the cached data, as there is no
+ *   interference with I/Os from other clients, which enables easier
+ *   performance optimizations;
+ * - The requirements on the HW inside the client nodes are small, any kind of
+ *   SSDs or even HDDs can be used as cache devices;
+ * - Caching reduces the pressure on the object storage targets (OSTs), as
+ *   small or random I/Os can be regularized to big sequential I/Os and
+ *   temporary files do not even need to be flushed to OSTs.
+ *
+ * PCC can accelerate applications with certain I/O patterns:
+ * - small-sized random writes (< 1MB) from a single client
+ * - repeated read of data that is larger than RAM
+ * - clients with high network latency
+ *
+ * Author: Li Xi <lixi@ddn.com>
+ * Author: Qian Yingjin <qian@ddn.com>
+ */
+
+#define DEBUG_SUBSYSTEM S_LLITE
+
+#include "pcc.h"
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include "llite_internal.h"
+
+struct kmem_cache *pcc_inode_slab;
+
+void pcc_super_init(struct pcc_super *super)
+{
+	spin_lock_init(&super->pccs_lock);
+	INIT_LIST_HEAD(&super->pccs_datasets);
+}
+
+/**
+ * pcc_dataset_add - Add a Cache policy to control which files need be
+ * cached and where it will be cached.
+ *
+ * @super: superblock of pcc
+ * @pathname: root path of pcc
+ * @id: HSM archive ID
+ * @projid: files with specified project ID will be cached.
+ */
+static int
+pcc_dataset_add(struct pcc_super *super, const char *pathname,
+		u32 archive_id, u32 projid)
+{
+	int rc;
+	struct pcc_dataset *dataset;
+	struct pcc_dataset *tmp;
+	bool found = false;
+
+	dataset = kzalloc(sizeof(*dataset), GFP_NOFS);
+	if (!dataset)
+		return -ENOMEM;
+
+	rc = kern_path(pathname, LOOKUP_DIRECTORY, &dataset->pccd_path);
+	if (unlikely(rc)) {
+		kfree(dataset);
+		return rc;
+	}
+	strncpy(dataset->pccd_pathname, pathname, PATH_MAX);
+	dataset->pccd_id = archive_id;
+	dataset->pccd_projid = projid;
+	atomic_set(&dataset->pccd_refcount, 1);
+
+	spin_lock(&super->pccs_lock);
+	list_for_each_entry(tmp, &super->pccs_datasets, pccd_linkage) {
+		if (tmp->pccd_id == archive_id) {
+			found = true;
+			break;
+		}
+	}
+	if (!found)
+		list_add(&dataset->pccd_linkage, &super->pccs_datasets);
+	spin_unlock(&super->pccs_lock);
+
+	if (found) {
+		pcc_dataset_put(dataset);
+		rc = -EEXIST;
+	}
+
+	return rc;
+}
+
+struct pcc_dataset *
+pcc_dataset_get(struct pcc_super *super, u32 projid, u32 archive_id)
+{
+	struct pcc_dataset *dataset;
+	struct pcc_dataset *selected = NULL;
+
+	if (projid == 0 && archive_id == 0)
+		return NULL;
+
+	/*
+	 * archive ID is unique in the list, projid might be duplicate,
+	 * we just return last added one as first priority.
+	 */
+	spin_lock(&super->pccs_lock);
+	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
+		if (projid && dataset->pccd_projid != projid)
+			continue;
+		if (archive_id && dataset->pccd_id != archive_id)
+			continue;
+		atomic_inc(&dataset->pccd_refcount);
+		selected = dataset;
+		break;
+	}
+	spin_unlock(&super->pccs_lock);
+	if (selected)
+		CDEBUG(D_CACHE, "matched projid %u, PCC create\n",
+		       selected->pccd_projid);
+	return selected;
+}
+
+void
+pcc_dataset_put(struct pcc_dataset *dataset)
+{
+	if (atomic_dec_and_test(&dataset->pccd_refcount)) {
+		path_put(&dataset->pccd_path);
+		kfree(dataset);
+	}
+}
+
+static int
+pcc_dataset_del(struct pcc_super *super, char *pathname)
+{
+	struct list_head *l, *tmp;
+	struct pcc_dataset *dataset;
+	int rc = -ENOENT;
+
+	spin_lock(&super->pccs_lock);
+	list_for_each_safe(l, tmp, &super->pccs_datasets) {
+		dataset = list_entry(l, struct pcc_dataset, pccd_linkage);
+		if (strcmp(dataset->pccd_pathname, pathname) == 0) {
+			list_del(&dataset->pccd_linkage);
+			pcc_dataset_put(dataset);
+			rc = 0;
+			break;
+		}
+	}
+	spin_unlock(&super->pccs_lock);
+	return rc;
+}
+
+static void
+pcc_dataset_dump(struct pcc_dataset *dataset, struct seq_file *m)
+{
+	seq_printf(m, "%s:\n", dataset->pccd_pathname);
+	seq_printf(m, "  rwid: %u\n", dataset->pccd_id);
+	seq_printf(m, "  autocache: projid=%u\n", dataset->pccd_projid);
+}
+
+int
+pcc_super_dump(struct pcc_super *super, struct seq_file *m)
+{
+	struct pcc_dataset *dataset;
+
+	spin_lock(&super->pccs_lock);
+	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
+		pcc_dataset_dump(dataset, m);
+	}
+	spin_unlock(&super->pccs_lock);
+	return 0;
+}
+
+void pcc_super_fini(struct pcc_super *super)
+{
+	struct pcc_dataset *dataset, *tmp;
+
+	list_for_each_entry_safe(dataset, tmp,
+				 &super->pccs_datasets, pccd_linkage) {
+		list_del(&dataset->pccd_linkage);
+		pcc_dataset_put(dataset);
+	}
+}
+
+static bool pathname_is_valid(const char *pathname)
+{
+	/* Needs to be absolute path */
+	if (!pathname || strlen(pathname) == 0 ||
+	    strlen(pathname) >= PATH_MAX || pathname[0] != '/')
+		return false;
+	return true;
+}
+
+static struct pcc_cmd *
+pcc_cmd_parse(char *buffer, unsigned long count)
+{
+	static struct pcc_cmd *cmd;
+	char *token;
+	char *val;
+	unsigned long tmp;
+	int rc = 0;
+
+	cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+	if (!cmd) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	/* clear all setting */
+	if (strncmp(buffer, "clear", 5) == 0) {
+		cmd->pccc_cmd = PCC_CLEAR_ALL;
+		rc = 0;
+		goto out;
+	}
+
+	val = buffer;
+	token = strsep(&val, " ");
+	if (!val || strlen(val) == 0) {
+		rc = -EINVAL;
+		goto out_free_cmd;
+	}
+
+	/* Type of the command */
+	if (strcmp(token, "add") == 0) {
+		cmd->pccc_cmd = PCC_ADD_DATASET;
+	} else if (strcmp(token, "del") == 0) {
+		cmd->pccc_cmd = PCC_DEL_DATASET;
+	} else {
+		rc = -EINVAL;
+		goto out_free_cmd;
+	}
+
+	/* Pathname of the dataset */
+	token = strsep(&val, " ");
+	if ((!val && cmd->pccc_cmd != PCC_DEL_DATASET) ||
+	    !pathname_is_valid(token)) {
+		rc = -EINVAL;
+		goto out_free_cmd;
+	}
+	cmd->pccc_pathname = token;
+
+	if (cmd->pccc_cmd == PCC_ADD_DATASET) {
+		/* archive ID */
+		token = strsep(&val, " ");
+		if (!val) {
+			rc = -EINVAL;
+			goto out_free_cmd;
+		}
+
+		rc = kstrtoul(token, 10, &tmp);
+		if (rc != 0) {
+			rc = -EINVAL;
+			goto out_free_cmd;
+		}
+		if (tmp == 0) {
+			rc = -EINVAL;
+			goto out_free_cmd;
+		}
+		cmd->u.pccc_add.pccc_id = tmp;
+
+		token = val;
+		rc = kstrtoul(token, 10, &tmp);
+		if (rc != 0) {
+			rc = -EINVAL;
+			goto out_free_cmd;
+		}
+		if (tmp == 0) {
+			rc = -EINVAL;
+			goto out_free_cmd;
+		}
+		cmd->u.pccc_add.pccc_projid = tmp;
+	}
+
+	goto out;
+out_free_cmd:
+	kfree(cmd);
+out:
+	if (rc)
+		cmd = ERR_PTR(rc);
+	return cmd;
+}
+
+int pcc_cmd_handle(char *buffer, unsigned long count,
+		   struct pcc_super *super)
+{
+	int rc = 0;
+	struct pcc_cmd *cmd;
+
+	cmd = pcc_cmd_parse(buffer, count);
+	if (IS_ERR(cmd))
+		return PTR_ERR(cmd);
+
+	switch (cmd->pccc_cmd) {
+	case PCC_ADD_DATASET:
+		rc = pcc_dataset_add(super, cmd->pccc_pathname,
+				      cmd->u.pccc_add.pccc_id,
+				      cmd->u.pccc_add.pccc_projid);
+		break;
+	case PCC_DEL_DATASET:
+		rc = pcc_dataset_del(super, cmd->pccc_pathname);
+		break;
+	case PCC_CLEAR_ALL:
+		pcc_super_fini(super);
+		break;
+	default:
+		rc = -EINVAL;
+		break;
+	}
+
+	kfree(cmd);
+	return rc;
+}
+
+static inline void pcc_inode_lock(struct inode *inode)
+{
+	mutex_lock(&ll_i2info(inode)->lli_pcc_lock);
+}
+
+static inline void pcc_inode_unlock(struct inode *inode)
+{
+	mutex_unlock(&ll_i2info(inode)->lli_pcc_lock);
+}
+
+static void pcc_inode_init(struct pcc_inode *pcci)
+{
+	atomic_set(&pcci->pcci_refcount, 0);
+	pcci->pcci_type = LU_PCC_NONE;
+}
+
+static void pcc_inode_fini(struct pcc_inode *pcci)
+{
+	path_put(&pcci->pcci_path);
+	pcci->pcci_type = LU_PCC_NONE;
+	kmem_cache_free(pcc_inode_slab, pcci);
+}
+
+static void pcc_inode_get(struct pcc_inode *pcci)
+{
+	atomic_inc(&pcci->pcci_refcount);
+}
+
+static void pcc_inode_put(struct pcc_inode *pcci)
+{
+	if (atomic_dec_and_test(&pcci->pcci_refcount))
+		pcc_inode_fini(pcci);
+}
+
+void pcc_inode_free(struct inode *inode)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci = lli->lli_pcc_inode;
+
+	if (pcci) {
+		WARN_ON(atomic_read(&pcci->pcci_refcount) > 1);
+		pcc_inode_put(pcci);
+		lli->lli_pcc_inode = NULL;
+	}
+}
+
+/*
+ * TODO:
+ * As Andreas suggested, we'd better use new layout to
+ * reduce overhead:
+ * (fid->f_oid >> 16 & oxFFFF)/FID
+ */
+#define MAX_PCC_DATABASE_PATH (6 * 5 + FID_NOBRACE_LEN + 1)
+static int pcc_fid2dataset_path(char *buf, int sz, struct lu_fid *fid)
+{
+	return snprintf(buf, sz, "%04x/%04x/%04x/%04x/%04x/%04x/"
+			DFID_NOBRACE,
+			(fid)->f_oid       & 0xFFFF,
+			(fid)->f_oid >> 16 & 0xFFFF,
+			(unsigned int)((fid)->f_seq       & 0xFFFF),
+			(unsigned int)((fid)->f_seq >> 16 & 0xFFFF),
+			(unsigned int)((fid)->f_seq >> 32 & 0xFFFF),
+			(unsigned int)((fid)->f_seq >> 48 & 0xFFFF),
+			PFID(fid));
+}
+
+void pcc_file_init(struct pcc_file *pccf)
+{
+	pccf->pccf_file = NULL;
+	pccf->pccf_type = LU_PCC_NONE;
+}
+
+int pcc_file_open(struct inode *inode, struct file *file)
+{
+	struct pcc_inode *pcci;
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct pcc_file *pccf = &fd->fd_pcc_file;
+	struct file *pcc_file;
+	struct path *path;
+	struct qstr *dname;
+	int rc = 0;
+
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (!pcci)
+		goto out_unlock;
+
+	if (atomic_read(&pcci->pcci_refcount) == 0)
+		goto out_unlock;
+
+	pcc_inode_get(pcci);
+	WARN_ON(pccf->pccf_file);
+
+	path = &pcci->pcci_path;
+	dname = &path->dentry->d_name;
+	CDEBUG(D_CACHE, "opening pcc file '%.*s'\n", dname->len,
+	       dname->name);
+	pcc_file = dentry_open(path, file->f_flags, current_cred());
+	if (IS_ERR_OR_NULL(pcc_file)) {
+		rc = pcc_file ? PTR_ERR(pcc_file) : -EINVAL;
+		pcc_inode_put(pcci);
+	} else {
+		pccf->pccf_file = pcc_file;
+		pccf->pccf_type = pcci->pcci_type;
+	}
+
+out_unlock:
+	pcc_inode_unlock(inode);
+	return rc;
+}
+
+void pcc_file_release(struct inode *inode, struct file *file)
+{
+	struct pcc_inode *pcci;
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct pcc_file *pccf;
+	struct path *path;
+	struct qstr *dname;
+
+	if (!S_ISREG(inode->i_mode) || !fd)
+		return;
+
+	pccf = &fd->fd_pcc_file;
+	pcc_inode_lock(inode);
+	if (!pccf->pccf_file)
+		goto out;
+
+	pcci = ll_i2pcci(inode);
+	LASSERT(pcci);
+	path = &pcci->pcci_path;
+	dname = &path->dentry->d_name;
+	CDEBUG(D_CACHE, "releasing pcc file \"%.*s\"\n", dname->len,
+	       dname->name);
+	pcc_inode_put(pcci);
+	fput(pccf->pccf_file);
+	pccf->pccf_file = NULL;
+out:
+	pcc_inode_unlock(inode);
+}
+
+ssize_t pcc_file_read_iter(struct kiocb *iocb,
+			   struct iov_iter *iter, bool *cached)
+{
+	struct file *file = iocb->ki_filp;
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct pcc_file *pccf = &fd->fd_pcc_file;
+	ssize_t result;
+
+	if (!pccf->pccf_file) {
+		*cached = false;
+		return 0;
+	}
+	*cached = true;
+	iocb->ki_filp = pccf->pccf_file;
+
+	result = generic_file_read_iter(iocb, iter);
+	iocb->ki_filp = file;
+
+	return result;
+}
+
+ssize_t pcc_file_write_iter(struct kiocb *iocb,
+			    struct iov_iter *iter, bool *cached)
+{
+	struct file *file = iocb->ki_filp;
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct pcc_file *pccf = &fd->fd_pcc_file;
+	ssize_t result;
+
+	if (!pccf->pccf_file) {
+		*cached = false;
+		return 0;
+	}
+	*cached = true;
+
+	if (pccf->pccf_type != LU_PCC_READWRITE)
+		return -EWOULDBLOCK;
+
+	iocb->ki_filp = pccf->pccf_file;
+
+	/* Since file->fop->write_iter makes write calls via
+	 * the normal vfs interface to the local PCC file system,
+	 * the inode lock is not needed.
+	 */
+	result = file->f_op->write_iter(iocb, iter);
+	iocb->ki_filp = file;
+	return result;
+}
+
+int pcc_inode_setattr(struct inode *inode, struct iattr *attr,
+		      bool *cached)
+{
+	int rc = 0;
+	struct pcc_inode *pcci;
+	struct iattr attr2 = *attr;
+	struct dentry *pcc_dentry;
+
+	if (!S_ISREG(inode->i_mode)) {
+		*cached = false;
+		return 0;
+	}
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (!pcci || atomic_read(&pcci->pcci_refcount) == 0)
+		goto out_unlock;
+
+	*cached = true;
+	attr2.ia_valid = attr->ia_valid & (ATTR_SIZE | ATTR_ATIME |
+			 ATTR_ATIME_SET | ATTR_MTIME | ATTR_MTIME_SET |
+			 ATTR_CTIME);
+	pcc_dentry = pcci->pcci_path.dentry;
+	inode_lock(pcc_dentry->d_inode);
+	rc = pcc_dentry->d_inode->i_op->setattr(pcc_dentry, &attr2);
+	inode_unlock(pcc_dentry->d_inode);
+out_unlock:
+	pcc_inode_unlock(inode);
+	return rc;
+}
+
+int pcc_inode_getattr(struct inode *inode, bool *cached)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci;
+	struct kstat stat;
+	s64 atime;
+	s64 mtime;
+	s64 ctime;
+	int rc = 0;
+
+	if (!S_ISREG(inode->i_mode)) {
+		*cached = false;
+		return 0;
+	}
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (!pcci || atomic_read(&pcci->pcci_refcount) == 0)
+		goto out_unlock;
+
+	*cached = true;
+	rc = vfs_getattr(&pcci->pcci_path, &stat,
+			 STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
+	if (rc)
+		goto out_unlock;
+
+	ll_inode_size_lock(inode);
+	if (test_and_clear_bit(LLIF_UPDATE_ATIME, &lli->lli_flags) ||
+	    inode->i_atime.tv_sec < lli->lli_atime)
+		inode->i_atime.tv_sec = lli->lli_atime;
+
+	inode->i_mtime.tv_sec = lli->lli_mtime;
+	inode->i_ctime.tv_sec = lli->lli_ctime;
+
+	atime = inode->i_atime.tv_sec;
+	mtime = inode->i_mtime.tv_sec;
+	ctime = inode->i_ctime.tv_sec;
+
+	if (atime < stat.atime.tv_sec)
+		atime = stat.atime.tv_sec;
+
+	if (ctime < stat.ctime.tv_sec)
+		ctime = stat.ctime.tv_sec;
+
+	if (mtime < stat.mtime.tv_sec)
+		mtime = stat.mtime.tv_sec;
+
+	i_size_write(inode, stat.size);
+	inode->i_blocks = stat.blocks;
+
+	inode->i_atime.tv_sec = atime;
+	inode->i_mtime.tv_sec = mtime;
+	inode->i_ctime.tv_sec = ctime;
+
+	ll_inode_size_unlock(inode);
+
+out_unlock:
+	pcc_inode_unlock(inode);
+	return rc;
+}
+
+/* Create directory under base if directory does not exist */
+static struct dentry *
+pcc_mkdir(struct dentry *base, const char *name, umode_t mode)
+{
+	int rc;
+	struct dentry *dentry;
+	struct inode *dir = base->d_inode;
+
+	inode_lock(dir);
+	dentry = lookup_one_len(name, base, strlen(name));
+	if (IS_ERR(dentry))
+		goto out;
+
+	if (d_is_positive(dentry))
+		goto out;
+
+	rc = vfs_mkdir(dir, dentry, mode);
+	if (rc) {
+		dput(dentry);
+		dentry = ERR_PTR(rc);
+		goto out;
+	}
+out:
+	inode_unlock(dir);
+	return dentry;
+}
+
+static struct dentry *
+pcc_mkdir_p(struct dentry *root, char *path, umode_t mode)
+{
+	char *ptr, *entry_name;
+	struct dentry *parent;
+	struct dentry *child = ERR_PTR(-EINVAL);
+
+	ptr = path;
+	while (*ptr == '/')
+		ptr++;
+
+	entry_name = ptr;
+	parent = dget(root);
+	while ((ptr = strchr(ptr, '/')) != NULL) {
+		*ptr = '\0';
+		child = pcc_mkdir(parent, entry_name, mode);
+		*ptr = '/';
+		if (IS_ERR(child))
+			break;
+		dput(parent);
+		parent = child;
+		ptr++;
+		entry_name = ptr;
+	}
+
+	return child;
+}
+
+/* Create file under base. If file already exist, return failure */
+static struct dentry *
+pcc_create(struct dentry *base, const char *name, umode_t mode)
+{
+	int rc;
+	struct dentry *dentry;
+	struct inode *dir = base->d_inode;
+
+	inode_lock(dir);
+	dentry = lookup_one_len(name, base, strlen(name));
+	if (IS_ERR(dentry))
+		goto out;
+
+	if (d_is_positive(dentry))
+		goto out;
+
+	rc = vfs_create(dir, dentry, mode, false);
+	if (rc) {
+		dput(dentry);
+		dentry = ERR_PTR(rc);
+		goto out;
+	}
+out:
+	inode_unlock(dir);
+	return dentry;
+}
+
+/* Must be called with pcci->pcci_lock held */
+static void pcc_inode_attach_init(struct pcc_dataset *dataset,
+				  struct pcc_inode *pcci,
+				  struct dentry *dentry,
+				  enum lu_pcc_type type)
+{
+	pcci->pcci_path.mnt = mntget(dataset->pccd_path.mnt);
+	pcci->pcci_path.dentry = dentry;
+	LASSERT(atomic_read(&pcci->pcci_refcount) == 0);
+	atomic_set(&pcci->pcci_refcount, 1);
+	pcci->pcci_type = type;
+	pcci->pcci_attr_valid = false;
+}
+
+static int __pcc_inode_create(struct pcc_dataset *dataset,
+			      struct lu_fid *fid,
+			      struct dentry **dentry)
+{
+	char *path;
+	struct dentry *base;
+	struct dentry *child;
+	int rc = 0;
+
+	path = kzalloc(MAX_PCC_DATABASE_PATH, GFP_NOFS);
+	if (!path)
+		return -ENOMEM;
+
+	pcc_fid2dataset_path(path, MAX_PCC_DATABASE_PATH, fid);
+
+	base = pcc_mkdir_p(dataset->pccd_path.dentry, path, 0700);
+	if (IS_ERR(base)) {
+		rc = PTR_ERR(base);
+		goto out;
+	}
+
+	snprintf(path, MAX_PCC_DATABASE_PATH, DFID_NOBRACE, PFID(fid));
+	child = pcc_create(base, path, 0600);
+	if (IS_ERR(child)) {
+		rc = PTR_ERR(child);
+		goto out_base;
+	}
+	*dentry = child;
+
+out_base:
+	dput(base);
+out:
+	kfree(path);
+	return rc;
+}
+
+int pcc_inode_create(struct pcc_dataset *dataset, struct lu_fid *fid,
+		     struct dentry **pcc_dentry)
+{
+	return __pcc_inode_create(dataset, fid, pcc_dentry);
+}
+
+int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
+			  struct dentry *pcc_dentry)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci;
+
+	LASSERT(!ll_i2pcci(inode));
+	pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
+	if (!pcci)
+		return -ENOMEM;
+
+	pcc_inode_init(pcci);
+	pcc_inode_lock(inode);
+	pcc_inode_attach_init(dataset, pcci, pcc_dentry, LU_PCC_READWRITE);
+	lli->lli_pcc_inode = pcci;
+	pcc_inode_unlock(inode);
+
+	return 0;
+}
+
+static int pcc_filp_write(struct file *filp, const void *buf, ssize_t count,
+			  loff_t *offset)
+{
+	while (count > 0) {
+		ssize_t size;
+
+		size = kernel_write(filp, buf, count, offset);
+		if (size < 0)
+			return size;
+		count -= size;
+		buf += size;
+	}
+	return 0;
+}
+
+static int pcc_copy_data(struct file *src, struct file *dst)
+{
+	int rc = 0;
+	ssize_t rc2;
+	loff_t pos, offset = 0;
+	size_t buf_len = 1048576;
+	void *buf;
+
+	buf = kvzalloc(buf_len, GFP_NOFS);
+	if (!buf)
+		return -ENOMEM;
+
+	while (1) {
+		pos = offset;
+		rc2 = kernel_read(src, buf, buf_len, &pos);
+		if (rc2 < 0) {
+			rc = rc2;
+			goto out_free;
+		} else if (rc2 == 0)
+			break;
+
+		pos = offset;
+		rc = pcc_filp_write(dst, buf, rc2, &pos);
+		if (rc < 0)
+			goto out_free;
+		offset += rc2;
+	}
+
+out_free:
+	kvfree(buf);
+	return rc;
+}
+
+int pcc_readwrite_attach(struct file *file, struct inode *inode,
+			 u32 archive_id)
+{
+	struct pcc_dataset *dataset;
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci;
+	struct dentry *dentry;
+	struct file *pcc_filp;
+	struct path path;
+	int rc;
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (!pcci) {
+		pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
+		if (!pcci) {
+			pcc_inode_unlock(inode);
+			return -ENOMEM;
+		}
+
+		pcc_inode_init(pcci);
+	} else if (atomic_read(&pcci->pcci_refcount) > 0) {
+		pcc_inode_unlock(inode);
+		return -EEXIST;
+	}
+	pcc_inode_unlock(inode);
+
+	dataset = pcc_dataset_get(&ll_i2sbi(inode)->ll_pcc_super, 0,
+				  archive_id);
+	if (!dataset) {
+		rc = -ENOENT;
+		goto out_free_pcci;
+	}
+
+	rc = __pcc_inode_create(dataset, &lli->lli_fid, &dentry);
+	if (rc)
+		goto out_dataset_put;
+
+	path.mnt = dataset->pccd_path.mnt;
+	path.dentry = dentry;
+	pcc_filp = dentry_open(&path, O_TRUNC | O_WRONLY | O_LARGEFILE,
+			       current_cred());
+	if (IS_ERR_OR_NULL(pcc_filp)) {
+		rc = pcc_filp ? PTR_ERR(pcc_filp) : -EINVAL;
+		goto out_dentry;
+	}
+
+	rc = pcc_copy_data(file, pcc_filp);
+	if (rc)
+		goto out_fput;
+
+	pcc_inode_lock(inode);
+	if (lli->lli_pcc_inode) {
+		rc = -EEXIST;
+		goto out_unlock;
+	}
+	pcc_inode_attach_init(dataset, pcci, dentry, LU_PCC_READWRITE);
+	lli->lli_pcc_inode = pcci;
+out_unlock:
+	pcc_inode_unlock(inode);
+out_fput:
+	fput(pcc_filp);
+out_dentry:
+	if (rc)
+		dput(dentry);
+out_dataset_put:
+	pcc_dataset_put(dataset);
+out_free_pcci:
+	if (rc)
+		kmem_cache_free(pcc_inode_slab, pcci);
+	return rc;
+
+}
+
+int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
+			      bool lease_broken, int rc, bool attached)
+{
+	struct pcc_inode *pcci = ll_i2pcci(inode);
+
+	if ((rc || lease_broken) && attached && pcci)
+		pcc_inode_put(pcci);
+
+	return rc;
+}
+
+int pcc_ioctl_detach(struct inode *inode)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci = lli->lli_pcc_inode;
+	int rc = 0;
+	int count;
+
+	pcc_inode_lock(inode);
+	if (!pcci)
+		goto out_unlock;
+
+	count = atomic_read(&pcci->pcci_refcount);
+	if (count > 1) {
+		rc = -EBUSY;
+		goto out_unlock;
+	} else if (count == 0)
+		goto out_unlock;
+
+	pcc_inode_put(pcci);
+	lli->lli_pcc_inode = NULL;
+out_unlock:
+	pcc_inode_unlock(inode);
+
+	return rc;
+}
+
+int pcc_ioctl_state(struct inode *inode, struct lu_pcc_state *state)
+{
+	int rc = 0;
+	int count;
+	char *buf;
+	char *path;
+	int buf_len = sizeof(state->pccs_path);
+	struct pcc_inode *pcci;
+
+	if (buf_len <= 0)
+		return -EINVAL;
+
+	buf = kzalloc(buf_len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (!pcci) {
+		state->pccs_type = LU_PCC_NONE;
+		goto out_unlock;
+	}
+
+	count = atomic_read(&pcci->pcci_refcount);
+	if (count == 0) {
+		state->pccs_type = LU_PCC_NONE;
+		goto out_unlock;
+	}
+	state->pccs_type = pcci->pcci_type;
+	state->pccs_open_count = count - 1;
+	state->pccs_flags = pcci->pcci_attr_valid ?
+			    PCC_STATE_FLAG_ATTR_VALID : 0;
+	path = dentry_path_raw(pcci->pcci_path.dentry, buf, buf_len);
+	if (IS_ERR(path)) {
+		rc = PTR_ERR(path);
+		goto out_unlock;
+	}
+
+	if (strlcpy(state->pccs_path, path, buf_len) >= buf_len) {
+		rc = -ENAMETOOLONG;
+		goto out_unlock;
+	}
+
+out_unlock:
+	pcc_inode_unlock(inode);
+	kfree(buf);
+	return rc;
+}
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
new file mode 100644
index 0000000..0f960b9
--- /dev/null
+++ b/fs/lustre/llite/pcc.h
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2017, DDN Storage Corporation.
+ */
+/*
+ *
+ * Persistent Client Cache
+ *
+ * Author: Li Xi <lixi@ddn.com>
+ */
+
+#ifndef LLITE_PCC_H
+#define LLITE_PCC_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <linux/seq_file.h>
+#include <uapi/linux/lustre/lustre_user.h>
+
+extern struct kmem_cache *pcc_inode_slab;
+
+#define LPROCFS_WR_PCC_MAX_CMD 4096
+
+struct pcc_dataset {
+	u32			pccd_id;	 /* Archive ID */
+	u32			pccd_projid;	 /* Project ID */
+	char			pccd_pathname[PATH_MAX]; /* full path */
+	struct path		pccd_path;	 /* Root path */
+	struct list_head	pccd_linkage;  /* Linked to pccs_datasets */
+	atomic_t		pccd_refcount; /* reference count */
+};
+
+struct pcc_super {
+	spinlock_t		pccs_lock;	/* Protect pccs_datasets */
+	struct list_head	pccs_datasets;	/* List of datasets */
+};
+
+struct pcc_inode {
+	/* Cache path on local file system */
+	struct path			 pcci_path;
+	/*
+	 * If reference count is 0, then the cache is not inited, if 1, then
+	 * no one is using it.
+	 */
+	atomic_t			 pcci_refcount;
+	/* Whether readonly or readwrite PCC */
+	enum lu_pcc_type		 pcci_type;
+	/* Whether the inode is cached locally */
+	bool				 pcci_attr_valid;
+};
+
+struct pcc_file {
+	/* Opened cache file */
+	struct file		*pccf_file;
+	/* Whether readonly or readwrite PCC */
+	enum lu_pcc_type	 pccf_type;
+};
+
+enum pcc_cmd_type {
+	PCC_ADD_DATASET = 0,
+	PCC_DEL_DATASET,
+	PCC_CLEAR_ALL,
+};
+
+struct pcc_cmd {
+	enum pcc_cmd_type			 pccc_cmd;
+	char					*pccc_pathname;
+	union {
+		struct pcc_cmd_add {
+			u32			 pccc_id;
+			u32			 pccc_projid;
+		} pccc_add;
+		struct pcc_cmd_del {
+			u32			 pccc_pad;
+		} pccc_del;
+	} u;
+};
+
+void pcc_super_init(struct pcc_super *super);
+void pcc_super_fini(struct pcc_super *super);
+int pcc_cmd_handle(char *buffer, unsigned long count,
+		   struct pcc_super *super);
+int
+pcc_super_dump(struct pcc_super *super, struct seq_file *m);
+int pcc_readwrite_attach(struct file *file,
+			 struct inode *inode, u32 arch_id);
+int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
+			      bool lease_broken, int rc, bool attached);
+int pcc_ioctl_detach(struct inode *inode);
+int pcc_ioctl_state(struct inode *inode, struct lu_pcc_state *state);
+void pcc_file_init(struct pcc_file *pccf);
+int pcc_file_open(struct inode *inode, struct file *file);
+void pcc_file_release(struct inode *inode, struct file *file);
+ssize_t pcc_file_read_iter(struct kiocb *iocb, struct iov_iter *iter,
+			   bool *cached);
+ssize_t pcc_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+			    bool *cached);
+int pcc_inode_getattr(struct inode *inode, bool *cached);
+int pcc_inode_setattr(struct inode *inode, struct iattr *attr, bool *cached);
+int pcc_inode_create(struct pcc_dataset *dataset, struct lu_fid *fid,
+		     struct dentry **pcc_dentry);
+int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
+			  struct dentry *pcc_dentry);
+struct pcc_dataset *
+pcc_dataset_get(struct pcc_super *super, u32 projid, u32 archive_id);
+void pcc_dataset_put(struct pcc_dataset *dataset);
+void pcc_inode_free(struct inode *inode);
+#endif /* LLITE_PCC_H */
diff --git a/fs/lustre/llite/super25.c b/fs/lustre/llite/super25.c
index 6cae48c..afd51a6 100644
--- a/fs/lustre/llite/super25.c
+++ b/fs/lustre/llite/super25.c
@@ -222,6 +222,14 @@ static int __init lustre_init(void)
 	if (!ll_file_data_slab)
 		goto out_cache;
 
+	pcc_inode_slab = kmem_cache_create("ll_pcc_inode",
+					   sizeof(struct pcc_inode), 0,
+					   SLAB_HWCACHE_ALIGN, NULL);
+	if (!pcc_inode_slab) {
+		rc = -ENOMEM;
+		goto out_cache;
+	}
+
 	rc = llite_tunables_register();
 	if (rc)
 		goto out_cache;
@@ -258,6 +266,7 @@ static int __init lustre_init(void)
 out_cache:
 	kmem_cache_destroy(ll_inode_cachep);
 	kmem_cache_destroy(ll_file_data_slab);
+	kmem_cache_destroy(pcc_inode_slab);
 	return rc;
 }
 
@@ -278,6 +287,7 @@ static void __exit lustre_exit(void)
 	rcu_barrier();
 	kmem_cache_destroy(ll_inode_cachep);
 	kmem_cache_destroy(ll_file_data_slab);
+	kmem_cache_destroy(pcc_inode_slab);
 }
 
 MODULE_AUTHOR("OpenSFS, Inc. <http://www.lustre.org/>");
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 3efd977..f62cd7c 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -356,7 +356,8 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 		op_data->op_mds = tgt->ltd_index;
 	} else {
 		LASSERT(fid_is_sane(&op_data->op_fid1));
-		LASSERT(fid_is_zero(&op_data->op_fid2));
+		LASSERT(it->it_flags & MDS_OPEN_PCC ||
+			fid_is_zero(&op_data->op_fid2));
 		LASSERT(op_data->op_name);
 
 		tgt = lmv_locate_tgt(lmv, op_data);
@@ -367,7 +368,8 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 	/* If it is ready to open the file by FID, do not need
 	 * allocate FID at all, otherwise it will confuse MDT
 	 */
-	if ((it->it_op & IT_CREAT) && !(it->it_flags & MDS_OPEN_BY_FID)) {
+	if ((it->it_op & IT_CREAT) && !(it->it_flags & MDS_OPEN_BY_FID ||
+					it->it_flags & MDS_OPEN_PCC)) {
 		/*
 		 * For lookup(IT_CREATE) cases allocate new fid and setup FLD
 		 * for it.
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 20ae322..bd64ebc 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -3480,6 +3480,7 @@ static int lmv_merge_attr(struct obd_export *exp,
 	.set_info_async		= lmv_set_info_async,
 	.notify			= lmv_notify,
 	.get_uuid		= lmv_get_uuid,
+	.fid_alloc		= lmv_fid_alloc,
 	.iocontrol		= lmv_iocontrol,
 	.quotactl		= lmv_quotactl
 };
diff --git a/fs/lustre/mdc/mdc_lib.c b/fs/lustre/mdc/mdc_lib.c
index f0e5a84..be77944b 100644
--- a/fs/lustre/mdc/mdc_lib.c
+++ b/fs/lustre/mdc/mdc_lib.c
@@ -294,6 +294,10 @@ void mdc_open_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 		cr_flags |= MDS_OPEN_HAS_EA;
 		tmp = req_capsule_client_get(&req->rq_pill, &RMF_EADATA);
 		memcpy(tmp, lmm, lmmlen);
+		if (cr_flags & MDS_OPEN_PCC) {
+			LASSERT(op_data);
+			rec->cr_archive_id = op_data->op_archive_id;
+		}
 	}
 	set_mrc_cr_flags(rec, cr_flags);
 }
@@ -504,6 +508,8 @@ static void mdc_close_intent_pack(struct ptlrpc_request *req,
 			memcpy(req_capsule_client_get(&req->rq_pill, &RMF_U32),
 				op_data->op_data, count * sizeof(u32));
 		}
+	} else if (bias & MDS_PCC_ATTACH) {
+		data->cd_archive_id = op_data->op_archive_id;
 	}
 }
 
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index a26f3ae..2e54dd1 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1719,6 +1719,7 @@ enum mds_op_bias {
 	MDS_CLOSE_RESYNC_DONE	= 1 << 16,
 	MDS_CLOSE_LAYOUT_SPLIT	= 1 << 17,
 	MDS_TRUNC_KEEP_LEASE	= 1 << 18,
+	MDS_PCC_ATTACH		= 1 << 19,
 };
 
 #define MDS_CLOSE_INTENT (MDS_HSM_RELEASE | MDS_CLOSE_LAYOUT_SWAP |         \
@@ -1741,7 +1742,10 @@ struct mdt_rec_create {
 	struct lu_fid	cr_fid2;
 	struct lustre_handle cr_open_handle_old; /* in case of open replay */
 	__s64		cr_time;
-	__u64		cr_rdev;
+	union {
+		__u64		cr_rdev;
+		__u32		cr_archive_id;
+	};
 	__u64		cr_ioepoch;
 	__u64		cr_padding_1;	/* rr_blocks */
 	__u32		cr_mode;
@@ -2963,6 +2967,8 @@ struct close_data {
 		struct close_data_resync_done	cd_resync;
 		/* split close */
 		__u16				cd_mirror_id;
+		/* PCC release */
+		__u32				cd_archive_id;
 	};
 };
 
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index d66c883..2b12612 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -268,6 +268,7 @@ enum ll_lease_flags {
 	LL_LEASE_RESYNC_DONE	= 0x2,
 	LL_LEASE_LAYOUT_MERGE	= 0x4,
 	LL_LEASE_LAYOUT_SPLIT	= 0x8,
+	LL_LEASE_PCC_ATTACH	= 0x10,
 };
 
 #define IOC_IDS_MAX	4096
@@ -356,6 +357,8 @@ struct ll_ioc_lease_id {
 #define LL_IOC_LADVISE			_IOR('f', 250, struct llapi_lu_ladvise)
 #define LL_IOC_HEAT_GET			_IOWR('f', 251, struct lu_heat)
 #define LL_IOC_HEAT_SET			_IOW('f', 251, __u64)
+#define LL_IOC_PCC_DETACH		_IOW('f', 252, struct lu_pcc_detach)
+#define LL_IOC_PCC_STATE		_IOR('f', 252, struct lu_pcc_state)
 
 #define LL_STATFS_LMV		1
 #define LL_STATFS_LOV		2
@@ -1048,11 +1051,15 @@ enum la_valid {
 					      */
 #define MDS_OPEN_RELEASE   02000000000000ULL /* Open the file for HSM release */
 #define MDS_OPEN_RESYNC    04000000000000ULL /* FLR: file resync */
+#define MDS_OPEN_PCC      010000000000000ULL /* PCC: auto RW-PCC cache attach
+					      * for newly created file
+					      */
 
 #define MDS_OPEN_FL_INTERNAL (MDS_OPEN_HAS_EA | MDS_OPEN_HAS_OBJS |	\
 			      MDS_OPEN_OWNEROVERRIDE | MDS_OPEN_LOCK |	\
 			      MDS_OPEN_BY_FID | MDS_OPEN_LEASE |	\
-			      MDS_OPEN_RELEASE | MDS_OPEN_RESYNC)
+			      MDS_OPEN_RELEASE | MDS_OPEN_RESYNC |	\
+			      MDS_OPEN_PCC)
 
 /********* Changelogs **********/
 /** Changelog record types */
@@ -2062,6 +2069,47 @@ struct lu_heat {
 	__u64 lh_heat[0];
 };
 
+enum lu_pcc_type {
+	LU_PCC_NONE = 0,
+	LU_PCC_READWRITE,
+	LU_PCC_MAX
+};
+
+static inline const char *pcc_type2string(enum lu_pcc_type type)
+{
+	switch (type) {
+	case LU_PCC_NONE:
+		return "none";
+	case LU_PCC_READWRITE:
+		return "readwrite";
+	default:
+		return "fault";
+	}
+}
+
+struct lu_pcc_attach {
+	__u32 pcca_type; /* PCC type */
+	__u32 pcca_id; /* archive ID for readwrite, group ID for readonly */
+};
+
+struct lu_pcc_detach {
+	/* fid of the file to detach */
+	struct lu_fid	pccd_fid;
+};
+
+enum lu_pcc_state_flags {
+	/* Whether the inode attr is cached locally */
+	PCC_STATE_FLAG_ATTR_VALID	= 0x1,
+};
+
+struct lu_pcc_state {
+	__u32	pccs_type; /* enum lu_pcc_type */
+	__u32	pccs_open_count;
+	__u32	pccs_flags; /* enum lu_pcc_state_flags */
+	__u32	pccs_padding;
+	char	pccs_path[PATH_MAX];
+};
+
 /** @} lustreuser */
 
 #endif /* _LUSTRE_USER_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 359/622] lustre: pcc: Non-blocking PCC caching
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (357 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 358/622] lustre: llite: Add persistent cache on client James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 360/622] lustre: pcc: security and permission for non-root user access James Simmons
                   ` (263 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

Current PCC uses refcount of PCC inode to determine whether a
previous PCC-attached file can be detached. If a file is open
(refcount > 1), the detaching will return -EBUSY.

When another client accesses the PCC-cached file, it will trigger
the restore process as the file is HSM released. During restore,
the Agent needs to detach the PCC-cached file.
Thus, if a PCC-attached file is keeping opened but not closed
for a long time, the restore request will always return failure.

In this patch, we implement a non-blocking PCC caching mechanism
for Lustre. After attaching the file into PCC, the client acquires
the layout lock for the file, and the layout generation is
maintained in the PCC inode. Under the layout lock protection, the
PCC caching state is valid and all I/O will direct into PCC. When
the layout lock is revoked, in the blocking AST it will invalidate
the PCC caching state and detach the file automatically.

This patch is also helpful to handle the ENOSPC error for PCC
write by fallback to normal I/O path which will restore the file
data into OSTs (The file is in HSM released state) and redo the
write again.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10092
Lustre-commit: 58d744e3eaab ("LU-10092 pcc: Non-blocking PCC caching")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/32966
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h         |   4 +
 fs/lustre/llite/dir.c                   |  31 +-
 fs/lustre/llite/file.c                  |  63 ++--
 fs/lustre/llite/llite_internal.h        |   1 +
 fs/lustre/llite/llite_lib.c             |   1 +
 fs/lustre/llite/llite_mmap.c            |  36 +-
 fs/lustre/llite/namei.c                 |   4 -
 fs/lustre/llite/pcc.c                   | 569 +++++++++++++++++++++++++++-----
 fs/lustre/llite/pcc.h                   |  51 ++-
 fs/lustre/llite/vvp_object.c            |   3 +-
 include/uapi/linux/lustre/lustre_user.h |  10 +-
 11 files changed, 604 insertions(+), 169 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 837b68d..9609dd5 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -458,6 +458,10 @@
 #define OBD_FAIL_LLITE_IMUTEX_SEC			0x140e
 #define OBD_FAIL_LLITE_IMUTEX_NOSEC			0x140f
 #define OBD_FAIL_LLITE_OPEN_BY_NAME			0x1410
+#define OBD_FAIL_LLITE_PCC_FAKE_ERROR			0x1411
+#define OBD_FAIL_LLITE_PCC_DETACH_MKWRITE		0x1412
+#define OBD_FAIL_LLITE_PCC_MKWRITE_PAUSE		0x1413
+#define OBD_FAIL_LLITE_PCC_ATTACH_PAUSE			0x1414
 
 #define OBD_FAIL_FID_INDIR				0x1501
 #define OBD_FAIL_FID_INLMA				0x1502
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 337582b..1f7ed32 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1917,41 +1917,12 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return ll_ioctl_fsgetxattr(inode, cmd, arg);
 	case FS_IOC_FSSETXATTR:
 		return ll_ioctl_fssetxattr(inode, cmd, arg);
-	case LL_IOC_PCC_DETACH: {
+	case LL_IOC_PCC_DETACH_BY_FID: {
 		struct lu_pcc_detach *detach;
 		struct lu_fid *fid;
 		struct inode *inode2;
 		unsigned long ino;
 
-		/*
-		 * The reason why a dir IOCTL is used to detach a PCC-cached
-		 * file rather than making it a file IOCTL is:
-		 * When PCC caching a file, it will attach the file firstly,
-		 * and increase the refcount of PCC inode (pcci->pcci_refcount)
-		 * from 0 to 1.
-		 * When detaching a PCC-cached file, it will check whether the
-		 * refcount is 1. If so, the file can be detached successfully.
-		 * Otherwise, it means there are some users opened and using
-		 * the file currently, and it will return -EBUSY.
-		 * Each open on the PCC-cached file will increase the refcount
-		 * of the PCC inode;
-		 * Each close on the PCC-cached file will decrease the refcount
-		 * of the PCC inode;
-		 * When used a file IOCTL to detach a PCC-cached file, it needs
-		 * to open it at first, which will increase the refcount. So
-		 * during the process of the detach IOCTL, it will return
-		 * -EBUSY as the PCC inode refcount is larger than 1. Someone
-		 * might argue that here it can just decrease the refcount
-		 * of the PCC inode, return succeed and make the close of
-		 * IOCTL file handle to perform the real detach. But this
-		 * may result in inconsistent state of a PCC file. i.e. Process
-		 * A got a successful return form the detach IOCTL; Process B
-		 * opens the file before Process A finally closed the IOCTL
-		 * file handle. It makes the following I/O of Process B will
-		 * direct into PCC although the file was already detached from
-		 * the view of Process A.
-		 * Using a dir IOCTL does not exist the problem above.
-		 */
 		detach = kzalloc(sizeof(*detach), GFP_KERNEL);
 		if (!detach)
 			return -ENOMEM;
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 95e7c73..5a52cad 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -59,6 +59,7 @@ struct split_param {
 struct pcc_param {
 	u64	pa_data_version;
 	u32	pa_archive_id;
+	u32	pa_layout_gen;
 };
 
 static int
@@ -241,6 +242,12 @@ static int ll_close_inode_openhandle(struct inode *inode,
 		body = req_capsule_server_get(&req->rq_pill, &RMF_MDT_BODY);
 		if (!(body->mbo_valid & OBD_MD_CLOSE_INTENT_EXECED))
 			rc = -EBUSY;
+
+		if (bias & MDS_PCC_ATTACH) {
+			struct pcc_param *param = data;
+
+			param->pa_layout_gen = body->mbo_layout_gen;
+		}
 	}
 
 	ll_finish_md_op_data(op_data);
@@ -1657,7 +1664,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t result;
 	u16 refcheck;
 	ssize_t rc2;
-	bool cached = false;
+	bool cached;
 
 	/**
 	 * Currently when PCC read failed, we do not fall back to the
@@ -1766,20 +1773,21 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct vvp_io_args *args;
 	ssize_t rc_tiny = 0, rc_normal;
 	u16 refcheck;
-	bool cached = false;
+	bool cached;
 	int result;
 
 	/**
-	 * When PCC write failed, we do not fall back to the normal
-	 * write path, just return the error. The reason is that:
-	 * PCC is actually a HSM device, and HSM does not handle the
-	 * failure especially -ENOSPC due to space used out; Moreover,
-	 * the fallback to normal I/O path for ENOSPC failure, needs
-	 * to restore the file data to OSTs first and redo the write
-	 * again, making the logic of PCC very complex.
+	 * When PCC write failed, we usually do not fall back to the normal
+	 * write path, just return the error. But there is a special case when
+	 * returned error code is -ENOSPC due to running out of space on PCC HSM
+	 * bakcend. At this time, it will fall back to normal I/O path and
+	 * retry the I/O. As the file is in HSM released state, it will restore
+	 * the file data to OSTs first and redo the write again. And the
+	 * restore process will revoke the layout lock and detach the file
+	 * from PCC cache automatically.
 	 */
 	result = pcc_file_write_iter(iocb, from, &cached);
-	if (cached)
+	if (cached && result != -ENOSPC)
 		return result;
 
 	/* NB: we can't do direct IO for tiny writes because they use the page
@@ -3197,8 +3205,10 @@ static long ll_file_unlock_lease(struct file *file, struct ll_ioc_lease *ioc,
 	case LL_LEASE_PCC_ATTACH:
 		if (!rc)
 			rc = rc2;
-		rc = pcc_readwrite_attach_fini(file, inode, lease_broken,
-					       rc, attached);
+		rc = pcc_readwrite_attach_fini(file, inode,
+					       param.pa_layout_gen,
+					       lease_broken, rc,
+					       attached);
 		break;
 	}
 
@@ -3721,6 +3731,14 @@ static int ll_heat_set(struct inode *inode, enum lu_heat_flag flags)
 		rc = ll_heat_set(inode, flags);
 		return rc;
 	}
+	case LL_IOC_PCC_DETACH:
+		if (!S_ISREG(inode->i_mode))
+			return -EINVAL;
+
+		if (!inode_owner_or_capable(inode))
+			return -EPERM;
+
+		return pcc_ioctl_detach(inode);
 	case LL_IOC_PCC_STATE: {
 		struct lu_pcc_state __user *ustate =
 			(struct lu_pcc_state __user *)arg;
@@ -3735,7 +3753,7 @@ static int ll_heat_set(struct inode *inode, enum lu_heat_flag flags)
 			goto out_state;
 		}
 
-		rc = pcc_ioctl_state(inode, state);
+		rc = pcc_ioctl_state(file, inode, state);
 		if (rc)
 			goto out_state;
 
@@ -3855,19 +3873,13 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct inode *inode = file_inode(file);
 	struct ll_inode_info *lli = ll_i2info(inode);
-	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct ptlrpc_request *req;
-	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
 	int rc, err;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p)\n",
 	       PFID(ll_inode2fid(inode)), inode);
 	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FSYNC, 1);
 
-	/* pcc cache path */
-	if (pcc_file)
-		return file_inode(pcc_file)->i_fop->fsync(pcc_file,
-					start, end, datasync);
 
 	rc = file_write_and_wait_range(file, start, end);
 	inode_lock(inode);
@@ -3877,6 +3889,7 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	 */
 	if (!S_ISDIR(inode->i_mode)) {
 		err = lli->lli_async_rc;
+
 		lli->lli_async_rc = 0;
 		if (rc == 0)
 			rc = err;
@@ -3895,8 +3908,15 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 
 	if (S_ISREG(inode->i_mode)) {
 		struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+		bool cached;
 
-		err = cl_sync_file_range(inode, start, end, CL_FSYNC_ALL, 0);
+		/* Sync metadata on MDT first, and then sync the cached data
+		 * on PCC.
+		 */
+		err = pcc_fsync(file, start, end, datasync, &cached);
+		if (!cached)
+			err = cl_sync_file_range(inode, start, end,
+						 CL_FSYNC_ALL, 0);
 		if (rc == 0 && err < 0)
 			rc = err;
 		if (rc < 0)
@@ -4416,11 +4436,12 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 		return rc;
 
 	if (S_ISREG(inode->i_mode)) {
-		bool cached = false;
+		bool cached;
 
 		rc = pcc_inode_getattr(inode, &cached);
 		if (cached && rc < 0)
 			return rc;
+
 		/* In case of restore, the MDT has the right size and has
 		 * already send it back without granting the layout lock,
 		 * inode is up-to-date so glimpse is useless.
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index f2ea856..d36e01e 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -208,6 +208,7 @@ struct ll_inode_info {
 			char				lli_jobid[LUSTRE_JOBID_SIZE];
 
 			struct mutex		 lli_pcc_lock;
+			enum lu_pcc_state_flags	 lli_pcc_state;
 			struct pcc_inode	*lli_pcc_inode;
 		};
 	};
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index d46bc99..1b22062 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -956,6 +956,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 		obd_heat_clear(lli->lli_heat_instances, OBD_HEAT_COUNT);
 		lli->lli_heat_flags = 0;
 		mutex_init(&lli->lli_pcc_lock);
+		lli->lli_pcc_state = PCC_STATE_FL_NONE;
 		lli->lli_pcc_inode = NULL;
 	}
 	mutex_init(&lli->lli_layout_mutex);
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index fc2331b..71799cd 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -360,9 +360,17 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	int count = 0;
 	bool printed = false;
+	bool cached;
 	vm_fault_t result;
 	sigset_t old, new;
 
+	ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
+			   LPROC_LL_FAULT, 1);
+
+	result = pcc_fault(vma, vmf, &cached);
+	if (cached)
+		return result;
+
 	/* Only SIGKILL and SIGTERM are allowed for fault/nopage/mkwrite
 	 * so that it can be killed by admin but not cause segfault by
 	 * other signals.
@@ -370,9 +378,6 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	siginitsetinv(&new, sigmask(SIGKILL) | sigmask(SIGTERM));
 	sigprocmask(SIG_BLOCK, &new, &old);
 
-	ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
-			   LPROC_LL_FAULT, 1);
-
 	/* make sure offset is not a negative number */
 	if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return VM_FAULT_SIGBUS;
@@ -410,12 +415,17 @@ static vm_fault_t ll_page_mkwrite(struct vm_fault *vmf)
 	int count = 0;
 	bool printed = false;
 	bool retry;
+	bool cached;
 	int err;
 	vm_fault_t ret;
 
 	ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
 			   LPROC_LL_MKWRITE, 1);
 
+	err = pcc_page_mkwrite(vma, vmf, &cached);
+	if (cached)
+		return err;
+
 	file_update_time(vma->vm_file);
 	do {
 		retry = false;
@@ -463,6 +473,7 @@ static void ll_vm_open(struct vm_area_struct *vma)
 
 	LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
 	atomic_inc(&vob->vob_mmap_cnt);
+	pcc_vm_open(vma);
 }
 
 /**
@@ -475,6 +486,7 @@ static void ll_vm_close(struct vm_area_struct *vma)
 
 	atomic_dec(&vob->vob_mmap_cnt);
 	LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
+	pcc_vm_close(vma);
 }
 
 /* XXX put nice comment here.  talk about __free_pte -> dirty pages and
@@ -488,7 +500,7 @@ int ll_teardown_mmaps(struct address_space *mapping, u64 first, u64 last)
 	if (mapping_mapped(mapping)) {
 		rc = 0;
 		unmap_mapping_range(mapping, first + PAGE_SIZE - 1,
-				    last - first + 1, 0);
+				    last - first + 1, 1);
 	}
 
 	return rc;
@@ -504,26 +516,24 @@ int ll_teardown_mmaps(struct address_space *mapping, u64 first, u64 last)
 int ll_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file_inode(file);
+	bool cached;
 	int rc;
-	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
-	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
-
-	/* pcc cache path */
-	if (pcc_file) {
-		vma->vm_file = pcc_file;
-		return file_inode(pcc_file)->i_fop->mmap(pcc_file, vma);
-	}
 
 	if (ll_file_nolock(file))
 		return -EOPNOTSUPP;
 
+	rc = pcc_file_mmap(file, vma, &cached);
+	if (cached && rc != 0)
+		return rc;
+
 	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_MAP, 1);
 	rc = generic_file_mmap(file, vma);
 	if (rc == 0) {
 		vma->vm_ops = &ll_file_vm_ops;
 		vma->vm_ops->open(vma);
 		/* update the inode's size and mtime */
-		rc = ll_glimpse_size(inode);
+		if (!cached)
+			rc = ll_glimpse_size(inode);
 	}
 
 	return rc;
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 4f39b2c..d10decb 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -824,10 +824,6 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 
 		lum->lmm_magic = LOV_USER_MAGIC_V1;
 		lum->lmm_pattern = LOV_PATTERN_F_RELEASED | LOV_PATTERN_RAID0;
-		lum->lmm_stripe_size = 0;
-		lum->lmm_stripe_count = 0;
-		lum->lmm_stripe_offset = 0;
-
 		op_data->op_data = lum;
 		op_data->op_data_size = sizeof(*lum);
 		op_data->op_archive_id = dataset->pccd_id;
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index 53e5cda..8440647 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -401,17 +401,25 @@ static inline void pcc_inode_unlock(struct inode *inode)
 	mutex_unlock(&ll_i2info(inode)->lli_pcc_lock);
 }
 
-static void pcc_inode_init(struct pcc_inode *pcci)
+static void pcc_inode_init(struct pcc_inode *pcci, struct ll_inode_info *lli)
 {
+	pcci->pcci_lli = lli;
+	lli->lli_pcc_inode = pcci;
 	atomic_set(&pcci->pcci_refcount, 0);
 	pcci->pcci_type = LU_PCC_NONE;
+	pcci->pcci_layout_gen = CL_LAYOUT_GEN_NONE;
+	atomic_set(&pcci->pcci_active_ios, 0);
+	init_waitqueue_head(&pcci->pcci_waitq);
 }
 
 static void pcc_inode_fini(struct pcc_inode *pcci)
 {
+	struct ll_inode_info *lli = pcci->pcci_lli;
+
 	path_put(&pcci->pcci_path);
 	pcci->pcci_type = LU_PCC_NONE;
 	kmem_cache_free(pcc_inode_slab, pcci);
+	lli->lli_pcc_inode = NULL;
 }
 
 static void pcc_inode_get(struct pcc_inode *pcci)
@@ -427,13 +435,11 @@ static void pcc_inode_put(struct pcc_inode *pcci)
 
 void pcc_inode_free(struct inode *inode)
 {
-	struct ll_inode_info *lli = ll_i2info(inode);
-	struct pcc_inode *pcci = lli->lli_pcc_inode;
+	struct pcc_inode *pcci = ll_i2pcci(inode);
 
 	if (pcci) {
 		WARN_ON(atomic_read(&pcci->pcci_refcount) > 1);
 		pcc_inode_put(pcci);
-		lli->lli_pcc_inode = NULL;
 	}
 }
 
@@ -463,6 +469,11 @@ void pcc_file_init(struct pcc_file *pccf)
 	pccf->pccf_type = LU_PCC_NONE;
 }
 
+static inline bool pcc_inode_has_layout(struct pcc_inode *pcci)
+{
+	return pcci->pcci_layout_gen != CL_LAYOUT_GEN_NONE;
+}
+
 int pcc_file_open(struct inode *inode, struct file *file)
 {
 	struct pcc_inode *pcci;
@@ -481,7 +492,8 @@ int pcc_file_open(struct inode *inode, struct file *file)
 	if (!pcci)
 		goto out_unlock;
 
-	if (atomic_read(&pcci->pcci_refcount) == 0)
+	if (atomic_read(&pcci->pcci_refcount) == 0 ||
+	    !pcc_inode_has_layout(pcci))
 		goto out_unlock;
 
 	pcc_inode_get(pcci);
@@ -534,24 +546,64 @@ void pcc_file_release(struct inode *inode, struct file *file)
 	pcc_inode_unlock(inode);
 }
 
+static inline void pcc_layout_gen_set(struct pcc_inode *pcci,
+				      u32 gen)
+{
+	pcci->pcci_layout_gen = gen;
+}
+
+static void pcc_io_init(struct inode *inode, bool *cached)
+{
+	struct pcc_inode *pcci;
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (pcci && pcc_inode_has_layout(pcci)) {
+		LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
+		atomic_inc(&pcci->pcci_active_ios);
+		*cached = true;
+	} else {
+		*cached = false;
+	}
+	pcc_inode_unlock(inode);
+}
+
+static void pcc_io_fini(struct inode *inode)
+{
+	struct pcc_inode *pcci = ll_i2pcci(inode);
+
+	LASSERT(pcci && atomic_read(&pcci->pcci_active_ios) > 0);
+	if (atomic_dec_and_test(&pcci->pcci_active_ios))
+		wake_up_all(&pcci->pcci_waitq);
+}
+
 ssize_t pcc_file_read_iter(struct kiocb *iocb,
 			   struct iov_iter *iter, bool *cached)
 {
 	struct file *file = iocb->ki_filp;
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct pcc_file *pccf = &fd->fd_pcc_file;
+	struct inode *inode = file_inode(file);
 	ssize_t result;
 
 	if (!pccf->pccf_file) {
 		*cached = false;
 		return 0;
 	}
-	*cached = true;
-	iocb->ki_filp = pccf->pccf_file;
 
-	result = generic_file_read_iter(iocb, iter);
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
+
+	iocb->ki_filp = pccf->pccf_file;
+	/* generic_file_aio_read does not support ext4-dax,
+	 * filp->f_ops->read_iter uses ->aio_read hook directly
+	 * to add support for ext4-dax.
+	 */
+	result = file->f_op->read_iter(iocb, iter);
 	iocb->ki_filp = file;
 
+	pcc_io_fini(inode);
 	return result;
 }
 
@@ -561,16 +613,27 @@ ssize_t pcc_file_write_iter(struct kiocb *iocb,
 	struct file *file = iocb->ki_filp;
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct pcc_file *pccf = &fd->fd_pcc_file;
+	struct inode *inode = file_inode(file);
 	ssize_t result;
 
 	if (!pccf->pccf_file) {
 		*cached = false;
 		return 0;
 	}
-	*cached = true;
 
-	if (pccf->pccf_type != LU_PCC_READWRITE)
-		return -EWOULDBLOCK;
+	if (pccf->pccf_type != LU_PCC_READWRITE) {
+		*cached = false;
+		return -EAGAIN;
+	}
+
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
+
+	if (OBD_FAIL_CHECK(OBD_FAIL_LLITE_PCC_FAKE_ERROR)) {
+		result = -ENOSPC;
+		goto out;
+	}
 
 	iocb->ki_filp = pccf->pccf_file;
 
@@ -580,6 +643,8 @@ ssize_t pcc_file_write_iter(struct kiocb *iocb,
 	 */
 	result = file->f_op->write_iter(iocb, iter);
 	iocb->ki_filp = file;
+out:
+	pcc_io_fini(inode);
 	return result;
 }
 
@@ -587,37 +652,35 @@ int pcc_inode_setattr(struct inode *inode, struct iattr *attr,
 		      bool *cached)
 {
 	int rc = 0;
-	struct pcc_inode *pcci;
 	struct iattr attr2 = *attr;
 	struct dentry *pcc_dentry;
+	struct pcc_inode *pcci;
 
 	if (!S_ISREG(inode->i_mode)) {
 		*cached = false;
 		return 0;
 	}
 
-	pcc_inode_lock(inode);
-	pcci = ll_i2pcci(inode);
-	if (!pcci || atomic_read(&pcci->pcci_refcount) == 0)
-		goto out_unlock;
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
 
-	*cached = true;
 	attr2.ia_valid = attr->ia_valid & (ATTR_SIZE | ATTR_ATIME |
 			 ATTR_ATIME_SET | ATTR_MTIME | ATTR_MTIME_SET |
 			 ATTR_CTIME);
+	pcci = ll_i2pcci(inode);
 	pcc_dentry = pcci->pcci_path.dentry;
 	inode_lock(pcc_dentry->d_inode);
 	rc = pcc_dentry->d_inode->i_op->setattr(pcc_dentry, &attr2);
 	inode_unlock(pcc_dentry->d_inode);
-out_unlock:
-	pcc_inode_unlock(inode);
+
+	pcc_io_fini(inode);
 	return rc;
 }
 
 int pcc_inode_getattr(struct inode *inode, bool *cached)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
-	struct pcc_inode *pcci;
 	struct kstat stat;
 	s64 atime;
 	s64 mtime;
@@ -629,16 +692,14 @@ int pcc_inode_getattr(struct inode *inode, bool *cached)
 		return 0;
 	}
 
-	pcc_inode_lock(inode);
-	pcci = ll_i2pcci(inode);
-	if (!pcci || atomic_read(&pcci->pcci_refcount) == 0)
-		goto out_unlock;
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
 
-	*cached = true;
-	rc = vfs_getattr(&pcci->pcci_path, &stat,
+	rc = vfs_getattr(&ll_i2pcci(inode)->pcci_path, &stat,
 			 STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
 	if (rc)
-		goto out_unlock;
+		goto out;
 
 	ll_inode_size_lock(inode);
 	if (test_and_clear_bit(LLIF_UPDATE_ATIME, &lli->lli_flags) ||
@@ -669,9 +730,274 @@ int pcc_inode_getattr(struct inode *inode, bool *cached)
 	inode->i_ctime.tv_sec = ctime;
 
 	ll_inode_size_unlock(inode);
+out:
+	pcc_io_fini(inode);
+	return rc;
+}
 
-out_unlock:
+ssize_t pcc_file_splice_read(struct file *in_file, loff_t *ppos,
+			     struct pipe_inode_info *pipe,
+			     size_t count, unsigned int flags,
+			     bool *cached)
+{
+	struct inode *inode = file_inode(in_file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(in_file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	ssize_t result;
+
+	*cached = false;
+	if (!pcc_file)
+		return 0;
+
+	if (!file_inode(pcc_file)->i_fop->splice_read)
+		return -ENOTSUPP;
+
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
+
+	result = file_inode(pcc_file)->i_fop->splice_read(pcc_file,
+							  ppos, pipe, count,
+							  flags);
+
+	pcc_io_fini(inode);
+	return result;
+}
+
+int pcc_fsync(struct file *file, loff_t start, loff_t end,
+	      int datasync, bool *cached)
+{
+	struct inode *inode = file_inode(file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	int rc;
+
+	if (!pcc_file) {
+		*cached = false;
+		return 0;
+	}
+
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
+
+	rc = file_inode(pcc_file)->i_fop->fsync(pcc_file,
+						start, end, datasync);
+
+	pcc_io_fini(inode);
+	return rc;
+}
+
+int pcc_file_mmap(struct file *file, struct vm_area_struct *vma,
+		  bool *cached)
+{
+	struct inode *inode = file_inode(file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	struct pcc_inode *pcci;
+	int rc = 0;
+
+	if (!pcc_file || !file_inode(pcc_file)->i_fop->mmap) {
+		*cached = false;
+		return 0;
+	}
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (pcci && pcc_inode_has_layout(pcci)) {
+		LASSERT(atomic_read(&pcci->pcci_refcount) > 1);
+		*cached = true;
+		vma->vm_file = pcc_file;
+		rc = file_inode(pcc_file)->i_fop->mmap(pcc_file, vma);
+		vma->vm_file = file;
+		/* Save the vm ops of backend PCC */
+		vma->vm_private_data = (void *)vma->vm_ops;
+	} else {
+		*cached = false;
+	}
 	pcc_inode_unlock(inode);
+
+	return rc;
+}
+
+void pcc_vm_open(struct vm_area_struct *vma)
+{
+	struct pcc_inode *pcci;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	const struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+
+	if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->open)
+		return;
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (pcci && pcc_inode_has_layout(pcci)) {
+		vma->vm_file = pcc_file;
+		pcc_vm_ops->open(vma);
+		vma->vm_file = file;
+	}
+	pcc_inode_unlock(inode);
+}
+
+void pcc_vm_close(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	const struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+
+	if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->close)
+		return;
+
+	pcc_inode_lock(inode);
+	/* Layout lock maybe revoked here */
+	vma->vm_file = pcc_file;
+	pcc_vm_ops->close(vma);
+	vma->vm_file = file;
+	pcc_inode_unlock(inode);
+}
+
+int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		     bool *cached)
+{
+	struct page *page = vmf->page;
+	struct mm_struct *mm = vma->vm_mm;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	const struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+	int rc;
+
+	if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->page_mkwrite) {
+		*cached = false;
+		return 0;
+	}
+
+	/* Pause to allow for a race with concurrent detach */
+	OBD_FAIL_TIMEOUT(OBD_FAIL_LLITE_PCC_MKWRITE_PAUSE, cfs_fail_val);
+
+	pcc_io_init(inode, cached);
+	if (!*cached) {
+		/* This happens when the file is detached from PCC after got
+		 * the fault page via ->fault() on the inode of the PCC copy.
+		 * Here it can not simply fall back to normal Lustre I/O path.
+		 * The reason is that the address space of fault page used by
+		 * ->page_mkwrite() is still the one of PCC inode. In the
+		 * normal Lustre ->page_mkwrite() I/O path, it will be wrongly
+		 * handled as the address space of the fault page is not
+		 * consistent with the one of the Lustre inode (though the
+		 * fault page was truncated).
+		 * As the file is detached from PCC, the fault page must
+		 * be released frist, and retry the mmap write (->fault() and
+		 * ->page_mkwrite).
+		 * We use an ugly and tricky method by returning
+		 * VM_FAULT_NOPAGE | VM_FAULT_RETRY to the caller
+		 * __do_page_fault and retry the memory fault handling.
+		 */
+		if (page->mapping == file_inode(pcc_file)->i_mapping) {
+			*cached = true;
+			up_read(&mm->mmap_sem);
+			return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
+		}
+
+		return 0;
+	}
+
+	/*
+	 * This fault injection can also be used to simulate -ENOSPC and
+	 * -EDQUOT failure of underlying PCC backend fs.
+	 */
+	if (OBD_FAIL_CHECK(OBD_FAIL_LLITE_PCC_DETACH_MKWRITE)) {
+		pcc_io_fini(inode);
+		pcc_ioctl_detach(inode);
+		up_read(&mm->mmap_sem);
+		return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
+	}
+
+	vma->vm_file = pcc_file;
+	rc = pcc_vm_ops->page_mkwrite(vmf);
+	vma->vm_file = file;
+
+	pcc_io_fini(inode);
+	return rc;
+}
+
+int pcc_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+	      bool *cached)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+	const struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+	int rc;
+
+	if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->fault) {
+		*cached = false;
+		return 0;
+	}
+
+	pcc_io_init(inode, cached);
+	if (!*cached)
+		return 0;
+
+	vma->vm_file = pcc_file;
+	rc = pcc_vm_ops->fault(vmf);
+	vma->vm_file = file;
+
+	pcc_io_fini(inode);
+	return rc;
+}
+
+static void pcc_layout_wait(struct pcc_inode *pcci)
+{
+	if (atomic_read(&pcci->pcci_active_ios) > 0)
+		CDEBUG(D_CACHE, "Waiting for IO completion: %d\n",
+		       atomic_read(&pcci->pcci_active_ios));
+	wait_event_idle(pcci->pcci_waitq,
+			atomic_read(&pcci->pcci_active_ios) == 0);
+}
+
+static void __pcc_layout_invalidate(struct pcc_inode *pcci)
+{
+	pcci->pcci_type = LU_PCC_NONE;
+	pcc_layout_gen_set(pcci, CL_LAYOUT_GEN_NONE);
+	pcc_layout_wait(pcci);
+}
+
+void pcc_layout_invalidate(struct inode *inode)
+{
+	struct pcc_inode *pcci;
+
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	if (pcci && pcc_inode_has_layout(pcci)) {
+		LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
+		__pcc_layout_invalidate(pcci);
+
+		CDEBUG(D_CACHE, "Invalidate "DFID" layout gen %d\n",
+		       PFID(&ll_i2info(inode)->lli_fid), pcci->pcci_layout_gen);
+
+		pcc_inode_put(pcci);
+	}
+	pcc_inode_unlock(inode);
+}
+
+static int pcc_inode_remove(struct pcc_inode *pcci)
+{
+	struct dentry *dentry;
+	int rc;
+
+	dentry = pcci->pcci_path.dentry;
+	rc = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
+	if (rc)
+		CWARN("failed to unlink cached file, rc = %d\n", rc);
+
 	return rc;
 }
 
@@ -719,9 +1045,10 @@ int pcc_inode_getattr(struct inode *inode, bool *cached)
 		*ptr = '\0';
 		child = pcc_mkdir(parent, entry_name, mode);
 		*ptr = '/';
+		dput(parent);
 		if (IS_ERR(child))
 			break;
-		dput(parent);
+
 		parent = child;
 		ptr++;
 		entry_name = ptr;
@@ -816,21 +1143,36 @@ int pcc_inode_create(struct pcc_dataset *dataset, struct lu_fid *fid,
 int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 			  struct dentry *pcc_dentry)
 {
-	struct ll_inode_info *lli = ll_i2info(inode);
 	struct pcc_inode *pcci;
+	int rc = 0;
 
+	pcc_inode_lock(inode);
 	LASSERT(!ll_i2pcci(inode));
 	pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
-	if (!pcci)
-		return -ENOMEM;
+	if (!pcci) {
+		rc = -ENOMEM;
+		goto out_unlock;
+	}
 
-	pcc_inode_init(pcci);
-	pcc_inode_lock(inode);
+	pcc_inode_init(pcci, ll_i2info(inode));
 	pcc_inode_attach_init(dataset, pcci, pcc_dentry, LU_PCC_READWRITE);
-	lli->lli_pcc_inode = pcci;
-	pcc_inode_unlock(inode);
+	/* Set the layout generation of newly created file with 0 */
+	pcc_layout_gen_set(pcci, 0);
 
-	return 0;
+out_unlock:
+	if (rc) {
+		int rc2;
+
+		rc2 = vfs_unlink(pcc_dentry->d_parent->d_inode,
+				 pcc_dentry, NULL);
+		if (rc2)
+			CWARN("failed to unlink PCC file, rc = %d\n", rc2);
+
+		dput(pcc_dentry);
+	}
+
+	pcc_inode_unlock(inode);
+	return rc;
 }
 
 static int pcc_filp_write(struct file *filp, const void *buf, ssize_t count,
@@ -881,6 +1223,30 @@ static int pcc_copy_data(struct file *src, struct file *dst)
 	return rc;
 }
 
+static int pcc_attach_allowed_check(struct inode *inode)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci;
+	int rc = 0;
+
+	pcc_inode_lock(inode);
+	if (lli->lli_pcc_state & PCC_STATE_FL_ATTACHING) {
+		rc = -EBUSY;
+		goto out_unlock;
+	}
+
+	pcci = ll_i2pcci(inode);
+	if (pcci && pcc_inode_has_layout(pcci)) {
+		rc = -EEXIST;
+		goto out_unlock;
+	}
+
+	lli->lli_pcc_state |= PCC_STATE_FL_ATTACHING;
+out_unlock:
+	pcc_inode_unlock(inode);
+	return rc;
+}
+
 int pcc_readwrite_attach(struct file *file, struct inode *inode,
 			 u32 archive_id)
 {
@@ -892,28 +1258,14 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	struct path path;
 	int rc;
 
-	pcc_inode_lock(inode);
-	pcci = ll_i2pcci(inode);
-	if (!pcci) {
-		pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
-		if (!pcci) {
-			pcc_inode_unlock(inode);
-			return -ENOMEM;
-		}
-
-		pcc_inode_init(pcci);
-	} else if (atomic_read(&pcci->pcci_refcount) > 0) {
-		pcc_inode_unlock(inode);
-		return -EEXIST;
-	}
-	pcc_inode_unlock(inode);
+	rc = pcc_attach_allowed_check(inode);
+	if (rc)
+		return rc;
 
 	dataset = pcc_dataset_get(&ll_i2sbi(inode)->ll_pcc_super, 0,
 				  archive_id);
-	if (!dataset) {
-		rc = -ENOENT;
-		goto out_free_pcci;
-	}
+	if (!dataset)
+		return -ENOENT;
 
 	rc = __pcc_inode_create(dataset, &lli->lli_fid, &dentry);
 	if (rc)
@@ -932,73 +1284,117 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	if (rc)
 		goto out_fput;
 
+	/* Pause to allow for a race with concurrent HSM remove */
+	OBD_FAIL_TIMEOUT(OBD_FAIL_LLITE_PCC_ATTACH_PAUSE, cfs_fail_val);
+
 	pcc_inode_lock(inode);
-	if (lli->lli_pcc_inode) {
-		rc = -EEXIST;
+	pcci = ll_i2pcci(inode);
+	LASSERT(!pcci);
+	pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
+	if (!pcci) {
+		rc = -ENOMEM;
 		goto out_unlock;
 	}
+
+	pcc_inode_init(pcci, lli);
 	pcc_inode_attach_init(dataset, pcci, dentry, LU_PCC_READWRITE);
-	lli->lli_pcc_inode = pcci;
 out_unlock:
 	pcc_inode_unlock(inode);
 out_fput:
 	fput(pcc_filp);
 out_dentry:
-	if (rc)
+	if (rc) {
+		int rc2;
+
+		rc2 = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
+		if (rc2)
+			CWARN("failed to unlink PCC file, rc = %d\n", rc2);
+
 		dput(dentry);
+	}
 out_dataset_put:
 	pcc_dataset_put(dataset);
-out_free_pcci:
-	if (rc)
-		kmem_cache_free(pcc_inode_slab, pcci);
 	return rc;
-
 }
 
 int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
-			      bool lease_broken, int rc, bool attached)
+			      u32 gen, bool lease_broken, int rc,
+			      bool attached)
 {
-	struct pcc_inode *pcci = ll_i2pcci(inode);
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci;
+	u32 gen2;
 
-	if ((rc || lease_broken) && attached && pcci)
-		pcc_inode_put(pcci);
+	pcc_inode_lock(inode);
+	pcci = ll_i2pcci(inode);
+	lli->lli_pcc_state &= ~PCC_STATE_FL_ATTACHING;
+	if ((rc || lease_broken)) {
+		if (attached && pcci)
+			pcc_inode_put(pcci);
+
+		goto out_unlock;
+	}
+
+	/* PCC inode may be released due to layout lock revocatioin */
+	if (!pcci) {
+		rc = -ESTALE;
+		goto out_unlock;
+	}
 
+	LASSERT(attached);
+	rc = ll_layout_refresh(inode, &gen2);
+	if (!rc) {
+		if (gen2 == gen) {
+			pcc_layout_gen_set(pcci, gen);
+		} else {
+			CDEBUG(D_CACHE,
+			       DFID" layout changed from %d to %d.\n",
+			       PFID(ll_inode2fid(inode)), gen, gen2);
+			rc = -ESTALE;
+			goto out_put;
+		}
+	}
+
+out_put:
+	if (rc) {
+		pcc_inode_remove(pcci);
+		pcc_inode_put(pcci);
+	}
+out_unlock:
+	pcc_inode_unlock(inode);
 	return rc;
 }
 
 int pcc_ioctl_detach(struct inode *inode)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
-	struct pcc_inode *pcci = lli->lli_pcc_inode;
+	struct pcc_inode *pcci;
 	int rc = 0;
-	int count;
 
 	pcc_inode_lock(inode);
-	if (!pcci)
-		goto out_unlock;
-
-	count = atomic_read(&pcci->pcci_refcount);
-	if (count > 1) {
-		rc = -EBUSY;
-		goto out_unlock;
-	} else if (count == 0)
+	pcci = lli->lli_pcc_inode;
+	if (!pcci || lli->lli_pcc_state & PCC_STATE_FL_ATTACHING ||
+	    !pcc_inode_has_layout(pcci))
 		goto out_unlock;
 
+	__pcc_layout_invalidate(pcci);
 	pcc_inode_put(pcci);
-	lli->lli_pcc_inode = NULL;
+
 out_unlock:
 	pcc_inode_unlock(inode);
-
 	return rc;
 }
 
-int pcc_ioctl_state(struct inode *inode, struct lu_pcc_state *state)
+int pcc_ioctl_state(struct file *file, struct inode *inode,
+		    struct lu_pcc_state *state)
 {
 	int rc = 0;
 	int count;
 	char *buf;
 	char *path;
 	int buf_len = sizeof(state->pccs_path);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct pcc_file *pccf = &fd->fd_pcc_file;
 	struct pcc_inode *pcci;
 
 	if (buf_len <= 0)
@@ -1018,12 +1414,17 @@ int pcc_ioctl_state(struct inode *inode, struct lu_pcc_state *state)
 	count = atomic_read(&pcci->pcci_refcount);
 	if (count == 0) {
 		state->pccs_type = LU_PCC_NONE;
+		state->pccs_open_count = 0;
 		goto out_unlock;
 	}
+
+	if (pcc_inode_has_layout(pcci))
+		count--;
+	if (pccf->pccf_file)
+		count--;
 	state->pccs_type = pcci->pcci_type;
-	state->pccs_open_count = count - 1;
-	state->pccs_flags = pcci->pcci_attr_valid ?
-			    PCC_STATE_FLAG_ATTR_VALID : 0;
+	state->pccs_open_count = count;
+	state->pccs_flags = ll_i2info(inode)->lli_pcc_state;
 	path = dentry_path_raw(pcci->pcci_path.dentry, buf, buf_len);
 	if (IS_ERR(path)) {
 		rc = PTR_ERR(path);
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index 0f960b9..1a73dbb 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -36,6 +36,7 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/mm.h>
 #include <uapi/linux/lustre/lustre_user.h>
 
 extern struct kmem_cache *pcc_inode_slab;
@@ -57,17 +58,27 @@ struct pcc_super {
 };
 
 struct pcc_inode {
+	struct ll_inode_info	*pcci_lli;
 	/* Cache path on local file system */
-	struct path			 pcci_path;
+	struct path		 pcci_path;
 	/*
 	 * If reference count is 0, then the cache is not inited, if 1, then
 	 * no one is using it.
 	 */
-	atomic_t			 pcci_refcount;
+	atomic_t		 pcci_refcount;
 	/* Whether readonly or readwrite PCC */
-	enum lu_pcc_type		 pcci_type;
-	/* Whether the inode is cached locally */
-	bool				 pcci_attr_valid;
+	enum lu_pcc_type	 pcci_type;
+	/* Whether the inode attr is cached locally */
+	bool			 pcci_attr_valid;
+	/* Layout generation */
+	u32			 pcci_layout_gen;
+	/*
+	 * How many IOs are on going on this cached object. Layout can be
+	 * changed only if there is no active IO.
+	 */
+	atomic_t		 pcci_active_ios;
+	/* Waitq - wait for PCC I/O completion. */
+	wait_queue_head_t	 pcci_waitq;
 };
 
 struct pcc_file {
@@ -101,14 +112,15 @@ struct pcc_cmd {
 void pcc_super_fini(struct pcc_super *super);
 int pcc_cmd_handle(char *buffer, unsigned long count,
 		   struct pcc_super *super);
-int
-pcc_super_dump(struct pcc_super *super, struct seq_file *m);
-int pcc_readwrite_attach(struct file *file,
-			 struct inode *inode, u32 arch_id);
+int pcc_super_dump(struct pcc_super *super, struct seq_file *m);
+int pcc_readwrite_attach(struct file *file, struct inode *inode,
+			 u32 arch_id);
 int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
-			      bool lease_broken, int rc, bool attached);
+			      u32 gen, bool lease_broken, int rc,
+			      bool attached);
 int pcc_ioctl_detach(struct inode *inode);
-int pcc_ioctl_state(struct inode *inode, struct lu_pcc_state *state);
+int pcc_ioctl_state(struct file *file, struct inode *inode,
+		    struct lu_pcc_state *state);
 void pcc_file_init(struct pcc_file *pccf);
 int pcc_file_open(struct inode *inode, struct file *file);
 void pcc_file_release(struct inode *inode, struct file *file);
@@ -118,12 +130,25 @@ ssize_t pcc_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
 			    bool *cached);
 int pcc_inode_getattr(struct inode *inode, bool *cached);
 int pcc_inode_setattr(struct inode *inode, struct iattr *attr, bool *cached);
+ssize_t pcc_file_splice_read(struct file *in_file, loff_t *ppos,
+			     struct pipe_inode_info *pipe, size_t count,
+			     unsigned int flags, bool *cached);
+int pcc_fsync(struct file *file, loff_t start, loff_t end,
+	      int datasync, bool *cached);
+int pcc_file_mmap(struct file *file, struct vm_area_struct *vma, bool *cached);
+void pcc_vm_open(struct vm_area_struct *vma);
+void pcc_vm_close(struct vm_area_struct *vma);
+int pcc_fault(struct vm_area_struct *mva, struct vm_fault *vmf, bool *cached);
+int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		     bool *cached);
 int pcc_inode_create(struct pcc_dataset *dataset, struct lu_fid *fid,
 		     struct dentry **pcc_dentry);
 int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 			  struct dentry *pcc_dentry);
-struct pcc_dataset *
-pcc_dataset_get(struct pcc_super *super, u32 projid, u32 archive_id);
+struct pcc_dataset *pcc_dataset_get(struct pcc_super *super, u32 projid,
+				    u32 archive_id);
 void pcc_dataset_put(struct pcc_dataset *dataset);
 void pcc_inode_free(struct inode *inode);
+void pcc_layout_invalidate(struct inode *inode);
+
 #endif /* LLITE_PCC_H */
diff --git a/fs/lustre/llite/vvp_object.c b/fs/lustre/llite/vvp_object.c
index eeb8823..b5ae7ad 100644
--- a/fs/lustre/llite/vvp_object.c
+++ b/fs/lustre/llite/vvp_object.c
@@ -146,7 +146,8 @@ static int vvp_conf_set(const struct lu_env *env, struct cl_object *obj,
 		 * a price themselves.
 		 */
 		unmap_mapping_range(conf->coc_inode->i_mapping,
-				    0, OBD_OBJECT_EOF, 0);
+				    0, OBD_OBJECT_EOF, 1);
+		pcc_layout_invalidate(conf->coc_inode);
 	}
 
 	return 0;
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 2b12612..b024a44 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -357,7 +357,8 @@ struct ll_ioc_lease_id {
 #define LL_IOC_LADVISE			_IOR('f', 250, struct llapi_lu_ladvise)
 #define LL_IOC_HEAT_GET			_IOWR('f', 251, struct lu_heat)
 #define LL_IOC_HEAT_SET			_IOW('f', 251, __u64)
-#define LL_IOC_PCC_DETACH		_IOW('f', 252, struct lu_pcc_detach)
+#define LL_IOC_PCC_DETACH		_IO('f', 252)
+#define LL_IOC_PCC_DETACH_BY_FID	_IOW('f', 252, struct lu_pcc_detach)
 #define LL_IOC_PCC_STATE		_IOR('f', 252, struct lu_pcc_state)
 
 #define LL_STATFS_LMV		1
@@ -2098,8 +2099,11 @@ struct lu_pcc_detach {
 };
 
 enum lu_pcc_state_flags {
-	/* Whether the inode attr is cached locally */
-	PCC_STATE_FLAG_ATTR_VALID	= 0x1,
+	PCC_STATE_FL_NONE		= 0x0,
+	/* The inode attr is cached locally */
+	PCC_STATE_FL_ATTR_VALID		= 0x01,
+	/* The file is being attached into PCC */
+	PCC_STATE_FL_ATTACHING		= 0x02,
 };
 
 struct lu_pcc_state {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 360/622] lustre: pcc: security and permission for non-root user access
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (358 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 359/622] lustre: pcc: Non-blocking PCC caching James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 361/622] lustre: llite: Rule based auto PCC caching when create files James Simmons
                   ` (262 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

For current PCC, if a file is left on the PCC cache, it may be
accessible to other jobs/users who would not normally be able to
access it. (That is,  they access it directly on the PCC mount via
FID as the local PCC mount is basically just a normal local file
system.)

This patch solves this by restricting access on the PCC side and
just depending on the Lustre side permissions for opening a file.
So PCC files on the local mount fs are created with some minimal
(zero) set of permissions. Then, when accessing a PCC cached
file, we do the permission check on the Lustre file, then do not
do it on the PCC file. This should render the PCC files
inaccessible except to root or via Lustre.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10092
Lustre-commit: 2102c86e0d0a ("LU-10092 pcc: security and permission for non-root user access")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/34637
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c      |   3 +-
 fs/lustre/llite/llite_lib.c |  23 +++++++---
 fs/lustre/llite/namei.c     |   2 +-
 fs/lustre/llite/pcc.c       | 103 ++++++++++++++++++++++++++++++++++++++------
 fs/lustre/llite/pcc.h       |  16 ++++---
 5 files changed, 120 insertions(+), 27 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 5a52cad..96311ad 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -860,6 +860,7 @@ int ll_file_open(struct inode *inode, struct file *file)
 		if (rc)
 			goto out_och_free;
 	}
+
 	rc = pcc_file_open(inode, file);
 	if (rc)
 		goto out_och_free;
@@ -1787,7 +1788,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	 * from PCC cache automatically.
 	 */
 	result = pcc_file_write_iter(iocb, from, &cached);
-	if (cached && result != -ENOSPC)
+	if (cached && result != -ENOSPC && result != -EDQUOT)
 		return result;
 
 	/* NB: we can't do direct IO for tiny writes because they use the page
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 1b22062..5ac083c 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -71,11 +71,16 @@ static struct ll_sb_info *ll_init_sbi(void)
 	unsigned long pages;
 	unsigned long lru_page_max;
 	struct sysinfo si;
+	int rc;
 	int i;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_NOFS);
 	if (!sbi)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
+
+	rc = pcc_super_init(&sbi->ll_pcc_super);
+	if (rc < 0)
+		goto out_sbi;
 
 	spin_lock_init(&sbi->ll_lock);
 	mutex_init(&sbi->ll_lco.lco_lock);
@@ -89,8 +94,8 @@ static struct ll_sb_info *ll_init_sbi(void)
 
 	sbi->ll_cache = cl_cache_init(lru_page_max);
 	if (!sbi->ll_cache) {
-		kfree(sbi);
-		return NULL;
+		rc = -ENOMEM;
+		goto out_pcc;
 	}
 
 	sbi->ll_ra_info.ra_max_pages_per_file = min(pages / 32,
@@ -128,12 +133,16 @@ static struct ll_sb_info *ll_init_sbi(void)
 	sbi->ll_squash.rsi_gid = 0;
 	INIT_LIST_HEAD(&sbi->ll_squash.rsi_nosquash_nids);
 	spin_lock_init(&sbi->ll_squash.rsi_lock);
-	pcc_super_init(&sbi->ll_pcc_super);
 
 	/* Per-filesystem file heat */
 	sbi->ll_heat_decay_weight = SBI_DEFAULT_HEAT_DECAY_WEIGHT;
 	sbi->ll_heat_period_second = SBI_DEFAULT_HEAT_PERIOD_SECOND;
 	return sbi;
+out_pcc:
+	pcc_super_fini(&sbi->ll_pcc_super);
+out_sbi:
+	kfree(sbi);
+	return ERR_PTR(rc);
 }
 
 static void ll_free_sbi(struct super_block *sb)
@@ -990,8 +999,8 @@ int ll_fill_super(struct super_block *sb)
 	/* client additional sb info */
 	sbi = ll_init_sbi();
 	lsi->lsi_llsbi = sbi;
-	if (!sbi) {
-		err = -ENOMEM;
+	if (IS_ERR(sbi)) {
+		err = PTR_ERR(sbi);
 		goto out_free;
 	}
 
@@ -1120,7 +1129,7 @@ void ll_put_super(struct super_block *sb)
 	int next, force = 1, rc = 0;
 	long ccc_count;
 
-	if (!sbi)
+	if (IS_ERR(sbi))
 		goto out_no_sbi;
 
 	CDEBUG(D_VFSTRACE, "VFS Op: sb %p - %s\n", sb, profilenm);
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index d10decb..10c0cef 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -835,7 +835,7 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 			goto out;
 		}
 
-		rc = pcc_inode_create(dataset, &op_data->op_fid2,
+		rc = pcc_inode_create(parent->i_sb, dataset, &op_data->op_fid2,
 				      &pca->pca_dentry);
 		if (rc) {
 			retval = ERR_PTR(rc);
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index 8440647..fa81b55 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -113,10 +113,20 @@
 
 struct kmem_cache *pcc_inode_slab;
 
-void pcc_super_init(struct pcc_super *super)
+int pcc_super_init(struct pcc_super *super)
 {
+	struct cred *cred;
+
+	super->pccs_cred = cred = prepare_creds();
+	if (!cred)
+		return -ENOMEM;
+
+	/* Never override disk quota limits or use reserved space */
+	cap_lower(cred->cap_effective, CAP_SYS_RESOURCE);
 	spin_lock_init(&super->pccs_lock);
 	INIT_LIST_HEAD(&super->pccs_datasets);
+
+	return 0;
 }
 
 /**
@@ -251,7 +261,7 @@ struct pcc_dataset *
 	return 0;
 }
 
-void pcc_super_fini(struct pcc_super *super)
+static void pcc_remove_datasets(struct pcc_super *super)
 {
 	struct pcc_dataset *dataset, *tmp;
 
@@ -262,6 +272,12 @@ void pcc_super_fini(struct pcc_super *super)
 	}
 }
 
+void pcc_super_fini(struct pcc_super *super)
+{
+	pcc_remove_datasets(super);
+	put_cred(super->pccs_cred);
+}
+
 static bool pathname_is_valid(const char *pathname)
 {
 	/* Needs to be absolute path */
@@ -380,7 +396,7 @@ int pcc_cmd_handle(char *buffer, unsigned long count,
 		rc = pcc_dataset_del(super, cmd->pccc_pathname);
 		break;
 	case PCC_CLEAR_ALL:
-		pcc_super_fini(super);
+		pcc_remove_datasets(super);
 		break;
 	default:
 		rc = -EINVAL;
@@ -463,6 +479,11 @@ static int pcc_fid2dataset_path(char *buf, int sz, struct lu_fid *fid)
 			PFID(fid));
 }
 
+static inline const struct cred *pcc_super_cred(struct super_block *sb)
+{
+	return ll_s2sbi(sb)->ll_pcc_super.pccs_cred;
+}
+
 void pcc_file_init(struct pcc_file *pccf)
 {
 	pccf->pccf_file = NULL;
@@ -503,7 +524,9 @@ int pcc_file_open(struct inode *inode, struct file *file)
 	dname = &path->dentry->d_name;
 	CDEBUG(D_CACHE, "opening pcc file '%.*s'\n", dname->len,
 	       dname->name);
-	pcc_file = dentry_open(path, file->f_flags, current_cred());
+
+	pcc_file = dentry_open(path, file->f_flags,
+			       pcc_super_cred(inode->i_sb));
 	if (IS_ERR_OR_NULL(pcc_file)) {
 		rc = pcc_file ? PTR_ERR(pcc_file) : -EINVAL;
 		pcc_inode_put(pcci);
@@ -652,6 +675,7 @@ int pcc_inode_setattr(struct inode *inode, struct iattr *attr,
 		      bool *cached)
 {
 	int rc = 0;
+	const struct cred *old_cred;
 	struct iattr attr2 = *attr;
 	struct dentry *pcc_dentry;
 	struct pcc_inode *pcci;
@@ -667,11 +691,13 @@ int pcc_inode_setattr(struct inode *inode, struct iattr *attr,
 
 	attr2.ia_valid = attr->ia_valid & (ATTR_SIZE | ATTR_ATIME |
 			 ATTR_ATIME_SET | ATTR_MTIME | ATTR_MTIME_SET |
-			 ATTR_CTIME);
+			 ATTR_CTIME | ATTR_UID | ATTR_GID);
 	pcci = ll_i2pcci(inode);
 	pcc_dentry = pcci->pcci_path.dentry;
 	inode_lock(pcc_dentry->d_inode);
+	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	rc = pcc_dentry->d_inode->i_op->setattr(pcc_dentry, &attr2);
+	revert_creds(old_cred);
 	inode_unlock(pcc_dentry->d_inode);
 
 	pcc_io_fini(inode);
@@ -681,6 +707,7 @@ int pcc_inode_setattr(struct inode *inode, struct iattr *attr,
 int pcc_inode_getattr(struct inode *inode, bool *cached)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
+	const struct cred *old_cred;
 	struct kstat stat;
 	s64 atime;
 	s64 mtime;
@@ -696,8 +723,10 @@ int pcc_inode_getattr(struct inode *inode, bool *cached)
 	if (!*cached)
 		return 0;
 
+	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	rc = vfs_getattr(&ll_i2pcci(inode)->pcci_path, &stat,
 			 STATX_BASIC_STATS, AT_STATX_SYNC_AS_STAT);
+	revert_creds(old_cred);
 	if (rc)
 		goto out;
 
@@ -1113,14 +1142,14 @@ static int __pcc_inode_create(struct pcc_dataset *dataset,
 
 	pcc_fid2dataset_path(path, MAX_PCC_DATABASE_PATH, fid);
 
-	base = pcc_mkdir_p(dataset->pccd_path.dentry, path, 0700);
+	base = pcc_mkdir_p(dataset->pccd_path.dentry, path, 0);
 	if (IS_ERR(base)) {
 		rc = PTR_ERR(base);
 		goto out;
 	}
 
 	snprintf(path, MAX_PCC_DATABASE_PATH, DFID_NOBRACE, PFID(fid));
-	child = pcc_create(base, path, 0600);
+	child = pcc_create(base, path, 0);
 	if (IS_ERR(child)) {
 		rc = PTR_ERR(child);
 		goto out_base;
@@ -1134,18 +1163,44 @@ static int __pcc_inode_create(struct pcc_dataset *dataset,
 	return rc;
 }
 
-int pcc_inode_create(struct pcc_dataset *dataset, struct lu_fid *fid,
-		     struct dentry **pcc_dentry)
+/* TODO: Set the project ID for PCC copy */
+int pcc_inode_store_ugpid(struct dentry *dentry, kuid_t uid, kgid_t gid)
+{
+	struct inode *inode = dentry->d_inode;
+	struct iattr attr;
+	int rc;
+
+	attr.ia_valid = ATTR_UID | ATTR_GID;
+	attr.ia_uid = uid;
+	attr.ia_gid = gid;
+
+	inode_lock(inode);
+	rc = notify_change(dentry, &attr, NULL);
+	inode_unlock(inode);
+
+	return rc;
+}
+
+int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
+		     struct lu_fid *fid, struct dentry **pcc_dentry)
 {
-	return __pcc_inode_create(dataset, fid, pcc_dentry);
+	const struct cred *old_cred;
+	int rc;
+
+	old_cred = override_creds(pcc_super_cred(sb));
+	rc = __pcc_inode_create(dataset, fid, pcc_dentry);
+	revert_creds(old_cred);
+	return rc;
 }
 
 int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 			  struct dentry *pcc_dentry)
 {
+	const struct cred *old_cred;
 	struct pcc_inode *pcci;
 	int rc = 0;
 
+	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	pcc_inode_lock(inode);
 	LASSERT(!ll_i2pcci(inode));
 	pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
@@ -1154,6 +1209,11 @@ int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 		goto out_unlock;
 	}
 
+	rc = pcc_inode_store_ugpid(pcc_dentry, old_cred->suid,
+				   old_cred->sgid);
+	if (rc)
+		goto out_unlock;
+
 	pcc_inode_init(pcci, ll_i2info(inode));
 	pcc_inode_attach_init(dataset, pcci, pcc_dentry, LU_PCC_READWRITE);
 	/* Set the layout generation of newly created file with 0 */
@@ -1172,6 +1232,10 @@ int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 	}
 
 	pcc_inode_unlock(inode);
+	revert_creds(old_cred);
+	if (rc)
+		kmem_cache_free(pcc_inode_slab, pcci);
+
 	return rc;
 }
 
@@ -1253,6 +1317,7 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	struct pcc_dataset *dataset;
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct pcc_inode *pcci;
+	const struct cred *old_cred;
 	struct dentry *dentry;
 	struct file *pcc_filp;
 	struct path path;
@@ -1267,9 +1332,12 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	if (!dataset)
 		return -ENOENT;
 
+	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	rc = __pcc_inode_create(dataset, &lli->lli_fid, &dentry);
-	if (rc)
+	if (rc) {
+		revert_creds(old_cred);
 		goto out_dataset_put;
+	}
 
 	path.mnt = dataset->pccd_path.mnt;
 	path.dentry = dentry;
@@ -1277,9 +1345,15 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 			       current_cred());
 	if (IS_ERR_OR_NULL(pcc_filp)) {
 		rc = pcc_filp ? PTR_ERR(pcc_filp) : -EINVAL;
+		revert_creds(old_cred);
 		goto out_dentry;
 	}
 
+	rc = pcc_inode_store_ugpid(dentry, old_cred->uid, old_cred->gid);
+	revert_creds(old_cred);
+	if (rc)
+		goto out_fput;
+
 	rc = pcc_copy_data(file, pcc_filp);
 	if (rc)
 		goto out_fput;
@@ -1306,7 +1380,9 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	if (rc) {
 		int rc2;
 
+		old_cred = override_creds(pcc_super_cred(inode->i_sb));
 		rc2 = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
+		revert_creds(old_cred);
 		if (rc2)
 			CWARN("failed to unlink PCC file, rc = %d\n", rc2);
 
@@ -1322,13 +1398,14 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 			      bool attached)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
+	const struct cred *old_cred;
 	struct pcc_inode *pcci;
 	u32 gen2;
 
 	pcc_inode_lock(inode);
 	pcci = ll_i2pcci(inode);
 	lli->lli_pcc_state &= ~PCC_STATE_FL_ATTACHING;
-	if ((rc || lease_broken)) {
+	if (rc || lease_broken) {
 		if (attached && pcci)
 			pcc_inode_put(pcci);
 
@@ -1357,7 +1434,9 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 
 out_put:
 	if (rc) {
+		old_cred = override_creds(pcc_super_cred(inode->i_sb));
 		pcc_inode_remove(pcci);
+		revert_creds(old_cred);
 		pcc_inode_put(pcci);
 	}
 out_unlock:
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index 1a73dbb..54492c9 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -53,8 +53,12 @@ struct pcc_dataset {
 };
 
 struct pcc_super {
-	spinlock_t		pccs_lock;	/* Protect pccs_datasets */
-	struct list_head	pccs_datasets;	/* List of datasets */
+	/* Protect pccs_datasets */
+	spinlock_t		 pccs_lock;
+	/* List of datasets */
+	struct list_head	 pccs_datasets;
+	/* creds of process who forced instantiation of super block */
+	const struct cred	*pccs_cred;
 };
 
 struct pcc_inode {
@@ -108,7 +112,7 @@ struct pcc_cmd {
 	} u;
 };
 
-void pcc_super_init(struct pcc_super *super);
+int pcc_super_init(struct pcc_super *super);
 void pcc_super_fini(struct pcc_super *super);
 int pcc_cmd_handle(char *buffer, unsigned long count,
 		   struct pcc_super *super);
@@ -141,10 +145,10 @@ int pcc_fsync(struct file *file, loff_t start, loff_t end,
 int pcc_fault(struct vm_area_struct *mva, struct vm_fault *vmf, bool *cached);
 int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		     bool *cached);
-int pcc_inode_create(struct pcc_dataset *dataset, struct lu_fid *fid,
-		     struct dentry **pcc_dentry);
+int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
+		     struct lu_fid *fid, struct dentry **pcc_dentry);
 int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
-			  struct dentry *pcc_dentry);
+			   struct dentry *pcc_dentry);
 struct pcc_dataset *pcc_dataset_get(struct pcc_super *super, u32 projid,
 				    u32 archive_id);
 void pcc_dataset_put(struct pcc_dataset *dataset);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 361/622] lustre: llite: Rule based auto PCC caching when create files
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (359 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 360/622] lustre: pcc: security and permission for non-root user access James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 362/622] lustre: pcc: auto attach during open for valid cache James Simmons
                   ` (261 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

Configurable rule based auto PCC caching for newly created files
can significantly benefit users for readwrite PCC. It can
determine which file can use a cache on PCC directly without any
admission control for high priority user/group/project or filename
with wildcard support. Meanwhile, we can enforce a quota limitation
of capacity usage for each user/group/project to providing caching
isolation.

Similar to NRS TBF command line, it supports logical conditional
conjunction and disjunction operations among different user/group/
project or filename with the wildcard support.

The command line to add this kind of rule is as follow:
lctl pcc add /mnt/lustre /mnt/pcc
        "projid={500 1000}&fname={*.h5},uid={1001} rwid=1 roid=1"
It means that Project ID of 500, 1000 AND file suffix name is "h5"
OR User ID is 1001 can be auto cached on PCC for newly create file
on the client. "rwid" means RW-PCC attach ID (which is
usually archive ID); "roid" means RO-PCC attach ID. By default,
RO-PCC attach id is setting same with RW-PCC attach ID for a
shared PCC backend.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10918
Lustre-commit: 4fbae1352947 ("LU-10918 llite: Rule based auto PCC caching when create files")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/34751
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/namei.c |  13 +-
 fs/lustre/llite/pcc.c   | 637 ++++++++++++++++++++++++++++++++++++++++++++----
 fs/lustre/llite/pcc.h   |  67 ++++-
 3 files changed, 659 insertions(+), 58 deletions(-)

diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 10c0cef..49433c9 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -826,7 +826,7 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 		lum->lmm_pattern = LOV_PATTERN_F_RELEASED | LOV_PATTERN_RAID0;
 		op_data->op_data = lum;
 		op_data->op_data_size = sizeof(*lum);
-		op_data->op_archive_id = dataset->pccd_id;
+		op_data->op_archive_id = dataset->pccd_rwid;
 
 		rc = obd_fid_alloc(NULL, ll_i2mdexp(parent), &op_data->op_fid2,
 				   op_data);
@@ -1002,9 +1002,14 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 		/* Volatile file is used for HSM restore, so do not use PCC */
 		if (!filename_is_volatile(dentry->d_name.name,
 					  dentry->d_name.len, NULL)) {
-			dataset = pcc_dataset_get(&sbi->ll_pcc_super,
-						  ll_i2info(dir)->lli_projid,
-						  0);
+			struct pcc_matcher item;
+
+			item.pm_uid = from_kuid(&init_user_ns, current_uid());
+			item.pm_gid = from_kgid(&init_user_ns, current_gid());
+			item.pm_projid = ll_i2info(dir)->lli_projid;
+			item.pm_name = &dentry->d_name;
+			dataset = pcc_dataset_match_get(&sbi->ll_pcc_super,
+							&item);
 			pca.pca_dataset = dataset;
 		}
 	}
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index fa81b55..469ff6c 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -109,6 +109,7 @@
 #include <linux/namei.h>
 #include <linux/file.h>
 #include <linux/mount.h>
+#include <linux/libcfs/libcfs_string.h>
 #include "llite_internal.h"
 
 struct kmem_cache *pcc_inode_slab;
@@ -129,23 +130,550 @@ int pcc_super_init(struct pcc_super *super)
 	return 0;
 }
 
+/* Rule based auto caching */
+static void pcc_id_list_free(struct list_head *id_list)
+{
+	struct pcc_match_id *id, *n;
+
+	list_for_each_entry_safe(id, n, id_list, pmi_linkage) {
+		list_del_init(&id->pmi_linkage);
+		kfree(id);
+	}
+}
+
+static void pcc_fname_list_free(struct list_head *fname_list)
+{
+	struct pcc_match_fname *fname, *n;
+
+	list_for_each_entry_safe(fname, n, fname_list, pmf_linkage) {
+		kfree(fname->pmf_name);
+		list_del_init(&fname->pmf_linkage);
+		kfree(fname);
+	}
+}
+
+static void pcc_expression_free(struct pcc_expression *expr)
+{
+	LASSERT(expr->pe_field >= PCC_FIELD_UID &&
+		expr->pe_field < PCC_FIELD_MAX);
+	switch (expr->pe_field) {
+	case PCC_FIELD_UID:
+	case PCC_FIELD_GID:
+	case PCC_FIELD_PROJID:
+		pcc_id_list_free(&expr->pe_cond);
+		break;
+	case PCC_FIELD_FNAME:
+		pcc_fname_list_free(&expr->pe_cond);
+		break;
+	default:
+		LBUG();
+	}
+	kfree(expr);
+}
+
+static void pcc_conjunction_free(struct pcc_conjunction *conjunction)
+{
+	struct pcc_expression *expression, *n;
+
+	LASSERT(list_empty(&conjunction->pc_linkage));
+	list_for_each_entry_safe(expression, n,
+				 &conjunction->pc_expressions,
+				 pe_linkage) {
+		list_del_init(&expression->pe_linkage);
+		pcc_expression_free(expression);
+	}
+	kfree(conjunction);
+}
+
+static void pcc_rule_conds_free(struct list_head *cond_list)
+{
+	struct pcc_conjunction *conjunction, *n;
+
+	list_for_each_entry_safe(conjunction, n, cond_list, pc_linkage) {
+		list_del_init(&conjunction->pc_linkage);
+		pcc_conjunction_free(conjunction);
+	}
+}
+
+static void pcc_cmd_fini(struct pcc_cmd *cmd)
+{
+	if (cmd->pccc_cmd == PCC_ADD_DATASET) {
+		if (!list_empty(&cmd->u.pccc_add.pccc_conds))
+			pcc_rule_conds_free(&cmd->u.pccc_add.pccc_conds);
+		kfree(cmd->u.pccc_add.pccc_conds_str);
+	}
+}
+
+#define PCC_DISJUNCTION_DELIM	(',')
+#define PCC_CONJUNCTION_DELIM	('&')
+#define PCC_EXPRESSION_DELIM	('=')
+
+static int
+pcc_fname_list_add(struct cfs_lstr *id, struct list_head *fname_list)
+{
+	struct pcc_match_fname *fname;
+
+	fname = kzalloc(sizeof(*fname), GFP_KERNEL);
+	if (!fname)
+		return -ENOMEM;
+
+	fname->pmf_name = kzalloc(id->ls_len + 1, GFP_KERNEL);
+	if (!fname->pmf_name) {
+		kfree(fname);
+		return -ENOMEM;
+	}
+
+	memcpy(fname->pmf_name, id->ls_str, id->ls_len);
+	list_add_tail(&fname->pmf_linkage, fname_list);
+	return 0;
+}
+
+static int
+pcc_fname_list_parse(char *str, int len, struct list_head *fname_list)
+{
+	struct cfs_lstr src;
+	struct cfs_lstr res;
+	int rc = 0;
+
+	src.ls_str = str;
+	src.ls_len = len;
+	INIT_LIST_HEAD(fname_list);
+	while (src.ls_str) {
+		rc = cfs_gettok(&src, ' ', &res);
+		if (rc == 0) {
+			rc = -EINVAL;
+			break;
+		}
+		rc = pcc_fname_list_add(&res, fname_list);
+		if (rc)
+			break;
+	}
+	if (rc)
+		pcc_fname_list_free(fname_list);
+	return rc;
+}
+
+static int
+pcc_id_list_parse(char *str, int len, struct list_head *id_list,
+		  enum pcc_field type)
+{
+	struct cfs_lstr src;
+	struct cfs_lstr res;
+	int rc = 0;
+
+	if (type != PCC_FIELD_UID && type != PCC_FIELD_GID &&
+	    type != PCC_FIELD_PROJID)
+		return -EINVAL;
+
+	src.ls_str = str;
+	src.ls_len = len;
+	INIT_LIST_HEAD(id_list);
+	while (src.ls_str) {
+		struct pcc_match_id *id;
+		u32 id_val;
+
+		if (cfs_gettok(&src, ' ', &res) == 0) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		if (!cfs_str2num_check(res.ls_str, res.ls_len,
+				       &id_val, 0, (u32)~0U)) {
+			rc = -EINVAL;
+			goto out;
+		}
+
+		id = kzalloc(sizeof(*id), GFP_KERNEL);
+		if (!id) {
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		id->pmi_id = id_val;
+		list_add_tail(&id->pmi_linkage, id_list);
+	}
+out:
+	if (rc)
+		pcc_id_list_free(id_list);
+	return rc;
+}
+
+static inline bool
+pcc_check_field(struct cfs_lstr *field, char *str)
+{
+	int len = strlen(str);
+
+	return (field->ls_len == len &&
+		strncmp(field->ls_str, str, len) == 0);
+}
+
+static int
+pcc_expression_parse(struct cfs_lstr *src, struct list_head *cond_list)
+{
+	struct pcc_expression *expr;
+	struct cfs_lstr field;
+	int rc = 0;
+
+	expr = kzalloc(sizeof(*expr), GFP_KERNEL);
+	if (!expr)
+		return -ENOMEM;
+
+	rc = cfs_gettok(src, PCC_EXPRESSION_DELIM, &field);
+	if (rc == 0 || src->ls_len <= 2 || src->ls_str[0] != '{' ||
+	    src->ls_str[src->ls_len - 1] != '}') {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	/* Skip '{' and '}' */
+	src->ls_str++;
+	src->ls_len -= 2;
+
+	if (pcc_check_field(&field, "uid")) {
+		if (pcc_id_list_parse(src->ls_str,
+				      src->ls_len,
+				      &expr->pe_cond,
+				      PCC_FIELD_UID) < 0) {
+			rc = -EINVAL;
+			goto out;
+		}
+		expr->pe_field = PCC_FIELD_UID;
+	} else if (pcc_check_field(&field, "gid")) {
+		if (pcc_id_list_parse(src->ls_str,
+				      src->ls_len,
+				      &expr->pe_cond,
+				      PCC_FIELD_GID) < 0) {
+			rc = -EINVAL;
+			goto out;
+		}
+		expr->pe_field = PCC_FIELD_GID;
+	} else if (pcc_check_field(&field, "projid")) {
+		if (pcc_id_list_parse(src->ls_str,
+				      src->ls_len,
+				      &expr->pe_cond,
+				      PCC_FIELD_PROJID) < 0) {
+			rc = -EINVAL;
+			goto out;
+		}
+		expr->pe_field = PCC_FIELD_PROJID;
+	} else if (pcc_check_field(&field, "fname")) {
+		if (pcc_fname_list_parse(src->ls_str,
+					 src->ls_len,
+					 &expr->pe_cond) < 0) {
+			rc = -EINVAL;
+			goto out;
+		}
+		expr->pe_field = PCC_FIELD_FNAME;
+	} else {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	list_add_tail(&expr->pe_linkage, cond_list);
+	return 0;
+out:
+	kfree(expr);
+	return rc;
+}
+
+static int
+pcc_conjunction_parse(struct cfs_lstr *src, struct list_head *cond_list)
+{
+	struct pcc_conjunction *conjunction;
+	struct cfs_lstr expr;
+	int rc = 0;
+
+	conjunction = kzalloc(sizeof(*conjunction), GFP_KERNEL);
+	if (!conjunction)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&conjunction->pc_expressions);
+	list_add_tail(&conjunction->pc_linkage, cond_list);
+
+	while (src->ls_str) {
+		rc = cfs_gettok(src, PCC_CONJUNCTION_DELIM, &expr);
+		if (rc == 0) {
+			rc = -EINVAL;
+			break;
+		}
+		rc = pcc_expression_parse(&expr,
+					  &conjunction->pc_expressions);
+		if (rc)
+			break;
+	}
+	return rc;
+}
+
+static int pcc_conds_parse(char *str, int len, struct list_head *cond_list)
+{
+	struct cfs_lstr src;
+	struct cfs_lstr res;
+	int rc = 0;
+
+	src.ls_str = str;
+	src.ls_len = len;
+	INIT_LIST_HEAD(cond_list);
+	while (src.ls_str) {
+		rc = cfs_gettok(&src, PCC_DISJUNCTION_DELIM, &res);
+		if (rc == 0) {
+			rc = -EINVAL;
+			break;
+		}
+		rc = pcc_conjunction_parse(&res, cond_list);
+		if (rc)
+			break;
+	}
+	return rc;
+}
+
+static int pcc_id_parse(struct pcc_cmd *cmd, const char *id)
+{
+	int rc;
+
+	cmd->u.pccc_add.pccc_conds_str = kzalloc(strlen(id) + 1, GFP_KERNEL);
+	if (!cmd->u.pccc_add.pccc_conds_str)
+		return -ENOMEM;
+
+	memcpy(cmd->u.pccc_add.pccc_conds_str, id, strlen(id));
+
+	rc = pcc_conds_parse(cmd->u.pccc_add.pccc_conds_str,
+			     strlen(cmd->u.pccc_add.pccc_conds_str),
+			     &cmd->u.pccc_add.pccc_conds);
+	if (rc)
+		pcc_cmd_fini(cmd);
+
+	return rc;
+}
+
+static int
+pcc_parse_value_pair(struct pcc_cmd *cmd, char *buffer)
+{
+	char *key, *val;
+	unsigned long id;
+	int rc;
+
+	val = buffer;
+	key = strsep(&val, "=");
+	if (!val || strlen(val) == 0)
+		return -EINVAL;
+
+	/* Key of the value pair */
+	if (strcmp(key, "rwid") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id <= 0)
+			return -EINVAL;
+		cmd->u.pccc_add.pccc_rwid = id;
+	} else if (strcmp(key, "roid") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id <= 0)
+			return -EINVAL;
+		cmd->u.pccc_add.pccc_roid = id;
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int
+pcc_parse_value_pairs(struct pcc_cmd *cmd, char *buffer)
+{
+	char *val;
+	char *token;
+	int rc;
+
+	val = buffer;
+	while (val && strlen(val) != 0) {
+		token = strsep(&val, " ");
+		rc = pcc_parse_value_pair(cmd, token);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static void
+pcc_dataset_rule_fini(struct pcc_match_rule *rule)
+{
+	if (!list_empty(&rule->pmr_conds))
+		pcc_rule_conds_free(&rule->pmr_conds);
+	LASSERT(rule->pmr_conds_str);
+	kfree(rule->pmr_conds_str);
+}
+
+static int
+pcc_dataset_rule_init(struct pcc_match_rule *rule, struct pcc_cmd *cmd)
+{
+	int rc = 0;
+
+	LASSERT(cmd->u.pccc_add.pccc_conds_str);
+	rule->pmr_conds_str = kzalloc(
+		strlen(cmd->u.pccc_add.pccc_conds_str) + 1,
+		GFP_KERNEL);
+	if (!rule->pmr_conds_str)
+		return -ENOMEM;
+
+	memcpy(rule->pmr_conds_str,
+	       cmd->u.pccc_add.pccc_conds_str,
+	       strlen(cmd->u.pccc_add.pccc_conds_str));
+
+	INIT_LIST_HEAD(&rule->pmr_conds);
+	if (!list_empty(&cmd->u.pccc_add.pccc_conds))
+		rc = pcc_conds_parse(rule->pmr_conds_str,
+					  strlen(rule->pmr_conds_str),
+					  &rule->pmr_conds);
+
+	if (rc)
+		pcc_dataset_rule_fini(rule);
+
+	return rc;
+}
+
+/* Rule Matching */
+static int
+pcc_id_list_match(struct list_head *id_list, u32 id_val)
+{
+	struct pcc_match_id *id;
+
+	list_for_each_entry(id, id_list, pmi_linkage) {
+		if (id->pmi_id == id_val)
+			return 1;
+	}
+	return 0;
+}
+
+static bool
+cfs_match_wildcard(const char *pattern, const char *content)
+{
+	if (*pattern == '\0' && *content == '\0')
+		return true;
+
+	if (*pattern == '*' && *(pattern + 1) != '\0' && *content == '\0')
+		return false;
+
+	while (*pattern == *content) {
+		pattern++;
+		content++;
+		if (*pattern == '\0' && *content == '\0')
+			return true;
+
+		if (*pattern == '*' && *(pattern + 1) != '\0' &&
+		    *content == '\0')
+			return false;
+	}
+
+	if (*pattern == '*')
+		return (cfs_match_wildcard(pattern + 1, content) ||
+			cfs_match_wildcard(pattern, content + 1));
+
+	return false;
+}
+
+static int
+pcc_fname_list_match(struct list_head *fname_list, const char *name)
+{
+	struct pcc_match_fname *fname;
+
+	list_for_each_entry(fname, fname_list, pmf_linkage) {
+		if (cfs_match_wildcard(fname->pmf_name, name))
+			return 1;
+	}
+	return 0;
+}
+
+static int
+pcc_expression_match(struct pcc_expression *expr, struct pcc_matcher *matcher)
+{
+	switch (expr->pe_field) {
+	case PCC_FIELD_UID:
+		return pcc_id_list_match(&expr->pe_cond, matcher->pm_uid);
+	case PCC_FIELD_GID:
+		return pcc_id_list_match(&expr->pe_cond, matcher->pm_gid);
+	case PCC_FIELD_PROJID:
+		return pcc_id_list_match(&expr->pe_cond, matcher->pm_projid);
+	case PCC_FIELD_FNAME:
+		return pcc_fname_list_match(&expr->pe_cond,
+					    matcher->pm_name->name);
+	default:
+		return 0;
+	}
+}
+
+static int
+pcc_conjunction_match(struct pcc_conjunction *conjunction,
+		      struct pcc_matcher *matcher)
+{
+	struct pcc_expression *expr;
+	int matched;
+
+	list_for_each_entry(expr, &conjunction->pc_expressions, pe_linkage) {
+		matched = pcc_expression_match(expr, matcher);
+		if (!matched)
+			return 0;
+	}
+
+	return 1;
+}
+
+static int
+pcc_cond_match(struct pcc_match_rule *rule, struct pcc_matcher *matcher)
+{
+	struct pcc_conjunction *conjunction;
+	int matched;
+
+	list_for_each_entry(conjunction, &rule->pmr_conds, pc_linkage) {
+		matched = pcc_conjunction_match(conjunction, matcher);
+		if (matched)
+			return 1;
+	}
+
+	return 0;
+}
+
+struct pcc_dataset*
+pcc_dataset_match_get(struct pcc_super *super, struct pcc_matcher *matcher)
+{
+	struct pcc_dataset *dataset;
+	struct pcc_dataset *selected = NULL;
+
+	spin_lock(&super->pccs_lock);
+	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
+		if (pcc_cond_match(&dataset->pccd_rule, matcher)) {
+			atomic_inc(&dataset->pccd_refcount);
+			selected = dataset;
+			break;
+		}
+	}
+	spin_unlock(&super->pccs_lock);
+	if (selected)
+		CDEBUG(D_CACHE, "PCC create, matched %s - %d:%d:%d:%s\n",
+		       dataset->pccd_rule.pmr_conds_str,
+		       matcher->pm_uid, matcher->pm_gid,
+		       matcher->pm_projid, matcher->pm_name->name);
+
+	return selected;
+}
+
 /**
  * pcc_dataset_add - Add a Cache policy to control which files need be
  * cached and where it will be cached.
  *
- * @super: superblock of pcc
- * @pathname: root path of pcc
- * @id: HSM archive ID
- * @projid: files with specified project ID will be cached.
+ * @super:	superblock of pcc
+ * @cmd:	pcc command
  */
 static int
-pcc_dataset_add(struct pcc_super *super, const char *pathname,
-		u32 archive_id, u32 projid)
+pcc_dataset_add(struct pcc_super *super, struct pcc_cmd *cmd)
 {
-	int rc;
+	char *pathname = cmd->pccc_pathname;
 	struct pcc_dataset *dataset;
 	struct pcc_dataset *tmp;
 	bool found = false;
+	int rc;
 
 	dataset = kzalloc(sizeof(*dataset), GFP_NOFS);
 	if (!dataset)
@@ -157,13 +685,23 @@ int pcc_super_init(struct pcc_super *super)
 		return rc;
 	}
 	strncpy(dataset->pccd_pathname, pathname, PATH_MAX);
-	dataset->pccd_id = archive_id;
-	dataset->pccd_projid = projid;
+	dataset->pccd_rwid = cmd->u.pccc_add.pccc_rwid;
+	dataset->pccd_roid = cmd->u.pccc_add.pccc_roid;
 	atomic_set(&dataset->pccd_refcount, 1);
 
+	rc = pcc_dataset_rule_init(&dataset->pccd_rule, cmd);
+	if (rc) {
+		pcc_dataset_put(dataset);
+		return rc;
+	}
+
 	spin_lock(&super->pccs_lock);
 	list_for_each_entry(tmp, &super->pccs_datasets, pccd_linkage) {
-		if (tmp->pccd_id == archive_id) {
+		if (strcmp(tmp->pccd_pathname, pathname) == 0 ||
+		    (dataset->pccd_rwid != 0 &&
+		     dataset->pccd_rwid == tmp->pccd_rwid) ||
+		    (dataset->pccd_roid != 0 &&
+		     dataset->pccd_roid == tmp->pccd_roid)) {
 			found = true;
 			break;
 		}
@@ -181,23 +719,21 @@ int pcc_super_init(struct pcc_super *super)
 }
 
 struct pcc_dataset *
-pcc_dataset_get(struct pcc_super *super, u32 projid, u32 archive_id)
+pcc_dataset_get(struct pcc_super *super, enum lu_pcc_type type, u32 id)
 {
 	struct pcc_dataset *dataset;
 	struct pcc_dataset *selected = NULL;
 
-	if (projid == 0 && archive_id == 0)
+	if (id == 0)
 		return NULL;
 
 	/*
-	 * archive ID is unique in the list, projid might be duplicate,
+	 * archive ID (read-write ID) or read-only ID is unique in the list,
 	 * we just return last added one as first priority.
 	 */
 	spin_lock(&super->pccs_lock);
 	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
-		if (projid && dataset->pccd_projid != projid)
-			continue;
-		if (archive_id && dataset->pccd_id != archive_id)
+		if (type == LU_PCC_READWRITE && dataset->pccd_rwid != id)
 			continue;
 		atomic_inc(&dataset->pccd_refcount);
 		selected = dataset;
@@ -205,8 +741,8 @@ struct pcc_dataset *
 	}
 	spin_unlock(&super->pccs_lock);
 	if (selected)
-		CDEBUG(D_CACHE, "matched projid %u, PCC create\n",
-		       selected->pccd_projid);
+		CDEBUG(D_CACHE, "matched id %u, PCC mode %d\n", id, type);
+
 	return selected;
 }
 
@@ -214,6 +750,7 @@ struct pcc_dataset *
 pcc_dataset_put(struct pcc_dataset *dataset)
 {
 	if (atomic_dec_and_test(&dataset->pccd_refcount)) {
+		pcc_dataset_rule_fini(&dataset->pccd_rule);
 		path_put(&dataset->pccd_path);
 		kfree(dataset);
 	}
@@ -244,8 +781,8 @@ struct pcc_dataset *
 pcc_dataset_dump(struct pcc_dataset *dataset, struct seq_file *m)
 {
 	seq_printf(m, "%s:\n", dataset->pccd_pathname);
-	seq_printf(m, "  rwid: %u\n", dataset->pccd_id);
-	seq_printf(m, "  autocache: projid=%u\n", dataset->pccd_projid);
+	seq_printf(m, "  rwid: %u\n", dataset->pccd_rwid);
+	seq_printf(m, "  autocache: %s\n", dataset->pccd_rule.pmr_conds_str);
 }
 
 int
@@ -293,7 +830,6 @@ static bool pathname_is_valid(const char *pathname)
 	static struct pcc_cmd *cmd;
 	char *token;
 	char *val;
-	unsigned long tmp;
 	int rc = 0;
 
 	cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
@@ -336,38 +872,40 @@ static bool pathname_is_valid(const char *pathname)
 	cmd->pccc_pathname = token;
 
 	if (cmd->pccc_cmd == PCC_ADD_DATASET) {
-		/* archive ID */
-		token = strsep(&val, " ");
+		/* List of ID */
+		LASSERT(val);
+		token = val;
+		val = strrchr(token, '}');
 		if (!val) {
 			rc = -EINVAL;
 			goto out_free_cmd;
 		}
 
-		rc = kstrtoul(token, 10, &tmp);
-		if (rc != 0) {
-			rc = -EINVAL;
-			goto out_free_cmd;
-		}
-		if (tmp == 0) {
+		/* Skip '}' */
+		val++;
+		if (*val == '\0') {
+			val = NULL;
+		} else if (*val == ' ') {
+			*val = '\0';
+			val++;
+		} else {
 			rc = -EINVAL;
 			goto out_free_cmd;
 		}
-		cmd->u.pccc_add.pccc_id = tmp;
 
-		token = val;
-		rc = kstrtoul(token, 10, &tmp);
-		if (rc != 0) {
-			rc = -EINVAL;
+		rc = pcc_id_parse(cmd, token);
+		if (rc)
 			goto out_free_cmd;
-		}
-		if (tmp == 0) {
+
+		rc = pcc_parse_value_pairs(cmd, val);
+		if (rc) {
 			rc = -EINVAL;
-			goto out_free_cmd;
+			goto out_cmd_fini;
 		}
-		cmd->u.pccc_add.pccc_projid = tmp;
 	}
-
 	goto out;
+out_cmd_fini:
+	pcc_cmd_fini(cmd);
 out_free_cmd:
 	kfree(cmd);
 out:
@@ -388,9 +926,7 @@ int pcc_cmd_handle(char *buffer, unsigned long count,
 
 	switch (cmd->pccc_cmd) {
 	case PCC_ADD_DATASET:
-		rc = pcc_dataset_add(super, cmd->pccc_pathname,
-				      cmd->u.pccc_add.pccc_id,
-				      cmd->u.pccc_add.pccc_projid);
+		rc = pcc_dataset_add(super, cmd);
 		break;
 	case PCC_DEL_DATASET:
 		rc = pcc_dataset_del(super, cmd->pccc_pathname);
@@ -403,6 +939,7 @@ int pcc_cmd_handle(char *buffer, unsigned long count,
 		break;
 	}
 
+	pcc_cmd_fini(cmd);
 	kfree(cmd);
 	return rc;
 }
@@ -1025,7 +1562,8 @@ static int pcc_inode_remove(struct pcc_inode *pcci)
 	dentry = pcci->pcci_path.dentry;
 	rc = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
 	if (rc)
-		CWARN("failed to unlink cached file, rc = %d\n", rc);
+		CWARN("failed to unlink PCC file %.*s, rc = %d\n",
+		      dentry->d_name.len, dentry->d_name.name, rc);
 
 	return rc;
 }
@@ -1226,7 +1764,10 @@ int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 		rc2 = vfs_unlink(pcc_dentry->d_parent->d_inode,
 				 pcc_dentry, NULL);
 		if (rc2)
-			CWARN("failed to unlink PCC file, rc = %d\n", rc2);
+			CWARN("%s: failed to unlink PCC file %.*s, rc = %d\n",
+			      ll_i2sbi(inode)->ll_fsname,
+			      pcc_dentry->d_name.len, pcc_dentry->d_name.name,
+			      rc2);
 
 		dput(pcc_dentry);
 	}
@@ -1327,8 +1868,8 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	if (rc)
 		return rc;
 
-	dataset = pcc_dataset_get(&ll_i2sbi(inode)->ll_pcc_super, 0,
-				  archive_id);
+	dataset = pcc_dataset_get(&ll_i2sbi(inode)->ll_pcc_super,
+				  LU_PCC_READWRITE, archive_id);
 	if (!dataset)
 		return -ENOENT;
 
@@ -1384,7 +1925,9 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 		rc2 = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
 		revert_creds(old_cred);
 		if (rc2)
-			CWARN("failed to unlink PCC file, rc = %d\n", rc2);
+			CWARN("%s: failed to unlink PCC file %.*s, rc = %d\n",
+			      ll_i2sbi(inode)->ll_fsname, dentry->d_name.len,
+			      dentry->d_name.name, rc2);
 
 		dput(dentry);
 	}
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index 54492c9..f2b57f9 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -43,13 +43,64 @@
 
 #define LPROCFS_WR_PCC_MAX_CMD 4096
 
+/* User/Group/Project ID */
+struct pcc_match_id {
+	u32			pmi_id;
+	struct list_head	pmi_linkage;
+};
+
+/* wildcard file name */
+struct pcc_match_fname {
+	char			*pmf_name;
+	struct list_head	 pmf_linkage;
+};
+
+enum pcc_field {
+	PCC_FIELD_UID,
+	PCC_FIELD_GID,
+	PCC_FIELD_PROJID,
+	PCC_FIELD_FNAME,
+	PCC_FIELD_MAX
+};
+
+struct pcc_expression {
+	enum pcc_field		pe_field;
+	struct list_head	pe_cond;
+	struct list_head	pe_linkage;
+};
+
+struct pcc_conjunction {
+	/* link to disjunction */
+	struct list_head	pc_linkage;
+	/* list of logical conjunction */
+	struct list_head	pc_expressions;
+};
+
+/**
+ * Match rule for auto PCC-cached files.
+ */
+struct pcc_match_rule {
+	char			*pmr_conds_str;
+	struct list_head	 pmr_conds;
+};
+
+struct pcc_matcher {
+	u32		 pm_uid;
+	u32		 pm_gid;
+	u32		 pm_projid;
+	struct qstr	*pm_name;
+};
+
 struct pcc_dataset {
-	u32			pccd_id;	 /* Archive ID */
-	u32			pccd_projid;	 /* Project ID */
+	u32			pccd_rwid;	 /* Archive ID */
+	u32			pccd_roid;	 /* Readonly ID */
+	struct pcc_match_rule	pccd_rule;	 /* Match rule */
+	u32			pccd_rwonly:1, /* Only use as RW-PCC */
+				pccd_roonly:1; /* Only use as RO-PCC */
 	char			pccd_pathname[PATH_MAX]; /* full path */
 	struct path		pccd_path;	 /* Root path */
 	struct list_head	pccd_linkage;  /* Linked to pccs_datasets */
-	atomic_t		pccd_refcount; /* reference count */
+	atomic_t		pccd_refcount; /* Reference count */
 };
 
 struct pcc_super {
@@ -103,8 +154,10 @@ struct pcc_cmd {
 	char					*pccc_pathname;
 	union {
 		struct pcc_cmd_add {
-			u32			 pccc_id;
-			u32			 pccc_projid;
+			u32			 pccc_rwid;
+			u32			 pccc_roid;
+			struct list_head	 pccc_conds;
+			char			*pccc_conds_str;
 		} pccc_add;
 		struct pcc_cmd_del {
 			u32			 pccc_pad;
@@ -149,8 +202,8 @@ int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
 		     struct lu_fid *fid, struct dentry **pcc_dentry);
 int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 			   struct dentry *pcc_dentry);
-struct pcc_dataset *pcc_dataset_get(struct pcc_super *super, u32 projid,
-				    u32 archive_id);
+struct pcc_dataset *pcc_dataset_match_get(struct pcc_super *super,
+					  struct pcc_matcher *matcher);
 void pcc_dataset_put(struct pcc_dataset *dataset);
 void pcc_inode_free(struct inode *inode);
 void pcc_layout_invalidate(struct inode *inode);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 362/622] lustre: pcc: auto attach during open for valid cache
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (360 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 361/622] lustre: llite: Rule based auto PCC caching when create files James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 363/622] lustre: pcc: change detach behavior and add keep option James Simmons
                   ` (260 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

In current PCC implementation, all PCC state information is
stored in the in-memory data structure named pcc_inode (a member
of data structure ll_inode_info). Once the file inode is reclaimed
due to the memory pressure or memory shrinking, the corresponding
in-memory pcc_inode will be released too, and the PCC-cached file
will be detached automatically. And the revocation of layout lock
will also trigger the detach of the PCC-cached file. These all lead
that the still valid PCC-cached file can not be used.

To solve this problem, we introduce an auto-attaching mechanism
during open. During PCC attach, the L.Gen will be stored as
extented attribute of the local copy file on PCC device. When the
in-memory inode is reclaimed or the layout lock is revoked, and
the file is opend again, it can check whether the stored L.Gen on
the PCC copy is same as the Lustre file current L.Gen on MDT. If
they are consistent, it means the cached copy on PCC device is still
valid, we can continue to use it after auto-attach.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10092
Lustre-commit: e29ecb659e51 ("LU-10092 pcc: auto attach during open for valid cache")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/33787
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h           |   2 +
 fs/lustre/llite/pcc.c                   | 400 ++++++++++++++++++++++++++------
 fs/lustre/llite/pcc.h                   |  18 +-
 fs/lustre/lov/lov_object.c              |   1 +
 include/uapi/linux/lustre/lustre_user.h |   2 +
 5 files changed, 348 insertions(+), 75 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 3337bbf..d1c1413 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -293,6 +293,8 @@ struct cl_layout {
 	u32			cl_layout_gen;
 	/** whether layout is a composite one */
 	bool			cl_is_composite;
+	/** Whether layout is a HSM released one */
+	bool			cl_is_released;
 };
 
 /**
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index 469ff6c..fc4a2a3 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -124,7 +124,7 @@ int pcc_super_init(struct pcc_super *super)
 
 	/* Never override disk quota limits or use reserved space */
 	cap_lower(cred->cap_effective, CAP_SYS_RESOURCE);
-	spin_lock_init(&super->pccs_lock);
+	init_rwsem(&super->pccs_rw_sem);
 	INIT_LIST_HEAD(&super->pccs_datasets);
 
 	return 0;
@@ -472,6 +472,24 @@ static int pcc_id_parse(struct pcc_cmd *cmd, const char *id)
 		if (id <= 0)
 			return -EINVAL;
 		cmd->u.pccc_add.pccc_roid = id;
+	} else if (strcmp(key, "open_attach") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id > 0)
+			cmd->u.pccc_add.pccc_flags |= PCC_DATASET_OPEN_ATTACH;
+	} else if (strcmp(key, "rwpcc") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id > 0)
+			cmd->u.pccc_add.pccc_flags |= PCC_DATASET_RWPCC;
+	} else if (strcmp(key, "ropcc") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id > 0)
+			cmd->u.pccc_add.pccc_flags |= PCC_DATASET_ROPCC;
 	} else {
 		return -EINVAL;
 	}
@@ -494,6 +512,24 @@ static int pcc_id_parse(struct pcc_cmd *cmd, const char *id)
 			return rc;
 	}
 
+	switch (cmd->pccc_cmd) {
+	case PCC_ADD_DATASET:
+		if (cmd->u.pccc_add.pccc_flags & PCC_DATASET_RWPCC &&
+		    cmd->u.pccc_add.pccc_flags & PCC_DATASET_ROPCC)
+			return -EINVAL;
+		/*
+		 * By default, a PCC backend can provide caching service for
+		 * both RW-PCC and RO-PCC.
+		 */
+		if ((cmd->u.pccc_add.pccc_flags & PCC_DATASET_PCC_ALL) == 0)
+			cmd->u.pccc_add.pccc_flags |= PCC_DATASET_PCC_ALL;
+		break;
+	case PCC_DEL_DATASET:
+	case PCC_CLEAR_ALL:
+		break;
+	default:
+		return -EINVAL;
+	}
 	return 0;
 }
 
@@ -641,15 +677,18 @@ struct pcc_dataset*
 	struct pcc_dataset *dataset;
 	struct pcc_dataset *selected = NULL;
 
-	spin_lock(&super->pccs_lock);
+	down_read(&super->pccs_rw_sem);
 	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
+		if (!(dataset->pccd_flags & PCC_DATASET_RWPCC))
+			continue;
+
 		if (pcc_cond_match(&dataset->pccd_rule, matcher)) {
 			atomic_inc(&dataset->pccd_refcount);
 			selected = dataset;
 			break;
 		}
 	}
-	spin_unlock(&super->pccs_lock);
+	up_read(&super->pccs_rw_sem);
 	if (selected)
 		CDEBUG(D_CACHE, "PCC create, matched %s - %d:%d:%d:%s\n",
 		       dataset->pccd_rule.pmr_conds_str,
@@ -687,6 +726,7 @@ struct pcc_dataset*
 	strncpy(dataset->pccd_pathname, pathname, PATH_MAX);
 	dataset->pccd_rwid = cmd->u.pccc_add.pccc_rwid;
 	dataset->pccd_roid = cmd->u.pccc_add.pccc_roid;
+	dataset->pccd_flags = cmd->u.pccc_add.pccc_flags;
 	atomic_set(&dataset->pccd_refcount, 1);
 
 	rc = pcc_dataset_rule_init(&dataset->pccd_rule, cmd);
@@ -695,7 +735,7 @@ struct pcc_dataset*
 		return rc;
 	}
 
-	spin_lock(&super->pccs_lock);
+	down_write(&super->pccs_rw_sem);
 	list_for_each_entry(tmp, &super->pccs_datasets, pccd_linkage) {
 		if (strcmp(tmp->pccd_pathname, pathname) == 0 ||
 		    (dataset->pccd_rwid != 0 &&
@@ -708,7 +748,7 @@ struct pcc_dataset*
 	}
 	if (!found)
 		list_add(&dataset->pccd_linkage, &super->pccs_datasets);
-	spin_unlock(&super->pccs_lock);
+	up_write(&super->pccs_rw_sem);
 
 	if (found) {
 		pcc_dataset_put(dataset);
@@ -731,15 +771,16 @@ struct pcc_dataset *
 	 * archive ID (read-write ID) or read-only ID is unique in the list,
 	 * we just return last added one as first priority.
 	 */
-	spin_lock(&super->pccs_lock);
+	down_read(&super->pccs_rw_sem);
 	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
-		if (type == LU_PCC_READWRITE && dataset->pccd_rwid != id)
+		if (type == LU_PCC_READWRITE && (dataset->pccd_rwid != id ||
+		    !(dataset->pccd_flags & PCC_DATASET_RWPCC)))
 			continue;
 		atomic_inc(&dataset->pccd_refcount);
 		selected = dataset;
 		break;
 	}
-	spin_unlock(&super->pccs_lock);
+	up_read(&super->pccs_rw_sem);
 	if (selected)
 		CDEBUG(D_CACHE, "matched id %u, PCC mode %d\n", id, type);
 
@@ -763,17 +804,17 @@ struct pcc_dataset *
 	struct pcc_dataset *dataset;
 	int rc = -ENOENT;
 
-	spin_lock(&super->pccs_lock);
+	down_write(&super->pccs_rw_sem);
 	list_for_each_safe(l, tmp, &super->pccs_datasets) {
 		dataset = list_entry(l, struct pcc_dataset, pccd_linkage);
 		if (strcmp(dataset->pccd_pathname, pathname) == 0) {
-			list_del(&dataset->pccd_linkage);
+			list_del_init(&dataset->pccd_linkage);
 			pcc_dataset_put(dataset);
 			rc = 0;
 			break;
 		}
 	}
-	spin_unlock(&super->pccs_lock);
+	up_write(&super->pccs_rw_sem);
 	return rc;
 }
 
@@ -782,6 +823,7 @@ struct pcc_dataset *
 {
 	seq_printf(m, "%s:\n", dataset->pccd_pathname);
 	seq_printf(m, "  rwid: %u\n", dataset->pccd_rwid);
+	seq_printf(m, "  flags: %x\n", dataset->pccd_flags);
 	seq_printf(m, "  autocache: %s\n", dataset->pccd_rule.pmr_conds_str);
 }
 
@@ -790,11 +832,11 @@ struct pcc_dataset *
 {
 	struct pcc_dataset *dataset;
 
-	spin_lock(&super->pccs_lock);
+	down_read(&super->pccs_rw_sem);
 	list_for_each_entry(dataset, &super->pccs_datasets, pccd_linkage) {
 		pcc_dataset_dump(dataset, m);
 	}
-	spin_unlock(&super->pccs_lock);
+	up_read(&super->pccs_rw_sem);
 	return 0;
 }
 
@@ -802,11 +844,13 @@ static void pcc_remove_datasets(struct pcc_super *super)
 {
 	struct pcc_dataset *dataset, *tmp;
 
+	down_write(&super->pccs_rw_sem);
 	list_for_each_entry_safe(dataset, tmp,
 				 &super->pccs_datasets, pccd_linkage) {
 		list_del(&dataset->pccd_linkage);
 		pcc_dataset_put(dataset);
 	}
+	up_write(&super->pccs_rw_sem);
 }
 
 void pcc_super_fini(struct pcc_super *super)
@@ -1027,19 +1071,241 @@ void pcc_file_init(struct pcc_file *pccf)
 	pccf->pccf_type = LU_PCC_NONE;
 }
 
+static inline bool pcc_open_attach_enabled(struct pcc_dataset *dataset)
+{
+	return dataset->pccd_flags & PCC_DATASET_OPEN_ATTACH;
+}
+
+static const char pcc_xattr_layout[] = XATTR_USER_PREFIX "PCC.layout";
+
+static int pcc_layout_xattr_set(struct pcc_inode *pcci, u32 gen)
+{
+	struct dentry *pcc_dentry = pcci->pcci_path.dentry;
+	struct ll_inode_info *lli = pcci->pcci_lli;
+	int rc;
+
+	if (!(lli->lli_pcc_state & PCC_STATE_FL_OPEN_ATTACH))
+		return 0;
+
+	rc = __vfs_setxattr(pcc_dentry, pcc_dentry->d_inode, pcc_xattr_layout,
+			    &gen, sizeof(gen), 0);
+	return rc;
+}
+
+static int pcc_get_layout_info(struct inode *inode, struct cl_layout *clt)
+{
+	struct lu_env *env;
+	struct ll_inode_info *lli = ll_i2info(inode);
+	u16 refcheck;
+	int rc;
+
+	if (!lli->lli_clob)
+		return -EINVAL;
+
+	env = cl_env_get(&refcheck);
+	if (IS_ERR(env))
+		return PTR_ERR(env);
+
+	rc = cl_object_layout_get(env, lli->lli_clob, clt);
+	if (rc)
+		CDEBUG(D_INODE, "Cannot get layout for "DFID"\n",
+		       PFID(ll_inode2fid(inode)));
+
+	cl_env_put(env, &refcheck);
+	return rc;
+}
+
+static int pcc_fid2dataset_fullpath(char *buf, int sz, struct lu_fid *fid,
+				    struct pcc_dataset *dataset)
+{
+	return snprintf(buf, sz, "%s/%04x/%04x/%04x/%04x/%04x/%04x/"
+			DFID_NOBRACE,
+			dataset->pccd_pathname,
+			(fid)->f_oid       & 0xFFFF,
+			(fid)->f_oid >> 16 & 0xFFFF,
+			(unsigned int)((fid)->f_seq       & 0xFFFF),
+			(unsigned int)((fid)->f_seq >> 16 & 0xFFFF),
+			(unsigned int)((fid)->f_seq >> 32 & 0xFFFF),
+			(unsigned int)((fid)->f_seq >> 48 & 0xFFFF),
+			PFID(fid));
+}
+
+/* Must be called with pcci->pcci_lock held */
+static void pcc_inode_attach_init(struct pcc_dataset *dataset,
+				  struct pcc_inode *pcci,
+				  struct dentry *dentry,
+				  enum lu_pcc_type type)
+{
+	pcci->pcci_path.mnt = mntget(dataset->pccd_path.mnt);
+	pcci->pcci_path.dentry = dentry;
+	LASSERT(atomic_read(&pcci->pcci_refcount) == 0);
+	atomic_set(&pcci->pcci_refcount, 1);
+	pcci->pcci_type = type;
+	pcci->pcci_attr_valid = false;
+
+	if (pcc_open_attach_enabled(dataset)) {
+		struct ll_inode_info *lli = pcci->pcci_lli;
+
+		lli->lli_pcc_state |= PCC_STATE_FL_OPEN_ATTACH;
+	}
+}
+
+static inline void pcc_layout_gen_set(struct pcc_inode *pcci,
+				      u32 gen)
+{
+	pcci->pcci_layout_gen = gen;
+}
+
 static inline bool pcc_inode_has_layout(struct pcc_inode *pcci)
 {
 	return pcci->pcci_layout_gen != CL_LAYOUT_GEN_NONE;
 }
 
+static int pcc_try_dataset_attach(struct inode *inode, u32 gen,
+				  enum lu_pcc_type type,
+				  struct pcc_dataset *dataset,
+				  bool *cached)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_inode *pcci = lli->lli_pcc_inode;
+	const struct cred *old_cred;
+	struct dentry *pcc_dentry;
+	struct path path;
+	char *pathname;
+	u32 pcc_gen;
+	int rc;
+
+	if (type == LU_PCC_READWRITE &&
+	    !(dataset->pccd_flags & PCC_DATASET_RWPCC))
+		return 0;
+
+	pathname = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!pathname)
+		return -ENOMEM;
+
+	pcc_fid2dataset_fullpath(pathname, PATH_MAX, &lli->lli_fid, dataset);
+
+	old_cred = override_creds(pcc_super_cred(inode->i_sb));
+	rc = kern_path(pathname, LOOKUP_FOLLOW, &path);
+	if (rc) {
+		/* ignore this error */
+		rc = 0;
+		goto out;
+	}
+
+	pcc_dentry = path.dentry;
+	rc = __vfs_getxattr(pcc_dentry, pcc_dentry->d_inode, pcc_xattr_layout,
+			    &pcc_gen, sizeof(pcc_gen));
+	if (rc < 0) {
+		/* ignore this error */
+		rc = 0;
+		goto out_put_path;
+	}
+
+	rc = 0;
+	/* The file is still valid cached in PCC, attach it immediately. */
+	if (pcc_gen == gen) {
+		CDEBUG(D_CACHE, DFID" L.Gen (%d) consistent, auto attached.\n",
+		       PFID(&lli->lli_fid), gen);
+		if (!pcci) {
+			pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
+			if (!pcci) {
+				rc = -ENOMEM;
+				goto out_put_path;
+			}
+
+			pcc_inode_init(pcci, lli);
+			dget(pcc_dentry);
+			pcc_inode_attach_init(dataset, pcci, pcc_dentry, type);
+		} else {
+			/*
+			 * This happened when a file was once attached into
+			 * PCC, and some processes keep this file opened
+			 * (pcci->refcount > 1) and corresponding PCC file
+			 * without any I/O activity, and then this file was
+			 * detached by the manual detach command or the
+			 * revocation of the layout lock (i.e. cached LRU lock
+			 * shrinking).
+			 */
+			pcc_inode_get(pcci);
+			pcci->pcci_type = type;
+		}
+		pcc_layout_gen_set(pcci, gen);
+		*cached = true;
+	}
+out_put_path:
+	path_put(&path);
+out:
+	revert_creds(old_cred);
+	kfree(pathname);
+	return rc;
+}
+
+static int pcc_try_datasets_attach(struct inode *inode, u32 gen,
+				   enum lu_pcc_type type, bool *cached)
+{
+	struct pcc_dataset *dataset, *tmp;
+	struct pcc_super *super = &ll_i2sbi(inode)->ll_pcc_super;
+	int rc = 0;
+
+	down_read(&super->pccs_rw_sem);
+	list_for_each_entry_safe(dataset, tmp,
+				 &super->pccs_datasets, pccd_linkage) {
+		if (!pcc_open_attach_enabled(dataset))
+			continue;
+		rc = pcc_try_dataset_attach(inode, gen, type, dataset, cached);
+		if (rc < 0 || (!rc && *cached))
+			break;
+	}
+	up_read(&super->pccs_rw_sem);
+
+	return rc;
+}
+
+static int pcc_try_open_attach(struct inode *inode, bool *cached)
+{
+	struct pcc_super *super = &ll_i2sbi(inode)->ll_pcc_super;
+	struct cl_layout clt = {
+		.cl_layout_gen = 0,
+		.cl_is_released = false,
+	};
+	int rc;
+
+	/*
+	 * Quick check whether there is PCC device.
+	 */
+	if (list_empty(&super->pccs_datasets))
+		return 0;
+
+	/*
+	 * The file layout lock was cancelled. And this open does not
+	 * obtain valid layout lock from MDT (i.e. the file is being
+	 * HSM restoring).
+	 */
+	if (ll_layout_version_get(ll_i2info(inode)) == CL_LAYOUT_GEN_NONE)
+		return 0;
+
+	rc = pcc_get_layout_info(inode, &clt);
+	if (rc)
+		return rc;
+
+	if (clt.cl_is_released)
+		rc = pcc_try_datasets_attach(inode, clt.cl_layout_gen,
+					     LU_PCC_READWRITE, cached);
+
+	return rc;
+}
+
 int pcc_file_open(struct inode *inode, struct file *file)
 {
 	struct pcc_inode *pcci;
+	struct ll_inode_info *lli = ll_i2info(inode);
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct pcc_file *pccf = &fd->fd_pcc_file;
 	struct file *pcc_file;
 	struct path *path;
 	struct qstr *dname;
+	bool cached = false;
 	int rc = 0;
 
 	if (!S_ISREG(inode->i_mode))
@@ -1047,13 +1313,19 @@ int pcc_file_open(struct inode *inode, struct file *file)
 
 	pcc_inode_lock(inode);
 	pcci = ll_i2pcci(inode);
-	if (!pcci)
-		goto out_unlock;
 
-	if (atomic_read(&pcci->pcci_refcount) == 0 ||
-	    !pcc_inode_has_layout(pcci))
+	if (lli->lli_pcc_state & PCC_STATE_FL_ATTACHING)
 		goto out_unlock;
 
+	if (!pcci || !pcc_inode_has_layout(pcci)) {
+		rc = pcc_try_open_attach(inode, &cached);
+		if (rc < 0 || !cached)
+			goto out_unlock;
+
+		if (!pcci)
+			pcci = ll_i2pcci(inode);
+	}
+
 	pcc_inode_get(pcci);
 	WARN_ON(pccf->pccf_file);
 
@@ -1106,12 +1378,6 @@ void pcc_file_release(struct inode *inode, struct file *file)
 	pcc_inode_unlock(inode);
 }
 
-static inline void pcc_layout_gen_set(struct pcc_inode *pcci,
-				      u32 gen)
-{
-	pcci->pcci_layout_gen = gen;
-}
-
 static void pcc_io_init(struct inode *inode, bool *cached)
 {
 	struct pcc_inode *pcci;
@@ -1439,11 +1705,20 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	const struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
 	int rc;
 
-	if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->page_mkwrite) {
+	if (!pcc_file || !pcc_vm_ops) {
 		*cached = false;
 		return 0;
 	}
 
+	if (!pcc_vm_ops->page_mkwrite &&
+	    page->mapping == pcc_file->f_mapping) {
+		CDEBUG(D_MMAP,
+		       "%s: PCC backend fs not support ->page_mkwrite()\n",
+		       ll_i2sbi(inode)->ll_fsname);
+		pcc_ioctl_detach(inode);
+		up_read(&mm->mmap_sem);
+		return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
+	}
 	/* Pause to allow for a race with concurrent detach */
 	OBD_FAIL_TIMEOUT(OBD_FAIL_LLITE_PCC_MKWRITE_PAUSE, cfs_fail_val);
 
@@ -1465,7 +1740,7 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		 * VM_FAULT_NOPAGE | VM_FAULT_RETRY to the caller
 		 * __do_page_fault and retry the memory fault handling.
 		 */
-		if (page->mapping == file_inode(pcc_file)->i_mapping) {
+		if (page->mapping == pcc_file->f_mapping) {
 			*cached = true;
 			up_read(&mm->mmap_sem);
 			return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
@@ -1554,16 +1829,15 @@ void pcc_layout_invalidate(struct inode *inode)
 	pcc_inode_unlock(inode);
 }
 
-static int pcc_inode_remove(struct pcc_inode *pcci)
+static int pcc_inode_remove(struct inode *inode, struct dentry *pcc_dentry)
 {
-	struct dentry *dentry;
 	int rc;
 
-	dentry = pcci->pcci_path.dentry;
-	rc = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
+	rc = vfs_unlink(pcc_dentry->d_parent->d_inode, pcc_dentry, NULL);
 	if (rc)
-		CWARN("failed to unlink PCC file %.*s, rc = %d\n",
-		      dentry->d_name.len, dentry->d_name.name, rc);
+		CWARN("%s: failed to unlink PCC file %.*s, rc = %d\n",
+		      ll_i2sbi(inode)->ll_fsname, pcc_dentry->d_name.len,
+		      pcc_dentry->d_name.name, rc);
 
 	return rc;
 }
@@ -1651,20 +1925,6 @@ static int pcc_inode_remove(struct pcc_inode *pcci)
 	return dentry;
 }
 
-/* Must be called with pcci->pcci_lock held */
-static void pcc_inode_attach_init(struct pcc_dataset *dataset,
-				  struct pcc_inode *pcci,
-				  struct dentry *dentry,
-				  enum lu_pcc_type type)
-{
-	pcci->pcci_path.mnt = mntget(dataset->pccd_path.mnt);
-	pcci->pcci_path.dentry = dentry;
-	LASSERT(atomic_read(&pcci->pcci_refcount) == 0);
-	atomic_set(&pcci->pcci_refcount, 1);
-	pcci->pcci_type = type;
-	pcci->pcci_attr_valid = false;
-}
-
 static int __pcc_inode_create(struct pcc_dataset *dataset,
 			      struct lu_fid *fid,
 			      struct dentry **dentry)
@@ -1744,38 +2004,37 @@ int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 	pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
 	if (!pcci) {
 		rc = -ENOMEM;
-		goto out_unlock;
+		goto out_put;
 	}
 
 	rc = pcc_inode_store_ugpid(pcc_dentry, old_cred->suid,
 				   old_cred->sgid);
 	if (rc)
-		goto out_unlock;
+		goto out_put;
 
 	pcc_inode_init(pcci, ll_i2info(inode));
 	pcc_inode_attach_init(dataset, pcci, pcc_dentry, LU_PCC_READWRITE);
-	/* Set the layout generation of newly created file with 0 */
-	pcc_layout_gen_set(pcci, 0);
 
-out_unlock:
+	rc = pcc_layout_xattr_set(pcci, 0);
 	if (rc) {
-		int rc2;
+		(void) pcc_inode_remove(inode, pcci->pcci_path.dentry);
+		pcc_inode_put(pcci);
+		goto out_unlock;
+	}
 
-		rc2 = vfs_unlink(pcc_dentry->d_parent->d_inode,
-				 pcc_dentry, NULL);
-		if (rc2)
-			CWARN("%s: failed to unlink PCC file %.*s, rc = %d\n",
-			      ll_i2sbi(inode)->ll_fsname,
-			      pcc_dentry->d_name.len, pcc_dentry->d_name.name,
-			      rc2);
+	/* Set the layout generation of newly created file with 0 */
+	pcc_layout_gen_set(pcci, 0);
 
+out_put:
+	if (rc) {
+		(void) pcc_inode_remove(inode, pcc_dentry);
 		dput(pcc_dentry);
-	}
 
+		kmem_cache_free(pcc_inode_slab, pcci);
+	}
+out_unlock:
 	pcc_inode_unlock(inode);
 	revert_creds(old_cred);
-	if (rc)
-		kmem_cache_free(pcc_inode_slab, pcci);
 
 	return rc;
 }
@@ -1919,16 +2178,9 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	fput(pcc_filp);
 out_dentry:
 	if (rc) {
-		int rc2;
-
 		old_cred = override_creds(pcc_super_cred(inode->i_sb));
-		rc2 = vfs_unlink(dentry->d_parent->d_inode, dentry, NULL);
+		(void) pcc_inode_remove(inode, dentry);
 		revert_creds(old_cred);
-		if (rc2)
-			CWARN("%s: failed to unlink PCC file %.*s, rc = %d\n",
-			      ll_i2sbi(inode)->ll_fsname, dentry->d_name.len,
-			      dentry->d_name.name, rc2);
-
 		dput(dentry);
 	}
 out_dataset_put:
@@ -1945,6 +2197,7 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 	struct pcc_inode *pcci;
 	u32 gen2;
 
+	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	pcc_inode_lock(inode);
 	pcci = ll_i2pcci(inode);
 	lli->lli_pcc_state &= ~PCC_STATE_FL_ATTACHING;
@@ -1962,6 +2215,10 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 	}
 
 	LASSERT(attached);
+	rc = pcc_layout_xattr_set(pcci, gen);
+	if (rc)
+		goto out_put;
+
 	rc = ll_layout_refresh(inode, &gen2);
 	if (!rc) {
 		if (gen2 == gen) {
@@ -1977,13 +2234,12 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 
 out_put:
 	if (rc) {
-		old_cred = override_creds(pcc_super_cred(inode->i_sb));
-		pcc_inode_remove(pcci);
-		revert_creds(old_cred);
+		(void) pcc_inode_remove(inode, pcci->pcci_path.dentry);
 		pcc_inode_put(pcci);
 	}
 out_unlock:
 	pcc_inode_unlock(inode);
+	revert_creds(old_cred);
 	return rc;
 }
 
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index f2b57f9..4947911 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -91,12 +91,23 @@ struct pcc_matcher {
 	struct qstr	*pm_name;
 };
 
+enum pcc_dataset_flags {
+	PCC_DATASET_NONE	= 0x0,
+	/* Try auto attach at open, disabled by default */
+	PCC_DATASET_OPEN_ATTACH	= 0x1,
+	/* PCC backend is only used for RW-PCC */
+	PCC_DATASET_RWPCC	= 0x2,
+	/* PCC backend is only used for RO-PCC */
+	PCC_DATASET_ROPCC	= 0x4,
+	/* PCC backend provides caching services for both RW-PCC and RO-PCC */
+	PCC_DATASET_PCC_ALL	= PCC_DATASET_RWPCC | PCC_DATASET_ROPCC,
+};
+
 struct pcc_dataset {
 	u32			pccd_rwid;	 /* Archive ID */
 	u32			pccd_roid;	 /* Readonly ID */
 	struct pcc_match_rule	pccd_rule;	 /* Match rule */
-	u32			pccd_rwonly:1, /* Only use as RW-PCC */
-				pccd_roonly:1; /* Only use as RO-PCC */
+	enum pcc_dataset_flags	pccd_flags;	 /* flags of PCC backend */
 	char			pccd_pathname[PATH_MAX]; /* full path */
 	struct path		pccd_path;	 /* Root path */
 	struct list_head	pccd_linkage;  /* Linked to pccs_datasets */
@@ -105,7 +116,7 @@ struct pcc_dataset {
 
 struct pcc_super {
 	/* Protect pccs_datasets */
-	spinlock_t		 pccs_lock;
+	struct rw_semaphore	 pccs_rw_sem;
 	/* List of datasets */
 	struct list_head	 pccs_datasets;
 	/* creds of process who forced instantiation of super block */
@@ -158,6 +169,7 @@ struct pcc_cmd {
 			u32			 pccc_roid;
 			struct list_head	 pccc_conds;
 			char			*pccc_conds_str;
+			enum pcc_dataset_flags	 pccc_flags;
 		} pccc_add;
 		struct pcc_cmd_del {
 			u32			 pccc_pad;
diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index 27e0ca5..792d946 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -2049,6 +2049,7 @@ static int lov_object_layout_get(const struct lu_env *env,
 	cl->cl_size = lov_comp_md_size(lsm);
 	cl->cl_layout_gen = lsm->lsm_layout_gen;
 	cl->cl_dom_comp_size = 0;
+	cl->cl_is_released = lsm->lsm_is_released;
 	if (lsm_is_composite(lsm->lsm_magic)) {
 		struct lov_stripe_md_entry *lsme = lsm->lsm_entries[0];
 
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index b024a44..2f9687e 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -2104,6 +2104,8 @@ enum lu_pcc_state_flags {
 	PCC_STATE_FL_ATTR_VALID		= 0x01,
 	/* The file is being attached into PCC */
 	PCC_STATE_FL_ATTACHING		= 0x02,
+	/* Allow to auto attach at open */
+	PCC_STATE_FL_OPEN_ATTACH	= 0x04,
 };
 
 struct lu_pcc_state {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 363/622] lustre: pcc: change detach behavior and add keep option
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (361 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 362/622] lustre: pcc: auto attach during open for valid cache James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 364/622] lustre: lov: return error if cl_env_get fails James Simmons
                   ` (259 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

After introduce the feature of auto-attach at open, when the PCC
cached file is detach by "pcc detach" command, it will be attached
automatically at the next open. This may be not what the user wants.

To solve this problem, we change the default detach behavior and
add an option "--keep|-k" for the detach of RW-PCC.
The manual "lfs pcc detach" command will detach the file from PCC
permanently. And it will also remove the PCC copy by default.
When the file is detached with "keep" option, it only unmaps the
relationship between the file inode and PCC copy, but keep the
PCC copy. The file is allowed to be attached automatically at
the next open when the file is still valid in cache.

Note here that currently auto detach caused by inode reclaim or
revocation of the layout lock would not delete the PCC copy too.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10092
Lustre-commit: 2dadefb4148f ("LU-10092 pcc: change detach behavior and add keep option")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/33844
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                   |  6 +--
 fs/lustre/llite/file.c                  | 33 +++++++++++++---
 fs/lustre/llite/pcc.c                   | 68 ++++++++++++++++++++++++++++++---
 fs/lustre/llite/pcc.h                   |  2 +-
 include/uapi/linux/lustre/lustre_user.h | 16 ++++++--
 5 files changed, 107 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 1f7ed32..2c39579 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1918,7 +1918,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case FS_IOC_FSSETXATTR:
 		return ll_ioctl_fssetxattr(inode, cmd, arg);
 	case LL_IOC_PCC_DETACH_BY_FID: {
-		struct lu_pcc_detach *detach;
+		struct lu_pcc_detach_fid *detach;
 		struct lu_fid *fid;
 		struct inode *inode2;
 		unsigned long ino;
@@ -1928,7 +1928,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			return -ENOMEM;
 
 		if (copy_from_user(detach,
-				   (const struct lu_pcc_detach __user *)arg,
+				   (const struct lu_pcc_detach_fid __user *)arg,
 				   sizeof(*detach))) {
 			rc = -EFAULT;
 			goto out_detach;
@@ -1955,7 +1955,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			goto out_iput;
 		}
 
-		rc = pcc_ioctl_detach(inode2);
+		rc = pcc_ioctl_detach(inode2, detach->pccd_opt);
 out_iput:
 		iput(inode2);
 out_detach:
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 96311ad..a27c06c 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3732,14 +3732,35 @@ static int ll_heat_set(struct inode *inode, enum lu_heat_flag flags)
 		rc = ll_heat_set(inode, flags);
 		return rc;
 	}
-	case LL_IOC_PCC_DETACH:
-		if (!S_ISREG(inode->i_mode))
-			return -EINVAL;
+	case LL_IOC_PCC_DETACH: {
+		struct lu_pcc_detach *detach;
 
-		if (!inode_owner_or_capable(inode))
-			return -EPERM;
+		detach = kzalloc(sizeof(*detach), GFP_KERNEL);
+		if (!detach)
+			return -ENOMEM;
+
+		if (copy_from_user(detach,
+				   (const struct lu_pcc_detach __user *)arg,
+				   sizeof(*detach))) {
+			rc = -EFAULT;
+			goto out_detach_free;
+		}
+
+		if (!S_ISREG(inode->i_mode)) {
+			rc = -EINVAL;
+			goto out_detach_free;
+		}
 
-		return pcc_ioctl_detach(inode);
+		if (!inode_owner_or_capable(inode)) {
+			rc = -EPERM;
+			goto out_detach_free;
+		}
+
+		rc = pcc_ioctl_detach(inode, detach->pccd_opt);
+out_detach_free:
+		kfree(detach);
+		return rc;
+	}
 	case LL_IOC_PCC_STATE: {
 		struct lu_pcc_state __user *ustate =
 			(struct lu_pcc_state __user *)arg;
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index fc4a2a3..c8c2442 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -1002,6 +1002,7 @@ static void pcc_inode_init(struct pcc_inode *pcci, struct ll_inode_info *lli)
 {
 	pcci->pcci_lli = lli;
 	lli->lli_pcc_inode = pcci;
+	lli->lli_pcc_state = PCC_STATE_FL_NONE;
 	atomic_set(&pcci->pcci_refcount, 0);
 	pcci->pcci_type = LU_PCC_NONE;
 	pcci->pcci_layout_gen = CL_LAYOUT_GEN_NONE;
@@ -1715,8 +1716,9 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		CDEBUG(D_MMAP,
 		       "%s: PCC backend fs not support ->page_mkwrite()\n",
 		       ll_i2sbi(inode)->ll_fsname);
-		pcc_ioctl_detach(inode);
+		pcc_ioctl_detach(inode, PCC_DETACH_OPT_NONE);
 		up_read(&mm->mmap_sem);
+		*cached = true;
 		return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
 	}
 	/* Pause to allow for a race with concurrent detach */
@@ -1755,7 +1757,7 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	if (OBD_FAIL_CHECK(OBD_FAIL_LLITE_PCC_DETACH_MKWRITE)) {
 		pcc_io_fini(inode);
-		pcc_ioctl_detach(inode);
+		pcc_ioctl_detach(inode, PCC_DETACH_OPT_NONE);
 		up_read(&mm->mmap_sem);
 		return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
 	}
@@ -2243,10 +2245,51 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 	return rc;
 }
 
-int pcc_ioctl_detach(struct inode *inode)
+static int pcc_hsm_remove(struct inode *inode)
+{
+	struct hsm_user_request *hur;
+	u32 gen;
+	int len;
+	int rc;
+
+	rc = ll_layout_restore(inode, 0, OBD_OBJECT_EOF);
+	if (rc) {
+		CDEBUG(D_CACHE, DFID" RESTORE failure: %d\n",
+		       PFID(&ll_i2info(inode)->lli_fid), rc);
+		return rc;
+	}
+
+	ll_layout_refresh(inode, &gen);
+
+	len = sizeof(struct hsm_user_request) +
+	      sizeof(struct hsm_user_item);
+	hur = kzalloc(len, GFP_NOFS);
+	if (!hur)
+		return -ENOMEM;
+
+	hur->hur_request.hr_action = HUA_REMOVE;
+	hur->hur_request.hr_archive_id = 0;
+	hur->hur_request.hr_flags = 0;
+	memcpy(&hur->hur_user_item[0].hui_fid, &ll_i2info(inode)->lli_fid,
+	       sizeof(hur->hur_user_item[0].hui_fid));
+	hur->hur_user_item[0].hui_extent.offset = 0;
+	hur->hur_user_item[0].hui_extent.length = OBD_OBJECT_EOF;
+	hur->hur_request.hr_itemcount = 1;
+	rc = obd_iocontrol(LL_IOC_HSM_REQUEST, ll_i2sbi(inode)->ll_md_exp,
+			   len, hur, NULL);
+	if (rc)
+		CDEBUG(D_CACHE, DFID" HSM REMOVE failure: %d\n",
+		       PFID(&ll_i2info(inode)->lli_fid), rc);
+
+	kfree(hur);
+	return rc;
+}
+
+int pcc_ioctl_detach(struct inode *inode, u32 opt)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct pcc_inode *pcci;
+	bool hsm_remove = false;
 	int rc = 0;
 
 	pcc_inode_lock(inode);
@@ -2255,11 +2298,26 @@ int pcc_ioctl_detach(struct inode *inode)
 	    !pcc_inode_has_layout(pcci))
 		goto out_unlock;
 
-	__pcc_layout_invalidate(pcci);
-	pcc_inode_put(pcci);
+	LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
+
+	if (pcci->pcci_type == LU_PCC_READWRITE) {
+		if (opt == PCC_DETACH_OPT_UNCACHE)
+			hsm_remove = true;
+
+		__pcc_layout_invalidate(pcci);
+		pcc_inode_put(pcci);
+	}
 
 out_unlock:
 	pcc_inode_unlock(inode);
+	if (hsm_remove) {
+		const struct cred *old_cred;
+
+		old_cred = override_creds(pcc_super_cred(inode->i_sb));
+		rc = pcc_hsm_remove(inode);
+		revert_creds(old_cred);
+	}
+
 	return rc;
 }
 
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index 4947911..c00cb0b 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -187,7 +187,7 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 			      u32 gen, bool lease_broken, int rc,
 			      bool attached);
-int pcc_ioctl_detach(struct inode *inode);
+int pcc_ioctl_detach(struct inode *inode, u32 opt);
 int pcc_ioctl_state(struct file *file, struct inode *inode,
 		    struct lu_pcc_state *state);
 void pcc_file_init(struct pcc_file *pccf);
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 2f9687e..317b236 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -357,8 +357,8 @@ struct ll_ioc_lease_id {
 #define LL_IOC_LADVISE			_IOR('f', 250, struct llapi_lu_ladvise)
 #define LL_IOC_HEAT_GET			_IOWR('f', 251, struct lu_heat)
 #define LL_IOC_HEAT_SET			_IOW('f', 251, __u64)
-#define LL_IOC_PCC_DETACH		_IO('f', 252)
-#define LL_IOC_PCC_DETACH_BY_FID	_IOW('f', 252, struct lu_pcc_detach)
+#define LL_IOC_PCC_DETACH		_IOW('f', 252, struct lu_pcc_detach)
+#define LL_IOC_PCC_DETACH_BY_FID	_IOW('f', 252, struct lu_pcc_detach_fid)
 #define LL_IOC_PCC_STATE		_IOR('f', 252, struct lu_pcc_state)
 
 #define LL_STATFS_LMV		1
@@ -2093,9 +2093,19 @@ struct lu_pcc_attach {
 	__u32 pcca_id; /* archive ID for readwrite, group ID for readonly */
 };
 
-struct lu_pcc_detach {
+enum lu_pcc_detach_opts {
+	PCC_DETACH_OPT_NONE = 0, /* Detach only, keep the PCC copy */
+	PCC_DETACH_OPT_UNCACHE, /* Remove the cached file after detach */
+};
+
+struct lu_pcc_detach_fid {
 	/* fid of the file to detach */
 	struct lu_fid	pccd_fid;
+	__u32		pccd_opt;
+};
+
+struct lu_pcc_detach {
+	__u32		pccd_opt;
 };
 
 enum lu_pcc_state_flags {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 364/622] lustre: lov: return error if cl_env_get fails
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (362 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 363/622] lustre: pcc: change detach behavior and add keep option James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 365/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro James Simmons
                   ` (258 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

When cl_env_get() fails with an error return the error.

Cray-bug-id: LUS-7310
WC-bug-id: https://jira.whamcloud.com/browse/LU-12436
Lustre-commit: a7997c836bbf ("LU-12436 lov: return error if cl_env_get fails")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/35229
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index 5b28793..9cdfca1 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -120,8 +120,10 @@ static int lov_io_sub_init(const struct lu_env *env, struct lov_io *lio,
 
 	/* obtain new environment */
 	sub->sub_env = cl_env_get(&sub->sub_refcheck);
-	if (IS_ERR(sub->sub_env))
+	if (IS_ERR(sub->sub_env)) {
 		rc = PTR_ERR(sub->sub_env);
+		return rc;
+	}
 
 	sub_obj = lovsub2cl(lov_r0(lov, index)->lo_sub[stripe]);
 	sub_io = &sub->sub_io;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 365/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (363 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 364/622] lustre: lov: return error if cl_env_get fails James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 366/622] lustre: ldlm: layout lock fixes James Simmons
                   ` (257 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Vitaly Fertman <c17818@cray.com>

Add rq_no_reply flag to the DEBUG_REQ_FLAGS macro for debug purposes
Also, add another debug message to check_write_rcs

WC-bug-id: https://jira.whamcloud.com/browse/LU-12333
Lustre-commit: 3e43d06810e6 ("LU-12333 ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Reviewed-on: https://review.whamcloud.com/35090
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 4 ++--
 fs/lustre/osc/osc_request.c    | 5 ++++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 383d59e..7ed2d99 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1066,7 +1066,7 @@ static inline void lustre_set_rep_swabbed(struct ptlrpc_request *req,
 	FLAG(req->rq_err, "E"),	FLAG(req->rq_net_err, "e"),		      \
 	FLAG(req->rq_timedout, "X") /* eXpired */, FLAG(req->rq_resend, "S"), \
 	FLAG(req->rq_restart, "T"), FLAG(req->rq_replay, "P"),		      \
-	FLAG(req->rq_no_resend, "N"),					      \
+	FLAG(req->rq_no_resend, "N"), FLAG(req->rq_no_reply, "n"),	      \
 	FLAG(req->rq_waiting, "W"),					      \
 	FLAG(req->rq_wait_ctx, "C"), FLAG(req->rq_hp, "H"),		      \
 	FLAG(req->rq_committed, "M"),                                          \
@@ -1074,7 +1074,7 @@ static inline void lustre_set_rep_swabbed(struct ptlrpc_request *req,
 	FLAG(req->rq_reply_unlinked, "U"),                                     \
 	FLAG(req->rq_receiving_reply, "r")
 
-#define REQ_FLAGS_FMT "%s:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s"
+#define REQ_FLAGS_FMT "%s:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s"
 
 void _debug_req(struct ptlrpc_request *req,
 		struct libcfs_debug_msg_data *data, const char *fmt, ...)
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index f929908..6b066e5 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1064,8 +1064,11 @@ static int check_write_rcs(struct ptlrpc_request *req,
 
 	/* return error if any niobuf was in error */
 	for (i = 0; i < niocount; i++) {
-		if ((int)remote_rcs[i] < 0)
+		if ((int)remote_rcs[i] < 0) {
+			CDEBUG(D_INFO, "rc[%d]: %d req %p\n",
+			       i, remote_rcs[i], req);
 			return remote_rcs[i];
+		}
 
 		if (remote_rcs[i] != 0) {
 			CDEBUG(D_INFO, "rc[%d] invalid (%d) req %p\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 366/622] lustre: ldlm: layout lock fixes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (364 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 365/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 367/622] lnet: Do not allow gateways on remote nets James Simmons
                   ` (256 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Vitaly Fertman <c17818@cray.com>

as the intent_layout operation becomes more frequent with SEL,
cancel existent layout locks in advance and reuse ELC to deliver
cancels to MDS

as clients are given LCK_EX layout locks, take into account this
mode as well in ldlm_lock_match

Cray-bug-id: LUS-2528
WC-bug-id: https://jira.whamcloud.com/browse/LU-10070
Lustre-commit: 51f23ffa4dae ("LU-10070 ldlm: layout lock fixes")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Reviewed-on: https://review.whamcloud.com/35232
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c    |  3 ++-
 fs/lustre/mdc/mdc_locks.c | 12 ++++++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index a27c06c..9321b84 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4978,7 +4978,8 @@ int ll_layout_refresh(struct inode *inode, u32 *gen)
 		 * match it before grabbing layout lock mutex.
 		 */
 		mode = ll_take_md_lock(inode, MDS_INODELOCK_LAYOUT, &lockh, 0,
-				       LCK_CR | LCK_CW | LCK_PR | LCK_PW);
+				       LCK_CR | LCK_CW | LCK_PR |
+				       LCK_PW | LCK_EX);
 		if (mode != 0) { /* hit cached lock */
 			rc = ll_layout_lock_set(&lockh, mode, inode);
 			if (rc == -EAGAIN)
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index cf6bc9d..5885bbd 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -580,18 +580,26 @@ static struct ptlrpc_request *mdc_intent_layout_pack(struct obd_export *exp,
 						     struct md_op_data *op_data)
 {
 	struct obd_device *obd = class_exp2obd(exp);
+	struct list_head cancels = LIST_HEAD_INIT(cancels);
 	struct ptlrpc_request *req;
 	struct ldlm_intent *lit;
 	struct layout_intent *layout;
-	int rc;
+	int count = 0, rc;
 
 	req = ptlrpc_request_alloc(class_exp2cliimp(exp),
 				   &RQF_LDLM_INTENT_LAYOUT);
 	if (!req)
 		return ERR_PTR(-ENOMEM);
 
+	if (fid_is_sane(&op_data->op_fid2) && (it->it_op & IT_LAYOUT) &&
+	    (it->it_flags & FMODE_WRITE)) {
+		count = mdc_resource_get_unused(exp, &op_data->op_fid2,
+						&cancels, LCK_EX,
+						MDS_INODELOCK_LAYOUT);
+	}
+
 	req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_CLIENT, 0);
-	rc = ldlm_prep_enqueue_req(exp, req, NULL, 0);
+	rc = ldlm_prep_enqueue_req(exp, req, &cancels, count);
 	if (rc) {
 		ptlrpc_request_free(req);
 		return ERR_PTR(rc);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 367/622] lnet: Do not allow gateways on remote nets
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (365 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 366/622] lustre: ldlm: layout lock fixes James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 368/622] lustre: osc: reduce lock contention in osc_unreserve_grant James Simmons
                   ` (255 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

A gateway needs to be reachable over some local interface.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12411
Lustre-commit: 43b35351e9ca ("LU-12411 lnet: Do not allow gateways on remote nets")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35198
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 81f7a94..f7b53e0 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -436,6 +436,13 @@ static void lnet_shuffle_seed(void)
 	if (lnet_islocalnet(net))
 		return -EEXIST;
 
+	if (!lnet_islocalnet(LNET_NIDNET(gateway))) {
+		CERROR("Cannot add route with gateway %s. There is no local interface configured on LNet %s\n",
+		       libcfs_nid2str(gateway),
+		       libcfs_net2str(LNET_NIDNET(gateway)));
+		return -EINVAL;
+	}
+
 	/* Assume net, route, all new */
 	route = kzalloc(sizeof(*route), GFP_NOFS);
 	rnet = kzalloc(sizeof(*rnet), GFP_NOFS);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 368/622] lustre: osc: reduce lock contention in osc_unreserve_grant
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (366 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 367/622] lnet: Do not allow gateways on remote nets James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 369/622] lnet: Change static defines to use macro for module.c James Simmons
                   ` (254 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

In osc_queue_async_io() the cl_loi_list_lock is acquired to reserve
and consume the grant and released, right after we expand the extent
the same lock is used to unreserve the grant.
We can keep the spinlock when we are done with the grant to improve
the throughput.

mpirun  -np 32 /root/ior-openmpi/src/ior -w -t 1m -b 8g -F -e -vv
        -o /scratch0/file -i 1
master:
Max Write: 13799.70 MiB/sec (14470.04 MB/sec)
master with 33858:
Max Write: 14339.57 MiB/sec (15036.13 MB/sec)

WC-bug-id: https://jira.whamcloud.com/browse/LU-11775
Lustre-commit: 8a1ae45a3e4f ("LU-11775 osc: reduce lock contention in osc_unreserve_grant")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/33858
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 8ffd8f9..3b4c598 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -636,6 +636,7 @@ void osc_extent_release(const struct lu_env *env, struct osc_extent *ext)
 			 */
 			osc_extent_state_set(ext, OES_TRUNC);
 			ext->oe_trunc_pending = 0;
+			osc_object_unlock(obj);
 		} else {
 			int grant = 0;
 
@@ -648,8 +649,6 @@ void osc_extent_release(const struct lu_env *env, struct osc_extent *ext)
 				grant += cli->cl_grant_extent_tax;
 			if (!osc_extent_merge(env, ext, next_extent(ext)))
 				grant += cli->cl_grant_extent_tax;
-			if (grant > 0)
-				osc_unreserve_grant(cli, 0, grant);
 
 			if (ext->oe_urgent)
 				list_move_tail(&ext->oe_link,
@@ -658,8 +657,10 @@ void osc_extent_release(const struct lu_env *env, struct osc_extent *ext)
 				list_move_tail(&ext->oe_link,
 					       &obj->oo_full_exts);
 			}
+			osc_object_unlock(obj);
+			if (grant > 0)
+				osc_unreserve_grant(cli, 0, grant);
 		}
-		osc_object_unlock(obj);
 
 		osc_io_unplug_async(env, cli, obj);
 	}
@@ -1483,13 +1484,20 @@ static void __osc_unreserve_grant(struct client_obd *cli,
 	}
 }
 
-static void osc_unreserve_grant(struct client_obd *cli,
-				unsigned int reserved, unsigned int unused)
+static void osc_unreserve_grant_nolock(struct client_obd *cli,
+				       unsigned int reserved,
+				       unsigned int unused)
 {
-	spin_lock(&cli->cl_loi_list_lock);
 	__osc_unreserve_grant(cli, reserved, unused);
 	if (unused > 0)
 		osc_wake_cache_waiters(cli);
+}
+
+static void osc_unreserve_grant(struct client_obd *cli,
+				unsigned int reserved, unsigned int unused)
+{
+	spin_lock(&cli->cl_loi_list_lock);
+	osc_unreserve_grant_nolock(cli, reserved, unused);
 	spin_unlock(&cli->cl_loi_list_lock);
 }
 
@@ -2385,7 +2393,6 @@ int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 			grants = 0;
 			need_release = true;
 		}
-		spin_unlock(&cli->cl_loi_list_lock);
 		if (!need_release && ext->oe_end < index) {
 			tmp = grants;
 			/* try to expand this extent */
@@ -2396,10 +2403,11 @@ int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 			} else {
 				OSC_EXTENT_DUMP(D_CACHE, ext,
 						"expanded for %lu.\n", index);
-				osc_unreserve_grant(cli, grants, tmp);
+				osc_unreserve_grant_nolock(cli, grants, tmp);
 				grants = 0;
 			}
 		}
+		spin_unlock(&cli->cl_loi_list_lock);
 		rc = 0;
 	} else if (ext) {
 		/* index is located outside of active extent */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 369/622] lnet: Change static defines to use macro for module.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (367 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 368/622] lustre: osc: reduce lock contention in osc_unreserve_grant James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 370/622] lustre: llite, readahead: don't always use max RPC size James Simmons
                   ` (253 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch replaces mutex which are defined statically
in file net/lnet/lnet/module.c with kernel provided macro.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9010
Lustre-commit: bb967468875f ("LU-9010 lnet: Change static defines to use macro for module.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/33932
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/module.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/lnet/lnet/module.c b/net/lnet/lnet/module.c
index 95e1bae..5905f38 100644
--- a/net/lnet/lnet/module.c
+++ b/net/lnet/lnet/module.c
@@ -40,7 +40,7 @@
 module_param(config_on_load, int, 0444);
 MODULE_PARM_DESC(config_on_load, "configure network at module load");
 
-static struct mutex lnet_config_mutex;
+static DEFINE_MUTEX(lnet_config_mutex);
 
 static int
 lnet_configure(void *arg)
@@ -235,8 +235,6 @@ static int __init lnet_init(void)
 {
 	int rc;
 
-	mutex_init(&lnet_config_mutex);
-
 	rc = libcfs_setup();
 	if (rc)
 		return rc;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 370/622] lustre: llite, readahead: don't always use max RPC size
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (368 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 369/622] lnet: Change static defines to use macro for module.c James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:13 ` [lustre-devel] [PATCH 371/622] lustre: llite: improve single-thread read performance James Simmons
                   ` (252 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Since 64M RPC landed, @PTLRPC_MAX_BRW_PAGES will be 64M.
And we always try to use this max possible RPC size to check
whether we should avoid fast IO and trigger real context IO.

This is not good for following reasons:

(1) Since current default RPC size is still 4M,
most of system won't use 64M for most of time.

(2) Currently default readahead size per file is still 64M,
which makes fast IO always run out of all readahead pages
before next IO. This breaks what users really want for readahead
grapping pages in advance.

To fix this problem, we use 16M as a balance value if RPC smaller
than 16M, patch also fix the problem that @ras_rpc_size could not
grow bigger which is possibe in the following case:

1) set RPC to 16M
2) Set RPC to 64M

In the current logic ras->ras_rpc_size will be kept as 16M which is wrong.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: 7864a6854c3d ("LU-12043 llite,readahead: don't always use max RPC size")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35033
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 3 +++
 fs/lustre/llite/rw.c             | 6 ++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index d36e01e..36b620e 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -307,6 +307,9 @@ static inline struct pcc_inode *ll_i2pcci(struct inode *inode)
 	return ll_i2info(inode)->lli_pcc_inode;
 }
 
+/* default to use at least 16M for fast read if possible */
+#define RA_REMAIN_WINDOW_MIN			MiB_TO_PAGES(16UL)
+
 /* default to about 64M of readahead on a given system. */
 #define SBI_DEFAULT_READAHEAD_MAX		MiB_TO_PAGES(64UL)
 
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index c42bbab..ad55695 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -376,7 +376,7 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 				 * update read ahead RPC size.
 				 * NB: it's racy but doesn't matter
 				 */
-				if (ras->ras_rpc_size > ra.cra_rpc_size &&
+				if (ras->ras_rpc_size != ra.cra_rpc_size &&
 				    ra.cra_rpc_size > 0)
 					ras->ras_rpc_size = ra.cra_rpc_size;
 				/* trim it to align with optimal RPC size */
@@ -1203,6 +1203,8 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		struct ll_readahead_state *ras = &fd->fd_ras;
 		struct lu_env *local_env = NULL;
 		struct inode *inode = file_inode(file);
+		unsigned long fast_read_pages =
+			max(RA_REMAIN_WINDOW_MIN, ras->ras_rpc_size);
 		struct vvp_page *vpg;
 
 		result = -ENODATA;
@@ -1245,7 +1247,7 @@ int ll_readpage(struct file *file, struct page *vmpage)
 			 * a cl_io to issue the RPC.
 			 */
 			if (ras->ras_window_start + ras->ras_window_len <
-			    ras->ras_next_readahead + PTLRPC_MAX_BRW_PAGES) {
+			    ras->ras_next_readahead + fast_read_pages) {
 				/* export the page and skip io stack */
 				vpg->vpg_ra_used = 1;
 				cl_page_export(env, page, 1);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 371/622] lustre: llite: improve single-thread read performance
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (369 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 370/622] lustre: llite, readahead: don't always use max RPC size James Simmons
@ 2020-02-27 21:13 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 372/622] lustre: obdclass: allow per-session jobids James Simmons
                   ` (251 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:13 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Here is whole history:

Currently, for sequential read IO, We grow up window
size very quickly, and once we cached @max_readahead_per_file
pages. For following command:

  dd if=/mnt/lustre/file of=/dev/null bs=1M

We will do something like following:
...
64M bytes cached.
fast io for 16M bytes
readahead extra 16M to fill up window.
fast io for 16M bytes
readahead extra 16M to fill up window.
....

In this way, we could only use fast IO for 16M bytes and
then fall through non-fast IO mode. this is also reason
that why increasing @max_readahead_per_file don't give us
performances up, since this value only changes how much
memory we cached in memory, during my testing whatever
I changed the value, i could only get 2GB/s for single thread
read.

Actually, we could do this better, if we have used
more than 16M bytes readahead pages, submit another readahead
requests in the background. and ideally, we could always
use fast IO.

Test                                                    Patched         Unpatched
dd if=file of=/dev/null bs=1M.                          4.0G/s          1.9G/s
ior -np 192 r -t 1m -b 4g -F -e -vv -o /cache1/ior -k   11195.97        10817.02 MB/sec

Tested with drop OSS and client memory before every run.
max_readahead_per_mb=128M, RPC size is 16M.
dd file's size is 400G which is double of memory or so.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: c2791674260b ("LU-12043 llite: improve single-thread read performance")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/34095
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h    |   6 +-
 fs/lustre/llite/file.c           |   3 +-
 fs/lustre/llite/llite_internal.h |  27 ++++
 fs/lustre/llite/llite_lib.c      |  17 ++-
 fs/lustre/llite/lproc_llite.c    |  87 +++++++++++-
 fs/lustre/llite/rw.c             | 277 +++++++++++++++++++++++++++++++++++----
 fs/lustre/llite/vvp_io.c         |   5 +
 fs/lustre/lov/lov_io.c           |   1 +
 8 files changed, 391 insertions(+), 32 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index d1c1413..5096025 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1891,7 +1891,11 @@ struct cl_io {
 	 * mirror is inaccessible, non-delay RPC would error out quickly so
 	 * that the upper layer can try to access the next mirror.
 	 */
-				ci_ndelay:1;
+				ci_ndelay:1,
+	/**
+	 * Set if IO is triggered by async workqueue readahead.
+	 */
+				ci_async_readahead:1;
 	/**
 	 * How many times the read has retried before this one.
 	 * Set by the top level and consumed by the LOV.
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 9321b84..5d1cfa4 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1407,7 +1407,7 @@ static bool file_is_noatime(const struct file *file)
 	return false;
 }
 
-static void ll_io_init(struct cl_io *io, const struct file *file, int write)
+void ll_io_init(struct cl_io *io, const struct file *file, int write)
 {
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct inode *inode = file_inode(file);
@@ -1431,6 +1431,7 @@ static void ll_io_init(struct cl_io *io, const struct file *file, int write)
 	}
 
 	io->ci_noatime = file_is_noatime(file);
+	io->ci_async_readahead = false;
 
 	/* FLR: only use non-delay I/O for read as there is only one
 	 * available mirror for write.
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 36b620e..8d95694 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -330,6 +330,8 @@ enum ra_stat {
 	RA_STAT_MAX_IN_FLIGHT,
 	RA_STAT_WRONG_GRAB_PAGE,
 	RA_STAT_FAILED_REACH_END,
+	RA_STAT_ASYNC,
+	RA_STAT_FAILED_FAST_READ,
 	_NR_RA_STAT,
 };
 
@@ -338,6 +340,16 @@ struct ll_ra_info {
 	unsigned long	     ra_max_pages;
 	unsigned long	     ra_max_pages_per_file;
 	unsigned long	     ra_max_read_ahead_whole_pages;
+	struct workqueue_struct  *ll_readahead_wq;
+	/*
+	 * Max number of active works for readahead workqueue,
+	 * default is 0 which make workqueue init number itself,
+	 * unless there is a specific need for throttling the
+	 * number of active work items, specifying '0' is recommended.
+	 */
+	unsigned int ra_async_max_active;
+	/* Threshold to control when to trigger async readahead */
+	unsigned long ra_async_pages_per_file_threshold;
 };
 
 /* ra_io_arg will be filled in the beginning of ll_readahead with
@@ -656,6 +668,20 @@ struct ll_readahead_state {
 	 * stride read-ahead will be enable
 	 */
 	unsigned long   ras_consecutive_stride_requests;
+	/* index of the last page that async readahead starts */
+	unsigned long	ras_async_last_readpage;
+};
+
+struct ll_readahead_work {
+	/** File to readahead */
+	struct file			*lrw_file;
+	/** Start bytes */
+	unsigned long			 lrw_start;
+	/** End bytes */
+	unsigned long			 lrw_end;
+
+	/* async worker to handler read */
+	struct work_struct		 lrw_readahead_work;
 };
 
 extern struct kmem_cache *ll_file_data_slab;
@@ -757,6 +783,7 @@ int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
 void ll_rw_stats_tally(struct ll_sb_info *sbi, pid_t pid,
 		       struct ll_file_data *file, loff_t pos,
 		       size_t count, int rw);
+void ll_io_init(struct cl_io *io, const struct file *file, int write);
 
 enum {
 	LPROC_LL_DIRTY_HITS,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 5ac083c..33f7fdb 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -92,14 +92,25 @@ static struct ll_sb_info *ll_init_sbi(void)
 	pages = si.totalram - si.totalhigh;
 	lru_page_max = pages / 2;
 
+	sbi->ll_ra_info.ra_async_max_active = 0;
+	sbi->ll_ra_info.ll_readahead_wq =
+		alloc_workqueue("ll-readahead-wq", WQ_UNBOUND,
+				sbi->ll_ra_info.ra_async_max_active);
+	if (!sbi->ll_ra_info.ll_readahead_wq) {
+		rc = -ENOMEM;
+		goto out_pcc;
+	}
+
 	sbi->ll_cache = cl_cache_init(lru_page_max);
 	if (!sbi->ll_cache) {
 		rc = -ENOMEM;
-		goto out_pcc;
+		goto out_destroy_ra;
 	}
 
 	sbi->ll_ra_info.ra_max_pages_per_file = min(pages / 32,
 						    SBI_DEFAULT_READAHEAD_MAX);
+	sbi->ll_ra_info.ra_async_pages_per_file_threshold =
+				sbi->ll_ra_info.ra_max_pages_per_file;
 	sbi->ll_ra_info.ra_max_pages = sbi->ll_ra_info.ra_max_pages_per_file;
 	sbi->ll_ra_info.ra_max_read_ahead_whole_pages = -1;
 
@@ -138,6 +149,8 @@ static struct ll_sb_info *ll_init_sbi(void)
 	sbi->ll_heat_decay_weight = SBI_DEFAULT_HEAT_DECAY_WEIGHT;
 	sbi->ll_heat_period_second = SBI_DEFAULT_HEAT_PERIOD_SECOND;
 	return sbi;
+out_destroy_ra:
+	destroy_workqueue(sbi->ll_ra_info.ll_readahead_wq);
 out_pcc:
 	pcc_super_fini(&sbi->ll_pcc_super);
 out_sbi:
@@ -151,6 +164,8 @@ static void ll_free_sbi(struct super_block *sb)
 
 	if (!list_empty(&sbi->ll_squash.rsi_nosquash_nids))
 		cfs_free_nidlist(&sbi->ll_squash.rsi_nosquash_nids);
+	if (sbi->ll_ra_info.ll_readahead_wq)
+		destroy_workqueue(sbi->ll_ra_info.ll_readahead_wq);
 	if (sbi->ll_cache) {
 		cl_cache_decref(sbi->ll_cache);
 		sbi->ll_cache = NULL;
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 8cb4983..02403e4 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1059,6 +1059,87 @@ static ssize_t tiny_write_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(tiny_write);
 
+static ssize_t max_read_ahead_async_active_show(struct kobject *kobj,
+					       struct attribute *attr,
+					       char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+			sbi->ll_ra_info.ra_async_max_active);
+}
+
+static ssize_t max_read_ahead_async_active_store(struct kobject *kobj,
+						struct attribute *attr,
+						const char *buffer,
+						size_t count)
+{
+	unsigned int val;
+	int rc;
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	rc = kstrtouint(buffer, 10, &val);
+	if (rc)
+		return rc;
+
+	if (val < 1 || val > WQ_UNBOUND_MAX_ACTIVE) {
+		CERROR("%s: cannot set max_read_ahead_async_active=%u %s than %u\n",
+		       sbi->ll_fsname, val,
+		       val < 1 ? "smaller" : "larger",
+		       val < 1 ? 1 : WQ_UNBOUND_MAX_ACTIVE);
+		return -ERANGE;
+	}
+
+	sbi->ll_ra_info.ra_async_max_active = val;
+	workqueue_set_max_active(sbi->ll_ra_info.ll_readahead_wq, val);
+
+	return count;
+}
+LUSTRE_RW_ATTR(max_read_ahead_async_active);
+
+static ssize_t read_ahead_async_file_threshold_mb_show(struct kobject *kobj,
+						       struct attribute *attr,
+						       char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n",
+	     PAGES_TO_MiB(sbi->ll_ra_info.ra_async_pages_per_file_threshold));
+}
+
+static ssize_t
+read_ahead_async_file_threshold_mb_store(struct kobject *kobj,
+					 struct attribute *attr,
+					 const char *buffer, size_t count)
+{
+	unsigned long pages_number;
+	unsigned long max_ra_per_file;
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	int rc;
+
+	rc = kstrtoul(buffer, 10, &pages_number);
+	if (rc)
+		return rc;
+
+	pages_number = MiB_TO_PAGES(pages_number);
+	max_ra_per_file = sbi->ll_ra_info.ra_max_pages_per_file;
+	if (pages_number < 0 || pages_number > max_ra_per_file) {
+		CERROR("%s: can't set read_ahead_async_file_threshold_mb=%lu > max_read_readahead_per_file_mb=%lu\n",
+		       sbi->ll_fsname,
+		       PAGES_TO_MiB(pages_number),
+		       PAGES_TO_MiB(max_ra_per_file));
+		return -ERANGE;
+	}
+	sbi->ll_ra_info.ra_async_pages_per_file_threshold = pages_number;
+
+	return count;
+}
+LUSTRE_RW_ATTR(read_ahead_async_file_threshold_mb);
+
 static ssize_t fast_read_show(struct kobject *kobj,
 			      struct attribute *attr,
 			      char *buf)
@@ -1407,6 +1488,8 @@ struct lprocfs_vars lprocfs_llite_obd_vars[] = {
 	&lustre_attr_file_heat.attr,
 	&lustre_attr_heat_decay_percentage.attr,
 	&lustre_attr_heat_period_second.attr,
+	&lustre_attr_max_read_ahead_async_active.attr,
+	&lustre_attr_read_ahead_async_file_threshold_mb.attr,
 	NULL,
 };
 
@@ -1505,7 +1588,9 @@ void ll_stats_ops_tally(struct ll_sb_info *sbi, int op, int count)
 	[RA_STAT_EOF]			= "read-ahead to EOF",
 	[RA_STAT_MAX_IN_FLIGHT]		= "hit max r-a issue",
 	[RA_STAT_WRONG_GRAB_PAGE]	= "wrong page from grab_cache_page",
-	[RA_STAT_FAILED_REACH_END]	= "failed to reach end"
+	[RA_STAT_FAILED_REACH_END]	= "failed to reach end",
+	[RA_STAT_ASYNC]			= "async readahead",
+	[RA_STAT_FAILED_FAST_READ]	= "failed to fast read",
 };
 
 int ll_debugfs_register_super(struct super_block *sb, const char *name)
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index ad55695..bec26c4 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -45,6 +45,7 @@
 #include <linux/uaccess.h>
 
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/pagemap.h>
 /* current_is_kswapd() */
 #include <linux/swap.h>
@@ -129,16 +130,17 @@ void ll_ra_stats_inc(struct inode *inode, enum ra_stat which)
 }
 
 #define RAS_CDEBUG(ras) \
-	CDEBUG(D_READA,						      \
+	CDEBUG(D_READA,							     \
 	       "lrp %lu cr %lu cp %lu ws %lu wl %lu nra %lu rpc %lu "	     \
-	       "r %lu ri %lu csr %lu sf %lu sp %lu sl %lu\n",		     \
-	       ras->ras_last_readpage, ras->ras_consecutive_requests,	\
-	       ras->ras_consecutive_pages, ras->ras_window_start,	    \
-	       ras->ras_window_len, ras->ras_next_readahead,		 \
+	       "r %lu ri %lu csr %lu sf %lu sp %lu sl %lu lr %lu\n",	     \
+	       ras->ras_last_readpage, ras->ras_consecutive_requests,	     \
+	       ras->ras_consecutive_pages, ras->ras_window_start,	     \
+	       ras->ras_window_len, ras->ras_next_readahead,		     \
 	       ras->ras_rpc_size,					     \
-	       ras->ras_requests, ras->ras_request_index,		    \
+	       ras->ras_requests, ras->ras_request_index,		     \
 	       ras->ras_consecutive_stride_requests, ras->ras_stride_offset, \
-	       ras->ras_stride_pages, ras->ras_stride_length)
+	       ras->ras_stride_pages, ras->ras_stride_length,		     \
+	       ras->ras_async_last_readpage)
 
 static int index_in_window(unsigned long index, unsigned long point,
 			   unsigned long before, unsigned long after)
@@ -432,13 +434,177 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 	return count;
 }
 
+static void ll_readahead_work_free(struct ll_readahead_work *work)
+{
+	fput(work->lrw_file);
+	kfree(work);
+}
+
+static void ll_readahead_handle_work(struct work_struct *wq);
+static void ll_readahead_work_add(struct inode *inode,
+				  struct ll_readahead_work *work)
+{
+	INIT_WORK(&work->lrw_readahead_work, ll_readahead_handle_work);
+	queue_work(ll_i2sbi(inode)->ll_ra_info.ll_readahead_wq,
+		   &work->lrw_readahead_work);
+}
+
+static int ll_readahead_file_kms(const struct lu_env *env,
+				struct cl_io *io, u64 *kms)
+{
+	struct cl_object *clob;
+	struct inode *inode;
+	struct cl_attr *attr = vvp_env_thread_attr(env);
+	int ret;
+
+	clob = io->ci_obj;
+	inode = vvp_object_inode(clob);
+
+	cl_object_attr_lock(clob);
+	ret = cl_object_attr_get(env, clob, attr);
+	cl_object_attr_unlock(clob);
+
+	if (ret != 0)
+		return ret;
+
+	*kms = attr->cat_kms;
+	return 0;
+}
+
+static void ll_readahead_handle_work(struct work_struct *wq)
+{
+	struct ll_readahead_work *work;
+	struct lu_env *env;
+	u16 refcheck;
+	struct ra_io_arg *ria;
+	struct inode *inode;
+	struct ll_file_data *fd;
+	struct ll_readahead_state *ras;
+	struct cl_io *io;
+	struct cl_2queue *queue;
+	pgoff_t ra_end = 0;
+	unsigned long len, mlen = 0;
+	struct file *file;
+	u64 kms;
+	int rc;
+	unsigned long end_index;
+
+	work = container_of(wq, struct ll_readahead_work,
+			    lrw_readahead_work);
+	fd = LUSTRE_FPRIVATE(work->lrw_file);
+	ras = &fd->fd_ras;
+	file = work->lrw_file;
+	inode = file_inode(file);
+
+	env = cl_env_alloc(&refcheck, LCT_NOREF);
+	if (IS_ERR(env)) {
+		rc = PTR_ERR(env);
+		goto out_free_work;
+	}
+
+	io = vvp_env_thread_io(env);
+	ll_io_init(io, file, 0);
+
+	rc = ll_readahead_file_kms(env, io, &kms);
+	if (rc != 0)
+		goto out_put_env;
+
+	if (kms == 0) {
+		ll_ra_stats_inc(inode, RA_STAT_ZERO_LEN);
+		rc = 0;
+		goto out_put_env;
+	}
+
+	ria = &ll_env_info(env)->lti_ria;
+	memset(ria, 0, sizeof(*ria));
+
+	ria->ria_start = work->lrw_start;
+	/* Truncate RA window to end of file */
+	end_index = (unsigned long)((kms - 1) >> PAGE_SHIFT);
+	if (end_index <= work->lrw_end) {
+		work->lrw_end = end_index;
+		ria->ria_eof = true;
+	}
+	if (work->lrw_end <= work->lrw_start) {
+		rc = 0;
+		goto out_put_env;
+	}
+
+	ria->ria_end = work->lrw_end;
+	len = ria->ria_end - ria->ria_start + 1;
+	ria->ria_reserved = ll_ra_count_get(ll_i2sbi(inode), ria,
+					    ria_page_count(ria), mlen);
+
+	CDEBUG(D_READA,
+	       "async reserved pages: %lu/%lu/%lu, ra_cur %d, ra_max %lu\n",
+	       ria->ria_reserved, len, mlen,
+	       atomic_read(&ll_i2sbi(inode)->ll_ra_info.ra_cur_pages),
+	       ll_i2sbi(inode)->ll_ra_info.ra_max_pages);
+
+	if (ria->ria_reserved < len) {
+		ll_ra_stats_inc(inode, RA_STAT_MAX_IN_FLIGHT);
+		if (PAGES_TO_MiB(ria->ria_reserved) < 1) {
+			ll_ra_count_put(ll_i2sbi(inode), ria->ria_reserved);
+			rc = 0;
+			goto out_put_env;
+		}
+	}
+
+	rc = cl_io_rw_init(env, io, CIT_READ, ria->ria_start, len);
+	if (rc)
+		goto out_put_env;
+
+	vvp_env_io(env)->vui_fd = fd;
+	io->ci_state = CIS_LOCKED;
+	io->ci_async_readahead = true;
+	rc = cl_io_start(env, io);
+	if (rc)
+		goto out_io_fini;
+
+	queue = &io->ci_queue;
+	cl_2queue_init(queue);
+
+	rc = ll_read_ahead_pages(env, io, &queue->c2_qin, ras, ria, &ra_end);
+	if (ria->ria_reserved != 0)
+		ll_ra_count_put(ll_i2sbi(inode), ria->ria_reserved);
+	if (queue->c2_qin.pl_nr > 0) {
+		int count = queue->c2_qin.pl_nr;
+
+		rc = cl_io_submit_rw(env, io, CRT_READ, queue);
+		if (rc == 0)
+			task_io_account_read(PAGE_SIZE * count);
+	}
+	if (ria->ria_end == ra_end && ra_end == (kms >> PAGE_SHIFT))
+		ll_ra_stats_inc(inode, RA_STAT_EOF);
+
+	if (ra_end != ria->ria_end)
+		ll_ra_stats_inc(inode, RA_STAT_FAILED_REACH_END);
+
+	/* TODO: discard all pages until page reinit route is implemented */
+	cl_page_list_discard(env, io, &queue->c2_qin);
+
+	/* Unlock unsent read pages in case of error. */
+	cl_page_list_disown(env, io, &queue->c2_qin);
+
+	cl_2queue_fini(env, queue);
+out_io_fini:
+	cl_io_end(env, io);
+	cl_io_fini(env, io);
+out_put_env:
+	cl_env_put(env, &refcheck);
+out_free_work:
+	if (ra_end > 0)
+		ll_ra_stats_inc_sbi(ll_i2sbi(inode), RA_STAT_ASYNC);
+	ll_readahead_work_free(work);
+}
+
 static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 			struct cl_page_list *queue,
-			struct ll_readahead_state *ras, bool hit)
+			struct ll_readahead_state *ras, bool hit,
+			struct file *file)
 {
 	struct vvp_io *vio = vvp_env_io(env);
 	struct ll_thread_info *lti = ll_env_info(env);
-	struct cl_attr *attr = vvp_env_thread_attr(env);
 	unsigned long len, mlen = 0;
 	pgoff_t ra_end = 0, start = 0, end = 0;
 	struct inode *inode;
@@ -451,14 +617,10 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	inode = vvp_object_inode(clob);
 
 	memset(ria, 0, sizeof(*ria));
-
-	cl_object_attr_lock(clob);
-	ret = cl_object_attr_get(env, clob, attr);
-	cl_object_attr_unlock(clob);
-
+	ret = ll_readahead_file_kms(env, io, &kms);
 	if (ret != 0)
 		return ret;
-	kms = attr->cat_kms;
+
 	if (kms == 0) {
 		ll_ra_stats_inc(inode, RA_STAT_ZERO_LEN);
 		return 0;
@@ -1141,7 +1303,7 @@ int ll_io_read_page(const struct lu_env *env, struct cl_io *io,
 		int rc2;
 
 		rc2 = ll_readahead(env, io, &queue->c2_qin, ras,
-				   uptodate);
+				   uptodate, file);
 		CDEBUG(D_READA, DFID "%d pages read ahead at %lu\n",
 		       PFID(ll_inode2fid(inode)), rc2, vvp_index(vpg));
 	}
@@ -1183,6 +1345,60 @@ int ll_io_read_page(const struct lu_env *env, struct cl_io *io,
 	return rc;
 }
 
+/*
+ * Possible return value:
+ * 0 no async readahead triggered and fast read could not be used.
+ * 1 no async readahead, but fast read could be used.
+ * 2 async readahead triggered and fast read could be used too.
+ * < 0 on error.
+ */
+static int kickoff_async_readahead(struct file *file, unsigned long pages)
+{
+	struct ll_readahead_work *lrw;
+	struct inode *inode = file_inode(file);
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+	struct ll_readahead_state *ras = &fd->fd_ras;
+	struct ll_ra_info *ra = &sbi->ll_ra_info;
+	unsigned long throttle;
+	unsigned long start = ras_align(ras, ras->ras_next_readahead, NULL);
+	unsigned long end = start + pages - 1;
+
+	throttle = min(ra->ra_async_pages_per_file_threshold,
+		       ra->ra_max_pages_per_file);
+	/*
+	 * If this is strided i/o or the window is smaller than the
+	 * throttle limit, we do not do async readahead. Otherwise,
+	 * we do async readahead, allowing the user thread to do fast i/o.
+	 */
+	if (stride_io_mode(ras) || !throttle ||
+	    ras->ras_window_len < throttle)
+		return 0;
+
+	if ((atomic_read(&ra->ra_cur_pages) + pages) > ra->ra_max_pages)
+		return 0;
+
+	if (ras->ras_async_last_readpage == start)
+		return 1;
+
+	/* ll_readahead_work_free() free it */
+	lrw = kzalloc(sizeof(*lrw), GFP_NOFS);
+	if (lrw) {
+		lrw->lrw_file = get_file(file);
+		lrw->lrw_start = start;
+		lrw->lrw_end = end;
+		spin_lock(&ras->ras_lock);
+		ras->ras_next_readahead = end + 1;
+		ras->ras_async_last_readpage = start;
+		spin_unlock(&ras->ras_lock);
+		ll_readahead_work_add(inode, lrw);
+	} else {
+		return -ENOMEM;
+	}
+
+	return 2;
+}
+
 int ll_readpage(struct file *file, struct page *vmpage)
 {
 	struct cl_object *clob = ll_i2info(file_inode(file))->lli_clob;
@@ -1190,6 +1406,7 @@ int ll_readpage(struct file *file, struct page *vmpage)
 	const struct lu_env *env = NULL;
 	struct cl_io *io = NULL;
 	struct cl_page *page;
+	struct ll_sb_info *sbi = ll_i2sbi(file_inode(file));
 	int result;
 
 	lcc = ll_cl_find(file);
@@ -1216,14 +1433,10 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		page = cl_vmpage_page(vmpage, clob);
 		if (!page) {
 			unlock_page(vmpage);
+			ll_ra_stats_inc_sbi(sbi, RA_STAT_FAILED_FAST_READ);
 			return result;
 		}
 
-		if (!env) {
-			local_env = cl_env_percpu_get();
-			env = local_env;
-		}
-
 		vpg = cl2vvp_page(cl_object_page_slice(page->cp_obj, page));
 		if (vpg->vpg_defer_uptodate) {
 			enum ras_update_flags flags = LL_RAS_HIT;
@@ -1236,8 +1449,7 @@ int ll_readpage(struct file *file, struct page *vmpage)
 			 * if the page is hit in cache because non cache page
 			 * case will be handled by slow read later.
 			 */
-			ras_update(ll_i2sbi(inode), inode, ras, vvp_index(vpg),
-				   flags);
+			ras_update(sbi, inode, ras, vvp_index(vpg), flags);
 			/* avoid duplicate ras_update() call */
 			vpg->vpg_ra_updated = 1;
 
@@ -1247,14 +1459,23 @@ int ll_readpage(struct file *file, struct page *vmpage)
 			 * a cl_io to issue the RPC.
 			 */
 			if (ras->ras_window_start + ras->ras_window_len <
-			    ras->ras_next_readahead + fast_read_pages) {
-				/* export the page and skip io stack */
-				vpg->vpg_ra_used = 1;
-				cl_page_export(env, page, 1);
+			    ras->ras_next_readahead + fast_read_pages ||
+			    kickoff_async_readahead(file, fast_read_pages) > 0)
 				result = 0;
-			}
 		}
 
+		if (!env) {
+			local_env = cl_env_percpu_get();
+			env = local_env;
+		}
+
+		/* export the page and skip io stack */
+		if (result == 0) {
+			vpg->vpg_ra_used = 1;
+			cl_page_export(env, page, 1);
+		} else {
+			ll_ra_stats_inc_sbi(sbi, RA_STAT_FAILED_FAST_READ);
+		}
 		/* release page refcount before unlocking the page to ensure
 		 * the object won't be destroyed in the calling path of
 		 * cl_page_put(). Please see comment in ll_releasepage().
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index ee44a18..68455d5 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -749,6 +749,11 @@ static int vvp_io_read_start(const struct lu_env *env,
 
 	down_read(&lli->lli_trunc_sem);
 
+	if (io->ci_async_readahead) {
+		file_accessed(file);
+		return 0;
+	}
+
 	if (!can_populate_pages(env, io, inode))
 		return 0;
 
diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index 9cdfca1..9328240 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -136,6 +136,7 @@ static int lov_io_sub_init(const struct lu_env *env, struct lov_io *lio,
 	sub_io->ci_type = io->ci_type;
 	sub_io->ci_no_srvlock = io->ci_no_srvlock;
 	sub_io->ci_noatime = io->ci_noatime;
+	sub_io->ci_async_readahead = io->ci_async_readahead;
 	sub_io->ci_lock_no_expand = io->ci_lock_no_expand;
 	sub_io->ci_ndelay = io->ci_ndelay;
 	sub_io->ci_layout_version = io->ci_layout_version;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 372/622] lustre: obdclass: allow per-session jobids.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (370 preceding siblings ...)
  2020-02-27 21:13 ` [lustre-devel] [PATCH 371/622] lustre: llite: improve single-thread read performance James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 373/622] lustre: llite: fix deadloop with tiny write James Simmons
                   ` (250 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

Lustre includes a jobid in all RPC message sent to the server.  This
is used to collected per-job statistics, where a "job" can involve
multiple processes on multiple nodes in a cluster.

Nodes in a cluster can be running processes for multiple jobs, so it
is best if different processes can have different jobids, and that
processes on different nodes can have the same job id.

The current mechanism for supporting this is to use an environment
variable which the kernel extracts from the relevant process's address
space. Some kernel developers see this to be an unacceptable design
choice, and the code is not likely to be accepted upstream.

This patch provides an alternate method, leveraging the concept of a
"session id", as set with setsid().  Each login session already gets a
unique sid which is preserved for all processes in that session unless
explicitly changed (with setsid(1)).
When a process in a session writes to
        /sys/fs/lustre/jobid_this_session
the string becomes the name for that session.
If jobid_var is set to "session", then the per-session jobid is used
for the jobid for all requests from processes in that session.

When a session ends, the jobid information will be purged within 5
minutes.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12330
Lustre-commit: a32ce8f50eca ("LU-12330 obdclass: allow per-session jobids.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/34995
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lprocfs_status.h |   1 +
 fs/lustre/include/obd_class.h      |   4 +
 fs/lustre/obdclass/jobid.c         | 199 +++++++++++++++++++++++++++++++++++--
 fs/lustre/obdclass/obd_sysfs.c     |  48 +++++++++
 4 files changed, 246 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index 9f62d4e..6269bd3 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -360,6 +360,7 @@ enum {
 #define JOBSTATS_DISABLE		"disable"
 #define JOBSTATS_PROCNAME_UID		"procname_uid"
 #define JOBSTATS_NODELOCAL		"nodelocal"
+#define JOBSTATS_SESSION		"session"
 
 /* obd_config.c */
 void lustre_register_client_process_config(int (*cpc)(struct lustre_cfg *lcfg));
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 58c743c..76e8201 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -57,6 +57,10 @@
 struct obd_device *class_exp2obd(struct obd_export *exp);
 int class_handle_ioctl(unsigned int cmd, unsigned long arg);
 int lustre_get_jobid(char *jobid, size_t len);
+void jobid_cache_fini(void);
+int jobid_cache_init(void);
+char *jobid_current(void);
+int jobid_set_current(char *jobid);
 
 struct lu_device_type;
 
diff --git a/fs/lustre/obdclass/jobid.c b/fs/lustre/obdclass/jobid.c
index 8bad859..98b3f39 100644
--- a/fs/lustre/obdclass/jobid.c
+++ b/fs/lustre/obdclass/jobid.c
@@ -46,6 +46,151 @@
 char obd_jobid_var[JOBSTATS_JOBID_VAR_MAX_LEN + 1] = JOBSTATS_DISABLE;
 char obd_jobid_name[LUSTRE_JOBID_SIZE] = "%e.%u";
 
+/*
+ * Jobid can be set for a session (see setsid(2)) by writing to
+ * a sysfs file from any process in that session.
+ * The jobids are stored in a hash table indexed by the relevant
+ * struct pid.  We periodically look for entries where the pid has
+ * no PIDTYPE_SID tasks any more, and prune them.  This happens within
+ * 5 seconds of a jobid being added, and every 5 minutes when jobids exist,
+ * but none are added.
+ */
+#define JOBID_EXPEDITED_CLEAN	(5)
+#define JOBID_BACKGROUND_CLEAN	(5 * 60)
+
+struct session_jobid {
+	struct pid		*sj_session;
+	struct rhash_head	sj_linkage;
+	struct rcu_head		sj_rcu;
+	char			sj_jobid[1];
+};
+
+static const struct rhashtable_params jobid_params = {
+	.key_len	= sizeof(struct pid *),
+	.key_offset	= offsetof(struct session_jobid, sj_session),
+	.head_offset	= offsetof(struct session_jobid, sj_linkage),
+};
+
+static struct rhashtable session_jobids;
+
+/*
+ * jobid_current must be called with rcu_read_lock held.
+ * if it returns non-NULL, the string can only be used
+ * until rcu_read_unlock is called.
+ */
+char *jobid_current(void)
+{
+	struct pid *sid = task_session(current);
+	struct session_jobid *sj;
+
+	sj = rhashtable_lookup_fast(&session_jobids, &sid, jobid_params);
+	if (sj)
+		return sj->sj_jobid;
+	return NULL;
+}
+
+static void jobid_prune_expedite(void);
+/*
+ * jobid_set_current will try to add a new entry
+ * to the table.  If one exists with the same key, the
+ * jobid will be replaced
+ */
+int jobid_set_current(char *jobid)
+{
+	struct pid *sid;
+	struct session_jobid *sj, *origsj;
+	int ret;
+	int len = strlen(jobid);
+
+	sj = kmalloc(sizeof(*sj) + len, GFP_KERNEL);
+	if (!sj)
+		return -ENOMEM;
+	rcu_read_lock();
+	sid = task_session(current);
+	sj->sj_session = get_pid(sid);
+	strncpy(sj->sj_jobid, jobid, len+1);
+	origsj = rhashtable_lookup_get_insert_fast(&session_jobids,
+						   &sj->sj_linkage,
+						   jobid_params);
+	if (!origsj) {
+		/* successful insert */
+		rcu_read_unlock();
+		jobid_prune_expedite();
+		return 0;
+	}
+
+	if (IS_ERR(origsj)) {
+		put_pid(sj->sj_session);
+		kfree(sj);
+		rcu_read_unlock();
+		return PTR_ERR(origsj);
+	}
+	ret = rhashtable_replace_fast(&session_jobids,
+				      &origsj->sj_linkage,
+				      &sj->sj_linkage,
+				      jobid_params);
+	if (ret) {
+		put_pid(sj->sj_session);
+		kfree(sj);
+		rcu_read_unlock();
+		return ret;
+	}
+	put_pid(origsj->sj_session);
+	rcu_read_unlock();
+	kfree_rcu(origsj, sj_rcu);
+	jobid_prune_expedite();
+
+	return 0;
+}
+
+static void jobid_free(void *vsj, void *arg)
+{
+	struct session_jobid *sj = vsj;
+
+	put_pid(sj->sj_session);
+	kfree(sj);
+}
+
+static void jobid_prune(struct work_struct *work);
+static DECLARE_DELAYED_WORK(jobid_prune_work, jobid_prune);
+static int jobid_prune_expedited;
+static void jobid_prune(struct work_struct *work)
+{
+	int remaining = 0;
+	struct rhashtable_iter iter;
+	struct session_jobid *sj;
+
+	jobid_prune_expedited = 0;
+	rhashtable_walk_enter(&session_jobids, &iter);
+	rhashtable_walk_start(&iter);
+	while ((sj = rhashtable_walk_next(&iter)) != NULL) {
+		if (!hlist_empty(&sj->sj_session->tasks[PIDTYPE_SID])) {
+			remaining++;
+			continue;
+		}
+		if (rhashtable_remove_fast(&session_jobids,
+					   &sj->sj_linkage,
+					   jobid_params) == 0) {
+			put_pid(sj->sj_session);
+			kfree_rcu(sj, sj_rcu);
+		}
+	}
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
+	if (remaining)
+		schedule_delayed_work(&jobid_prune_work,
+				      JOBID_BACKGROUND_CLEAN * HZ);
+}
+
+static void jobid_prune_expedite(void)
+{
+	if (!jobid_prune_expedited) {
+		jobid_prune_expedited = 1;
+		mod_delayed_work(system_wq, &jobid_prune_work,
+				 JOBID_EXPEDITED_CLEAN * HZ);
+	}
+}
+
 /* Get jobid of current process from stored variable or calculate
  * it from pid and user_id.
  *
@@ -134,14 +279,40 @@ static int jobid_interpret_string(const char *jobfmt, char *jobid,
 	return joblen < 0 ? -EOVERFLOW : 0;
 }
 
+/**
+ * Generate the job identifier string for this process for tracking purposes.
+ *
+ * Fill in @jobid string based on the value of obd_jobid_var:
+ * JOBSTATS_DISABLE:	  none
+ * JOBSTATS_NODELOCAL:	  content of obd_jobid_name (jobid_interpret_string())
+ * JOBSTATS_PROCNAME_UID: process name/UID
+ * JOBSTATS_SESSION	  per-session value set by
+ *			  /sys/fs/lustre/jobid_this_session
+ *
+ * Return -ve error number, 0 on success.
+ */
 int lustre_get_jobid(char *jobid, size_t joblen)
 {
 	char tmp_jobid[LUSTRE_JOBID_SIZE] = "";
 
+	if (unlikely(joblen < 2)) {
+		if (joblen == 1)
+			jobid[0] = '\0';
+		return -EINVAL;
+	}
+
 	/* Jobstats isn't enabled */
 	if (strcmp(obd_jobid_var, JOBSTATS_DISABLE) == 0)
 		goto out_cache_jobid;
 
+	/* Whole node dedicated to single job */
+	if (strcmp(obd_jobid_var, JOBSTATS_NODELOCAL) == 0) {
+		int rc2 = jobid_interpret_string(obd_jobid_name,
+						 tmp_jobid, joblen);
+		if (!rc2)
+			goto out_cache_jobid;
+	}
+
 	/* Use process name + fsuid as jobid */
 	if (strcmp(obd_jobid_var, JOBSTATS_PROCNAME_UID) == 0) {
 		snprintf(tmp_jobid, LUSTRE_JOBID_SIZE, "%s.%u",
@@ -150,13 +321,17 @@ int lustre_get_jobid(char *jobid, size_t joblen)
 		goto out_cache_jobid;
 	}
 
-	/* Whole node dedicated to single job */
-	if (strcmp(obd_jobid_var, JOBSTATS_NODELOCAL) == 0) {
-		int rc2 = jobid_interpret_string(obd_jobid_name,
-						 tmp_jobid, joblen);
-		if (!rc2)
-			goto out_cache_jobid;
+	if (strcmp(obd_jobid_var, JOBSTATS_SESSION) == 0) {
+		char *jid;
+
+		rcu_read_lock();
+		jid = jobid_current();
+		if (jid)
+			strlcpy(jobid, jid, sizeof(jobid));
+		rcu_read_unlock();
+		goto out_cache_jobid;
 	}
+
 	return -ENOENT;
 
 out_cache_jobid:
@@ -167,3 +342,15 @@ int lustre_get_jobid(char *jobid, size_t joblen)
 	return 0;
 }
 EXPORT_SYMBOL(lustre_get_jobid);
+
+int jobid_cache_init(void)
+{
+	return rhashtable_init(&session_jobids, &jobid_params);
+}
+
+void jobid_cache_fini(void)
+{
+	cancel_delayed_work_sync(&jobid_prune_work);
+
+	rhashtable_free_and_destroy(&session_jobids, jobid_free, NULL);
+}
diff --git a/fs/lustre/obdclass/obd_sysfs.c b/fs/lustre/obdclass/obd_sysfs.c
index ca15936..8803d05 100644
--- a/fs/lustre/obdclass/obd_sysfs.c
+++ b/fs/lustre/obdclass/obd_sysfs.c
@@ -259,6 +259,44 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr,
 	return count;
 }
 
+static ssize_t jobid_this_session_show(struct kobject *kobj,
+				       struct attribute *attr,
+				       char *buf)
+{
+	char *jid;
+	int ret = -ENOENT;
+
+	rcu_read_lock();
+	jid = jobid_current();
+	if (jid)
+		ret = snprintf(buf, PAGE_SIZE, "%s\n", jid);
+	rcu_read_unlock();
+	return ret;
+}
+
+static ssize_t jobid_this_session_store(struct kobject *kobj,
+					struct attribute *attr,
+					const char *buffer,
+					size_t count)
+{
+	char *jobid;
+	int len;
+	int ret;
+
+	if (!count || count > LUSTRE_JOBID_SIZE)
+		return -EINVAL;
+
+	jobid = kstrndup(buffer, count, GFP_KERNEL);
+	if (!jobid)
+		return -ENOMEM;
+	len = strcspn(jobid, "\n ");
+	jobid[len] = '\0';
+	ret = jobid_set_current(jobid);
+	kfree(jobid);
+
+	return ret ?: count;
+}
+
 /* Root for /sys/kernel/debug/lustre */
 struct dentry *debugfs_lustre_root;
 EXPORT_SYMBOL_GPL(debugfs_lustre_root);
@@ -268,6 +306,7 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr,
 LUSTRE_RO_ATTR(health_check);
 LUSTRE_RW_ATTR(jobid_var);
 LUSTRE_RW_ATTR(jobid_name);
+LUSTRE_RW_ATTR(jobid_this_session);
 
 static struct attribute *lustre_attrs[] = {
 	&lustre_attr_version.attr,
@@ -275,6 +314,7 @@ static ssize_t jobid_name_store(struct kobject *kobj, struct attribute *attr,
 	&lustre_attr_health_check.attr,
 	&lustre_attr_jobid_name.attr,
 	&lustre_attr_jobid_var.attr,
+	&lustre_attr_jobid_this_session.attr,
 	&lustre_sattr_timeout.u.attr,
 	&lustre_attr_max_dirty_mb.attr,
 	&lustre_sattr_debug_peer_on_timeout.u.attr,
@@ -441,6 +481,12 @@ int class_procfs_init(void)
 		goto out;
 	}
 
+	rc = jobid_cache_init();
+	if (rc) {
+		kset_unregister(lustre_kset);
+		goto out;
+	}
+
 	debugfs_lustre_root = debugfs_create_dir("lustre", NULL);
 
 	debugfs_create_file("devices", 0444, debugfs_lustre_root, NULL,
@@ -458,6 +504,8 @@ int class_procfs_clean(void)
 
 	debugfs_lustre_root = NULL;
 
+	jobid_cache_fini();
+
 	sysfs_remove_group(&lustre_kset->kobj, &lustre_attr_group);
 
 	kset_unregister(lustre_kset);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 373/622] lustre: llite: fix deadloop with tiny write
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (371 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 372/622] lustre: obdclass: allow per-session jobids James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 374/622] lnet: prevent loop in LNetPrimaryNID() James Simmons
                   ` (249 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

For a small write(<4K), we will use tiny write and
__generic_file_write_iter() will be called to handle it.

On newer kernel(4.14 etc), the function is exported and will
do something like following:

|->__generic_file_write_iter
  |->generic_perform_write()

If iov_iter_count() passed in is 0, generic_write_perform() will
try go to forever loop as bytes copied is always calculated as 0.

The problem is VFS doesn't always skip IO count zero before it comes
to lower layer read/write hook, and we should do it by ourselves.

To fix this problem, always return 0 early if there is no
real any IO needed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12382
Lustre-commit: e9a543b0d303 ("LU-12382 llite: fix deadloop with tiny write")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35058
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 5d1cfa4..1ed4b14 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1668,6 +1668,9 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t rc2;
 	bool cached;
 
+	if (!iov_iter_count(to))
+		return 0;
+
 	/**
 	 * Currently when PCC read failed, we do not fall back to the
 	 * normal read path, just return the error.
@@ -1778,6 +1781,11 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	bool cached;
 	int result;
 
+	if (!iov_iter_count(from)) {
+		rc_normal = 0;
+		goto out;
+	}
+
 	/**
 	 * When PCC write failed, we usually do not fall back to the normal
 	 * write path, just return the error. But there is a special case when
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 374/622] lnet: prevent loop in LNetPrimaryNID()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (372 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 373/622] lustre: llite: fix deadloop with tiny write James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 375/622] lustre: ldlm: Fix style issues for ldlm_lib.c James Simmons
                   ` (248 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

If discovery is disabled locally or at the remote end, then attempt
discovery only once. Do not update the internal database when
discovery is disabled and do not repeat discovery.

This change prevents LNet from getting hung waiting for
discovery to complete.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12424
Lustre-commit: 439520f762b0 ("LU-12424 lnet: prevent loop in LNetPrimaryNID()")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35191
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 73 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 42 insertions(+), 31 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 55ff01d..e5cce2f 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1137,6 +1137,34 @@ struct lnet_peer_ni *
 	return primary_nid;
 }
 
+bool
+lnet_is_discovery_disabled_locked(struct lnet_peer *lp)
+{
+	if (lnet_peer_discovery_disabled)
+		return true;
+
+	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL) ||
+	    (lp->lp_state & LNET_PEER_NO_DISCOVERY)) {
+		return true;
+	}
+
+	return false;
+}
+
+/* Peer Discovery
+ */
+bool
+lnet_is_discovery_disabled(struct lnet_peer *lp)
+{
+	bool rc = false;
+
+	spin_lock(&lp->lp_lock);
+	rc = lnet_is_discovery_disabled_locked(lp);
+	spin_unlock(&lp->lp_lock);
+
+	return rc;
+}
+
 lnet_nid_t
 LNetPrimaryNID(lnet_nid_t nid)
 {
@@ -1153,11 +1181,16 @@ struct lnet_peer_ni *
 		goto out_unlock;
 	}
 	lp = lpni->lpni_peer_net->lpn_peer;
+
 	while (!lnet_peer_is_uptodate(lp)) {
 		rc = lnet_discover_peer_locked(lpni, cpt, true);
 		if (rc)
 			goto out_decref;
 		lp = lpni->lpni_peer_net->lpn_peer;
+
+		/* Only try once if discovery is disabled */
+		if (lnet_is_discovery_disabled(lp))
+			break;
 	}
 	primary_nid = lp->lp_primary_nid;
 out_decref:
@@ -1784,35 +1817,6 @@ struct lnet_peer_ni *
 }
 
 bool
-lnet_is_discovery_disabled_locked(struct lnet_peer *lp)
-{
-	if (lnet_peer_discovery_disabled)
-		return true;
-
-	if (!(lp->lp_state & LNET_PEER_MULTI_RAIL) ||
-	    (lp->lp_state & LNET_PEER_NO_DISCOVERY)) {
-		return true;
-	}
-
-	return false;
-}
-
-/*
- * Peer Discovery
- */
-bool
-lnet_is_discovery_disabled(struct lnet_peer *lp)
-{
-	bool rc = false;
-
-	spin_lock(&lp->lp_lock);
-	rc = lnet_is_discovery_disabled_locked(lp);
-	spin_unlock(&lp->lp_lock);
-
-	return rc;
-}
-
-bool
 lnet_peer_gw_discovery(struct lnet_peer *lp)
 {
 	bool rc = false;
@@ -2157,8 +2161,6 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 			break;
 		lnet_peer_queue_for_discovery(lp);
 
-		if (lnet_is_discovery_disabled(lp))
-			break;
 		/*
 		 * if caller requested a non-blocking operation then
 		 * return immediately. Once discovery is complete then the
@@ -2176,6 +2178,15 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 		lnet_peer_decref_locked(lp);
 		/* Peer may have changed */
 		lp = lpni->lpni_peer_net->lpn_peer;
+
+		/* Wait for discovery to complete, but don't repeat if
+		 * discovery is disabled. This is done to ensure we can
+		 * use discovery as a standard ping as well for backwards
+		 * compatibility with routers which do not have discovery
+		 * or have discovery disabled
+		 */
+		if (lnet_is_discovery_disabled(lp))
+			break;
 	}
 	finish_wait(&lp->lp_dc_waitq, &wait);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 375/622] lustre: ldlm: Fix style issues for ldlm_lib.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (373 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 374/622] lnet: prevent loop in LNetPrimaryNID() James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 376/622] lustre: obdclass: protect imp_sec using rwlock_t James Simmons
                   ` (247 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Arshad Hussain <arshad.super@gmail.com>

This patch fixes issues reported by checkpatch for
file fs/lustre/ldlm/ldlm_lib.c

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 939cdd034e7b ("LU-6142 ldlm: Fix style issues for ldlm_lib.c")
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/34495
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_lib.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 4a982ab..af74f97 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -48,7 +48,8 @@
 #include <lustre_sec.h>
 #include "ldlm_internal.h"
 
-/* @priority: If non-zero, move the selected connection to the list head.
+/*
+ * @priority: If non-zero, move the selected connection to the list head.
  * @create: If zero, only search in existing connections.
  */
 static int import_set_conn(struct obd_import *imp, struct obd_uuid *uuid,
@@ -223,7 +224,8 @@ int client_import_find_conn(struct obd_import *imp, lnet_nid_t peer,
 
 void client_destroy_import(struct obd_import *imp)
 {
-	/* Drop security policy instance after all RPCs have finished/aborted
+	/*
+	 * Drop security policy instance after all RPCs have finished/aborted
 	 * to let all busy contexts be released.
 	 */
 	class_import_get(imp);
@@ -233,7 +235,8 @@ void client_destroy_import(struct obd_import *imp)
 }
 EXPORT_SYMBOL(client_destroy_import);
 
-/* Configure an RPC client OBD device.
+/*
+ * Configure an RPC client OBD device.
  *
  * lcfg parameters:
  * 1 - client UUID
@@ -255,7 +258,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	};
 	int rc;
 
-	/* In a more perfect world, we would hang a ptlrpc_client off of
+	/*
+	 * In a more perfect world, we would hang a ptlrpc_client off of
 	 * obd_type and just use the values from there.
 	 */
 	if (!strcmp(name, LUSTRE_OSC_NAME)) {
@@ -630,7 +634,8 @@ int client_disconnect_export(struct obd_export *exp)
 		goto out_disconnect;
 	}
 
-	/* Mark import deactivated now, so we don't try to reconnect if any
+	/*
+	 * Mark import deactivated now, so we don't try to reconnect if any
 	 * of the cleanup RPCs fails (e.g. LDLM cancel, etc).  We don't
 	 * fully deactivate the import, or that would drop all requests.
 	 */
@@ -638,7 +643,8 @@ int client_disconnect_export(struct obd_export *exp)
 	imp->imp_deactive = 1;
 	spin_unlock(&imp->imp_lock);
 
-	/* Some non-replayable imports (MDS's OSCs) are pinged, so just
+	/*
+	 * Some non-replayable imports (MDS's OSCs) are pinged, so just
 	 * delete it regardless.  (It's safe to delete an import that was
 	 * never added.)
 	 */
@@ -652,7 +658,8 @@ int client_disconnect_export(struct obd_export *exp)
 					  obd->obd_force);
 	}
 
-	/* There's no need to hold sem while disconnecting an import,
+	/*
+	 * There's no need to hold sem while disconnecting an import,
 	 * and it may actually cause deadlock in GSS.
 	 */
 	up_write(&cli->cl_sem);
@@ -662,7 +669,8 @@ int client_disconnect_export(struct obd_export *exp)
 	ptlrpc_invalidate_import(imp);
 
 out_disconnect:
-	/* Use server style - class_disconnect should be always called for
+	/*
+	 * Use server style - class_disconnect should be always called for
 	 * o_disconnect.
 	 */
 	err = class_disconnect(exp);
@@ -680,9 +688,10 @@ int client_disconnect_export(struct obd_export *exp)
  */
 int target_pack_pool_reply(struct ptlrpc_request *req)
 {
-	struct obd_device *obd;
+struct obd_device *obd;
 
-	/* Check that we still have all structures alive as this may
+	/*
+	 * Check that we still have all structures alive as this may
 	 * be some late RPC at shutdown time.
 	 */
 	if (unlikely(!req->rq_export || !req->rq_export->exp_obd ||
@@ -711,7 +720,8 @@ int target_pack_pool_reply(struct ptlrpc_request *req)
 		DEBUG_REQ(D_ERROR, req, "dropping reply");
 		return -ECOMM;
 	}
-	/* We can have a null rq_reqmsg in the event of bad signature or
+	/*
+	 * We can have a null rq_reqmsg in the event of bad signature or
 	 * no context when unwrapping
 	 */
 	if (req->rq_reqmsg &&
@@ -792,7 +802,8 @@ void target_send_reply(struct ptlrpc_request *req, int rc, int fail_id)
 	atomic_inc(&svcpt->scp_nreps_difficult);
 
 	if (netrc != 0) {
-		/* error sending: reply is off the net.  Also we need +1
+		/*
+		 * error sending: reply is off the net.  Also we need +1
 		 * reply ref until ptlrpc_handle_rs() is done
 		 * with the reply state (if the send was successful, there
 		 * would have been +1 ref for the net, which
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 376/622] lustre: obdclass: protect imp_sec using rwlock_t
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (374 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 375/622] lustre: ldlm: Fix style issues for ldlm_lib.c James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 377/622] lustre: llite: console message for disabled flock call James Simmons
                   ` (246 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

We've seen spinlock contention on imp_lock in
sptlrpc_import_sec_ref(), introduce a new rwlock
imp_sec_lock to protect imp_sec instead of using imp_lock.

This patch also removes imp_sec_mutex from obd_import,
which is not needed, to avoid confusion between
imp_sec_lock/mutex.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11775
Lustre-commit: 8ed361345154 ("LU-11775 obdclass: protect imp_sec using rwlock_t")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/33861
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h |  2 +-
 fs/lustre/obdclass/genops.c       |  2 +-
 fs/lustre/ptlrpc/sec.c            | 15 ++++++---------
 fs/lustre/ptlrpc/sec_config.c     |  4 ++--
 4 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index f16d621..ff171d1 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -206,7 +206,7 @@ struct obd_import {
 	 * @{
 	 */
 	struct ptlrpc_sec	       *imp_sec;
-	struct mutex			imp_sec_mutex;
+	rwlock_t			imp_sec_lock;
 	time64_t			imp_sec_expire;
 	/** @} */
 
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index fd9dd96..2b1175f 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -997,7 +997,7 @@ struct obd_import *class_new_import(struct obd_device *obd)
 	imp->imp_last_success_conn = 0;
 	imp->imp_state = LUSTRE_IMP_NEW;
 	imp->imp_obd = class_incref(obd, "import", imp);
-	mutex_init(&imp->imp_sec_mutex);
+	rwlock_init(&imp->imp_sec_lock);
 	init_waitqueue_head(&imp->imp_recovery_waitq);
 	INIT_WORK(&imp->imp_zombie_work, obd_zombie_imp_cull);
 
diff --git a/fs/lustre/ptlrpc/sec.c b/fs/lustre/ptlrpc/sec.c
index 789b5cb..d82809f 100644
--- a/fs/lustre/ptlrpc/sec.c
+++ b/fs/lustre/ptlrpc/sec.c
@@ -303,13 +303,13 @@ static int import_sec_check_expire(struct obd_import *imp)
 {
 	int adapt = 0;
 
-	spin_lock(&imp->imp_lock);
+	write_lock(&imp->imp_sec_lock);
 	if (imp->imp_sec_expire &&
 	    imp->imp_sec_expire < ktime_get_real_seconds()) {
 		adapt = 1;
 		imp->imp_sec_expire = 0;
 	}
-	spin_unlock(&imp->imp_lock);
+	write_unlock(&imp->imp_sec_lock);
 
 	if (!adapt)
 		return 0;
@@ -1317,9 +1317,9 @@ struct ptlrpc_sec *sptlrpc_import_sec_ref(struct obd_import *imp)
 {
 	struct ptlrpc_sec *sec;
 
-	spin_lock(&imp->imp_lock);
+	read_lock(&imp->imp_sec_lock);
 	sec = sptlrpc_sec_get(imp->imp_sec);
-	spin_unlock(&imp->imp_lock);
+	read_unlock(&imp->imp_sec_lock);
 
 	return sec;
 }
@@ -1332,10 +1332,10 @@ static void sptlrpc_import_sec_install(struct obd_import *imp,
 
 	LASSERT_ATOMIC_POS(&sec->ps_refcount);
 
-	spin_lock(&imp->imp_lock);
+	write_lock(&imp->imp_sec_lock);
 	old_sec = imp->imp_sec;
 	imp->imp_sec = sec;
-	spin_unlock(&imp->imp_lock);
+	write_unlock(&imp->imp_sec_lock);
 
 	if (old_sec) {
 		sptlrpc_sec_kill(old_sec);
@@ -1455,8 +1455,6 @@ int sptlrpc_import_sec_adapt(struct obd_import *imp,
 		       sptlrpc_flavor2name(&sf, str, sizeof(str)));
 	}
 
-	mutex_lock(&imp->imp_sec_mutex);
-
 	newsec = sptlrpc_sec_create(imp, svc_ctx, &sf, sp);
 	if (newsec) {
 		sptlrpc_import_sec_install(imp, newsec);
@@ -1467,7 +1465,6 @@ int sptlrpc_import_sec_adapt(struct obd_import *imp,
 		rc = -EPERM;
 	}
 
-	mutex_unlock(&imp->imp_sec_mutex);
 out:
 	sptlrpc_sec_put(sec);
 	return rc;
diff --git a/fs/lustre/ptlrpc/sec_config.c b/fs/lustre/ptlrpc/sec_config.c
index e4b1a075..9ced6c7 100644
--- a/fs/lustre/ptlrpc/sec_config.c
+++ b/fs/lustre/ptlrpc/sec_config.c
@@ -846,11 +846,11 @@ void sptlrpc_conf_client_adapt(struct obd_device *obd)
 
 	imp = obd->u.cli.cl_import;
 	if (imp) {
-		spin_lock(&imp->imp_lock);
+		write_lock(&imp->imp_sec_lock);
 		if (imp->imp_sec)
 			imp->imp_sec_expire = ktime_get_real_seconds() +
 				SEC_ADAPT_DELAY;
-		spin_unlock(&imp->imp_lock);
+		write_unlock(&imp->imp_sec_lock);
 	}
 
 	up_read(&obd->u.cli.cl_sem);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 377/622] lustre: llite: console message for disabled flock call
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (375 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 376/622] lustre: obdclass: protect imp_sec using rwlock_t James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 378/622] lustre: ptlrpc: Add increasing XIDs CONNECT2 flag James Simmons
                   ` (245 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Li Xi <lixi@ddn.com>

When flock option is disabled on a Lustre client, any call to
flock() or lockf() would cause a return value with failure.
For applications that don't print proper error message, it is
hard to know the root cause is the missing flock option on Lustre
file system. Thus this patch prints following error message to
the tty that calls flock()/lockf():

"Lustre: flock disabled, mount with '-o [local]flock' to enable"

Such message will print to each file descriptor no more than
once to avoid message flood.

In order to do so, this patch adds support for CDEBUG_LIMIT(D_TTY).
It prints the message to tty. When using this macro, please
note that "\r\n" needs to be the end of the line. Otherwise,
message like "format at $FILE:$LINO:$FUNC doesn't end in '\r\n'"
will be printed to the system message for warning.

Note that LL_FILE_RMTACL should have been removed by
Commit 341f1f0affed ("staging: lustre: remove remote client support")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12349
Lustre-commit: f6497eb3503b ("LU-12349 llite: console message for disabled flock call")
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/34986
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Yingjin Qian <qian@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c                  | 12 ++++++++++++
 include/uapi/linux/lnet/libcfs_debug.h  |  4 ++--
 include/uapi/linux/lustre/lustre_user.h |  2 +-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 1ed4b14..76a5074 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4275,6 +4275,18 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
 static int
 ll_file_noflock(struct file *file, int cmd, struct file_lock *file_lock)
 {
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
+
+	/*
+	 * In order to avoid flood of warning messages, only print one message
+	 * for one file. And the entire message rate on the client is limited
+	 * by CDEBUG_LIMIT too.
+	 */
+	if (!(fd->fd_flags & LL_FILE_FLOCK_WARNING)) {
+		fd->fd_flags |= LL_FILE_FLOCK_WARNING;
+		CDEBUG_LIMIT(D_TTY | D_CONSOLE,
+			     "flock disabled, mount with '-o [local]flock' to enable\r\n");
+	}
 	return -EINVAL;
 }
 
diff --git a/include/uapi/linux/lnet/libcfs_debug.h b/include/uapi/linux/lnet/libcfs_debug.h
index 1a68667..6255331 100644
--- a/include/uapi/linux/lnet/libcfs_debug.h
+++ b/include/uapi/linux/lnet/libcfs_debug.h
@@ -106,7 +106,7 @@ struct ptldebug_header {
 #define D_TRACE		0x00000001 /* ENTRY/EXIT markers */
 #define D_INODE		0x00000002
 #define D_SUPER		0x00000004
-#define D_EXT2		0x00000008 /* anything from ext2_debug */
+#define D_TTY		0x00000008 /* notification printed to TTY */
 #define D_MALLOC	0x00000010 /* print malloc, free information */
 #define D_CACHE		0x00000020 /* cache-related items */
 #define D_INFO		0x00000040 /* general information */
@@ -137,7 +137,7 @@ struct ptldebug_header {
 #define D_LAYOUT	0x80000000
 
 #define LIBCFS_DEBUG_MASKS_NAMES {					\
-	"trace", "inode", "super", "ext2", "malloc", "cache", "info",	\
+	"trace", "inode", "super", "tty", "malloc", "cache", "info",	\
 	"ioctl", "neterror", "net", "warning", "buffs", "other",	\
 	"dentry", "nettrace", "page", "dlmtrace", "error", "emerg",	\
 	"ha", "rpctrace", "vfstrace", "reada", "mmap", "config",	\
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 317b236..d43170f 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -385,7 +385,7 @@ struct ll_ioc_lease_id {
 #define LL_FILE_READAHEA	0x00000004
 #define LL_FILE_LOCKED_DIRECTIO 0x00000008 /* client-side locks with dio */
 #define LL_FILE_LOCKLESS_IO	0x00000010 /* server-side locks with cio */
-#define LL_FILE_RMTACL		0x00000020
+#define LL_FILE_FLOCK_WARNING	0x00000020 /* warned about disabled flock */
 
 #define LOV_USER_MAGIC_V1	0x0BD10BD0
 #define LOV_USER_MAGIC		LOV_USER_MAGIC_V1
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 378/622] lustre: ptlrpc: Add increasing XIDs CONNECT2 flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (376 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 377/622] lustre: llite: console message for disabled flock call James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 379/622] lustre: ptlrpc: don't reset lru_resize on idle reconnect James Simmons
                   ` (244 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

This patch reserves the OBD_CONNECT2 flag
for increasing XIDs.

Cray-bug-id: LUS-6272
WC-bug-id: https://jira.whamcloud.com/browse/LU-11444
Lustre-commit: b4375f5fc66c ("LU-11444 ptlrpc: Add increasing XIDs CONNECT2 flag")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/35113
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 2 +-
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index c244adb..ca169ec 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -120,7 +120,7 @@
 	"wbc",		/* 0x40 */
 	"lock_convert",	/* 0x80 */
 	"archive_id_array",	/* 0x100 */
-	"unknown",		/* 0x200 */
+	"increasing_xid",	/* 0x200 */
 	"selinux_policy",	/* 0x400 */
 	"lsom",			/* 0x800 */
 	"pcc",			/* 0x1000 */
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 64ccc6e..e801f2c 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1148,6 +1148,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_LOCK_CONVERT);
 	LASSERTF(OBD_CONNECT2_ARCHIVE_ID_ARRAY == 0x100ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ARCHIVE_ID_ARRAY);
+	LASSERTF(OBD_CONNECT2_INC_XID == 0x200ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_INC_XID);
 	LASSERTF(OBD_CONNECT2_SELINUX_POLICY == 0x400ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_SELINUX_POLICY);
 	LASSERTF(OBD_CONNECT2_LSOM == 0x800ULL, "found 0x%.16llxULL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 2e54dd1..c86b188 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -806,6 +806,7 @@ struct ptlrpc_body_v2 {
 						 */
 #define OBD_CONNECT2_LOCK_CONVERT	0x80ULL /* IBITS lock convert support */
 #define OBD_CONNECT2_ARCHIVE_ID_ARRAY  0x100ULL	/* store HSM archive_id in array */
+#define OBD_CONNECT2_INC_XID	       0x200ULL /* Increasing xid */
 #define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
 #define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
 #define OBD_CONNECT2_PCC	       0x1000ULL /* Persistent Client Cache */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 379/622] lustre: ptlrpc: don't reset lru_resize on idle reconnect
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (377 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 378/622] lustre: ptlrpc: Add increasing XIDs CONNECT2 flag James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 380/622] lnet: use after free in lnet_discover_peer_locked() James Simmons
                   ` (243 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

ptlrpc_disconnect_idle_interpret() clears imp_remote_handle,
so reconnect has pcaa_initial_connect set to 1.

Update only changed ns_connect_flags bits.

Fixes: 4b102da53ad ("lustre: ptlrpc: idle connections can disconnect")
Cray-bug-id: LUS-7471
WC-bug-id: https://jira.whamcloud.com/browse/LU-11518
Lustre-commit: acacc9d9b1d0 ("LU-11518 ptlrpc: don't reset lru_resize on idle reconnect")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/35285
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 6f13ec1..f8e15f2 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -858,13 +858,17 @@ static int ptlrpc_connect_set_flags(struct obd_import *imp,
 	 * disable lru_resize, etc.
 	 */
 	if (old_connect_flags != exp_connect_flags(exp) || init_connect) {
+		struct ldlm_namespace *ns = imp->imp_obd->obd_namespace;
+		u64 changed_flags;
+
+		changed_flags =
+			ns->ns_connect_flags ^ ns->ns_orig_connect_flags;
 		CDEBUG(D_HA,
 		       "%s: Resetting ns_connect_flags to server flags: %#llx\n",
 		       imp->imp_obd->obd_name, ocd->ocd_connect_flags);
-		imp->imp_obd->obd_namespace->ns_connect_flags =
-			ocd->ocd_connect_flags;
-		imp->imp_obd->obd_namespace->ns_orig_connect_flags =
-			ocd->ocd_connect_flags;
+		ns->ns_connect_flags = (ns->ns_connect_flags & changed_flags) |
+				      (ocd->ocd_connect_flags & ~changed_flags);
+		ns->ns_orig_connect_flags = ocd->ocd_connect_flags;
 	}
 
 	if (ocd->ocd_connect_flags & OBD_CONNECT_AT)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 380/622] lnet: use after free in lnet_discover_peer_locked()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (378 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 379/622] lustre: ptlrpc: don't reset lru_resize on idle reconnect James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 381/622] lustre: obdclass: generate random u64 max correctly James Simmons
                   ` (242 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Weber <olaf.weber@hpe.com>

When the lnet_net_lock is unlocked, the peer attached to an
lnet_peer_ni (found via lnet_peer_ni::lpni_peer_net->lpn_peer)
can change, and the old peer deallocated. If we are really
unlucky, then all the churn could give us a new, different,
peer at the same address in memory.

Change the reference counting on the lnet_peer lp so that it
is guaranteed to be alive when we relock the lnet_net_lock for
the cpt. When the reference count is dropped lp may go away if
it was unlinked, but the new peer is guaranteed to have a
different address, so we can still correctly determine whether
the peer changed and discovery should be redone.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9971
Lustre-commit: 2b5b551b15d9 ("LU-9971 lnet: use after free in lnet_discover_peer_locked()")
Signed-off-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-on: https://review.whamcloud.com/28944
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index e5cce2f..d167a37 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2150,6 +2150,8 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	 * zombie if we race with DLC, so we must check for that.
 	 */
 	for (;;) {
+		/* Keep lp alive when the lnet_net_lock is unlocked */
+		lnet_peer_addref_locked(lp);
 		prepare_to_wait(&lp->lp_dc_waitq, &wait, TASK_INTERRUPTIBLE);
 		if (signal_pending(current))
 			break;
@@ -2161,16 +2163,14 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 			break;
 		lnet_peer_queue_for_discovery(lp);
 
-		/*
-		 * if caller requested a non-blocking operation then
-		 * return immediately. Once discovery is complete then the
-		 * peer ref will be decremented and any pending messages
-		 * that were stopped due to discovery will be transmitted.
+		/* If caller requested a non-blocking operation then
+		 * return immediately. Once discovery is complete any
+		 * pending messages that were stopped due to discovery
+		 * will be transmitted.
 		 */
 		if (!block)
 			break;
 
-		lnet_peer_addref_locked(lp);
 		lnet_net_unlock(LNET_LOCK_EX);
 		schedule();
 		finish_wait(&lp->lp_dc_waitq, &wait);
@@ -2192,10 +2192,12 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 
 	lnet_net_unlock(LNET_LOCK_EX);
 	lnet_net_lock(cpt);
-
-	/* If the peer has changed after we've discovered the older peer,
-	 * then we need to discovery the new peer to make sure the
-	 * interface information is up to date
+	lnet_peer_decref_locked(lp);
+	/* The peer may have changed, so re-check and rediscover if that turns
+	 * out to have been the case. The reference count on lp ensured that
+	 * even if it was unlinked from lpni the memory could not be recycled.
+	 * Thus the check below is sufficient to determine whether the peer
+	 * changed. If the peer changed, then lp must not be dereferenced.
 	 */
 	if (lp != lpni->lpni_peer_net->lpn_peer)
 		goto again;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 381/622] lustre: obdclass: generate random u64 max correctly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (379 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 380/622] lnet: use after free in lnet_discover_peer_locked() James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 382/622] lnet: fix peer ref counting James Simmons
                   ` (241 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Generate random u64 max number correctly, and make it an obdclass
function lu_prandom_u64_max().

Fixes: bcfa98a507 ("staging: lustre: replace cfs_rand() with prandom_u32_max()")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12495
Lustre-commit: 645b72c5c058 ("LU-12495 obdclass: generate random u64 max correctly")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35394
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h |  1 +
 fs/lustre/lmv/lmv_qos.c       | 26 +-------------------------
 fs/lustre/obdclass/lu_qos.c   | 36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index 0f3e3be..6b1064a 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1390,6 +1390,7 @@ struct lu_qos {
 
 int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
 int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
+u64 lu_prandom_u64_max(u64 ep_ro);
 
 /** @} lu */
 #endif /* __LUSTRE_LU_OBJECT_H */
diff --git a/fs/lustre/lmv/lmv_qos.c b/fs/lustre/lmv/lmv_qos.c
index e323398..85053d2e 100644
--- a/fs/lustre/lmv/lmv_qos.c
+++ b/fs/lustre/lmv/lmv_qos.c
@@ -370,31 +370,7 @@ struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 		total_weight += tgt->ltd_qos.ltq_weight;
 	}
 
-	if (total_weight) {
-#if BITS_PER_LONG == 32
-		/*
-		 * If total_weight > 32-bit, first generate the high
-		 * 32 bits of the random number, then add in the low
-		 * 32 bits (truncated to the upper limit, if needed)
-		 */
-		if (total_weight > 0xffffffffULL)
-			rand = (u64)(prandom_u32_max(
-				(unsigned int)(total_weight >> 32)) << 32;
-		else
-			rand = 0;
-
-		if (rand == (total_weight & 0xffffffff00000000ULL))
-			rand |= prandom_u32_max((unsigned int)total_weight);
-		else
-			rand |= prandom_u32();
-
-#else
-		rand = ((u64)prandom_u32() << 32 | prandom_u32()) %
-			total_weight;
-#endif
-	} else {
-		rand = 0;
-	}
+	rand = lu_prandom_u64_max(total_weight);
 
 	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
 		tgt = lmv->tgts[i];
diff --git a/fs/lustre/obdclass/lu_qos.c b/fs/lustre/obdclass/lu_qos.c
index 4ee3f59..9fdcbc2 100644
--- a/fs/lustre/obdclass/lu_qos.c
+++ b/fs/lustre/obdclass/lu_qos.c
@@ -35,6 +35,7 @@
 
 #include <linux/module.h>
 #include <linux/list.h>
+#include <linux/random.h>
 #include <obd_class.h>
 #include <obd_support.h>
 #include <lustre_disk.h>
@@ -164,3 +165,38 @@ int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
 	return rc;
 }
 EXPORT_SYMBOL(lqos_del_tgt);
+
+/**
+ * lu_prandom_u64_max - returns a pseudo-random u64 number in interval
+ * [0, ep_ro)
+ *
+ * #ep_ro	right open interval endpoint
+ *
+ * Return:	a pseudo-random 64-bit number that is in interval [0, ep_ro).
+ */
+u64 lu_prandom_u64_max(u64 ep_ro)
+{
+	u64 rand = 0;
+
+	if (ep_ro) {
+#if BITS_PER_LONG == 32
+		/*
+		 * If ep_ro > 32-bit, first generate the high
+		 * 32 bits of the random number, then add in the low
+		 * 32 bits (truncated to the upper limit, if needed)
+		 */
+		if (ep_ro > 0xffffffffULL)
+			rand = prandom_u32_max((u32)(ep_ro >> 32)) << 32;
+
+		if (rand == (ep_ro & 0xffffffff00000000ULL))
+			rand |= prandom_u32_max((u32)ep_ro);
+		else
+			rand |= prandom_u32();
+#else
+		rand = ((u64)prandom_u32() << 32 | prandom_u32()) % ep_ro;
+#endif
+	}
+
+	return rand;
+}
+EXPORT_SYMBOL(lu_prandom_u64_max);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 382/622] lnet: fix peer ref counting
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (380 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 381/622] lustre: obdclass: generate random u64 max correctly James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 383/622] lustre: llite: collect debug info for ll_fsync James Simmons
                   ` (240 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Exit from the loop after peer ref count has been incremented
to avoid wrong ref count.

The code makes sure that a peer is queued for discovery at most
once if discovery is disabled. This is done to use discovery
as a standard ping for gateways which do not have discovery feature
or discovery is disabled.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9971
Lustre-commit: dbcddb4824f0 ("LU-9971 lnet: fix peer ref counting")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35446
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index d167a37..e33dc0e 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2138,6 +2138,7 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	DEFINE_WAIT(wait);
 	struct lnet_peer *lp;
 	int rc = 0;
+	int count = 0;
 
 again:
 	lnet_net_unlock(cpt);
@@ -2157,11 +2158,20 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 			break;
 		if (the_lnet.ln_dc_state != LNET_DC_STATE_RUNNING)
 			break;
+		/* Don't repeat discovery if discovery is disabled. This is
+		 * done to ensure we can use discovery as a standard ping as
+		 * well for backwards compatibility with routers which do not
+		 * have discovery or have discovery disabled
+		 */
+		if (lnet_is_discovery_disabled(lp) && count > 0)
+			break;
 		if (lp->lp_dc_error)
 			break;
 		if (lnet_peer_is_uptodate(lp))
 			break;
 		lnet_peer_queue_for_discovery(lp);
+		count++;
+		CDEBUG(D_NET, "Discovery attempt # %d\n", count);
 
 		/* If caller requested a non-blocking operation then
 		 * return immediately. Once discovery is complete any
@@ -2178,15 +2188,6 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 		lnet_peer_decref_locked(lp);
 		/* Peer may have changed */
 		lp = lpni->lpni_peer_net->lpn_peer;
-
-		/* Wait for discovery to complete, but don't repeat if
-		 * discovery is disabled. This is done to ensure we can
-		 * use discovery as a standard ping as well for backwards
-		 * compatibility with routers which do not have discovery
-		 * or have discovery disabled
-		 */
-		if (lnet_is_discovery_disabled(lp))
-			break;
 	}
 	finish_wait(&lp->lp_dc_waitq, &wait);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 383/622] lustre: llite: collect debug info for ll_fsync
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (381 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 382/622] lnet: fix peer ref counting James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 384/622] lustre: obdclass: use RCU to release lu_env_item James Simmons
                   ` (239 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Improve ll_fsync() debug message to capture all the arguments of
the current fsync.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12462
Lustre-commit: 4cb6ce1863d0 ("LU-12462 llite: Remove old fsync versions")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35339
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 76a5074..a20896c 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -3907,8 +3907,10 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct ptlrpc_request *req;
 	int rc, err;
 
-	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p)\n",
-	       PFID(ll_inode2fid(inode)), inode);
+	CDEBUG(D_VFSTRACE,
+	       "VFS Op:inode=" DFID "(%p), start %lld, end %lld, datasync %d\n",
+	       PFID(ll_inode2fid(inode)), inode, start, end, datasync);
+
 	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FSYNC, 1);
 
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 384/622] lustre: obdclass: use RCU to release lu_env_item
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (382 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 383/622] lustre: llite: collect debug info for ll_fsync James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 385/622] lustre: mdt: improve IBITS lock definitions James Simmons
                   ` (238 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

as rhashtable_lookup_fast() is lockless and can
find just released objects.

Fixes: c678ad5a25 ("lustre: obdclass: put all service's env on the list")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12491
Lustre-commit: 87306c22e4b9 ("LU-12491 obdclass: use RCU to release lu_env_item")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35038
Reviewed-by: Neil Brown <neilb@suse.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lu_object.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index bafd817..c94911d 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -1870,6 +1870,7 @@ struct lu_env_item {
 	struct task_struct	*lei_task;	/* rhashtable key */
 	struct rhash_head	lei_linkage;
 	struct lu_env		*lei_env;
+	struct rcu_head		lei_rcu_head;
 };
 
 static const struct rhashtable_params lu_env_rhash_params = {
@@ -1909,6 +1910,14 @@ int lu_env_add(struct lu_env *env)
 }
 EXPORT_SYMBOL(lu_env_add);
 
+static void lu_env_item_free(struct rcu_head *head)
+{
+	struct lu_env_item *lei;
+
+	lei = container_of(head, struct lu_env_item, lei_rcu_head);
+	kfree(lei);
+}
+
 void lu_env_remove(struct lu_env *env)
 {
 	struct lu_env_item *lei;
@@ -1923,13 +1932,11 @@ void lu_env_remove(struct lu_env *env)
 		}
 	}
 
-	rcu_read_lock();
 	lei = rhashtable_lookup_fast(&lu_env_rhash, &task,
 				     lu_env_rhash_params);
 	if (lei && rhashtable_remove_fast(&lu_env_rhash, &lei->lei_linkage,
 					  lu_env_rhash_params) == 0)
-		kfree(lei);
-	rcu_read_unlock();
+		call_rcu(&lei->lei_rcu_head, lu_env_item_free);
 }
 EXPORT_SYMBOL(lu_env_remove);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 385/622] lustre: mdt: improve IBITS lock definitions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (383 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 384/622] lustre: obdclass: use RCU to release lu_env_item James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 386/622] lustre: uapi: change "space" hash type to hash flag James Simmons
                   ` (237 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Move MDS_INODELOCK_* flags into a named enum, and add the definitions
for the newer flags into wirecheck/wiretest to ensure consistency.

Rename MDS_INODELOCK_MAXSHIFT to MDS_INODELOCK_NUMBITS to hold current
number of lockbits, rather than one less than the number of lockbits,
since the only two places that use it expect it to be one larger than
it is.  Fix uses of MDS_INODELOCK_NUMBITS to be number of locks.  This
does not change the value of MDS_INODELOCK_FULL, which is used in the
protocol to exchange supported lock bits between client and server.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11285
Lustre-commit: 3611352b699c ("LU-11285 mdt: improve IBITS lock definitions")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35045
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c                 |  2 +-
 fs/lustre/ptlrpc/wiretest.c            |  6 ++++
 include/uapi/linux/lustre/lustre_idl.h | 51 +++++++++++++++++-----------------
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index a20896c..d313730 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -4323,7 +4323,7 @@ int ll_have_md_lock(struct inode *inode, u64 *bits,
 	       ldlm_lockname[mode]);
 
 	flags = LDLM_FL_BLOCK_GRANTED | LDLM_FL_CBPENDING | LDLM_FL_TEST_LOCK;
-	for (i = 0; i <= MDS_INODELOCK_MAXSHIFT && *bits != 0; i++) {
+	for (i = 0; i < MDS_INODELOCK_NUMBITS && *bits != 0; i++) {
 		policy.l_inodebits.bits = *bits & (1 << i);
 		if (policy.l_inodebits.bits == 0)
 			continue;
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index e801f2c..adc71ff 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -2185,6 +2185,12 @@ void lustre_assert_wire_constants(void)
 		 MDS_INODELOCK_OPEN);
 	LASSERTF(MDS_INODELOCK_LAYOUT == 0x000008, "found 0x%.8x\n",
 		 MDS_INODELOCK_LAYOUT);
+	LASSERTF(MDS_INODELOCK_PERM == 0x000010, "found 0x%.8x\n",
+		MDS_INODELOCK_PERM);
+	LASSERTF(MDS_INODELOCK_XATTR == 0x000020, "found 0x%.8x\n",
+		MDS_INODELOCK_XATTR);
+	LASSERTF(MDS_INODELOCK_DOM == 0x000040, "found 0x%.8x\n",
+		MDS_INODELOCK_DOM);
 
 	/* Checks for struct mdt_ioepoch */
 	LASSERTF((int)sizeof(struct mdt_ioepoch) == 24, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index c86b188..5acf781 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1482,33 +1482,32 @@ enum mdt_reint_cmd {
 #define DISP_OPEN_DENY		0x10000000
 
 /* INODE LOCK PARTS */
-#define MDS_INODELOCK_LOOKUP	0x000001	/*
-						 * For namespace, dentry etc, and
-						 * also was used to protect
-						 * permission (mode, owner, group
-						 * etc) before 2.4.
-						 */
-#define MDS_INODELOCK_UPDATE	0x000002	/* size, links, timestamps */
-#define MDS_INODELOCK_OPEN	0x000004	/* For opened files */
-#define MDS_INODELOCK_LAYOUT	0x000008	/* for layout */
-
-/* The PERM bit is added int 2.4, and it is used to protect permission(mode,
- * owner, group, acl etc), so to separate the permission from LOOKUP lock.
- * Because for remote directories(in DNE), these locks will be granted by
- * different MDTs(different ldlm namespace).
- *
- * For local directory, MDT will always grant UPDATE_LOCK|PERM_LOCK together.
- * For Remote directory, the master MDT, where the remote directory is, will
- * grant UPDATE_LOCK|PERM_LOCK, and the remote MDT, where the name entry is,
- * will grant LOOKUP_LOCK.
- */
-#define MDS_INODELOCK_PERM	0x000010
-#define MDS_INODELOCK_XATTR	0x000020	/* extended attributes */
-#define MDS_INODELOCK_DOM    0x000040 /* Data for data-on-mdt files */
-
-#define MDS_INODELOCK_MAXSHIFT 6
+enum mds_ibits_locks {
+	MDS_INODELOCK_LOOKUP	= 0x000001, /* For namespace, dentry etc.  Was
+					     * used to protect permission (mode,
+					     * owner, group, etc) before 2.4.
+					     */
+	MDS_INODELOCK_UPDATE	= 0x000002, /* size, links, timestamps */
+	MDS_INODELOCK_OPEN	= 0x000004, /* For opened files */
+	MDS_INODELOCK_LAYOUT	= 0x000008, /* for layout */
+
+	/* The PERM bit is added in 2.4, and is used to protect permission
+	 * (mode, owner, group, ACL, etc.) separate from LOOKUP lock.
+	 * For remote directories (in DNE) these locks will be granted by
+	 * different MDTs (different LDLM namespace).
+	 *
+	 * For local directory, the MDT always grants UPDATE|PERM together.
+	 * For remote directory, master MDT (where remote directory is) grants
+	 * UPDATE|PERM, and remote MDT (where name entry is) grants LOOKUP_LOCK.
+	 */
+	MDS_INODELOCK_PERM	= 0x000010,
+	MDS_INODELOCK_XATTR	= 0x000020, /* non-permission extended attrs */
+	MDS_INODELOCK_DOM	= 0x000040, /* Data for Data-on-MDT files */
+	/* Do not forget to increase MDS_INODELOCK_NUMBITS when adding bits */
+};
+#define MDS_INODELOCK_NUMBITS 7
 /* This FULL lock is useful to take on unlink sort of operations */
-#define MDS_INODELOCK_FULL ((1 << (MDS_INODELOCK_MAXSHIFT + 1)) - 1)
+#define MDS_INODELOCK_FULL ((1 << MDS_INODELOCK_NUMBITS) - 1)
 /* DOM lock shouldn't be canceled early, use this macro for ELC */
 #define MDS_INODELOCK_ELC (MDS_INODELOCK_FULL & ~MDS_INODELOCK_DOM)
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 386/622] lustre: uapi: change "space" hash type to hash flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (384 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 385/622] lustre: mdt: improve IBITS lock definitions James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 387/622] lustre: osc: cancel osc_lock list traversal once found the lock is being used James Simmons
                   ` (236 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Change LMV_HASH_TYPE_SPACE to LMV_HASH_FLAG_SPACE to make it flexible
in directory layout inheritance in the future. But it's still exposed
to user as hash type "space" in "lfs setdirstripe" command to make
it easy to understand.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: c605ef1dbeb4 ("LU-11213 uapi: change "space" hash type to hash flag")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35318
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_lmv.h          |  5 ++---
 fs/lustre/lmv/lmv_obd.c                 |  4 ++--
 fs/lustre/ptlrpc/wiretest.c             |  2 +-
 include/uapi/linux/lustre/lustre_idl.h  | 10 ----------
 include/uapi/linux/lustre/lustre_user.h | 35 ++++++++++++++++++++++++++-------
 5 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index bb1efb4..b33a6ed 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -55,7 +55,6 @@ struct lmv_stripe_md {
 	struct lmv_oinfo lsm_md_oinfo[0];
 };
 
-/* NB: LMV_HASH_TYPE_SPACE is set in default LMV only */
 static inline bool lmv_is_known_hash_type(u32 type)
 {
 	return (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_FNV_1A_64 ||
@@ -91,9 +90,9 @@ static inline bool lmv_dir_bad_hash(const struct lmv_stripe_md *lsm)
 }
 
 /* NB, this is checking directory default LMV */
-static inline bool lmv_dir_space_hashed(const struct lmv_stripe_md *lsm)
+static inline bool lmv_dir_qos_mkdir(const struct lmv_stripe_md *lsm)
 {
-	return lsm && lsm->lsm_md_hash_type == LMV_HASH_TYPE_SPACE;
+	return lsm && (lsm->lsm_md_hash_type & LMV_HASH_FLAG_SPACE);
 }
 
 static inline bool
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index bd64ebc..ae799db 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1187,7 +1187,7 @@ static u32 lmv_placement_policy(struct obd_device *obd,
 		mdt = le32_to_cpu(lum->lum_stripe_offset);
 	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
 		   !lmv_dir_striped(op_data->op_mea1) &&
-		   lmv_dir_space_hashed(op_data->op_default_mea1)) {
+		   lmv_dir_qos_mkdir(op_data->op_default_mea1)) {
 		mdt = op_data->op_mds;
 	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
 		   op_data->op_default_mea1 &&
@@ -1716,7 +1716,7 @@ struct lmv_tgt_desc *
 		op_data->op_mds = oinfo->lmo_mds;
 		tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
 	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
-		   lmv_dir_space_hashed(op_data->op_default_mea1) &&
+		   lmv_dir_qos_mkdir(op_data->op_default_mea1) &&
 		   !lmv_dir_striped(lsm)) {
 		tgt = lmv_locate_tgt_qos(lmv, &op_data->op_mds);
 		if (tgt == ERR_PTR(-EAGAIN))
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index adc71ff..1d34b15 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1661,8 +1661,8 @@ void lustre_assert_wire_constants(void)
 	BUILD_BUG_ON(LMV_MAGIC_V1 != 0x0CD20CD0);
 	BUILD_BUG_ON(LMV_MAGIC_STRIPE != 0x0CD40CD0);
 	BUILD_BUG_ON(LMV_HASH_TYPE_MASK != 0x0000ffff);
+	BUILD_BUG_ON(LMV_HASH_FLAG_SPACE != 0x08000000);
 	BUILD_BUG_ON(LMV_HASH_FLAG_MIGRATION != 0x80000000);
-	BUILD_BUG_ON(LMV_HASH_FLAG_DEAD != 0x40000000);
 
 	/* Checks for struct obd_statfs */
 	LASSERTF((int)sizeof(struct obd_statfs) == 144, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 5acf781..5740d42 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2001,16 +2001,6 @@ struct lmv_foreign_md {
 #define LMV_MAGIC_STRIPE 0x0CD40CD0	/* magic for dir sub_stripe */
 #define LMV_MAGIC_FOREIGN 0x0CD50CD0	/* magic for lmv foreign */
 
-/*
- *Right now only the lower part(0-16bits) of lmv_hash_type is being used,
- * and the higher part will be the flag to indicate the status of object,
- * for example the object is being migrated. And the hash function
- * might be interpreted differently with different flags.
- */
-#define LMV_HASH_TYPE_MASK		0x0000ffff
-
-#define LMV_HASH_FLAG_MIGRATION		0x80000000
-#define LMV_HASH_FLAG_DEAD		0x40000000
 
 /**
  * The FNV-1a hash algorithm is as follows:
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index d43170f..86f3111 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -655,16 +655,37 @@ enum lmv_hash_type {
 	LMV_HASH_TYPE_UNKNOWN	= 0,	/* 0 is reserved for testing purpose */
 	LMV_HASH_TYPE_ALL_CHARS = 1,
 	LMV_HASH_TYPE_FNV_1A_64 = 2,
-	LMV_HASH_TYPE_SPACE	= 3,	/*
-					 * distribute subdirs among all MDTs
-					 * with balanced space usage.
-					 */
 	LMV_HASH_TYPE_MAX,
 };
 
-#define LMV_HASH_NAME_ALL_CHARS		"all_char"
-#define LMV_HASH_NAME_FNV_1A_64		"fnv_1a_64"
-#define LMV_HASH_NAME_SPACE		"space"
+#define LMV_HASH_TYPE_DEFAULT LMV_HASH_TYPE_FNV_1A_64
+
+#define LMV_HASH_NAME_ALL_CHARS	"all_char"
+#define LMV_HASH_NAME_FNV_1A_64	"fnv_1a_64"
+
+/* not real hash type, but exposed to user as "space" hash type */
+#define LMV_HASH_NAME_SPACE	"space"
+
+/* Right now only the lower part(0-16bits) of lmv_hash_type is being used,
+ * and the higher part will be the flag to indicate the status of object,
+ * for example the object is being migrated. And the hash function
+ * might be interpreted differently with different flags.
+ */
+#define LMV_HASH_TYPE_MASK		0x0000ffff
+
+/* once this is set on a plain directory default layout, newly created
+ * subdirectories will be distributed on all MDTs by space usage.
+ */
+#define LMV_HASH_FLAG_SPACE		0x08000000
+
+/* The striped directory has ever lost its master LMV EA, then LFSCK
+ * re-generated it. This flag is used to indicate such case. It is an
+ * on-disk flag.
+ */
+#define LMV_HASH_FLAG_LOST_LMV		0x10000000
+
+#define LMV_HASH_FLAG_BAD_TYPE		0x20000000
+#define LMV_HASH_FLAG_MIGRATION		0x80000000
 
 struct lustre_foreign_type {
 	uint32_t lft_type;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 387/622] lustre: osc: cancel osc_lock list traversal once found the lock is being used
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (385 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 386/622] lustre: uapi: change "space" hash type to hash flag James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 388/622] lustre: obdclass: add comment for rcu handling in lu_env_remove James Simmons
                   ` (235 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Gu Zheng <gzheng@ddn.com>

Currently, in osc_ldlm_weigh_ast, it walks osc_lock list (oo_ol_list)
to check whether target dlm is being used, normally, if found, it needs
to skip the rest ones and cancel the traversal, but it doesn't, let's
fix it here.

Fixes: 3f3a24dc5d7d ("LU-3259 clio: cl_lock simplification")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11518
Lustre-commit: eb9aa909343b ("LU-11518 osc: cancel osc_lock list traversal once found the lock is being used")
Signed-off-by: Gu Zheng <gzheng@ddn.com>
Reviewed-on: https://review.whamcloud.com/35396
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_lock.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index 29d8373..e01bf5f 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -687,9 +687,10 @@ unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock)
 
 	spin_lock(&obj->oo_ol_spin);
 	list_for_each_entry(oscl, &obj->oo_ol_list, ols_nextlock_oscobj) {
-		if (oscl->ols_dlmlock && oscl->ols_dlmlock != dlmlock)
-			continue;
-		found = true;
+		if (oscl->ols_dlmlock == dlmlock) {
+			found = true;
+			break;
+		}
 	}
 	spin_unlock(&obj->oo_ol_spin);
 	if (found) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 388/622] lustre: obdclass: add comment for rcu handling in lu_env_remove
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (386 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 387/622] lustre: osc: cancel osc_lock list traversal once found the lock is being used James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 389/622] lnet: honor discovery setting James Simmons
                   ` (234 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

During the review it was pointed out why the RCU lock was dropped
in lu_env_remove() but the code itself doesn't explain why. Add
a comment giving the details why RCU locking is not needed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12491
Lustre-commit: 709fbe6ee54a ("LU-12491 obdclass: add comment for rcu handling in lu_env_remove")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/35447
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Neil Brown <neilb@suse.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lu_object.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index c94911d..d8bff3f 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -1932,6 +1932,11 @@ void lu_env_remove(struct lu_env *env)
 		}
 	}
 
+	/* The rcu_lock is not taking in this case since the key
+	 * used is the actual task_struct. This implies that each
+	 * object is only removed by the owning thread, so there
+	 * can never be a race on a particular object.
+	 */
 	lei = rhashtable_lookup_fast(&lu_env_rhash, &task,
 				     lu_env_rhash_params);
 	if (lei && rhashtable_remove_fast(&lu_env_rhash, &lei->lei_linkage,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 389/622] lnet: honor discovery setting
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (387 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 388/622] lustre: obdclass: add comment for rcu handling in lu_env_remove James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 390/622] lustre: obdclass: don't send multiple statfs RPCs James Simmons
                   ` (233 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

If discovery is off do not push out any updates. This could be
triggered in case of a gateway's interface changing.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12423
Lustre-commit: a06b656639c4 ("LU-12423 lnet: honor discovery setting")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35192
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index e33dc0e..b0ca1de 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -877,6 +877,8 @@ struct lnet_peer_ni *
 	int cpt;
 
 	lnet_net_lock(LNET_LOCK_EX);
+	if (lnet_peer_discovery_disabled)
+		force = 0;
 	lncpt = cfs_percpt_number(the_lnet.ln_peer_tables);
 	for (cpt = 0; cpt < lncpt; cpt++) {
 		ptable = the_lnet.ln_peer_tables[cpt];
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 390/622] lustre: obdclass: don't send multiple statfs RPCs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (388 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 389/622] lnet: honor discovery setting James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 391/622] lustre: lov: Correct bounds checking James Simmons
                   ` (232 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

If multiple threads are racing to send a non-cached OST_STATFS or
MDS_STATFS RPC, this can cause a significant RPC storm for systems
with many-core clients and many OSTs due to amplification of the
requests, and the fact that STATFS RPCs are sent asynchronously.
Some logs have shown few 96-core clients have 20k+ OST_STATFS RPCs
in flight concurrently, which can overload the network if many OSTs
are on the same OSS nodes (osc.*.max_rpcs_in_flight is per OST).

This was not previously a significant issue when core counts were
smaller on the clients, or with fewer OSTs per OSS.

If a thread can't use the cached statfs values, limit statfs to one
thread at a time, since the thread(s) would be blocked waiting for
the RPC replies anyway, which can't finish faster if many are sent.

Also add a llite.*.statfs_max_age parameter that can be tuned on
to control the maximum age (in seconds) of the statfs cache.  This
can avoid overhead for workloads that are statfs heavy, given that
the filesystem is _probably_ not running out of space this second,
and even so "statfs" does not guarantee space in parallel workloads.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12368
Lustre-commit: 1c41a6ac390b ("LU-12368 obdclass: don't send multiple statfs RPCs")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35380
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h          |  2 ++
 fs/lustre/include/obd_class.h    | 22 ++++++++++++++++++++--
 fs/lustre/llite/llite_internal.h |  3 +++
 fs/lustre/llite/llite_lib.c      |  5 +++--
 fs/lustre/llite/lproc_llite.c    | 31 +++++++++++++++++++++++++++++++
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index f53c303..53d078e 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -379,6 +379,8 @@ struct echo_client_obd {
 
 /* allow statfs data caching for 1 second */
 #define OBD_STATFS_CACHE_SECONDS 1
+/* arbitrary maximum. larger would be useless, allows catching bogus input */
+#define OBD_STATFS_CACHE_MAX_AGE 3600 /* seconds */
 
 #define lov_tgt_desc lu_tgt_desc
 
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 76e8201..b8afa5a 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -952,13 +952,31 @@ static inline int obd_statfs(const struct lu_env *env, struct obd_export *exp,
 	if (obd->obd_osfs_age < max_age ||
 	    ((obd->obd_osfs.os_state & OS_STATE_SUM) &&
 	     !(flags & OBD_STATFS_SUM))) {
-		rc = OBP(obd, statfs)(env, exp, osfs, max_age, flags);
+		bool update_age = false;
+		/* the RPC will block anyway, so avoid sending many at once */
+		rc = mutex_lock_interruptible(&obd->obd_dev_mutex);
+		if (rc)
+			return rc;
+		if (obd->obd_osfs_age < max_age ||
+		    ((obd->obd_osfs.os_state & OS_STATE_SUM) &&
+		     !(flags & OBD_STATFS_SUM))) {
+			rc = OBP(obd, statfs)(env, exp, osfs, max_age, flags);
+			update_age = true;
+		} else {
+			CDEBUG(D_SUPER,
+			       "%s: new %p cache blocks %llu/%llu objects %llu/%llu\n",
+			       obd->obd_name, &obd->obd_osfs,
+			       obd->obd_osfs.os_bavail, obd->obd_osfs.os_blocks,
+			       obd->obd_osfs.os_ffree, obd->obd_osfs.os_files);
+		}
 		if (rc == 0) {
 			spin_lock(&obd->obd_osfs_lock);
 			memcpy(&obd->obd_osfs, osfs, sizeof(obd->obd_osfs));
-			obd->obd_osfs_age = ktime_get_seconds();
+			if (update_age)
+				obd->obd_osfs_age = ktime_get_seconds();
 			spin_unlock(&obd->obd_osfs_lock);
 		}
+		mutex_unlock(&obd->obd_dev_mutex);
 	} else {
 		CDEBUG(D_SUPER,
 		       "%s: use %p cache blocks %llu/%llu objects %llu/%llu\n",
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 8d95694..9d60ae5 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -568,6 +568,9 @@ struct ll_sb_info {
 	/* st_blksize returned by stat(2), when non-zero */
 	unsigned int		 ll_stat_blksize;
 
+	/* maximum relative age of cached statfs results */
+	unsigned int		  ll_statfs_max_age;
+
 	struct kset		ll_kset;	/* sysfs object */
 	struct completion	 ll_kobj_unregister;
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 33f7fdb..cc417d6 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -87,6 +87,7 @@ static struct ll_sb_info *ll_init_sbi(void)
 	spin_lock_init(&sbi->ll_pp_extent_lock);
 	spin_lock_init(&sbi->ll_process_lock);
 	sbi->ll_rw_stats_on = 0;
+	sbi->ll_statfs_max_age = OBD_STATFS_CACHE_SECONDS;
 
 	si_meminfo(&si);
 	pages = si.totalram - si.totalhigh;
@@ -330,7 +331,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	 * available
 	 */
 	err = obd_statfs(NULL, sbi->ll_md_exp, osfs,
-			 ktime_get_seconds() - OBD_STATFS_CACHE_SECONDS,
+			 ktime_get_seconds() - sbi->ll_statfs_max_age,
 			 OBD_STATFS_FOR_MDT0);
 	if (err)
 		goto out_md_fid;
@@ -1860,7 +1861,7 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 	time64_t max_age;
 	int rc;
 
-	max_age = ktime_get_seconds() - OBD_STATFS_CACHE_SECONDS;
+	max_age = ktime_get_seconds() - sbi->ll_statfs_max_age;
 
 	rc = obd_statfs(NULL, sbi->ll_md_exp, osfs, max_age, flags);
 	if (rc)
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 02403e4..4cffd36 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -882,6 +882,36 @@ static ssize_t lazystatfs_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(lazystatfs);
 
+static ssize_t statfs_max_age_show(struct kobject *kobj, struct attribute *attr,
+				   char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u\n", sbi->ll_statfs_max_age);
+}
+
+static ssize_t statfs_max_age_store(struct kobject *kobj,
+				    struct attribute *attr, const char *buffer,
+				    size_t count)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	unsigned int val;
+	int rc;
+
+	rc = kstrtouint(buffer, 10, &val);
+	if (rc)
+		return rc;
+	if (val > OBD_STATFS_CACHE_MAX_AGE)
+		return -EINVAL;
+
+	sbi->ll_statfs_max_age = val;
+
+	return count;
+}
+LUSTRE_RW_ATTR(statfs_max_age);
+
 static ssize_t max_easize_show(struct kobject *kobj,
 			       struct attribute *attr,
 			       char *buf)
@@ -1480,6 +1510,7 @@ struct lprocfs_vars lprocfs_llite_obd_vars[] = {
 	&lustre_attr_statahead_max.attr,
 	&lustre_attr_statahead_agl.attr,
 	&lustre_attr_lazystatfs.attr,
+	&lustre_attr_statfs_max_age.attr,
 	&lustre_attr_max_easize.attr,
 	&lustre_attr_default_easize.attr,
 	&lustre_attr_xattr_cache.attr,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 391/622] lustre: lov: Correct bounds checking
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (389 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 390/622] lustre: obdclass: don't send multiple statfs RPCs James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 392/622] lustre: lu_object: Add missed qos_rr_init James Simmons
                   ` (231 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Nathaniel Clark <nclark@whamcloud.com>

While Dan Carpenter ran his smatch tool against the lustre code
base he encountered the following static checker warning:

fs/lustre/lov/lov_ea.c:207 lsm_unpackmd_common()
warn: signed overflow undefined. 'min_stripe_maxbytes * stripe_count < min_stripe_maxbytes'

The current code doesn't properly handle the potential overflow
with the min_stripe_maxbytes * stripe_count. This fixes the
overflow detection for maxbytes in lsme_unpack().

Fixes: 476f575cf070 ("staging: lustre: lov: Ensure correct operation for large object sizes")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9862
Lustre-commit: 31ff883c7b0c ("LU-9862 lov: Correct bounds checking")
Signed-off-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/28484
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_ea.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/lustre/lov/lov_ea.c b/fs/lustre/lov/lov_ea.c
index 07bfe0f..4be01bb8 100644
--- a/fs/lustre/lov/lov_ea.c
+++ b/fs/lustre/lov/lov_ea.c
@@ -274,15 +274,16 @@ void lsm_free(struct lov_stripe_md *lsm)
 	if (min_stripe_maxbytes == 0)
 		min_stripe_maxbytes = LUSTRE_EXT3_STRIPE_MAXBYTES;
 
-	lov_bytes = min_stripe_maxbytes * stripe_count;
+	if (stripe_count == 0)
+		lov_bytes = min_stripe_maxbytes;
+	else if (min_stripe_maxbytes <= LLONG_MAX / stripe_count)
+		lov_bytes = min_stripe_maxbytes * stripe_count;
+	else
+		lov_bytes = MAX_LFS_FILESIZE;
 
 out_dom:
-	if (maxbytes) {
-		if (lov_bytes < min_stripe_maxbytes) /* handle overflow */
-			*maxbytes = MAX_LFS_FILESIZE;
-		else
-			*maxbytes = lov_bytes;
-	}
+	if (maxbytes)
+		*maxbytes = min_t(loff_t, lov_bytes, MAX_LFS_FILESIZE);
 
 	return lsme;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 392/622] lustre: lu_object: Add missed qos_rr_init
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (390 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 391/622] lustre: lov: Correct bounds checking James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 393/622] lustre: fld: let's caller to retry FLD_QUERY James Simmons
                   ` (230 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

The new lmv space hash code uses the lu_qos_rr struct, but
forgot to init it fully.  Specifically, the spin lock isn't
inited, causing failures.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12538
Lustre-commit: 5e6a30cc2f34 ("LU-12538 lod: Add missed qos_rr_init")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35490
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h | 1 +
 fs/lustre/lmv/lmv_obd.c       | 3 ++-
 fs/lustre/obdclass/lu_qos.c   | 7 +++++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index 6b1064a..d2e84a3 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1388,6 +1388,7 @@ struct lu_qos {
 				 lq_reset:1;     /* zero current penalties */
 };
 
+void lu_qos_rr_init(struct lu_qos_rr *lqr);
 int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
 int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
 u64 lu_prandom_u64_max(u64 ep_ro);
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index ae799db..e9f9c36 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1295,13 +1295,14 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	INIT_LIST_HEAD(&lmv->lmv_qos.lq_svr_list);
 	init_rwsem(&lmv->lmv_qos.lq_rw_sem);
 	lmv->lmv_qos.lq_dirty = 1;
-	lmv->lmv_qos.lq_rr.lqr_dirty = 1;
 	lmv->lmv_qos.lq_reset = 1;
 	/* Default priority is toward free space balance */
 	lmv->lmv_qos.lq_prio_free = 232;
 	/* Default threshold for rr (roughly 17%) */
 	lmv->lmv_qos.lq_threshold_rr = 43;
 
+	lu_qos_rr_init(&lmv->lmv_qos.lq_rr);
+
 	/*
 	 * initialize rr_index to lower 32bit of netid, so that client
 	 * can distribute subdirs evenly from the beginning.
diff --git a/fs/lustre/obdclass/lu_qos.c b/fs/lustre/obdclass/lu_qos.c
index 9fdcbc2..d4803e8 100644
--- a/fs/lustre/obdclass/lu_qos.c
+++ b/fs/lustre/obdclass/lu_qos.c
@@ -42,6 +42,13 @@
 #include <lustre_fid.h>
 #include <lu_object.h>
 
+void lu_qos_rr_init(struct lu_qos_rr *lqr)
+{
+	spin_lock_init(&lqr->lqr_alloc);
+	lqr->lqr_dirty = 1;
+}
+EXPORT_SYMBOL(lu_qos_rr_init);
+
 /**
  * Add a new target to Quality of Service (QoS) target table.
  *
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 393/622] lustre: fld: let's caller to retry FLD_QUERY
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (391 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 392/622] lustre: lu_object: Add missed qos_rr_init James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 394/622] lustre: llite: make sure readahead cover current read James Simmons
                   ` (229 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

In fld_client_rpc(), if the FLD_QUERY request between MDTs fails
with -EWOUDBLOCK because the connection is lost, return -EAGAIN
to notify the caller to retry.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11761
Lustre-commit: e3f6111dfd1c ("LU-11761 fld: let's caller to retry FLD_QUERY")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34962
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/fld/fld_request.c     | 23 ++++++++++++++---------
 fs/lustre/include/obd_support.h |  1 +
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
index 75cba18..52c148a 100644
--- a/fs/lustre/fld/fld_request.c
+++ b/fs/lustre/fld/fld_request.c
@@ -314,7 +314,6 @@ int fld_client_rpc(struct obd_export *exp,
 
 	LASSERT(exp);
 
-again:
 	imp = class_exp2cliimp(exp);
 	switch (fld_op) {
 	case FLD_QUERY:
@@ -363,17 +362,23 @@ int fld_client_rpc(struct obd_export *exp,
 	req->rq_reply_portal = MDC_REPLY_PORTAL;
 	ptlrpc_at_set_req_timeout(req);
 
-	obd_get_request_slot(&exp->exp_obd->u.cli);
-	rc = ptlrpc_queue_wait(req);
-	obd_put_request_slot(&exp->exp_obd->u.cli);
+	if (OBD_FAIL_CHECK(OBD_FAIL_FLD_QUERY_REQ && req->rq_no_delay)) {
+		/* the same error returned by ptlrpc_import_delay_req */
+		rc = -EWOULDBLOCK;
+		req->rq_status = rc;
+	} else {
+		obd_get_request_slot(&exp->exp_obd->u.cli);
+		rc = ptlrpc_queue_wait(req);
+		obd_put_request_slot(&exp->exp_obd->u.cli);
+	}
+
 	if (rc != 0) {
 		if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
-			/* Since LWP is not replayable, so it will keep
-			 * trying unless umount happens, otherwise it would
-			 * cause unnecessary failure of the application.
+			/*
+			 * Since LWP is not replayable, so notify the caller
+			 * to retry if needed after a while.
 			 */
-			ptlrpc_req_finished(req);
-			goto again;
+			rc = -EAGAIN;
 		}
 		goto out_req;
 	}
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 9609dd5..23f6bae 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -424,6 +424,7 @@
 #define OBD_FAIL_FLD					0x1100
 #define OBD_FAIL_FLD_QUERY_NET				0x1101
 #define OBD_FAIL_FLD_READ_NET				0x1102
+#define OBD_FAIL_FLD_QUERY_REQ				0x1103
 
 #define OBD_FAIL_SEC_CTX				0x1200
 #define OBD_FAIL_SEC_CTX_INIT_NET			0x1201
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 394/622] lustre: llite: make sure readahead cover current read
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (392 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 393/622] lustre: fld: let's caller to retry FLD_QUERY James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 395/622] lustre: ptlrpc: Add jobid to rpctrace debug messages James Simmons
                   ` (228 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

When doing readahead, @ria_end_min is used to indicate
how far we are expected to read to cover current
read.

update @ria_end_min unconditionally with IO end.
also @ria_end_min is closed interval which should be
calculated as start + count - 1;

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: 8fbef5ee7619 ("LU-12043 llite: make sure readahead cover current read")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35215
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/rw.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index bec26c4..fe9a2b0 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -689,16 +689,8 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 
 	/* at least to extend the readahead window to cover current read */
 	if (!hit && vio->vui_ra_valid &&
-	    vio->vui_ra_start + vio->vui_ra_count > ria->ria_start) {
-		unsigned long remainder;
-
-		/* to the end of current read window. */
-		mlen = vio->vui_ra_start + vio->vui_ra_count - ria->ria_start;
-		/* trim to RPC boundary */
-		ras_align(ras, ria->ria_start, &remainder);
-		mlen = min(mlen, ras->ras_rpc_size - remainder);
-		ria->ria_end_min = ria->ria_start + mlen;
-	}
+	    vio->vui_ra_start + vio->vui_ra_count > ria->ria_start)
+		ria->ria_end_min = vio->vui_ra_start + vio->vui_ra_count - 1;
 
 	ria->ria_reserved = ll_ra_count_get(ll_i2sbi(inode), ria, len, mlen);
 	if (ria->ria_reserved < len)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 395/622] lustre: ptlrpc: Add jobid to rpctrace debug messages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (393 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 394/622] lustre: llite: make sure readahead cover current read James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 396/622] lnet: libcfs: Reduce memory frag due to HA debug msg James Simmons
                   ` (227 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

This mod adds the jobid string found in the ptlrpc_body of an rpc
to the output of rpctrace messages. If jobids are not in use the
string will be empty. If jobids are in use, the string can be
useful in analyzing Lustre activity.

Cray-bug-id: LUS-7557
WC-bug-id: https://jira.whamcloud.com/browse/LU-12523
Lustre-commit: 9ae40e4c5ecb ("LU-12523 ptlrpc: Add jobid to rpctrace debug messages")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/35445
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h  |  1 +
 fs/lustre/ptlrpc/client.c       | 15 +++++++++------
 fs/lustre/ptlrpc/pack_generic.c | 30 ++++++++++++++++++++++++++++--
 fs/lustre/ptlrpc/service.c      | 12 +++++++-----
 4 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 7ed2d99..d03e8c6 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -2074,6 +2074,7 @@ int lustre_shrink_msg(struct lustre_msg *msg, int segment,
 u32 lustre_msg_get_magic(struct lustre_msg *msg);
 u32 lustre_msg_get_timeout(struct lustre_msg *msg);
 u32 lustre_msg_get_service_time(struct lustre_msg *msg);
+char *lustre_msg_get_jobid(struct lustre_msg *msg);
 u32 lustre_msg_get_cksum(struct lustre_msg *msg);
 u32 lustre_msg_calc_cksum(struct lustre_msg *msg);
 void lustre_msg_set_handle(struct lustre_msg *msg,
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index ac16878..bd641cc 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -1639,11 +1639,12 @@ static int ptlrpc_send_new_req(struct ptlrpc_request *req)
 	}
 
 	CDEBUG(D_RPCTRACE,
-	       "Sending RPC pname:cluuid:pid:xid:nid:opc %s:%s:%d:%llu:%s:%d\n",
-	       current->comm,
+	       "Sending RPC req@%p pname:cluuid:pid:xid:nid:opc:job %s:%s:%d:%llu:%s:%d:%s\n",
+	       req, current->comm,
 	       imp->imp_obd->obd_uuid.uuid,
 	       lustre_msg_get_status(req->rq_reqmsg), req->rq_xid,
-	       obd_import_nid2str(imp), lustre_msg_get_opc(req->rq_reqmsg));
+	       obd_import_nid2str(imp), lustre_msg_get_opc(req->rq_reqmsg),
+	       lustre_msg_get_jobid(req->rq_reqmsg));
 
 	rc = ptl_send_rpc(req, 0);
 	if (rc == -ENOMEM) {
@@ -2057,12 +2058,14 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 
 		if (req->rq_reqmsg)
 			CDEBUG(D_RPCTRACE,
-			       "Completed RPC pname:cluuid:pid:xid:nid:opc %s:%s:%d:%llu:%s:%d\n",
-			       current->comm, imp->imp_obd->obd_uuid.uuid,
+			       "Completed RPC req@%p pname:cluuid:pid:xid:nid:opc:job %s:%s:%d:%llu:%s:%d:%s\n",
+			       req, current->comm,
+			       imp->imp_obd->obd_uuid.uuid,
 			       lustre_msg_get_status(req->rq_reqmsg),
 			       req->rq_xid,
 			       obd_import_nid2str(imp),
-			       lustre_msg_get_opc(req->rq_reqmsg));
+			       lustre_msg_get_opc(req->rq_reqmsg),
+			       lustre_msg_get_jobid(req->rq_reqmsg));
 
 		spin_lock(&imp->imp_lock);
 		/*
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index a4f28f3..f687ecc 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1183,6 +1183,31 @@ u32 lustre_msg_get_service_time(struct lustre_msg *msg)
 	}
 }
 
+char *lustre_msg_get_jobid(struct lustre_msg *msg)
+{
+	switch (msg->lm_magic) {
+	case LUSTRE_MSG_MAGIC_V2: {
+		struct ptlrpc_body *pb;
+
+		/* the old pltrpc_body_v2 is smaller; doesn't include jobid */
+		if (msg->lm_buflens[MSG_PTLRPC_BODY_OFF] <
+		    sizeof(struct ptlrpc_body))
+			return NULL;
+
+		pb = lustre_msg_buf_v2(msg, MSG_PTLRPC_BODY_OFF,
+					  sizeof(struct ptlrpc_body));
+		if (!pb)
+			return NULL;
+
+		return pb->pb_jobid;
+	}
+	default:
+		CERROR("incorrect message magic: %08x\n", msg->lm_magic);
+		return NULL;
+	}
+}
+EXPORT_SYMBOL(lustre_msg_get_jobid);
+
 u32 lustre_msg_get_cksum(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
@@ -2337,7 +2362,7 @@ void _debug_req(struct ptlrpc_request *req,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 	libcfs_debug_msg(msgdata,
-			 "%pV req@%p x%llu/t%lld(%lld) o%d->%s@%s:%d/%d lens %d/%d e %d to %lld dl %lld ref %d fl " REQ_FLAGS_FMT "/%x/%x rc %d/%d\n",
+			 "%pV req@%p x%llu/t%lld(%lld) o%d->%s@%s:%d/%d lens %d/%d e %d to %lld dl %lld ref %d fl " REQ_FLAGS_FMT "/%x/%x rc %d/%d job:'%s'\n",
 			 &vaf,
 			 req, req->rq_xid, req->rq_transno,
 			 req_ok ? lustre_msg_get_transno(req->rq_reqmsg) : 0,
@@ -2355,7 +2380,8 @@ void _debug_req(struct ptlrpc_request *req,
 			 atomic_read(&req->rq_refcount),
 			 DEBUG_REQ_FLAGS(req),
 			 req_ok ? lustre_msg_get_flags(req->rq_reqmsg) : -1,
-			 rep_flags, req->rq_status, rep_status);
+			 rep_flags, req->rq_status, rep_status,
+			 req_ok ? lustre_msg_get_jobid(req->rq_reqmsg) : "");
 	va_end(args);
 }
 EXPORT_SYMBOL(_debug_req);
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 8e6013a..3132a1e 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1756,15 +1756,16 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	}
 
 	CDEBUG(D_RPCTRACE,
-	       "Handling RPC pname:cluuid+ref:pid:xid:nid:opc %s:%s+%d:%d:x%llu:%s:%d\n",
-	       current->comm,
+	       "Handling RPC req@%p pname:cluuid+ref:pid:xid:nid:opc:job %s:%s+%d:%d:x%llu:%s:%d:%s\n",
+	       request, current->comm,
 	       (request->rq_export ?
 		(char *)request->rq_export->exp_client_uuid.uuid : "0"),
 	       (request->rq_export ?
 		refcount_read(&request->rq_export->exp_refcount) : -99),
 	       lustre_msg_get_status(request->rq_reqmsg), request->rq_xid,
 	       libcfs_id2str(request->rq_peer),
-	       lustre_msg_get_opc(request->rq_reqmsg));
+	       lustre_msg_get_opc(request->rq_reqmsg),
+	       lustre_msg_get_jobid(request->rq_reqmsg));
 
 	if (lustre_msg_get_opc(request->rq_reqmsg) != OBD_PING)
 		CFS_FAIL_TIMEOUT_MS(OBD_FAIL_PTLRPC_PAUSE_REQ, cfs_fail_val);
@@ -1796,8 +1797,8 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	timediff_usecs = ktime_us_delta(work_end, work_start);
 	arrived_usecs = ktime_us_delta(work_end, arrived);
 	CDEBUG(D_RPCTRACE,
-	       "Handled RPC pname:cluuid+ref:pid:xid:nid:opc %s:%s+%d:%d:x%llu:%s:%d Request processed in %lldus (%lldus total) trans %llu rc %d/%d\n",
-	       current->comm,
+	       "Handled RPC req@%p pname:cluuid+ref:pid:xid:nid:opc:job %s:%s+%d:%d:x%llu:%s:%d:%s Request processed in %lldus (%lldus total) trans %llu rc %d/%d\n",
+	       request, current->comm,
 	       (request->rq_export ?
 		(char *)request->rq_export->exp_client_uuid.uuid : "0"),
 	       (request->rq_export ?
@@ -1806,6 +1807,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	       request->rq_xid,
 	       libcfs_id2str(request->rq_peer),
 	       lustre_msg_get_opc(request->rq_reqmsg),
+	       lustre_msg_get_jobid(request->rq_reqmsg),
 	       timediff_usecs,
 	       arrived_usecs,
 	       (request->rq_repmsg ?
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 396/622] lnet: libcfs: Reduce memory frag due to HA debug msg
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (394 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 395/622] lustre: ptlrpc: Add jobid to rpctrace debug messages James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 397/622] lustre: ptlrpc: change IMPORT_SET_* macros into real functions James Simmons
                   ` (226 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

The dynamic allocation and freeing of Lustre trace pages has been
shown to cause memory fragmentation that sometimes prevents
applications from getting the contiguous memory they need to run. In
one such occurrence over 99% of the messages were the matched open
trace messages issued by mdc_close():

DEBUG_REQ(D_HA, mod->mod_open_req, "matched open; tag %d", tag);

D_HA is included in the default set of debug flags. This has proven
to be quite useful in debugging connection issues particularly at
mount time. So removing all HA message from the default tracing is
not a good option.

However, the matched open debug message has not proven itself to be
as generally useful. So moving the message under a different debug
flag, one that must be explicitly enabled, reduces the amount of
default tracing and thereby helps reduce fragmentation without
causing much loss of functionality. Using D_RPCTRACE to match the
corresponding open debug message in mdc_set_open_replay_data.

Cray-bug-id: LUS-7560
WC-bug-id: https://jira.whamcloud.com/browse/LU-12524
Lustre-commit: 076a5961f20b ("LU-12524 libcfs: Reduce memory frag due to HA debug msg")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/35449
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index a26efa1..7bc6196 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -937,7 +937,7 @@ static int mdc_close(struct obd_export *exp, struct md_op_data *op_data,
 
 		mod->mod_close_req = req;
 
-		DEBUG_REQ(D_HA, mod->mod_open_req, "matched open");
+		DEBUG_REQ(D_RPCTRACE, mod->mod_open_req, "matched open");
 		/* We no longer want to preserve this open for replay even
 		 * though the open was committed. b=3632, b=3633
 		 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 397/622] lustre: ptlrpc: change IMPORT_SET_* macros into real functions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (395 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 396/622] lnet: libcfs: Reduce memory frag due to HA debug msg James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 398/622] lustre: uapi: add unused enum obd_statfs_state James Simmons
                   ` (225 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

Make the IMPORT_SET_STATE_NOLOCK and IMPORT_SET_STATE macros into
normal functions. Since import_set_state_nolock() is basically a
wrapper around __import_set_state() we can merge both functions.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10756
Lustre-commit: cf78502e48d6 ("LU-10756 ptlrpc: change IMPORT_SET_* macros into real functions")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/35463
Reviewed-by: Neil Brown <neilb@suse.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 96 +++++++++++++++++++++++------------------------
 1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index f8e15f2..98c09f6 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -58,10 +58,10 @@ struct ptlrpc_connect_async_args {
 
 /**
  * Updates import @imp current state to provided @state value
- * Helper function. Must be called under imp_lock.
+ * Helper function.
  */
-static void __import_set_state(struct obd_import *imp,
-			       enum lustre_imp_state state)
+static void import_set_state_nolock(struct obd_import *imp,
+				    enum lustre_imp_state state)
 {
 	switch (state) {
 	case LUSTRE_IMP_CLOSED:
@@ -74,6 +74,18 @@ static void __import_set_state(struct obd_import *imp,
 		break;
 	default:
 		imp->imp_replay_state = LUSTRE_IMP_REPLAY;
+		break;
+	}
+
+	/* A CLOSED import should remain so. */
+	if (state == LUSTRE_IMP_CLOSED)
+		return;
+
+	if (imp->imp_state != LUSTRE_IMP_NEW) {
+		CDEBUG(D_HA, "%p %s: changing import state from %s to %s\n",
+		       imp, obd2cli_tgt(imp->imp_obd),
+		       ptlrpc_import_state_name(imp->imp_state),
+		       ptlrpc_import_state_name(state));
 	}
 
 	imp->imp_state = state;
@@ -84,24 +96,13 @@ static void __import_set_state(struct obd_import *imp,
 		IMP_STATE_HIST_LEN;
 }
 
-/* A CLOSED import should remain so. */
-#define IMPORT_SET_STATE_NOLOCK(imp, state)				       \
-do {									       \
-	if (imp->imp_state != LUSTRE_IMP_CLOSED) {			       \
-		CDEBUG(D_HA, "%p %s: changing import state from %s to %s\n",   \
-		       imp, obd2cli_tgt(imp->imp_obd),			       \
-		       ptlrpc_import_state_name(imp->imp_state),	       \
-		       ptlrpc_import_state_name(state));		       \
-		__import_set_state(imp, state);				       \
-	}								       \
-} while (0)
-
-#define IMPORT_SET_STATE(imp, state)					\
-do {									\
-	spin_lock(&imp->imp_lock);					\
-	IMPORT_SET_STATE_NOLOCK(imp, state);				\
-	spin_unlock(&imp->imp_lock);					\
-} while (0)
+static void import_set_state(struct obd_import *imp,
+			     enum lustre_imp_state new_state)
+{
+	spin_lock(&imp->imp_lock);
+	import_set_state_nolock(imp, new_state);
+	spin_unlock(&imp->imp_lock);
+}
 
 static int ptlrpc_connect_interpret(const struct lu_env *env,
 				    struct ptlrpc_request *request,
@@ -180,7 +181,7 @@ int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt)
 					   target_len, target_start,
 					   obd_import_nid2str(imp));
 		}
-		IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_DISCON);
+		import_set_state_nolock(imp, LUSTRE_IMP_DISCON);
 		spin_unlock(&imp->imp_lock);
 
 		if (obd_dump_on_timeout)
@@ -629,7 +630,7 @@ int ptlrpc_connect_import(struct obd_import *imp)
 		return -EALREADY;
 	}
 
-	IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_CONNECTING);
+	import_set_state_nolock(imp, LUSTRE_IMP_CONNECTING);
 
 	imp->imp_conn_cnt++;
 	imp->imp_resend_replay = 0;
@@ -742,7 +743,7 @@ int ptlrpc_connect_import(struct obd_import *imp)
 	rc = 0;
 out:
 	if (rc != 0)
-		IMPORT_SET_STATE(imp, LUSTRE_IMP_DISCON);
+		import_set_state(imp, LUSTRE_IMP_DISCON);
 
 	return rc;
 }
@@ -1094,9 +1095,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 		if (msg_flags & MSG_CONNECT_RECOVERING) {
 			CDEBUG(D_HA, "connect to %s during recovery\n",
 			       obd2cli_tgt(imp->imp_obd));
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_REPLAY_LOCKS);
+			import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
 		} else {
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL);
+			import_set_state(imp, LUSTRE_IMP_FULL);
 			ptlrpc_activate_import(imp);
 		}
 
@@ -1149,8 +1150,8 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 			imp->imp_remote_handle =
 				     *lustre_msg_get_handle(request->rq_repmsg);
 
-			if (!(msg_flags & MSG_CONNECT_RECOVERING)) {
-				IMPORT_SET_STATE(imp, LUSTRE_IMP_EVICTED);
+			if (!(MSG_CONNECT_RECOVERING & msg_flags)) {
+				import_set_state(imp, LUSTRE_IMP_EVICTED);
 				rc = 0;
 				goto finish;
 			}
@@ -1162,11 +1163,10 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 		}
 
 		if (imp->imp_invalid) {
-			CDEBUG(D_HA,
-			       "%s: reconnected but import is invalid; marking evicted\n",
-			       imp->imp_obd->obd_name);
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_EVICTED);
-		} else if (msg_flags & MSG_CONNECT_RECOVERING) {
+			CDEBUG(D_HA, "%s: reconnected but import is invalid; "
+			       "marking evicted\n", imp->imp_obd->obd_name);
+			import_set_state(imp, LUSTRE_IMP_EVICTED);
+		} else if (MSG_CONNECT_RECOVERING & msg_flags) {
 			CDEBUG(D_HA, "%s: reconnected to %s during replay\n",
 			       imp->imp_obd->obd_name,
 			       obd2cli_tgt(imp->imp_obd));
@@ -1175,9 +1175,9 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 			imp->imp_resend_replay = 1;
 			spin_unlock(&imp->imp_lock);
 
-			IMPORT_SET_STATE(imp, imp->imp_replay_state);
+			import_set_state(imp, imp->imp_replay_state);
 		} else {
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_RECOVER);
+			import_set_state(imp, LUSTRE_IMP_RECOVER);
 		}
 	} else if ((msg_flags & MSG_CONNECT_RECOVERING) && !imp->imp_invalid) {
 		LASSERT(imp->imp_replayable);
@@ -1185,14 +1185,14 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 				*lustre_msg_get_handle(request->rq_repmsg);
 		imp->imp_last_replay_transno = 0;
 		imp->imp_replay_cursor = &imp->imp_committed_list;
-		IMPORT_SET_STATE(imp, LUSTRE_IMP_REPLAY);
+		import_set_state(imp, LUSTRE_IMP_REPLAY);
 	} else {
 		DEBUG_REQ(D_HA, request,
 			  "%s: evicting (reconnect/recover flags not set: %x)",
 			  imp->imp_obd->obd_name, msg_flags);
 		imp->imp_remote_handle =
 				*lustre_msg_get_handle(request->rq_repmsg);
-		IMPORT_SET_STATE(imp, LUSTRE_IMP_EVICTED);
+		import_set_state(imp, LUSTRE_IMP_EVICTED);
 	}
 
 	/* Sanity checks for a reconnected import. */
@@ -1232,7 +1232,7 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 		class_export_put(exp);
 
 	if (rc != 0) {
-		IMPORT_SET_STATE(imp, LUSTRE_IMP_DISCON);
+		import_set_state(imp, LUSTRE_IMP_DISCON);
 		if (rc == -EACCES) {
 			/*
 			 * Give up trying to reconnect
@@ -1268,7 +1268,7 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 						   OBD_OCD_VERSION_FIX(ocd->ocd_version),
 						   LUSTRE_VERSION_STRING);
 				ptlrpc_deactivate_import(imp);
-				IMPORT_SET_STATE(imp, LUSTRE_IMP_CLOSED);
+				import_set_state(imp, LUSTRE_IMP_CLOSED);
 			}
 			return -EPROTO;
 		}
@@ -1367,7 +1367,7 @@ static int ptlrpc_invalidate_import_thread(void *data)
 		libcfs_debug_dumplog();
 	}
 
-	IMPORT_SET_STATE(imp, LUSTRE_IMP_RECOVER);
+	import_set_state(imp, LUSTRE_IMP_RECOVER);
 	ptlrpc_import_recovery_state_machine(imp);
 
 	class_import_put(imp);
@@ -1448,7 +1448,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 		rc = ptlrpc_replay_next(imp, &inflight);
 		if (inflight == 0 &&
 		    atomic_read(&imp->imp_replay_inflight) == 0) {
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_REPLAY_LOCKS);
+			import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
 			rc = ldlm_replay_locks(imp);
 			if (rc)
 				goto out;
@@ -1458,7 +1458,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 
 	if (imp->imp_state == LUSTRE_IMP_REPLAY_LOCKS)
 		if (atomic_read(&imp->imp_replay_inflight) == 0) {
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_REPLAY_WAIT);
+			import_set_state(imp, LUSTRE_IMP_REPLAY_WAIT);
 			rc = signal_completed_replay(imp);
 			if (rc)
 				goto out;
@@ -1466,7 +1466,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 
 	if (imp->imp_state == LUSTRE_IMP_REPLAY_WAIT)
 		if (atomic_read(&imp->imp_replay_inflight) == 0)
-			IMPORT_SET_STATE(imp, LUSTRE_IMP_RECOVER);
+			import_set_state(imp, LUSTRE_IMP_RECOVER);
 
 	if (imp->imp_state == LUSTRE_IMP_RECOVER) {
 		CDEBUG(D_HA, "reconnected to %s@%s\n",
@@ -1476,7 +1476,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 		rc = ptlrpc_resend(imp);
 		if (rc)
 			goto out;
-		IMPORT_SET_STATE(imp, LUSTRE_IMP_FULL);
+		import_set_state(imp, LUSTRE_IMP_FULL);
 		ptlrpc_activate_import(imp);
 
 		deuuidify(obd2cli_tgt(imp->imp_obd), NULL,
@@ -1536,7 +1536,7 @@ static struct ptlrpc_request *ptlrpc_disconnect_prep_req(struct obd_import *imp)
 	req->rq_timeout = min_t(int, req->rq_timeout,
 				INITIAL_CONNECT_TIMEOUT);
 
-	IMPORT_SET_STATE(imp, LUSTRE_IMP_CONNECTING);
+	import_set_state(imp, LUSTRE_IMP_CONNECTING);
 	req->rq_send_state =  LUSTRE_IMP_CONNECTING;
 	ptlrpc_request_set_replen(req);
 
@@ -1601,9 +1601,9 @@ int ptlrpc_disconnect_import(struct obd_import *imp, int noclose)
 	spin_lock(&imp->imp_lock);
 out:
 	if (noclose)
-		IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_DISCON);
+		import_set_state_nolock(imp, LUSTRE_IMP_DISCON);
 	else
-		IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_CLOSED);
+		import_set_state_nolock(imp, LUSTRE_IMP_CLOSED);
 	memset(&imp->imp_remote_handle, 0, sizeof(imp->imp_remote_handle));
 	spin_unlock(&imp->imp_lock);
 
@@ -1657,7 +1657,7 @@ static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 		if (atomic_read(&imp->imp_inflight) > 1) {
 			imp->imp_generation++;
 			imp->imp_initiated_at = imp->imp_generation;
-			IMPORT_SET_STATE_NOLOCK(imp, LUSTRE_IMP_NEW);
+			import_set_state_nolock(imp, LUSTRE_IMP_NEW);
 			ptlrpc_reset_reqs_generation(imp);
 			connect = 1;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 398/622] lustre: uapi: add unused enum obd_statfs_state
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (396 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 397/622] lustre: ptlrpc: change IMPORT_SET_* macros into real functions James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 399/622] lustre: llite: create obd_device with usercopy whitelist James Simmons
                   ` (224 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The 3rd and 4th bit field of enum obd_statfs_state are for values
that have been obsoleted since Lustre 1.6. Lets make this clear
for end user applications.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12501
Lustre-commit: e4d92a8a08ac ("LU-12501 utils: fix 'lfs df' printing loop")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35456
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 86f3111..9c849ce 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -102,6 +102,8 @@ enum obd_statfs_state {
 	OS_STATE_DEGRADED	= 0x00000001, /**< RAID degraded/rebuilding */
 	OS_STATE_READONLY	= 0x00000002, /**< filesystem is read-only */
 	OS_STATE_NOPRECREATE	= 0x00000004, /**< no object precreation */
+	OS_STATE_UNUSED1	= 0x00000008, /**< obsolete 1.6, was EROFS=30 */
+	OS_STATE_UNUSED2	= 0x00000010, /**< obsolete 1.6, was EROFS=30 */
 	OS_STATE_ENOSPC		= 0x00000020, /**< not enough free space */
 	OS_STATE_ENOINO		= 0x00000040, /**< not enough inodes */
 	OS_STATE_SUM		= 0x00000100, /**< aggregated for all tagrets */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 399/622] lustre: llite: create obd_device with usercopy whitelist
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (397 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 398/622] lustre: uapi: add unused enum obd_statfs_state James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 400/622] lnet: warn if discovery is off James Simmons
                   ` (223 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

Since kernel 4.16 hardened usercopy has been added,
whitelist the struct obd_device to silence the warning.

Bad or missing usercopy whitelist? Kernel memory exposure attempt
detected from SLUB object 'll_obd_dev_cache' (offset 1256, size 40)!
WARNING: CPU: 1 PID: 17534 at mm/usercopy.c:83 usercopy_warn+0x7d/0xa0
Call Trace:
  __check_object_size+0xfa/0x181
  lmv_iocontrol+0x1146/0x1880 [lmv]
  ll_obd_statfs+0x356/0x860 [lustre]
  ll_dir_ioctl+0x1e37/0x6760 [lustre]
  do_vfs_ioctl+0xa4/0x630

Linux-commit: 8eb8284b412906181357c2b0110d879d5af95e52

WC-bug-id: https://jira.whamcloud.com/browse/LU-12331
Lustre-commit: e34c59812abf ("LU-12331 llite: create obd_device with usercopy whitelist")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/34946
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/genops.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 2b1175f..49db077 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -648,9 +648,11 @@ void obd_cleanup_caches(void)
 int obd_init_caches(void)
 {
 	LASSERT(!obd_device_cachep);
-	obd_device_cachep = kmem_cache_create("ll_obd_dev_cache",
-					      sizeof(struct obd_device),
-					      0, 0, NULL);
+	obd_device_cachep = kmem_cache_create_usercopy("ll_obd_dev_cache",
+						       sizeof(struct obd_device),
+						       0, 0, 0,
+						       sizeof(struct obd_device),
+						       NULL);
 	if (!obd_device_cachep)
 		goto out;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 400/622] lnet: warn if discovery is off
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (398 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 399/622] lustre: llite: create obd_device with usercopy whitelist James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 401/622] lustre: ldlm: always cancel aged locks regardless enabling or disabling lru resize James Simmons
                   ` (222 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

Output a warning if discovery is off and admin is
either trying to add a route or enable routing

WC-bug-id: https://jira.whamcloud.com/browse/LU-12427
Lustre-commit: c9718be06192 ("LU-12427 lnet: warn if discovery is off")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35200
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index f7b53e0..eb76c72 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -519,6 +519,8 @@ static void lnet_shuffle_seed(void)
 	if (add_route) {
 		gw->lp_health_sensitivity = sensitivity;
 		lnet_add_route_to_rnet(rnet2, route);
+		if (lnet_peer_discovery_disabled)
+			CWARN("Consider turning discovery on to enable full Multi-Rail routing functionality\n");
 	}
 
 	/* get rid of the reference on the lpni.
@@ -1379,6 +1381,9 @@ bool lnet_router_checker_active(void)
 		~LNET_PING_FEAT_RTE_DISABLED;
 	lnet_net_unlock(LNET_LOCK_EX);
 
+	if (lnet_peer_discovery_disabled)
+		CWARN("Consider turning discovery on to enable full Multi-Rail routing functionality\n");
+
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 401/622] lustre: ldlm: always cancel aged locks regardless enabling or disabling lru resize
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (399 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 400/622] lnet: warn if discovery is off James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 402/622] lustre: llite: cleanup stats of LPROC_LL_* James Simmons
                   ` (221 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Gu Zheng <gzheng@ddn.com>

Currently cancelling aged locks is handled by of ldlm_pool_recalc routine,
and it only works when lru resize is enabled, means if we disabled lru
resize, old aged locks are still cached even though they reach the
ns_max_age.

But theoretically, even lru resize disabled, lru_max_age should behave
same as enabling lru resize. At the end, lru_size is like hard limit of
number of locks, but ns_max_age/lru_max_age is a elimination mechanism,
regardless enabling or disabling lru resize meaning once it gets
lru_max_age, locks need to be cancelled.

So fix it here with changing the lru flags when invoking ldlm_cancel_lru
to do the real cancel work, if lru resize is enabled, set flag to
LDLM_LRU_FLAG_LRUR, otherwise LDLM_LRU_FLAG_AGED.

Change lru_flags into a proper enum

WC-bug-id: https://jira.whamcloud.com/browse/LU-11672
Lustre-commit: e4c490bac770 ("LU-11672 ldlm: awalys cancel aged locks regardless enabling or disabling lru resize")
Signed-off-by: Gu Zheng <gzheng@ddn.com>
Reviewed-on: https://review.whamcloud.com/35467
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_internal.h |  8 +++++---
 fs/lustre/ldlm/ldlm_pool.c     | 14 +++++++-------
 fs/lustre/ldlm/ldlm_request.c  | 40 ++++++++++++++++++++++------------------
 3 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index 3789496..4844a9b 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -87,7 +87,7 @@ void ldlm_namespace_move_to_inactive_locked(struct ldlm_namespace *ns,
 
 /* ldlm_request.c */
 /* Cancel lru flag, it indicates we cancel aged locks. */
-enum {
+enum ldlm_lru_flags {
 	LDLM_LRU_FLAG_AGED	= BIT(0), /* Cancel old non-LRU resize locks */
 	LDLM_LRU_FLAG_PASSED	= BIT(1), /* Cancel passed number of locks. */
 	LDLM_LRU_FLAG_SHRINK	= BIT(2), /* Cancel locks from shrinker. */
@@ -104,10 +104,12 @@ enum {
 };
 
 int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
-		    enum ldlm_cancel_flags sync, int flags);
+		    enum ldlm_cancel_flags cancel_flags,
+		    enum ldlm_lru_flags lru_flags);
 int ldlm_cancel_lru_local(struct ldlm_namespace *ns,
 			  struct list_head *cancels, int count, int max,
-			  enum ldlm_cancel_flags cancel_flags, int flags);
+			  enum ldlm_cancel_flags cancel_flags,
+			  enum ldlm_lru_flags lru_flags);
 extern unsigned int ldlm_enqueue_min;
 extern unsigned int ldlm_cancel_unused_locks_before_replay;
 
diff --git a/fs/lustre/ldlm/ldlm_pool.c b/fs/lustre/ldlm/ldlm_pool.c
index b2b3ead..9185dc93 100644
--- a/fs/lustre/ldlm/ldlm_pool.c
+++ b/fs/lustre/ldlm/ldlm_pool.c
@@ -255,6 +255,7 @@ static void ldlm_cli_pool_pop_slv(struct ldlm_pool *pl)
 static int ldlm_cli_pool_recalc(struct ldlm_pool *pl)
 {
 	time64_t recalc_interval_sec;
+	enum ldlm_lru_flags lru_flags;
 	int ret;
 
 	recalc_interval_sec = ktime_get_real_seconds() - pl->pl_recalc_time;
@@ -279,13 +280,13 @@ static int ldlm_cli_pool_recalc(struct ldlm_pool *pl)
 	spin_unlock(&pl->pl_lock);
 
 	/*
-	 * Do not cancel locks in case lru resize is disabled for this ns.
+	 * Cancel aged locks if lru resize is disabled for this ns.
 	 */
 	if (!ns_connect_lru_resize(container_of(pl, struct ldlm_namespace,
-						ns_pool))) {
-		ret = 0;
-		goto out;
-	}
+						ns_pool)))
+		lru_flags = LDLM_LRU_FLAG_LRUR;
+	else
+		lru_flags = LDLM_LRU_FLAG_AGED;
 
 	/*
 	 * In the time of canceling locks on client we do not need to maintain
@@ -294,9 +295,8 @@ static int ldlm_cli_pool_recalc(struct ldlm_pool *pl)
 	 * take into account pl->pl_recalc_time here.
 	 */
 	ret = ldlm_cancel_lru(container_of(pl, struct ldlm_namespace, ns_pool),
-			      0, LCF_ASYNC, LDLM_LRU_FLAG_LRUR);
+			      0, LCF_ASYNC, lru_flags);
 
-out:
 	spin_lock(&pl->pl_lock);
 	/*
 	 * Time of LRU resizing might be longer than period,
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 5a7026d..75492f6 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -590,7 +590,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
 	struct ldlm_namespace *ns = exp->exp_obd->obd_namespace;
 	struct req_capsule *pill = &req->rq_pill;
 	struct ldlm_request *dlm = NULL;
-	int flags, avail, to_free, pack = 0;
+	enum ldlm_lru_flags lru_flags;
+	int avail, to_free, pack = 0;
 	LIST_HEAD(head);
 	int rc;
 
@@ -601,9 +602,9 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
 		req_capsule_filled_sizes(pill, RCL_CLIENT);
 		avail = ldlm_capsule_handles_avail(pill, RCL_CLIENT, canceloff);
 
-		flags = LDLM_LRU_FLAG_NO_WAIT |
-			(ns_connect_lru_resize(ns) ?
-			 LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED);
+		lru_flags = LDLM_LRU_FLAG_NO_WAIT |
+			    (ns_connect_lru_resize(ns) ?
+			     LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED);
 		to_free = !ns_connect_lru_resize(ns) &&
 			  opc == LDLM_ENQUEUE ? 1 : 0;
 
@@ -614,7 +615,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
 		 */
 		if (avail > count)
 			count += ldlm_cancel_lru_local(ns, cancels, to_free,
-						       avail - count, 0, flags);
+						       avail - count, 0,
+						       lru_flags);
 		if (avail > count)
 			pack = count;
 		else
@@ -1279,7 +1281,8 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		    enum ldlm_cancel_flags cancel_flags)
 {
 	struct obd_export *exp;
-	int avail, flags, count = 1;
+	enum ldlm_lru_flags lru_flags;
+	int avail, count = 1;
 	u64 rc = 0;
 	struct ldlm_namespace *ns;
 	struct ldlm_lock *lock;
@@ -1354,10 +1357,10 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		LASSERT(avail > 0);
 
 		ns = ldlm_lock_to_ns(lock);
-		flags = ns_connect_lru_resize(ns) ?
-			LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED;
+		lru_flags = ns_connect_lru_resize(ns) ?
+			    LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED;
 		count += ldlm_cancel_lru_local(ns, &cancels, 0, avail - 1,
-					       LCF_BL_AST, flags);
+					       LCF_BL_AST, lru_flags);
 	}
 	ldlm_cli_cancel_list(&cancels, count, NULL, cancel_flags);
 	return 0;
@@ -1593,7 +1596,7 @@ typedef enum ldlm_policy_res (*ldlm_cancel_lru_policy_t)(struct ldlm_namespace *
 							 int, int, int);
 
 static ldlm_cancel_lru_policy_t
-ldlm_cancel_lru_policy(struct ldlm_namespace *ns, int lru_flags)
+ldlm_cancel_lru_policy(struct ldlm_namespace *ns, enum ldlm_lru_flags lru_flags)
 {
 	if (ns_connect_lru_resize(ns)) {
 		if (lru_flags & LDLM_LRU_FLAG_SHRINK) {
@@ -1662,16 +1665,16 @@ typedef enum ldlm_policy_res (*ldlm_cancel_lru_policy_t)(struct ldlm_namespace *
  */
 static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 				 struct list_head *cancels, int count, int max,
-				 int flags)
+				 enum ldlm_lru_flags lru_flags)
 {
 	ldlm_cancel_lru_policy_t pf;
 	int added = 0;
-	int no_wait = flags & LDLM_LRU_FLAG_NO_WAIT;
+	int no_wait = lru_flags & LDLM_LRU_FLAG_NO_WAIT;
 
 	if (!ns_connect_lru_resize(ns))
 		count += ns->ns_nr_unused - ns->ns_max_unused;
 
-	pf = ldlm_cancel_lru_policy(ns, flags);
+	pf = ldlm_cancel_lru_policy(ns, lru_flags);
 	LASSERT(pf);
 
 	/* For any flags, stop scanning if @max is reached. */
@@ -1787,7 +1790,7 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		 */
 		lock->l_flags |= LDLM_FL_CBPENDING | LDLM_FL_CANCELING;
 
-		if ((flags & LDLM_LRU_FLAG_CLEANUP) &&
+		if ((lru_flags & LDLM_LRU_FLAG_CLEANUP) &&
 		    (lock->l_resource->lr_type == LDLM_EXTENT ||
 		     ldlm_has_dom(lock)) && lock->l_granted_mode == LCK_PR)
 			ldlm_set_discard_data(lock);
@@ -1811,11 +1814,12 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 
 int ldlm_cancel_lru_local(struct ldlm_namespace *ns,
 			  struct list_head *cancels, int count, int max,
-			  enum ldlm_cancel_flags cancel_flags, int flags)
+			  enum ldlm_cancel_flags cancel_flags,
+			  enum ldlm_lru_flags lru_flags)
 {
 	int added;
 
-	added = ldlm_prepare_lru_list(ns, cancels, count, max, flags);
+	added = ldlm_prepare_lru_list(ns, cancels, count, max, lru_flags);
 	if (added <= 0)
 		return added;
 	return ldlm_cli_cancel_list_local(cancels, added, cancel_flags);
@@ -1831,7 +1835,7 @@ int ldlm_cancel_lru_local(struct ldlm_namespace *ns,
  */
 int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
 		    enum ldlm_cancel_flags cancel_flags,
-		    int flags)
+		    enum ldlm_lru_flags lru_flags)
 {
 	LIST_HEAD(cancels);
 	int count, rc;
@@ -1840,7 +1844,7 @@ int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
 	 * Just prepare the list of locks, do not actually cancel them yet.
 	 * Locks are cancelled later in a separate thread.
 	 */
-	count = ldlm_prepare_lru_list(ns, &cancels, nr, 0, flags);
+	count = ldlm_prepare_lru_list(ns, &cancels, nr, 0, lru_flags);
 	rc = ldlm_bl_to_thread_list(ns, NULL, &cancels, count, cancel_flags);
 	if (rc == 0)
 		return count;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 402/622] lustre: llite: cleanup stats of LPROC_LL_*
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (400 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 401/622] lustre: ldlm: always cancel aged locks regardless enabling or disabling lru resize James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 403/622] lustre: osc: Do not assert for first extent James Simmons
                   ` (220 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Li Xi <lixi@ddn.com>

Some LPROC_LL_ stats are not used for a long time. This patch
removes them. LPROC_LL_STAFS is changed to LPROC_LL_STATFS in
this patch too.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12545
Lustre-commit: 976c1f334fcb ("LU-12545 llite: cleanup stats of LPROC_LL_*")
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/35514
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 6 +-----
 fs/lustre/llite/llite_lib.c      | 2 +-
 fs/lustre/llite/lproc_llite.c    | 8 +-------
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 9d60ae5..a0d631d 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -789,12 +789,8 @@ void ll_rw_stats_tally(struct ll_sb_info *sbi, pid_t pid,
 void ll_io_init(struct cl_io *io, const struct file *file, int write);
 
 enum {
-	LPROC_LL_DIRTY_HITS,
-	LPROC_LL_DIRTY_MISSES,
 	LPROC_LL_READ_BYTES,
 	LPROC_LL_WRITE_BYTES,
-	LPROC_LL_BRW_READ,
-	LPROC_LL_BRW_WRITE,
 	LPROC_LL_IOCTL,
 	LPROC_LL_OPEN,
 	LPROC_LL_RELEASE,
@@ -816,7 +812,7 @@ enum {
 	LPROC_LL_RMDIR,
 	LPROC_LL_MKNOD,
 	LPROC_LL_RENAME,
-	LPROC_LL_STAFS,
+	LPROC_LL_STATFS,
 	LPROC_LL_ALLOC_INODE,
 	LPROC_LL_SETXATTR,
 	LPROC_LL_GETXATTR,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index cc417d6..e0395e5 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1918,7 +1918,7 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs)
 	int rc;
 
 	CDEBUG(D_VFSTRACE, "VFS Op: at %llu jiffies\n", get_jiffies_64());
-	ll_stats_ops_tally(ll_s2sbi(sb), LPROC_LL_STAFS, 1);
+	ll_stats_ops_tally(ll_s2sbi(sb), LPROC_LL_STATFS, 1);
 
 	/* Some amount of caching on the client is allowed */
 	rc = ll_statfs_internal(ll_s2sbi(sb), &osfs, OBD_STATFS_SUM);
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 4cffd36..6eb3d33 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1543,16 +1543,10 @@ static void sbi_kobj_release(struct kobject *kobj)
 	const char	*opname;
 } llite_opcode_table[LPROC_LL_FILE_OPCODES] = {
 	/* file operation */
-	{ LPROC_LL_DIRTY_HITS,     LPROCFS_TYPE_REGS, "dirty_pages_hits" },
-	{ LPROC_LL_DIRTY_MISSES,   LPROCFS_TYPE_REGS, "dirty_pages_misses" },
 	{ LPROC_LL_READ_BYTES,     LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_BYTES,
 				   "read_bytes" },
 	{ LPROC_LL_WRITE_BYTES,    LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_BYTES,
 				   "write_bytes" },
-	{ LPROC_LL_BRW_READ,       LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_PAGES,
-				   "brw_read" },
-	{ LPROC_LL_BRW_WRITE,      LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_PAGES,
-				   "brw_write" },
 	{ LPROC_LL_IOCTL,	   LPROCFS_TYPE_REGS, "ioctl" },
 	{ LPROC_LL_OPEN,	   LPROCFS_TYPE_REGS, "open" },
 	{ LPROC_LL_RELEASE,	   LPROCFS_TYPE_REGS, "close" },
@@ -1577,7 +1571,7 @@ static void sbi_kobj_release(struct kobject *kobj)
 	{ LPROC_LL_MKNOD,	   LPROCFS_TYPE_REGS, "mknod" },
 	{ LPROC_LL_RENAME,	   LPROCFS_TYPE_REGS, "rename" },
 	/* special inode operation */
-	{ LPROC_LL_STAFS,	   LPROCFS_TYPE_REGS, "statfs" },
+	{ LPROC_LL_STATFS,	   LPROCFS_TYPE_REGS, "statfs" },
 	{ LPROC_LL_ALLOC_INODE,    LPROCFS_TYPE_REGS, "alloc_inode" },
 	{ LPROC_LL_SETXATTR,       LPROCFS_TYPE_REGS, "setxattr" },
 	{ LPROC_LL_GETXATTR,       LPROCFS_TYPE_REGS, "getxattr" },
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 403/622] lustre: osc: Do not assert for first extent
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (401 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 402/622] lustre: llite: cleanup stats of LPROC_LL_* James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 404/622] lustre: llite: MS_* flags and SB_* flags split James Simmons
                   ` (219 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

In the discard case, the OSC fsync/writeback code asserts
that each OSC extent is fully covered by the fsync request.

This is not valid for the DOM case, because OSC extent
alignment requirements can create OSC extents which start
before the OST region of the layout (ie, they cross in to
the DOM region).  This is OK because the layout prevents
them from ever being used for i/o, but this same behavior
means that the OSC fsync start/end is aligned with the
layout, and so does not necessarily cover that first
extent.

The simplest solution is just to not assert on the first
extent.  (There is no way at the OSC layer to recognize the
DOM case.)

WC-bug-id: https://jira.whamcloud.com/browse/LU-12462
Lustre-commit: 092ecd66127e ("LU-12462 osc: Do not assert for first extent")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35525
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 3b4c598..9e2f90d 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2931,10 +2931,17 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 				unplug = true;
 			} else {
 				/* the only discarder is lock cancelling, so
-				 * [start, end] must contain this extent
+				 * [start, end] must contain this extent.
+				 * However, with DOM, osc extent alignment may
+				 * cause the first extent to start before the
+				 * OST portion of the layout.  This is never
+				 * accessed for i/o, but the unused portion
+				 * will not be covered by the sync request,
+				 * so we cannot assert in that case.
 				 */
-				EASSERT(ext->oe_start >= start &&
-					ext->oe_end <= end, ext);
+				EASSERT(ergo(!(ext == first_extent(obj)),
+					ext->oe_start >= start &&
+					ext->oe_end <= end), ext);
 				osc_extent_state_set(ext, OES_LOCKING);
 				ext->oe_owner = current;
 				list_move_tail(&ext->oe_link, &discard_list);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 404/622] lustre: llite: MS_* flags and SB_* flags split
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (402 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 403/622] lustre: osc: Do not assert for first extent James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 405/622] lustre: llite: improve ll_dom_lock_cancel James Simmons
                   ` (218 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

In kernel 4.20 the MS_* flags should only be used for mount
time flags and SB_* flags for checking super_block.s_flags
The MS_* flags have moved to a uapi header. Change the one
that was missed, MS_NOSEC to SB_NOSEC.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12355
Lustre-commit:72a84970e6d2a ("LU-12355 llite: MS_* flags and SB_* flags split")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/35019
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index e0395e5..3e058d2 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -270,7 +270,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	/* Setting this indicates we correctly support S_NOSEC (See kernel
 	 * commit 9e1f1de02c2275d7172e18dc4e7c2065777611bf)
 	 */
-	sb->s_flags |= MS_NOSEC;
+	sb->s_flags |= SB_NOSEC;
 
 	if (sbi->ll_flags & LL_SBI_FLOCK)
 		sbi->ll_fop = &ll_file_operations_flock;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 405/622] lustre: llite: improve ll_dom_lock_cancel
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (403 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 404/622] lustre: llite: MS_* flags and SB_* flags split James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 406/622] lustre: llite: swab LOV EA user data James Simmons
                   ` (217 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

ll_dom_lock_cancel() should zero kms attribute similar to
mdc_ldlm_blocking_ast0().

In order to avoid code duplication between mdc_ldlm_blocking_ast0()
and ll_dom_lock_cancel() - add cl_object_operations method to be able
to reach mdc's blocking ast from llite level.

Test illustrating the issue is added.

Cray-bug-id: LUS-7118
WC-bug-id: https://jira.whamcloud.com/browse/LU-12296
Lustre-commit: 707bab62f5d6 ("LU-12296 llite: improve ll_dom_lock_cancel")
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Reviewed-on: https://review.whamcloud.com/34858
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h  | 10 ++++++++++
 fs/lustre/llite/namei.c        | 25 +++++--------------------
 fs/lustre/lov/lov_object.c     | 26 +++++++++++++++++++++++++-
 fs/lustre/mdc/mdc_dev.c        | 11 +++++++++--
 fs/lustre/obdclass/cl_object.c | 17 +++++++++++++++++
 5 files changed, 66 insertions(+), 23 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 5096025..7ac0dd2 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -417,6 +417,13 @@ struct cl_object_operations {
 	void (*coo_req_attr_set)(const struct lu_env *env,
 				 struct cl_object *obj,
 				 struct cl_req_attr *attr);
+	/**
+	 * Flush \a obj data corresponding to \a lock. Used for DoM
+	 * locks in llite's cancelling blocking ast callback.
+	 */
+	int (*coo_object_flush)(const struct lu_env *env,
+				struct cl_object *obj,
+				struct ldlm_lock *lock);
 };
 
 /**
@@ -2108,6 +2115,9 @@ int cl_object_fiemap(const struct lu_env *env, struct cl_object *obj,
 int cl_object_layout_get(const struct lu_env *env, struct cl_object *obj,
 			 struct cl_layout *cl);
 loff_t cl_object_maxbytes(struct cl_object *obj);
+int cl_object_flush(const struct lu_env *env, struct cl_object *obj,
+		    struct ldlm_lock *lock);
+
 
 /**
  * Returns true, iff @o0 and @o1 are slices of the same object.
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 49433c9..71e757a 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -177,13 +177,12 @@ int ll_test_inode_by_fid(struct inode *inode, void *opaque)
 	return lu_fid_eq(&ll_i2info(inode)->lli_fid, opaque);
 }
 
-int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
+static int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
 {
 	struct lu_env *env;
 	struct ll_inode_info *lli = ll_i2info(inode);
-	struct cl_layout clt = { .cl_layout_gen = 0, };
-	int rc;
 	u16 refcheck;
+	int rc;
 
 	if (!lli->lli_clob) {
 		/* Due to DoM read on open, there may exist pages for Lustre
@@ -197,28 +196,14 @@ int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	rc = cl_object_layout_get(env, lli->lli_clob, &clt);
-	if (rc) {
-		CDEBUG(D_INODE, "Cannot get layout for "DFID"\n",
-		       PFID(ll_inode2fid(inode)));
-		rc = -ENODATA;
-	} else if (clt.cl_size == 0 || clt.cl_dom_comp_size == 0) {
-		CDEBUG(D_INODE, "DOM lock without DOM layout for "DFID"\n",
-		       PFID(ll_inode2fid(inode)));
-	} else {
-		enum cl_fsync_mode mode;
-		loff_t end = clt.cl_dom_comp_size - 1;
+	/* reach MDC layer to flush data under  the DoM ldlm lock */
+	rc = cl_object_flush(env, lli->lli_clob, lock);
 
-		mode = ldlm_is_discard_data(lock) ?
-					CL_FSYNC_DISCARD : CL_FSYNC_LOCAL;
-		rc = cl_sync_file_range(inode, 0, end, mode, 1);
-		truncate_inode_pages_range(inode->i_mapping, 0, end);
-	}
 	cl_env_put(env, &refcheck);
 	return rc;
 }
 
-void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
+static void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 {
 	struct inode *inode = ll_inode_from_resource_lock(lock);
 	struct ll_inode_info *lli;
diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index 792d946..52d8c30 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -75,6 +75,8 @@ struct lov_layout_operations {
 			    struct cl_object *obj, struct cl_io *io);
 	int  (*llo_getattr)(const struct lu_env *env, struct cl_object *obj,
 			    struct cl_attr *attr);
+	int  (*llo_flush)(const struct lu_env *env, struct cl_object *obj,
+			  struct ldlm_lock *lock);
 };
 
 static int lov_layout_wait(const struct lu_env *env, struct lov_object *lov);
@@ -1021,7 +1023,21 @@ static int lov_attr_get_composite(const struct lu_env *env,
 	return 0;
 }
 
-static const struct lov_layout_operations lov_dispatch[] = {
+static int lov_flush_composite(const struct lu_env *env,
+			       struct cl_object *obj,
+			       struct ldlm_lock *lock)
+{
+	struct lov_object *lov = cl2lov(obj);
+	struct lovsub_object *lovsub;
+
+	if (!lsme_is_dom(lov->lo_lsm->lsm_entries[0]))
+		return -EINVAL;
+
+	lovsub = lov->u.composite.lo_entries[0].lle_dom.lo_dom;
+	return cl_object_flush(env, lovsub2cl(lovsub), lock);
+}
+
+const static struct lov_layout_operations lov_dispatch[] = {
 	[LLT_EMPTY] = {
 		.llo_init	= lov_init_empty,
 		.llo_delete	= lov_delete_empty,
@@ -1051,6 +1067,7 @@ static int lov_attr_get_composite(const struct lu_env *env,
 		.llo_lock_init	= lov_lock_init_composite,
 		.llo_io_init	= lov_io_init_composite,
 		.llo_getattr	= lov_attr_get_composite,
+		.llo_flush	= lov_flush_composite,
 	},
 	[LLT_FOREIGN] = {
 		.llo_init      = lov_init_foreign,
@@ -2083,6 +2100,12 @@ static loff_t lov_object_maxbytes(struct cl_object *obj)
 	return maxbytes;
 }
 
+static int lov_object_flush(const struct lu_env *env, struct cl_object *obj,
+			    struct ldlm_lock *lock)
+{
+	return LOV_2DISPATCH_NOLOCK(cl2lov(obj), llo_flush, env, obj, lock);
+}
+
 static const struct cl_object_operations lov_ops = {
 	.coo_page_init		= lov_page_init,
 	.coo_lock_init		= lov_lock_init,
@@ -2094,6 +2117,7 @@ static loff_t lov_object_maxbytes(struct cl_object *obj)
 	.coo_layout_get		= lov_object_layout_get,
 	.coo_maxbytes		= lov_object_maxbytes,
 	.coo_fiemap		= lov_object_fiemap,
+	.coo_object_flush	= lov_object_flush
 };
 
 static const struct lu_object_operations lov_lu_obj_ops = {
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 14cece1..df8bb33 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -286,7 +286,7 @@ void mdc_lock_lockless_cancel(const struct lu_env *env,
  */
 static int mdc_dlm_blocking_ast0(const struct lu_env *env,
 				 struct ldlm_lock *dlmlock,
-				 void *data, int flag)
+				 int flag)
 {
 	struct cl_object *obj = NULL;
 	int result = 0;
@@ -375,7 +375,7 @@ int mdc_ldlm_blocking_ast(struct ldlm_lock *dlmlock,
 			break;
 		}
 
-		rc = mdc_dlm_blocking_ast0(env, dlmlock, data, flag);
+		rc = mdc_dlm_blocking_ast0(env, dlmlock, flag);
 		cl_env_put(env, &refcheck);
 		break;
 	}
@@ -1382,6 +1382,12 @@ int mdc_object_prune(const struct lu_env *env, struct cl_object *obj)
 	return 0;
 }
 
+static int mdc_object_flush(const struct lu_env *env, struct cl_object *obj,
+			    struct ldlm_lock *lock)
+{
+	return mdc_dlm_blocking_ast0(env, lock, LDLM_CB_CANCELING);
+}
+
 static const struct cl_object_operations mdc_ops = {
 	.coo_page_init		= osc_page_init,
 	.coo_lock_init		= mdc_lock_init,
@@ -1391,6 +1397,7 @@ int mdc_object_prune(const struct lu_env *env, struct cl_object *obj)
 	.coo_glimpse		= osc_object_glimpse,
 	.coo_req_attr_set	= mdc_req_attr_set,
 	.coo_prune		= mdc_object_prune,
+	.coo_object_flush	= mdc_object_flush
 };
 
 static const struct osc_object_operations mdc_object_ops = {
diff --git a/fs/lustre/obdclass/cl_object.c b/fs/lustre/obdclass/cl_object.c
index f0ae34f..b323eb4 100644
--- a/fs/lustre/obdclass/cl_object.c
+++ b/fs/lustre/obdclass/cl_object.c
@@ -389,6 +389,23 @@ loff_t cl_object_maxbytes(struct cl_object *obj)
 }
 EXPORT_SYMBOL(cl_object_maxbytes);
 
+int cl_object_flush(const struct lu_env *env, struct cl_object *obj,
+			 struct ldlm_lock *lock)
+{
+	struct lu_object_header *top = obj->co_lu.lo_header;
+	int rc = 0;
+
+	list_for_each_entry(obj, &top->loh_layers, co_lu.lo_linkage) {
+		if (obj->co_ops->coo_object_flush) {
+			rc = obj->co_ops->coo_object_flush(env, obj, lock);
+			if (rc)
+				break;
+		}
+	}
+	return rc;
+}
+EXPORT_SYMBOL(cl_object_flush);
+
 /**
  * Helper function removing all object locks, and marking object for
  * deletion. All object pages must have been deleted at this point.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 406/622] lustre: llite: swab LOV EA user data
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (404 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 405/622] lustre: llite: improve ll_dom_lock_cancel James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 407/622] lustre: clio: support custom csi_end_io handler James Simmons
                   ` (216 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Jian Yu <yujian@whamcloud.com>

Many sub-tests failed with "Invalid argument" failures
on PPC client because of the endianness issue.

This patch fixes the issue by adding a common function
lustre_swab_lov_user_md() to swab the LOV EA user data.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10100
Lustre-commit: 9d17996766e0 ("LU-10100 llite: swab LOV EA user data")
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35291
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_swab.h |  1 +
 fs/lustre/llite/dir.c           | 65 ++++++++++--------------------------
 fs/lustre/llite/file.c          | 46 ++++++++++++-------------
 fs/lustre/llite/llite_lib.c     |  4 +--
 fs/lustre/llite/xattr.c         | 25 ++++++++++++--
 fs/lustre/ptlrpc/pack_generic.c | 74 +++++++++++++++++++++++++++++++++--------
 6 files changed, 126 insertions(+), 89 deletions(-)

diff --git a/fs/lustre/include/lustre_swab.h b/fs/lustre/include/lustre_swab.h
index 7e96640..e99e16d 100644
--- a/fs/lustre/include/lustre_swab.h
+++ b/fs/lustre/include/lustre_swab.h
@@ -86,6 +86,7 @@
 void lustre_swab_lov_comp_md_v1(struct lov_comp_md_v1 *lum);
 void lustre_swab_lov_user_md_objects(struct lov_user_ost_data *lod,
 				     int stripe_count);
+void lustre_swab_lov_user_md(struct lov_user_md *lum);
 void lustre_swab_lov_mds_md(struct lov_mds_md *lmm);
 void lustre_swab_lustre_capa(struct lustre_capa *c);
 void lustre_swab_lustre_capa_key(struct lustre_capa_key *k);
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 2c39579..f87ddd2 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -525,60 +525,46 @@ int ll_dir_setstripe(struct inode *inode, struct lov_user_md *lump,
 	int lum_size;
 
 	if (lump) {
-		/*
-		 * This is coming from userspace, so should be in
-		 * local endian.  But the MDS would like it in little
-		 * endian, so we swab it before we send it.
-		 */
 		switch (lump->lmm_magic) {
-		case LOV_USER_MAGIC_V1: {
-			if (lump->lmm_magic != cpu_to_le32(LOV_USER_MAGIC_V1))
-				lustre_swab_lov_user_md_v1(lump);
+		case LOV_USER_MAGIC_V1:
 			lum_size = sizeof(struct lov_user_md_v1);
 			break;
-		}
-		case LOV_USER_MAGIC_V3: {
-			if (lump->lmm_magic != cpu_to_le32(LOV_USER_MAGIC_V3))
-				lustre_swab_lov_user_md_v3((struct lov_user_md_v3 *)lump);
+		case LOV_USER_MAGIC_V3:
 			lum_size = sizeof(struct lov_user_md_v3);
 			break;
-		}
-		case LOV_USER_MAGIC_COMP_V1: {
-			if (lump->lmm_magic !=
-			    cpu_to_le32(LOV_USER_MAGIC_COMP_V1))
-				lustre_swab_lov_comp_md_v1((struct lov_comp_md_v1 *)lump);
-			lum_size = le32_to_cpu(((struct lov_comp_md_v1 *)lump)->lcm_size);
+		case LOV_USER_MAGIC_COMP_V1:
+			lum_size = ((struct lov_comp_md_v1 *)lump)->lcm_size;
 			break;
-		}
-		case LMV_USER_MAGIC: {
+		case LMV_USER_MAGIC:
 			if (lump->lmm_magic != cpu_to_le32(LMV_USER_MAGIC))
 				lustre_swab_lmv_user_md((struct lmv_user_md *)lump);
 			lum_size = sizeof(struct lmv_user_md);
 			break;
-		}
 		case LOV_USER_MAGIC_SPECIFIC: {
 			struct lov_user_md_v3 *v3 =
-					(struct lov_user_md_v3 *)lump;
+				(struct lov_user_md_v3 *)lump;
 			if (v3->lmm_stripe_count > LOV_MAX_STRIPE_COUNT)
 				return -EINVAL;
-			if (lump->lmm_magic !=
-			    cpu_to_le32(LOV_USER_MAGIC_SPECIFIC)) {
-				lustre_swab_lov_user_md_v3(v3);
-				lustre_swab_lov_user_md_objects(v3->lmm_objects,
-						v3->lmm_stripe_count);
-			}
 			lum_size = lov_user_md_size(v3->lmm_stripe_count,
 						    LOV_USER_MAGIC_SPECIFIC);
 			break;
 		}
-		default: {
+		default:
 			CDEBUG(D_IOCTL,
 			       "bad userland LOV MAGIC: %#08x != %#08x nor %#08x\n",
 			       lump->lmm_magic, LOV_USER_MAGIC_V1,
 			       LOV_USER_MAGIC_V3);
 			return -EINVAL;
 		}
-		}
+
+		/*
+		 * This is coming from userspace, so should be in
+		 * local endian.  But the MDS would like it in little
+		 * endian, so we swab it before we send it.
+		 */
+		if ((__swab32(lump->lmm_magic) & le32_to_cpu(LOV_MAGIC_MASK)) ==
+		    le32_to_cpu(LOV_MAGIC_MAGIC))
+			lustre_swab_lov_user_md(lump);
 	} else {
 		lum_size = sizeof(struct lov_user_md_v1);
 	}
@@ -706,16 +692,11 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 	/* We don't swab objects for directories */
 	switch (le32_to_cpu(lmm->lmm_magic)) {
 	case LOV_MAGIC_V1:
-		if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC)
-			lustre_swab_lov_user_md_v1((struct lov_user_md_v1 *)lmm);
-		break;
 	case LOV_MAGIC_V3:
-		if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC)
-			lustre_swab_lov_user_md_v3((struct lov_user_md_v3 *)lmm);
-		break;
 	case LOV_MAGIC_COMP_V1:
+	case LOV_USER_MAGIC_SPECIFIC:
 		if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC)
-			lustre_swab_lov_comp_md_v1((struct lov_comp_md_v1 *)lmm);
+			lustre_swab_lov_user_md((struct lov_user_md *)lmm);
 		break;
 	case LMV_MAGIC_V1:
 		if (cpu_to_le32(LMV_MAGIC) != LMV_MAGIC)
@@ -725,16 +706,6 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 		if (cpu_to_le32(LMV_USER_MAGIC) != LMV_USER_MAGIC)
 			lustre_swab_lmv_user_md((struct lmv_user_md *)lmm);
 		break;
-	case LOV_USER_MAGIC_SPECIFIC: {
-		struct lov_user_md_v3 *v3 = (struct lov_user_md_v3 *)lmm;
-
-		if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC) {
-			lustre_swab_lov_user_md_v3(v3);
-			lustre_swab_lov_user_md_objects(v3->lmm_objects,
-							v3->lmm_stripe_count);
-			}
-		}
-		break;
 	case LMV_MAGIC_FOREIGN: {
 		struct lmv_foreign_md *lfm = (struct lmv_foreign_md *)lmm;
 
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index d313730..5a3e80e 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1852,6 +1852,12 @@ int ll_lov_setstripe_ea_info(struct inode *inode, struct dentry *dentry,
 	};
 	int rc = 0;
 
+	if ((__swab32(lum->lmm_magic) & le32_to_cpu(LOV_MAGIC_MASK)) ==
+	    le32_to_cpu(LOV_MAGIC_MAGIC)) {
+		/* this code will only exist for big-endian systems */
+		lustre_swab_lov_user_md(lum);
+	}
+
 	ll_inode_size_lock(inode);
 	rc = ll_intent_file_open(dentry, lum, lum_size, &oit);
 	if (rc < 0)
@@ -1920,8 +1926,9 @@ int ll_lov_getstripe_ea_info(struct inode *inode, const char *filename,
 	 * little endian. We convert it to host endian before
 	 * passing it to userspace.
 	 */
-	if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC) {
-		int stripe_count;
+	if ((lmm->lmm_magic & __swab32(LOV_MAGIC_MAGIC)) ==
+	    __swab32(LOV_MAGIC_MAGIC)) {
+		int stripe_count = 0;
 
 		if (lmm->lmm_magic == cpu_to_le32(LOV_MAGIC_V1) ||
 		    lmm->lmm_magic == cpu_to_le32(LOV_MAGIC_V3)) {
@@ -1931,31 +1938,20 @@ int ll_lov_getstripe_ea_info(struct inode *inode, const char *filename,
 				stripe_count = 0;
 		}
 
+		lustre_swab_lov_user_md((struct lov_user_md *)lmm);
+
 		/* if function called for directory - we should
 		 * avoid swab not existent lsm objects
 		 */
-		if (lmm->lmm_magic == cpu_to_le32(LOV_MAGIC_V1)) {
-			lustre_swab_lov_user_md_v1((struct lov_user_md_v1 *)lmm);
-			if (S_ISREG(body->mbo_mode))
-				lustre_swab_lov_user_md_objects(((struct lov_user_md_v1 *)lmm)->lmm_objects,
-								stripe_count);
-		} else if (lmm->lmm_magic == cpu_to_le32(LOV_MAGIC_V3)) {
-			lustre_swab_lov_user_md_v3((struct lov_user_md_v3 *)lmm);
-			if (S_ISREG(body->mbo_mode))
-				lustre_swab_lov_user_md_objects(((struct lov_user_md_v3 *)lmm)->lmm_objects,
-								stripe_count);
-		} else if (lmm->lmm_magic == cpu_to_le32(LOV_MAGIC_COMP_V1)) {
-			lustre_swab_lov_comp_md_v1((struct lov_comp_md_v1 *)lmm);
-		} else if (lmm->lmm_magic ==
-			   cpu_to_le32(LOV_MAGIC_FOREIGN)) {
-			struct lov_foreign_md *lfm;
-
-			lfm = (struct lov_foreign_md *)lmm;
-			__swab32s(&lfm->lfm_magic);
-			__swab32s(&lfm->lfm_length);
-			__swab32s(&lfm->lfm_type);
-			__swab32s(&lfm->lfm_flags);
-		}
+		if (lmm->lmm_magic == LOV_MAGIC_V1 && S_ISREG(body->mbo_mode))
+			lustre_swab_lov_user_md_objects(
+				((struct lov_user_md_v1 *)lmm)->lmm_objects,
+				stripe_count);
+		else if (lmm->lmm_magic == LOV_MAGIC_V3 &&
+			 S_ISREG(body->mbo_mode))
+			lustre_swab_lov_user_md_objects(
+				((struct lov_user_md_v3 *)lmm)->lmm_objects,
+				stripe_count);
 	}
 
 out:
@@ -2040,7 +2036,7 @@ static int ll_lov_setstripe(struct inode *inode, struct file *file,
 
 	cl_lov_delay_create_clear(&file->f_flags);
 out:
-	kfree(klum);
+	kvfree(klum);
 	return rc;
 }
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 3e058d2..86be562 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2757,14 +2757,14 @@ ssize_t ll_copy_user_md(const struct lov_user_md __user *md,
 	if (lum_size < 0)
 		goto no_kbuf;
 
-	*kbuf = kzalloc(lum_size, GFP_NOFS);
+	*kbuf = kvzalloc(lum_size, GFP_NOFS);
 	if (!*kbuf) {
 		lum_size = -ENOMEM;
 		goto no_kbuf;
 	}
 
 	if (copy_from_user(*kbuf, md, lum_size) != 0) {
-		kfree(*kbuf);
+		kvfree(*kbuf);
 		*kbuf = NULL;
 		lum_size = -EFAULT;
 	}
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index 9707e78..cf1cfd2 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -40,6 +40,7 @@
 
 #include <obd_support.h>
 #include <lustre_dlm.h>
+#include <lustre_swab.h>
 
 #include "llite_internal.h"
 
@@ -316,6 +317,11 @@ static int ll_xattr_set(const struct xattr_handler *handler,
 		return 0;
 	}
 
+	if (strncmp(name, "lov.", 4) == 0 &&
+	    (__swab32(((struct lov_user_md *)value)->lmm_magic) &
+	    le32_to_cpu(LOV_MAGIC_MASK)) == le32_to_cpu(LOV_MAGIC_MAGIC))
+		lustre_swab_lov_user_md((struct lov_user_md *)value);
+
 	return ll_xattr_set_common(handler, dentry, inode, name, value, size,
 				   flags);
 }
@@ -485,10 +491,25 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		 * file is restored. See LU-2809.
 		 */
 		magic = ((struct lov_mds_md *)buf)->lmm_magic;
-		if (magic == LOV_MAGIC_COMP_V1 || magic == LOV_MAGIC_FOREIGN)
+		if ((magic & __swab32(LOV_MAGIC_MAGIC)) ==
+		    __swab32(LOV_MAGIC_MAGIC))
+			magic = __swab32(magic);
+
+		switch (magic) {
+		case LOV_MAGIC_V1:
+		case LOV_MAGIC_V3:
+		case LOV_MAGIC_SPECIFIC:
+			((struct lov_mds_md *)buf)->lmm_layout_gen = 0;
+			break;
+		case LOV_MAGIC_COMP_V1:
+		case LOV_MAGIC_FOREIGN:
+			goto out_env;
+		default:
+			CERROR("Invalid LOV magic %08x\n", magic);
+			rc = -EINVAL;
 			goto out_env;
+		}
 
-		((struct lov_mds_md *)buf)->lmm_layout_gen = 0;
 out_env:
 		cl_env_put(env, &refcheck);
 
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index f687ecc..7acb4a8 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -2004,6 +2004,8 @@ void lustre_swab_lmv_user_md(struct lmv_user_md *lum)
 	if (lum->lum_magic == LMV_MAGIC_FOREIGN) {
 		__swab32s(&lum->lum_magic);
 		__swab32s(&((struct lmv_foreign_md *)lum)->lfm_length);
+		__swab32s(&((struct lmv_foreign_md *)lum)->lfm_type);
+		__swab32s(&((struct lmv_foreign_md *)lum)->lfm_flags);
 		return;
 	}
 
@@ -2132,18 +2134,6 @@ void lustre_swab_lov_comp_md_v1(struct lov_comp_md_v1 *lum)
 }
 EXPORT_SYMBOL(lustre_swab_lov_comp_md_v1);
 
-void lustre_swab_lov_mds_md(struct lov_mds_md *lmm)
-{
-	CDEBUG(D_IOCTL, "swabbing lov_mds_md\n");
-	__swab32s(&lmm->lmm_magic);
-	__swab32s(&lmm->lmm_pattern);
-	lustre_swab_lmm_oi(&lmm->lmm_oi);
-	__swab32s(&lmm->lmm_stripe_size);
-	__swab16s(&lmm->lmm_stripe_count);
-	__swab16s(&lmm->lmm_layout_gen);
-}
-EXPORT_SYMBOL(lustre_swab_lov_mds_md);
-
 void lustre_swab_lov_user_md_objects(struct lov_user_ost_data *lod,
 				     int stripe_count)
 {
@@ -2157,9 +2147,67 @@ void lustre_swab_lov_user_md_objects(struct lov_user_ost_data *lod,
 }
 EXPORT_SYMBOL(lustre_swab_lov_user_md_objects);
 
+void lustre_swab_lov_user_md(struct lov_user_md *lum)
+{
+	CDEBUG(D_IOCTL, "swabbing lov_user_md\n");
+	switch (lum->lmm_magic) {
+	case __swab32(LOV_MAGIC_V1):
+	case LOV_USER_MAGIC_V1:
+		lustre_swab_lov_user_md_v1((struct lov_user_md_v1 *)lum);
+		break;
+	case __swab32(LOV_MAGIC_V3):
+	case LOV_USER_MAGIC_V3:
+		lustre_swab_lov_user_md_v3((struct lov_user_md_v3 *)lum);
+		break;
+	case __swab32(LOV_USER_MAGIC_SPECIFIC):
+	case LOV_USER_MAGIC_SPECIFIC:
+	{
+		struct lov_user_md_v3 *v3 = (struct lov_user_md_v3 *)lum;
+		u16 stripe_count = v3->lmm_stripe_count;
+
+		if (lum->lmm_magic != LOV_USER_MAGIC_SPECIFIC)
+			__swab16s(&stripe_count);
+
+		lustre_swab_lov_user_md_v3(v3);
+		lustre_swab_lov_user_md_objects(v3->lmm_objects, stripe_count);
+		break;
+	}
+	case __swab32(LOV_MAGIC_COMP_V1):
+	case LOV_USER_MAGIC_COMP_V1:
+		lustre_swab_lov_comp_md_v1((struct lov_comp_md_v1 *)lum);
+		break;
+	case __swab32(LOV_MAGIC_FOREIGN):
+	case LOV_USER_MAGIC_FOREIGN:
+	{
+		struct lov_foreign_md *lfm = (struct lov_foreign_md *)lum;
+
+		__swab32s(&lfm->lfm_magic);
+		__swab32s(&lfm->lfm_length);
+		__swab32s(&lfm->lfm_type);
+		__swab32s(&lfm->lfm_flags);
+		break;
+	}
+	default:
+		CDEBUG(D_IOCTL, "Invalid LOV magic %08x\n", lum->lmm_magic);
+	}
+}
+EXPORT_SYMBOL(lustre_swab_lov_user_md);
+
+void lustre_swab_lov_mds_md(struct lov_mds_md *lmm)
+{
+	CDEBUG(D_IOCTL, "swabbing lov_mds_md\n");
+	__swab32s(&lmm->lmm_magic);
+	__swab32s(&lmm->lmm_pattern);
+	lustre_swab_lmm_oi(&lmm->lmm_oi);
+	__swab32s(&lmm->lmm_stripe_size);
+	__swab16s(&lmm->lmm_stripe_count);
+	__swab16s(&lmm->lmm_layout_gen);
+}
+EXPORT_SYMBOL(lustre_swab_lov_mds_md);
+
 static void lustre_swab_ldlm_res_id(struct ldlm_res_id *id)
 {
-	int i;
+	int  i;
 
 	for (i = 0; i < RES_NAME_SIZE; i++)
 		__swab64s(&id->name[i]);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 407/622] lustre: clio: support custom csi_end_io handler
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (405 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 406/622] lustre: llite: swab LOV EA user data James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 408/622] lustre: llite: release active extent on sync write commit James Simmons
                   ` (215 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

Provide an initialize that supports a custom end_io handler.

Cray-bug-id: LUS-7330
WC-bug-id: https://jira.whamcloud.com/browse/LU-12431
Lustre-commit: 6ee742fd5c56 ("LU-12431 clio: remove default csi_end_io handler")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/35400
Reviewed-by: Neil Brown <neilb@suse.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h | 24 ++++++++++++++++++------
 fs/lustre/obdclass/cl_io.c    | 19 ++++++++++++++++---
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 7ac0dd2..71ca283 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -2457,6 +2457,22 @@ void cl_req_attr_set(const struct lu_env *env, struct cl_object *obj,
  * @{
  */
 
+struct cl_sync_io;
+
+typedef void (cl_sync_io_end_t)(const struct lu_env *, struct cl_sync_io *);
+
+void cl_sync_io_init_notify(struct cl_sync_io *anchor, int nr,
+			    cl_sync_io_end_t *end);
+
+int cl_sync_io_wait(const struct lu_env *env, struct cl_sync_io *anchor,
+		    long timeout);
+void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
+		     int ioret);
+static inline void cl_sync_io_init(struct cl_sync_io *anchor, int nr)
+{
+	cl_sync_io_init_notify(anchor, nr, NULL);
+}
+
 /**
  * Anchor for synchronous transfer. This is allocated on a stack by thread
  * doing synchronous transfer, and a pointer to this structure is set up in
@@ -2470,14 +2486,10 @@ struct cl_sync_io {
 	int			csi_sync_rc;
 	/** completion to be signaled when transfer is complete. */
 	wait_queue_head_t	csi_waitq;
+	/** callback to invoke when this IO is finished */
+	cl_sync_io_end_t	*csi_end_io;
 };
 
-void cl_sync_io_init(struct cl_sync_io *anchor, int nr);
-int  cl_sync_io_wait(const struct lu_env *env, struct cl_sync_io *anchor,
-		     long timeout);
-void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
-		     int ioret);
-
 /** @} cl_sync_io */
 
 /** \defgroup cl_env cl_env
diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index 4278bc0..14849ed 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -1024,16 +1024,26 @@ void cl_req_attr_set(const struct lu_env *env, struct cl_object *obj,
 EXPORT_SYMBOL(cl_req_attr_set);
 
 /**
- * Initialize synchronous io wait anchor
+ * Initialize synchronous io wait @anchor for @nr pages with optional
+ * @end handler.
+ *
+ * @anchor	owned by caller, initialzied here.
+ * @nr		number of pages initally pending in sync.
+ * @end		optional callback sync_io completion, can be used to
+ *		trigger erasure coding, integrity, dedupe, or similar
+ *		operation. @end is called with a spinlock on
+ *		anchor->csi_waitq.lock
  */
-void cl_sync_io_init(struct cl_sync_io *anchor, int nr)
+void cl_sync_io_init_notify(struct cl_sync_io *anchor, int nr,
+			    cl_sync_io_end_t *end)
 {
 	memset(anchor, 0, sizeof(*anchor));
 	init_waitqueue_head(&anchor->csi_waitq);
 	atomic_set(&anchor->csi_sync_nr, nr);
 	anchor->csi_sync_rc = 0;
+	anchor->csi_end_io = end;
 }
-EXPORT_SYMBOL(cl_sync_io_init);
+EXPORT_SYMBOL(cl_sync_io_init_notify);
 
 /**
  * Wait until all IO completes. Transfer completion routine has to call
@@ -1088,6 +1098,7 @@ void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
 	LASSERT(atomic_read(&anchor->csi_sync_nr) > 0);
 	if (atomic_dec_and_lock(&anchor->csi_sync_nr,
 				&anchor->csi_waitq.lock)) {
+		cl_sync_io_end_t *end_io = anchor->csi_end_io;
 
 		/*
 		 * Holding the lock across both the decrement and
@@ -1095,6 +1106,8 @@ void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
 		 * before the wakeup completes.
 		 */
 		wake_up_all_locked(&anchor->csi_waitq);
+		if (end_io)
+			end_io(env, anchor);
 		spin_unlock(&anchor->csi_waitq.lock);
 
 		/* Can't access anchor any more */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 408/622] lustre: llite: release active extent on sync write commit
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (406 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 407/622] lustre: clio: support custom csi_end_io handler James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 409/622] lustre: obd: harden debugfs handling James Simmons
                   ` (214 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

Processes can wait forever in osc_extent_wait() for the extent state
to change because the extent write is not started before the wait
begins. A 4.7 kernel change to generic_write_sync() modified it to
check IOCB_DSYNC instead of O_SYNC. Thus an active extent is not
released (written) in osc_io_commit_async() in the synchronous case.

Cray-bug-id: LUS-7435
WC-bug-id: https://jira.whamcloud.com/browse/LU-12536
Lustre-commit: a9af7100ce72 ("LU-12536 llite: release active extent on sync write commit")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/35472
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c           | 9 +++++++--
 fs/lustre/llite/llite_internal.h | 4 +++-
 fs/lustre/llite/rw.c             | 2 +-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 5a3e80e..6f418e0 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1407,7 +1407,8 @@ static bool file_is_noatime(const struct file *file)
 	return false;
 }
 
-void ll_io_init(struct cl_io *io, const struct file *file, int write)
+void ll_io_init(struct cl_io *io, const struct file *file, int write,
+		struct vvp_io_args *args)
 {
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct inode *inode = file_inode(file);
@@ -1420,7 +1421,11 @@ void ll_io_init(struct cl_io *io, const struct file *file, int write)
 		io->u.ci_wr.wr_sync = file->f_flags & O_SYNC ||
 				      file->f_flags & O_DIRECT ||
 				      IS_SYNC(inode);
+		io->u.ci_wr.wr_sync |= !!(args &&
+					  (args->u.normal.via_iocb->ki_flags &
+					   IOCB_DSYNC));
 	}
+
 	io->ci_obj = ll_i2info(inode)->lli_clob;
 	io->ci_lockreq = CILR_MAYBE;
 	if (ll_file_nolock(file)) {
@@ -1491,7 +1496,7 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 
 restart:
 	io = vvp_env_thread_io(env);
-	ll_io_init(io, file, iot == CIT_WRITE);
+	ll_io_init(io, file, iot == CIT_WRITE, args);
 	io->ci_ndelay_tried = retried;
 
 	if (cl_io_rw_init(env, io, iot, *ppos, count) == 0) {
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index a0d631d..49c0c78 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -786,7 +786,6 @@ int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
 void ll_rw_stats_tally(struct ll_sb_info *sbi, pid_t pid,
 		       struct ll_file_data *file, loff_t pos,
 		       size_t count, int rw);
-void ll_io_init(struct cl_io *io, const struct file *file, int write);
 
 enum {
 	LPROC_LL_READ_BYTES,
@@ -1056,6 +1055,9 @@ static inline struct vvp_io_args *ll_env_args(const struct lu_env *env)
 	return &ll_env_info(env)->lti_args;
 }
 
+void ll_io_init(struct cl_io *io, const struct file *file, int write,
+		struct vvp_io_args *args);
+
 /* llite/llite_mmap.c */
 
 int ll_teardown_mmaps(struct address_space *mapping, u64 first, u64 last);
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index fe9a2b0..9c4b89f 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -503,7 +503,7 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 	}
 
 	io = vvp_env_thread_io(env);
-	ll_io_init(io, file, 0);
+	ll_io_init(io, file, 0, NULL);
 
 	rc = ll_readahead_file_kms(env, io, &kms);
 	if (rc != 0)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 409/622] lustre: obd: harden debugfs handling
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (407 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 408/622] lustre: llite: release active extent on sync write commit James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 410/622] lustre: obd: add rmfid support James Simmons
                   ` (213 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

While the seq_file private data shouldn't disappear from under
us just in case always test if the private field is set. If
not return -ENODEV for debugfs read and write operations.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Lustre-commit: 44d450890f43 ("LU-8066 obd: harden debugfs handling")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/35575
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lprocfs_status.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index 6269bd3..fdc1b19 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -519,6 +519,8 @@ void lprocfs_stats_collect(struct lprocfs_stats *stats, int idx,
 #define LPROC_SEQ_FOPS_RO_TYPE(name, type)				\
 	static int name##_##type##_seq_show(struct seq_file *m, void *v)\
 	{								\
+		if (!m->private)					\
+			return -ENODEV;					\
 		return lprocfs_rd_##type(m, m->private);		\
 	}								\
 	LPROC_SEQ_FOPS_RO(name##_##type)
@@ -526,6 +528,8 @@ void lprocfs_stats_collect(struct lprocfs_stats *stats, int idx,
 #define LPROC_SEQ_FOPS_RW_TYPE(name, type)				\
 	static int name##_##type##_seq_show(struct seq_file *m, void *v)\
 	{								\
+		if (!m->private)					\
+			return -ENODEV;					\
 		return lprocfs_rd_##type(m, m->private);		\
 	}								\
 	static ssize_t name##_##type##_seq_write(struct file *file,	\
@@ -533,6 +537,9 @@ void lprocfs_stats_collect(struct lprocfs_stats *stats, int idx,
 						loff_t *off)		\
 	{								\
 		struct seq_file *seq = file->private_data;		\
+									\
+		if (!seq->private)					\
+			return -ENODEV;					\
 		return lprocfs_wr_##type(file, buffer,			\
 					 count, seq->private);		\
 	}								\
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 410/622] lustre: obd: add rmfid support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (408 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 409/622] lustre: obd: harden debugfs handling James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 411/622] lnet: Convert noisy timeout error to cdebug James Simmons
                   ` (212 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

a new RPC_REINT_RMFID has been introduced by the patch.
it's supposed to be used with corresponding llapi_rmfid()
to unlink a batch of MDS files by their FIDs. the caller
has to have permission to modify parent dir(s) and the objects
themselves.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12090
Lustre-commit: 1fd63fcb045c ("LU-12090 utils: lfs rmfid")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34449
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h   |  2 +
 fs/lustre/include/obd.h                 |  2 +
 fs/lustre/include/obd_class.h           | 12 ++++
 fs/lustre/include/obd_support.h         |  1 +
 fs/lustre/llite/dir.c                   | 54 +++++++++++++++++-
 fs/lustre/lmv/lmv_obd.c                 | 98 +++++++++++++++++++++++++++++++++
 fs/lustre/mdc/mdc_request.c             | 76 ++++++++++++++++++++++++-
 fs/lustre/ptlrpc/layout.c               | 25 +++++++++
 fs/lustre/ptlrpc/lproc_ptlrpc.c         |  1 +
 fs/lustre/ptlrpc/wiretest.c             |  4 +-
 include/uapi/linux/lustre/lustre_idl.h  |  1 +
 include/uapi/linux/lustre/lustre_user.h | 10 ++++
 12 files changed, 283 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index dca4ef4..feb5e77 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -165,6 +165,7 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_format RQF_MDS_SWAP_LAYOUTS;
 extern struct req_format RQF_MDS_REINT_MIGRATE;
 extern struct req_format RQF_MDS_REINT_RESYNC;
+extern struct req_format RQF_MDS_RMFID;
 /* MDS hsm formats */
 extern struct req_format RQF_MDS_HSM_STATE_GET;
 extern struct req_format RQF_MDS_HSM_STATE_SET;
@@ -236,6 +237,7 @@ void req_capsule_shrink(struct req_capsule *pill,
 extern struct req_msg_field RMF_CLOSE_DATA;
 extern struct req_msg_field RMF_FILE_SECCTX_NAME;
 extern struct req_msg_field RMF_FILE_SECCTX;
+extern struct req_msg_field RMF_FID_ARRAY;
 
 /*
  * connection handle received in MDS_CONNECT request.
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 53d078e..886c697 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -1039,6 +1039,8 @@ struct md_ops {
 
 	int (*unpackmd)(struct obd_export *exp, struct lmv_stripe_md **plsm,
 			const union lmv_mds_md *lmv, size_t lmv_size);
+	int (*rmfid)(struct obd_export *exp, struct fid_array *fa, int *rcs,
+		     struct ptlrpc_request_set *set);
 };
 
 static inline struct md_open_data *obd_mod_alloc(void)
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index b8afa5a..bc01eca 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1663,6 +1663,18 @@ static inline int md_unpackmd(struct obd_export *exp,
 	return MDP(exp->exp_obd, unpackmd)(exp, plsm, lmm, lmm_size);
 }
 
+static inline int md_rmfid(struct obd_export *exp, struct fid_array *fa,
+			   int *rcs, struct ptlrpc_request_set *set)
+{
+	int rc;
+
+	rc = exp_check_ops(exp);
+	if (rc)
+		return rc;
+
+	return MDP(exp->exp_obd, rmfid)(exp, fa, rcs, set);
+}
+
 /* OBD Metadata Support */
 
 int obd_init_caches(void);
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 23f6bae..c66b61a 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -194,6 +194,7 @@
 #define OBD_FAIL_MDS_CHANGELOG_INIT			0x151
 #define OBD_FAIL_MDS_REINT_MULTI_NET			0x159
 #define OBD_FAIL_MDS_REINT_MULTI_NET_REP		0x15a
+#define OBD_FAIL_MDS_RMFID_NET		 0x166
 
 /* layout lock */
 #define OBD_FAIL_MDS_NO_LL_GETATTR			0x170
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index f87ddd2..3540c18 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1180,6 +1180,57 @@ static int quotactl_ioctl(struct ll_sb_info *sbi, struct if_quotactl *qctl)
 	return rc;
 }
 
+int ll_rmfid(struct file *file, void __user *arg)
+{
+	const struct fid_array __user *ufa = arg;
+	struct fid_array *lfa = NULL;
+	size_t size;
+	unsigned int nr;
+	int i, rc, *rcs = NULL;
+
+	if (!capable(CAP_DAC_READ_SEARCH) &&
+	    !(ll_i2sbi(file_inode(file))->ll_flags & LL_SBI_USER_FID2PATH))
+		return -EPERM;
+	/* Only need to get the buflen */
+	if (get_user(nr, &ufa->fa_nr))
+		return -EFAULT;
+	/* DoS protection */
+	if (nr > OBD_MAX_FIDS_IN_ARRAY)
+		return -E2BIG;
+
+	size = offsetof(struct fid_array, fa_fids[nr]);
+	lfa = kzalloc(size, GFP_NOFS);
+	if (!lfa)
+		return -ENOMEM;
+	rcs = kcalloc(nr, sizeof(int), GFP_NOFS);
+	if (!rcs) {
+		rc = -ENOMEM;
+		goto free_lfa;
+	}
+
+	if (copy_from_user(lfa, arg, size)) {
+		rc = -EFAULT;
+		goto free_rcs;
+	}
+
+	/* Call mdc_iocontrol */
+	rc = md_rmfid(ll_i2mdexp(file_inode(file)), lfa, rcs, NULL);
+	if (!rc) {
+		for (i = 0; i < nr; i++)
+			if (rcs[i])
+				lfa->fa_fids[i].f_ver = rcs[i];
+		if (copy_to_user(arg, lfa, size))
+			rc = -EFAULT;
+	}
+
+free_rcs:
+	kfree(rcs);
+free_lfa:
+	kfree(lfa);
+
+	return rc;
+}
+
 /* This function tries to get a single name component,
  * to send to the server. No actual path traversal involved,
  * so we limit to NAME_MAX
@@ -1544,7 +1595,8 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		ptlrpc_req_finished(request);
 		return rc;
 	}
-
+	case LL_IOC_RMFID:
+		return ll_rmfid(file, (void __user *)arg);
 	case LL_IOC_LOV_SWAP_LAYOUTS:
 		return -EPERM;
 	case IOC_OBD_STATFS:
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index e9f9c36..d323250 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -2930,6 +2930,103 @@ static int lmv_get_info(const struct lu_env *env, struct obd_export *exp,
 	return -EINVAL;
 }
 
+static int lmv_rmfid(struct obd_export *exp, struct fid_array *fa,
+		     int *__rcs, struct ptlrpc_request_set *_set)
+{
+	struct obd_device *obddev = class_exp2obd(exp);
+	struct ptlrpc_request_set *set = _set;
+	struct lmv_obd *lmv = &obddev->u.lmv;
+	int tgt_count = lmv->desc.ld_tgt_count;
+	struct fid_array *fat, **fas = NULL;
+	int i, rc, **rcs = NULL;
+
+	if (!set) {
+		set = ptlrpc_prep_set();
+		if (!set)
+			return -ENOMEM;
+	}
+
+	/* split FIDs by targets */
+	fas = kcalloc(tgt_count, sizeof(fas), GFP_NOFS);
+	if (!fas) {
+		rc = -ENOMEM;
+		goto out;
+	}
+	rcs = kcalloc(tgt_count, sizeof(int *), GFP_NOFS);
+	if (!rcs) {
+		rc = -ENOMEM;
+		goto out_fas;
+	}
+
+	for (i = 0; i < fa->fa_nr; i++) {
+		unsigned int idx;
+
+		rc = lmv_fld_lookup(lmv, &fa->fa_fids[i], &idx);
+		if (rc) {
+			CDEBUG(D_OTHER, "can't lookup "DFID": rc = %d\n",
+			       PFID(&fa->fa_fids[i]), rc);
+			continue;
+		}
+		LASSERT(idx < tgt_count);
+		if (!fas[idx])
+			fas[idx] = kzalloc(offsetof(struct fid_array,
+						    fa_fids[fa->fa_nr]),
+					   GFP_NOFS);
+		if (!fas[idx]) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		if (!rcs[idx])
+			rcs[idx] = kcalloc(fa->fa_nr, sizeof(int), GFP_NOFS);
+		if (!rcs[idx]) {
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		fat = fas[idx];
+		fat->fa_fids[fat->fa_nr++] = fa->fa_fids[i];
+	}
+
+	for (i = 0; i < tgt_count; i++) {
+		fat = fas[i];
+		if (!fat || fat->fa_nr == 0)
+			continue;
+		rc = md_rmfid(lmv->tgts[i]->ltd_exp, fat, rcs[i], set);
+	}
+
+	rc = ptlrpc_set_wait(NULL, set);
+	if (rc == 0) {
+		int j = 0;
+
+		for (i = 0; i < tgt_count; i++) {
+			fat = fas[i];
+			if (!fat || fat->fa_nr == 0)
+				continue;
+			/* copy FIDs back */
+			memcpy(fa->fa_fids + j, fat->fa_fids,
+			       fat->fa_nr * sizeof(struct lu_fid));
+			/* copy rcs back */
+			memcpy(__rcs + j, rcs[i], fat->fa_nr * sizeof(**rcs));
+			j += fat->fa_nr;
+		}
+	}
+	if (set != _set)
+		ptlrpc_set_destroy(set);
+
+out:
+	for (i = 0; i < tgt_count; i++) {
+		if (fas)
+			kfree(fas[i]);
+		if (rcs)
+			kfree(rcs[i]);
+	}
+	kfree(rcs);
+out_fas:
+	kfree(fas);
+
+	return rc;
+}
+
 /**
  * Asynchronously set by key a value associated with a LMV device.
  *
@@ -3517,6 +3614,7 @@ static int lmv_merge_attr(struct obd_export *exp,
 	.revalidate_lock	= lmv_revalidate_lock,
 	.get_fid_from_lsm	= lmv_get_fid_from_lsm,
 	.unpackmd		= lmv_unpackmd,
+	.rmfid			= lmv_rmfid,
 };
 
 static int __init lmv_init(void)
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 7bc6196..693c455 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -2585,6 +2585,79 @@ static int mdc_fsync(struct obd_export *exp, const struct lu_fid *fid,
 	return rc;
 }
 
+struct mdc_rmfid_args {
+	int *mra_rcs;
+	int mra_nr;
+};
+
+int mdc_rmfid_interpret(const struct lu_env *env, struct ptlrpc_request *req,
+			  void *args, int rc)
+{
+	struct mdc_rmfid_args *aa;
+	int *rcs, size;
+
+	if (!rc) {
+		aa = ptlrpc_req_async_args(aa, req);
+
+		size = req_capsule_get_size(&req->rq_pill, &RMF_RCS,
+					    RCL_SERVER);
+		LASSERT(size == sizeof(int) * aa->mra_nr);
+		rcs = req_capsule_server_get(&req->rq_pill, &RMF_RCS);
+		LASSERT(rcs);
+		LASSERT(aa->mra_rcs);
+		LASSERT(aa->mra_nr);
+		memcpy(aa->mra_rcs, rcs, size);
+	}
+
+	return rc;
+}
+
+static int mdc_rmfid(struct obd_export *exp, struct fid_array *fa,
+		     int *rcs, struct ptlrpc_request_set *set)
+{
+	struct ptlrpc_request *req;
+	struct mdc_rmfid_args *aa;
+	struct mdt_body *b;
+	struct lu_fid *tmp;
+	int rc, flen;
+
+	req = ptlrpc_request_alloc(class_exp2cliimp(exp), &RQF_MDS_RMFID);
+	if (!req)
+		return -ENOMEM;
+
+	flen = fa->fa_nr * sizeof(struct lu_fid);
+	req_capsule_set_size(&req->rq_pill, &RMF_FID_ARRAY,
+			     RCL_CLIENT, flen);
+	req_capsule_set_size(&req->rq_pill, &RMF_FID_ARRAY,
+			     RCL_SERVER, flen);
+	req_capsule_set_size(&req->rq_pill, &RMF_RCS,
+			     RCL_SERVER, fa->fa_nr * sizeof(u32));
+	rc = ptlrpc_request_pack(req, LUSTRE_MDS_VERSION, MDS_RMFID);
+	if (rc) {
+		ptlrpc_request_free(req);
+		return rc;
+	}
+	tmp = req_capsule_client_get(&req->rq_pill, &RMF_FID_ARRAY);
+	memcpy(tmp, fa->fa_fids, flen);
+
+	mdc_pack_body(req, NULL, 0, 0, -1, 0);
+	b = req_capsule_client_get(&req->rq_pill, &RMF_MDT_BODY);
+	b->mbo_ctime = ktime_get_real_seconds();
+
+	ptlrpc_request_set_replen(req);
+
+	LASSERT(rcs);
+	aa = ptlrpc_req_async_args(aa, req);
+	aa->mra_rcs = rcs;
+	aa->mra_nr = fa->fa_nr;
+	req->rq_interpret_reply = mdc_rmfid_interpret;
+
+	ptlrpc_set_add_req(set, req);
+	ptlrpc_check_set(NULL, set);
+
+	return rc;
+}
+
 static int mdc_import_event(struct obd_device *obd, struct obd_import *imp,
 			    enum obd_import_event event)
 {
@@ -2886,7 +2959,8 @@ static int mdc_cleanup(struct obd_device *obd)
 	.set_open_replay_data	= mdc_set_open_replay_data,
 	.clear_open_replay_data	= mdc_clear_open_replay_data,
 	.intent_getattr_async	= mdc_intent_getattr_async,
-	.revalidate_lock	= mdc_revalidate_lock
+	.revalidate_lock	= mdc_revalidate_lock,
+	.rmfid			= mdc_rmfid,
 };
 
 static int __init mdc_init(void)
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index c10b593..fb60558 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -318,6 +318,21 @@
 	&RMF_DLM_REQ
 };
 
+static const struct req_msg_field *mds_rmfid_client[] = {
+	&RMF_PTLRPC_BODY,
+	&RMF_MDT_BODY,
+	&RMF_FID_ARRAY,
+	&RMF_CAPA1,
+	&RMF_CAPA2,
+};
+
+static const struct req_msg_field *mds_rmfid_server[] = {
+	&RMF_PTLRPC_BODY,
+	&RMF_MDT_BODY,
+	&RMF_FID_ARRAY,
+	&RMF_RCS,
+};
+
 static const struct req_msg_field *obd_connect_client[] = {
 	&RMF_PTLRPC_BODY,
 	&RMF_TGTUUID,
@@ -731,6 +746,7 @@
 	&RQF_MDS_HSM_ACTION,
 	&RQF_MDS_HSM_REQUEST,
 	&RQF_MDS_SWAP_LAYOUTS,
+	&RQF_MDS_RMFID,
 	&RQF_OST_CONNECT,
 	&RQF_OST_DISCONNECT,
 	&RQF_OST_QUOTACTL,
@@ -929,6 +945,10 @@ struct req_msg_field RMF_NAME =
 	DEFINE_MSGF("name", RMF_F_STRING, -1, NULL, NULL);
 EXPORT_SYMBOL(RMF_NAME);
 
+struct req_msg_field RMF_FID_ARRAY =
+	DEFINE_MSGF("fid_array", 0, -1, NULL, NULL);
+EXPORT_SYMBOL(RMF_FID_ARRAY);
+
 struct req_msg_field RMF_SYMTGT =
 	DEFINE_MSGF("symtgt", RMF_F_STRING, -1, NULL, NULL);
 EXPORT_SYMBOL(RMF_SYMTGT);
@@ -1511,6 +1531,11 @@ struct req_format RQF_MDS_WRITEPAGE =
 			mdt_body_capa, mdt_body_only);
 EXPORT_SYMBOL(RQF_MDS_WRITEPAGE);
 
+struct req_format RQF_MDS_RMFID =
+	DEFINE_REQ_FMT0("MDS_RMFID", mds_rmfid_client,
+			mds_rmfid_server);
+EXPORT_SYMBOL(RQF_MDS_RMFID);
+
 struct req_format RQF_LLOG_ORIGIN_HANDLE_CREATE =
 	DEFINE_REQ_FMT0("LLOG_ORIGIN_HANDLE_CREATE",
 			llog_origin_handle_create_client, llogd_body_only);
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index 700e109..d52a08a 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -96,6 +96,7 @@
 	{ MDS_HSM_CT_REGISTER,			"mds_hsm_ct_register" },
 	{ MDS_HSM_CT_UNREGISTER,		"mds_hsm_ct_unregister" },
 	{ MDS_SWAP_LAYOUTS,			"mds_swap_layouts" },
+	{ MDS_RMFID,				"mds_rmfid" },
 	{ LDLM_ENQUEUE,				"ldlm_enqueue" },
 	{ LDLM_CONVERT,				"ldlm_convert" },
 	{ LDLM_CANCEL,				"ldlm_cancel" },
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 1d34b15..9298c97 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -178,7 +178,9 @@ void lustre_assert_wire_constants(void)
 		 (long long)MDS_HSM_CT_UNREGISTER);
 	LASSERTF(MDS_SWAP_LAYOUTS == 61, "found %lld\n",
 		 (long long)MDS_SWAP_LAYOUTS);
-	LASSERTF(MDS_LAST_OPC == 62, "found %lld\n",
+	LASSERTF(MDS_RMFID == 62, "found %lld\n",
+		 (long long)MDS_RMFID);
+	LASSERTF(MDS_LAST_OPC == 63, "found %lld\n",
 		 (long long)MDS_LAST_OPC);
 	LASSERTF(REINT_SETATTR == 1, "found %lld\n",
 		 (long long)REINT_SETATTR);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 5740d42..87251ee 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1443,6 +1443,7 @@ enum mds_cmd {
 	MDS_HSM_CT_REGISTER	= 59,
 	MDS_HSM_CT_UNREGISTER	= 60,
 	MDS_SWAP_LAYOUTS	= 61,
+	MDS_RMFID		= 62,
 	MDS_LAST_OPC
 };
 
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 9c849ce..db36ce5 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -348,6 +348,7 @@ struct ll_ioc_lease_id {
 
 #define LL_IOC_LMV_SETSTRIPE		_IOWR('f', 240, struct lmv_user_md)
 #define LL_IOC_LMV_GETSTRIPE		_IOWR('f', 241, struct lmv_user_md)
+#define LL_IOC_RMFID			_IOR('f', 242, struct fid_array)
 #define LL_IOC_SET_LEASE		_IOWR('f', 243, struct ll_ioc_lease)
 #define LL_IOC_SET_LEASE_OLD		_IOWR('f', 243, long)
 #define LL_IOC_GET_LEASE		_IO('f', 244)
@@ -2149,6 +2150,15 @@ struct lu_pcc_state {
 	char	pccs_path[PATH_MAX];
 };
 
+struct fid_array {
+	__u32 fa_nr;
+	/* make header's size equal lu_fid */
+	__u32 fa_padding0;
+	__u64 fa_padding1;
+	struct lu_fid fa_fids[0];
+};
+#define OBD_MAX_FIDS_IN_ARRAY	4096
+
 /** @} lustreuser */
 
 #endif /* _LUSTRE_USER_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 411/622] lnet: Convert noisy timeout error to cdebug
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (409 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 410/622] lustre: obd: add rmfid support James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 412/622] lnet: Misleading error from lnet_is_health_check James Simmons
                   ` (211 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

This error message in lnet_finalize_expired_responses is very noisy
when nodes go down or are rebooted, and it does not provide much value
to system administrators. Convert it to a CDEBUG instead

WC-bug-id: https://jira.whamcloud.com/browse/LU-12439
Lustre-commit: bd3ed8cb7165 ("LU-12439 lnet: Convert noisy timeout error to cdebug")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35233
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 629856c..9a4c426 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2636,8 +2636,9 @@ struct lnet_mt_event_info {
 
 				nid = rspt->rspt_next_hop_nid;
 
-				CNETERR("Response timed out: md = %p: nid = %s\n",
-					md, libcfs_nid2str(nid));
+				CDEBUG(D_NET,
+				       "Response timeout: md = %p: nid = %s\n",
+				       md, libcfs_nid2str(nid));
 				LNetMDUnlink(rspt->rspt_mdh);
 				lnet_rspt_free(rspt, i);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 412/622] lnet: Misleading error from lnet_is_health_check
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (410 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 411/622] lnet: Convert noisy timeout error to cdebug James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 413/622] lustre: llite: do not cache write open lock for exec file James Simmons
                   ` (210 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

In the case of sending to 0 at lo we never set msg_txpeer nor
msg_rxpeer. This results in failing this lnet_is_health_check
condition and a misleading error message. The condition is only an
error the msg status is non-zero.

An additional case where we can have msg_rx_committed, but not
msg_rxpeer is for optimized GETs. In this case we allocate a reply
message but do not set msg_rxpeer.  We cannot perform further health
checking on this message, but it is not an error condition.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12440
Lustre-commit: 6caa6ed07df0 ("LU-12440 lnet: Misleading error from lnet_is_health_check")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35235
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 9ffd874..b70a6c9 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -848,8 +848,13 @@
 
 	if ((msg->msg_tx_committed && !msg->msg_txpeer) ||
 	    (msg->msg_rx_committed && !msg->msg_rxpeer)) {
-		CDEBUG(D_NET, "msg %p failed too early to retry and send\n",
-		       msg);
+		/* The optimized GET case does not set msg_rxpeer, but status
+		 * could be zero. Only print the error message if we have a
+		 * non-zero status.
+		 */
+		if (status)
+			CDEBUG(D_NET, "msg %p status %d cannot retry\n", msg,
+			       status);
 		return false;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 413/622] lustre: llite: do not cache write open lock for exec file
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (411 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 412/622] lnet: Misleading error from lnet_is_health_check James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 414/622] lustre: mdc: polling mode for changelog reader James Simmons
                   ` (209 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Jinshan Xiong <jinshan.xiong@uber.com>

This is to avoid the problem that the MDT needs an extra lock
revocation to make the file be able to execute.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4398
Lustre-commit: 6dd9d57bc006 ("LU-4398 llite: do not cache write open lock for exec file")
Signed-off-by: Jinshan Xiong <jinshan.xiong@uber.com>
Signed-off-by: Gu Zheng <gzheng@ddn.com>
Reviewed-on: https://review.whamcloud.com/32265
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 6f418e0..35e31ad 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -360,7 +360,9 @@ static int ll_md_close(struct inode *inode, struct file *file)
 	}
 	mutex_unlock(&lli->lli_och_mutex);
 
-	if (!md_lock_match(ll_i2mdexp(inode), flags, ll_inode2fid(inode),
+	/* LU-4398: do not cache write open lock if the file has exec bit */
+	if ((lockmode == LCK_CW && inode->i_mode & 0111) ||
+	    !md_lock_match(ll_i2mdexp(inode), flags, ll_inode2fid(inode),
 			   LDLM_IBITS, &policy, lockmode, &lockh))
 		rc = ll_md_real_close(inode, fd->fd_omode);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 414/622] lustre: mdc: polling mode for changelog reader
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (412 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 413/622] lustre: llite: do not cache write open lock for exec file James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 415/622] lnet: Sync the start of discovery and monitor threads James Simmons
                   ` (208 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

this allows the user (like lsom_sync and similar) to follow
the changelog and don't rescan getting duplicates.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12553
Lustre-commit: e215002883d5 ("LU-12553  mdc: polling mode for changelog reader")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35262
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_changelog.c            | 37 +++++++++++++++++++++++++++++++-
 include/uapi/linux/lustre/lustre_ioctl.h |  1 +
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/mdc/mdc_changelog.c b/fs/lustre/mdc/mdc_changelog.c
index fb0de68..ea74bab 100644
--- a/fs/lustre/mdc/mdc_changelog.c
+++ b/fs/lustre/mdc/mdc_changelog.c
@@ -37,6 +37,7 @@
 #include <linux/miscdevice.h>
 
 #include <lustre_log.h>
+#include <uapi/linux/lustre/lustre_ioctl.h>
 
 #include "mdc_internal.h"
 
@@ -88,6 +89,9 @@ struct chlg_reader_state {
 	u64			 crs_rec_count;
 	/* List of prefetched enqueued_record::enq_linkage_items */
 	struct list_head	 crs_rec_queue;
+	unsigned int		 crs_last_catidx;
+	unsigned int		 crs_last_idx;
+	bool			 crs_poll;
 };
 
 struct chlg_rec_entry {
@@ -132,6 +136,9 @@ static int chlg_read_cat_process_cb(const struct lu_env *env,
 
 	rec = container_of(hdr, struct llog_changelog_rec, cr_hdr);
 
+	crs->crs_last_catidx = llh->lgh_hdr->llh_cat_idx;
+	crs->crs_last_idx = hdr->lrh_index;
+
 	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
 		rc = -EINVAL;
 		CERROR("%s: not a changelog rec %x/%d in llog : rc = %d\n",
@@ -225,6 +232,10 @@ static int chlg_load(void *args)
 		goto err_out;
 	}
 
+	crs->crs_last_catidx = -1;
+	crs->crs_last_idx = 0;
+
+again:
 	rc = llog_open(NULL, ctx, &llh, NULL, CHANGELOG_CATALOG,
 		       LLOG_OPEN_EXISTS);
 	if (rc) {
@@ -248,12 +259,18 @@ static int chlg_load(void *args)
 		goto err_out;
 	}
 
-	rc = llog_cat_process(NULL, llh, chlg_read_cat_process_cb, crs, 0, 0);
+	rc = llog_cat_process(NULL, llh, chlg_read_cat_process_cb, crs,
+				crs->crs_last_catidx, crs->crs_last_idx);
 	if (rc < 0) {
 		CERROR("%s: fail to process llog: rc = %d\n",
 		       obd->obd_name, rc);
 		goto err_out;
 	}
+	if (!kthread_should_stop() && crs->crs_poll) {
+		llog_cat_close(NULL, llh);
+		schedule_timeout_interruptible(HZ);
+		goto again;
+	}
 
 	crs->crs_eof = true;
 
@@ -602,6 +619,23 @@ static unsigned int chlg_poll(struct file *file, poll_table *wait)
 	return mask;
 }
 
+static long chlg_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct chlg_reader_state *crs = file->private_data;
+	int rc;
+
+	switch (cmd) {
+	case OBD_IOC_CHLG_POLL:
+		crs->crs_poll = !!arg;
+		rc = 0;
+		break;
+	default:
+		rc = -EINVAL;
+		break;
+	}
+	return rc;
+}
+
 static const struct file_operations chlg_fops = {
 	.owner		= THIS_MODULE,
 	.llseek		= chlg_llseek,
@@ -610,6 +644,7 @@ static unsigned int chlg_poll(struct file *file, poll_table *wait)
 	.open		= chlg_open,
 	.release	= chlg_release,
 	.poll		= chlg_poll,
+	.unlocked_ioctl	= chlg_ioctl,
 };
 
 /**
diff --git a/include/uapi/linux/lustre/lustre_ioctl.h b/include/uapi/linux/lustre/lustre_ioctl.h
index b067cc6..53dd34f 100644
--- a/include/uapi/linux/lustre/lustre_ioctl.h
+++ b/include/uapi/linux/lustre/lustre_ioctl.h
@@ -221,6 +221,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
 #define OBD_IOC_START_LFSCK	_IOWR('f', 230, OBD_IOC_DATA_TYPE)
 #define OBD_IOC_STOP_LFSCK	_IOW('f', 231, OBD_IOC_DATA_TYPE)
 #define OBD_IOC_QUERY_LFSCK	_IOR('f', 232, struct obd_ioctl_data)
+#define OBD_IOC_CHLG_POLL	_IOR('f', 233, long)
 /*	lustre/lustre_user.h	240-249 */
 /* was LIBCFS_IOC_DEBUG_MASK   _IOWR('f', 250, long) until 2.11 */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 415/622] lnet: Sync the start of discovery and monitor threads
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (413 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 414/622] lustre: mdc: polling mode for changelog reader James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 416/622] lustre: llite: don't check vmpage refcount in ll_releasepage() James Simmons
                   ` (207 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

The discovery thread starts up before the monitor thread so it may
issue PUTs or GETs before the monitor thread has a chance to
initialize its data structures (namely the_lnet.ln_mt_rstq). This can
result in an OOPs when we attempt to attach response trackers to MDs.

Introduce a completion to synchronize the startup of these threads.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12537
Lustre-commit: 9283e2ed6655 ("LU-12537 lnet: Sync the start of discovery and monitor threads")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35478
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  5 +++++
 net/lnet/lnet/api-ni.c         |  3 +++
 net/lnet/lnet/lib-move.c       |  1 +
 net/lnet/lnet/peer.c           | 11 ++++++++++-
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index b240361..1009a69 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -1161,6 +1161,11 @@ struct lnet {
 	/* recovery eq handler */
 	struct lnet_handle_eq		ln_mt_eqh;
 
+	/*
+	 * Completed when the discovery and monitor threads can enter their
+	 * work loops
+	 */
+	struct completion		ln_started;
 };
 
 #endif
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 65f1f17..aa5ca52 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1062,6 +1062,7 @@ struct lnet_libhandle *
 	INIT_LIST_HEAD(&the_lnet.ln_mt_peerNIRecovq);
 	init_waitqueue_head(&the_lnet.ln_dc_waitq);
 	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
+	init_completion(&the_lnet.ln_started);
 
 	rc = lnet_descriptor_setup();
 	if (rc != 0)
@@ -2583,6 +2584,8 @@ void lnet_lib_exit(void)
 
 	mutex_unlock(&the_lnet.ln_api_mutex);
 
+	complete_all(&the_lnet.ln_started);
+
 	/* wait for all routers to start */
 	lnet_wait_router_start();
 
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 9a4c426..413397c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3529,6 +3529,7 @@ void lnet_monitor_thr_stop(void)
 
 	lnet_build_msg_event(msg, LNET_EVENT_PUT);
 
+	wait_for_completion(&the_lnet.ln_started);
 	/*
 	 * Must I ACK?  If so I'll grab the ack_wmd out of the header and put
 	 * it back into the ACK during lnet_finalize()
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index b0ca1de..49da7a1 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3258,6 +3258,8 @@ static int lnet_peer_discovery(void *arg)
 	struct lnet_peer *lp;
 	int rc;
 
+	wait_for_completion(&the_lnet.ln_started);
+
 	CDEBUG(D_NET, "started\n");
 
 	for (;;) {
@@ -3429,7 +3431,14 @@ void lnet_peer_discovery_stop(void)
 
 	LASSERT(the_lnet.ln_dc_state == LNET_DC_STATE_RUNNING);
 	the_lnet.ln_dc_state = LNET_DC_STATE_STOPPING;
-	wake_up(&the_lnet.ln_dc_waitq);
+
+	/* In the LNetNIInit() path we may be stopping discovery before it
+	 * entered its work loop
+	 */
+	if (!completion_done(&the_lnet.ln_started))
+		complete(&the_lnet.ln_started);
+	else
+		wake_up(&the_lnet.ln_dc_waitq);
 
 	wait_event(the_lnet.ln_dc_waitq,
 		   the_lnet.ln_dc_state == LNET_DC_STATE_SHUTDOWN);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 416/622] lustre: llite: don't check vmpage refcount in ll_releasepage()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (414 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 415/622] lnet: Sync the start of discovery and monitor threads James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 417/622] lnet: Deprecate live and dead router check params James Simmons
                   ` (206 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

We could not use vmpage refcount to check whether page could be
released because it break invalidate_complete_page2():

See comments:
/*
 * This is like invalidate_complete_page(), except it ignores the page's
 * refcount.  We do this because invalidate_inode_pages2() needs stronger
 * invalidation guarantees, and cannot afford to leave pages behind because
 * shrink_page_list() has a temp ref on them, or because they're transiently
 * sitting in the lru_cache_add() pagevecs.
 */

So checking refcount > 3 might be wrong here, one common
case is page might be transiently in lru_cache_add().

Since we have checked whether vmpage is used by cl_page later in the
function, and vmpage will be locked before called, it should be safe
to remove vmpage refcount check.

One of problem currently is following DIO will mostly
fall back to Buffer IO:

 $ dd if=/dev/zero of=data bs=1M count=1
 $ dd if=/dev/zero of=data bs=1M count=1 oflag=direct conv=notrunc

Which is because DIO will firstly try to writeback and invalidate
clean page which fail because vmpage refcount could be 4 here.

Function calls come from:

|->generic_file_direct_write()
  |->filemap_write_and_wait_range()
    |->invalidate_inode_pages2_range()
          |->invalidate_complete_page2() If a page can not be invalidated,
                                         return 0 to fall back to buffered write.
                |->try_to_release_page()
                  |->ll_releasepage()
                        return 0 because of vmpage count is 4 > 3
   |->generic_file_buffered_write

WC-bug-id: https://jira.whamcloud.com/browse/LU-12587
Lustre-commit: e59f0c9a245f ("LU-12587 llite: don't check vmpage refcount in ll_releasepage()")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35610
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/rw26.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index f5c1479..75348bf 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -119,10 +119,6 @@ static int ll_releasepage(struct page *vmpage, gfp_t gfp_mask)
 	if (!obj)
 		return 1;
 
-	/* 1 for caller, 1 for cl_page and 1 for page cache */
-	if (page_count(vmpage) > 3)
-		return 0;
-
 	page = cl_vmpage_page(vmpage, obj);
 	if (!page)
 		return 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 417/622] lnet: Deprecate live and dead router check params
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (415 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 416/622] lustre: llite: don't check vmpage refcount in ll_releasepage() James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 418/622] lnet: Detach rspt when md_threshold is infinite James Simmons
                   ` (205 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Rather than delete these params let's deprecate them for one release
and print a warning to console if the user is setting them.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12492
Lustre-commit: fca1a999899a ("LU-12492 lnet: Deprecate live and dead router check params")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35387
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 2 ++
 net/lnet/lnet/module.c        | 4 ++++
 net/lnet/lnet/router.c        | 8 ++++++++
 3 files changed, 14 insertions(+)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 3dd56a2..dd0075b 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -501,6 +501,8 @@ struct lnet_ni *
 extern unsigned int lnet_drop_asym_route;
 extern unsigned int router_sensitivity_percentage;
 extern int alive_router_check_interval;
+extern int live_router_check_interval;
+extern int dead_router_check_interval;
 extern int portal_rotor;
 
 int lnet_lib_init(void);
diff --git a/net/lnet/lnet/module.c b/net/lnet/lnet/module.c
index 5905f38..939c255 100644
--- a/net/lnet/lnet/module.c
+++ b/net/lnet/lnet/module.c
@@ -245,6 +245,10 @@ static int __init lnet_init(void)
 		return rc;
 	}
 
+	if (live_router_check_interval != INT_MIN ||
+	    dead_router_check_interval != INT_MIN)
+		LCONSOLE_WARN("live_router_check_interval and dead_router_check_interval have been deprecated. Use alive_router_check_interval instead. Ignoring these deprecated parameters.\n");
+
 	rc = blocking_notifier_chain_register(&libcfs_ioctl_list,
 					      &lnet_ioctl_handler);
 	LASSERT(!rc);
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index eb76c72..892164b 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -78,6 +78,14 @@
 module_param(avoid_asym_router_failure, int, 0644);
 MODULE_PARM_DESC(avoid_asym_router_failure, "Avoid asymmetrical router failures (0 to disable)");
 
+int dead_router_check_interval = INT_MIN;
+module_param(dead_router_check_interval, int, 0444);
+MODULE_PARM_DESC(dead_router_check_interval, "(DEPRECATED - Use alive_router_check_interval)");
+
+int live_router_check_interval = INT_MIN;
+module_param(live_router_check_interval, int, 0444);
+MODULE_PARM_DESC(live_router_check_interval, "(DEPRECATED - Use alive_router_check_interval)");
+
 int alive_router_check_interval = 60;
 module_param(alive_router_check_interval, int, 0644);
 MODULE_PARM_DESC(alive_router_check_interval, "Seconds between live router health checks (<= 0 to disable)");
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 418/622] lnet: Detach rspt when md_threshold is infinite
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (416 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 417/622] lnet: Deprecate live and dead router check params James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 419/622] lnet: Return EHOSTUNREACH for unreachable gateway James Simmons
                   ` (204 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

MDs for pings use the infinite threshold on MD operations.
As such they aren't normally unlinkable as determined by
lnet_md_unlinkable(). We can cover this case by checking whether the
refcount is zero and threshold is LNET_MD_THRESH_INF.

Cray-bug-id: LUS-7366
WC-bug-id: https://jira.whamcloud.com/browse/LU-12441
Lustre-commit: ebbf909a1c2d ("LU-12441 lnet: Detach rspt when md_threshold is infinite")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35452
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index b70a6c9..805d5b9 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -825,10 +825,12 @@
 		lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev);
 	}
 
-	if (unlink) {
+	if (unlink || (md->md_refcount == 0 &&
+		       md->md_threshold == LNET_MD_THRESH_INF))
 		lnet_detach_rsp_tracker(md, cpt);
+
+	if (unlink)
 		lnet_md_unlink(md);
-	}
 
 	msg->msg_md = NULL;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 419/622] lnet: Return EHOSTUNREACH for unreachable gateway
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (417 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 418/622] lnet: Detach rspt when md_threshold is infinite James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 420/622] lustre: ptlrpc: Don't get jobid in body_v2 James Simmons
                   ` (203 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Commit f1d0660a5bbe ("lnet: Do not allow gateways on remote nets")
contains a flaw in that it shouldn't be a fatal error to
encounter an unreachable gateway when parsing routes.  Parsing should
continue in case there are any valid, reachable routes that are being
added.  Returning EINAL here will cause a failure to load the LNet
module.  lnet_parse_route() explicitly allows for lnet_add_route() to
return EHOSTUNREACH for just this purpose.

Fixes: f1d0660a5bbe ("lnet: Do not allow gateways on remote nets")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12595
Lustre-commit: 7c12c24c8a10 ("LU-12595 lnet: Return EHOSTUNREACH for unreachable gateway")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35630
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 892164b..4ab587d 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -448,7 +448,7 @@ static void lnet_shuffle_seed(void)
 		CERROR("Cannot add route with gateway %s. There is no local interface configured on LNet %s\n",
 		       libcfs_nid2str(gateway),
 		       libcfs_net2str(LNET_NIDNET(gateway)));
-		return -EINVAL;
+		return -EHOSTUNREACH;
 	}
 
 	/* Assume net, route, all new */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 420/622] lustre: ptlrpc: Don't get jobid in body_v2
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (418 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 419/622] lnet: Return EHOSTUNREACH for unreachable gateway James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 421/622] lnet: Defer rspt cleanup when MD queued for unlink James Simmons
                   ` (202 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Some Lustre messages are still sent with ptlrpc_body_v2,
which does not have space for the jobid.

This results in errors like this when getting the jobid
from these messages, which we do now that the jobid is in
all RPC debug:
LustreError: 6817:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg
000000005c83b7a2 buffer[0] size 152 too small (required 184, opc=-1)

While we should stop sending ptlrpc_body_v2 messages, we
we still have to support these messages from older servers.
So put a check in lustre_msg_get_jobid so it won't try to
get the jobid if it's the old, smaller RPC body.

Fixes: 9eabc4eaba47 ("lustre: ptlrpc: Add jobid to rpctrace debug messages")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12523
Lustre-commit: 544701a782fb ("LU-12523 ptlrpc: Don't get jobid in body_v2")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35584
Reviewed-by: Ann Koehler <amk@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c       | 4 ++--
 fs/lustre/ptlrpc/pack_generic.c | 3 ++-
 fs/lustre/ptlrpc/service.c      | 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index bd641cc..9920a95 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -1644,7 +1644,7 @@ static int ptlrpc_send_new_req(struct ptlrpc_request *req)
 	       imp->imp_obd->obd_uuid.uuid,
 	       lustre_msg_get_status(req->rq_reqmsg), req->rq_xid,
 	       obd_import_nid2str(imp), lustre_msg_get_opc(req->rq_reqmsg),
-	       lustre_msg_get_jobid(req->rq_reqmsg));
+	       lustre_msg_get_jobid(req->rq_reqmsg) ?: "");
 
 	rc = ptl_send_rpc(req, 0);
 	if (rc == -ENOMEM) {
@@ -2065,7 +2065,7 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 			       req->rq_xid,
 			       obd_import_nid2str(imp),
 			       lustre_msg_get_opc(req->rq_reqmsg),
-			       lustre_msg_get_jobid(req->rq_reqmsg));
+			       lustre_msg_get_jobid(req->rq_reqmsg) ?: "");
 
 		spin_lock(&imp->imp_lock);
 		/*
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 7acb4a8..b066113 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -2429,7 +2429,8 @@ void _debug_req(struct ptlrpc_request *req,
 			 DEBUG_REQ_FLAGS(req),
 			 req_ok ? lustre_msg_get_flags(req->rq_reqmsg) : -1,
 			 rep_flags, req->rq_status, rep_status,
-			 req_ok ? lustre_msg_get_jobid(req->rq_reqmsg) : "");
+			 req_ok ? lustre_msg_get_jobid(req->rq_reqmsg) ?: ""
+				: "");
 	va_end(args);
 }
 EXPORT_SYMBOL(_debug_req);
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index 3132a1e..f40cb8d 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1765,7 +1765,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	       lustre_msg_get_status(request->rq_reqmsg), request->rq_xid,
 	       libcfs_id2str(request->rq_peer),
 	       lustre_msg_get_opc(request->rq_reqmsg),
-	       lustre_msg_get_jobid(request->rq_reqmsg));
+	       lustre_msg_get_jobid(request->rq_reqmsg) ?: "");
 
 	if (lustre_msg_get_opc(request->rq_reqmsg) != OBD_PING)
 		CFS_FAIL_TIMEOUT_MS(OBD_FAIL_PTLRPC_PAUSE_REQ, cfs_fail_val);
@@ -1807,7 +1807,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	       request->rq_xid,
 	       libcfs_id2str(request->rq_peer),
 	       lustre_msg_get_opc(request->rq_reqmsg),
-	       lustre_msg_get_jobid(request->rq_reqmsg),
+	       lustre_msg_get_jobid(request->rq_reqmsg) ?: "",
 	       timediff_usecs,
 	       arrived_usecs,
 	       (request->rq_repmsg ?
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 421/622] lnet: Defer rspt cleanup when MD queued for unlink
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (419 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 420/622] lustre: ptlrpc: Don't get jobid in body_v2 James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 422/622] lustre: lov: Correct write_intent end for trunc James Simmons
                   ` (201 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

When an MD is queued for unlink its lnet_libhandle is invalidated so
that future lookups of the MD fail. As a result, the monitor thread
cannot detach the response tracker from such an MD, and instead must
wait for the remaining operations on the MD to complete before it can
safely free the response tracker and remove it from the list. Freeing
the memory while there are pending operations on the MD can result
in a use after free situation when the final operation on the MD
completes and we attempt to remove the response tracker from the MD
via the lnet_msg_detach_md()->lnet_detach_rsp_tracker() call chain.

Here we introduce zombie lists for such response trackers. This will
allow us to also handle the case where there are response trackers
on the monitor queue during LNet shutdown. In this instance the
zombie response trackers will be freed when either all the operations
on the MD have completed (this free'ing is performed by
lnet_detach_rsp_tracker()) or after the LND Nets have shutdown since
we are ensured there will not be any more operations on the
associated MDs (this free'ing is performed by
lnet_clean_zombie_rstqs()).

Three other small changes are included in this patch:
 - When deleting the response tracker from the monitor's list we
   should use list_del() rather than list_del_init() since we'll
   be freeing the response tracker after removing it from the list.
 - Perform a single ktime_get() call for each local queue.
 - Move the check of whether the local queue is empty outside of
   the net lock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12568
Lustre-commit: 4a4ac34de42c ("LU-12568 lnet: Defer rspt cleanup when MD queued for unlink")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35576
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |   3 +
 include/linux/lnet/lib-types.h |   7 +++
 net/lnet/lnet/api-ni.c         |  31 ++++++++++
 net/lnet/lnet/lib-move.c       | 134 +++++++++++++++++++++++++++--------------
 4 files changed, 131 insertions(+), 44 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index dd0075b..b1407b3 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -571,6 +571,8 @@ int lnet_send_ping(lnet_nid_t dest_nid, struct lnet_handle_md *mdh, int nnis,
 void lnet_schedule_blocked_locked(struct lnet_rtrbufpool *rbp);
 void lnet_drop_routed_msgs_locked(struct list_head *list, int cpt);
 
+struct list_head **lnet_create_array_of_queues(void);
+
 /* portals functions */
 /* portals attributes */
 static inline int
@@ -641,6 +643,7 @@ struct lnet_msg *lnet_create_reply_msg(struct lnet_ni *ni,
 void lnet_set_reply_msg_len(struct lnet_ni *ni, struct lnet_msg *msg,
 			    unsigned int len);
 void lnet_detach_rsp_tracker(struct lnet_libmd *md, int cpt);
+void lnet_clean_zombie_rstqs(void);
 
 void lnet_finalize(struct lnet_msg *msg, int rc);
 bool lnet_send_error_simulation(struct lnet_msg *msg,
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 1009a69..904ef7a 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -1158,6 +1158,13 @@ struct lnet {
 	 * based on the mdh cookie.
 	 */
 	struct list_head		**ln_mt_rstq;
+	/*
+	 * A response tracker becomes a zombie when the associated MD is queued
+	 * for unlink before the response tracker is detached from the MD. An
+	 * entry on a zombie list can be freed when either the remaining
+	 * operations on the MD complete or when LNet has shut down.
+	 */
+	struct list_head		**ln_mt_zombie_rstqs;
 	/* recovery eq handler */
 	struct lnet_handle_eq		ln_mt_eqh;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index aa5ca52..e773839 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1028,6 +1028,26 @@ struct lnet_libhandle *
 	list_add(&lh->lh_hash_chain, &rec->rec_lh_hash[hash]);
 }
 
+struct list_head **
+lnet_create_array_of_queues(void)
+{
+	struct list_head **qs;
+	struct list_head *q;
+	int i;
+
+	qs = cfs_percpt_alloc(lnet_cpt_table(),
+			      sizeof(struct list_head));
+	if (!qs) {
+		CERROR("Failed to allocate queues\n");
+		return NULL;
+	}
+
+	cfs_percpt_for_each(q, i, qs)
+		INIT_LIST_HEAD(q);
+
+	return qs;
+}
+
 static int lnet_unprepare(void);
 
 static int
@@ -1120,6 +1140,12 @@ struct lnet_libhandle *
 		goto failed;
 	}
 
+	the_lnet.ln_mt_zombie_rstqs = lnet_create_array_of_queues();
+	if (!the_lnet.ln_mt_zombie_rstqs) {
+		rc = -ENOMEM;
+		goto failed;
+	}
+
 	return 0;
 
 failed:
@@ -1144,6 +1170,11 @@ struct lnet_libhandle *
 	LASSERT(list_empty(&the_lnet.ln_test_peers));
 	LASSERT(list_empty(&the_lnet.ln_nets));
 
+	if (the_lnet.ln_mt_zombie_rstqs) {
+		lnet_clean_zombie_rstqs();
+		the_lnet.ln_mt_zombie_rstqs = NULL;
+	}
+
 	if (!LNetEQHandleIsInvalid(the_lnet.ln_mt_eqh)) {
 		rc = LNetEQFree(the_lnet.ln_mt_eqh);
 		LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 413397c..322998a 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -2556,24 +2556,55 @@ struct lnet_mt_event_info {
 		return;
 
 	rspt = md->md_rspt_ptr;
-	md->md_rspt_ptr = NULL;
 
 	/* debug code */
 	LASSERT(rspt->rspt_cpt == cpt);
 
-	/* invalidate the handle to indicate that a response has been
-	 * received, which will then lead the monitor thread to clean up
-	 * the rspt block.
-	 */
-	LNetInvalidateMDHandle(&rspt->rspt_mdh);
+	md->md_rspt_ptr = NULL;
+
+	if (LNetMDHandleIsInvalid(rspt->rspt_mdh)) {
+		/* The monitor thread has invalidated this handle because the
+		 * response timed out, but it failed to lookup the MD. That
+		 * means this response tracker is on the zombie list. We can
+		 * safely remove it under the resource lock (held by caller) and
+		 * free the response tracker block.
+		 */
+		list_del(&rspt->rspt_on_list);
+		lnet_rspt_free(rspt, cpt);
+	} else {
+		/* invalidate the handle to indicate that a response has been
+		 * received, which will then lead the monitor thread to clean up
+		 * the rspt block.
+		 */
+		LNetInvalidateMDHandle(&rspt->rspt_mdh);
+	}
+}
+
+void
+lnet_clean_zombie_rstqs(void)
+{
+	struct lnet_rsp_tracker *rspt, *tmp;
+	int i;
+
+	cfs_cpt_for_each(i, lnet_cpt_table()) {
+		list_for_each_entry_safe(rspt, tmp,
+					 the_lnet.ln_mt_zombie_rstqs[i],
+					 rspt_on_list) {
+			list_del(&rspt->rspt_on_list);
+			lnet_rspt_free(rspt, i);
+		}
+	}
+
+	cfs_percpt_free(the_lnet.ln_mt_zombie_rstqs);
 }
 
 static void
-lnet_finalize_expired_responses(bool force)
+lnet_finalize_expired_responses(void)
 {
 	struct lnet_libmd *md;
 	struct list_head local_queue;
 	struct lnet_rsp_tracker *rspt, *tmp;
+	ktime_t now;
 	int i;
 
 	if (!the_lnet.ln_mt_rstq)
@@ -2590,6 +2621,8 @@ struct lnet_mt_event_info {
 		list_splice_init(the_lnet.ln_mt_rstq[i], &local_queue);
 		lnet_net_unlock(i);
 
+		now = ktime_get();
+
 		list_for_each_entry_safe(rspt, tmp, &local_queue,
 					 rspt_on_list) {
 			/* The rspt mdh will be invalidated when a response
@@ -2605,42 +2638,74 @@ struct lnet_mt_event_info {
 			lnet_res_lock(i);
 			if (LNetMDHandleIsInvalid(rspt->rspt_mdh)) {
 				lnet_res_unlock(i);
-				list_del_init(&rspt->rspt_on_list);
+				list_del(&rspt->rspt_on_list);
 				lnet_rspt_free(rspt, i);
 				continue;
 			}
 
-			if (ktime_compare(ktime_get(),
-					  rspt->rspt_deadline) >= 0 ||
-			    force) {
+			if (ktime_compare(now, rspt->rspt_deadline) >= 0 ||
+			    the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN) {
 				struct lnet_peer_ni *lpni;
 				lnet_nid_t nid;
 
 				md = lnet_handle2md(&rspt->rspt_mdh);
 				if (!md) {
+					/* MD has been queued for unlink, but
+					 * rspt hasn't been detached (Note we've
+					 * checked above that the rspt_mdh is
+					 * valid). Since we cannot lookup the MD
+					 * we're unable to detach the rspt
+					 * ourselves. Thus, move the rspt to the
+					 * zombie list where we'll wait for
+					 * either:
+					 *   1. The remaining operations on the
+					 *   MD to complete. In this case the
+					 *   final operation will result in
+					 *   lnet_msg_detach_md()->
+					 *   lnet_detach_rsp_tracker() where
+					 *   we will clean up this response
+					 *   tracker.
+					 *   2. LNet to shutdown. In this case
+					 *   we'll wait until after all LND Nets
+					 *   have shutdown and then we can
+					 *   safely free any remaining response
+					 *   tracker blocks on the zombie list.
+					 * Note: We need to hold the resource
+					 * lock when adding to the zombie list
+					 * because we may have concurrent access
+					 * with lnet_detach_rsp_tracker().
+					 */
 					LNetInvalidateMDHandle(&rspt->rspt_mdh);
+					list_move(&rspt->rspt_on_list,
+						  the_lnet.ln_mt_zombie_rstqs[i]);
 					lnet_res_unlock(i);
-					list_del_init(&rspt->rspt_on_list);
-					lnet_rspt_free(rspt, i);
 					continue;
 				}
 				LASSERT(md->md_rspt_ptr == rspt);
 				md->md_rspt_ptr = NULL;
 				lnet_res_unlock(i);
 
+				LNetMDUnlink(rspt->rspt_mdh);
+
+				nid = rspt->rspt_next_hop_nid;
+
+				list_del(&rspt->rspt_on_list);
+				lnet_rspt_free(rspt, i);
+
+				/* If we're shutting down we just want to clean
+				 * up the rspt blocks
+				 */
+				if (the_lnet.ln_mt_state ==
+				    LNET_MT_STATE_SHUTDOWN)
+					continue;
+
 				lnet_net_lock(i);
 				the_lnet.ln_counters[i]->lct_health.lch_response_timeout_count++;
 				lnet_net_unlock(i);
 
-				list_del_init(&rspt->rspt_on_list);
-
-				nid = rspt->rspt_next_hop_nid;
-
 				CDEBUG(D_NET,
 				       "Response timeout: md = %p: nid = %s\n",
 				       md, libcfs_nid2str(nid));
-				LNetMDUnlink(rspt->rspt_mdh);
-				lnet_rspt_free(rspt, i);
 
 				/* If there is a timeout on the response
 				 * from the next hop decrement its health
@@ -2659,10 +2724,11 @@ struct lnet_mt_event_info {
 			}
 		}
 
-		lnet_net_lock(i);
-		if (!list_empty(&local_queue))
+		if (!list_empty(&local_queue)) {
+			lnet_net_lock(i);
 			list_splice(&local_queue, the_lnet.ln_mt_rstq[i]);
-		lnet_net_unlock(i);
+			lnet_net_unlock(i);
+		}
 	}
 }
 
@@ -2927,26 +2993,6 @@ struct lnet_mt_event_info {
 	lnet_net_unlock(0);
 }
 
-static struct list_head **
-lnet_create_array_of_queues(void)
-{
-	struct list_head **qs;
-	struct list_head *q;
-	int i;
-
-	qs = cfs_percpt_alloc(lnet_cpt_table(),
-			      sizeof(struct list_head));
-	if (!qs) {
-		CERROR("Failed to allocate queues\n");
-		return NULL;
-	}
-
-	cfs_percpt_for_each(q, i, qs)
-		INIT_LIST_HEAD(q);
-
-	return qs;
-}
-
 static int
 lnet_resendqs_create(void)
 {
@@ -3204,7 +3250,7 @@ struct lnet_mt_event_info {
 		lnet_resend_pending_msgs();
 
 		if (now >= rsp_timeout) {
-			lnet_finalize_expired_responses(false);
+			lnet_finalize_expired_responses();
 			rsp_timeout = now + (lnet_transaction_timeout / 2);
 		}
 
@@ -3422,7 +3468,7 @@ struct lnet_mt_event_info {
 static void
 lnet_rsp_tracker_clean(void)
 {
-	lnet_finalize_expired_responses(true);
+	lnet_finalize_expired_responses();
 
 	cfs_percpt_free(the_lnet.ln_mt_rstq);
 	the_lnet.ln_mt_rstq = NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 422/622] lustre: lov: Correct write_intent end for trunc
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (420 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 421/622] lnet: Defer rspt cleanup when MD queued for unlink James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 423/622] lustre: mdc: hold lock while walking changelog dev list James Simmons
                   ` (200 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

When instantiating a layout, the server interprets the
write intent from the client as the range [start, end), not
including the last byte.

This is correct for writes because the last byte given for
a write is actually 'endpos', the resulting file pointer
position, and so is not included.

However, truncate is specifying a size, not an endpos, so
truncate is [start, size].  To make this work with the
[start, end) processing for write_intents, we have to add
1 to the size when sending a write intent.

Without this, a truncate operation to the first byte of a
new layout component fails silently because the component
is not instantiated.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12586
Lustre-commit: c32c7401426d ("LU-12586 lov: Correct write_intent end for trunc")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35607
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_io.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index 9328240..6e86efa 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -555,7 +555,15 @@ static int lov_io_slice_init(struct lov_io *lio, struct lov_object *obj,
 	 */
 	if (cl_io_is_trunc(io)) {
 		io->ci_write_intent.e_start = 0;
-		io->ci_write_intent.e_end = io->u.ci_setattr.sa_attr.lvb_size;
+		/* for writes, e_end is endpos, the location of the file
+		 * pointer after the write is completed, so it is not accessed.
+		 * For truncate, 'end' is the size, and *is* acccessed.
+		 * In other words, writes are [start, end), but truncate is
+		 * [start, size], where both are included.  So add 1 to the
+		 * size when creating the write intent to account for this.
+		 */
+		io->ci_write_intent.e_end =
+			io->u.ci_setattr.sa_attr.lvb_size + 1;
 	} else {
 		io->ci_write_intent.e_start = lio->lis_pos;
 		io->ci_write_intent.e_end = lio->lis_endpos;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 423/622] lustre: mdc: hold lock while walking changelog dev list
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (421 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 422/622] lustre: lov: Correct write_intent end for trunc James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 424/622] lustre: import: fix race between imp_state & imp_invalid James Simmons
                   ` (199 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock
while walking and changing entries on the chlog_registered_devs
and ced_obds lists in chlg_registered_dev_find_by_obd().

Move the calling of chlg_registered_dev_find_by_obd() under the
mutex, and add assertions to the places where the lists are walked
and changed that the mutex is held.

Fixes: dfecb064ac1f ("lustre: mdc: expose changelog through char devices")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12566
Lustre-commit: a260c530801d ("LU-12566 mdc: hold lock while walking changelog dev list")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35668
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_changelog.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/mdc/mdc_changelog.c b/fs/lustre/mdc/mdc_changelog.c
index ea74bab..9af0541 100644
--- a/fs/lustre/mdc/mdc_changelog.c
+++ b/fs/lustre/mdc/mdc_changelog.c
@@ -677,6 +677,7 @@ static void get_chlg_name(char *name, size_t name_len, struct obd_device *obd)
 {
 	struct chlg_registered_dev *dit;
 
+	LASSERT(mutex_is_locked(&chlg_registered_dev_lock));
 	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
 		if (strcmp(name, dit->ced_name) == 0)
 			return dit;
@@ -695,6 +696,7 @@ static void get_chlg_name(char *name, size_t name_len, struct obd_device *obd)
 	struct chlg_registered_dev *dit;
 	struct obd_device *oit;
 
+	LASSERT(mutex_is_locked(&chlg_registered_dev_lock));
 	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
 		list_for_each_entry(oit, &dit->ced_obds,
 				    u.cli.cl_chg_dev_linkage)
@@ -768,6 +770,7 @@ static void chlg_dev_clear(struct kref *kref)
 	struct chlg_registered_dev *entry = container_of(kref,
 							 struct chlg_registered_dev,
 							 ced_refs);
+	LASSERT(mutex_is_locked(&chlg_registered_dev_lock));
 	list_del(&entry->ced_link);
 	misc_deregister(&entry->ced_misc);
 	kfree(entry);
@@ -778,9 +781,10 @@ static void chlg_dev_clear(struct kref *kref)
  */
 void mdc_changelog_cdev_finish(struct obd_device *obd)
 {
-	struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
+	struct chlg_registered_dev *dev;
 
 	mutex_lock(&chlg_registered_dev_lock);
+	dev = chlg_registered_dev_find_by_obd(obd);
 	list_del_init(&obd->u.cli.cl_chg_dev_linkage);
 	kref_put(&dev->ced_refs, chlg_dev_clear);
 	mutex_unlock(&chlg_registered_dev_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 424/622] lustre: import: fix race between imp_state & imp_invalid
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (422 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 423/622] lustre: mdc: hold lock while walking changelog dev list James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 425/622] lnet: support non-default network namespace James Simmons
                   ` (198 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Yang Sheng <ys@whamcloud.com>

We set import to LUSTRE_IMP_DISCON and then deactive when
it is unreplayable. Someone may set this import up between
those two operations. So we will get a invalid import with
FULL state.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11542
Lustre-commit: 29904135df67 ("LU-11542 import: fix race between imp_state & imp_invalid")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33395
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_ha.h      |   2 +-
 fs/lustre/lov/lov_obd.c            |   2 +-
 fs/lustre/ptlrpc/client.c          |   3 +-
 fs/lustre/ptlrpc/import.c          | 104 ++++++++++++++++++++++++-------------
 fs/lustre/ptlrpc/pinger.c          |  13 ++---
 fs/lustre/ptlrpc/ptlrpc_internal.h |   3 +-
 fs/lustre/ptlrpc/recover.c         |  14 ++---
 7 files changed, 80 insertions(+), 61 deletions(-)

diff --git a/fs/lustre/include/lustre_ha.h b/fs/lustre/include/lustre_ha.h
index af92a56..c914ef6 100644
--- a/fs/lustre/include/lustre_ha.h
+++ b/fs/lustre/include/lustre_ha.h
@@ -50,7 +50,7 @@
 void ptlrpc_wake_delayed(struct obd_import *imp);
 int ptlrpc_recover_import(struct obd_import *imp, char *new_uuid, int async);
 int ptlrpc_set_import_active(struct obd_import *imp, int active);
-void ptlrpc_activate_import(struct obd_import *imp);
+void ptlrpc_activate_import(struct obd_import *imp, bool set_state_full);
 void ptlrpc_deactivate_import(struct obd_import *imp);
 void ptlrpc_invalidate_import(struct obd_import *imp);
 void ptlrpc_fail_import(struct obd_import *imp, u32 conn_cnt);
diff --git a/fs/lustre/lov/lov_obd.c b/fs/lustre/lov/lov_obd.c
index 234b556..3348380 100644
--- a/fs/lustre/lov/lov_obd.c
+++ b/fs/lustre/lov/lov_obd.c
@@ -157,7 +157,7 @@ int lov_connect_osc(struct obd_device *obd, u32 index, int activate,
 		/* FIXME this is probably supposed to be
 		 * ptlrpc_set_import_active.  Horrible naming.
 		 */
-		ptlrpc_activate_import(imp);
+		ptlrpc_activate_import(imp, false);
 	}
 
 	rc = obd_register_observer(tgt_obd, obd);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 9920a95..dcc5e6b 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -3033,7 +3033,7 @@ void ptlrpc_abort_inflight(struct obd_import *imp)
 	 * ptlrpc_{queue,set}_wait must (and does) hold imp_lock while testing
 	 * this flag and then putting requests on sending_list or delayed_list.
 	 */
-	spin_lock(&imp->imp_lock);
+	assert_spin_locked(&imp->imp_lock);
 
 	/*
 	 * XXX locking?  Maybe we should remove each request with the list
@@ -3071,7 +3071,6 @@ void ptlrpc_abort_inflight(struct obd_import *imp)
 	if (imp->imp_replayable)
 		ptlrpc_free_committed(imp);
 
-	spin_unlock(&imp->imp_lock);
 }
 
 /**
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 98c09f6..0ade41e 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -144,6 +144,17 @@ static void deuuidify(char *uuid, const char *prefix, char **uuid_start,
 		*uuid_len -= strlen(UUID_STR);
 }
 
+/* Must be called with imp_lock held! */
+static void ptlrpc_deactivate_import_nolock(struct obd_import *imp)
+{
+	assert_spin_locked(&imp->imp_lock);
+	CDEBUG(D_HA, "setting import %s INVALID\n", obd2cli_tgt(imp->imp_obd));
+	imp->imp_invalid = 1;
+	imp->imp_generation++;
+
+	ptlrpc_abort_inflight(imp);
+}
+
 /**
  * Returns true if import was FULL, false if import was already not
  * connected.
@@ -154,8 +165,10 @@ static void deuuidify(char *uuid, const char *prefix, char **uuid_start,
  *	     bulk requests) and if one has already caused a reconnection
  *	     (increasing the import->conn_cnt) the older failure should
  *	     not also cause a reconnection.  If zero it forces a reconnect.
+ * @invalid - set import invalid flag
  */
-int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt)
+int ptlrpc_set_import_discon(struct obd_import *imp,
+			     u32 conn_cnt, bool invalid)
 {
 	int rc = 0;
 
@@ -165,10 +178,12 @@ int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt)
 	    (conn_cnt == 0 || conn_cnt == imp->imp_conn_cnt)) {
 		char *target_start;
 		int   target_len;
+		bool inact = false;
 
 		deuuidify(obd2cli_tgt(imp->imp_obd), NULL,
 			  &target_start, &target_len);
 
+		import_set_state_nolock(imp, LUSTRE_IMP_DISCON);
 		if (imp->imp_replayable) {
 			LCONSOLE_WARN("%s: Connection to %.*s (at %s) was lost; in progress operations using this service will wait for recovery to complete\n",
 				      imp->imp_obd->obd_name,
@@ -180,14 +195,25 @@ int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt)
 					   imp->imp_obd->obd_name,
 					   target_len, target_start,
 					   obd_import_nid2str(imp));
+			if (invalid) {
+				CDEBUG(D_HA,
+				       "import %s@%s for %s not replayable, auto-deactivating\n",
+				       obd2cli_tgt(imp->imp_obd),
+				       imp->imp_connection->c_remote_uuid.uuid,
+				       imp->imp_obd->obd_name);
+				ptlrpc_deactivate_import_nolock(imp);
+				inact = true;
+			}
 		}
-		import_set_state_nolock(imp, LUSTRE_IMP_DISCON);
 		spin_unlock(&imp->imp_lock);
 
 		if (obd_dump_on_timeout)
 			libcfs_debug_dumplog();
 
 		obd_import_event(imp->imp_obd, imp, IMP_EVENT_DISCON);
+
+		if (inact)
+			obd_import_event(imp->imp_obd, imp, IMP_EVENT_INACTIVE);
 		rc = 1;
 	} else {
 		spin_unlock(&imp->imp_lock);
@@ -211,11 +237,9 @@ void ptlrpc_deactivate_import(struct obd_import *imp)
 	CDEBUG(D_HA, "setting import %s INVALID\n", obd2cli_tgt(imp->imp_obd));
 
 	spin_lock(&imp->imp_lock);
-	imp->imp_invalid = 1;
-	imp->imp_generation++;
+	ptlrpc_deactivate_import_nolock(imp);
 	spin_unlock(&imp->imp_lock);
 
-	ptlrpc_abort_inflight(imp);
 	obd_import_event(imp->imp_obd, imp, IMP_EVENT_INACTIVE);
 }
 EXPORT_SYMBOL(ptlrpc_deactivate_import);
@@ -379,17 +403,23 @@ void ptlrpc_invalidate_import(struct obd_import *imp)
 EXPORT_SYMBOL(ptlrpc_invalidate_import);
 
 /* unset imp_invalid */
-void ptlrpc_activate_import(struct obd_import *imp)
+void ptlrpc_activate_import(struct obd_import *imp, bool set_state_full)
 {
 	struct obd_device *obd = imp->imp_obd;
 
 	spin_lock(&imp->imp_lock);
 	if (imp->imp_deactive != 0) {
+		LASSERT(imp->imp_state != LUSTRE_IMP_FULL);
+		if (imp->imp_state != LUSTRE_IMP_DISCON)
+			import_set_state_nolock(imp, LUSTRE_IMP_DISCON);
 		spin_unlock(&imp->imp_lock);
 		return;
 	}
+	if (set_state_full)
+		import_set_state_nolock(imp, LUSTRE_IMP_FULL);
 
 	imp->imp_invalid = 0;
+
 	spin_unlock(&imp->imp_lock);
 	obd_import_event(obd, imp, IMP_EVENT_ACTIVE);
 }
@@ -413,18 +443,8 @@ void ptlrpc_fail_import(struct obd_import *imp, u32 conn_cnt)
 {
 	LASSERT(!imp->imp_dlm_fake);
 
-	if (ptlrpc_set_import_discon(imp, conn_cnt)) {
-		if (!imp->imp_replayable) {
-			CDEBUG(D_HA,
-			       "import %s@%s for %s not replayable, auto-deactivating\n",
-			       obd2cli_tgt(imp->imp_obd),
-			       imp->imp_connection->c_remote_uuid.uuid,
-			       imp->imp_obd->obd_name);
-			ptlrpc_deactivate_import(imp);
-		}
-
+	if (ptlrpc_set_import_discon(imp, conn_cnt, true))
 		ptlrpc_pinger_force(imp);
-	}
 }
 
 int ptlrpc_reconnect_import(struct obd_import *imp)
@@ -1073,12 +1093,10 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 		spin_lock(&imp->imp_lock);
 		if (msg_flags & MSG_CONNECT_REPLAYABLE) {
 			imp->imp_replayable = 1;
-			spin_unlock(&imp->imp_lock);
 			CDEBUG(D_HA, "connected to replayable target: %s\n",
 			       obd2cli_tgt(imp->imp_obd));
 		} else {
 			imp->imp_replayable = 0;
-			spin_unlock(&imp->imp_lock);
 		}
 
 		/* if applies, adjust the imp->imp_msg_magic here
@@ -1095,10 +1113,11 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 		if (msg_flags & MSG_CONNECT_RECOVERING) {
 			CDEBUG(D_HA, "connect to %s during recovery\n",
 			       obd2cli_tgt(imp->imp_obd));
-			import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
+			import_set_state_nolock(imp, LUSTRE_IMP_REPLAY_LOCKS);
+			spin_unlock(&imp->imp_lock);
 		} else {
-			import_set_state(imp, LUSTRE_IMP_FULL);
-			ptlrpc_activate_import(imp);
+			spin_unlock(&imp->imp_lock);
+			ptlrpc_activate_import(imp, true);
 		}
 
 		rc = 0;
@@ -1223,31 +1242,33 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 	}
 
 out:
+	if (exp)
+		class_export_put(exp);
+
 	spin_lock(&imp->imp_lock);
 	imp->imp_connected = 0;
 	imp->imp_connect_tried = 1;
-	spin_unlock(&imp->imp_lock);
 
-	if (exp)
-		class_export_put(exp);
+	if (rc) {
+		bool inact = false;
 
-	if (rc != 0) {
-		import_set_state(imp, LUSTRE_IMP_DISCON);
+		import_set_state_nolock(imp, LUSTRE_IMP_DISCON);
 		if (rc == -EACCES) {
 			/*
 			 * Give up trying to reconnect
 			 * EACCES means client has no permission for connection
 			 */
 			imp->imp_obd->obd_no_recov = 1;
-			ptlrpc_deactivate_import(imp);
-		}
-
-		if (rc == -EPROTO) {
+			ptlrpc_deactivate_import_nolock(imp);
+			inact = true;
+		} else if (rc == -EPROTO) {
 			struct obd_connect_data *ocd;
 
 			/* reply message might not be ready */
-			if (!request->rq_repmsg)
+			if (!request->rq_repmsg) {
+				spin_unlock(&imp->imp_lock);
 				return -EPROTO;
+			}
 
 			ocd = req_capsule_server_get(&request->rq_pill,
 						     &RMF_CONNECT_DATA);
@@ -1267,17 +1288,26 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 						   OBD_OCD_VERSION_PATCH(ocd->ocd_version),
 						   OBD_OCD_VERSION_FIX(ocd->ocd_version),
 						   LUSTRE_VERSION_STRING);
-				ptlrpc_deactivate_import(imp);
-				import_set_state(imp, LUSTRE_IMP_CLOSED);
+				ptlrpc_deactivate_import_nolock(imp);
+				import_set_state_nolock(imp, LUSTRE_IMP_CLOSED);
+				inact = true;
 			}
-			return -EPROTO;
 		}
+		spin_unlock(&imp->imp_lock);
+
+		if (inact)
+			obd_import_event(imp->imp_obd, imp, IMP_EVENT_INACTIVE);
+
+		if (rc == -EPROTO)
+			return rc;
 
 		ptlrpc_maybe_ping_import_soon(imp);
 
 		CDEBUG(D_HA, "recovery of %s on %s failed (%d)\n",
 		       obd2cli_tgt(imp->imp_obd),
 		       (char *)imp->imp_connection->c_remote_uuid.uuid, rc);
+	} else {
+		spin_unlock(&imp->imp_lock);
 	}
 
 	wake_up_all(&imp->imp_recovery_waitq);
@@ -1476,8 +1506,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 		rc = ptlrpc_resend(imp);
 		if (rc)
 			goto out;
-		import_set_state(imp, LUSTRE_IMP_FULL);
-		ptlrpc_activate_import(imp);
+		ptlrpc_activate_import(imp, true);
 
 		deuuidify(obd2cli_tgt(imp->imp_obd), NULL,
 			  &target_start, &target_len);
@@ -1684,6 +1713,7 @@ int ptlrpc_disconnect_and_idle_import(struct obd_import *imp)
 		return 0;
 
 	spin_lock(&imp->imp_lock);
+
 	if (imp->imp_state != LUSTRE_IMP_FULL) {
 		spin_unlock(&imp->imp_lock);
 		return 0;
diff --git a/fs/lustre/ptlrpc/pinger.c b/fs/lustre/ptlrpc/pinger.c
index c3fbddc..a812942 100644
--- a/fs/lustre/ptlrpc/pinger.c
+++ b/fs/lustre/ptlrpc/pinger.c
@@ -217,8 +217,6 @@ static void ptlrpc_pinger_process_import(struct obd_import *imp,
 
 	imp->imp_force_next_verify = 0;
 
-	spin_unlock(&imp->imp_lock);
-
 	CDEBUG(level == LUSTRE_IMP_FULL ? D_INFO : D_HA,
 	       "%s->%s: level %s/%u force %u force_next %u deactive %u pingable %u suppress %u\n",
 	       imp->imp_obd->obd_uuid.uuid, obd2cli_tgt(imp->imp_obd),
@@ -228,22 +226,21 @@ static void ptlrpc_pinger_process_import(struct obd_import *imp,
 	if (level == LUSTRE_IMP_DISCON && !imp_is_deactive(imp)) {
 		/* wait for a while before trying recovery again */
 		imp->imp_next_ping = ptlrpc_next_reconnect(imp);
+		spin_unlock(&imp->imp_lock);
 		if (!imp->imp_no_pinger_recover ||
 		    imp->imp_connect_error == -EAGAIN)
 			ptlrpc_initiate_recovery(imp);
-	} else if (level != LUSTRE_IMP_FULL ||
-		   imp->imp_obd->obd_no_recov ||
+	} else if (level != LUSTRE_IMP_FULL || imp->imp_obd->obd_no_recov ||
 		   imp_is_deactive(imp)) {
 		CDEBUG(D_HA,
 		       "%s->%s: not pinging (in recovery or recovery disabled: %s)\n",
 		       imp->imp_obd->obd_uuid.uuid, obd2cli_tgt(imp->imp_obd),
 		       ptlrpc_import_state_name(level));
-		if (force) {
-			spin_lock(&imp->imp_lock);
+		if (force)
 			imp->imp_force_verify = 1;
-			spin_unlock(&imp->imp_lock);
-		}
+		spin_unlock(&imp->imp_lock);
 	} else if ((imp->imp_pingable && !suppress) || force_next || force) {
+		spin_unlock(&imp->imp_lock);
 		ptlrpc_ping(imp);
 	}
 }
diff --git a/fs/lustre/ptlrpc/ptlrpc_internal.h b/fs/lustre/ptlrpc/ptlrpc_internal.h
index f84d278..9e74d71 100644
--- a/fs/lustre/ptlrpc/ptlrpc_internal.h
+++ b/fs/lustre/ptlrpc/ptlrpc_internal.h
@@ -83,7 +83,8 @@ void ptlrpc_set_add_new_req(struct ptlrpcd_ctl *pc,
 void ptlrpc_request_handle_notconn(struct ptlrpc_request *req);
 void lustre_assert_wire_constants(void);
 int ptlrpc_import_in_recovery(struct obd_import *imp);
-int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt);
+int ptlrpc_set_import_discon(struct obd_import *imp, u32 conn_cnt,
+			     bool invalid);
 int ptlrpc_replay_next(struct obd_import *imp, int *inflight);
 void ptlrpc_initiate_recovery(struct obd_import *imp);
 
diff --git a/fs/lustre/ptlrpc/recover.c b/fs/lustre/ptlrpc/recover.c
index e26612d..e6e6661 100644
--- a/fs/lustre/ptlrpc/recover.c
+++ b/fs/lustre/ptlrpc/recover.c
@@ -224,21 +224,13 @@ void ptlrpc_wake_delayed(struct obd_import *imp)
 void ptlrpc_request_handle_notconn(struct ptlrpc_request *failed_req)
 {
 	struct obd_import *imp = failed_req->rq_import;
+	int conn = lustre_msg_get_conn_cnt(failed_req->rq_reqmsg);
 
 	CDEBUG(D_HA, "import %s of %s@%s abruptly disconnected: reconnecting\n",
 	       imp->imp_obd->obd_name, obd2cli_tgt(imp->imp_obd),
 	       imp->imp_connection->c_remote_uuid.uuid);
 
-	if (ptlrpc_set_import_discon(imp,
-			      lustre_msg_get_conn_cnt(failed_req->rq_reqmsg))) {
-		if (!imp->imp_replayable) {
-			CDEBUG(D_HA,
-			       "import %s@%s for %s not replayable, auto-deactivating\n",
-			       obd2cli_tgt(imp->imp_obd),
-			       imp->imp_connection->c_remote_uuid.uuid,
-			       imp->imp_obd->obd_name);
-			ptlrpc_deactivate_import(imp);
-		}
+	if (ptlrpc_set_import_discon(imp, conn, true)) {
 		/* to control recovery via lctl {disable|enable}_recovery */
 		if (imp->imp_deactive == 0)
 			ptlrpc_connect_import(imp);
@@ -317,7 +309,7 @@ int ptlrpc_recover_import(struct obd_import *imp, char *new_uuid, int async)
 		goto out;
 
 	/* force import to be disconnected. */
-	ptlrpc_set_import_discon(imp, 0);
+	ptlrpc_set_import_discon(imp, 0, false);
 
 	if (new_uuid) {
 		struct obd_uuid uuid;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 425/622] lnet: support non-default network namespace
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (423 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 424/622] lustre: import: fix race between imp_state & imp_invalid James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 426/622] lustre: obdclass: 0-nlink race in lu_object_find_at() James Simmons
                   ` (197 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Aurelien Degremont <degremoa@amazon.com>

Replace hard coded references to default root network namespace
(&init_net) in LNET code (LNET, socklnd and o2iblnd).

When a network interface is created, Lustre records the current
network namespace. This patch improves the LNET code to use
this reference namespace most of the time instead of the root
network namespace. When using lctl, lnetctl or insmod, we
use the current process network namespace.
When starting the listening acceptor, we use the namespace of the
process that triggers this start.

An additional patch is needed for RPCSEC GSS support.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12236
Lustre-commit: 93b08edfb1c6 ("LU-12236 lnet: support non-default network namespace")
Signed-off-by: Aurelien Degremont <degremoa@amazon.com>
Reviewed-on: https://review.whamcloud.com/34768
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h       |  9 +++++----
 net/lnet/klnds/o2iblnd/o2iblnd.c    | 22 +++++++++++-----------
 net/lnet/klnds/o2iblnd/o2iblnd.h    |  9 ++++-----
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c |  8 +++++---
 net/lnet/klnds/socklnd/socklnd.c    |  2 +-
 net/lnet/klnds/socklnd/socklnd_cb.c |  3 ++-
 net/lnet/lnet/acceptor.c            | 11 +++++++----
 net/lnet/lnet/config.c              |  6 +++---
 net/lnet/lnet/lib-socket.c          | 13 +++++++------
 9 files changed, 45 insertions(+), 38 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index b1407b3..b889af2 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -717,7 +717,7 @@ void lnet_copy_kiov2iter(struct iov_iter *to,
 void lnet_unregister_lnd(struct lnet_lnd *lnd);
 
 int lnet_connect(struct socket **sockp, lnet_nid_t peer_nid,
-		 u32 local_ip, u32 peer_ip, int peer_port);
+		 u32 local_ip, u32 peer_ip, int peer_port, struct net *ns);
 void lnet_connect_console_error(int rc, lnet_nid_t peer_nid,
 				u32 peer_ip, int port);
 int lnet_count_acceptor_nets(void);
@@ -738,18 +738,19 @@ struct lnet_inetdev {
 	char	li_name[IFNAMSIZ];
 };
 
-int lnet_inet_enumerate(struct lnet_inetdev **dev_list);
+int lnet_inet_enumerate(struct lnet_inetdev **dev_list, struct net *ns);
 int lnet_sock_setbuf(struct socket *socket, int txbufsize, int rxbufsize);
 int lnet_sock_getbuf(struct socket *socket, int *txbufsize, int *rxbufsize);
 int lnet_sock_getaddr(struct socket *socket, bool remote, u32 *ip, int *port);
 int lnet_sock_write(struct socket *sock, void *buffer, int nob, int timeout);
 int lnet_sock_read(struct socket *sock, void *buffer, int nob, int timeout);
 
-int lnet_sock_listen(struct socket **sockp, u32 ip, int port, int backlog);
+int lnet_sock_listen(struct socket **sockp, u32 ip, int port, int backlog,
+		     struct net *ns);
 int lnet_sock_accept(struct socket **newsockp, struct socket *sock);
 int lnet_sock_connect(struct socket **sockp, int *fatal,
 		      u32 local_ip, int local_port,
-		      u32 peer_ip, int peer_port);
+		      u32 peer_ip, int peer_port, struct net *ns);
 void libcfs_sock_release(struct socket *sock);
 
 int lnet_peers_start_down(void);
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index bb7590f..f3176e1 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2358,7 +2358,7 @@ static int kiblnd_dummy_callback(struct rdma_cm_id *cmid,
 	return 0;
 }
 
-static int kiblnd_dev_need_failover(struct kib_dev *dev)
+static int kiblnd_dev_need_failover(struct kib_dev *dev, struct net *ns)
 {
 	struct rdma_cm_id *cmid;
 	struct sockaddr_in srcaddr;
@@ -2382,8 +2382,8 @@ static int kiblnd_dev_need_failover(struct kib_dev *dev)
 	 * a. rdma_bind_addr(), it will conflict with listener cmid
 	 * b. rdma_resolve_addr() to zero addr
 	 */
-	cmid = kiblnd_rdma_create_id(kiblnd_dummy_callback, dev, RDMA_PS_TCP,
-				     IB_QPT_RC);
+	cmid = kiblnd_rdma_create_id(ns, kiblnd_dummy_callback, dev,
+				     RDMA_PS_TCP, IB_QPT_RC);
 	if (IS_ERR(cmid)) {
 		rc = PTR_ERR(cmid);
 		CERROR("Failed to create cmid for failover: %d\n", rc);
@@ -2412,7 +2412,7 @@ static int kiblnd_dev_need_failover(struct kib_dev *dev)
 	return rc;
 }
 
-int kiblnd_dev_failover(struct kib_dev *dev)
+int kiblnd_dev_failover(struct kib_dev *dev, struct net *ns)
 {
 	LIST_HEAD(zombie_tpo);
 	LIST_HEAD(zombie_ppo);
@@ -2429,7 +2429,7 @@ int kiblnd_dev_failover(struct kib_dev *dev)
 	LASSERT(*kiblnd_tunables.kib_dev_failover > 1 ||
 		dev->ibd_can_failover || !dev->ibd_hdev);
 
-	rc = kiblnd_dev_need_failover(dev);
+	rc = kiblnd_dev_need_failover(dev, ns);
 	if (rc <= 0)
 		goto out;
 
@@ -2454,7 +2454,7 @@ int kiblnd_dev_failover(struct kib_dev *dev)
 		rdma_destroy_id(cmid);
 	}
 
-	cmid = kiblnd_rdma_create_id(kiblnd_cm_callback, dev, RDMA_PS_TCP,
+	cmid = kiblnd_rdma_create_id(ns, kiblnd_cm_callback, dev, RDMA_PS_TCP,
 				     IB_QPT_RC);
 	if (IS_ERR(cmid)) {
 		rc = PTR_ERR(cmid);
@@ -2683,7 +2683,7 @@ static void kiblnd_shutdown(struct lnet_ni *ni)
 		kiblnd_base_shutdown();
 }
 
-static int kiblnd_base_startup(void)
+static int kiblnd_base_startup(struct net *ns)
 {
 	struct kib_sched_info *sched;
 	int rc;
@@ -2758,7 +2758,7 @@ static int kiblnd_base_startup(void)
 	}
 
 	if (*kiblnd_tunables.kib_dev_failover)
-		rc = kiblnd_thread_start(kiblnd_failover_thread, NULL,
+		rc = kiblnd_thread_start(kiblnd_failover_thread, ns,
 					 "kiblnd_failover");
 
 	if (rc) {
@@ -2856,7 +2856,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 	LASSERT(ni->ni_net->net_lnd == &the_o2iblnd);
 
 	if (kiblnd_data.kib_init == IBLND_INIT_NOTHING) {
-		rc = kiblnd_base_startup();
+		rc = kiblnd_base_startup(ni->ni_net_ns);
 		if (rc)
 			return rc;
 	}
@@ -2894,7 +2894,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 		goto failed;
 	}
 
-	rc = lnet_inet_enumerate(&ifaces);
+	rc = lnet_inet_enumerate(&ifaces, ni->ni_net_ns);
 	if (rc < 0)
 		goto failed;
 
@@ -2925,7 +2925,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 	INIT_LIST_HEAD(&ibdev->ibd_fail_list);
 
 	/* initialize the device */
-	rc = kiblnd_dev_failover(ibdev);
+	rc = kiblnd_dev_failover(ibdev, ni->ni_net_ns);
 	if (rc) {
 		CERROR("ko2iblnd: Can't initialize device: rc = %d\n", rc);
 		goto failed;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 2f7ca52..1285ab1 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -109,10 +109,9 @@ struct kib_tunables {
 					IBLND_CREDIT_HIGHWATER_V1 : \
 					t->lnd_peercredits_hiw)
 
-#define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(current->nsproxy->net_ns, \
-							       cb, dev, \
-							       ps, qpt)
-
+# define kiblnd_rdma_create_id(ns, cb, dev, ps, qpt) rdma_create_id(ns, cb, \
+								    dev, ps, \
+								    qpt)
 /* 2 OOB shall suffice for 1 keepalive and 1 returning credits */
 #define IBLND_OOB_CAPABLE(v)	((v) != IBLND_MSG_VERSION_1)
 #define IBLND_OOB_MSGS(v)	(IBLND_OOB_CAPABLE(v) ? 2 : 0)
@@ -1030,7 +1029,7 @@ int kiblnd_cm_callback(struct rdma_cm_id *cmid,
 		       struct rdma_cm_event *event);
 int kiblnd_translate_mtu(int value);
 
-int kiblnd_dev_failover(struct kib_dev *dev);
+int kiblnd_dev_failover(struct kib_dev *dev, struct net *ns);
 int kiblnd_create_peer(struct lnet_ni *ni, struct kib_peer_ni **peerp,
 		       lnet_nid_t nid);
 void kiblnd_destroy_peer(struct kib_peer_ni *peer_ni);
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 69918cf..1110553 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1330,8 +1330,9 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	LASSERT(net);
 	LASSERT(peer_ni->ibp_connecting > 0);
 
-	cmid = kiblnd_rdma_create_id(kiblnd_cm_callback, peer_ni, RDMA_PS_TCP,
-				     IB_QPT_RC);
+	cmid = kiblnd_rdma_create_id(peer_ni->ibp_ni->ni_net_ns,
+				     kiblnd_cm_callback, peer_ni,
+				     RDMA_PS_TCP, IB_QPT_RC);
 
 	if (IS_ERR(cmid)) {
 		CERROR("Can't create CMID for %s: %ld\n",
@@ -3830,6 +3831,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 {
 	rwlock_t *glock = &kiblnd_data.kib_global_lock;
 	struct kib_dev *dev;
+	struct net *ns = arg;
 	wait_queue_entry_t wait;
 	unsigned long flags;
 	int rc;
@@ -3856,7 +3858,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 			dev->ibd_failover = 1;
 			write_unlock_irqrestore(glock, flags);
 
-			rc = kiblnd_dev_failover(dev);
+			rc = kiblnd_dev_failover(dev, ns);
 
 			write_lock_irqsave(glock, flags);
 
diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 0f5c7fc..78f6c7e 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -2718,7 +2718,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		net_tunables->lct_peer_rtr_credits =
 			*ksocknal_tunables.ksnd_peerrtrcredits;
 
-	rc = lnet_inet_enumerate(&ifaces);
+	rc = lnet_inet_enumerate(&ifaces, ni->ni_net_ns);
 	if (rc < 0)
 		goto fail_1;
 
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 581f734..0132727 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -1871,7 +1871,8 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 
 		rc = lnet_connect(&sock, peer_ni->ksnp_id.nid,
 				  route->ksnr_myipaddr,
-				  route->ksnr_ipaddr, route->ksnr_port);
+				  route->ksnr_ipaddr, route->ksnr_port,
+				  peer_ni->ksnp_ni->ni_net_ns);
 		if (rc)
 			goto failed;
 
diff --git a/net/lnet/lnet/acceptor.c b/net/lnet/lnet/acceptor.c
index 1854347..23b5bf0 100644
--- a/net/lnet/lnet/acceptor.c
+++ b/net/lnet/lnet/acceptor.c
@@ -44,6 +44,7 @@
 	int			pta_shutdown;
 	struct socket		*pta_sock;
 	struct completion	pta_signal;
+	struct net		*pta_ns;
 } lnet_acceptor_state = {
 	.pta_shutdown = 1
 };
@@ -142,7 +143,7 @@
 
 int
 lnet_connect(struct socket **sockp, lnet_nid_t peer_nid,
-	     u32 local_ip, u32 peer_ip, int peer_port)
+	     u32 local_ip, u32 peer_ip, int peer_port, struct net *ns)
 {
 	struct lnet_acceptor_connreq cr;
 	struct socket *sock;
@@ -158,7 +159,7 @@
 		/* Iterate through reserved ports. */
 
 		rc = lnet_sock_connect(&sock, &fatal, local_ip, port, peer_ip,
-				       peer_port);
+				       peer_port, ns);
 		if (rc) {
 			if (fatal)
 				goto failed;
@@ -335,8 +336,9 @@
 
 	LASSERT(!lnet_acceptor_state.pta_sock);
 
-	rc = lnet_sock_listen(&lnet_acceptor_state.pta_sock, 0, accept_port,
-			      accept_backlog);
+	rc = lnet_sock_listen(&lnet_acceptor_state.pta_sock,
+			      0, accept_port, accept_backlog,
+			      lnet_acceptor_state.pta_ns);
 	if (rc) {
 		if (rc == -EADDRINUSE)
 			LCONSOLE_ERROR_MSG(0x122, "Can't start acceptor on port %d: port already in use\n",
@@ -457,6 +459,7 @@
 	if (!lnet_count_acceptor_nets())  /* not required */
 		return 0;
 
+	lnet_acceptor_state.pta_ns = current->nsproxy->net_ns;
 	task = kthread_run(lnet_acceptor, (void *)(uintptr_t)secure,
 			   "acceptor_%03ld", secure);
 	if (IS_ERR(task)) {
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index a2a9c79..2c8edcd 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -1563,7 +1563,7 @@ struct lnet_ni *
 	return count;
 }
 
-int lnet_inet_enumerate(struct lnet_inetdev **dev_list)
+int lnet_inet_enumerate(struct lnet_inetdev **dev_list, struct net *ns)
 {
 	struct lnet_inetdev *ifaces = NULL;
 	struct net_device *dev;
@@ -1571,7 +1571,7 @@ int lnet_inet_enumerate(struct lnet_inetdev **dev_list)
 	int nip = 0;
 
 	rtnl_lock();
-	for_each_netdev(&init_net, dev) {
+	for_each_netdev(ns, dev) {
 		int flags = dev_get_flags(dev);
 		const struct in_ifaddr *ifa;
 		struct in_device *in_dev;
@@ -1642,7 +1642,7 @@ int lnet_inet_enumerate(struct lnet_inetdev **dev_list)
 	int rc;
 	int i;
 
-	nip = lnet_inet_enumerate(&ifaces);
+	nip = lnet_inet_enumerate(&ifaces, current->nsproxy->net_ns);
 	if (nip < 0) {
 		if (nip != -ENOENT) {
 			LCONSOLE_ERROR_MSG(0x117,
diff --git a/net/lnet/lnet/lib-socket.c b/net/lnet/lnet/lib-socket.c
index d430d6f..046bd2d 100644
--- a/net/lnet/lnet/lib-socket.c
+++ b/net/lnet/lnet/lib-socket.c
@@ -156,7 +156,7 @@
 
 static int
 lnet_sock_create(struct socket **sockp, int *fatal, u32 local_ip,
-		 int local_port)
+		 int local_port, struct net *ns)
 {
 	struct sockaddr_in locaddr;
 	struct socket *sock;
@@ -166,7 +166,7 @@
 	/* All errors are fatal except bind failure if the port is in use */
 	*fatal = 1;
 
-	rc = sock_create_kern(&init_net, PF_INET, SOCK_STREAM, 0, &sock);
+	rc = sock_create_kern(ns, PF_INET, SOCK_STREAM, 0, &sock);
 	*sockp = sock;
 	if (rc) {
 		CERROR("Can't create socket: %d\n", rc);
@@ -282,12 +282,12 @@
 
 int
 lnet_sock_listen(struct socket **sockp, u32 local_ip, int local_port,
-		 int backlog)
+		 int backlog, struct net *ns)
 {
 	int fatal;
 	int rc;
 
-	rc = lnet_sock_create(sockp, &fatal, local_ip, local_port);
+	rc = lnet_sock_create(sockp, &fatal, local_ip, local_port, ns);
 	if (rc) {
 		if (!fatal)
 			CERROR("Can't create socket: port %d already in use\n",
@@ -347,12 +347,13 @@
 
 int
 lnet_sock_connect(struct socket **sockp, int *fatal, u32 local_ip,
-		  int local_port, u32 peer_ip, int peer_port)
+		  int local_port, u32 peer_ip, int peer_port,
+		  struct net *ns)
 {
 	struct sockaddr_in srvaddr;
 	int rc;
 
-	rc = lnet_sock_create(sockp, fatal, local_ip, local_port);
+	rc = lnet_sock_create(sockp, fatal, local_ip, local_port, ns);
 	if (rc)
 		return rc;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 426/622] lustre: obdclass: 0-nlink race in lu_object_find_at()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (424 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 425/622] lnet: support non-default network namespace James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 427/622] lustre: osc: reserve lru pages for read in batch James Simmons
                   ` (196 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

There is a race in lu_object_find_at: in the gap between
lu_object_alloc() and hash insertion, another thread may
have allocated another object for the same file and unlinked
it, so we may get an object with 0-nlink, which will trigger
assertion in osd_object_release().

To avoid such race, initialize object after hash insertion.
But this may cause an uninitialized object found in cache, if
so, wait for the object initialized by the allocator.

To reproduce the race, introduced cfs_race_wait() and
cfs_race_wakeup(): cfs_race_wait() will cause the thread that
calls it wait on the race; while cfs_race_wakeup() will wake
up the waiting thread. Same as cfs_race(), CFS_FAIL_ONCE
should be set together with fail_loc.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12485
Lustre-commit: 2ff420913b97 ("LU-12485 obdclass: 0-nlink race in lu_object_find_at()")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35360
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h      |  15 ++++-
 fs/lustre/include/obd_support.h    |   1 +
 fs/lustre/obdclass/lu_object.c     | 127 ++++++++++++++++++++++++++++---------
 include/linux/libcfs/libcfs_fail.h |  40 +++++++++++-
 4 files changed, 151 insertions(+), 32 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index d2e84a3..1c1a60f 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -461,7 +461,12 @@ enum lu_object_header_flags {
 	/**
 	 * Mark this object has already been taken out of cache.
 	 */
-	LU_OBJECT_UNHASHED = 1,
+	LU_OBJECT_UNHASHED	= 1,
+	/**
+	 * Object is initialized, when object is found in cache, it may not be
+	 * initialized yet, the object allocator will initialize it.
+	 */
+	LU_OBJECT_INITED	= 2
 };
 
 enum lu_object_header_attr {
@@ -656,6 +661,14 @@ static inline int lu_object_is_dying(const struct lu_object_header *h)
 	return test_bit(LU_OBJECT_HEARD_BANSHEE, &h->loh_flags);
 }
 
+/**
+ * Return true if object is initialized.
+ */
+static inline int lu_object_is_inited(const struct lu_object_header *h)
+{
+	return test_bit(LU_OBJECT_INITED, &h->loh_flags);
+}
+
 void lu_object_put(const struct lu_env *env, struct lu_object *o);
 void lu_object_unhash(const struct lu_env *env, struct lu_object *o);
 int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s, int nr,
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index c66b61a..506535b 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -371,6 +371,7 @@
 #define OBD_FAIL_OBD_IDX_READ_BREAK			0x608
 #define OBD_FAIL_OBD_NO_LRU				0x609
 #define OBD_FAIL_OBDCLASS_MODULE_LOAD			0x60a
+#define OBD_FAIL_OBD_ZERO_NLINK_RACE	 0x60b
 
 #define OBD_FAIL_TGT_REPLY_NET				0x700
 #define OBD_FAIL_TGT_CONN_RACE				0x701
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index d8bff3f..6fea1f3 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -67,13 +67,14 @@ struct lu_site_bkt_data {
 	struct list_head		lsb_lru;
 	/**
 	 * Wait-queue signaled when an object in this site is ultimately
-	 * destroyed (lu_object_free()). It is used by lu_object_find() to
-	 * wait before re-trying when object in the process of destruction is
-	 * found in the hash table.
+	 * destroyed (lu_object_free()) or initialized (lu_object_start()).
+	 * It is used by lu_object_find() to wait before re-trying when
+	 * object in the process of destruction is found in the hash table;
+	 * or wait object to be initialized by the allocator.
 	 *
 	 * \see htable_lookup().
 	 */
-	wait_queue_head_t		lsb_marche_funebre;
+	wait_queue_head_t		lsb_waitq;
 };
 
 enum {
@@ -116,7 +117,7 @@ enum {
 
 	cfs_hash_bd_get(site->ls_obj_hash, fid, &bd);
 	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
-	return &bkt->lsb_marche_funebre;
+	return &bkt->lsb_waitq;
 }
 EXPORT_SYMBOL(lu_site_wq_from_fid);
 
@@ -168,7 +169,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			 * somebody may be waiting for this, currently only
 			 * used for cl_object, see cl_object_put_last().
 			 */
-			wake_up_all(&bkt->lsb_marche_funebre);
+			wake_up_all(&bkt->lsb_waitq);
 		}
 		return;
 	}
@@ -255,16 +256,9 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
  */
 static struct lu_object *lu_object_alloc(const struct lu_env *env,
 					 struct lu_device *dev,
-					 const struct lu_fid *f,
-					 const struct lu_object_conf *conf)
+					 const struct lu_fid *f)
 {
-	struct lu_object *scan;
 	struct lu_object *top;
-	struct list_head *layers;
-	unsigned int init_mask = 0;
-	unsigned int init_flag;
-	int clean;
-	int result;
 
 	/*
 	 * Create top-level object slice. This will also create
@@ -280,6 +274,27 @@ static struct lu_object *lu_object_alloc(const struct lu_env *env,
 	 * after this point.
 	 */
 	top->lo_header->loh_fid = *f;
+
+	return top;
+}
+
+/**
+ * Initialize object.
+ *
+ * This is called after object hash insertion to avoid returning an object with
+ * stale attributes.
+ */
+static int lu_object_start(const struct lu_env *env, struct lu_device *dev,
+			   struct lu_object *top,
+			   const struct lu_object_conf *conf)
+{
+	struct lu_object *scan;
+	struct list_head *layers;
+	unsigned int init_mask = 0;
+	unsigned int init_flag;
+	int clean;
+	int result;
+
 	layers = &top->lo_header->loh_layers;
 
 	do {
@@ -295,10 +310,9 @@ static struct lu_object *lu_object_alloc(const struct lu_env *env,
 			clean = 0;
 			scan->lo_header = top->lo_header;
 			result = scan->lo_ops->loo_object_init(env, scan, conf);
-			if (result != 0) {
-				lu_object_free(env, top);
-				return ERR_PTR(result);
-			}
+			if (result)
+				return result;
+
 			init_mask |= init_flag;
 next:
 			init_flag <<= 1;
@@ -308,15 +322,16 @@ static struct lu_object *lu_object_alloc(const struct lu_env *env,
 	list_for_each_entry_reverse(scan, layers, lo_linkage) {
 		if (scan->lo_ops->loo_object_start) {
 			result = scan->lo_ops->loo_object_start(env, scan);
-			if (result != 0) {
-				lu_object_free(env, top);
-				return ERR_PTR(result);
-			}
+			if (result)
+				return result;
 		}
 	}
 
 	lprocfs_counter_incr(dev->ld_site->ls_stats, LU_SS_CREATED);
-	return top;
+
+	set_bit(LU_OBJECT_INITED, &top->lo_header->loh_flags);
+
+	return 0;
 }
 
 /**
@@ -598,7 +613,6 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 				       const struct lu_fid *f,
 				       u64 *version)
 {
-	struct lu_site_bkt_data *bkt;
 	struct lu_object_header *h;
 	struct hlist_node *hnode;
 	u64 ver = cfs_hash_bd_version_get(bd);
@@ -607,7 +621,6 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 		return ERR_PTR(-ENOENT);
 
 	*version = ver;
-	bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, bd);
 	/* cfs_hash_bd_peek_locked is a somehow "internal" function
 	 * of cfs_hash, it doesn't add refcount on object.
 	 */
@@ -681,7 +694,9 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	struct lu_site *s;
 	struct cfs_hash	*hs;
 	struct cfs_hash_bd bd;
+	struct lu_site_bkt_data *bkt;
 	u64 version = 0;
+	int rc;
 
 	/*
 	 * This uses standard index maintenance protocol:
@@ -703,26 +718,50 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	 */
 	s  = dev->ld_site;
 	hs = s->ls_obj_hash;
+	if (unlikely(OBD_FAIL_PRECHECK(OBD_FAIL_OBD_ZERO_NLINK_RACE)))
+		lu_site_purge(env, s, -1);
 
 	cfs_hash_bd_get(hs, f, &bd);
+	bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
 	if (!(conf && conf->loc_flags & LOC_F_NEW)) {
 		cfs_hash_bd_lock(hs, &bd, 1);
 		o = htable_lookup(s, &bd, f, &version);
 		cfs_hash_bd_unlock(hs, &bd, 1);
 
-		if (!IS_ERR(o) || PTR_ERR(o) != -ENOENT)
+		if (!IS_ERR(o)) {
+			if (likely(lu_object_is_inited(o->lo_header)))
+				return o;
+
+			wait_event_idle(bkt->lsb_waitq,
+					lu_object_is_inited(o->lo_header) ||
+					lu_object_is_dying(o->lo_header));
+
+			if (lu_object_is_dying(o->lo_header)) {
+				lu_object_put(env, o);
+
+				return ERR_PTR(-ENOENT);
+			}
+
+			return o;
+		}
+
+		if (PTR_ERR(o) != -ENOENT)
 			return o;
 	}
+
 	/*
-	 * Allocate new object. This may result in rather complicated
-	 * operations, including fld queries, inode loading, etc.
+	 * Allocate new object, NB, object is uninitialized in case object
+	 * is changed between allocation and hash insertion, thus the object
+	 * with stale attributes is returned.
 	 */
-	o = lu_object_alloc(env, dev, f, conf);
+	o = lu_object_alloc(env, dev, f);
 	if (IS_ERR(o))
 		return o;
 
 	LASSERT(lu_fid_eq(lu_object_fid(o), f));
 
+	CFS_RACE_WAIT(OBD_FAIL_OBD_ZERO_NLINK_RACE);
+
 	cfs_hash_bd_lock(hs, &bd, 1);
 
 	if (conf && conf->loc_flags & LOC_F_NEW)
@@ -733,6 +772,20 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 		cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
 		cfs_hash_bd_unlock(hs, &bd, 1);
 
+		/*
+		 * This may result in rather complicated operations, including
+		 * fld queries, inode loading, etc.
+		 */
+		rc = lu_object_start(env, dev, o, conf);
+		if (rc) {
+			set_bit(LU_OBJECT_HEARD_BANSHEE,
+				&o->lo_header->loh_flags);
+			lu_object_put(env, o);
+			return ERR_PTR(rc);
+		}
+
+		wake_up_all(&bkt->lsb_waitq);
+
 		lu_object_limit(env, dev);
 
 		return o;
@@ -741,6 +794,20 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_RACE);
 	cfs_hash_bd_unlock(hs, &bd, 1);
 	lu_object_free(env, o);
+
+	if (!(conf && conf->loc_flags & LOC_F_NEW) &&
+	    !lu_object_is_inited(shadow->lo_header)) {
+		wait_event_idle(bkt->lsb_waitq,
+				lu_object_is_inited(shadow->lo_header) ||
+				lu_object_is_dying(shadow->lo_header));
+
+		if (lu_object_is_dying(shadow->lo_header)) {
+			lu_object_put(env, shadow);
+
+			return ERR_PTR(-ENOENT);
+		}
+	}
+
 	return shadow;
 }
 EXPORT_SYMBOL(lu_object_find_at);
@@ -998,7 +1065,7 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	cfs_hash_for_each_bucket(s->ls_obj_hash, &bd, i) {
 		bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
 		INIT_LIST_HEAD(&bkt->lsb_lru);
-		init_waitqueue_head(&bkt->lsb_marche_funebre);
+		init_waitqueue_head(&bkt->lsb_waitq);
 	}
 
 	s->ls_stats = lprocfs_alloc_stats(LU_SS_LAST_STAT, 0);
diff --git a/include/linux/libcfs/libcfs_fail.h b/include/linux/libcfs/libcfs_fail.h
index c341567..45166c5 100644
--- a/include/linux/libcfs/libcfs_fail.h
+++ b/include/linux/libcfs/libcfs_fail.h
@@ -187,7 +187,7 @@ static inline void cfs_race(u32 id)
 			CERROR("cfs_race id %x sleeping\n", id);
 			rc = wait_event_interruptible(cfs_race_waitq,
 						      !!cfs_race_state);
-			CERROR("cfs_fail_race id %x awake, rc=%d\n", id, rc);
+			CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc);
 		} else {
 			CERROR("cfs_fail_race id %x waking\n", id);
 			cfs_race_state = 1;
@@ -198,4 +198,42 @@ static inline void cfs_race(u32 id)
 
 #define CFS_RACE(id) cfs_race(id)
 
+/**
+ * Wait on race.
+ *
+ * The first thread that calls this with a matching fail_loc is put to sleep,
+ * but subseqent callers of this won't sleep. Until another thread that calls
+ * cfs_race_wakeup(), the first thread will be woken up and continue.
+ */
+static inline void cfs_race_wait(u32 id)
+{
+	if (CFS_FAIL_PRECHECK(id)) {
+		if (unlikely(__cfs_fail_check_set(id, 0, CFS_FAIL_LOC_NOSET))) {
+			int rc;
+
+			cfs_race_state = 0;
+			CERROR("cfs_race id %x sleeping\n", id);
+			rc = wait_event_interruptible(cfs_race_waitq,
+						      cfs_race_state != 0);
+			CERROR("cfs_fail_race id %x awake: rc=%d\n", id, rc);
+		}
+	}
+}
+#define CFS_RACE_WAIT(id) cfs_race_wait(id)
+
+/**
+ * Wake up the thread that is waiting on the matching fail_loc.
+ */
+static inline void cfs_race_wakeup(u32 id)
+{
+	if (CFS_FAIL_PRECHECK(id)) {
+		if (likely(!__cfs_fail_check_set(id, 0, CFS_FAIL_LOC_NOSET))) {
+			CERROR("cfs_fail_race id %x waking\n", id);
+			cfs_race_state = 1;
+			wake_up(&cfs_race_waitq);
+		}
+	}
+}
+#define CFS_RACE_WAKEUP(id) cfs_race_wakeup(id)
+
 #endif /* _LIBCFS_FAIL_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 427/622] lustre: osc: reserve lru pages for read in batch
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (425 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 426/622] lustre: obdclass: 0-nlink race in lu_object_find_at() James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 428/622] lustre: uapi: Make lustre_user.h c++-legal James Simmons
                   ` (195 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

The benefit of doing this is to reduce contention
against atomic counter cl_lru_left by changing it from
per-page access to per-IO access.

We have done this optimization for write, do it for read
too.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12520
Lustre-commit: 0692dadfba87 ("LU-12520 osc: reserve lru pages for read in batch")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35440
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h |  6 +++---
 fs/lustre/mdc/mdc_dev.c        |  8 ++++----
 fs/lustre/osc/osc_io.c         | 20 ++++++++++----------
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 1c5af80..37e56ef 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -685,9 +685,9 @@ int osc_io_commit_async(const struct lu_env *env,
 int osc_io_iter_init(const struct lu_env *env, const struct cl_io_slice *ios);
 void osc_io_iter_fini(const struct lu_env *env,
 		      const struct cl_io_slice *ios);
-int osc_io_write_iter_init(const struct lu_env *env,
-			   const struct cl_io_slice *ios);
-void osc_io_write_iter_fini(const struct lu_env *env,
+int osc_io_rw_iter_init(const struct lu_env *env,
+			const struct cl_io_slice *ios);
+void osc_io_rw_iter_fini(const struct lu_env *env,
 			    const struct cl_io_slice *ios);
 int osc_io_fault_start(const struct lu_env *env, const struct cl_io_slice *ios);
 void osc_io_setattr_end(const struct lu_env *env,
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index df8bb33..b49509c 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -1257,13 +1257,13 @@ static void mdc_io_data_version_end(const struct lu_env *env,
 static struct cl_io_operations mdc_io_ops = {
 	.op = {
 		[CIT_READ] = {
-			.cio_iter_init	= osc_io_iter_init,
-			.cio_iter_fini	= osc_io_iter_fini,
+			.cio_iter_init	= osc_io_rw_iter_init,
+			.cio_iter_fini	= osc_io_rw_iter_fini,
 			.cio_start	= osc_io_read_start,
 		},
 		[CIT_WRITE] = {
-			.cio_iter_init	= osc_io_write_iter_init,
-			.cio_iter_fini	= osc_io_write_iter_fini,
+			.cio_iter_init	= osc_io_rw_iter_init,
+			.cio_iter_fini	= osc_io_rw_iter_fini,
 			.cio_start	= osc_io_write_start,
 			.cio_end	= osc_io_end,
 		},
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index dfdf064..4f46b95 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -375,8 +375,8 @@ int osc_io_iter_init(const struct lu_env *env, const struct cl_io_slice *ios)
 }
 EXPORT_SYMBOL(osc_io_iter_init);
 
-int osc_io_write_iter_init(const struct lu_env *env,
-			   const struct cl_io_slice *ios)
+int osc_io_rw_iter_init(const struct lu_env *env,
+			const struct cl_io_slice *ios)
 {
 	struct cl_io *io = ios->cis_io;
 	struct osc_io *oio = osc_env_io(env);
@@ -394,7 +394,7 @@ int osc_io_write_iter_init(const struct lu_env *env,
 
 	return osc_io_iter_init(env, ios);
 }
-EXPORT_SYMBOL(osc_io_write_iter_init);
+EXPORT_SYMBOL(osc_io_rw_iter_init);
 
 void osc_io_iter_fini(const struct lu_env *env,
 		      const struct cl_io_slice *ios)
@@ -412,8 +412,8 @@ void osc_io_iter_fini(const struct lu_env *env,
 }
 EXPORT_SYMBOL(osc_io_iter_fini);
 
-void osc_io_write_iter_fini(const struct lu_env *env,
-			    const struct cl_io_slice *ios)
+void osc_io_rw_iter_fini(const struct lu_env *env,
+			 const struct cl_io_slice *ios)
 {
 	struct osc_io *oio = osc_env_io(env);
 	struct osc_object *osc = cl2osc(ios->cis_obj);
@@ -426,7 +426,7 @@ void osc_io_write_iter_fini(const struct lu_env *env,
 
 	osc_io_iter_fini(env, ios);
 }
-EXPORT_SYMBOL(osc_io_write_iter_fini);
+EXPORT_SYMBOL(osc_io_rw_iter_fini);
 
 int osc_io_fault_start(const struct lu_env *env, const struct cl_io_slice *ios)
 {
@@ -970,14 +970,14 @@ void osc_io_end(const struct lu_env *env, const struct cl_io_slice *slice)
 static const struct cl_io_operations osc_io_ops = {
 	.op = {
 		[CIT_READ] = {
-			.cio_iter_init	= osc_io_iter_init,
-			.cio_iter_fini	= osc_io_iter_fini,
+			.cio_iter_init	= osc_io_rw_iter_init,
+			.cio_iter_fini	= osc_io_rw_iter_fini,
 			.cio_start	= osc_io_read_start,
 			.cio_fini	= osc_io_fini
 		},
 		[CIT_WRITE] = {
-			.cio_iter_init	= osc_io_write_iter_init,
-			.cio_iter_fini	= osc_io_write_iter_fini,
+			.cio_iter_init	= osc_io_rw_iter_init,
+			.cio_iter_fini	= osc_io_rw_iter_fini,
 			.cio_start	= osc_io_write_start,
 			.cio_end	= osc_io_end,
 			.cio_fini	= osc_io_fini
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 428/622] lustre: uapi: Make lustre_user.h c++-legal
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (426 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 427/622] lustre: osc: reserve lru pages for read in batch James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 429/622] lnet: create existing net returns EEXIST James Simmons
                   ` (194 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Rob Latham <robl@mcs.anl.gov>

recent c++ compilers did not like some of the C idioms used in this
header:
  - C++ checks the types of enums more forecfully than is done in C
  - signed vs unsigned comparisons will generate a warning under g++
  - "invalid suffix on literal" warning: Lustre is not trying to
    generate a new literal identifier.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12527
Lustre-commit: 14b11dc3526a ("LU-12527 utils: Make lustre_user.h c++-legal")
Signed-off-by: Rob Latham <robl@mcs.anl.gov>
Reviewed-on: https://review.whamcloud.com/35471
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 139 +++++++++++++++++++-------------
 1 file changed, 84 insertions(+), 55 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index db36ce5..3016b73 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -885,7 +885,7 @@ static inline void obd_uuid2fsname(char *buf, char *uuid, int buflen)
 
 #define ALLQUOTA 255	/* set all quota */
 
-static inline char *qtype_name(int qtype)
+static inline const char *qtype_name(int qtype)
 {
 	switch (qtype) {
 	case USRQUOTA:
@@ -1206,7 +1206,8 @@ static inline enum hsm_event hsm_get_cl_event(__u16 flags)
 static inline void hsm_set_cl_event(enum changelog_rec_flags *clf_flags,
 				    enum hsm_event he)
 {
-	*clf_flags |= (he << CLF_HSM_EVENT_L);
+	*clf_flags = (enum changelog_rec_flags)
+		(*clf_flags | (he << CLF_HSM_EVENT_L));
 }
 
 static inline __u16 hsm_get_cl_flags(enum changelog_rec_flags clf_flags)
@@ -1217,7 +1218,8 @@ static inline __u16 hsm_get_cl_flags(enum changelog_rec_flags clf_flags)
 static inline void hsm_set_cl_flags(enum changelog_rec_flags *clf_flags,
 				    unsigned int bits)
 {
-	*clf_flags |= (bits << CLF_HSM_FLAG_L);
+	*clf_flags = (enum changelog_rec_flags)
+		(*clf_flags | (bits << CLF_HSM_FLAG_L));
 }
 
 static inline int hsm_get_cl_error(enum changelog_rec_flags clf_flags)
@@ -1228,7 +1230,8 @@ static inline int hsm_get_cl_error(enum changelog_rec_flags clf_flags)
 static inline void hsm_set_cl_error(enum changelog_rec_flags *clf_flags,
 				    unsigned int error)
 {
-	*clf_flags |= (error << CLF_HSM_ERR_L);
+	*clf_flags = (enum changelog_rec_flags)
+		(*clf_flags | (error << CLF_HSM_ERR_L));
 }
 
 enum changelog_rec_extra_flags {
@@ -1370,9 +1373,11 @@ static inline size_t changelog_rec_size(struct changelog_rec *rec)
 	enum changelog_rec_extra_flags cref = CLFE_INVALID;
 
 	if (rec->cr_flags & CLF_EXTRA_FLAGS)
-		cref = changelog_rec_extra_flags(rec)->cr_extra_flags;
+		cref = (enum changelog_rec_extra_flags)
+		       changelog_rec_extra_flags(rec)->cr_extra_flags;
 
-	return changelog_rec_offset(rec->cr_flags, cref);
+	return changelog_rec_offset((enum changelog_rec_flags)rec->cr_flags,
+				    cref);
 }
 
 static inline size_t changelog_rec_varsize(struct changelog_rec *rec)
@@ -1383,7 +1388,8 @@ static inline size_t changelog_rec_varsize(struct changelog_rec *rec)
 static inline
 struct changelog_ext_rename *changelog_rec_rename(struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags & CLF_VERSION;
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+				       (rec->cr_flags & CLF_VERSION);
 
 	return (struct changelog_ext_rename *)((char *)rec +
 					       changelog_rec_offset(crf,
@@ -1394,8 +1400,8 @@ struct changelog_ext_rename *changelog_rec_rename(struct changelog_rec *rec)
 static inline
 struct changelog_ext_jobid *changelog_rec_jobid(struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags &
-				       (CLF_VERSION | CLF_RENAME);
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+				       (rec->cr_flags & (CLF_VERSION | CLF_RENAME));
 
 	return (struct changelog_ext_jobid *)((char *)rec +
 					      changelog_rec_offset(crf,
@@ -1407,8 +1413,8 @@ struct changelog_ext_jobid *changelog_rec_jobid(struct changelog_rec *rec)
 struct changelog_ext_extra_flags *changelog_rec_extra_flags(
 	const struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags &
-		(CLF_VERSION | CLF_RENAME | CLF_JOBID);
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+		(rec->cr_flags & (CLF_VERSION | CLF_RENAME | CLF_JOBID));
 
 	return (struct changelog_ext_extra_flags *)((char *)rec +
 						 changelog_rec_offset(crf,
@@ -1420,8 +1426,9 @@ struct changelog_ext_extra_flags *changelog_rec_extra_flags(
 struct changelog_ext_uidgid *changelog_rec_uidgid(
 	const struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags &
-		(CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS);
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+		(rec->cr_flags &
+		 (CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS));
 
 	return (struct changelog_ext_uidgid *)((char *)rec +
 					       changelog_rec_offset(crf,
@@ -1432,13 +1439,15 @@ struct changelog_ext_uidgid *changelog_rec_uidgid(
 static inline
 struct changelog_ext_nid *changelog_rec_nid(const struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags &
-		(CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS);
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+		(rec->cr_flags &
+		 (CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS));
 	enum changelog_rec_extra_flags cref = CLFE_INVALID;
 
 	if (rec->cr_flags & CLF_EXTRA_FLAGS)
-		cref = changelog_rec_extra_flags(rec)->cr_extra_flags &
-		       CLFE_UIDGID;
+		cref = (enum changelog_rec_extra_flags)
+			(changelog_rec_extra_flags(rec)->cr_extra_flags &
+			 CLFE_UIDGID);
 
 	return (struct changelog_ext_nid *)((char *)rec +
 					    changelog_rec_offset(crf, cref));
@@ -1449,13 +1458,16 @@ struct changelog_ext_nid *changelog_rec_nid(const struct changelog_rec *rec)
 struct changelog_ext_openmode *changelog_rec_openmode(
 	const struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags &
-		(CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS);
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+		(rec->cr_flags &
+		 (CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS));
 	enum changelog_rec_extra_flags cref = CLFE_INVALID;
 
-	if (rec->cr_flags & CLF_EXTRA_FLAGS)
-		cref = changelog_rec_extra_flags(rec)->cr_extra_flags &
-		       (CLFE_UIDGID | CLFE_NID);
+	if (rec->cr_flags & CLF_EXTRA_FLAGS) {
+		cref = (enum changelog_rec_extra_flags)
+			(changelog_rec_extra_flags(rec)->cr_extra_flags &
+			 (CLFE_UIDGID | CLFE_NID));
+	}
 
 	return (struct changelog_ext_openmode *)((char *)rec +
 						 changelog_rec_offset(crf, cref));
@@ -1466,13 +1478,15 @@ struct changelog_ext_openmode *changelog_rec_openmode(
 struct changelog_ext_xattr *changelog_rec_xattr(
 	const struct changelog_rec *rec)
 {
-	enum changelog_rec_flags crf = rec->cr_flags &
-		(CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS);
+	enum changelog_rec_flags crf = (enum changelog_rec_flags)
+		(rec->cr_flags &
+		 (CLF_VERSION | CLF_RENAME | CLF_JOBID | CLF_EXTRA_FLAGS));
 	enum changelog_rec_extra_flags cref = CLFE_INVALID;
 
 	if (rec->cr_flags & CLF_EXTRA_FLAGS)
-		cref = changelog_rec_extra_flags(rec)->cr_extra_flags &
-		       (CLFE_UIDGID | CLFE_NID | CLFE_OPEN);
+		cref = (enum changelog_rec_extra_flags)
+			(changelog_rec_extra_flags(rec)->cr_extra_flags &
+		         (CLFE_UIDGID | CLFE_NID | CLFE_OPEN));
 
 	return (struct changelog_ext_xattr *)((char *)rec +
 					      changelog_rec_offset(crf, cref));
@@ -1484,10 +1498,12 @@ static inline char *changelog_rec_name(struct changelog_rec *rec)
 	enum changelog_rec_extra_flags cref = CLFE_INVALID;
 
 	if (rec->cr_flags & CLF_EXTRA_FLAGS)
-		cref = changelog_rec_extra_flags(rec)->cr_extra_flags;
+		cref = (enum changelog_rec_extra_flags)
+			changelog_rec_extra_flags(rec)->cr_extra_flags;
 
-	return (char *)rec + changelog_rec_offset(rec->cr_flags & CLF_SUPPORTED,
-						  cref & CLFE_SUPPORTED);
+	return (char *)rec + changelog_rec_offset(
+		(enum changelog_rec_flags)(rec->cr_flags & CLF_SUPPORTED),
+		(enum changelog_rec_extra_flags)(cref & CLFE_SUPPORTED));
 }
 
 static inline size_t changelog_rec_snamelen(struct changelog_rec *rec)
@@ -1535,8 +1551,10 @@ static inline void changelog_remap_rec(struct changelog_rec *rec,
 	char *jid_mov, *rnm_mov;
 	enum changelog_rec_extra_flags cref = CLFE_INVALID;
 
-	crf_wanted &= CLF_SUPPORTED;
-	cref_want &= CLFE_SUPPORTED;
+	crf_wanted = (enum changelog_rec_flags)
+		      (crf_wanted & CLF_SUPPORTED);
+	cref_want = (enum changelog_rec_extra_flags)
+		     (cref_want & CLFE_SUPPORTED);
 
 	if ((rec->cr_flags & CLF_SUPPORTED) == crf_wanted) {
 		if (!(rec->cr_flags & CLF_EXTRA_FLAGS) ||
@@ -1554,38 +1572,49 @@ static inline void changelog_remap_rec(struct changelog_rec *rec,
 	/* Locations of extensions in the remapped record */
 	if (rec->cr_flags & CLF_EXTRA_FLAGS) {
 		xattr_mov = (char *)rec +
-			changelog_rec_offset(crf_wanted & CLF_SUPPORTED,
-					     cref_want & ~CLFE_XATTR);
+			    changelog_rec_offset((enum changelog_rec_flags)
+						  (crf_wanted & CLF_SUPPORTED),
+						 (enum changelog_rec_extra_flags)
+						  (cref_want & ~CLFE_XATTR));
 		omd_mov = (char *)rec +
-			changelog_rec_offset(crf_wanted & CLF_SUPPORTED,
-					     cref_want & ~(CLFE_OPEN |
-							   CLFE_XATTR));
+			  changelog_rec_offset((enum changelog_rec_flags)
+						(crf_wanted & CLF_SUPPORTED),
+					       (enum changelog_rec_extra_flags)
+					        (cref_want & ~(CLFE_OPEN |
+							       CLFE_XATTR)));
 		nid_mov = (char *)rec +
-			  changelog_rec_offset(crf_wanted & CLF_SUPPORTED,
-					       cref_want & ~(CLFE_NID |
-							     CLFE_OPEN |
-							     CLFE_XATTR));
+			  changelog_rec_offset((enum changelog_rec_flags)
+						(crf_wanted & CLF_SUPPORTED),
+					       (enum changelog_rec_extra_flags)
+					        (cref_want & ~(CLFE_NID |
+							       CLFE_OPEN |
+							       CLFE_XATTR)));
 		uidgid_mov = (char *)rec +
-			changelog_rec_offset(crf_wanted & CLF_SUPPORTED,
-					     cref_want & ~(CLFE_UIDGID |
-							   CLFE_NID |
-							   CLFE_OPEN |
-							   CLFE_XATTR));
-		cref = changelog_rec_extra_flags(rec)->cr_extra_flags;
+			     changelog_rec_offset((enum changelog_rec_flags)
+						   (crf_wanted & CLF_SUPPORTED),
+					          (enum changelog_rec_extra_flags)
+					           (cref_want & ~(CLFE_UIDGID |
+								  CLFE_NID |
+								  CLFE_OPEN |
+								  CLFE_XATTR)));
+		cref = (enum changelog_rec_extra_flags)
+		       changelog_rec_extra_flags(rec)->cr_extra_flags;
 	}
 
 	ef_mov  = (char *)rec +
-		  changelog_rec_offset(crf_wanted & ~CLF_EXTRA_FLAGS,
-				       CLFE_INVALID);
+		  changelog_rec_offset((enum changelog_rec_flags)
+					(crf_wanted & ~CLF_EXTRA_FLAGS),
+				        CLFE_INVALID);
 	jid_mov = (char *)rec +
-		  changelog_rec_offset(crf_wanted &
-				       ~(CLF_EXTRA_FLAGS | CLF_JOBID),
+		  changelog_rec_offset((enum changelog_rec_flags)
+					(crf_wanted & ~(CLF_EXTRA_FLAGS |
+							CLF_JOBID)),
 				       CLFE_INVALID);
 	rnm_mov = (char *)rec +
-		  changelog_rec_offset(crf_wanted &
-				       ~(CLF_EXTRA_FLAGS |
-					 CLF_JOBID |
-					 CLF_RENAME),
+		  changelog_rec_offset((enum changelog_rec_flags)
+					(crf_wanted & ~(CLF_EXTRA_FLAGS |
+							CLF_JOBID |
+							CLF_RENAME)),
 				       CLFE_INVALID);
 
 	/* Move the extension fields to the desired positions */
@@ -1824,7 +1853,7 @@ static inline ssize_t hur_len(struct hsm_user_request *hur)
 		(__u64)hur->hur_request.hr_itemcount *
 		sizeof(hur->hur_user_item[0]) + hur->hur_request.hr_data_len;
 
-	if (size != (ssize_t)size)
+	if ((ssize_t)size < 0)
 		return -1;
 
 	return size;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 429/622] lnet: create existing net returns EEXIST
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (427 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 428/622] lustre: uapi: Make lustre_user.h c++-legal James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 430/622] lustre: obdecho: reuse an cl env cache for obdecho survey James Simmons
                   ` (193 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Faaland <faaland1@llnl.gov>

When "lnetctl net add" is called for an interface/net pair that
already exists, the error returned should be EEXIST, so the
user knows that the net is already configured.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12626
Lustre-commit: 4aa71267cc03 ("LU-12626 lnet: create existing net returns EEXIST")
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-on: https://review.whamcloud.com/35681
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index e773839..79deaac 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -2301,7 +2301,7 @@ static void lnet_push_target_fini(void)
 		 * up is actually unique. if it's not fail. */
 		if (!lnet_ni_unique_net(&net_l->net_ni_list,
 					ni->ni_interfaces[0])) {
-			rc = -EINVAL;
+			rc = -EEXIST;
 			goto failed1;
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 430/622] lustre: obdecho: reuse an cl env cache for obdecho survey
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (428 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 429/622] lnet: create existing net returns EEXIST James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:14 ` [lustre-devel] [PATCH 431/622] lustre: mdc: dir page ldp_hash_end mistakenly adjusted James Simmons
                   ` (192 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

obdecho environment is already CL_thread type, so
easy to reuse cl_env cache instead of allocate env on each
ioctl call. It reduce cpu usage dramatically.

Cray-bug-id: LUS-7552
WC-bug-id: https://jira.whamcloud.com/browse/LU-12578
Lustre-commit: 55c33b70c46f ("LU-12578 obdecho: reuse an cl env cache for obdecho survey")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/35700
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h   |  9 ++++++
 fs/lustre/obdclass/cl_object.c  |  6 ++--
 fs/lustre/obdclass/lu_object.c  | 68 +++++++++++++++++++++++++++++++++++++++--
 fs/lustre/obdecho/echo_client.c | 28 +++++++++++------
 4 files changed, 97 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index 1c1a60f..b00fad8 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1208,6 +1208,14 @@ void *lu_context_key_get(const struct lu_context *ctx,
 void lu_context_key_revive_many(struct lu_context_key *k, ...);
 void lu_context_key_quiesce_many(struct lu_context_key *k, ...);
 
+/*
+ * update/clear ctx/ses tags.
+ */
+void lu_context_tags_update(u32 tags);
+void lu_context_tags_clear(u32 tags);
+void lu_session_tags_update(u32 tags);
+void lu_session_tags_clear(u32 tags);
+
 /**
  * Environment.
  */
@@ -1225,6 +1233,7 @@ struct lu_env {
 int lu_env_init(struct lu_env *env, u32 tags);
 void lu_env_fini(struct lu_env *env);
 int lu_env_refill(struct lu_env *env);
+int lu_env_refill_by_tags(struct lu_env *env, u32 ctags, u32 stags);
 
 struct lu_env *lu_env_find(void);
 int lu_env_add(struct lu_env *env);
diff --git a/fs/lustre/obdclass/cl_object.c b/fs/lustre/obdclass/cl_object.c
index b323eb4..57b3a9a 100644
--- a/fs/lustre/obdclass/cl_object.c
+++ b/fs/lustre/obdclass/cl_object.c
@@ -788,8 +788,10 @@ void cl_env_put(struct lu_env *env, u16 *refcheck)
 		 * with the standard tags.
 		 */
 		if (cl_envs[cpu].cec_count < cl_envs_cached_max &&
-		    (env->le_ctx.lc_tags & ~LCT_HAS_EXIT) == LCT_CL_THREAD &&
-		    (env->le_ses->lc_tags & ~LCT_HAS_EXIT) == LCT_SESSION) {
+		    (env->le_ctx.lc_tags & ~LCT_HAS_EXIT) ==
+			lu_context_tags_default &&
+		    (env->le_ses->lc_tags & ~LCT_HAS_EXIT) ==
+			lu_session_tags_default) {
 			read_lock(&cl_envs[cpu].cec_guard);
 			list_add(&cle->ce_linkage, &cl_envs[cpu].cec_envs);
 			cl_envs[cpu].cec_count++;
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index 6fea1f3..dccff91 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -1778,8 +1778,44 @@ int lu_context_refill(struct lu_context *ctx)
  * predefined when the lu_device type are registered, during the module probe
  * phase.
  */
-u32 lu_context_tags_default;
-u32 lu_session_tags_default;
+u32 lu_context_tags_default = LCT_CL_THREAD;
+u32 lu_session_tags_default = LCT_SESSION;
+
+void lu_context_tags_update(__u32 tags)
+{
+	spin_lock(&lu_context_remembered_guard);
+	lu_context_tags_default |= tags;
+	atomic_inc(&key_set_version);
+	spin_unlock(&lu_context_remembered_guard);
+}
+EXPORT_SYMBOL(lu_context_tags_update);
+
+void lu_context_tags_clear(__u32 tags)
+{
+	spin_lock(&lu_context_remembered_guard);
+	lu_context_tags_default &= ~tags;
+	atomic_inc(&key_set_version);
+	spin_unlock(&lu_context_remembered_guard);
+}
+EXPORT_SYMBOL(lu_context_tags_clear);
+
+void lu_session_tags_update(__u32 tags)
+{
+	spin_lock(&lu_context_remembered_guard);
+	lu_session_tags_default |= tags;
+	atomic_inc(&key_set_version);
+	spin_unlock(&lu_context_remembered_guard);
+}
+EXPORT_SYMBOL(lu_session_tags_update);
+
+void lu_session_tags_clear(__u32 tags)
+{
+	spin_lock(&lu_context_remembered_guard);
+	lu_session_tags_default &= ~tags;
+	atomic_inc(&key_set_version);
+	spin_unlock(&lu_context_remembered_guard);
+}
+EXPORT_SYMBOL(lu_session_tags_clear);
 
 int lu_env_init(struct lu_env *env, u32 tags)
 {
@@ -1801,6 +1837,34 @@ void lu_env_fini(struct lu_env *env)
 }
 EXPORT_SYMBOL(lu_env_fini);
 
+/**
+ * Currently, this API will only be used by echo client.
+ * Because echo client and normal lustre client will share
+ * same cl_env cache. So echo client needs to refresh
+ * the env context after it get one from the cache, especially
+ * when normal client and echo client co-exist in the same client.
+ */
+int lu_env_refill_by_tags(struct lu_env *env, u32 ctags,
+			  u32 stags)
+{
+	int result;
+
+	if ((env->le_ctx.lc_tags & ctags) != ctags) {
+		env->le_ctx.lc_version = 0;
+		env->le_ctx.lc_tags |= ctags;
+	}
+
+	if (env->le_ses && (env->le_ses->lc_tags & stags) != stags) {
+		env->le_ses->lc_version = 0;
+		env->le_ses->lc_tags |= stags;
+	}
+
+	result = lu_env_refill(env);
+
+	return result;
+}
+EXPORT_SYMBOL(lu_env_refill_by_tags);
+
 int lu_env_refill(struct lu_env *env)
 {
 	int result;
diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 01d8c04..84823ec 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -50,6 +50,10 @@
  * @{
  */
 
+/* echo thread key have a CL_THREAD flag, which set cl_env function directly */
+#define ECHO_DT_CTX_TAG (LCT_REMEMBER | LCT_DT_THREAD)
+#define ECHO_SES_TAG    (LCT_REMEMBER | LCT_SESSION | LCT_SERVER_SESSION)
+
 struct echo_device {
 	struct cl_device		ed_cl;
 	struct echo_client_obd	       *ed_ec;
@@ -1481,6 +1485,7 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 	struct echo_object *eco;
 	struct obd_ioctl_data *data = karg;
 	struct lu_env *env;
+	u16 refcheck;
 	struct obdo *oa;
 	struct lu_fid fid;
 	int rw = OBD_BRW_READ;
@@ -1497,16 +1502,14 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 	if (rc < 0)
 		return rc;
 
-	env = kzalloc(sizeof(*env), GFP_NOFS);
-	if (!env)
-		return -ENOMEM;
+	env = cl_env_get(&refcheck);
+	if (IS_ERR(env))
+		return PTR_ERR(env);
 
-	rc = lu_env_init(env, LCT_DT_THREAD);
-	if (rc) {
-		rc = -ENOMEM;
-		goto out;
-	}
 	lu_env_add(env);
+	rc = lu_env_refill_by_tags(env, ECHO_DT_CTX_TAG, ECHO_SES_TAG);
+	if (rc != 0)
+		goto out;
 
 	switch (cmd) {
 	case OBD_IOC_CREATE:		/* may create echo object */
@@ -1574,8 +1577,7 @@ static int echo_client_brw_ioctl(const struct lu_env *env, int rw,
 
 out:
 	lu_env_remove(env);
-	lu_env_fini(env);
-	kfree(env);
+	cl_env_put(env, &refcheck);
 
 	return rc;
 }
@@ -1606,6 +1608,9 @@ static int echo_client_setup(const struct lu_env *env,
 	INIT_LIST_HEAD(&ec->ec_locks);
 	ec->ec_unique = 0;
 
+	lu_context_tags_update(ECHO_DT_CTX_TAG);
+	lu_session_tags_update(ECHO_SES_TAG);
+
 	ocd = kzalloc(sizeof(*ocd), GFP_NOFS);
 	if (!ocd)
 		return -ENOMEM;
@@ -1642,6 +1647,9 @@ static int echo_client_cleanup(struct obd_device *obddev)
 		return -EBUSY;
 	}
 
+	lu_session_tags_clear(ECHO_SES_TAG & ~LCT_SESSION);
+	lu_context_tags_clear(ECHO_DT_CTX_TAG);
+
 	LASSERT(refcount_read(&ec->ec_exp->exp_refcount) > 0);
 	rc = obd_disconnect(ec->ec_exp);
 	if (rc != 0)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 431/622] lustre: mdc: dir page ldp_hash_end mistakenly adjusted
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (429 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 430/622] lustre: obdecho: reuse an cl env cache for obdecho survey James Simmons
@ 2020-02-27 21:14 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 432/622] lnet: handle unlink before send completes James Simmons
                   ` (191 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:14 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

On system PAGE_SIZE > 4k, mdc_adjust_dirpages() adjusts dir page
end hash with le64_to_cpu() value, but it should be little endian.

Fixes: 4f76f0ec093 ("staging: lustre: llite: move dir cache to MDC layer")

WC-bug-id: https://jira.whamcloud.com/browse/LU-10094
Lustre-commit: d8b19ae66177 ("LU-10094 mdc: dir page ldp_hash_end mistakenly adjusted")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35517
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_request.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 693c455..162ace7 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1259,8 +1259,8 @@ static void mdc_adjust_dirpages(struct page **pages, int cfs_pgs, int lu_pgs)
 
 	for (i = 0; i < cfs_pgs; i++) {
 		struct lu_dirpage *dp = kmap(pages[i]);
-		u64 hash_end = le64_to_cpu(dp->ldp_hash_end);
-		u32 flags = le32_to_cpu(dp->ldp_flags);
+		u64 hash_end = dp->ldp_hash_end;
+		u32 flags = dp->ldp_flags;
 		struct lu_dirpage *first = dp;
 
 		while (--lu_pgs > 0) {
@@ -1279,8 +1279,8 @@ static void mdc_adjust_dirpages(struct page **pages, int cfs_pgs, int lu_pgs)
 				break;
 
 			/* Save the hash and flags of this lu_dirpage. */
-			hash_end = le64_to_cpu(dp->ldp_hash_end);
-			flags = le32_to_cpu(dp->ldp_flags);
+			hash_end = dp->ldp_hash_end;
+			flags = dp->ldp_flags;
 
 			/* Check if lu_dirpage contains no entries. */
 			if (!end_dirent)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 432/622] lnet: handle unlink before send completes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (430 preceding siblings ...)
  2020-02-27 21:14 ` [lustre-devel] [PATCH 431/622] lustre: mdc: dir page ldp_hash_end mistakenly adjusted James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 433/622] lustre: osc: layout and chunkbits alignment mismatch James Simmons
                   ` (190 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

If LNetMDUnlink() is called on an md with md->md_refcount > 0 then
the eq callback isn't called.
There is a scenario where the response times out before the send
completes. So we have a refcount on the MD. The Unlink callback gets
dropped on the floor. Send completes, but because we've already timed
out, the REPLY for the GET is dropped. Now we're left with a peer
that is in the following state:
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERING
LNET_PEER_PING_SENT
But no more events are coming to it, and the discovery never
completes.

This scenario can get RPCs stuck as well if the response times out
before the send completes.

The solution is to set the event status to -ETIMEDOUT to inform
the send event handler that it should not expect a reply

WC-bug-id: https://jira.whamcloud.com/browse/LU-10931
Lustre-commit: d8fc5c23fe54 ("LU-10931 lnet: handle unlink before send completes")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35444
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 805d5b9..0d6c363 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -820,7 +820,12 @@
 
 	unlink = lnet_md_unlinkable(md);
 	if (md->md_eq) {
-		msg->msg_ev.status = status;
+		if ((md->md_flags & LNET_MD_FLAG_ABORTED) && !status) {
+			msg->msg_ev.status = -ETIMEDOUT;
+			CDEBUG(D_NET, "md 0x%p already unlinked\n", md);
+		} else {
+			msg->msg_ev.status = status;
+		}
 		msg->msg_ev.unlinked = unlink;
 		lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 433/622] lustre: osc: layout and chunkbits alignment mismatch
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (431 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 432/622] lnet: handle unlink before send completes James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 434/622] lnet: handle recursion in resend James Simmons
                   ` (189 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Vitaly Fertman <c17818@cray.com>

In the discard case, the OSC fsync/writeback code asserts
that each OSC extent is fully covered by the fsync request.

It may happen that a start(or an end) of a component does not match
the first (the last) osc object extent start (end), which is aligned
by the cl_chunkbits which depends on the OST block size.

The requirement for the component alignment is LOV_MIN_STRIPE_SIZE
which is 64K, the ZFS block size could be in MBs.

Use an aligned by chunk size the fsync reqion in the assertion.

Fixes: 58c252e47d ("lustre: osc: Do not assert for first extent")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12462
Lustre-commit: 7a9f7dec700c ("LU-12462 osc: layout and chunkbits alignment mismatch")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Cray-bug-id: LUS-7498
Reviewed-on: https://review.whamcloud.com/35733
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 9e2f90d..3d47c02 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2930,18 +2930,25 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 					list_move_tail(&ext->oe_link, list);
 				unplug = true;
 			} else {
+				struct client_obd *cli = osc_cli(obj);
+				int pcc_bits = cli->cl_chunkbits - PAGE_SHIFT;
+				pgoff_t align_by = (1 << pcc_bits);
+				pgoff_t a_start = round_down(start, align_by);
+				pgoff_t a_end = round_up(end, align_by);
+
+				/* overflow case */
+				if (end && !a_end)
+					a_end = CL_PAGE_EOF;
 				/* the only discarder is lock cancelling, so
-				 * [start, end] must contain this extent.
-				 * However, with DOM, osc extent alignment may
-				 * cause the first extent to start before the
-				 * OST portion of the layout.  This is never
-				 * accessed for i/o, but the unused portion
-				 * will not be covered by the sync request,
-				 * so we cannot assert in that case.
+				 * [start, end], aligned by chunk size, must
+				 * contain this extent
 				 */
-				EASSERT(ergo(!(ext == first_extent(obj)),
-					ext->oe_start >= start &&
-					ext->oe_end <= end), ext);
+				LASSERTF(ext->oe_start >= a_start &&
+					 ext->oe_end <= a_end,
+					 "ext [%lu, %lu] reg [%lu, %lu] orig [%lu %lu] align %lu bits %d\n",
+					 ext->oe_start, ext->oe_end,
+					 a_start, a_end, start, end,
+					 align_by, pcc_bits);
 				osc_extent_state_set(ext, OES_LOCKING);
 				ext->oe_owner = current;
 				list_move_tail(&ext->oe_link, &discard_list);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 434/622] lnet: handle recursion in resend
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (432 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 433/622] lustre: osc: layout and chunkbits alignment mismatch James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 435/622] lustre: llite: forget cached ACLs properly James Simmons
                   ` (188 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When we're resending a message we have to decommit it first. This
could potentially result in another message being picked up from the
queue and sent, which could fail immediately and be finalized, causing
recursion. This problem was observed when a router was being shutdown.

This patch uses the same mechanism used in lnet_finalize() to limit
recursion. If a thread is already finalizing a message and it gets
into path where it starts finalizing a second, then that message
is queued and handled later.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12402
Lustre-commit: ad9243693c9a ("LU-12402 lnet: handle recursion in resend")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35431
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |   4 +
 net/lnet/lnet/lib-msg.c        | 292 +++++++++++++++++++++++++++--------------
 2 files changed, 194 insertions(+), 102 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 904ef7a..3f81928 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -985,9 +985,13 @@ struct lnet_msg_container {
 	int			  msc_nfinalizers;
 	/* msgs waiting to complete finalizing */
 	struct list_head	  msc_finalizing;
+	/* msgs waiting to be resent */
+	struct list_head	  msc_resending;
 	struct list_head	  msc_active;	/* active message list */
 	/* threads doing finalization */
 	void			**msc_finalizers;
+	/* threads doing resends */
+	void			**msc_resenders;
 };
 
 /* Peer Discovery states */
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 0d6c363..5c39ce3 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -597,6 +597,168 @@
 	}
 }
 
+static void
+lnet_resend_msg_locked(struct lnet_msg *msg)
+{
+	msg->msg_retry_count++;
+
+	/* remove message from the active list and reset it to prepare
+	 * for a resend. Two exceptions to this
+	 *
+	 * 1. the router case. When a message is being routed it is
+	 * committed for rx when received and committed for tx when
+	 * forwarded. We don't want to remove it from the active list, since
+	 * code which handles receiving expects it to remain on the active
+	 * list.
+	 *
+	 * 2. The REPLY case. Reply messages use the same message
+	 * structure for the GET that was received.
+	 */
+	if (!msg->msg_routing && msg->msg_type != LNET_MSG_REPLY) {
+		list_del_init(&msg->msg_activelist);
+		msg->msg_onactivelist = 0;
+	}
+
+	/* The msg_target.nid which was originally set
+	 * when calling LNetGet() or LNetPut() might've
+	 * been overwritten if we're routing this message.
+	 * Call lnet_msg_decommit_tx() to return the credit
+	 * this message consumed. The message will
+	 * consume another credit when it gets resent.
+	 */
+	msg->msg_target.nid = msg->msg_hdr.dest_nid;
+	lnet_msg_decommit_tx(msg, -EAGAIN);
+	msg->msg_sending = 0;
+	msg->msg_receiving = 0;
+	msg->msg_target_is_router = 0;
+
+	CDEBUG(D_NET, "%s->%s:%s:%s - queuing msg (%p) for resend\n",
+	       libcfs_nid2str(msg->msg_hdr.src_nid),
+	       libcfs_nid2str(msg->msg_hdr.dest_nid),
+	       lnet_msgtyp2str(msg->msg_type),
+	       lnet_health_error2str(msg->msg_health_status), msg);
+
+	list_add_tail(&msg->msg_list, the_lnet.ln_mt_resendqs[msg->msg_tx_cpt]);
+
+	wake_up(&the_lnet.ln_mt_waitq);
+}
+
+int
+lnet_check_finalize_recursion_locked(struct lnet_msg *msg,
+				     struct list_head *containerq,
+				     int nworkers, void **workers)
+{
+	int my_slot = -1;
+	int i;
+
+	list_add_tail(&msg->msg_list, containerq);
+
+	for (i = 0; i < nworkers; i++) {
+		if (workers[i] == current)
+			break;
+
+		if (my_slot < 0 && !workers[i])
+			my_slot = i;
+	}
+
+	if (i < nworkers || my_slot < 0)
+		return -1;
+
+	workers[my_slot] = current;
+
+	return my_slot;
+}
+
+int
+lnet_attempt_msg_resend(struct lnet_msg *msg)
+{
+	struct lnet_msg_container *container;
+	int my_slot;
+	int cpt;
+
+	/* we can only resend tx_committed messages */
+	LASSERT(msg->msg_tx_committed);
+
+	/* don't resend recovery messages */
+	if (msg->msg_recovery) {
+		CDEBUG(D_NET, "msg %s->%s is a recovery ping. retry# %d\n",
+		       libcfs_nid2str(msg->msg_from),
+		       libcfs_nid2str(msg->msg_target.nid),
+		       msg->msg_retry_count);
+		return -ENOTRECOVERABLE;
+	}
+
+	/* if we explicitly indicated we don't want to resend then just
+	 * return
+	 */
+	if (msg->msg_no_resend) {
+		CDEBUG(D_NET, "msg %s->%s requested no resend. retry# %d\n",
+		       libcfs_nid2str(msg->msg_from),
+		       libcfs_nid2str(msg->msg_target.nid),
+		       msg->msg_retry_count);
+		return -ENOTRECOVERABLE;
+	}
+
+	/* check if the message has exceeded the number of retries */
+	if (msg->msg_retry_count >= lnet_retry_count) {
+		CNETERR("msg %s->%s exceeded retry count %d\n",
+			libcfs_nid2str(msg->msg_from),
+			libcfs_nid2str(msg->msg_target.nid),
+			msg->msg_retry_count);
+		return -ENOTRECOVERABLE;
+	}
+
+	cpt = msg->msg_tx_cpt;
+	lnet_net_lock(cpt);
+
+	/* check again under lock */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
+		lnet_net_unlock(cpt);
+		return -ESHUTDOWN;
+	}
+
+	container = the_lnet.ln_msg_containers[cpt];
+	my_slot = lnet_check_finalize_recursion_locked(msg,
+						       &container->msc_resending,
+						       container->msc_nfinalizers,
+						       container->msc_resenders);
+	/* enough threads are resending */
+	if (my_slot == -1) {
+		lnet_net_unlock(cpt);
+		return 0;
+	}
+
+	while (!list_empty(&container->msc_resending)) {
+		msg = list_entry(container->msc_resending.next,
+				 struct lnet_msg, msg_list);
+		list_del(&msg->msg_list);
+
+		/* resending the message will require us to call
+		 * lnet_msg_decommit_tx() which will return the credit
+		 * which this message holds. This could trigger another
+		 * queued message to be sent. If that message fails and
+		 * requires a resend we will recurse.
+		 * But since at this point the slot is taken, the message
+		 * will be queued in the container and dealt with
+		 * later. This breaks the recursion.
+		 */
+		lnet_resend_msg_locked(msg);
+	}
+
+	/* msc_resenders is an array of process pointers. Each entry holds
+	 * a pointer to the current process operating on the message. An
+	 * array entry is created per CPT. If the array slot is already
+	 * set, then it means that there is a thread on the CPT currently
+	 * resending a message.
+	 * Once the thread finishes clear the slot to enable the thread to
+	 * take on more resend work.
+	 */
+	container->msc_resenders[my_slot] = NULL;
+	lnet_net_unlock(cpt);
+
+	return 0;
+}
+
 /* Do a health check on the message:
  * return -1 if we're not going to handle the error or
  *   if we've reached the maximum number of retries.
@@ -607,9 +769,9 @@
 lnet_health_check(struct lnet_msg *msg)
 {
 	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
-	bool lo = false;
-	struct lnet_ni *ni;
 	struct lnet_peer_ni *lpni;
+	struct lnet_ni *ni;
+	bool lo = false;
 
 	/* if we're shutting down no point in handling health. */
 	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING)
@@ -697,7 +859,7 @@
 		lnet_handle_local_failure(ni);
 		if (msg->msg_tx_committed)
 			/* add to the re-send queue */
-			goto resend;
+			return lnet_attempt_msg_resend(msg);
 		break;
 
 	/* These errors will not trigger a resend so simply
@@ -713,7 +875,7 @@
 	case LNET_MSG_STATUS_REMOTE_DROPPED:
 		lnet_handle_remote_failure(lpni);
 		if (msg->msg_tx_committed)
-			goto resend;
+			return lnet_attempt_msg_resend(msg);
 		break;
 
 	case LNET_MSG_STATUS_REMOTE_ERROR:
@@ -725,87 +887,8 @@
 		LBUG();
 	}
 
-resend:
-	/* we can only resend tx_committed messages */
-	LASSERT(msg->msg_tx_committed);
-
-	/* don't resend recovery messages */
-	if (msg->msg_recovery) {
-		CDEBUG(D_NET, "msg %s->%s is a recovery ping. retry# %d\n",
-		       libcfs_nid2str(msg->msg_from),
-		       libcfs_nid2str(msg->msg_target.nid),
-		       msg->msg_retry_count);
-		return -1;
-	}
-
-	/* if we explicitly indicated we don't want to resend then just
-	 * return
-	 */
-	if (msg->msg_no_resend) {
-		CDEBUG(D_NET, "msg %s->%s requested no resend. retry# %d\n",
-		       libcfs_nid2str(msg->msg_from),
-		       libcfs_nid2str(msg->msg_target.nid),
-		       msg->msg_retry_count);
-		return -1;
-	}
-
-	/* check if the message has exceeded the number of retries */
-	if (msg->msg_retry_count >= lnet_retry_count) {
-		CNETERR("msg %s->%s exceeded retry count %d\n",
-			libcfs_nid2str(msg->msg_from),
-			libcfs_nid2str(msg->msg_target.nid),
-			msg->msg_retry_count);
-		return -1;
-	}
-	msg->msg_retry_count++;
-
-	lnet_net_lock(msg->msg_tx_cpt);
-
-	/* check again under lock */
-	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
-		lnet_net_unlock(msg->msg_tx_cpt);
-		return -1;
-	}
-
-	/* remove message from the active list and reset it in preparation
-	 * for a resend. Two exception to this
-	 *
-	 * 1. the router case, when a message is committed for rx when
-	 * received, then tx when it is sent. When committed to both tx and
-	 * rx we don't want to remove it from the active list.
-	 *
-	 * 2. The REPLY case since it uses the same msg block for the GET
-	 * that was received.
-	 */
-	if (!msg->msg_routing && msg->msg_type != LNET_MSG_REPLY) {
-		list_del_init(&msg->msg_activelist);
-		msg->msg_onactivelist = 0;
-	}
-
-	/* The msg_target.nid which was originally set
-	 * when calling LNetGet() or LNetPut() might've
-	 * been overwritten if we're routing this message.
-	 * Call lnet_return_tx_credits_locked() to return
-	 * the credit this message consumed. The message will
-	 * consume another credit when it gets resent.
-	 */
-	msg->msg_target.nid = msg->msg_hdr.dest_nid;
-	lnet_msg_decommit_tx(msg, -EAGAIN);
-	msg->msg_sending = 0;
-	msg->msg_receiving = 0;
-	msg->msg_target_is_router = 0;
-
-	CDEBUG(D_NET, "%s->%s:%s:%s - queuing for resend\n",
-	       libcfs_nid2str(msg->msg_hdr.src_nid),
-	       libcfs_nid2str(msg->msg_hdr.dest_nid),
-	       lnet_msgtyp2str(msg->msg_type),
-	       lnet_health_error2str(hstatus));
-
-	list_add_tail(&msg->msg_list, the_lnet.ln_mt_resendqs[msg->msg_tx_cpt]);
-	lnet_net_unlock(msg->msg_tx_cpt);
-
-	wake_up(&the_lnet.ln_mt_waitq);
-	return 0;
+	/* no resend is needed */
+	return -1;
 }
 
 static void
@@ -945,7 +1028,6 @@
 	int my_slot;
 	int cpt;
 	int rc;
-	int i;
 
 	LASSERT(!in_interrupt());
 
@@ -967,7 +1049,6 @@
 		 * put on the resend queue.
 		 */
 		if (!lnet_health_check(msg))
-			/* Message is queued for resend */
 			return;
 	}
 
@@ -998,28 +1079,20 @@
 	lnet_net_lock(cpt);
 
 	container = the_lnet.ln_msg_containers[cpt];
-	list_add_tail(&msg->msg_list, &container->msc_finalizing);
 
-	/*
-	 * Recursion breaker.  Don't complete the message here if I am (or
+	/* Recursion breaker.  Don't complete the message here if I am (or
 	 * enough other threads are) already completing messages
 	 */
-	my_slot = -1;
-	for (i = 0; i < container->msc_nfinalizers; i++) {
-		if (container->msc_finalizers[i] == current)
-			break;
-
-		if (my_slot < 0 && !container->msc_finalizers[i])
-			my_slot = i;
-	}
-
-	if (i < container->msc_nfinalizers || my_slot < 0) {
+	my_slot = lnet_check_finalize_recursion_locked(msg,
+						       &container->msc_finalizing,
+						       container->msc_nfinalizers,
+						       container->msc_finalizers);
+	/* enough threads are resending */
+	if (my_slot == -1) {
 		lnet_net_unlock(cpt);
 		return;
 	}
 
-	container->msc_finalizers[my_slot] = current;
-
 	rc = 0;
 	while ((msg = list_first_entry_or_null(&container->msc_finalizing,
 					       struct lnet_msg,
@@ -1073,6 +1146,10 @@
 
 	kvfree(container->msc_finalizers);
 	container->msc_finalizers = NULL;
+
+	kfree(container->msc_resenders);
+	container->msc_resenders = NULL;
+
 	container->msc_init = 0;
 }
 
@@ -1083,6 +1160,7 @@
 
 	INIT_LIST_HEAD(&container->msc_active);
 	INIT_LIST_HEAD(&container->msc_finalizing);
+	INIT_LIST_HEAD(&container->msc_resending);
 
 	/* number of CPUs */
 	container->msc_nfinalizers = cfs_cpt_weight(lnet_cpt_table(), cpt);
@@ -1099,6 +1177,16 @@
 		return -ENOMEM;
 	}
 
+	container->msc_resenders = kzalloc_cpt(container->msc_nfinalizers *
+					       sizeof(*container->msc_resenders),
+					       GFP_KERNEL, cpt);
+
+	if (!container->msc_resenders) {
+		CERROR("Failed to allocate message resenders\n");
+		lnet_msg_container_cleanup(container);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 435/622] lustre: llite: forget cached ACLs properly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (433 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 434/622] lnet: handle recursion in resend James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 436/622] lustre: osc: Fix dom handling in weight_ast James Simmons
                   ` (187 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

Lustre with linux-4.* fails ACL tests (e.g. sanity/103 and sanityn/25)
because ll_lock_cancel_bits() does not reset i_acl and i_default_acl
into initial state.  use kernel's forget_all_cached_acls() to do so.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12657
Lustre-commit: 3df034f8f46b ("LU-12657 llite: forget cached ACLs properly")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35756
Reviewed-by: Neil Brown <neilb@suse.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/namei.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 71e757a..de01a73 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -361,6 +361,9 @@ static void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 	    !is_root_inode(inode))
 		ll_invalidate_aliases(inode);
 
+	if (bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_PERM))
+		forget_all_cached_acls(inode);
+
 	iput(inode);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 436/622] lustre: osc: Fix dom handling in weight_ast
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (434 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 435/622] lustre: llite: forget cached ACLs properly James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 437/622] lustre: llite: Fix extents_stats James Simmons
                   ` (186 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

The DOM bit can be cancelled at any time during calls to
weigh_ast, so:

1. We cannot assert that it is present
2. We cannot use it to identify the !LDLM_EXTENT case when
calling osc_lock_weight

WC-bug-id: https://jira.whamcloud.com/browse/LU-12343
Lustre-commit: 92c4ad14d4b1 ("LU-12343 osc: Fix dom handling in weight_ast")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34966
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_lock.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index e01bf5f..33fdc7e7 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -673,7 +673,8 @@ unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock)
 		return 1;
 
 	LASSERT(dlmlock->l_resource->lr_type == LDLM_EXTENT ||
-		ldlm_has_dom(dlmlock));
+		dlmlock->l_resource->lr_type == LDLM_IBITS);
+
 	lock_res_and_lock(dlmlock);
 	obj = dlmlock->l_ast_data;
 	if (obj)
@@ -701,12 +702,17 @@ unsigned long osc_ldlm_weigh_ast(struct ldlm_lock *dlmlock)
 		goto out;
 	}
 
-	if (ldlm_has_dom(dlmlock))
-		weight = osc_lock_weight(env, obj, 0, OBD_OBJECT_EOF);
-	else
+	if (dlmlock->l_resource->lr_type == LDLM_EXTENT)
 		weight = osc_lock_weight(env, obj,
 					 dlmlock->l_policy_data.l_extent.start,
 					 dlmlock->l_policy_data.l_extent.end);
+	else if (ldlm_has_dom(dlmlock))
+		weight = osc_lock_weight(env, obj, 0, OBD_OBJECT_EOF);
+	/* The DOM bit can be cancelled at any time; in that case, we know
+	 * there are no pages, so just return weight of 0
+	 */
+	else
+		weight = 0;
 
 out:
 	if (obj)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 437/622] lustre: llite: Fix extents_stats
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (435 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 436/622] lustre: osc: Fix dom handling in weight_ast James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 438/622] lustre: llite: don't miss every first stride page James Simmons
                   ` (185 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Patch 32517 from LU-8066 that landed in OpenSFS branch changed:
        (1 << LL_HIST_START << i)

to

        BIT(LL_HIST_START << i)

But these are not equivalent because this changes the order
of operations.  The earlier one does the operations in this
order:
        (1 << LL_HIST_START) << i

The new one is this order:
        1 << (LL_HIST_START << i)

Which is quite different, as it's left shifting
LL_HIST_START directly, and LL_HIST_START is a number of
bits.

The goal is really just to start with BIT(LL_HIST_START)
and left shift by one (going from 4K, to 8K, etc) each
time, so just use:
        BIT(LL_HIST_START + i)

The result of this was that all i/os over 8K were placed in
the 4K-8K stat bucket, because the loop exited early.

Also add mmap'ed reads & writes to extents_stats.

Add test for extents_stats.

This was only broken in the OpenSFS branch but we want the
improvements.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12394
Lustre-commit: d31a4dad4e69 ("LU-12394 llite: Fix extents_stats")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35075
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c        | 23 +++++++++++++++++------
 fs/lustre/llite/llite_mmap.c  | 11 +++++++++++
 fs/lustre/llite/lproc_llite.c |  6 +++---
 fs/lustre/llite/vvp_io.c      |  4 ----
 4 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 35e31ad..fa61b09 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1670,6 +1670,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct lu_env *env;
 	struct vvp_io_args *args;
+	struct file *file = iocb->ki_filp;
 	ssize_t result;
 	u16 refcheck;
 	ssize_t rc2;
@@ -1693,7 +1694,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (cached)
 		return result;
 
-	ll_ras_enter(iocb->ki_filp);
+	ll_ras_enter(file);
 
 	result = ll_do_fast_read(iocb, to);
 	if (result < 0 || iov_iter_count(to) == 0)
@@ -1707,7 +1708,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	args->u.normal.via_iter = to;
 	args->u.normal.via_iocb = iocb;
 
-	rc2 = ll_file_io_generic(env, args, iocb->ki_filp, CIT_READ,
+	rc2 = ll_file_io_generic(env, args, file, CIT_READ,
 				 &iocb->ki_pos, iov_iter_count(to));
 	if (rc2 > 0)
 		result += rc2;
@@ -1716,6 +1717,11 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	cl_env_put(env, &refcheck);
 out:
+	if (result > 0)
+		ll_rw_stats_tally(ll_i2sbi(file_inode(file)), current->pid,
+				  LUSTRE_FPRIVATE(file), iocb->ki_pos, result,
+				  READ);
+
 	return result;
 }
 
@@ -1784,6 +1790,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct lu_env *env;
 	struct vvp_io_args *args;
 	ssize_t rc_tiny = 0, rc_normal;
+	struct file *file = iocb->ki_filp;
 	u16 refcheck;
 	bool cached;
 	int result;
@@ -1812,8 +1819,8 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	 * pages, and we can't do append writes because we can't guarantee the
 	 * required DLM locks are held to protect file size.
 	 */
-	if (ll_sbi_has_tiny_write(ll_i2sbi(file_inode(iocb->ki_filp))) &&
-	    !(iocb->ki_filp->f_flags & (O_DIRECT | O_SYNC | O_APPEND)))
+	if (ll_sbi_has_tiny_write(ll_i2sbi(file_inode(file))) &&
+	    !(file->f_flags & (O_DIRECT | O_SYNC | O_APPEND)))
 		rc_tiny = ll_do_tiny_write(iocb, from);
 
 	/* In case of error, go on and try normal write - Only stop if tiny
@@ -1832,8 +1839,8 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	args->u.normal.via_iter = from;
 	args->u.normal.via_iocb = iocb;
 
-	rc_normal = ll_file_io_generic(env, args, iocb->ki_filp, CIT_WRITE,
-				    &iocb->ki_pos, iov_iter_count(from));
+	rc_normal = ll_file_io_generic(env, args, file, CIT_WRITE,
+				       &iocb->ki_pos, iov_iter_count(from));
 
 	/* On success, combine bytes written. */
 	if (rc_tiny >= 0 && rc_normal > 0)
@@ -1846,6 +1853,10 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	cl_env_put(env, &refcheck);
 out:
+	if (rc_normal > 0)
+		ll_rw_stats_tally(ll_i2sbi(file_inode(file)), current->pid,
+				  LUSTRE_FPRIVATE(file), iocb->ki_pos,
+				  rc_normal, WRITE);
 	return rc_normal;
 }
 
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index 71799cd..5c13164 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -406,6 +406,12 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 		result = VM_FAULT_LOCKED;
 	}
 	sigprocmask(SIG_SETMASK, &old, NULL);
+
+	if (vmf->page && result == VM_FAULT_LOCKED)
+		ll_rw_stats_tally(ll_i2sbi(file_inode(vma->vm_file)),
+				  current->pid, LUSTRE_FPRIVATE(vma->vm_file),
+				  cl_offset(NULL, vmf->page->index), PAGE_SIZE,
+				  READ);
 	return result;
 }
 
@@ -459,6 +465,11 @@ static vm_fault_t ll_page_mkwrite(struct vm_fault *vmf)
 		break;
 	}
 
+	if (ret == VM_FAULT_LOCKED)
+		ll_rw_stats_tally(ll_i2sbi(file_inode(vma->vm_file)),
+				  current->pid, LUSTRE_FPRIVATE(vma->vm_file),
+				  cl_offset(NULL, vmf->page->index), PAGE_SIZE,
+				  WRITE);
 	return ret;
 }
 
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 6eb3d33..c2ec3fb 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1937,7 +1937,7 @@ void ll_rw_stats_tally(struct ll_sb_info *sbi, pid_t pid,
 		lprocfs_oh_clear(&io_extents->pp_extents[cur].pp_w_hist);
 	}
 
-	for (i = 0; (count >= (1 << LL_HIST_START << i)) &&
+	for (i = 0; (count >= BIT(LL_HIST_START + i)) &&
 	     (i < (LL_HIST_MAX - 1)); i++)
 		;
 	if (rw == 0) {
@@ -2032,7 +2032,7 @@ static int ll_rw_offset_stats_seq_show(struct seq_file *seq, void *v)
 	for (i = 0; i < LL_OFFSET_HIST_MAX; i++) {
 		if (offset[i].rw_pid != 0)
 			seq_printf(seq,
-				   "%3c %10d %14llu %14llu %17lu %17lu %14llu\n",
+				   "%3c %10d %14llu %14llu %17lu %17lu %14lld\n",
 				   offset[i].rw_op == READ ? 'R' : 'W',
 				   offset[i].rw_pid,
 				   offset[i].rw_range_start,
@@ -2045,7 +2045,7 @@ static int ll_rw_offset_stats_seq_show(struct seq_file *seq, void *v)
 	for (i = 0; i < LL_PROCESS_HIST_MAX; i++) {
 		if (process[i].rw_pid != 0)
 			seq_printf(seq,
-				   "%3c %10d %14llu %14llu %17lu %17lu %14llu\n",
+				   "%3c %10d %14llu %14llu %17lu %17lu %14lld\n",
 				   process[i].rw_op == READ ? 'R' : 'W',
 				   process[i].rw_pid,
 				   process[i].rw_range_start,
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 68455d5..847fb5e 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -791,8 +791,6 @@ static int vvp_io_read_start(const struct lu_env *env,
 		if (result < cnt)
 			io->ci_continue = 0;
 		io->ci_nob += result;
-		ll_rw_stats_tally(ll_i2sbi(inode), current->pid,
-				  vio->vui_fd, pos, result, READ);
 		result = 0;
 	}
 	return result;
@@ -1069,8 +1067,6 @@ static int vvp_io_write_start(const struct lu_env *env,
 
 		if (result < cnt)
 			io->ci_continue = 0;
-		ll_rw_stats_tally(ll_i2sbi(inode), current->pid,
-				  vio->vui_fd, pos, result, WRITE);
 		result = 0;
 	}
 	return result;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 438/622] lustre: llite: don't miss every first stride page
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (436 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 437/622] lustre: llite: Fix extents_stats James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 439/622] lustre: llite: swab LOV EA data in ll_getxattr_lov() James Simmons
                   ` (184 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Whenever we need skip some pages for stride io read, we
will calculate next start page index, however, this page
index is skipped every time, because loop start from index + 1

Testing command: iozone -w -c -i 5 -t1 -j 2 -s 100m -r 1m -F data
Without patch: 587384.69 kB/sec

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                      16  19  19   |          0   0   0
2:                       0   0  19   |          0   0   0
4:                       0   0  19   |          0   0   0
8:                       0   0  19   |          0   0   0
16:                      0   0  19   |          0   0   0
32:                      0   0  19   |          0   0   0
64:                      0   0  19   |          0   0   0
128:                     0   0  19   |          0   0   0
256:                     0   0  19   |          0   0   0
512:                    22  26  46   |          0   0   0
1024:                   44  53 100   |          0   0   0

With patch: 744635.56 kB/sec
                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |          0   0   0
64:                      0   0   0   |          0   0   0
128:                     0   0   0   |          0   0   0
256:                     0   0   0   |          0   0   0
512:                     8  13  13   |          0   0   0
1024:                   50  86 100   |          0   0   0

We get better performances ~27% up here, and all 1 page RPC
disappear.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: 29d8eb5ee7df ("LU-12043 llite: don't miss every first stride page")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35216
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/rw.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 9c4b89f..4fec9a6 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -407,12 +407,12 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 		} else if (stride_ria) {
 			/* If it is not in the read-ahead window, and it is
 			 * read-ahead mode, then check whether it should skip
-			 * the stride gap
+			 * the stride gap.
 			 */
 			pgoff_t offset;
-			/* FIXME: This assertion only is valid when it is for
-			 * forward read-ahead, it will be fixed when backward
-			 * read-ahead is implemented
+			/* NOTE: This assertion only is valid when it is for
+			 * forward read-ahead, must adjust if backward
+			 * readahead is implemented.
 			 */
 			LASSERTF(page_idx >= ria->ria_stoff,
 				 "Invalid page_idx %lu rs %lu re %lu ro %lu rl %lu rp %lu\n",
@@ -421,10 +421,11 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 				 ria->ria_length, ria->ria_pages);
 			offset = page_idx - ria->ria_stoff;
 			offset = offset % (ria->ria_length);
-			if (offset > ria->ria_pages) {
-				page_idx += ria->ria_length - offset;
-				CDEBUG(D_READA, "i %lu skip %lu\n", page_idx,
-				       ria->ria_length - offset);
+			if (offset >= ria->ria_pages) {
+				page_idx += ria->ria_length - offset - 1;
+				CDEBUG(D_READA,
+				       "Stride: jump %lu pages to %lu\n",
+				       ria->ria_length - offset, page_idx);
 				continue;
 			}
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 439/622] lustre: llite: swab LOV EA data in ll_getxattr_lov()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (437 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 438/622] lustre: llite: don't miss every first stride page James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 440/622] lustre: llite: Mark lustre_inode_cache as reclaimable James Simmons
                   ` (183 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Jian Yu <yujian@whamcloud.com>

On PPC client, the LOV EA data returned by getfattr from x86_64 server
was not swabbed to the host endian. While running setfattr, the data was
swabbed in ll_lov_setstripe_ea_info(), which caused magic mis-match in
ll_lov_user_md_size() and then ll_setstripe_ea() returned -ERANGE.

This patch fixed the above issue by swabbing LOV EA data in
ll_getxattr_lov().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12589
Lustre-commit: 5590f5aa94a5 ("LU-12589 llite: swab LOV EA data in ll_getxattr_lov()")
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35626
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_swab.h |  2 +-
 fs/lustre/llite/dir.c           |  4 ++--
 fs/lustre/llite/file.c          |  4 ++--
 fs/lustre/llite/xattr.c         | 16 ++++++++--------
 fs/lustre/ptlrpc/pack_generic.c | 40 ++++++++++++++++++++++++++++++++++------
 5 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/fs/lustre/include/lustre_swab.h b/fs/lustre/include/lustre_swab.h
index e99e16d..dd3c50c 100644
--- a/fs/lustre/include/lustre_swab.h
+++ b/fs/lustre/include/lustre_swab.h
@@ -86,7 +86,7 @@
 void lustre_swab_lov_comp_md_v1(struct lov_comp_md_v1 *lum);
 void lustre_swab_lov_user_md_objects(struct lov_user_ost_data *lod,
 				     int stripe_count);
-void lustre_swab_lov_user_md(struct lov_user_md *lum);
+void lustre_swab_lov_user_md(struct lov_user_md *lum, size_t size);
 void lustre_swab_lov_mds_md(struct lov_mds_md *lmm);
 void lustre_swab_lustre_capa(struct lustre_capa *c);
 void lustre_swab_lustre_capa_key(struct lustre_capa_key *k);
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 3540c18..812f535 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -564,7 +564,7 @@ int ll_dir_setstripe(struct inode *inode, struct lov_user_md *lump,
 		 */
 		if ((__swab32(lump->lmm_magic) & le32_to_cpu(LOV_MAGIC_MASK)) ==
 		    le32_to_cpu(LOV_MAGIC_MAGIC))
-			lustre_swab_lov_user_md(lump);
+			lustre_swab_lov_user_md(lump, 0);
 	} else {
 		lum_size = sizeof(struct lov_user_md_v1);
 	}
@@ -696,7 +696,7 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 	case LOV_MAGIC_COMP_V1:
 	case LOV_USER_MAGIC_SPECIFIC:
 		if (cpu_to_le32(LOV_MAGIC) != LOV_MAGIC)
-			lustre_swab_lov_user_md((struct lov_user_md *)lmm);
+			lustre_swab_lov_user_md((struct lov_user_md *)lmm, 0);
 		break;
 	case LMV_MAGIC_V1:
 		if (cpu_to_le32(LMV_MAGIC) != LMV_MAGIC)
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index fa61b09..6c5b9eb 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1873,7 +1873,7 @@ int ll_lov_setstripe_ea_info(struct inode *inode, struct dentry *dentry,
 	if ((__swab32(lum->lmm_magic) & le32_to_cpu(LOV_MAGIC_MASK)) ==
 	    le32_to_cpu(LOV_MAGIC_MAGIC)) {
 		/* this code will only exist for big-endian systems */
-		lustre_swab_lov_user_md(lum);
+		lustre_swab_lov_user_md(lum, 0);
 	}
 
 	ll_inode_size_lock(inode);
@@ -1956,7 +1956,7 @@ int ll_lov_getstripe_ea_info(struct inode *inode, const char *filename,
 				stripe_count = 0;
 		}
 
-		lustre_swab_lov_user_md((struct lov_user_md *)lmm);
+		lustre_swab_lov_user_md((struct lov_user_md *)lmm, 0);
 
 		/* if function called for directory - we should
 		 * avoid swab not existent lsm objects
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index cf1cfd2..4e1ce34 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -320,7 +320,7 @@ static int ll_xattr_set(const struct xattr_handler *handler,
 	if (strncmp(name, "lov.", 4) == 0 &&
 	    (__swab32(((struct lov_user_md *)value)->lmm_magic) &
 	    le32_to_cpu(LOV_MAGIC_MASK)) == le32_to_cpu(LOV_MAGIC_MAGIC))
-		lustre_swab_lov_user_md((struct lov_user_md *)value);
+		lustre_swab_lov_user_md((struct lov_user_md *)value, 0);
 
 	return ll_xattr_set_common(handler, dentry, inode, name, value, size,
 				   flags);
@@ -459,7 +459,6 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		};
 		struct lu_env *env;
 		u16 refcheck;
-		u32 magic;
 
 		if (!obj)
 			return -ENODATA;
@@ -490,12 +489,12 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		 * recognizing layout gen as stripe offset when the
 		 * file is restored. See LU-2809.
 		 */
-		magic = ((struct lov_mds_md *)buf)->lmm_magic;
-		if ((magic & __swab32(LOV_MAGIC_MAGIC)) ==
-		    __swab32(LOV_MAGIC_MAGIC))
-			magic = __swab32(magic);
+		if ((((struct lov_mds_md *)buf)->lmm_magic &
+		    __swab32(LOV_MAGIC_MAGIC)) == __swab32(LOV_MAGIC_MAGIC))
+			lustre_swab_lov_user_md((struct lov_user_md *)buf,
+						cl.cl_size);
 
-		switch (magic) {
+		switch (((struct lov_mds_md *)buf)->lmm_magic) {
 		case LOV_MAGIC_V1:
 		case LOV_MAGIC_V3:
 		case LOV_MAGIC_SPECIFIC:
@@ -505,7 +504,8 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		case LOV_MAGIC_FOREIGN:
 			goto out_env;
 		default:
-			CERROR("Invalid LOV magic %08x\n", magic);
+			CERROR("Invalid LOV magic %08x\n",
+			       ((struct lov_mds_md *)buf)->lmm_magic);
 			rc = -EINVAL;
 			goto out_env;
 		}
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index b066113..6a4ea7a 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -2147,23 +2147,51 @@ void lustre_swab_lov_user_md_objects(struct lov_user_ost_data *lod,
 }
 EXPORT_SYMBOL(lustre_swab_lov_user_md_objects);
 
-void lustre_swab_lov_user_md(struct lov_user_md *lum)
+void lustre_swab_lov_user_md(struct lov_user_md *lum, size_t size)
 {
+	struct lov_user_md_v1 *v1;
+	struct lov_user_md_v3 *v3;
+	struct lov_foreign_md *lfm;
+	u16 stripe_count;
+
 	CDEBUG(D_IOCTL, "swabbing lov_user_md\n");
 	switch (lum->lmm_magic) {
 	case __swab32(LOV_MAGIC_V1):
 	case LOV_USER_MAGIC_V1:
-		lustre_swab_lov_user_md_v1((struct lov_user_md_v1 *)lum);
+	{
+		v1 = (struct lov_user_md_v1 *)lum;
+		stripe_count = v1->lmm_stripe_count;
+
+		if (lum->lmm_magic != LOV_USER_MAGIC_V1)
+			__swab16s(&stripe_count);
+
+		lustre_swab_lov_user_md_v1(v1);
+		if (size > sizeof(*v1))
+			lustre_swab_lov_user_md_objects(v1->lmm_objects,
+							stripe_count);
+
 		break;
+	}
 	case __swab32(LOV_MAGIC_V3):
 	case LOV_USER_MAGIC_V3:
-		lustre_swab_lov_user_md_v3((struct lov_user_md_v3 *)lum);
+	{
+		v3 = (struct lov_user_md_v3 *)lum;
+		stripe_count = v3->lmm_stripe_count;
+
+		if (lum->lmm_magic != LOV_USER_MAGIC_V3)
+			__swab16s(&stripe_count);
+
+		lustre_swab_lov_user_md_v3(v3);
+		if (size > sizeof(*v3))
+			lustre_swab_lov_user_md_objects(v3->lmm_objects,
+							stripe_count);
 		break;
+	}
 	case __swab32(LOV_USER_MAGIC_SPECIFIC):
 	case LOV_USER_MAGIC_SPECIFIC:
 	{
-		struct lov_user_md_v3 *v3 = (struct lov_user_md_v3 *)lum;
-		u16 stripe_count = v3->lmm_stripe_count;
+		v3 = (struct lov_user_md_v3 *)lum;
+		stripe_count = v3->lmm_stripe_count;
 
 		if (lum->lmm_magic != LOV_USER_MAGIC_SPECIFIC)
 			__swab16s(&stripe_count);
@@ -2179,7 +2207,7 @@ void lustre_swab_lov_user_md(struct lov_user_md *lum)
 	case __swab32(LOV_MAGIC_FOREIGN):
 	case LOV_USER_MAGIC_FOREIGN:
 	{
-		struct lov_foreign_md *lfm = (struct lov_foreign_md *)lum;
+		lfm = (struct lov_foreign_md *)lum;
 
 		__swab32s(&lfm->lfm_magic);
 		__swab32s(&lfm->lfm_length);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 440/622] lustre: llite: Mark lustre_inode_cache as reclaimable
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (438 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 439/622] lustre: llite: swab LOV EA data in ll_getxattr_lov() James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 441/622] lustre: osc: add preferred checksum type support James Simmons
                   ` (182 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Jacek Tomaka <jacek.tomaka@poczta.fm>

This is required for proper kernel memory available accounting.
Without it memory allocated to lustre_inode_cache appears as
SUnreclaim where in reality it should apper as SReclaimable.
This affect MemAvailable as well (it is lower than it should be).

WC-bug-id: https://jira.whamcloud.com/browse/LU-12313
Lustre-commit: b09e63db24e5 ("LU-12313 llite: Mark lustre_inode_cache as reclaimable")
Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm>
Reviewed-on: https://review.whamcloud.com/35790
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Neil Brown <neilb@suse.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/super25.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/super25.c b/fs/lustre/llite/super25.c
index afd51a6..38d60b0 100644
--- a/fs/lustre/llite/super25.c
+++ b/fs/lustre/llite/super25.c
@@ -211,7 +211,11 @@ static int __init lustre_init(void)
 	rc = -ENOMEM;
 	ll_inode_cachep = kmem_cache_create("lustre_inode_cache",
 					    sizeof(struct ll_inode_info), 0,
-					    SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
+					    SLAB_HWCACHE_ALIGN |
+					    SLAB_RECLAIM_ACCOUNT |
+					    SLAB_ACCOUNT |
+					    SLAB_MEM_SPREAD |
+					    SLAB_ACCOUNT,
 					    NULL);
 	if (!ll_inode_cachep)
 		goto out_cache;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 441/622] lustre: osc: add preferred checksum type support
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (439 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 440/622] lustre: llite: Mark lustre_inode_cache as reclaimable James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 442/622] lustre: ptlrpc: Stop sending ptlrpc_body_v2 James Simmons
                   ` (181 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Li Xi <lixi@ddn.com>

Some checksum types might not work correctly even though they are
available options and have the best speeds during test. In these
circumstances, users might want to use a certain checksum type which
is known to be functional. However, "lctl conf_param XXX-YYY.osc.
checksum_type=ZZZ" won't help to enforce a certain checksum type,
because the selected checksum type is determined during OSC
connection, which will overwrite the LLOG parameter.

To solve this problem, whenever a valid checksum type is set by "lctl
conf_param" or "lctl set_param", it is remembered as the perferred
checksum type for the OSC. During connection process, if that
checksum type is available, that checksum type will be selected as
the RPC checksum type regardless of its speed.

The semantics of interface /proc/fs/lustre/osc/*/checksum_type is
changed for a little bit. If a wrong checksum name is being written
into this entry, -EINVAL will be returned as before. If the written
string is a valid checksum name, even though the checksum type is
not supported by this OSC/OST pair, the checksum type will still be
remembered as the perferred checksum type, and return value will be
-ENOTSUPP. Whenever connecting/reconnecting happens, if perferred
checksum type is available, it will be used for the RPC checksum.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11011
Lustre-commit: 9b6b5e479828 ("LU-11011 osc: add preferred checksum type support")
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: https://review.whamcloud.com/32349
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h       |  2 ++
 fs/lustre/include/obd_cksum.h | 13 ++++++++++---
 fs/lustre/ldlm/ldlm_lib.c     |  1 +
 fs/lustre/osc/lproc_osc.c     | 19 ++++++++++++-------
 fs/lustre/ptlrpc/import.c     |  3 ++-
 5 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 886c697..70dbaaf 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -339,6 +339,8 @@ struct client_obd {
 	u32			cl_supp_cksum_types;
 	/* checksum algorithm to be used */
 	enum cksum_type		cl_cksum_type;
+	/* preferred checksum algorithm to be used */
+	enum cksum_type		cl_preferred_cksum_type;
 
 	/* also protected by the poorly named _loi_list_lock lock above */
 	struct osc_async_rc     cl_ar;
diff --git a/fs/lustre/include/obd_cksum.h b/fs/lustre/include/obd_cksum.h
index cc47c44..c03d0e6 100644
--- a/fs/lustre/include/obd_cksum.h
+++ b/fs/lustre/include/obd_cksum.h
@@ -109,10 +109,17 @@ static inline enum cksum_type obd_cksum_types_supported_client(void)
  * Caution is advised, however, since what is fastest on a single client may
  * not be the fastest or most efficient algorithm on the server.
  */
-static inline enum cksum_type
-obd_cksum_type_select(const char *obd_name, enum cksum_type cksum_types)
+static inline
+enum cksum_type obd_cksum_type_select(const char *obd_name,
+				       enum cksum_type cksum_types,
+				       enum cksum_type preferred)
 {
-	u32 flag = obd_cksum_type_pack(obd_name, cksum_types);
+	u32 flag;
+
+	if (preferred & cksum_types)
+		return preferred;
+
+	flag = obd_cksum_type_pack(obd_name, cksum_types);
 
 	return obd_cksum_type_unpack(flag);
 }
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index af74f97..127ed32 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -364,6 +364,7 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	atomic_set(&cli->cl_destroy_in_flight, 0);
 
 	cli->cl_supp_cksum_types = OBD_CKSUM_CRC32;
+	cli->cl_preferred_cksum_type = 0;
 	/* Turn on checksumming by default. */
 	cli->cl_checksum = 1;
 	/*
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 775bf74..8e0088b 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -415,6 +415,7 @@ static ssize_t osc_checksum_type_seq_write(struct file *file,
 	DECLARE_CKSUM_NAME;
 	char kernbuf[10];
 	int i;
+	int rc = -EINVAL;
 
 	if (!obd)
 		return 0;
@@ -423,22 +424,26 @@ static ssize_t osc_checksum_type_seq_write(struct file *file,
 		return -EINVAL;
 	if (copy_from_user(kernbuf, buffer, count))
 		return -EFAULT;
+
 	if (count > 0 && kernbuf[count - 1] == '\n')
 		kernbuf[count - 1] = '\0';
 	else
 		kernbuf[count] = '\0';
 
 	for (i = 0; i < ARRAY_SIZE(cksum_name); i++) {
-		if (((1 << i) & obd->u.cli.cl_supp_cksum_types) == 0)
-			continue;
-		if (!strcmp(kernbuf, cksum_name[i])) {
-			obd->u.cli.cl_cksum_type = 1 << i;
-			return count;
+		if (strcmp(kernbuf, cksum_name[i]) == 0) {
+			obd->u.cli.cl_preferred_cksum_type = BIT(i);
+			if (obd->u.cli.cl_supp_cksum_types & BIT(i)) {
+				obd->u.cli.cl_cksum_type = BIT(i);
+				rc = count;
+			} else {
+				rc = -ENOTSUPP;
+			}
+			break;
 		}
 	}
-	return -EINVAL;
+	return rc;
 }
-
 LPROC_SEQ_FOPS(osc_checksum_type);
 
 static ssize_t resend_count_show(struct kobject *kobj,
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 0ade41e..a6d0b32 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -846,7 +846,8 @@ static int ptlrpc_connect_set_flags(struct obd_import *imp,
 		cli->cl_supp_cksum_types = OBD_CKSUM_ADLER;
 	}
 	cli->cl_cksum_type = obd_cksum_type_select(imp->imp_obd->obd_name,
-						   cli->cl_supp_cksum_types);
+						   cli->cl_supp_cksum_types,
+						   cli->cl_preferred_cksum_type);
 
 	if (ocd->ocd_connect_flags & OBD_CONNECT_BRW_SIZE)
 		cli->cl_max_pages_per_rpc =
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 442/622] lustre: ptlrpc: Stop sending ptlrpc_body_v2
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (440 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 441/622] lustre: osc: add preferred checksum type support James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 443/622] lnet: Fix style issues for selftest/rpc.c James Simmons
                   ` (180 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

ptlrpc_body_v2 does not include space for jobids, that
means that when we added jobid to the RPC debug messages,
we started getting errors like this:

LustreError: 6817:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg
000000005c83b7a2 buffer[0] size 152 too small (required 184, opc=-1)

This happened every time we tried to print a ptlrpc_body_v2
message.

body_v2 is still sent on some RPCs for compatibility with
very old versions of Lustre, but we no longer support
interop with those versions (latest reported is 2.3).

So, stop sending ptlrpc_body_v2 on any RPCs.

Note that we need to retain the ptlrpc_body_v2 definitions
and parsing capability for interop with servers which still
use them for some messages, which is all prior to this
patch.

One further note:
This does *not* fix the case of newer clients collecting
rpctrace with older servers.  They will still see the
error message for some RPCs.  That could be fixed with
tweaks to the debug printing code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12523
Lustre-commit: fb18c05c0f5e ("LU-12523 ptlrpc: Stop sending ptlrpc_body_v2")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35583
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c       | 27 +--------------------------
 fs/lustre/ptlrpc/niobuf.c       | 11 -----------
 fs/lustre/ptlrpc/pack_generic.c | 16 ++--------------
 3 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index dcc5e6b..c750a4e 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -817,32 +817,7 @@ int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 int ptlrpc_request_pack(struct ptlrpc_request *request,
 			u32 version, int opcode)
 {
-	int rc;
-
-	rc = ptlrpc_request_bufs_pack(request, version, opcode, NULL, NULL);
-	if (rc)
-		return rc;
-
-	/*
-	 * For some old 1.8 clients (< 1.8.7), they will LASSERT the size of
-	 * ptlrpc_body sent from server equal to local ptlrpc_body size, so we
-	 * have to send old ptlrpc_body to keep interoperability with these
-	 * clients.
-	 *
-	 * Only three kinds of server->client RPCs so far:
-	 *  - LDLM_BL_CALLBACK
-	 *  - LDLM_CP_CALLBACK
-	 *  - LDLM_GL_CALLBACK
-	 *
-	 * XXX This should be removed whenever we drop the interoperability with
-	 *     the these old clients.
-	 */
-	if (opcode == LDLM_BL_CALLBACK || opcode == LDLM_CP_CALLBACK ||
-	    opcode == LDLM_GL_CALLBACK)
-		req_capsule_shrink(&request->rq_pill, &RMF_PTLRPC_BODY,
-				   sizeof(struct ptlrpc_body_v2), RCL_CLIENT);
-
-	return rc;
+	return ptlrpc_request_bufs_pack(request, version, opcode, NULL, NULL);
 }
 EXPORT_SYMBOL(ptlrpc_request_pack);
 
diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 2e866fe..9d9e94c 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -388,17 +388,6 @@ int ptlrpc_send_reply(struct ptlrpc_request *req, int flags)
 		       req->rq_export->exp_obd->obd_minor);
 	}
 
-	/* In order to keep interoperability with the client (< 2.3) which
-	 * doesn't have pb_jobid in ptlrpc_body, We have to shrink the
-	 * ptlrpc_body in reply buffer to ptlrpc_body_v2, otherwise, the
-	 * reply buffer on client will be overflow.
-	 *
-	 * XXX Remove this whenever we drop the interoperability with
-	 * such client.
-	 */
-	req->rq_replen = lustre_shrink_msg(req->rq_repmsg, 0,
-					   sizeof(struct ptlrpc_body_v2), 1);
-
 	if (req->rq_type != PTL_RPC_MSG_ERR)
 		req->rq_type = PTL_RPC_MSG_REPLY;
 
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 6a4ea7a..e63720b 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -91,21 +91,9 @@ bool ptlrpc_buf_need_swab(struct ptlrpc_request *req, const int inout,
 /* early reply size */
 u32 lustre_msg_early_size(void)
 {
-	static u32 size;
-
-	if (!size) {
-		/* Always reply old ptlrpc_body_v2 to keep interoperability
-		 * with the old client (< 2.3) which doesn't have pb_jobid
-		 * in the ptlrpc_body.
-		 *
-		 * XXX Remove this whenever we drop interoperability with such
-		 *     client.
-		 */
-		u32 pblen = sizeof(struct ptlrpc_body_v2);
+	u32 pblen = sizeof(struct ptlrpc_body);
 
-		size = lustre_msg_size(LUSTRE_MSG_MAGIC_V2, 1, &pblen);
-	}
-	return size;
+	return lustre_msg_size(LUSTRE_MSG_MAGIC_V2, 1, &pblen);
 }
 EXPORT_SYMBOL(lustre_msg_early_size);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 443/622] lnet: Fix style issues for selftest/rpc.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (441 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 442/622] lustre: ptlrpc: Stop sending ptlrpc_body_v2 James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 444/622] lnet: Fix style issues for module.c conctl.c James Simmons
                   ` (179 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

This patch fixes issues reported by checkpatch for the file
selftest/rpc.c.
Linux 5.3 enforces the use of 'fallthrough' which is also
suggested by checkpatch

Cray-bug-id: LUS-7690
WC-bug-id: https://jira.whamcloud.com/browse/LU-12635
Lustre-commit: 4bfe21d09c39 ("LU-12635 lnet: Fix style issues for selftest/rpc.c")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/35800
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/selftest/rpc.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/lnet/selftest/rpc.c b/net/lnet/selftest/rpc.c
index a5941e4..4645f04 100644
--- a/net/lnet/selftest/rpc.c
+++ b/net/lnet/selftest/rpc.c
@@ -141,7 +141,8 @@ struct srpc_bulk *
 		struct page *pg;
 		int nob;
 
-		pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(), cpt),
+		pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(),
+							  cpt),
 				      GFP_KERNEL, 0);
 		if (!pg) {
 			CERROR("Can't allocate page %d of %d\n", i, bulk_npg);
@@ -386,7 +387,8 @@ struct srpc_bulk *
 		return -ENOMEM;
 	}
 
-	CDEBUG(D_NET, "Posted passive RDMA: peer %s, portal %d, matchbits %#llx\n",
+	CDEBUG(D_NET,
+	       "Posted passive RDMA: peer %s, portal %d, matchbits %#llx\n",
 	       libcfs_id2str(peer), portal, matchbits);
 	return 0;
 }
@@ -440,7 +442,8 @@ struct srpc_bulk *
 		rc = LNetMDUnlink(*mdh);
 		LASSERT(!rc);
 	} else {
-		CDEBUG(D_NET, "Posted active RDMA: peer %s, portal %u, matchbits %#llx\n",
+		CDEBUG(D_NET,
+		       "Posted active RDMA: peer %s, portal %u, matchbits %#llx\n",
 		       libcfs_id2str(peer), portal, matchbits);
 	}
 	return 0;
@@ -515,7 +518,8 @@ struct srpc_bulk *
 void
 srpc_add_buffer(struct swi_workitem *wi)
 {
-	struct srpc_service_cd *scd = container_of(wi, struct srpc_service_cd, scd_buf_wi);
+	struct srpc_service_cd *scd = container_of(wi, struct srpc_service_cd,
+						   scd_buf_wi);
 	struct srpc_buffer *buf;
 	int rc = 0;
 
@@ -662,7 +666,8 @@ struct srpc_bulk *
 		spin_lock(&scd->scd_lock);
 
 		if (scd->scd_buf_nposted > 0) {
-			CDEBUG(D_NET, "waiting for %d posted buffers to unlink\n",
+			CDEBUG(D_NET,
+			       "waiting for %d posted buffers to unlink\n",
 			       scd->scd_buf_nposted);
 			spin_unlock(&scd->scd_lock);
 			return 0;
@@ -960,7 +965,8 @@ struct srpc_bulk *
 void
 srpc_handle_rpc(struct swi_workitem *wi)
 {
-	struct srpc_server_rpc *rpc = container_of(wi, struct srpc_server_rpc, srpc_wi);
+	struct srpc_server_rpc *rpc = container_of(wi, struct srpc_server_rpc,
+						   srpc_wi);
 	struct srpc_service_cd *scd = rpc->srpc_scd;
 	struct srpc_service *sv = scd->scd_svc;
 	struct srpc_event *ev = &rpc->srpc_ev;
@@ -1398,7 +1404,9 @@ struct srpc_client_rpc *
 	return rc;
 }
 
-/* when in kernel always called with lnet_net_lock() held, and in thread context */
+/* when in kernel always called with lnet_net_lock() held,
+ * and in thread context
+ */
 static void
 srpc_lnet_ev_handler(struct lnet_event *ev)
 {
@@ -1451,7 +1459,8 @@ struct srpc_client_rpc *
 			       rpcev, crpc, &crpc->crpc_reqstev,
 			       &crpc->crpc_replyev, &crpc->crpc_bulkev);
 			CERROR("Bad event: status %d, type %d, lnet %d\n",
-			       rpcev->ev_status, rpcev->ev_type, rpcev->ev_lnet);
+			       rpcev->ev_status, rpcev->ev_type,
+			       rpcev->ev_lnet);
 			LBUG();
 		}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 444/622] lnet: Fix style issues for module.c conctl.c
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (442 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 443/622] lnet: Fix style issues for selftest/rpc.c James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 445/622] lustre: ptlrpc: check lm_bufcount and lm_buflen James Simmons
                   ` (178 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

This patch fixes issues reported by checkpatch for the file
selftest/module.c and selftest/conctl.c.
Linux 5.3 enforces the use of 'fallthrough' which is also
suggested by checkpatch

Cray-bug-id: LUS-7690
WC-bug-id: https://jira.whamcloud.com/browse/LU-12635
Lustre-commit: ebff8aba3392 ("LU-12635 lnet: Fix style issues for module.c conctl.c")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/35802
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/selftest/conctl.c | 4 ++--
 net/lnet/selftest/module.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/lnet/selftest/conctl.c b/net/lnet/selftest/conctl.c
index 906d82d..ed9eab9 100644
--- a/net/lnet/selftest/conctl.c
+++ b/net/lnet/selftest/conctl.c
@@ -121,7 +121,6 @@
 		return -EINVAL;
 
 	if (args->lstio_dbg_namep) {
-
 		if (copy_from_user(name, args->lstio_dbg_namep,
 				   args->lstio_dbg_nmlen))
 			return -EFAULT;
@@ -727,7 +726,8 @@ static int lst_test_add_ioctl(struct lstio_test_args *args)
 		goto out;
 	}
 
-	memset(&console_session.ses_trans_stat, 0, sizeof(struct lstcon_trans_stat));
+	memset(&console_session.ses_trans_stat,
+	       0, sizeof(struct lstcon_trans_stat));
 
 	switch (opc) {
 	case LSTIO_SESSION_NEW:
diff --git a/net/lnet/selftest/module.c b/net/lnet/selftest/module.c
index 9ba6532..2de2b59 100644
--- a/net/lnet/selftest/module.c
+++ b/net/lnet/selftest/module.c
@@ -105,7 +105,7 @@ enum {
 
 	nscheds = cfs_cpt_number(lnet_cpt_table());
 	lst_test_wq = kvmalloc_array(nscheds, sizeof(lst_test_wq[0]),
-					GFP_KERNEL | __GFP_ZERO);
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!lst_test_wq) {
 		rc = -ENOMEM;
 		goto error;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 445/622] lustre: ptlrpc: check lm_bufcount and lm_buflen
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (443 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 444/622] lnet: Fix style issues for module.c conctl.c James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 446/622] lustre: uapi: Remove unused CONNECT flag James Simmons
                   ` (177 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Emoly Liu <emoly@whamcloud.com>

Check lm_bufcount to be used by lustre_msg_hdr_size_v2() and
validate individual and total buffer lengths in
lustre_unpack_msg_v2() in case of any out-of-bound read.

Reported-by: Alibaba Cloud <yunye.ry@alibaba-inc.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-12590
Lustre-commit: 268edb13d769 ("LU-12590 ptlrpc: check lm_bufcount and lm_buflen")
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35783
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Yunye Ry <yunye.ry@alibaba-inc.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h  | 40 ++++++++++++++++++++++++++++++++++++++++
 fs/lustre/ptlrpc/pack_generic.c | 29 +++++++++++++++++++++++------
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index d03e8c6..caf766d 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -238,6 +238,34 @@
  *
  */
 
+/**
+ * This is the size of a maximum REINT_SETXATTR request:
+ *
+ *   lustre_msg		 56 (32 + 4 x 5 + 4)
+ *   ptlrpc_body	184
+ *   mdt_rec_setxattr	136
+ *   lustre_capa	120
+ *   name		256 (XATTR_NAME_MAX)
+ *   value	      65536 (XATTR_SIZE_MAX)
+ */
+#define MDS_EA_MAXREQSIZE	66288
+
+/**
+ * These are the maximum request and reply sizes (rounded up to 1 KB
+ * boundaries) for the "regular" MDS_REQUEST_PORTAL and MDS_REPLY_PORTAL.
+ */
+#define MDS_REG_MAXREQSIZE	(((max(MDS_EA_MAXREQSIZE, \
+				       MDS_LOV_MAXREQSIZE) + 1023) >> 10) << 10)
+#define MDS_REG_MAXREPSIZE	MDS_REG_MAXREQSIZE
+
+/**
+ * The update request includes all of updates from the create, which might
+ * include linkea (4K maxim), together with other updates, we set it to 1000K:
+ * lustre_msg + ptlrpc_body + OUT_UPDATE_BUFFER_SIZE_MAX
+ */
+#define OUT_MAXREQSIZE	(1000 * 1024)
+#define OUT_MAXREPSIZE	MDS_MAXREPSIZE
+
  /*
   * LDLM threads constants:
   *
@@ -291,6 +319,12 @@
 				 (DT_MAX_BRW_PAGES - 1)))
 
 /**
+ * MDS incoming request with LOV EA
+ * 24 = sizeof(struct lov_ost_data), i.e: replay of opencreate
+ */
+#define MDS_LOV_MAXREQSIZE	max(MDS_MAXREQSIZE, \
+				    362 + LOV_MAX_STRIPE_COUNT * 24)
+/**
  * FIEMAP request can be 4K+ for now
  */
 #define OST_MAXREQSIZE		(16UL * 1024UL)
@@ -2017,6 +2051,12 @@ struct ptlrpc_service *ptlrpc_register_service(struct ptlrpc_service_conf *conf,
  *
  * @{
  */
+#define PTLRPC_MAX_BUFCOUNT \
+	(sizeof(((struct ptlrpc_request *)0)->rq_req_swab_mask) * 8)
+#define MD_MAX_BUFLEN		(MDS_REG_MAXREQSIZE > OUT_MAXREQSIZE ? \
+				 MDS_REG_MAXREQSIZE : OUT_MAXREQSIZE)
+#define PTLRPC_MAX_BUFLEN	(OST_IO_MAXREQSIZE > MD_MAX_BUFLEN ? \
+				 OST_IO_MAXREQSIZE : MD_MAX_BUFLEN)
 bool ptlrpc_buf_need_swab(struct ptlrpc_request *req, const int inout,
 			  u32 index);
 void ptlrpc_buf_set_swabbed(struct ptlrpc_request *req, const int inout,
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index e63720b..4a0856a 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -60,6 +60,8 @@ static inline u32 lustre_msg_hdr_size_v2(u32 count)
 
 u32 lustre_msg_hdr_size(u32 magic, u32 count)
 {
+	LASSERT(count > 0);
+
 	switch (magic) {
 	case LUSTRE_MSG_MAGIC_V2:
 		return lustre_msg_hdr_size_v2(count);
@@ -102,6 +104,7 @@ u32 lustre_msg_size_v2(int count, u32 *lengths)
 	u32 size;
 	int i;
 
+	LASSERT(count > 0);
 	size = lustre_msg_hdr_size_v2(count);
 	for (i = 0; i < count; i++)
 		size += cfs_size_round(lengths[i]);
@@ -159,6 +162,8 @@ void lustre_init_msg_v2(struct lustre_msg_v2 *msg, int count, u32 *lens,
 	char *ptr;
 	int i;
 
+	LASSERT(count > 0);
+
 	msg->lm_bufcount = count;
 	/* XXX: lm_secflvr uninitialized here */
 	msg->lm_magic = LUSTRE_MSG_MAGIC_V2;
@@ -291,6 +296,7 @@ int lustre_pack_reply_v2(struct ptlrpc_request *req, int count,
 	int msg_len, rc;
 
 	LASSERT(!req->rq_reply_state);
+	LASSERT(count > 0);
 
 	if ((flags & LPRFL_EARLY_REPLY) == 0) {
 		spin_lock(&req->rq_lock);
@@ -366,6 +372,9 @@ void *lustre_msg_buf_v2(struct lustre_msg_v2 *m, u32 n, u32 min_size)
 {
 	u32 i, offset, buflen, bufcount;
 
+	LASSERT(m);
+	LASSERT(m->lm_bufcount > 0);
+
 	bufcount = m->lm_bufcount;
 	if (unlikely(n >= bufcount)) {
 		CDEBUG(D_INFO, "msg %p buffer[%d] not present (count %d)\n",
@@ -479,7 +488,7 @@ void lustre_free_reply_state(struct ptlrpc_reply_state *rs)
 
 static int lustre_unpack_msg_v2(struct lustre_msg_v2 *m, int len)
 {
-	int swabbed, required_len, i;
+	int swabbed, required_len, i, buflen;
 
 	/* Now we know the sender speaks my language. */
 	required_len = lustre_msg_hdr_size_v2(0);
@@ -502,6 +511,10 @@ static int lustre_unpack_msg_v2(struct lustre_msg_v2 *m, int len)
 		BUILD_BUG_ON(offsetof(typeof(*m), lm_padding_3) == 0);
 	}
 
+	if (m->lm_bufcount == 0 || m->lm_bufcount > PTLRPC_MAX_BUFCOUNT) {
+		CERROR("message bufcount %d is not valid\n", m->lm_bufcount);
+		return -EINVAL;
+	}
 	required_len = lustre_msg_hdr_size_v2(m->lm_bufcount);
 	if (len < required_len) {
 		/* didn't receive all the buffer lengths */
@@ -513,12 +526,16 @@ static int lustre_unpack_msg_v2(struct lustre_msg_v2 *m, int len)
 	for (i = 0; i < m->lm_bufcount; i++) {
 		if (swabbed)
 			__swab32s(&m->lm_buflens[i]);
-		required_len += cfs_size_round(m->lm_buflens[i]);
+		buflen = cfs_size_round(m->lm_buflens[i]);
+		if (buflen < 0 || buflen > PTLRPC_MAX_BUFLEN) {
+			CERROR("buffer %d length %d is not valid\n", i, buflen);
+			return -EINVAL;
+		}
+		required_len += buflen;
 	}
-
-	if (len < required_len) {
-		CERROR("len: %d, required_len %d\n", len, required_len);
-		CERROR("bufcount: %d\n", m->lm_bufcount);
+	if (len < required_len || required_len > PTLRPC_MAX_BUFLEN) {
+		CERROR("len: %d, required_len %d, bufcount: %d\n",
+		       len, required_len, m->lm_bufcount);
 		for (i = 0; i < m->lm_bufcount; i++)
 			CERROR("buffer %d length %d\n", i, m->lm_buflens[i]);
 		return -EINVAL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 446/622] lustre: uapi: Remove unused CONNECT flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (444 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 445/622] lustre: ptlrpc: check lm_bufcount and lm_buflen James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 447/622] lustre: lmv: disable remote file statahead James Simmons
                   ` (176 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

The plain layout connect flag was added as part of an
earlier implementation of LU-11213, but the design was
improved before landing and the flag was not needed.

Let's remove it.  Since it was never actually marked as
supported in any client/server version, we can just remove
it entirely, leaving the flag bit open for future use.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: 11eba11fe045 ("LU-11213 uapi: Remove unused CONNECT flag")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36008
Reviewed-by: Shilong Wang <wshilong@ddn.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c            | 2 --
 include/uapi/linux/lustre/lustre_idl.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 9298c97..c0b4ad9 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1158,8 +1158,6 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_LSOM);
 	LASSERTF(OBD_CONNECT2_PCC == 0x1000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_PCC);
-	LASSERTF(OBD_CONNECT2_PLAIN_LAYOUT == 0x2000ULL, "found 0x%.16llxULL\n",
-		 OBD_CONNECT2_PLAIN_LAYOUT);
 	LASSERTF(OBD_CONNECT2_ASYNC_DISCARD == 0x4000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ASYNC_DISCARD);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 87251ee..47321ae 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -810,7 +810,6 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
 #define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
 #define OBD_CONNECT2_PCC	       0x1000ULL /* Persistent Client Cache */
-#define OBD_CONNECT2_PLAIN_LAYOUT      0x2000ULL /* Plain Directory Layout */
 #define OBD_CONNECT2_ASYNC_DISCARD     0x4000ULL /* support async DoM data
 						  * discard
 						  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 447/622] lustre: lmv: disable remote file statahead
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (445 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 446/622] lustre: uapi: Remove unused CONNECT flag James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 448/622] lustre: llite: Fix page count for unaligned reads James Simmons
                   ` (175 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Remote file statahead is not supported, because such file needs
two RPCs to fetch both LOOKUP and GETATTR lock, on LOOKUP success
we only know file FID, thus can't prepare an inode correctly.

Disable this to avoid noise messages and confusion.

Update sanity.sh test_60g.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11681
Lustre-commit: 02b5a407081c ("LU-11681 lmv: disable remote file statahead")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33930
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_obd.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index d323250..26021bb 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -3416,25 +3416,28 @@ static int lmv_intent_getattr_async(struct obd_export *exp,
 	struct md_op_data *op_data = &minfo->mi_data;
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct lmv_tgt_desc *tgt = NULL;
+	struct lmv_tgt_desc *ptgt = NULL;
+	struct lmv_tgt_desc *ctgt;
 
 	if (!fid_is_sane(&op_data->op_fid2))
 		return -EINVAL;
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid1);
-	if (IS_ERR(tgt))
-		return PTR_ERR(tgt);
+	ptgt = lmv_locate_tgt(lmv, op_data);
+	if (IS_ERR(ptgt))
+		return PTR_ERR(ptgt);
+
+	ctgt = lmv_find_target(lmv, &op_data->op_fid2);
+	if (IS_ERR(ctgt))
+		return PTR_ERR(ctgt);
 
 	/*
-	 * no special handle for remote dir, which needs to fetch both LOOKUP
-	 * lock on parent, and then UPDATE lock on child MDT, which makes all
-	 * complicated because this is done async. So only LOOKUP lock is
-	 * fetched for remote dir, but considering remote dir is rare case,
-	 * and not supporting it in statahead won't cause any issue, just leave
-	 * it as is.
+	 * remote object needs two RPCs to lookup and getattr, considering the
+	 * complexity don't support statahead for now.
 	 */
+	if (ctgt != ptgt)
+		return -EREMOTE;
 
-	return md_intent_getattr_async(tgt->ltd_exp, minfo);
+	return md_intent_getattr_async(ptgt->ltd_exp, minfo);
 }
 
 static int lmv_revalidate_lock(struct obd_export *exp, struct lookup_intent *it,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 448/622] lustre: llite: Fix page count for unaligned reads
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (446 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 447/622] lustre: lmv: disable remote file statahead James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 449/622] lnet: discovery off route state update James Simmons
                   ` (174 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

When a read is unaligned on both the first and last page,
the calculation used to determine count of pages for
readahead misses that we access both of those pages.

Increase the calculated count by 1 in this case.

This case is covered by the generic readahead tests added
in LU-12645.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12367
Lustre-commit: d4a54de84c05 ("LU-12367 llite: Fix page count for unaligned reads")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35015
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/vvp_io.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 847fb5e..e676e62 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -778,6 +778,14 @@ static int vvp_io_read_start(const struct lu_env *env,
 		vio->vui_ra_valid = true;
 		vio->vui_ra_start = cl_index(obj, pos);
 		vio->vui_ra_count = cl_index(obj, tot + PAGE_SIZE - 1);
+		/* If both start and end are unaligned, we read one more page
+		 * than the index math suggests.
+		 */
+		if (pos % PAGE_SIZE != 0 && (pos + tot) % PAGE_SIZE != 0)
+			vio->vui_ra_count++;
+
+		CDEBUG(D_READA, "tot %ld, ra_start %lu, ra_count %lu\n", tot,
+		       vio->vui_ra_start, vio->vui_ra_count);
 	}
 
 	/* BUG: 5972 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 449/622] lnet: discovery off route state update
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (447 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 448/622] lustre: llite: Fix page count for unaligned reads James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 450/622] lustre: llite: prevent mulitple group locks James Simmons
                   ` (173 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When discovery is off rely on the discovery ping response
only, rather than the internal peer database to determine
route state. With discovery off the internal peer database
is not updated with all the gateway's interfaces.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12422
Lustre-commit: e35be987da57 ("LU-12422 lnet: discovery off route state update")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35199
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |   1 +
 include/linux/lnet/lib-types.h |   4 +-
 net/lnet/lnet/peer.c           |   8 +++
 net/lnet/lnet/router.c         | 134 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index b889af2..f2f5455 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -758,6 +758,7 @@ int lnet_sock_connect(struct socket **sockp, int *fatal,
 void lnet_consolidate_routes_locked(struct lnet_peer *orig_lp,
 				    struct lnet_peer *new_lp);
 void lnet_router_discovery_complete(struct lnet_peer *lp);
+void lnet_router_discovery_ping_reply(struct lnet_peer *lp);
 
 int lnet_monitor_thr_start(void);
 void lnet_monitor_thr_stop(void);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 3f81928..22c2bc6 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -611,7 +611,7 @@ struct lnet_peer {
 	/* number of NIDs on this peer */
 	int			lp_nnis;
 
-	/* # refs from lnet_route_t::lr_gateway */
+	/* # refs from lnet_route::lr_gateway */
 	int			lp_rtr_refcount;
 
 	/*
@@ -822,6 +822,8 @@ struct lnet_route {
 	u32			lr_hops;
 	/* route priority */
 	unsigned int		lr_priority;
+	/* cached route aliveness */
+	bool			lr_alive;
 };
 
 #define LNET_REMOTE_NETS_HASH_DEFAULT	(1U << 7)
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 49da7a1..088bb62 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2398,6 +2398,14 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 out:
 	lp->lp_state &= ~LNET_PEER_PING_SENT;
 	spin_unlock(&lp->lp_lock);
+
+	lnet_net_lock(LNET_LOCK_EX);
+	/* If this peer is a gateway, call the routing callback to
+	 * handle the ping reply
+	 */
+	if (lp->lp_rtr_refcount > 0)
+		lnet_router_discovery_ping_reply(lp);
+	lnet_net_unlock(LNET_LOCK_EX);
 }
 
 /*
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 4ab587d..bc9494d 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -221,6 +221,15 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	struct lnet_peer_net *rlpn;
 	bool route_alive;
 
+	/* if discovery is disabled then rely on the cached aliveness
+	 * information. This is handicapped information which we log when
+	 * we receive the discovery ping response. The most uptodate
+	 * aliveness information can only be obtained when discovery is
+	 * enabled.
+	 */
+	if (lnet_peer_discovery_disabled)
+		return route->lr_alive;
+
 	/* check the gateway's interfaces on the route rnet to make sure
 	 * that the gateway is viable.
 	 */
@@ -279,10 +288,125 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	}
 }
 
+static inline void
+lnet_set_route_aliveness(struct lnet_route *route, bool alive)
+{
+	/* Log when there's a state change */
+	if (route->lr_alive != alive) {
+		CERROR("route to %s through %s has gone from %s to %s\n",
+		       libcfs_net2str(route->lr_net),
+		       libcfs_nid2str(route->lr_gateway->lp_primary_nid),
+		       (route->lr_alive) ? "up" : "down",
+		       alive ? "up" : "down");
+		route->lr_alive = alive;
+	}
+}
+
+void
+lnet_router_discovery_ping_reply(struct lnet_peer *lp)
+{
+	struct lnet_ping_buffer *pbuf = lp->lp_data;
+	struct lnet_remotenet *rnet;
+	struct lnet_peer_net *llpn;
+	struct lnet_route *route;
+	bool net_up = false;
+	unsigned int lp_state;
+	u32 net, net2;
+	int i, j;
+
+	spin_lock(&lp->lp_lock);
+	lp_state = lp->lp_state;
+	spin_unlock(&lp->lp_lock);
+
+	/* only handle replies if discovery is disabled. */
+	if (!lnet_peer_discovery_disabled)
+		return;
+
+	if (lp_state & LNET_PEER_PING_FAILED) {
+		CDEBUG(D_NET,
+		       "Ping failed with %d. Set routes down for gw %s\n",
+		       lp->lp_ping_error, libcfs_nid2str(lp->lp_primary_nid));
+		/* If the ping failed then mark the routes served by this
+		 * peer down
+		 */
+		list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
+			lnet_set_route_aliveness(route, false);
+		return;
+	}
+
+	CDEBUG(D_NET, "Discovery is disabled. Processing reply for gw: %s\n",
+	       libcfs_nid2str(lp->lp_primary_nid));
+
+	/* examine the ping response:
+	 * For each NID in the ping response, extract the net
+	 * if the net exists on our remote net list then
+	 * iterate over the routes on the rnet and if:
+	 *	The route's local net is healthy and
+	 *	The remote net status is UP, then mark the route up
+	 * otherwise mark the route down
+	 */
+	for (i = 1; i < pbuf->pb_info.pi_nnis; i++) {
+		net = LNET_NIDNET(pbuf->pb_info.pi_ni[i].ns_nid);
+		rnet = lnet_find_rnet_locked(net);
+		if (!rnet)
+			continue;
+		list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
+			/* check if this is the route's gateway */
+			if (lp->lp_primary_nid !=
+			    route->lr_gateway->lp_primary_nid)
+				continue;
+
+			/* gateway has the routing feature disabled */
+			if (pbuf->pb_info.pi_features &
+			      LNET_PING_FEAT_RTE_DISABLED) {
+				lnet_set_route_aliveness(route, false);
+				continue;
+			}
+
+			llpn = lnet_peer_get_net_locked(lp, route->lr_lnet);
+			if (!llpn) {
+				lnet_set_route_aliveness(route, false);
+				continue;
+			}
+
+			if (!lnet_is_gateway_net_alive(llpn)) {
+				lnet_set_route_aliveness(route, false);
+				continue;
+			}
+
+			if (avoid_asym_router_failure &&
+			    pbuf->pb_info.pi_ni[i].ns_status !=
+				LNET_NI_STATUS_UP) {
+				net_up = false;
+
+				/* revisit all previous NIDs and check if
+				 * any on the network we're examining is
+				 * up. If at least one is up then we consider
+				 * the route to be alive.
+				 */
+				for (j = 1; j < i; j++) {
+					net2 = LNET_NIDNET(pbuf->pb_info.pi_ni[j].ns_nid);
+					if (net2 == net &&
+					    pbuf->pb_info.pi_ni[j].ns_status ==
+					    LNET_NI_STATUS_UP)
+						net_up = true;
+				}
+				if (!net_up) {
+					lnet_set_route_aliveness(route, false);
+					continue;
+				}
+			}
+
+			lnet_set_route_aliveness(route, true);
+		}
+	}
+}
+
 void
 lnet_router_discovery_complete(struct lnet_peer *lp)
 {
 	struct lnet_peer_ni *lpni = NULL;
+	struct lnet_route *route;
 
 	spin_lock(&lp->lp_lock);
 	lp->lp_state &= ~LNET_PEER_RTR_DISCOVERY;
@@ -306,6 +430,9 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_dc_error);
 	while ((lpni = lnet_get_next_peer_ni_locked(lp, NULL, lpni)) != NULL)
 		lpni->lpni_ns_status = LNET_NI_STATUS_DOWN;
+
+	list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
+		lnet_set_route_aliveness(route, false);
 }
 
 static void
@@ -1431,6 +1558,8 @@ bool lnet_router_checker_active(void)
 	    time64_t when)
 {
 	struct lnet_peer_ni *lpni = NULL;
+	struct lnet_route *route;
+	struct lnet_peer *lp;
 	time64_t now = ktime_get_seconds();
 	int cpt;
 
@@ -1499,6 +1628,11 @@ bool lnet_router_checker_active(void)
 	cpt = lpni->lpni_cpt;
 	lnet_net_lock(cpt);
 	lnet_peer_ni_decref_locked(lpni);
+	if (lpni && lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer) {
+		lp = lpni->lpni_peer_net->lpn_peer;
+		list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
+			lnet_set_route_aliveness(route, alive);
+	}
 	lnet_net_unlock(cpt);
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 450/622] lustre: llite: prevent mulitple group locks
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (448 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 449/622] lnet: discovery off route state update James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 451/622] lustre: ptlrpc: make DEBUG_REQ messages consistent James Simmons
                   ` (172 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

The patch adds mutex for group lock enqueue. It also adds waiting
of group lock users on a client side for a same node. This prevents
mulitple locks on the same resource and fixes a bugs when two locks
cover the same dirty pages.

The patch adds test sanity 244b. It creates threads which
opens file, takes group lock, writes data, puts group lock, closes.
It recreates the problem when a client has two or more group locks
for a single file. This leads to a wrong behaviour for a flush etc.
osc_cache_writeback_range()) ASSERTION( hp == 0 && discard == 0 )
failed
One more test for group lock with open file and fork. It checks that
child doesn't unlock file until the last close.

Cray-bug-id: LUS-7232
WC-bug-id: https://jira.whamcloud.com/browse/LU-9964
Lustre-commit: aba68250a67a ("LU-9964 llite: prevent mulitple group locks")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Reviewed-on: https://review.whamcloud.com/35791
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c    |  3 +-
 fs/lustre/llite/file.c           | 75 ++++++++++++++++++++++++++--------------
 fs/lustre/llite/llite_internal.h |  3 ++
 fs/lustre/llite/llite_lib.c      |  3 ++
 fs/lustre/osc/osc_lock.c         |  2 ++
 5 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 75492f6..0dd9fea 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -750,7 +750,8 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 	lock->l_conn_export = exp;
 	lock->l_export = NULL;
 	lock->l_blocking_ast = einfo->ei_cb_bl;
-	lock->l_flags |= (*flags & (LDLM_FL_NO_LRU | LDLM_FL_EXCL));
+	lock->l_flags |= (*flags & (LDLM_FL_NO_LRU | LDLM_FL_EXCL |
+				    LDLM_FL_ATOMIC_CB));
 	lock->l_activity = ktime_get_real_seconds();
 
 	/* lock not sent to server yet */
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 6c5b9eb..856aa64 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -2075,15 +2075,30 @@ static int ll_lov_setstripe(struct inode *inode, struct file *file,
 	if (ll_file_nolock(file))
 		return -EOPNOTSUPP;
 
-	spin_lock(&lli->lli_lock);
+retry:
+	if (file->f_flags & O_NONBLOCK) {
+		if (!mutex_trylock(&lli->lli_group_mutex))
+			return -EAGAIN;
+	} else
+		mutex_lock(&lli->lli_group_mutex);
+
 	if (fd->fd_flags & LL_FILE_GROUP_LOCKED) {
 		CWARN("group lock already existed with gid %lu\n",
 		      fd->fd_grouplock.lg_gid);
-		spin_unlock(&lli->lli_lock);
-		return -EINVAL;
+		rc = -EINVAL;
+		goto out;
+	}
+	if (arg != lli->lli_group_gid && lli->lli_group_users != 0) {
+		if (file->f_flags & O_NONBLOCK) {
+			rc = -EAGAIN;
+			goto out;
+		}
+		mutex_unlock(&lli->lli_group_mutex);
+		wait_var_event(&lli->lli_group_users, !lli->lli_group_users);
+		rc = 0;
+		goto retry;
 	}
 	LASSERT(!fd->fd_grouplock.lg_lock);
-	spin_unlock(&lli->lli_lock);
 
 	/**
 	 * XXX: group lock needs to protect all OST objects while PFL
@@ -2102,8 +2117,10 @@ static int ll_lov_setstripe(struct inode *inode, struct file *file,
 		u16 refcheck;
 
 		env = cl_env_get(&refcheck);
-		if (IS_ERR(env))
-			return PTR_ERR(env);
+		if (IS_ERR(env)) {
+			rc = PTR_ERR(env);
+			goto out;
+		}
 
 		rc = cl_object_layout_get(env, obj, &cl);
 		if (!rc && cl.cl_is_composite)
@@ -2112,28 +2129,26 @@ static int ll_lov_setstripe(struct inode *inode, struct file *file,
 
 		cl_env_put(env, &refcheck);
 		if (rc)
-			return rc;
+			goto out;
 	}
 
 	rc = cl_get_grouplock(ll_i2info(inode)->lli_clob,
 			      arg, (file->f_flags & O_NONBLOCK), &grouplock);
-	if (rc)
-		return rc;
 
-	spin_lock(&lli->lli_lock);
-	if (fd->fd_flags & LL_FILE_GROUP_LOCKED) {
-		spin_unlock(&lli->lli_lock);
-		CERROR("another thread just won the race\n");
-		cl_put_grouplock(&grouplock);
-		return -EINVAL;
-	}
+	if (rc)
+		goto out;
 
 	fd->fd_flags |= LL_FILE_GROUP_LOCKED;
 	fd->fd_grouplock = grouplock;
-	spin_unlock(&lli->lli_lock);
+	if (lli->lli_group_users == 0)
+		lli->lli_group_gid = grouplock.lg_gid;
+	lli->lli_group_users++;
 
 	CDEBUG(D_INFO, "group lock %lu obtained\n", arg);
-	return 0;
+out:
+	mutex_unlock(&lli->lli_group_mutex);
+
+	return rc;
 }
 
 static int ll_put_grouplock(struct inode *inode, struct file *file,
@@ -2142,30 +2157,40 @@ static int ll_put_grouplock(struct inode *inode, struct file *file,
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct ll_grouplock grouplock;
+	int rc;
 
-	spin_lock(&lli->lli_lock);
+	mutex_lock(&lli->lli_group_mutex);
 	if (!(fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-		spin_unlock(&lli->lli_lock);
 		CWARN("no group lock held\n");
-		return -EINVAL;
+		rc = -EINVAL;
+		goto out;
 	}
 	LASSERT(fd->fd_grouplock.lg_lock);
 
 	if (fd->fd_grouplock.lg_gid != arg) {
 		CWARN("group lock %lu doesn't match current id %lu\n",
 		      arg, fd->fd_grouplock.lg_gid);
-		spin_unlock(&lli->lli_lock);
-		return -EINVAL;
+		rc = -EINVAL;
+		goto out;
 	}
 
 	grouplock = fd->fd_grouplock;
 	memset(&fd->fd_grouplock, 0, sizeof(fd->fd_grouplock));
 	fd->fd_flags &= ~LL_FILE_GROUP_LOCKED;
-	spin_unlock(&lli->lli_lock);
 
 	cl_put_grouplock(&grouplock);
+
+	lli->lli_group_users--;
+	if (lli->lli_group_users == 0) {
+		lli->lli_group_gid = 0;
+		wake_up_var(&lli->lli_group_users);
+	}
 	CDEBUG(D_INFO, "group lock %lu released\n", arg);
-	return 0;
+	rc = 0;
+out:
+	mutex_unlock(&lli->lli_group_mutex);
+
+	return rc;
 }
 
 /**
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 49c0c78..232fb0a 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -210,6 +210,9 @@ struct ll_inode_info {
 			struct mutex		 lli_pcc_lock;
 			enum lu_pcc_state_flags	 lli_pcc_state;
 			struct pcc_inode	*lli_pcc_inode;
+			struct mutex			lli_group_mutex;
+			u64				lli_group_users;
+			unsigned long			lli_group_gid;
 		};
 	};
 
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 86be562..8946dc6 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -983,6 +983,9 @@ void ll_lli_init(struct ll_inode_info *lli)
 		mutex_init(&lli->lli_pcc_lock);
 		lli->lli_pcc_state = PCC_STATE_FL_NONE;
 		lli->lli_pcc_inode = NULL;
+		mutex_init(&lli->lli_group_mutex);
+		lli->lli_group_users = 0;
+		lli->lli_group_gid = 0;
 	}
 	mutex_init(&lli->lli_layout_mutex);
 	memset(lli->lli_jobid, 0, sizeof(lli->lli_jobid));
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index 33fdc7e7..c748e58 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -1182,6 +1182,8 @@ int osc_lock_init(const struct lu_env *env,
 
 	oscl->ols_flags = osc_enq2ldlm_flags(enqflags);
 	oscl->ols_speculative = !!(enqflags & CEF_SPECULATIVE);
+	if (lock->cll_descr.cld_mode == CLM_GROUP)
+		oscl->ols_flags |= LDLM_FL_ATOMIC_CB;
 
 	if (oscl->ols_flags & LDLM_FL_HAS_INTENT) {
 		oscl->ols_flags |= LDLM_FL_BLOCK_GRANTED;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 451/622] lustre: ptlrpc: make DEBUG_REQ messages consistent
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (449 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 450/622] lustre: llite: prevent mulitple group locks James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 452/622] lustre: ptlrpc: check buffer length in lustre_msg_string() James Simmons
                   ` (171 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Remove linefeed from DEBUG_REQ() messages, since this results in
debug logs that are split across multiple lines and do not start
with the proper timestamp or other standard fields.  This makes
post-processing difficult.

Some error and debug messages are checked for explicitly in tests.
Add a comment by those lines in the code to alert the reader that
changes to those messages may cause test failures, and make the
tests more forgiving in case of minor changes to the formatting.

Fix several tests to check for actual error message.  Some tests
have been broken for so long (1.5/1.8) that there is no point to
also check for the old messages, so use only the new messages.

The EINPROGRESS messages should not use D_ERROR, since they can be
hit under normal usage (e.g. LFSCK running), so D_WARNING at most.
Don't print every one to the console, that would be too verbose.

Fix code style of affected lines.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12368
Lustre-commit: c0fa0ba4a8ef ("LU-12368 ptlrpc: make DEBUG_REQ messages consistent")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35311
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h |  2 +-
 fs/lustre/ldlm/ldlm_request.c  |  4 ++--
 fs/lustre/llite/llite_lib.c    |  1 +
 fs/lustre/mdc/mdc_locks.c      |  5 ++--
 fs/lustre/mdc/mdc_request.c    | 11 +++++----
 fs/lustre/osc/osc_request.c    | 21 ++++++++++-------
 fs/lustre/ptlrpc/client.c      | 53 ++++++++++++++++++++++--------------------
 fs/lustre/ptlrpc/events.c      |  2 +-
 fs/lustre/ptlrpc/import.c      | 15 +++++++++---
 fs/lustre/ptlrpc/layout.c      |  4 ++--
 fs/lustre/ptlrpc/niobuf.c      |  6 ++---
 fs/lustre/ptlrpc/ptlrpcd.c     |  2 +-
 fs/lustre/ptlrpc/recover.c     |  2 +-
 fs/lustre/ptlrpc/sec.c         |  6 ++++-
 fs/lustre/ptlrpc/service.c     | 18 ++++++++------
 net/lnet/libcfs/module.c       |  1 +
 16 files changed, 90 insertions(+), 63 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index caf766d..f16c6d3 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -2202,7 +2202,7 @@ static inline int ptlrpc_status_ntoh(int n)
 			atomic_dec(&req->rq_import->imp_unregistering);
 	}
 
-	DEBUG_REQ(D_INFO, req, "move req \"%s\" -> \"%s\"",
+	DEBUG_REQ(D_INFO, req, "move request phase from %s to %s",
 		  ptlrpc_rqphase2str(req), ptlrpc_phase2str(new_phase));
 
 	req->rq_phase = new_phase;
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 0dd9fea..20bdba4 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -777,7 +777,7 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 	}
 
 	if (*flags & LDLM_FL_NDELAY) {
-		DEBUG_REQ(D_DLMTRACE, req, "enque lock with no delay\n");
+		DEBUG_REQ(D_DLMTRACE, req, "enqueue lock with no delay");
 		req->rq_no_resend = req->rq_no_delay = 1;
 		/*
 		 * probably set a shorter timeout value and handle ETIMEDOUT
@@ -1248,7 +1248,7 @@ int ldlm_cli_update_pool(struct ptlrpc_request *req)
 	if (lustre_msg_get_slv(req->rq_repmsg) == 0 ||
 	    lustre_msg_get_limit(req->rq_repmsg) == 0) {
 		DEBUG_REQ(D_HA, req,
-			  "Zero SLV or Limit found (SLV: %llu, Limit: %u)",
+			  "Zero SLV or limit found (SLV=%llu, limit=%u)",
 			  lustre_msg_get_slv(req->rq_repmsg),
 			  lustre_msg_get_limit(req->rq_repmsg));
 		return 0;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 8946dc6..217268e 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2731,6 +2731,7 @@ void ll_dirty_page_discard_warn(struct page *page, int ioret)
 			path = dentry_path_raw(dentry, buf, PAGE_SIZE);
 	}
 
+	/* The below message is checked in recovery-small.sh test_24b */
 	CDEBUG(D_WARNING,
 	       "%s: dirty page discard: %s/fid: " DFID "/%s may get corrupted (rc %d)\n",
 	       ll_i2sbi(inode)->ll_fsname,
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 5885bbd..b91c162 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -198,7 +198,8 @@ static inline void mdc_clear_replay_flag(struct ptlrpc_request *req, int rc)
 		spin_unlock(&req->rq_lock);
 	}
 	if (rc && req->rq_transno != 0) {
-		DEBUG_REQ(D_ERROR, req, "transno returned on error rc %d", rc);
+		DEBUG_REQ(D_ERROR, req, "transno returned on error: rc = %d",
+			  rc);
 		LBUG();
 	}
 }
@@ -710,7 +711,7 @@ static int mdc_finish_enqueue(struct obd_export *exp,
 	    (!it_disposition(it, DISP_OPEN_OPEN) || it->it_status != 0))
 		mdc_clear_replay_flag(req, it->it_status);
 
-	DEBUG_REQ(D_RPCTRACE, req, "op: %x disposition: %x, status: %d",
+	DEBUG_REQ(D_RPCTRACE, req, "op=%x disposition=%x, status=%d",
 		  it->it_op, it->it_disposition, it->it_status);
 
 	/* We know what to expect, so we do any byte flipping required here */
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 162ace7..34cf177 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -450,6 +450,7 @@ static int mdc_getxattr(struct obd_export *exp, const struct lu_fid *fid,
 	LASSERT(obd_md_valid == OBD_MD_FLXATTR ||
 		obd_md_valid == OBD_MD_FLXATTRLS);
 
+	/* The below message is checked in sanity-selinux.sh test_20d */
 	CDEBUG(D_INFO, "%s: get xattr '%s' for " DFID "\n",
 	       exp->exp_obd->obd_name, name, PFID(fid));
 	rc = mdc_xattr_common(exp, &RQF_MDS_GETXATTR, fid, MDS_GETXATTR,
@@ -695,7 +696,7 @@ void mdc_replay_open(struct ptlrpc_request *req)
 
 	if (!mod) {
 		DEBUG_REQ(D_ERROR, req,
-			  "Can't properly replay without open data.");
+			  "cannot properly replay without open data");
 		return;
 	}
 
@@ -794,7 +795,7 @@ int mdc_set_open_replay_data(struct obd_export *exp,
 		mod = obd_mod_alloc();
 		if (!mod) {
 			DEBUG_REQ(D_ERROR, open_req,
-				  "Can't allocate md_open_data");
+				  "cannot allocate md_open_data");
 			return 0;
 		}
 
@@ -848,7 +849,7 @@ static void mdc_free_open(struct md_open_data *mod)
 	 * The worst thing is eviction if the client gets open lock
 	 */
 	DEBUG_REQ(D_RPCTRACE, mod->mod_open_req,
-		  "free open request rq_replay = %d\n",
+		  "free open request, rq_replay=%d\n",
 		   mod->mod_open_req->rq_replay);
 
 	ptlrpc_request_committed(mod->mod_open_req, committed);
@@ -993,7 +994,7 @@ static int mdc_close(struct obd_export *exp, struct md_op_data *op_data,
 	mdc_put_mod_rpc_slot(req, NULL);
 
 	if (!req->rq_repmsg) {
-		CDEBUG(D_RPCTRACE, "request failed to send: %p, %d\n", req,
+		CDEBUG(D_RPCTRACE, "request %p failed to send: rc = %d\n", req,
 		       req->rq_status);
 		if (rc == 0)
 			rc = req->rq_status ?: -EIO;
@@ -1003,7 +1004,7 @@ static int mdc_close(struct obd_export *exp, struct md_op_data *op_data,
 		rc = lustre_msg_get_status(req->rq_repmsg);
 		if (lustre_msg_get_type(req->rq_repmsg) == PTL_RPC_MSG_ERR) {
 			DEBUG_REQ(D_ERROR, req,
-				  "type == PTL_RPC_MSG_ERR, err = %d", rc);
+				  "type = PTL_RPC_MSG_ERR: rc = %d", rc);
 			if (rc > 0)
 				rc = -rc;
 		}
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 6b066e5..75e0823 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1735,14 +1735,14 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 	u32 client_cksum = 0;
 
 	if (rc < 0 && rc != -EDQUOT) {
-		DEBUG_REQ(D_INFO, req, "Failed request with rc = %d\n", rc);
+		DEBUG_REQ(D_INFO, req, "Failed request: rc = %d", rc);
 		return rc;
 	}
 
 	LASSERTF(req->rq_repmsg, "rc = %d\n", rc);
 	body = req_capsule_server_get(&req->rq_pill, &RMF_OST_BODY);
 	if (!body) {
-		DEBUG_REQ(D_INFO, req, "Can't unpack body\n");
+		DEBUG_REQ(D_INFO, req, "cannot unpack body");
 		return -EPROTO;
 	}
 
@@ -1770,7 +1770,8 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 
 	if (lustre_msg_get_opc(req->rq_reqmsg) == OST_WRITE) {
 		if (rc > 0) {
-			CERROR("Unexpected +ve rc %d\n", rc);
+			CERROR("%s: unexpected positive size %d\n",
+			       obd_name, rc);
 			return -EPROTO;
 		}
 
@@ -1805,13 +1806,13 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 	}
 
 	if (rc > aa->aa_requested_nob) {
-		CERROR("Unexpected rc %d (%d requested)\n", rc,
-		       aa->aa_requested_nob);
+		CERROR("%s: unexpected size %d, requested %d\n", obd_name,
+		       rc, aa->aa_requested_nob);
 		return -EPROTO;
 	}
 
 	if (req->rq_bulk && rc != req->rq_bulk->bd_nob_transferred) {
-		CERROR("Unexpected rc %d (%d transferred)\n",
+		CERROR("%s: unexpected size %d, transferred %d\n", obd_name,
 		       rc, req->rq_bulk->bd_nob_transferred);
 		return -EPROTO;
 	}
@@ -1916,8 +1917,9 @@ static int osc_brw_fini_request(struct ptlrpc_request *req, int rc)
 
 		cksum_missed++;
 		if ((cksum_missed & (-cksum_missed)) == cksum_missed)
-			CERROR("Checksum %u requested from %s but not sent\n",
-			       cksum_missed, libcfs_nid2str(peer->nid));
+			CERROR("%s: checksum %u requested from %s but not sent\n",
+			       obd_name, cksum_missed,
+			       libcfs_nid2str(peer->nid));
 	} else {
 		rc = 0;
 	}
@@ -1936,6 +1938,7 @@ static int osc_brw_redo_request(struct ptlrpc_request *request,
 	struct osc_brw_async_args *new_aa;
 	struct osc_async_page *oap;
 
+	/* The below message is checked in replay-ost-single.sh test_8ae*/
 	DEBUG_REQ(rc == -EINPROGRESS ? D_RPCTRACE : D_ERROR, request,
 		  "redo for recoverable error %d", rc);
 
@@ -2346,7 +2349,7 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 	}
 	spin_unlock(&cli->cl_loi_list_lock);
 
-	DEBUG_REQ(D_INODE, req, "%d pages, aa %p. now %ur/%dw in flight",
+	DEBUG_REQ(D_INODE, req, "%d pages, aa %p, now %ur/%dw in flight",
 		  page_count, aa, cli->cl_r_in_flight,
 		  cli->cl_w_in_flight);
 	OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_DELAY_IO, cfs_fail_val);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index c750a4e..d2e5e04 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -424,14 +424,16 @@ static int unpack_reply(struct ptlrpc_request *req)
 	if (SPTLRPC_FLVR_POLICY(req->rq_flvr.sf_rpc) != SPTLRPC_POLICY_NULL) {
 		rc = ptlrpc_unpack_rep_msg(req, req->rq_replen);
 		if (rc) {
-			DEBUG_REQ(D_ERROR, req, "unpack_rep failed: %d", rc);
+			DEBUG_REQ(D_ERROR, req, "unpack_rep failed: rc = %d",
+				  rc);
 			return -EPROTO;
 		}
 	}
 
 	rc = lustre_unpack_rep_ptlrpc_body(req, MSG_PTLRPC_BODY_OFF);
 	if (rc) {
-		DEBUG_REQ(D_ERROR, req, "unpack ptlrpc body failed: %d", rc);
+		DEBUG_REQ(D_ERROR, req, "unpack ptlrpc body failed: rc = %d",
+			  rc);
 		return -EPROTO;
 	}
 	return 0;
@@ -491,6 +493,8 @@ static int ptlrpc_at_recv_early_reply(struct ptlrpc_request *req)
 	req->rq_deadline = req->rq_sent + req->rq_timeout +
 			   ptlrpc_at_get_net_latency(req);
 
+	/* The below message is checked in replay-single.sh test_65{a,b} */
+	/* The below message is checked in sanity-{gss,krb5} test_8 */
 	DEBUG_REQ(D_ADAPTTO, req,
 		  "Early reply #%d, new deadline in %lds (%lds)",
 		  req->rq_early_count,
@@ -1163,18 +1167,18 @@ static int ptlrpc_import_delay_req(struct obd_import *imp,
 	if (req->rq_ctx_init || req->rq_ctx_fini) {
 		/* always allow ctx init/fini rpc go through */
 	} else if (imp->imp_state == LUSTRE_IMP_NEW) {
-		DEBUG_REQ(D_ERROR, req, "Uninitialized import.");
+		DEBUG_REQ(D_ERROR, req, "Uninitialized import");
 		*status = -EIO;
 	} else if (imp->imp_state == LUSTRE_IMP_CLOSED) {
 		/* pings may safely race with umount */
 		DEBUG_REQ(lustre_msg_get_opc(req->rq_reqmsg) == OBD_PING ?
-			  D_HA : D_ERROR, req, "IMP_CLOSED ");
+			  D_HA : D_ERROR, req, "IMP_CLOSED");
 		*status = -EIO;
 	} else if (ptlrpc_send_limit_expired(req)) {
 		/* probably doesn't need to be a D_ERROR after initial
 		 * testing
 		 */
-		DEBUG_REQ(D_HA, req, "send limit expired ");
+		DEBUG_REQ(D_HA, req, "send limit expired");
 		*status = -ETIMEDOUT;
 	} else if (req->rq_send_state == LUSTRE_IMP_CONNECTING &&
 		   imp->imp_state == LUSTRE_IMP_CONNECTING) {
@@ -1204,7 +1208,7 @@ static int ptlrpc_import_delay_req(struct obd_import *imp,
 			   imp->imp_state == LUSTRE_IMP_REPLAY_LOCKS ||
 			   imp->imp_state == LUSTRE_IMP_REPLAY_WAIT ||
 			   imp->imp_state == LUSTRE_IMP_RECOVER)) {
-			DEBUG_REQ(D_HA, req, "allow during recovery.\n");
+			DEBUG_REQ(D_HA, req, "allow during recovery");
 		} else {
 			delay = 1;
 		}
@@ -1258,9 +1262,9 @@ static bool ptlrpc_console_allow(struct ptlrpc_request *req)
  */
 static int ptlrpc_check_status(struct ptlrpc_request *req)
 {
-	int err;
+	int rc;
 
-	err = lustre_msg_get_status(req->rq_repmsg);
+	rc = lustre_msg_get_status(req->rq_repmsg);
 	if (lustre_msg_get_type(req->rq_repmsg) == PTL_RPC_MSG_ERR) {
 		struct obd_import *imp = req->rq_import;
 		lnet_nid_t nid = imp->imp_connection->c_peer.nid;
@@ -1268,22 +1272,19 @@ static int ptlrpc_check_status(struct ptlrpc_request *req)
 
 		/* -EAGAIN is normal when using POSIX flocks */
 		if (ptlrpc_console_allow(req) &&
-		    !(opc == LDLM_ENQUEUE && err == -EAGAIN))
+		    !(opc == LDLM_ENQUEUE && rc == -EAGAIN))
 			LCONSOLE_ERROR_MSG(0x011,
 					   "%s: operation %s to node %s failed: rc = %d\n",
 					   imp->imp_obd->obd_name,
 					   ll_opcode2str(opc),
-					   libcfs_nid2str(nid), err);
-		return err < 0 ? err : -EINVAL;
+					   libcfs_nid2str(nid), rc);
+		return rc < 0 ? rc : -EINVAL;
 	}
 
-	if (err < 0)
-		DEBUG_REQ(D_INFO, req, "status is %d", err);
-	else if (err > 0)
-		/* XXX: translate this error from net to host */
-		DEBUG_REQ(D_INFO, req, "status is %d", err);
+	if (rc)
+		DEBUG_REQ(D_INFO, req, "check status: rc = %d", rc);
 
-	return err;
+	return rc;
 }
 
 /**
@@ -1347,7 +1348,7 @@ static int after_reply(struct ptlrpc_request *req)
 	if (req->rq_reply_truncated) {
 		if (ptlrpc_no_resend(req)) {
 			DEBUG_REQ(D_ERROR, req,
-				  "reply buffer overflow, expected: %d, actual size: %d",
+				  "reply buffer overflow, expected=%d, actual size=%d",
 				  req->rq_nob_received, req->rq_repbuf_len);
 			return -EOVERFLOW;
 		}
@@ -1375,7 +1376,7 @@ static int after_reply(struct ptlrpc_request *req)
 	 */
 	rc = sptlrpc_cli_unwrap_reply(req);
 	if (rc) {
-		DEBUG_REQ(D_ERROR, req, "unwrap reply failed (%d):", rc);
+		DEBUG_REQ(D_ERROR, req, "unwrap reply failed: rc = %d", rc);
 		return rc;
 	}
 
@@ -1392,7 +1393,8 @@ static int after_reply(struct ptlrpc_request *req)
 	    ptlrpc_no_resend(req) == 0 && !req->rq_no_retry_einprogress) {
 		time64_t now = ktime_get_real_seconds();
 
-		DEBUG_REQ(D_RPCTRACE, req, "Resending request on EINPROGRESS");
+		DEBUG_REQ((req->rq_nr_resend % 8 == 1 ? D_WARNING : 0) |
+			  D_RPCTRACE, req, "resending request on EINPROGRESS");
 		spin_lock(&req->rq_lock);
 		req->rq_resend = 1;
 		spin_unlock(&req->rq_lock);
@@ -1634,7 +1636,8 @@ static int ptlrpc_send_new_req(struct ptlrpc_request *req)
 		return rc;
 	}
 	if (rc) {
-		DEBUG_REQ(D_HA, req, "send failed (%d); expect timeout", rc);
+		DEBUG_REQ(D_HA, req, "send failed, expect timeout: rc = %d",
+			  rc);
 		spin_lock(&req->rq_lock);
 		req->rq_net_err = 1;
 		spin_unlock(&req->rq_lock);
@@ -2875,7 +2878,7 @@ static int ptlrpc_replay_interpret(const struct lu_env *env,
 	if (!ptlrpc_client_replied(req) ||
 	    (req->rq_bulk &&
 	     lustre_msg_get_status(req->rq_repmsg) == -ETIMEDOUT)) {
-		DEBUG_REQ(D_ERROR, req, "request replay timed out.\n");
+		DEBUG_REQ(D_ERROR, req, "request replay timed out");
 		rc = -ETIMEDOUT;
 		goto out;
 	}
@@ -2890,7 +2893,7 @@ static int ptlrpc_replay_interpret(const struct lu_env *env,
 	/** VBR: check version failure */
 	if (lustre_msg_get_status(req->rq_repmsg) == -EOVERFLOW) {
 		/** replay was failed due to version mismatch */
-		DEBUG_REQ(D_WARNING, req, "Version mismatch during replay\n");
+		DEBUG_REQ(D_WARNING, req, "Version mismatch during replay");
 		spin_lock(&imp->imp_lock);
 		imp->imp_vbr_failed = 1;
 		spin_unlock(&imp->imp_lock);
@@ -2913,14 +2916,14 @@ static int ptlrpc_replay_interpret(const struct lu_env *env,
 	/* transaction number shouldn't be bigger than the latest replayed */
 	if (req->rq_transno > lustre_msg_get_transno(req->rq_reqmsg)) {
 		DEBUG_REQ(D_ERROR, req,
-			  "Reported transno %llu is bigger than the replayed one: %llu",
+			  "Reported transno=%llu is bigger than replayed=%llu",
 			  req->rq_transno,
 			  lustre_msg_get_transno(req->rq_reqmsg));
 		rc = -EINVAL;
 		goto out;
 	}
 
-	DEBUG_REQ(D_HA, req, "got rep");
+	DEBUG_REQ(D_HA, req, "got reply");
 
 	/* let the callback do fixups, possibly including in the request */
 	if (req->rq_replay_cb)
diff --git a/fs/lustre/ptlrpc/events.c b/fs/lustre/ptlrpc/events.c
index 87c0ab7..e6a49db 100644
--- a/fs/lustre/ptlrpc/events.c
+++ b/fs/lustre/ptlrpc/events.c
@@ -132,7 +132,7 @@ void reply_in_callback(struct lnet_event *ev)
 	    ((lustre_msghdr_get_flags(req->rq_reqmsg) & MSGHDR_AT_SUPPORT))) {
 		/* Early reply */
 		DEBUG_REQ(D_ADAPTTO, req,
-			  "Early reply received: mlen=%u offset=%d replen=%d replied=%d unlinked=%d",
+			  "Early reply received, mlen=%u offset=%d replen=%d replied=%d unlinked=%d",
 			  ev->mlength, ev->offset,
 			  req->rq_replen, req->rq_replied, ev->unlinked);
 
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index a6d0b32..ff1b810 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -567,6 +567,7 @@ static int import_select_connection(struct obd_import *imp)
 		imp->imp_conn_current = imp_conn;
 	}
 
+	/* The below message is checked in conf-sanity.sh test_35[ab] */
 	CDEBUG(D_HA, "%s: import %p using connection %s/%s\n",
 	       imp->imp_obd->obd_name, imp, imp_conn->oic_uuid.uuid,
 	       libcfs_nid2str(imp_conn->oic_conn->c_peer.nid));
@@ -1221,10 +1222,18 @@ static int ptlrpc_connect_interpret(const struct lu_env *env,
 
 	if (lustre_msg_get_last_committed(request->rq_repmsg) > 0 &&
 	    lustre_msg_get_last_committed(request->rq_repmsg) <
-	    aa->pcaa_peer_committed)
-		CERROR("%s went back in time (transno %lld was previously committed, server now claims %lld)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646\n",
+	    aa->pcaa_peer_committed) {
+		static bool printed;
+
+		/* The below message is checked in recovery-small.sh test_54 */
+		CERROR("%s: went back in time (transno %lld was previously committed, server now claims %lld)!\n",
 		       obd2cli_tgt(imp->imp_obd), aa->pcaa_peer_committed,
 		       lustre_msg_get_last_committed(request->rq_repmsg));
+		if (!printed) {
+			CERROR("For further information, see http://doc.lustre.org/lustre_manual.xhtml#went_back_in_time\n");
+			printed = true;
+		}
+	}
 
 finish:
 	ptlrpc_prepare_replay(imp);
@@ -1668,7 +1677,7 @@ static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 	struct obd_import *imp = req->rq_import;
 	int connect = 0;
 
-	DEBUG_REQ(D_HA, req, "inflight=%d, refcount=%d: rc = %d ",
+	DEBUG_REQ(D_HA, req, "inflight=%d, refcount=%d: rc = %d",
 		  atomic_read(&imp->imp_inflight),
 		  atomic_read(&imp->imp_refcount), rc);
 
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index fb60558..67a7cd5 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -1825,7 +1825,7 @@ int req_capsule_server_pack(struct req_capsule *pill)
 			       pill->rc_area[RCL_SERVER], NULL);
 	if (rc != 0) {
 		DEBUG_REQ(D_ERROR, pill->rc_req,
-			  "Cannot pack %d fields in format `%s': ",
+			  "Cannot pack %d fields in format '%s'",
 			  count, fmt->rf_name);
 	}
 	return rc;
@@ -1988,7 +1988,7 @@ static void *__req_capsule_get(struct req_capsule *pill,
 
 	if (!value) {
 		DEBUG_REQ(D_ERROR, pill->rc_req,
-			  "Wrong buffer for field `%s' (%u of %u) in format `%s': %u vs. %u (%s)\n",
+			  "Wrong buffer for field '%s' (%u of %u) in format '%s', %u vs. %u (%s)",
 			  field->rmf_name, offset, lustre_msg_bufcount(msg),
 			  fmt->rf_name, lustre_msg_buflen(msg, offset), len,
 			  rcl_names[loc]);
diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 9d9e94c..12a9a5e 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -540,7 +540,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 
 		lustre_msg_set_last_xid(request->rq_reqmsg, min_xid);
 		DEBUG_REQ(D_RPCTRACE, request,
-			  "Allocating new xid for resend on EINPROGRESS");
+			  "Allocating new XID for resend on EINPROGRESS");
 	}
 
 	if (request->rq_bulk) {
@@ -551,7 +551,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 	if (list_empty(&request->rq_unreplied_list) ||
 	    request->rq_xid <= imp->imp_known_replied_xid) {
 		DEBUG_REQ(D_ERROR, request,
-			  "xid: %llu, replied: %llu, list_empty:%d\n",
+			  "xid=%llu, replied=%llu, list_empty=%d",
 			  request->rq_xid, imp->imp_known_replied_xid,
 			  list_empty(&request->rq_unreplied_list));
 		LBUG();
@@ -689,7 +689,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 
 	ptlrpc_pinger_sending_on_import(imp);
 
-	DEBUG_REQ(D_INFO, request, "send flg=%x",
+	DEBUG_REQ(D_INFO, request, "send flags=%x",
 		  lustre_msg_get_flags(request->rq_reqmsg));
 	rc = ptl_send_buf(&request->rq_req_md_h,
 			  request->rq_reqbuf, request->rq_reqdata_len,
diff --git a/fs/lustre/ptlrpc/ptlrpcd.c b/fs/lustre/ptlrpc/ptlrpcd.c
index bcf1e46..1a1fa05 100644
--- a/fs/lustre/ptlrpc/ptlrpcd.c
+++ b/fs/lustre/ptlrpc/ptlrpcd.c
@@ -256,7 +256,7 @@ void ptlrpcd_add_req(struct ptlrpc_request *req)
 
 	pc = ptlrpcd_select_pc(req);
 
-	DEBUG_REQ(D_INFO, req, "add req [%p] to pc [%s:%d]",
+	DEBUG_REQ(D_INFO, req, "add req [%p] to pc [%s+%d]",
 		  req, pc->pc_name, pc->pc_index);
 
 	ptlrpc_set_add_new_req(pc, req);
diff --git a/fs/lustre/ptlrpc/recover.c b/fs/lustre/ptlrpc/recover.c
index e6e6661..09ea3b3 100644
--- a/fs/lustre/ptlrpc/recover.c
+++ b/fs/lustre/ptlrpc/recover.c
@@ -143,7 +143,7 @@ int ptlrpc_replay_next(struct obd_import *imp, int *inflight)
 	 * unreplied list.
 	 */
 	if (req && list_empty(&req->rq_unreplied_list)) {
-		DEBUG_REQ(D_HA, req, "resend_replay: %d, last_transno: %llu\n",
+		DEBUG_REQ(D_HA, req, "resend_replay=%d, last_transno=%llu",
 			  imp->imp_resend_replay, last_transno);
 		ptlrpc_add_unreplied(req);
 		imp->imp_known_replied_xid = ptlrpc_known_replied_xid(imp);
diff --git a/fs/lustre/ptlrpc/sec.c b/fs/lustre/ptlrpc/sec.c
index d82809f..15667454 100644
--- a/fs/lustre/ptlrpc/sec.c
+++ b/fs/lustre/ptlrpc/sec.c
@@ -1151,7 +1151,7 @@ int sptlrpc_cli_unwrap_early_reply(struct ptlrpc_request *req,
 	rc = do_cli_unwrap_reply(early_req);
 	if (rc) {
 		DEBUG_REQ(D_ADAPTTO, early_req,
-			  "error %d unwrap early reply", rc);
+			  "unwrap early reply: rc = %d", rc);
 		goto err_ctx;
 	}
 
@@ -2037,18 +2037,21 @@ static int sptlrpc_svc_check_from(struct ptlrpc_request *req, int svc_rc)
 	switch (req->rq_sp_from) {
 	case LUSTRE_SP_CLI:
 		if (req->rq_auth_usr_mdt || req->rq_auth_usr_ost) {
+			/* The below message is checked in sanity-sec test_33 */
 			DEBUG_REQ(D_ERROR, req, "faked source CLI");
 			svc_rc = SECSVC_DROP;
 		}
 		break;
 	case LUSTRE_SP_MDT:
 		if (!req->rq_auth_usr_mdt) {
+			/* The below message is checked in sanity-sec test_33 */
 			DEBUG_REQ(D_ERROR, req, "faked source MDT");
 			svc_rc = SECSVC_DROP;
 		}
 		break;
 	case LUSTRE_SP_OST:
 		if (!req->rq_auth_usr_ost) {
+			/* The below message is checked in sanity-sec test_33 */
 			DEBUG_REQ(D_ERROR, req, "faked source OST");
 			svc_rc = SECSVC_DROP;
 		}
@@ -2057,6 +2060,7 @@ static int sptlrpc_svc_check_from(struct ptlrpc_request *req, int svc_rc)
 	case LUSTRE_SP_MGC:
 		if (!req->rq_auth_usr_root && !req->rq_auth_usr_mdt &&
 		    !req->rq_auth_usr_ost) {
+			/* The below message is checked in sanity-sec test_33 */
 			DEBUG_REQ(D_ERROR, req, "faked source MGC/MGS");
 			svc_rc = SECSVC_DROP;
 		}
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index f40cb8d..c66c690 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1072,6 +1072,7 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 		return 0;
 
 	if (olddl < 0) {
+		/* below message is checked in replay-ost-single.sh test_9 */
 		DEBUG_REQ(D_WARNING, req,
 			  "Already past deadline (%+llds), not sending early reply. Consider increasing at_early_margin (%d)?",
 			  (s64)olddl, at_early_margin);
@@ -1104,7 +1105,8 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 	 * we may be past adaptive_max
 	 */
 	if (req->rq_deadline >= newdl) {
-		DEBUG_REQ(D_WARNING, req, "Couldn't add any time (%ld/%lld), not sending early reply\n",
+		DEBUG_REQ(D_WARNING, req,
+			  "Could not add any time (%ld/%lld), not sending early reply",
 			  olddl, newdl - ktime_get_real_seconds());
 		return -ETIMEDOUT;
 	}
@@ -1140,10 +1142,10 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 	}
 
 	LASSERT(atomic_read(&req->rq_refcount));
-	/** if it is last refcount then early reply isn't needed */
+	/* if it is last refcount then early reply isn't needed */
 	if (atomic_read(&req->rq_refcount) == 1) {
 		DEBUG_REQ(D_ADAPTTO, reqcopy,
-			  "Normal reply already sent out, abort sending early reply\n");
+			  "Normal reply already sent, abort early reply");
 		rc = -EINVAL;
 		goto out;
 	}
@@ -1174,7 +1176,7 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 		req->rq_deadline = newdl;
 		req->rq_early_count++; /* number sent, server side */
 	} else {
-		DEBUG_REQ(D_ERROR, req, "Early reply send failed %d", rc);
+		DEBUG_REQ(D_ERROR, req, "Early reply send failed: rc = %d", rc);
 	}
 
 	/*
@@ -1628,7 +1630,7 @@ static int ptlrpc_server_handle_req_in(struct ptlrpc_service_part *svcpt,
 			rc = sptlrpc_target_export_check(req->rq_export, req);
 			if (rc)
 				DEBUG_REQ(D_ERROR, req,
-					  "DROPPING req with illegal security flavor,");
+					  "DROPPING req with illegal security flavor");
 		}
 
 		if (rc)
@@ -1747,7 +1749,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	 */
 	if (ktime_get_real_seconds() > request->rq_deadline) {
 		DEBUG_REQ(D_ERROR, request,
-			  "Dropping timed-out request from %s: deadline %lld:%llds ago\n",
+			  "Dropping timed-out request from %s: deadline %lld/%llds ago",
 			  libcfs_id2str(request->rq_peer),
 			  request->rq_deadline -
 			  request->rq_arrival_time.tv_sec,
@@ -1787,7 +1789,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 put_conn:
 	if (unlikely(ktime_get_real_seconds() > request->rq_deadline)) {
 		DEBUG_REQ(D_WARNING, request,
-			  "Request took longer than estimated (%lld:%llds); client may timeout.",
+			  "Request took longer than estimated (%lld/%llds); client may timeout",
 			  (s64)request->rq_deadline -
 			       request->rq_arrival_time.tv_sec,
 			  (s64)ktime_get_real_seconds() - request->rq_deadline);
@@ -2061,12 +2063,14 @@ static void ptlrpc_watchdog_fire(struct work_struct *w)
 	u32 ms_frac = do_div(ms_lapse, MSEC_PER_SEC);
 
 	if (!__ratelimit(&watchdog_limit)) {
+		/* below message is checked in sanity-quota.sh test_6,18 */
 		LCONSOLE_WARN("%s: service thread pid %u was inactive for %llu.%.03u seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:\n",
 			      thread->t_task->comm, thread->t_task->pid,
 			      ms_lapse, ms_frac);
 
 		libcfs_debug_dumpstack(thread->t_task);
 	} else {
+		/* below message is checked in sanity-quota.sh test_6,18 */
 		LCONSOLE_WARN("%s: service thread pid %u was inactive for %llu.%.03u seconds. Watchdog stack traces are limited to 3 per %u seconds, skipping this one.\n",
 			      thread->t_task->comm, thread->t_task->pid,
 			      ms_lapse, ms_frac, libcfs_watchdog_ratelimit);
diff --git a/net/lnet/libcfs/module.c b/net/lnet/libcfs/module.c
index 2e803d6..20d4302 100644
--- a/net/lnet/libcfs/module.c
+++ b/net/lnet/libcfs/module.c
@@ -791,6 +791,7 @@ static void libcfs_exit(void)
 
 	cfs_cpu_fini();
 
+	/* the below message is checked in test-framework.sh check_mem_leak() */
 	rc = libcfs_debug_cleanup();
 	if (rc)
 		pr_err("LustreError: libcfs_debug_cleanup: %d\n", rc);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 452/622] lustre: ptlrpc: check buffer length in lustre_msg_string()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (450 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 451/622] lustre: ptlrpc: make DEBUG_REQ messages consistent James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 453/622] lustre: uapi: fix building fail against Power9 little endian James Simmons
                   ` (170 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Emoly Liu <emoly@whamcloud.com>

Check buffer length in lustre_msg_string() in case of any invalid
access.

Reported-by: Alibaba Cloud <yunye.ry@alibaba-inc.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-12613
Lustre-commit: 728c58d60fae ("LU-12613 ptlrpc: check buffer length in lustre_msg_string()")
Signed-off-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35932
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Yunye Ry <yunye.ry@alibaba-inc.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/pack_generic.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 4a0856a..9b28624 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -712,6 +712,11 @@ char *lustre_msg_string(struct lustre_msg *m, u32 index, u32 max_len)
 		       m, index, blen);
 		return NULL;
 	}
+	if (blen > PTLRPC_MAX_BUFLEN) {
+		CERROR("buffer length of msg %p buffer[%d] is invalid(%d)\n",
+		       m, index, blen);
+		return NULL;
+	}
 
 	if (max_len == 0) {
 		if (slen != blen - 1) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 453/622] lustre: uapi: fix building fail against Power9 little endian
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (451 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 452/622] lustre: ptlrpc: check buffer length in lustre_msg_string() James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 454/622] lustre: ptlrpc: fix reply buffers shrinking and growing James Simmons
                   ` (169 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Gu Zheng <gzheng@ddn.com>

We use "%ll[dux]" for __u64 variable as an input/output modifier,
this may cause building error on some architectures which use "long"
for 64-bit types, for example, Power9 little endian.
Here add necessary typecasting (long long/unsigned long long) to
make the build correct.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12705
Lustre-commit: 4eddf36ac360 ("LU-12705 build: fix building fail against Power9 little endian")
Signed-off-by: Gu Zheng <gzheng@ddn.com>
Reviewed-on: https://review.whamcloud.com/36007
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_user.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 3016b73..695ceb2 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -512,7 +512,7 @@ struct lu_extent {
 };
 
 #define DEXT "[%#llx, %#llx)"
-#define PEXT(ext) (ext)->e_start, (ext)->e_end
+#define PEXT(ext) (unsigned long long)(ext)->e_start, (unsigned long long)(ext)->e_end
 
 static inline bool lu_extent_is_overlapped(struct lu_extent *e1,
 					    struct lu_extent *e2)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 454/622] lustre: ptlrpc: fix reply buffers shrinking and growing
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (452 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 453/622] lustre: uapi: fix building fail against Power9 little endian James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 455/622] lustre: dom: manual OST-to-DOM migration via mirroring James Simmons
                   ` (168 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The req_capsule_shrink() doesn't update capsule itself with
new buffer lenghts after the shrinking. Usually it is not
needed because reply is packed already. But if reply buffers
are re-allocated by req_capsule_server_grow() then non-updated
lenghts from capsule are used causing bigger reply message.
That may cause client buffer re-allocation with resend.

Patch does the following:
- update capsule length after the shrinking
  introduce lustre_grow_msg() to grow msg field in-place
- update req_capsule_server_grow() to use generic
  lustre_grow_msg() and make it able to grow reply without
  re-allocation if reply buffer is big enough already
- update sanity test 271f to use bigger file size to exceed
  current maximum reply buffer size allocated on client.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12443
Lustre-commit: cedbb25e984c ("LU-12443 ptlrpc: fix reply buffers shrinking and growing")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35243
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/layout.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 67a7cd5..dd04eee 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -2309,11 +2309,16 @@ void req_capsule_shrink(struct req_capsule *pill,
 	LASSERTF(newlen <= len, "%s:%s, oldlen=%u, newlen=%u\n",
 		 fmt->rf_name, field->rmf_name, len, newlen);
 
-	if (loc == RCL_CLIENT)
+	if (loc == RCL_CLIENT) {
 		pill->rc_req->rq_reqlen = lustre_shrink_msg(msg, offset, newlen,
 							    1);
-	else
+	} else {
 		pill->rc_req->rq_replen = lustre_shrink_msg(msg, offset, newlen,
 							    1);
+		/* update also field size in reply lenghts arrays for possible
+		 * reply re-pack due to req_capsule_server_grow() call.
+		 */
+		req_capsule_set_size(pill, field, loc, newlen);
+	}
 }
 EXPORT_SYMBOL(req_capsule_shrink);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 455/622] lustre: dom: manual OST-to-DOM migration via mirroring
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (453 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 454/622] lustre: ptlrpc: fix reply buffers shrinking and growing James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 456/622] lustre: fld: remove fci_no_shrink field James Simmons
                   ` (167 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Allow DOM mirroring, update LOV/LOD code to check not just
first component for DOM pattern but cycle through all mirrors
if any. Sanity checks allows one DOM component in a mirror
and it should be the first one. Multiple DOM components are
allowed only with the same for now.

Do OST file migration to MDT by using FLR. That can't be done
by layout swapping, because MDT data will be tied to temporary
volatile file but we want to keep data with the original file.
The mirroring allows that with the following steps:
- extent layout with new mirror on MDT, no data is copied but
  new mirror stays in 'stale' state. The reason is the same
  problem with volatile file.
- resync mirrors, now new DOM layout is filled with data.
- remove first mirror

WC-bug-id: https://jira.whamcloud.com/browse/LU-11421
Lustre-commit: 44a721b8c106 ("LU-11421 dom: manual OST-to-DOM migration via mirroring")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35359
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_object.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index 52d8c30..5c4d8f9 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -543,7 +543,13 @@ static int lov_init_dom(const struct lu_env *env, struct lov_device *dev,
 	u32 idx = 0;
 	int rc;
 
-	LASSERT(index == 0);
+	/* DOM entry may be not zero index due to FLR but must start from 0 */
+	if (unlikely(lle->lle_extent->e_start != 0)) {
+		CERROR("%s: DOM entry must be the first stripe in a mirror\n",
+		       lov2obd(dev->ld_lov)->obd_name);
+		dump_lsm(D_ERROR, lov->lo_lsm);
+		return -EINVAL;
+	}
 
 	/* find proper MDS device */
 	rc = lov_fld_lookup(dev, fid, &idx);
@@ -636,6 +642,7 @@ static int lov_init_composite(const struct lu_env *env, struct lov_device *dev,
 	int result = 0;
 	unsigned int seq;
 	int i, j;
+	bool dom_size = 0;
 
 	LASSERT(lsm->lsm_entry_count > 0);
 	LASSERT(!lov->lo_lsm);
@@ -679,6 +686,18 @@ static int lov_init_composite(const struct lu_env *env, struct lov_device *dev,
 			lle->lle_comp_ops = &raid0_ops;
 			break;
 		case LOV_PATTERN_MDT:
+			/* Allowed to have several DOM stripes in different
+			 * mirrors with the same DoM size.
+			 */
+			if (!dom_size) {
+				dom_size = lle->lle_lsme->lsme_extent.e_end;
+			} else if (dom_size !=
+				   lle->lle_lsme->lsme_extent.e_end) {
+				CERROR("%s: DOM entries with different sizes\n",
+				       lov2obd(dev->ld_lov)->obd_name);
+				dump_lsm(D_ERROR, lsm);
+				return -EINVAL;
+			}
 			lle->lle_comp_ops = &dom_ops;
 			break;
 		default:
@@ -869,7 +888,8 @@ static void lov_fini_composite(const struct lu_env *env,
 		struct lov_layout_entry *entry;
 
 		lov_foreach_layout_entry(lov, entry)
-			entry->lle_comp_ops->lco_fini(env, entry);
+			if (entry->lle_comp_ops)
+				entry->lle_comp_ops->lco_fini(env, entry);
 
 		kvfree(comp->lo_entries);
 		comp->lo_entries = NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 456/622] lustre: fld: remove fci_no_shrink field.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (454 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 455/622] lustre: dom: manual OST-to-DOM migration via mirroring James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 457/622] lustre: lustre: remove ldt_obd_type field of lu_device_type James Simmons
                   ` (166 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never set, so is always zero.
Remove it, and the one place where it is tested.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: e669586775c6 ("LU-6142 fld: remove fci_no_shrink field.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35875
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/fld/fld_cache.c    | 3 +--
 fs/lustre/fld/fld_internal.h | 1 -
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/lustre/fld/fld_cache.c b/fs/lustre/fld/fld_cache.c
index 5267ba2..79b10bb 100644
--- a/fs/lustre/fld/fld_cache.c
+++ b/fs/lustre/fld/fld_cache.c
@@ -381,8 +381,7 @@ static int fld_cache_insert_nolock(struct fld_cache *cache,
 	 * insertion loop.
 	 */
 
-	if (!cache->fci_no_shrink)
-		fld_cache_shrink(cache);
+	fld_cache_shrink(cache);
 
 	head = &cache->fci_entries_head;
 
diff --git a/fs/lustre/fld/fld_internal.h b/fs/lustre/fld/fld_internal.h
index 465c6ccf..53648d2 100644
--- a/fs/lustre/fld/fld_internal.h
+++ b/fs/lustre/fld/fld_internal.h
@@ -109,7 +109,6 @@ struct fld_cache {
 
 	/** Cache name used for debug and messages. */
 	char			fci_name[LUSTRE_MDT_MAXNAMELEN];
-	unsigned int		fci_no_shrink:1;
 };
 
 enum {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 457/622] lustre: lustre: remove ldt_obd_type field of lu_device_type
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (455 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 456/622] lustre: fld: remove fci_no_shrink field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 458/622] lustre: lustre: remove imp_no_timeout field James Simmons
                   ` (165 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never set, so it is always NULL.
So remove it,
 and the one place it is used,
 and a variable that now will now never be set.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 5274e833f5e6 ("LU-6142 lustre: remove ldt_obd_type field of lu_device_type")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35876
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h  | 5 +----
 fs/lustre/obdclass/lu_object.c | 6 ------
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index b00fad8..aed0d4b 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -43,6 +43,7 @@
 struct seq_file;
 struct lustre_cfg;
 struct lprocfs_stats;
+struct obd_type;
 
 /** \defgroup lu lu
  * lu_* data-types represent server-side entities shared by data and meta-data
@@ -319,10 +320,6 @@ struct lu_device_type {
 	 */
 	const struct lu_device_type_operations	*ldt_ops;
 	/**
-	 * \todo XXX: temporary pointer to associated obd_type.
-	 */
-	struct obd_type				*ldt_obd_type;
-	/**
 	 * \todo XXX: temporary: context tags used by obd_*() calls.
 	 */
 	u32					ldt_ctx_tags;
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index dccff91..38c04c7 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -1336,14 +1336,8 @@ void lu_stack_fini(const struct lu_env *env, struct lu_device *top)
 
 	for (scan = top; scan; scan = next) {
 		const struct lu_device_type *ldt = scan->ld_type;
-		struct obd_type *type;
 
 		next = ldt->ldt_ops->ldto_device_free(env, scan);
-		type = ldt->ldt_obd_type;
-		if (type) {
-			atomic_dec(&type->typ_refcnt);
-			class_put_type(type);
-		}
 	}
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 458/622] lustre: lustre: remove imp_no_timeout field
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (456 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 457/622] lustre: lustre: remove ldt_obd_type field of lu_device_type James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 459/622] lustre: llog: remove olg_cat_processing field James Simmons
                   ` (164 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never set and never used.  Remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: b9dd17681bfa ("LU-6142 lustre: remove imp_no_timeout field")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35877
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index ff171d1..c2f98e6 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -273,8 +273,7 @@ struct obd_import {
 	spinlock_t			imp_lock;
 
 	/* flags */
-	unsigned long			imp_no_timeout:1, /* timeouts are disabled */
-					imp_invalid:1,    /* evicted */
+	unsigned long			imp_invalid:1,    /* evicted */
 					/* administratively disabled */
 					imp_deactive:1,
 					/* try to recover the import */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 459/622] lustre: llog: remove olg_cat_processing field.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (457 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 458/622] lustre: lustre: remove imp_no_timeout field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 460/622] lustre: ptlrpc: remove struct ptlrpc_bulk_page James Simmons
                   ` (163 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This mutex is initialized but never used.
Remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 2801ef81f1d0 ("LU-6142 llog: remove olg_cat_processing field.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35878
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_log.h | 1 -
 fs/lustre/include/obd.h        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/fs/lustre/include/lustre_log.h b/fs/lustre/include/lustre_log.h
index 99c6305..9c784ac 100644
--- a/fs/lustre/include/lustre_log.h
+++ b/fs/lustre/include/lustre_log.h
@@ -288,7 +288,6 @@ static inline void llog_group_init(struct obd_llog_group *olg)
 {
 	init_waitqueue_head(&olg->olg_waitq);
 	spin_lock_init(&olg->olg_lock);
-	mutex_init(&olg->olg_cat_processing);
 }
 
 static inline int llog_group_set_ctxt(struct obd_llog_group *olg,
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 70dbaaf..ef37f78 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -527,7 +527,6 @@ struct obd_llog_group {
 	struct llog_ctxt       *olg_ctxts[LLOG_MAX_CTXTS];
 	wait_queue_head_t	olg_waitq;
 	spinlock_t		olg_lock;
-	struct mutex		olg_cat_processing;
 };
 
 /* corresponds to one of the obd's */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 460/622] lustre: ptlrpc: remove struct ptlrpc_bulk_page
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (458 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 459/622] lustre: llog: remove olg_cat_processing field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 461/622] lustre: ptlrpc: remove bd_import_generation field James Simmons
                   ` (162 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This structure is never used, so remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 2b9bf4c00bce ("LU-6142 ptlrpc: remove struct ptlrpc_bulk_page")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35879
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index f16c6d3..faf15e9 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1146,22 +1146,6 @@ void _debug_req(struct ptlrpc_request *req,
 } while (0)
 /** @} */
 
-/**
- * Structure that defines a single page of a bulk transfer
- */
-struct ptlrpc_bulk_page {
-	/** Linkage to list of pages in a bulk */
-	struct list_head	bp_link;
-	/**
-	 * Number of bytes in a page to transfer starting from @bp_pageoffset
-	 */
-	int			bp_buflen;
-	/** offset within a page */
-	int			bp_pageoffset;
-	/** The page itself */
-	struct page		*bp_page;
-};
-
 enum ptlrpc_bulk_op_type {
 	PTLRPC_BULK_OP_ACTIVE	= 0x00000001,
 	PTLRPC_BULK_OP_PASSIVE	= 0x00000002,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 461/622] lustre: ptlrpc: remove bd_import_generation field.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (459 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 460/622] lustre: ptlrpc: remove struct ptlrpc_bulk_page James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 462/622] lustre: ptlrpc: remove srv_threads from struct ptlrpc_service James Simmons
                   ` (161 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is set, but never accessed. So remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 531bbc669d66 ("LU-6142 ptlrpc: remove bd_import_generation field.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35880
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 2 --
 fs/lustre/ptlrpc/client.c      | 1 -
 2 files changed, 3 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index faf15e9..bec92cf 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1251,8 +1251,6 @@ struct ptlrpc_bulk_desc {
 	unsigned long			bd_registered:1;
 	/** For serialization with callback */
 	spinlock_t			bd_lock;
-	/** Import generation when request for this bulk was sent */
-	int				bd_import_generation;
 	/** {put,get}{source,sink}{kvec,kiov} */
 	enum ptlrpc_bulk_op_type	bd_type;
 	/** LNet portal for this bulk */
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index d2e5e04..478ba85 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -210,7 +210,6 @@ struct ptlrpc_bulk_desc *ptlrpc_prep_bulk_imp(struct ptlrpc_request *req,
 	if (!desc)
 		return NULL;
 
-	desc->bd_import_generation = req->rq_import_generation;
 	desc->bd_import = class_import_get(imp);
 	desc->bd_req = req;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 462/622] lustre: ptlrpc: remove srv_threads from struct ptlrpc_service
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (460 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 461/622] lustre: ptlrpc: remove bd_import_generation field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 463/622] lustre: ptlrpc: remove scp_nthrs_stopping field James Simmons
                   ` (160 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

The threads are not stored here - nothing is.
Threads are stored in svcpt->scp_threads.
So remove the field and update the comment.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 6d1062cdffca ("LU-6142 ptlrpc: remove srv_threads from struct ptlrpc_service")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35881
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index bec92cf..68db603 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1314,7 +1314,7 @@ enum {
  */
 struct ptlrpc_thread {
 	/**
-	 * List of active threads in svc->srv_threads
+	 * List of active threads in svcpt->scp_threads
 	 */
 	struct list_head		t_link;
 	/**
@@ -1474,8 +1474,6 @@ struct ptlrpc_service {
 	char				*srv_name;
 	/** only statically allocated strings here; we don't clean them */
 	char				*srv_thread_name;
-	/** service thread list */
-	struct list_head		srv_threads;
 	/** threads # should be created for each partition on initializing */
 	int				srv_nthrs_cpt_init;
 	/** limit of threads number for each partition */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 463/622] lustre: ptlrpc: remove scp_nthrs_stopping field.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (461 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 462/622] lustre: ptlrpc: remove srv_threads from struct ptlrpc_service James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 464/622] lustre: ldlm: remove unused ldlm_server_conn James Simmons
                   ` (159 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is unused, so remove it.
If "shrinking threads" is ever needed, any extra fields
required can be added then.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 7233248e565f ("LU-6142 ptlrpc: remove scp_nthrs_stopping field.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35882
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 68db603..aaf5cb8 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1557,8 +1557,6 @@ struct ptlrpc_service_part {
 	int				scp_thr_nextid;
 	/** # of starting threads */
 	int				scp_nthrs_starting;
-	/** # of stopping threads, reserved for shrinking threads */
-	int				scp_nthrs_stopping;
 	/** # running threads */
 	int				scp_nthrs_running;
 	/** service threads list */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 464/622] lustre: ldlm: remove unused ldlm_server_conn
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (462 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 463/622] lustre: ptlrpc: remove scp_nthrs_stopping field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 465/622] lustre: llite: remove lli_readdir_mutex James Simmons
                   ` (158 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never set or used, so remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 047a6185a1ed ("LU-6142 ldlm: remove unused ldlm_server_conn")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35883
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_internal.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index 4844a9b..336d9b7 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -197,7 +197,6 @@ struct ldlm_state {
 	struct ptlrpc_service *ldlm_cb_service;
 	struct ptlrpc_service *ldlm_cancel_service;
 	struct ptlrpc_client *ldlm_client;
-	struct ptlrpc_connection *ldlm_server_conn;
 	struct ldlm_bl_pool *ldlm_bl_pool;
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 465/622] lustre: llite: remove lli_readdir_mutex
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (463 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 464/622] lustre: ldlm: remove unused ldlm_server_conn James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 466/622] lustre: llite: remove ll_umounting field James Simmons
                   ` (157 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This mutex is initialized but never used, so
remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 26bf41c177a5 ("LU-6142 llite: remove lli_readdir_mutex")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35884
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 3 ---
 fs/lustre/llite/llite_lib.c      | 1 -
 2 files changed, 4 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 232fb0a..77854a5 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -143,9 +143,6 @@ struct ll_inode_info {
 	union {
 		/* for directory */
 		struct {
-			/* serialize normal readdir and statahead-readdir. */
-			struct mutex			lli_readdir_mutex;
-
 			/* metadata statahead */
 			/* since parent-child threads can share the same @file
 			 * struct, "opendir_key" is the token when dir close for
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 217268e..7d83ee3 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -960,7 +960,6 @@ void ll_lli_init(struct ll_inode_info *lli)
 
 	LASSERT(lli->lli_vfs_inode.i_mode != 0);
 	if (S_ISDIR(lli->lli_vfs_inode.i_mode)) {
-		mutex_init(&lli->lli_readdir_mutex);
 		lli->lli_opendir_key = NULL;
 		lli->lli_sai = NULL;
 		spin_lock_init(&lli->lli_sa_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 466/622] lustre: llite: remove ll_umounting field
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (464 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 465/622] lustre: llite: remove lli_readdir_mutex James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 467/622] lustre: llite: align field names in ll_sb_info James Simmons
                   ` (156 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is set but never accessed, so remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 15b83c9b7b28 ("LU-6142 llite: remove ll_umounting field")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35885
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 3 +--
 fs/lustre/llite/llite_lib.c      | 1 -
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 77854a5..6186720 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -505,8 +505,7 @@ struct ll_sb_info {
 	struct lu_fid	     ll_root_fid; /* root object fid */
 
 	int		       ll_flags;
-	unsigned int		  ll_umounting:1,
-				  ll_xattr_cache_enabled:1,
+	unsigned int		  ll_xattr_cache_enabled:1,
 				ll_xattr_cache_set:1, /* already set to 0/1 */
 				  ll_client_common_fill_super_succeeded:1,
 				  ll_checksum_set:1;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 7d83ee3..ad7c2e2 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -785,7 +785,6 @@ void ll_kill_super(struct super_block *sb)
 	 */
 	if (sbi) {
 		sb->s_dev = sbi->ll_sdev_orig;
-		sbi->ll_umounting = 1;
 
 		/* wait running statahead threads to quit */
 		while (atomic_read(&sbi->ll_sa_running) > 0)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 467/622] lustre: llite: align field names in ll_sb_info
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (465 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 466/622] lustre: llite: remove ll_umounting field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 468/622] lustre: llite: remove lti_iter field James Simmons
                   ` (155 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

Align field names and most comments in
struct ll_sb_info.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 74 ++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 37 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 6186720..bb5f519 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -493,26 +493,26 @@ struct ll_sb_info {
 	/* this protects pglist and ra_info.  It isn't safe to
 	 * grab from interrupt contexts
 	 */
-	spinlock_t		  ll_lock;
-	spinlock_t		  ll_pp_extent_lock; /* pp_extent entry*/
-	spinlock_t		  ll_process_lock; /* ll_rw_process_info */
-	struct obd_uuid	   ll_sb_uuid;
+	spinlock_t		ll_lock;
+	spinlock_t		ll_pp_extent_lock; /* pp_extent entry*/
+	spinlock_t		ll_process_lock; /* ll_rw_process_info */
+	struct obd_uuid		ll_sb_uuid;
 	struct obd_export	*ll_md_exp;
 	struct obd_export	*ll_dt_exp;
 	struct obd_device	*ll_md_obd;
 	struct obd_device	*ll_dt_obd;
 	struct dentry		*ll_debugfs_entry;
-	struct lu_fid	     ll_root_fid; /* root object fid */
+	struct lu_fid		ll_root_fid; /* root object fid */
 
-	int		       ll_flags;
-	unsigned int		  ll_xattr_cache_enabled:1,
+	int			ll_flags;
+	unsigned int		ll_xattr_cache_enabled:1,
 				ll_xattr_cache_set:1, /* already set to 0/1 */
-				  ll_client_common_fill_super_succeeded:1,
-				  ll_checksum_set:1;
+				ll_client_common_fill_super_succeeded:1,
+				ll_checksum_set:1;
 
-	struct lustre_client_ocd  ll_lco;
+	struct lustre_client_ocd ll_lco;
 
-	struct lprocfs_stats     *ll_stats; /* lprocfs stats counter */
+	struct lprocfs_stats	*ll_stats; /* lprocfs stats counter */
 
 	/*
 	 * Used to track "unstable" pages on a client, and maintain a
@@ -520,58 +520,58 @@ struct ll_sb_info {
 	 * any page which is sent to a server as part of a bulk request,
 	 * but is uncommitted to stable storage.
 	 */
-	struct cl_client_cache    *ll_cache;
+	struct cl_client_cache	*ll_cache;
 
-	struct lprocfs_stats     *ll_ra_stats;
+	struct lprocfs_stats	*ll_ra_stats;
 
-	struct ll_ra_info	 ll_ra_info;
-	unsigned int	      ll_namelen;
+	struct ll_ra_info	ll_ra_info;
+	unsigned int		ll_namelen;
 	const struct file_operations	*ll_fop;
 
-	struct lu_site	   *ll_site;
-	struct cl_device	 *ll_cl;
+	struct lu_site		*ll_site;
+	struct cl_device	*ll_cl;
 	/* Statistics */
 	struct ll_rw_extents_info ll_rw_extents_info;
-	int		       ll_extent_process_count;
+	int			ll_extent_process_count;
 	struct ll_rw_process_info ll_rw_process_info[LL_PROCESS_HIST_MAX];
-	unsigned int	      ll_offset_process_count;
+	unsigned int		ll_offset_process_count;
 	struct ll_rw_process_info ll_rw_offset_info[LL_OFFSET_HIST_MAX];
-	unsigned int	      ll_rw_offset_entry_count;
-	int		       ll_stats_track_id;
-	enum stats_track_type     ll_stats_track_type;
-	int		       ll_rw_stats_on;
+	unsigned int		ll_rw_offset_entry_count;
+	int			ll_stats_track_id;
+	enum stats_track_type	ll_stats_track_type;
+	int			ll_rw_stats_on;
 
 	/* metadata stat-ahead */
 	unsigned int		ll_sa_running_max; /* max concurrent
 						    * statahead instances
 						    */
-	unsigned int	      ll_sa_max;     /* max statahead RPCs */
-	atomic_t		  ll_sa_total;   /* statahead thread started
-						  * count
-						  */
-	atomic_t		  ll_sa_wrong;   /* statahead thread stopped for
-						  * low hit ratio
-						  */
+	unsigned int		ll_sa_max;	/* max statahead RPCs */
+	atomic_t		ll_sa_total;	/* statahead thread started
+						 * count
+						 */
+	atomic_t		ll_sa_wrong;	/* statahead thread stopped for
+						 * low hit ratio
+						 */
 	atomic_t		ll_sa_running;	/* running statahead thread
 						 * count
 						 */
-	atomic_t		  ll_agl_total;  /* AGL thread started count */
+	atomic_t		ll_agl_total;	/* AGL thread started count */
 
-	dev_t			  ll_sdev_orig; /* save s_dev before assign for
+	dev_t			ll_sdev_orig;	/* save s_dev before assign for
 						 * clustered nfs
 						 */
 	/* root squash */
-	struct root_squash_info	  ll_squash;
-	struct path		 ll_mnt;
+	struct root_squash_info	ll_squash;
+	struct path		ll_mnt;
 
 	/* st_blksize returned by stat(2), when non-zero */
-	unsigned int		 ll_stat_blksize;
+	unsigned int		ll_stat_blksize;
 
 	/* maximum relative age of cached statfs results */
-	unsigned int		  ll_statfs_max_age;
+	unsigned int		ll_statfs_max_age;
 
 	struct kset		ll_kset;	/* sysfs object */
-	struct completion	 ll_kobj_unregister;
+	struct completion	ll_kobj_unregister;
 
 	/* File heat */
 	unsigned int		ll_heat_decay_weight;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 468/622] lustre: llite: remove lti_iter field
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (466 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 467/622] lustre: llite: align field names in ll_sb_info James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 469/622] lustre: llite: remove ft_mtime field James Simmons
                   ` (154 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never used, so remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 0140f50c1287 ("LU-6142 llite: remove lti_iter field")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35886
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index bb5f519..025d33e 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -1033,7 +1033,6 @@ struct ll_cl_context {
 };
 
 struct ll_thread_info {
-	struct iov_iter		lti_iter;
 	struct vvp_io_args	lti_args;
 	struct ra_io_arg	lti_ria;
 	struct ll_cl_context	lti_io_ctx;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 469/622] lustre: llite: remove ft_mtime field
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (467 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 468/622] lustre: llite: remove lti_iter field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 470/622] lustre: llite: remove sub_reenter field James Simmons
                   ` (153 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is set but never accessed, so remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: b674c418fa04 ("LU-6142 llite: remove ft_mtime field")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35887
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/vvp_internal.h | 5 -----
 fs/lustre/llite/vvp_io.c       | 1 -
 2 files changed, 6 deletions(-)

diff --git a/fs/lustre/llite/vvp_internal.h b/fs/lustre/llite/vvp_internal.h
index 7a463cb..1cc152f 100644
--- a/fs/lustre/llite/vvp_internal.h
+++ b/fs/lustre/llite/vvp_internal.h
@@ -66,11 +66,6 @@ struct vvp_io {
 
 	union {
 		struct vvp_fault_io {
-			/**
-			 * Inode modification time that is checked across DLM
-			 * lock request.
-			 */
-			time64_t		ft_mtime;
 			struct vm_area_struct	*ft_vma;
 			/**
 			 *  locked page returned from vvp_io
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index e676e62..d0d8b1f 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -271,7 +271,6 @@ static int vvp_io_fault_iter_init(const struct lu_env *env,
 	struct inode *inode = vvp_object_inode(ios->cis_obj);
 
 	LASSERT(inode == file_inode(vio->vui_fd->fd_file));
-	vio->u.fault.ft_mtime = inode->i_mtime.tv_sec;
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 470/622] lustre: llite: remove sub_reenter field.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (468 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 469/622] lustre: llite: remove ft_mtime field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 471/622] lustre: osc: remove oti_descr oti_handle oti_plist James Simmons
                   ` (152 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never set or accessed, so
remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 3118333b9664 ("LU-6142 llite: remove sub_reenter field.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35888
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_cl_internal.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/lov/lov_cl_internal.h b/fs/lustre/lov/lov_cl_internal.h
index 6fea0f5..40bb6f0 100644
--- a/fs/lustre/lov/lov_cl_internal.h
+++ b/fs/lustre/lov/lov_cl_internal.h
@@ -509,7 +509,6 @@ struct lov_io_sub {
 	 * \see cl_env_get()
 	 */
 	u16			sub_refcheck;
-	u16			sub_reenter;
 };
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 471/622] lustre: osc: remove oti_descr oti_handle oti_plist
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (469 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 470/622] lustre: llite: remove sub_reenter field James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 472/622] lustre: osc: remove oe_next_page James Simmons
                   ` (151 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

These three fields in 'struct osc_thread_info' are
unused, so remove them.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 3bc9a5e32542 ("LU-6142 osc: remove oti_descr oti_handle oti_plist")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35889
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 37e56ef..044185d 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -176,10 +176,7 @@ struct osc_session {
 struct osc_thread_info {
 	struct ldlm_res_id	oti_resname;
 	union ldlm_policy_data	oti_policy;
-	struct cl_lock_descr	oti_descr;
 	struct cl_attr		oti_attr;
-	struct lustre_handle	oti_handle;
-	struct cl_page_list	oti_plist;
 	struct cl_io		oti_io;
 	struct pagevec		oti_pagevec;
 	void			*oti_pvec[OTI_PVEC_SIZE];
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 472/622] lustre: osc: remove oe_next_page
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (470 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 471/622] lustre: osc: remove oti_descr oti_handle oti_plist James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 473/622] lnet: o2iblnd: remove some unused fields James Simmons
                   ` (150 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

As the comment says, this field is unused.  So remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: d1b08c58b43e ("LU-6142 osc: remove oe_next_page")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35890
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 044185d..de7ccd6 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -946,11 +946,6 @@ struct osc_extent {
 	unsigned int		oe_nr_pages;
 	/* list of pending oap pages. Pages in this list are NOT sorted. */
 	struct list_head	oe_pages;
-	/* Since an extent has to be written out in atomic, this is used to
-	 * remember the next page need to be locked to write this extent out.
-	 * Not used right now.
-	 */
-	struct osc_page		*oe_next_page;
 	/* start and end index of this extent, include start and end
 	 * themselves. Page offset here is the page index of osc_pages.
 	 * oe_start is used as keyword for red-black tree.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 473/622] lnet: o2iblnd: remove some unused fields.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (471 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 472/622] lustre: osc: remove oe_next_page James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 474/622] lnet: socklnd: remove ksnp_sharecount James Simmons
                   ` (149 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

Fields kib_min_reconnect_interval kib_max_reconnect_interval kib_ntx
are never used or set.

ibh_mr_shift is set but never used;
rx_status is used (in a debug message) but never set.

Remove them all.

We could possibly remove ibh_mr_size too. It is only used
for an error message.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 68c04b8fdd5d ("LU-6142 o2iblnd: remove some unused fields.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35891
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 7 +------
 net/lnet/klnds/o2iblnd/o2iblnd.h | 5 -----
 2 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index f3176e1..278823f 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2303,7 +2303,6 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
 static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 {
 	struct ib_device_attr *dev_attr = &hdev->ibh_ibdev->attrs;
-	int rc = 0;
 
 	/*
 	 * It's safe to assume a HCA can handle a page size
@@ -2326,15 +2325,11 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 			hdev->ibh_dev->ibd_dev_caps |= IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT;
 	} else {
 		CERROR("IB device does not support FMRs nor FastRegs, can't register memory: %d\n",
-		       rc);
+		       -ENXIO);
 		return -ENXIO;
 	}
 
 	hdev->ibh_mr_size = dev_attr->max_mr_size;
-	if (hdev->ibh_mr_size == ~0ULL) {
-		hdev->ibh_mr_shift = 64;
-		return 0;
-	}
 
 	CERROR("Invalid mr size: %#llx\n", hdev->ibh_mr_size);
 	return -EINVAL;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 1285ab1..2f2337a 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -76,12 +76,9 @@
 struct kib_tunables {
 	int *kib_dev_failover;		/* HCA failover */
 	unsigned int *kib_service;	/* IB service number */
-	int *kib_min_reconnect_interval; /* first failed connection retry... */
-	int *kib_max_reconnect_interval; /* exponentially increasing to this */
 	int *kib_cksum;			/* checksum struct kib_msg? */
 	int *kib_timeout;		/* comms timeout (seconds) */
 	int *kib_keepalive;		/* keepalive timeout (seconds) */
-	int *kib_ntx;			/* # tx descs */
 	char **kib_default_ipif;	/* default IPoIB interface */
 	int *kib_retry_count;
 	int *kib_rnr_retry_count;
@@ -178,7 +175,6 @@ struct kib_hca_dev {
 	int			ibh_page_shift;	/* page shift of current HCA */
 	int			ibh_page_size;	/* page size of current HCA */
 	u64			ibh_page_mask;	/* page mask of current HCA */
-	int			ibh_mr_shift;	/* bits shift of max MR size */
 	u64			ibh_mr_size;	/* size of MR */
 	struct ib_pd		*ibh_pd;	/* PD */
 	struct kib_dev		*ibh_dev;	/* owner */
@@ -492,7 +488,6 @@ struct kib_rx {					/* receive message */
 	struct list_head	rx_list;	/* queue for attention */
 	struct kib_conn        *rx_conn;	/* owning conn */
 	int			rx_nob;		/* # bytes received (-1 while posted) */
-	enum ib_wc_status	rx_status;	/* completion status */
 	struct kib_msg	       *rx_msg;		/* message buffer (host vaddr) */
 	u64			rx_msgaddr;	/* message buffer (I/O addr) */
 	DEFINE_DMA_UNMAP_ADDR(rx_msgunmap);	/* for dma_unmap_single() */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 474/622] lnet: socklnd: remove ksnp_sharecount
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (472 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 473/622] lnet: o2iblnd: remove some unused fields James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 475/622] lustre: llite: extend readahead locks for striped file James Simmons
                   ` (148 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This field is never set, though its value is printed.
Remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 408a5a527567 ("LU-6142 socklnd: remove ksnp_sharecount")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35892
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 4 ++--
 net/lnet/klnds/socklnd/socklnd.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 78f6c7e..e2a9819 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -2471,10 +2471,10 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 			if (peer_ni->ksnp_ni != ni)
 				continue;
 
-			CWARN("Active peer_ni on shutdown: %s, ref %d, scnt %d, closing %d, accepting %d, err %d, zcookie %llu, txq %d, zc_req %d\n",
+			CWARN("Active peer_ni on shutdown: %s, ref %d, closing %d, accepting %d, err %d, zcookie %llu, txq %d, zc_req %d\n",
 			      libcfs_id2str(peer_ni->ksnp_id),
 			      atomic_read(&peer_ni->ksnp_refcount),
-			      peer_ni->ksnp_sharecount, peer_ni->ksnp_closing,
+			      peer_ni->ksnp_closing,
 			      peer_ni->ksnp_accepting, peer_ni->ksnp_error,
 			      peer_ni->ksnp_zc_next_cookie,
 			      !list_empty(&peer_ni->ksnp_tx_queue),
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 80c2e19..efdd02e 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -415,7 +415,6 @@ struct ksock_peer {
 							 */
 	struct lnet_process_id	ksnp_id;		/* who's on the other end(s) */
 	atomic_t		ksnp_refcount;		/* # users */
-	int			ksnp_sharecount;	/* lconf usage counter */
 	int			ksnp_closing;		/* being closed */
 	int			ksnp_accepting;		/* # passive connections pending
 							 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 475/622] lustre: llite: extend readahead locks for striped file
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (473 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 474/622] lnet: socklnd: remove ksnp_sharecount James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 476/622] lustre: llite: Improve readahead RPC issuance James Simmons
                   ` (147 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Currently cl_io_read_ahead() can not return locks
that cross stripe boundary at one time, thus readahead
will stop because of this reason.

This is really bad, as we will stop readahead every
time we hit stripe boundary, for example default stripe
size is 1M, this could hurt performances very much
especially with async readahead introduced.

So try to use existed locks aggressivly if there is no
lock contention, otherwise lock should be not
less than requested extent.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12043
Lustre-commit: cfbeae97d736 ("LU-12043 llite: extend readahead locks for striped file")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35438
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h |  2 ++
 fs/lustre/llite/rw.c          | 14 ++++++++++++--
 fs/lustre/osc/osc_io.c        |  2 ++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 71ca283..65fdab9 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1474,6 +1474,8 @@ struct cl_read_ahead {
 	void (*cra_release)(const struct lu_env *env, void *cbdata);
 	/* Callback data for cra_release routine */
 	void				*cra_cbdata;
+	/* whether lock is in contention */
+	bool				cra_contention;
 };
 
 static inline void cl_read_ahead_release(const struct lu_env *env,
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 4fec9a6..7c2dbdc 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -369,6 +369,18 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 				if (rc < 0)
 					break;
 
+				/* Do not shrink the ria_end@any case until
+				 * the minimum end of current read is covered.
+				 * And only shrink the ria_end if the matched
+				 * LDLM lock doesn't cover more.
+				 */
+				if (page_idx > ra.cra_end ||
+				    (ra.cra_contention &&
+				     page_idx > ria->ria_end_min)) {
+					ria->ria_end = ra.cra_end;
+					break;
+				}
+
 				CDEBUG(D_READA, "idx: %lu, ra: %lu, rpc: %lu\n",
 				       page_idx, ra.cra_end, ra.cra_rpc_size);
 				LASSERTF(ra.cra_end >= page_idx,
@@ -387,8 +399,6 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 					ria->ria_end = end - 1;
 				if (ria->ria_end < ria->ria_end_min)
 					ria->ria_end = ria->ria_end_min;
-				if (ria->ria_end > ra.cra_end)
-					ria->ria_end = ra.cra_end;
 			}
 
 			/* If the page is inside the read-ahead window */
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 4f46b95..8e299d4 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -92,6 +92,8 @@ static int osc_io_read_ahead(const struct lu_env *env,
 				       dlmlock->l_policy_data.l_extent.end);
 		ra->cra_release = osc_read_ahead_release;
 		ra->cra_cbdata = dlmlock;
+		if (ra->cra_end != CL_PAGE_EOF)
+			ra->cra_contention = true;
 		result = 0;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 476/622] lustre: llite: Improve readahead RPC issuance
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (474 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 475/622] lustre: llite: extend readahead locks for striped file James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 477/622] lustre: lov: Move page index to top level James Simmons
                   ` (146 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

lov_io_submit receives a range of pages, then adds pages in
to a batch until it hits a page which is not in the stripe
associated with this lov object.  This means that if a
readahead page range hits the same stripe more than once,
we will issue multiple I/Os, even if the pages would fit in
one RPC.

This is unnecessary - Just submit all these pages at once.

mpirun -n 2 $IOR -s 2000 -t 47K -b 47K -k -r -E -o $FILE

Without patch:
osc.lustre-OST0001-osc-ffff8fe82c952000.rpc_stats=

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                     118  56  56   |          0   0   0
2:                       0   0  56   |          0   0   0
4:                       0   0  56   |          0   0   0
8:                       0   0  56   |          0   0   0
16:                      5   2  58   |          0   0   0
32:                      0   0  58   |          0   0   0
64:                      0   0  58   |          0   0   0
128:                    21  10  68   |          0   0   0
256:                    25  11  80   |          0   0   0
512:                    10   4  85   |          0   0   0
1024:                   31  14 100   |          0   0   0

osc.lustre-OST0002-osc-ffff8fe82c952000.rpc_stats=
                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       5   6   6   |          0   0   0
2:                       0   0   6   |          0   0   0
4:                       0   0   6   |          0   0   0
8:                       0   0   6   |          0   0   0
16:                      0   0   6   |          0   0   0
32:                      0   0   6   |          0   0   0
64:                      0   0   6   |          0   0   0
128:                    19  23  29   |          0   0   0
256:                    19  23  52   |          0   0   0
512:                     5   6  58   |          0   0   0
1024:                   34  41 100   |          0   0   0

With patch:
osc.lustre-OST0001-osc-ffff8fe7a7227800.rpc_stats=
                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                      12  17  17   |          0   0   0
2:                       0   0  17   |          0   0   0
4:                       0   0  17   |          0   0   0
8:                       0   0  17   |          0   0   0
16:                      5   7  24   |          0   0   0
32:                      0   0  24   |          0   0   0
64:                      5   7  31   |          0   0   0
128:                     6   8  40   |          0   0   0
256:                     1   1  42   |          0   0   0
512:                     2   2  44   |          0   0   0
1024:                   38  55 100   |          0   0   0

osc.lustre-OST0002-osc-ffff8fe7a7227800.rpc_stats=
                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |          0   0   0
64:                      4   7   7   |          0   0   0
128:                     7  13  21   |          0   0   0
256:                     0   0  21   |          0   0   0
512:                     3   5  26   |          0   0   0
1024:                   38  73 100   |          0   0   0

Note the much larger # of smaller RPC issued without the patch.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12533
Lustre-commit: 05b9da4fd124 ("LU-12533 llite: Improve readahead RPC issuance")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35458
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_io.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index 6e86efa..fbed3de 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -1081,6 +1081,7 @@ static int lov_io_submit(const struct lu_env *env,
 	struct lov_io_sub *sub;
 	struct cl_page_list *plist = &lov_env_info(env)->lti_plist;
 	struct cl_page *page;
+	struct cl_page *tmp;
 	int index;
 	int rc = 0;
 
@@ -1105,10 +1106,10 @@ static int lov_io_submit(const struct lu_env *env,
 		cl_page_list_move(&cl2q->c2_qin, qin, page);
 
 		index = lov_page_index(page);
-		while (qin->pl_nr > 0) {
-			page = cl_page_list_first(qin);
+		cl_page_list_for_each_safe(page, tmp, qin) {
+			/* this page is not on this stripe */
 			if (index != lov_page_index(page))
-				break;
+				continue;
 
 			cl_page_list_move(&cl2q->c2_qin, qin, page);
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 477/622] lustre: lov: Move page index to top level
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (475 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 476/622] lustre: llite: Improve readahead RPC issuance James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 478/622] lustre: readahead: convert stride page index to byte James Simmons
                   ` (145 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

When doing readahead, we see an amazing amount of time
(~5-8%) just looking up the page index from the lov layer.

In particular, this is more than half the time spent
submitting pages:
         - 14.14% cl_io_submit_rw
            - 13.40% lov_io_submit
               - 8.24% lov_page_index

This requires several indirections, all of which can be
avoided by moving this up to the cl_page struct.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12535
Lustre-commit: 8d6d2914cf85 ("LU-12535 lov: Move page index to top level")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35470
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h   |  2 ++
 fs/lustre/lov/lov_cl_internal.h |  2 --
 fs/lustre/lov/lov_io.c          | 21 +++++----------------
 fs/lustre/lov/lov_page.c        | 10 +++++-----
 4 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 65fdab9..4c68d7b 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -762,6 +762,8 @@ struct cl_page {
 	struct lu_ref_link		 cp_queue_ref;
 	/** Assigned if doing a sync_io */
 	struct cl_sync_io		*cp_sync_io;
+	/** layout_entry + stripe index, composed using lov_comp_index() */
+	unsigned int			 cp_lov_index;
 };
 
 /**
diff --git a/fs/lustre/lov/lov_cl_internal.h b/fs/lustre/lov/lov_cl_internal.h
index 40bb6f0..8791e69 100644
--- a/fs/lustre/lov/lov_cl_internal.h
+++ b/fs/lustre/lov/lov_cl_internal.h
@@ -440,8 +440,6 @@ struct lov_lock {
 
 struct lov_page {
 	struct cl_page_slice	lps_cl;
-	/** layout_entry + stripe index, composed using lov_comp_index() */
-	unsigned int		lps_index;
 	/* the layout gen when this page was created */
 	u32			lps_layout_gen;
 };
diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index fbed3de..56e4a982 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -189,17 +189,6 @@ struct lov_io_sub *lov_sub_get(const struct lu_env *env,
  * Lov io operations.
  *
  */
-static int lov_page_index(const struct cl_page *page)
-{
-	const struct cl_page_slice *slice;
-
-	slice = cl_page_at(page, &lov_device_type);
-	LASSERT(slice);
-	LASSERT(slice->cpl_obj);
-
-	return cl2lov_page(slice)->lps_index;
-}
-
 static int lov_io_subio_init(const struct lu_env *env, struct lov_io *lio,
 			     struct cl_io *io)
 {
@@ -1105,10 +1094,10 @@ static int lov_io_submit(const struct lu_env *env,
 		cl_2queue_init(cl2q);
 		cl_page_list_move(&cl2q->c2_qin, qin, page);
 
-		index = lov_page_index(page);
+		index = page->cp_lov_index;
 		cl_page_list_for_each_safe(page, tmp, qin) {
 			/* this page is not on this stripe */
-			if (index != lov_page_index(page))
+			if (index != page->cp_lov_index)
 				continue;
 
 			cl_page_list_move(&cl2q->c2_qin, qin, page);
@@ -1171,10 +1160,10 @@ static int lov_io_commit_async(const struct lu_env *env,
 
 		cl_page_list_move(plist, queue, page);
 
-		index = lov_page_index(page);
+		index = page->cp_lov_index;
 		while (queue->pl_nr > 0) {
 			page = cl_page_list_first(queue);
-			if (index != lov_page_index(page))
+			if (index != page->cp_lov_index)
 				break;
 
 			cl_page_list_move(plist, queue, page);
@@ -1218,7 +1207,7 @@ static int lov_io_fault_start(const struct lu_env *env,
 
 	fio = &ios->cis_io->u.ci_fault;
 	lio = cl2lov_io(env, ios);
-	sub = lov_sub_get(env, lio, lov_page_index(fio->ft_page));
+	sub = lov_sub_get(env, lio, fio->ft_page->cp_lov_index);
 	if (IS_ERR(sub))
 		return PTR_ERR(sub);
 	sub->sub_io.u.ci_fault.ft_nob = fio->ft_nob;
diff --git a/fs/lustre/lov/lov_page.c b/fs/lustre/lov/lov_page.c
index c3337706..e73b5ff 100644
--- a/fs/lustre/lov/lov_page.c
+++ b/fs/lustre/lov/lov_page.c
@@ -57,8 +57,8 @@ static int lov_comp_page_print(const struct lu_env *env,
 	struct lov_page *lp = cl2lov_page(slice);
 
 	return (*printer)(env, cookie,
-			  LUSTRE_LOV_NAME "-page@%p, comp index: %x, gen: %u\n",
-			  lp, lp->lps_index, lp->lps_layout_gen);
+			  LUSTRE_LOV_NAME "-page@%p, gen: %u\n",
+			  lp, lp->lps_layout_gen);
 }
 
 static const struct cl_page_operations lov_comp_page_ops = {
@@ -95,11 +95,11 @@ int lov_page_init_composite(const struct lu_env *env, struct cl_object *obj,
 	rc = lov_stripe_offset(loo->lo_lsm, entry, offset, stripe, &suboff);
 	LASSERT(rc == 0);
 
-	lpg->lps_index = lov_comp_index(entry, stripe);
+	page->cp_lov_index = lov_comp_index(entry, stripe);
 	lpg->lps_layout_gen = loo->lo_lsm->lsm_layout_gen;
 	cl_page_slice_add(page, &lpg->lps_cl, obj, index, &lov_comp_page_ops);
 
-	sub = lov_sub_get(env, lio, lpg->lps_index);
+	sub = lov_sub_get(env, lio, page->cp_lov_index);
 	if (IS_ERR(sub))
 		return PTR_ERR(sub);
 
@@ -136,7 +136,7 @@ int lov_page_init_empty(const struct lu_env *env, struct cl_object *obj,
 	struct lov_page *lpg = cl_object_page_slice(obj, page);
 	void *addr;
 
-	lpg->lps_index = ~0;
+	page->cp_lov_index = ~0;
 	cl_page_slice_add(page, &lpg->lps_cl, obj, index, &lov_empty_page_ops);
 	addr = kmap(page->cp_vmpage);
 	memset(addr, 0, cl_page_size(obj));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 478/622] lustre: readahead: convert stride page index to byte
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (476 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 477/622] lustre: lov: Move page index to top level James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 479/622] lustre: osc: prevent use after free James Simmons
                   ` (144 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

This is a prepared patch to support unaligned stride readahead.
Some detection variables are converted to byte unit to be aware
of possible unaligned stride read.

Since we still need read pages by page index, so those variables
are still kept as page unit. to make things more clear, fix them
to use pgoff_t rather than unsigned long.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12518
Lustre-commit: 0923e4055116 ("LU-12518 readahead: convert stride page index to byte")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35829
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h |  60 +++++-----
 fs/lustre/llite/rw.c             | 243 ++++++++++++++++++++-------------------
 2 files changed, 153 insertions(+), 150 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 025d33e..d84f50c 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -358,22 +358,22 @@ struct ll_ra_info {
  * counted by page index.
  */
 struct ra_io_arg {
-	unsigned long ria_start;	/* start offset of read-ahead*/
-	unsigned long ria_end;		/* end offset of read-ahead*/
-	unsigned long ria_reserved;	/* reserved pages for read-ahead */
-	unsigned long ria_end_min;	/* minimum end to cover current read */
-	bool ria_eof;			/* reach end of file */
+	pgoff_t		ria_start;	/* start offset of read-ahead*/
+	pgoff_t		ria_end;	/* end offset of read-ahead*/
+	unsigned long	ria_reserved;	/* reserved pages for read-ahead */
+	pgoff_t		ria_end_min;	/* minimum end to cover current read */
+	bool		ria_eof;	/* reach end of file */
 	/* If stride read pattern is detected, ria_stoff means where
 	 * stride read is started. Note: for normal read-ahead, the
 	 * value here is meaningless, and also it will not be accessed
 	 */
-	pgoff_t ria_stoff;
-	/* ria_length and ria_pages are the length and pages length in the
+	unsigned long	ria_stoff;
+	/* ria_length and ria_bytes are the length and pages length in the
 	 * stride I/O mode. And they will also be used to check whether
 	 * it is stride I/O read-ahead in the read-ahead pages
 	 */
-	unsigned long ria_length;
-	unsigned long ria_pages;
+	unsigned long	ria_length;
+	unsigned long	ria_bytes;
 };
 
 /* LL_HIST_MAX=32 causes an overflow */
@@ -592,16 +592,10 @@ struct ll_sb_info {
  */
 struct ll_readahead_state {
 	spinlock_t  ras_lock;
+	/* End byte that read(2) try to read.  */
+	unsigned long	ras_last_read_end;
 	/*
-	 * index of the last page that read(2) needed and that wasn't in the
-	 * cache. Used by ras_update() to detect seeks.
-	 *
-	 * XXX nikita: if access seeks into cached region, Lustre doesn't see
-	 * this.
-	 */
-	unsigned long   ras_last_readpage;
-	/*
-	 * number of pages read after last read-ahead window reset. As window
+	 * number of bytes read after last read-ahead window reset. As window
 	 * is reset on each seek, this is effectively a number of consecutive
 	 * accesses. Maybe ->ras_accessed_in_window is better name.
 	 *
@@ -610,13 +604,13 @@ struct ll_readahead_state {
 	 * case, it probably doesn't make sense to expand window to
 	 * PTLRPC_MAX_BRW_PAGES on the third access.
 	 */
-	unsigned long   ras_consecutive_pages;
+	unsigned long	ras_consecutive_bytes;
 	/*
 	 * number of read requests after the last read-ahead window reset
 	 * As window is reset on each seek, this is effectively the number
 	 * on consecutive read request and is used to trigger read-ahead.
 	 */
-	unsigned long   ras_consecutive_requests;
+	unsigned long	ras_consecutive_requests;
 	/*
 	 * Parameters of current read-ahead window. Handled by
 	 * ras_update(). On the initial access to the file or after a seek,
@@ -624,7 +618,7 @@ struct ll_readahead_state {
 	 * expanded to PTLRPC_MAX_BRW_PAGES. Afterwards, window is enlarged by
 	 * PTLRPC_MAX_BRW_PAGES chunks up to ->ra_max_pages.
 	 */
-	unsigned long   ras_window_start, ras_window_len;
+	pgoff_t		ras_window_start, ras_window_len;
 	/*
 	 * Optimal RPC size. It decides how many pages will be sent
 	 * for each read-ahead.
@@ -637,41 +631,41 @@ struct ll_readahead_state {
 	 * ->ra_max_pages (see ll_ra_count_get()), 2. client cannot read pages
 	 * not covered by DLM lock.
 	 */
-	unsigned long   ras_next_readahead;
+	pgoff_t		ras_next_readahead;
 	/*
 	 * Total number of ll_file_read requests issued, reads originating
 	 * due to mmap are not counted in this total.  This value is used to
 	 * trigger full file read-ahead after multiple reads to a small file.
 	 */
-	unsigned long   ras_requests;
+	unsigned long	ras_requests;
 	/*
 	 * Page index with respect to the current request, these value
 	 * will not be accurate when dealing with reads issued via mmap.
 	 */
-	unsigned long   ras_request_index;
+	unsigned long	ras_request_index;
 	/*
 	 * The following 3 items are used for detecting the stride I/O
 	 * mode.
 	 * In stride I/O mode,
 	 * ...............|-----data-----|****gap*****|--------|******|....
-	 *    offset      |-stride_pages-|-stride_gap-|
+	 *    offset      |-stride_bytes-|-stride_gap-|
 	 * ras_stride_offset = offset;
-	 * ras_stride_length = stride_pages + stride_gap;
-	 * ras_stride_pages = stride_pages;
-	 * Note: all these three items are counted by pages.
+	 * ras_stride_length = stride_bytes + stride_gap;
+	 * ras_stride_bytes = stride_bytes;
+	 * Note: all these three items are counted by bytes.
 	 */
-	unsigned long   ras_stride_length;
-	unsigned long   ras_stride_pages;
-	pgoff_t	 ras_stride_offset;
+	unsigned long	ras_stride_length;
+	unsigned long	ras_stride_bytes;
+	unsigned long	ras_stride_offset;
 	/*
 	 * number of consecutive stride request count, and it is similar as
 	 * ras_consecutive_requests, but used for stride I/O mode.
 	 * Note: only more than 2 consecutive stride request are detected,
 	 * stride read-ahead will be enable
 	 */
-	unsigned long   ras_consecutive_stride_requests;
+	unsigned long	ras_consecutive_stride_requests;
 	/* index of the last page that async readahead starts */
-	unsigned long	ras_async_last_readpage;
+	pgoff_t		ras_async_last_readpage;
 };
 
 struct ll_readahead_work {
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 7c2dbdc..38f7aa2c 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -131,19 +131,18 @@ void ll_ra_stats_inc(struct inode *inode, enum ra_stat which)
 
 #define RAS_CDEBUG(ras) \
 	CDEBUG(D_READA,							     \
-	       "lrp %lu cr %lu cp %lu ws %lu wl %lu nra %lu rpc %lu "	     \
-	       "r %lu ri %lu csr %lu sf %lu sp %lu sl %lu lr %lu\n",	     \
-	       ras->ras_last_readpage, ras->ras_consecutive_requests,	     \
-	       ras->ras_consecutive_pages, ras->ras_window_start,	     \
+	       "lre %lu cr %lu cb %lu ws %lu wl %lu nra %lu rpc %lu r %lu ri %lu csr %lu sf %lu sb %lu sl %lu lr %lu\n", \
+	       ras->ras_last_read_end, ras->ras_consecutive_requests,	     \
+	       ras->ras_consecutive_bytes, ras->ras_window_start,	     \
 	       ras->ras_window_len, ras->ras_next_readahead,		     \
 	       ras->ras_rpc_size,					     \
 	       ras->ras_requests, ras->ras_request_index,		     \
 	       ras->ras_consecutive_stride_requests, ras->ras_stride_offset, \
-	       ras->ras_stride_pages, ras->ras_stride_length,		     \
+	       ras->ras_stride_bytes, ras->ras_stride_length,		     \
 	       ras->ras_async_last_readpage)
 
-static int index_in_window(unsigned long index, unsigned long point,
-			   unsigned long before, unsigned long after)
+static int pos_in_window(unsigned long pos, unsigned long point,
+			 unsigned long before, unsigned long after)
 {
 	unsigned long start = point - before, end = point + after;
 
@@ -152,7 +151,7 @@ static int index_in_window(unsigned long index, unsigned long point,
 	if (end < point)
 		end = ~0;
 
-	return start <= index && index <= end;
+	return start <= pos && pos <= end;
 }
 
 void ll_ras_enter(struct file *f)
@@ -242,10 +241,10 @@ static int ll_read_ahead_page(const struct lu_env *env, struct cl_io *io,
 	return rc;
 }
 
-#define RIA_DEBUG(ria)						       \
-	CDEBUG(D_READA, "rs %lu re %lu ro %lu rl %lu rp %lu\n",       \
-	ria->ria_start, ria->ria_end, ria->ria_stoff, ria->ria_length,\
-	ria->ria_pages)
+#define RIA_DEBUG(ria)						\
+	CDEBUG(D_READA, "rs %lu re %lu ro %lu rl %lu rb %lu\n",	\
+	       ria->ria_start, ria->ria_end, ria->ria_stoff,	\
+	       ria->ria_length, ria->ria_bytes)
 
 static inline int stride_io_mode(struct ll_readahead_state *ras)
 {
@@ -255,72 +254,76 @@ static inline int stride_io_mode(struct ll_readahead_state *ras)
 /* The function calculates how much pages will be read in
  * [off, off + length], in such stride IO area,
  * stride_offset = st_off, stride_length = st_len,
- * stride_pages = st_pgs
+ * stride_bytes = st_bytes
  *
  *   |------------------|*****|------------------|*****|------------|*****|....
  * st_off
- *   |--- st_pgs     ---|
+ *   |--- st_bytes     ---|
  *   |-----     st_len   -----|
  *
- *	      How many pages it should read in such pattern
+ *	      How many bytes it should read in such pattern
  *	      |-------------------------------------------------------------|
  *	      off
  *	      |<------		  length		      ------->|
  *
  *	  =   |<----->|  +  |-------------------------------------| +   |---|
- *	     start_left		 st_pgs * i		    end_left
+ *	       start_left                 st_bytes * i                 end_left
  */
 static unsigned long
-stride_pg_count(pgoff_t st_off, unsigned long st_len, unsigned long st_pgs,
-		unsigned long off, unsigned long length)
+stride_byte_count(unsigned long st_off, unsigned long st_len,
+		  unsigned long st_bytes, unsigned long off,
+		  unsigned long length)
 {
 	u64 start = off > st_off ? off - st_off : 0;
 	u64 end = off + length > st_off ? off + length - st_off : 0;
 	unsigned long start_left = 0;
 	unsigned long end_left = 0;
-	unsigned long pg_count;
+	unsigned long bytes_count;
 
 	if (st_len == 0 || length == 0 || end == 0)
 		return length;
 
 	start_left = do_div(start, st_len);
-	if (start_left < st_pgs)
-		start_left = st_pgs - start_left;
+	if (start_left < st_bytes)
+		start_left = st_bytes - start_left;
 	else
 		start_left = 0;
 
 	end_left = do_div(end, st_len);
-	if (end_left > st_pgs)
-		end_left = st_pgs;
+	if (end_left > st_bytes)
+		end_left = st_bytes;
 
 	CDEBUG(D_READA, "start %llu, end %llu start_left %lu end_left %lu\n",
 	       start, end, start_left, end_left);
 
 	if (start == end)
-		pg_count = end_left - (st_pgs - start_left);
+		bytes_count = end_left - (st_bytes - start_left);
 	else
-		pg_count = start_left + st_pgs * (end - start - 1) + end_left;
+		bytes_count = start_left +
+			st_bytes * (end - start - 1) + end_left;
 
 	CDEBUG(D_READA,
-	       "st_off %lu, st_len %lu st_pgs %lu off %lu length %lu pgcount %lu\n",
-	       st_off, st_len, st_pgs, off, length, pg_count);
+	       "st_off %lu, st_len %lu st_bytes %lu off %lu length %lu bytescount %lu\n",
+	       st_off, st_len, st_bytes, off, length, bytes_count);
 
-	return pg_count;
+	return bytes_count;
 }
 
 static int ria_page_count(struct ra_io_arg *ria)
 {
 	u64 length = ria->ria_end >= ria->ria_start ?
 		     ria->ria_end - ria->ria_start + 1 : 0;
+	unsigned int bytes_count;
+
+	bytes_count = stride_byte_count(ria->ria_stoff, ria->ria_length,
+					 ria->ria_bytes, ria->ria_start,
+					 length << PAGE_SHIFT);
+	return (bytes_count + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
-	return stride_pg_count(ria->ria_stoff, ria->ria_length,
-			       ria->ria_pages, ria->ria_start,
-			       length);
 }
 
 static unsigned long ras_align(struct ll_readahead_state *ras,
-			       unsigned long index,
-			       unsigned long *remainder)
+			       pgoff_t index, unsigned long *remainder)
 {
 	unsigned long rem = index % ras->ras_rpc_size;
 
@@ -337,9 +340,9 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 	 * For stride I/O mode, just check whether the idx is inside
 	 * the ria_pages.
 	 */
-	return ria->ria_length == 0 || ria->ria_length == ria->ria_pages ||
+	return ria->ria_length == 0 || ria->ria_length == ria->ria_bytes ||
 	       (idx >= ria->ria_stoff && (idx - ria->ria_stoff) %
-		ria->ria_length < ria->ria_pages);
+		ria->ria_length < ria->ria_bytes);
 }
 
 static unsigned long
@@ -356,7 +359,7 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 	LASSERT(ria);
 	RIA_DEBUG(ria);
 
-	stride_ria = ria->ria_length > ria->ria_pages && ria->ria_pages > 0;
+	stride_ria = ria->ria_length > ria->ria_bytes && ria->ria_bytes > 0;
 	for (page_idx = ria->ria_start;
 	     page_idx <= ria->ria_end && ria->ria_reserved > 0; page_idx++) {
 		if (ras_inside_ra_window(page_idx, ria)) {
@@ -419,20 +422,13 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 			 * read-ahead mode, then check whether it should skip
 			 * the stride gap.
 			 */
-			pgoff_t offset;
-			/* NOTE: This assertion only is valid when it is for
-			 * forward read-ahead, must adjust if backward
-			 * readahead is implemented.
-			 */
-			LASSERTF(page_idx >= ria->ria_stoff,
-				 "Invalid page_idx %lu rs %lu re %lu ro %lu rl %lu rp %lu\n",
-				 page_idx,
-				 ria->ria_start, ria->ria_end, ria->ria_stoff,
-				 ria->ria_length, ria->ria_pages);
-			offset = page_idx - ria->ria_stoff;
-			offset = offset % (ria->ria_length);
-			if (offset >= ria->ria_pages) {
-				page_idx += ria->ria_length - offset - 1;
+			unsigned long offset;
+			unsigned long pos = page_idx << PAGE_SHIFT;
+
+			offset = (pos - ria->ria_stoff) % ria->ria_length;
+			if (offset >= ria->ria_bytes) {
+				pos += (ria->ria_length - offset);
+				page_idx = (pos >> PAGE_SHIFT) - 1;
 				CDEBUG(D_READA,
 				       "Stride: jump %lu pages to %lu\n",
 				       ria->ria_length - offset, page_idx);
@@ -647,7 +643,8 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	 * so that stride read ahead can work correctly.
 	 */
 	if (stride_io_mode(ras))
-		start = max(ras->ras_next_readahead, ras->ras_stride_offset);
+		start = max(ras->ras_next_readahead,
+			    ras->ras_stride_offset >> PAGE_SHIFT);
 	else
 		start = ras->ras_next_readahead;
 
@@ -676,7 +673,7 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	if (stride_io_mode(ras)) {
 		ria->ria_stoff = ras->ras_stride_offset;
 		ria->ria_length = ras->ras_stride_length;
-		ria->ria_pages = ras->ras_stride_pages;
+		ria->ria_bytes = ras->ras_stride_bytes;
 	}
 	spin_unlock(&ras->ras_lock);
 
@@ -739,21 +736,18 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	return ret;
 }
 
-static void ras_set_start(struct inode *inode, struct ll_readahead_state *ras,
-			  unsigned long index)
+static void ras_set_start(struct ll_readahead_state *ras, pgoff_t index)
 {
 	ras->ras_window_start = ras_align(ras, index, NULL);
 }
 
 /* called with the ras_lock held or from places where it doesn't matter */
-static void ras_reset(struct inode *inode, struct ll_readahead_state *ras,
-		      unsigned long index)
+static void ras_reset(struct ll_readahead_state *ras, pgoff_t index)
 {
-	ras->ras_last_readpage = index;
 	ras->ras_consecutive_requests = 0;
-	ras->ras_consecutive_pages = 0;
+	ras->ras_consecutive_bytes = 0;
 	ras->ras_window_len = 0;
-	ras_set_start(inode, ras, index);
+	ras_set_start(ras, index);
 	ras->ras_next_readahead = max(ras->ras_window_start, index + 1);
 
 	RAS_CDEBUG(ras);
@@ -764,7 +758,7 @@ static void ras_stride_reset(struct ll_readahead_state *ras)
 {
 	ras->ras_consecutive_stride_requests = 0;
 	ras->ras_stride_length = 0;
-	ras->ras_stride_pages = 0;
+	ras->ras_stride_bytes = 0;
 	RAS_CDEBUG(ras);
 }
 
@@ -772,56 +766,59 @@ void ll_readahead_init(struct inode *inode, struct ll_readahead_state *ras)
 {
 	spin_lock_init(&ras->ras_lock);
 	ras->ras_rpc_size = PTLRPC_MAX_BRW_PAGES;
-	ras_reset(inode, ras, 0);
+	ras_reset(ras, 0);
+	ras->ras_last_read_end = 0;
 	ras->ras_requests = 0;
 }
 
 /*
  * Check whether the read request is in the stride window.
- * If it is in the stride window, return 1, otherwise return 0.
+ * If it is in the stride window, return true, otherwise return false.
  */
-static int index_in_stride_window(struct ll_readahead_state *ras,
-				  unsigned long index)
+static bool index_in_stride_window(struct ll_readahead_state *ras,
+				   pgoff_t index)
 {
 	unsigned long stride_gap;
+	unsigned long pos = index << PAGE_SHIFT;
 
-	if (ras->ras_stride_length == 0 || ras->ras_stride_pages == 0 ||
-	    ras->ras_stride_pages == ras->ras_stride_length)
-		return 0;
+	if (ras->ras_stride_length == 0 || ras->ras_stride_bytes == 0 ||
+	    ras->ras_stride_bytes == ras->ras_stride_length)
+		return false;
 
-	stride_gap = index - ras->ras_last_readpage - 1;
+	stride_gap = pos - ras->ras_last_read_end - 1;
 
 	/* If it is contiguous read */
 	if (stride_gap == 0)
-		return ras->ras_consecutive_pages + 1 <= ras->ras_stride_pages;
+		return ras->ras_consecutive_bytes + PAGE_SIZE <=
+			ras->ras_stride_bytes;
 
 	/* Otherwise check the stride by itself */
-	return (ras->ras_stride_length - ras->ras_stride_pages) == stride_gap &&
-		ras->ras_consecutive_pages == ras->ras_stride_pages;
+	return (ras->ras_stride_length - ras->ras_stride_bytes) == stride_gap &&
+		ras->ras_consecutive_bytes == ras->ras_stride_bytes;
 }
 
-static void ras_update_stride_detector(struct ll_readahead_state *ras,
-				       unsigned long index)
+static void ras_init_stride_detector(struct ll_readahead_state *ras,
+				     unsigned long pos, unsigned long count)
 {
-	unsigned long stride_gap = index - ras->ras_last_readpage - 1;
+	unsigned long stride_gap = pos - ras->ras_last_read_end - 1;
 
 	if ((stride_gap != 0 || ras->ras_consecutive_stride_requests == 0) &&
 	    !stride_io_mode(ras)) {
-		ras->ras_stride_pages = ras->ras_consecutive_pages;
-		ras->ras_stride_length = ras->ras_consecutive_pages +
+		ras->ras_stride_bytes = ras->ras_consecutive_bytes;
+		ras->ras_stride_length =  ras->ras_consecutive_bytes +
 					 stride_gap;
 	}
 	LASSERT(ras->ras_request_index == 0);
 	LASSERT(ras->ras_consecutive_stride_requests == 0);
 
-	if (index <= ras->ras_last_readpage) {
+	if (pos <= ras->ras_last_read_end) {
 		/*Reset stride window for forward read*/
 		ras_stride_reset(ras);
 		return;
 	}
 
-	ras->ras_stride_pages = ras->ras_consecutive_pages;
-	ras->ras_stride_length = stride_gap + ras->ras_consecutive_pages;
+	ras->ras_stride_bytes = ras->ras_consecutive_bytes;
+	ras->ras_stride_length = stride_gap + ras->ras_consecutive_bytes;
 
 	RAS_CDEBUG(ras);
 }
@@ -835,36 +832,42 @@ static void ras_stride_increase_window(struct ll_readahead_state *ras,
 {
 	unsigned long left, step, window_len;
 	unsigned long stride_len;
+	unsigned long end = ras->ras_window_start + ras->ras_window_len;
 
 	LASSERT(ras->ras_stride_length > 0);
-	LASSERTF(ras->ras_window_start + ras->ras_window_len >=
-		 ras->ras_stride_offset,
+	LASSERTF(end >= (ras->ras_stride_offset >> PAGE_SHIFT),
 		 "window_start %lu, window_len %lu stride_offset %lu\n",
-		 ras->ras_window_start,
-		 ras->ras_window_len, ras->ras_stride_offset);
+		 ras->ras_window_start, ras->ras_window_len,
+		 ras->ras_stride_offset);
 
-	stride_len = ras->ras_window_start + ras->ras_window_len -
-		     ras->ras_stride_offset;
+	end <<= PAGE_SHIFT;
+	if (end < ras->ras_stride_offset)
+		stride_len = 0;
+	else
+		stride_len = end - ras->ras_stride_offset;
 
 	left = stride_len % ras->ras_stride_length;
-	window_len = ras->ras_window_len - left;
+	window_len = (ras->ras_window_len << PAGE_SHIFT) - left;
 
-	if (left < ras->ras_stride_pages)
+	if (left < ras->ras_stride_bytes)
 		left += inc_len;
 	else
-		left = ras->ras_stride_pages + inc_len;
+		left = ras->ras_stride_bytes + inc_len;
 
-	LASSERT(ras->ras_stride_pages != 0);
+	LASSERT(ras->ras_stride_bytes != 0);
 
-	step = left / ras->ras_stride_pages;
-	left %= ras->ras_stride_pages;
+	step = left / ras->ras_stride_bytes;
+	left %= ras->ras_stride_bytes;
 
 	window_len += step * ras->ras_stride_length + left;
 
-	if (stride_pg_count(ras->ras_stride_offset, ras->ras_stride_length,
-			    ras->ras_stride_pages, ras->ras_stride_offset,
-			    window_len) <= ra->ra_max_pages_per_file)
-		ras->ras_window_len = window_len;
+	if (DIV_ROUND_UP(stride_byte_count(ras->ras_stride_offset,
+					   ras->ras_stride_length,
+					   ras->ras_stride_bytes,
+					   ras->ras_stride_offset,
+					   window_len), PAGE_SIZE)
+	    <= ra->ra_max_pages_per_file)
+		ras->ras_window_len = (window_len >> PAGE_SHIFT);
 
 	RAS_CDEBUG(ras);
 }
@@ -878,7 +881,8 @@ static void ras_increase_window(struct inode *inode,
 	 * information from lower layer. FIXME later
 	 */
 	if (stride_io_mode(ras)) {
-		ras_stride_increase_window(ras, ra, ras->ras_rpc_size);
+		ras_stride_increase_window(ras, ra,
+				ras->ras_rpc_size << PAGE_SHIFT);
 	} else {
 		unsigned long wlen;
 
@@ -897,6 +901,7 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 {
 	struct ll_ra_info *ra = &sbi->ll_ra_info;
 	int zero = 0, stride_detect = 0, ra_miss = 0;
+	unsigned long pos = index << PAGE_SHIFT;
 	bool hit = flags & LL_RAS_HIT;
 
 	spin_lock(&ras->ras_lock);
@@ -913,13 +918,14 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 	 * be a symptom of there being so many read-ahead pages that the VM is
 	 * reclaiming it before we get to it.
 	 */
-	if (!index_in_window(index, ras->ras_last_readpage, 8, 8)) {
+	if (!pos_in_window(pos, ras->ras_last_read_end,
+			   8 << PAGE_SHIFT, 8 << PAGE_SHIFT)) {
 		zero = 1;
 		ll_ra_stats_inc_sbi(sbi, RA_STAT_DISTANT_READPAGE);
 	} else if (!hit && ras->ras_window_len &&
 		   index < ras->ras_next_readahead &&
-		   index_in_window(index, ras->ras_window_start, 0,
-				   ras->ras_window_len)) {
+		   pos_in_window(index, ras->ras_window_start, 0,
+				 ras->ras_window_len)) {
 		ra_miss = 1;
 		ll_ra_stats_inc_sbi(sbi, RA_STAT_MISS_IN_WINDOW);
 	}
@@ -955,16 +961,16 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		if (!index_in_stride_window(ras, index)) {
 			if (ras->ras_consecutive_stride_requests == 0 &&
 			    ras->ras_request_index == 0) {
-				ras_update_stride_detector(ras, index);
+				ras_init_stride_detector(ras, pos, PAGE_SIZE);
 				ras->ras_consecutive_stride_requests++;
 			} else {
 				ras_stride_reset(ras);
 			}
-			ras_reset(inode, ras, index);
-			ras->ras_consecutive_pages++;
+			ras_reset(ras, index);
+			ras->ras_consecutive_bytes += PAGE_SIZE;
 			goto out_unlock;
 		} else {
-			ras->ras_consecutive_pages = 0;
+			ras->ras_consecutive_bytes = 0;
 			ras->ras_consecutive_requests = 0;
 			if (++ras->ras_consecutive_stride_requests > 1)
 				stride_detect = 1;
@@ -974,9 +980,10 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		if (ra_miss) {
 			if (index_in_stride_window(ras, index) &&
 			    stride_io_mode(ras)) {
-				if (index != ras->ras_last_readpage + 1)
-					ras->ras_consecutive_pages = 0;
-				ras_reset(inode, ras, index);
+				if (index != (ras->ras_last_read_end >>
+					      PAGE_SHIFT) + 1)
+					ras->ras_consecutive_bytes = 0;
+				ras_reset(ras, index);
 
 				/* If stride-RA hit cache miss, the stride
 				 * detector will not be reset to avoid the
@@ -986,15 +993,15 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 				 * read-ahead window.
 				 */
 				if (ras->ras_window_start <
-				    ras->ras_stride_offset)
+				    (ras->ras_stride_offset >> PAGE_SHIFT))
 					ras_stride_reset(ras);
 				RAS_CDEBUG(ras);
 			} else {
 				/* Reset both stride window and normal RA
 				 * window
 				 */
-				ras_reset(inode, ras, index);
-				ras->ras_consecutive_pages++;
+				ras_reset(ras, index);
+				ras->ras_consecutive_bytes += PAGE_SIZE;
 				ras_stride_reset(ras);
 				goto out_unlock;
 			}
@@ -1011,9 +1018,8 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 			}
 		}
 	}
-	ras->ras_consecutive_pages++;
-	ras->ras_last_readpage = index;
-	ras_set_start(inode, ras, index);
+	ras->ras_consecutive_bytes += PAGE_SIZE;
+	ras_set_start(ras, index);
 
 	if (stride_io_mode(ras)) {
 		/* Since stride readahead is sensitive to the offset
@@ -1022,8 +1028,9 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		 */
 		ras->ras_next_readahead = max(index + 1,
 					      ras->ras_next_readahead);
-		ras->ras_window_start = max(ras->ras_stride_offset,
-					    ras->ras_window_start);
+		ras->ras_window_start =
+				max(ras->ras_stride_offset >> PAGE_SHIFT,
+				    ras->ras_window_start);
 	} else {
 		if (ras->ras_next_readahead < ras->ras_window_start)
 			ras->ras_next_readahead = ras->ras_window_start;
@@ -1035,13 +1042,14 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 	/* Trigger RA in the mmap case where ras_consecutive_requests
 	 * is not incremented and thus can't be used to trigger RA
 	 */
-	if (ras->ras_consecutive_pages >= 4 && flags & LL_RAS_MMAP) {
+	if (ras->ras_consecutive_bytes >= (4 << PAGE_SHIFT) &&
+	    flags & LL_RAS_MMAP) {
 		ras_increase_window(inode, ras, ra);
 		/*
 		 * reset consecutive pages so that the readahead window can
 		 * grow gradually.
 		 */
-		ras->ras_consecutive_pages = 0;
+		ras->ras_consecutive_bytes = 0;
 		goto out_unlock;
 	}
 
@@ -1052,7 +1060,7 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		 * reset to make sure next_readahead > stride offset
 		 */
 		ras->ras_next_readahead = max(index, ras->ras_next_readahead);
-		ras->ras_stride_offset = index;
+		ras->ras_stride_offset = index << PAGE_SHIFT;
 		ras->ras_window_start = max(index, ras->ras_window_start);
 	}
 
@@ -1066,6 +1074,7 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 out_unlock:
 	RAS_CDEBUG(ras);
 	ras->ras_request_index++;
+	ras->ras_last_read_end = pos + PAGE_SIZE - 1;
 	spin_unlock(&ras->ras_lock);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 479/622] lustre: osc: prevent use after free
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (477 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 478/622] lustre: readahead: convert stride page index to byte James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 480/622] lustre: mdc: hold obd while processing changelog James Simmons
                   ` (143 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

Clear aa_oa after it's been freed to prevent use after free.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12581
Lustre-commit: 61c9f8797771 ("LU-12581 osc: prevent use after free")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35601
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 75e0823..7ba9ea5 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -748,6 +748,7 @@ static int osc_shrink_grant_interpret(const struct lu_env *env,
 	osc_update_grant(cli, body);
 out:
 	kmem_cache_free(osc_obdo_kmem, aa->aa_oa);
+	aa->aa_oa = NULL;
 
 	return rc;
 }
@@ -2131,6 +2132,7 @@ static int brw_interpret(const struct lu_env *env,
 		cl_object_attr_unlock(obj);
 	}
 	kmem_cache_free(osc_obdo_kmem, aa->aa_oa);
+	aa->aa_oa = NULL;
 
 	if (lustre_msg_get_opc(req->rq_reqmsg) == OST_WRITE && rc == 0)
 		osc_inc_unstable_pages(req);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 480/622] lustre: mdc: hold obd while processing changelog
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (478 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 479/622] lustre: osc: prevent use after free James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 481/622] lnet: change ln_mt_waitq to a completion James Simmons
                   ` (142 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

During read/write changelog, the corresponding obd_device should
be held to protect it from being released by umount.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11626
Lustre-commit: d7bb6647cd4d ("LU-11626 mdc: hold obd while processing changelog")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35784
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_changelog.c | 121 ++++++++++++++++++++++++------------------
 1 file changed, 70 insertions(+), 51 deletions(-)

diff --git a/fs/lustre/mdc/mdc_changelog.c b/fs/lustre/mdc/mdc_changelog.c
index 9af0541..043549d 100644
--- a/fs/lustre/mdc/mdc_changelog.c
+++ b/fs/lustre/mdc/mdc_changelog.c
@@ -69,8 +69,10 @@ struct chlg_registered_dev {
 };
 
 struct chlg_reader_state {
-	/* Device this state is associated with */
-	struct chlg_registered_dev *crs_dev;
+	/* Shortcut to the corresponding OBD device */
+	struct obd_device	*crs_obd;
+	/* the corresponding chlg_registered_dev */
+	struct chlg_registered_dev *crs_ced;
 	/* Producer thread (if any) */
 	struct task_struct	*crs_prod_task;
 	/* An error occurred that prevents from reading further */
@@ -109,6 +111,41 @@ enum {
 };
 
 /**
+ * Deregister a changelog character device whose refcount has reached zero.
+ */
+static void chlg_dev_clear(struct kref *kref)
+{
+	struct chlg_registered_dev *entry = container_of(kref,
+						struct chlg_registered_dev,
+						ced_refs);
+
+	list_del(&entry->ced_link);
+	misc_deregister(&entry->ced_misc);
+	kfree(entry);
+}
+
+static inline struct obd_device *chlg_obd_get(struct chlg_registered_dev *dev)
+{
+	struct obd_device *obd;
+
+	mutex_lock(&chlg_registered_dev_lock);
+	if (list_empty(&dev->ced_obds))
+		return NULL;
+
+	obd = list_first_entry(&dev->ced_obds, struct obd_device,
+			       u.cli.cl_chg_dev_linkage);
+	class_incref(obd, "changelog", dev);
+	mutex_unlock(&chlg_registered_dev_lock);
+	return obd;
+}
+
+static inline void chlg_obd_put(struct chlg_registered_dev *dev,
+			 struct obd_device *obd)
+{
+	class_decref(obd, "changelog", dev);
+}
+
+/**
  * ChangeLog catalog processing callback invoked on each record.
  * If the current record is eligible to userland delivery, push
  * it into the crs_rec_queue where the consumer code will fetch it.
@@ -142,7 +179,7 @@ static int chlg_read_cat_process_cb(const struct lu_env *env,
 	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
 		rc = -EINVAL;
 		CERROR("%s: not a changelog rec %x/%d in llog : rc = %d\n",
-		       crs->crs_dev->ced_name, rec->cr_hdr.lrh_type,
+		       crs->crs_obd->obd_name, rec->cr_hdr.lrh_type,
 		       rec->cr.cr_type, rc);
 		return rc;
 	}
@@ -193,17 +230,6 @@ static void enq_record_delete(struct chlg_rec_entry *rec)
 	kfree(rec);
 }
 
-/*
- * Find any OBD device associated with this reader
- * chlg_registered_dev_lock is held.
- */
-static inline struct obd_device *chlg_obd_get(struct chlg_registered_dev *dev)
-{
-	return list_first_entry_or_null(&dev->ced_obds,
-					struct obd_device,
-					u.cli.cl_chg_dev_linkage);
-}
-
 /**
  * Record prefetch thread entry point. Opens the changelog catalog and starts
  * reading records.
@@ -215,27 +241,28 @@ static inline struct obd_device *chlg_obd_get(struct chlg_registered_dev *dev)
 static int chlg_load(void *args)
 {
 	struct chlg_reader_state *crs = args;
+	struct chlg_registered_dev *ced = crs->crs_ced;
 	struct obd_device *obd;
 	struct llog_ctxt *ctx = NULL;
 	struct llog_handle *llh = NULL;
 	int rc;
 
-	mutex_lock(&chlg_registered_dev_lock);
-	obd = chlg_obd_get(crs->crs_dev);
-	if (!obd) {
-		rc = -ENOENT;
-		goto err_out;
-	}
+	crs->crs_last_catidx = -1;
+	crs->crs_last_idx = 0;
+
+again:
+	obd = chlg_obd_get(ced);
+	if (!obd)
+		return -ENODEV;
+
+	crs->crs_obd = obd;
+
 	ctx = llog_get_context(obd, LLOG_CHANGELOG_REPL_CTXT);
 	if (!ctx) {
 		rc = -ENOENT;
 		goto err_out;
 	}
 
-	crs->crs_last_catidx = -1;
-	crs->crs_last_idx = 0;
-
-again:
 	rc = llog_open(NULL, ctx, &llh, NULL, CHANGELOG_CATALOG,
 		       LLOG_OPEN_EXISTS);
 	if (rc) {
@@ -268,6 +295,8 @@ static int chlg_load(void *args)
 	}
 	if (!kthread_should_stop() && crs->crs_poll) {
 		llog_cat_close(NULL, llh);
+		llog_ctxt_put(ctx);
+		class_decref(obd, "changelog", crs);
 		schedule_timeout_interruptible(HZ);
 		goto again;
 	}
@@ -275,7 +304,6 @@ static int chlg_load(void *args)
 	crs->crs_eof = true;
 
 err_out:
-	mutex_unlock(&chlg_registered_dev_lock);
 	if (rc < 0)
 		crs->crs_err = rc;
 
@@ -287,6 +315,8 @@ static int chlg_load(void *args)
 	if (ctx)
 		llog_ctxt_put(ctx);
 
+	crs->crs_obd = NULL;
+	chlg_obd_put(ced, obd);
 	wait_event_idle(crs->crs_waitq_prod, kthread_should_stop());
 
 	return rc;
@@ -454,19 +484,19 @@ static int chlg_clear(struct chlg_reader_state *crs, u32 reader, u64 record)
 		.cs_recno = record,
 		.cs_id    = reader
 	};
-	int ret;
+	int rc;
 
-	mutex_lock(&chlg_registered_dev_lock);
-	obd = chlg_obd_get(crs->crs_dev);
+	obd = chlg_obd_get(crs->crs_ced);
 	if (!obd)
-		ret = -ENOENT;
-	else
-		ret = obd_set_info_async(NULL, obd->obd_self_export,
-					 strlen(KEY_CHANGELOG_CLEAR),
-					 KEY_CHANGELOG_CLEAR, sizeof(cs),
-					 &cs, NULL);
-	mutex_unlock(&chlg_registered_dev_lock);
-	return ret;
+		return -ENODEV;
+
+	rc = obd_set_info_async(NULL, obd->obd_self_export,
+				strlen(KEY_CHANGELOG_CLEAR),
+				KEY_CHANGELOG_CLEAR, sizeof(cs),
+				&cs, NULL);
+	chlg_obd_put(crs->crs_ced, obd);
+
+	return rc;
 }
 
 /** Maximum changelog control command size */
@@ -540,7 +570,8 @@ static int chlg_open(struct inode *inode, struct file *file)
 	if (!crs)
 		return -ENOMEM;
 
-	crs->crs_dev = dev;
+	kref_get(&dev->ced_refs);
+	crs->crs_ced = dev;
 	crs->crs_err = false;
 	crs->crs_eof = false;
 
@@ -564,6 +595,7 @@ static int chlg_open(struct inode *inode, struct file *file)
 	return 0;
 
 err_crs:
+	kref_put(&dev->ced_refs, chlg_dev_clear);
 	kfree(crs);
 	return rc;
 }
@@ -589,6 +621,7 @@ static int chlg_release(struct inode *inode, struct file *file)
 	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
 		enq_record_delete(rec);
 
+	kref_put(&crs->crs_ced->ced_refs, chlg_dev_clear);
 	kfree(crs);
 	return rc;
 }
@@ -763,20 +796,6 @@ int mdc_changelog_cdev_init(struct obd_device *obd)
 }
 
 /**
- * Deregister a changelog character device whose refcount has reached zero.
- */
-static void chlg_dev_clear(struct kref *kref)
-{
-	struct chlg_registered_dev *entry = container_of(kref,
-							 struct chlg_registered_dev,
-							 ced_refs);
-	LASSERT(mutex_is_locked(&chlg_registered_dev_lock));
-	list_del(&entry->ced_link);
-	misc_deregister(&entry->ced_misc);
-	kfree(entry);
-}
-
-/**
  * Release OBD, decrease reference count of the corresponding changelog device.
  */
 void mdc_changelog_cdev_finish(struct obd_device *obd)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 481/622] lnet: change ln_mt_waitq to a completion.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (479 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 480/622] lustre: mdc: hold obd while processing changelog James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 482/622] lustre: obdclass: align to T10 sector size when generating guard James Simmons
                   ` (141 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

ln_mt_waitq is only waited on by a call to
  wait_event_interruptible_timeout(..., false, timeout);

As 'false' is never 'true', this will always wait for the full
timeout to expire.  So the waitq is effectively pointless.

To acheive the apparent intent of the waitq, change it to a
completion.  The completion adds a 'done' flag to a waitq so we can
wait until a timeout or until a wakeup is requested.

With this, a longer timeout would could be used, but that is left to
a later patch.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12686
Lustre-commit: b81bcc6c6f0c ("LU-12686 lnet: change ln_mt_waitq to a completion.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35874
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  5 +++--
 net/lnet/lnet/api-ni.c         |  2 +-
 net/lnet/lnet/lib-move.c       | 10 +++++++---
 net/lnet/lnet/lib-msg.c        |  2 +-
 net/lnet/lnet/router.c         |  4 ++--
 5 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 22c2bc6..18d4e4e 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -1145,10 +1145,11 @@ struct lnet {
 	 */
 	bool				ln_nis_from_mod_params;
 
-	/* waitq for the monitor thread. The monitor thread takes care of
+	/*
+	 * completion for the monitor thread. The monitor thread takes care of
 	 * checking routes, timedout messages and resending messages.
 	 */
-	wait_queue_head_t		ln_mt_waitq;
+	struct completion		ln_mt_wait_complete;
 
 	/* per-cpt resend queues */
 	struct list_head		**ln_mt_resendqs;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 79deaac..e66d9dc7 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -486,7 +486,7 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	spin_lock_init(&the_lnet.ln_eq_wait_lock);
 	spin_lock_init(&the_lnet.ln_msg_resend_lock);
 	init_waitqueue_head(&the_lnet.ln_eq_waitq);
-	init_waitqueue_head(&the_lnet.ln_mt_waitq);
+	init_completion(&the_lnet.ln_mt_wait_complete);
 	mutex_init(&the_lnet.ln_lnd_mutex);
 }
 
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 322998a..2f31f06 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -3276,8 +3276,12 @@ struct lnet_mt_event_info {
 			       min((unsigned int)alive_router_check_interval /
 					lnet_current_net_count,
 				   lnet_transaction_timeout / 2));
-		wait_event_interruptible_timeout(the_lnet.ln_mt_waitq,
-						 false, HZ * interval);
+		wait_for_completion_interruptible_timeout(&the_lnet.ln_mt_wait_complete,
+							  interval * HZ);
+		/* Must re-init the completion before testing anything,
+		 * including ln_mt_state.
+		 */
+		reinit_completion(&the_lnet.ln_mt_wait_complete);
 	}
 
 	/* Shutting down */
@@ -3539,7 +3543,7 @@ void lnet_monitor_thr_stop(void)
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	/* tell the monitor thread that we're shutting down */
-	wake_up(&the_lnet.ln_mt_waitq);
+	complete(&the_lnet.ln_mt_wait_complete);
 
 	/* block until monitor thread signals that it's done */
 	wait_for_completion(&the_lnet.ln_mt_signal);
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 5c39ce3..d74ff53 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -640,7 +640,7 @@
 
 	list_add_tail(&msg->msg_list, the_lnet.ln_mt_resendqs[msg->msg_tx_cpt]);
 
-	wake_up(&the_lnet.ln_mt_waitq);
+	complete(&the_lnet.ln_mt_wait_complete);
 }
 
 int
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index bc9494d..7246eea 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -674,7 +674,7 @@ static void lnet_shuffle_seed(void)
 		kfree(rnet);
 
 	/* kick start the monitor thread to handle the added route */
-	wake_up(&the_lnet.ln_mt_waitq);
+	complete(&the_lnet.ln_mt_wait_complete);
 
 	return rc;
 }
@@ -1419,7 +1419,7 @@ bool lnet_router_checker_active(void)
 	lnet_net_lock(LNET_LOCK_EX);
 	the_lnet.ln_routing = 1;
 	lnet_net_unlock(LNET_LOCK_EX);
-	wake_up(&the_lnet.ln_mt_waitq);
+	complete(&the_lnet.ln_mt_wait_complete);
 	return 0;
 
 failed:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 482/622] lustre: obdclass: align to T10 sector size when generating guard
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (480 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 481/622] lnet: change ln_mt_waitq to a completion James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 483/622] lustre: ptlrpc: Hold imp lock for idle reconnect James Simmons
                   ` (140 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Otherwise the client and server would come up with
different checksum when the page size is different.

Improve test_810 to verify all available checksum types.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11729
Lustre-commit: 98ceaf854bb4 ("LU-11729 obdclass: align to T10 sector size when generating guard")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/34043
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/integrity.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/obdclass/integrity.c b/fs/lustre/obdclass/integrity.c
index 5cb9a25..2d5760d 100644
--- a/fs/lustre/obdclass/integrity.c
+++ b/fs/lustre/obdclass/integrity.c
@@ -50,26 +50,26 @@ int obd_page_dif_generate_buffer(const char *obd_name, struct page *page,
 				 int *used_number, int sector_size,
 				 obd_dif_csum_fn *fn)
 {
-	unsigned int i;
+	unsigned int i = offset;
+	unsigned int end = offset + length;
 	char *data_buf;
 	u16 *guard_buf = guard_start;
 	unsigned int data_size;
 	int used = 0;
 
 	data_buf = kmap(page) + offset;
-	for (i = 0; i < length; i += sector_size) {
+	while (i < end) {
 		if (used >= guard_number) {
 			CERROR("%s: unexpected used guard number of DIF %u/%u, data length %u, sector size %u: rc = %d\n",
 			       obd_name, used, guard_number, length,
 			       sector_size, -E2BIG);
 			return -E2BIG;
 		}
-		data_size = length - i;
-		if (data_size > sector_size)
-			data_size = sector_size;
+		data_size = min(round_up(i + 1, sector_size), end) - i;
 		*guard_buf = fn(data_buf, data_size);
 		guard_buf++;
 		data_buf += data_size;
+		i += data_size;
 		used++;
 	}
 	kunmap(page);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 483/622] lustre: ptlrpc: Hold imp lock for idle reconnect
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (481 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 482/622] lustre: obdclass: align to T10 sector size when generating guard James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 484/622] lustre: osc: glimpse - search for active lock James Simmons
                   ` (139 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

Idle reconnect sets import state to IMP_NEW, then releases
the import lock before calling ptlrpc_connect_import.  This
creates a gap where an import in IMP_NEW state is exposed,
which can cause new requests to fail with EIO.

Hold the lock across the call so as not to expose imports
in this state.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12559
Lustre-commit: e9472c54ac82 ("LU-12559 ptlrpc: Hold imp lock for idle reconnect")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35530
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h |  1 +
 fs/lustre/ptlrpc/client.c      | 13 ++++++-------
 fs/lustre/ptlrpc/import.c      | 19 +++++++++++++++----
 3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index aaf5cb8..8dad08e 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -2015,6 +2015,7 @@ struct ptlrpc_service *ptlrpc_register_service(struct ptlrpc_service_conf *conf,
  * @{
  */
 int ptlrpc_connect_import(struct obd_import *imp);
+int ptlrpc_connect_import_locked(struct obd_import *imp);
 int ptlrpc_init_import(struct obd_import *imp);
 int ptlrpc_disconnect_import(struct obd_import *imp, int noclose);
 int ptlrpc_disconnect_and_idle_import(struct obd_import *imp);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 478ba85..c359ac0 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -870,7 +870,6 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 			      const struct req_format *format)
 {
 	struct ptlrpc_request *request;
-	int connect = 0;
 
 	request = __ptlrpc_request_alloc(imp, pool);
 	if (!request)
@@ -890,17 +889,17 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 		if (imp->imp_state == LUSTRE_IMP_IDLE) {
 			imp->imp_generation++;
 			imp->imp_initiated_at = imp->imp_generation;
-			imp->imp_state =  LUSTRE_IMP_NEW;
-			connect = 1;
-		}
-		spin_unlock(&imp->imp_lock);
-		if (connect) {
-			rc = ptlrpc_connect_import(imp);
+			imp->imp_state = LUSTRE_IMP_NEW;
+
+			/* connect_import_locked releases imp_lock */
+			rc = ptlrpc_connect_import_locked(imp);
 			if (rc < 0) {
 				ptlrpc_request_free(request);
 				return NULL;
 			}
 			ptlrpc_pinger_add_import(imp);
+		} else {
+			spin_unlock(&imp->imp_lock);
 		}
 	}
 
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index ff1b810..c4a732d 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -611,13 +611,22 @@ static int ptlrpc_first_transno(struct obd_import *imp, u64 *transno)
 	return 0;
 }
 
+int ptlrpc_connect_import(struct obd_import *imp)
+{
+	spin_lock(&imp->imp_lock);
+	return ptlrpc_connect_import_locked(imp);
+}
+
 /**
  * Attempt to (re)connect import @imp. This includes all preparations,
  * initializing CONNECT RPC request and passing it to ptlrpcd for
  * actual sending.
+ *
+ * Assumes imp->imp_lock is held, and releases it.
+ *
  * Returns 0 on success or error code.
  */
-int ptlrpc_connect_import(struct obd_import *imp)
+int ptlrpc_connect_import_locked(struct obd_import *imp)
 {
 	struct obd_device *obd = imp->imp_obd;
 	int initial_connect = 0;
@@ -634,7 +643,8 @@ int ptlrpc_connect_import(struct obd_import *imp)
 	struct ptlrpc_connect_async_args *aa;
 	int rc;
 
-	spin_lock(&imp->imp_lock);
+	assert_spin_locked(&imp->imp_lock);
+
 	if (imp->imp_state == LUSTRE_IMP_CLOSED) {
 		spin_unlock(&imp->imp_lock);
 		CERROR("can't connect to a closed import\n");
@@ -1701,12 +1711,13 @@ static int ptlrpc_disconnect_idle_interpret(const struct lu_env *env,
 			connect = 1;
 		}
 	}
-	spin_unlock(&imp->imp_lock);
 
 	if (connect) {
-		rc = ptlrpc_connect_import(imp);
+		rc = ptlrpc_connect_import_locked(imp);
 		if (rc >= 0)
 			ptlrpc_pinger_add_import(imp);
+	} else {
+		spin_unlock(&imp->imp_lock);
 	}
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 484/622] lustre: osc: glimpse - search for active lock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (482 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 483/622] lustre: ptlrpc: Hold imp lock for idle reconnect James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 485/622] lustre: lmv: use lu_tgt_descs to manage tgts James Simmons
                   ` (138 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

When there are lock-ahead write locks on a file, the server
sends one glimpse AST RPC to each client having such (it
may have many) locks. This callback is sent to the lock
having the highest offset.

Client's glimpse callback goes up to the clio layers and
gets the global (not lock-specific) view of size.  The clio
layers are connected to the extent lock through the
l_ast_data (which points to the OSC object).

Speculative locks (AGL, lockahead) do not have l_ast_data
initialised until an IO happens under the lock. Thus, some
speculative locks may not have l_ast_data initialized.

It is possible for the client to do a write using one lock
(changing file size), but for the glimpse AST to be sent to
another lock without l_ast_data initialized.  Currently, a
lock with no l_ast_data set returns ELDLM_NO_LOCK_DATA to
the server.  In this case, this means we do not return the
updated size.

The solution is to search the granted lock tree for any lock
with initialized l_ast_data (it points to the OSC object
which is the same for all the extent locks) and to reach the
clio layers for the size through this lock instead.

cray-bug-id: LUS-6747
WC-bug-id: https://jira.whamcloud.com/browse/LU-11670
Lustre-commit: b3461d11dcb0 ("LU-11670 osc: glimpse - search for active lock")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/33660
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h  | 17 ++++++++++++++++-
 fs/lustre/include/obd_support.h |  1 +
 fs/lustre/ldlm/ldlm_lock.c      | 39 ++++++++++++++++++++-------------------
 fs/lustre/osc/osc_lock.c        | 41 ++++++++++++++++++++++++++++++++++++-----
 4 files changed, 73 insertions(+), 25 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 4060bb4..f7d2d9c 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -809,6 +809,20 @@ struct ldlm_lock {
 };
 
 /**
+ * Describe the overlap between two locks.  itree_overlap_cb data.
+ */
+struct ldlm_match_data {
+	struct ldlm_lock	*lmd_old;
+	struct ldlm_lock	*lmd_lock;
+	enum ldlm_mode		*lmd_mode;
+	union ldlm_policy_data	*lmd_policy;
+	u64			 lmd_flags;
+	u64			 lmd_skip_flags;
+	int			 lmd_unref;
+	bool			 lmd_has_ast_data;
+};
+
+/**
  * LDLM resource description.
  * Basically, resource is a representation for a single object.
  * Object has a name which is currently 4 64-bit integers. LDLM user is
@@ -1163,7 +1177,8 @@ static inline enum ldlm_mode ldlm_lock_match(struct ldlm_namespace *ns,
 	return ldlm_lock_match_with_skip(ns, flags, 0, res_id, type, policy,
 					 mode, lh, unref);
 }
-
+struct ldlm_lock *search_itree(struct ldlm_resource *res,
+			       struct ldlm_match_data *data);
 enum ldlm_mode ldlm_revalidate_lock_handle(const struct lustre_handle *lockh,
 					   u64 *bits);
 void ldlm_lock_cancel(struct ldlm_lock *lock);
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 506535b..acfd098 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -330,6 +330,7 @@
 #define OBD_FAIL_OSC_DELAY_SETTIME			0x412
 #define OBD_FAIL_OSC_CONNECT_GRANT_PARAM		0x413
 #define OBD_FAIL_OSC_DELAY_IO				0x414
+#define OBD_FAIL_OSC_NO_SIZE_DATA			0x415
 
 #define OBD_FAIL_PTLRPC					0x500
 #define OBD_FAIL_PTLRPC_ACK				0x501
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index b6c49c5..d14221a 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -1045,19 +1045,6 @@ void ldlm_grant_lock(struct ldlm_lock *lock, struct list_head *work_list)
 }
 
 /**
- * Describe the overlap between two locks.  itree_overlap_cb data.
- */
-struct lock_match_data {
-	struct ldlm_lock	*lmd_old;
-	struct ldlm_lock	*lmd_lock;
-	enum ldlm_mode		*lmd_mode;
-	union ldlm_policy_data	*lmd_policy;
-	u64			 lmd_flags;
-	u64			 lmd_skip_flags;
-	int			 lmd_unref;
-};
-
-/**
  * Check if the given @lock meets the criteria for a match.
  * A reference on the lock is taken if matched.
  *
@@ -1066,9 +1053,9 @@ struct lock_match_data {
  */
 static bool lock_matches(struct ldlm_lock *lock, void *vdata)
 {
-	struct lock_match_data *data = vdata;
+	struct ldlm_match_data *data = vdata;
 	union ldlm_policy_data *lpol = &lock->l_policy_data;
-	enum ldlm_mode match;
+	enum ldlm_mode match = LCK_MINMODE;
 
 	if (lock == data->lmd_old)
 		return true;
@@ -1098,6 +1085,17 @@ static bool lock_matches(struct ldlm_lock *lock, void *vdata)
 
 	if (!(lock->l_req_mode & *data->lmd_mode))
 		return false;
+
+	/* When we search for ast_data, we are not doing a traditional match,
+	 * so we don't worry about IBITS or extent matching.
+	 */
+	if (data->lmd_has_ast_data) {
+		if (!lock->l_ast_data)
+			return false;
+
+		goto matched;
+	}
+
 	match = lock->l_req_mode;
 
 	switch (lock->l_resource->lr_type) {
@@ -1138,6 +1136,7 @@ static bool lock_matches(struct ldlm_lock *lock, void *vdata)
 	if (data->lmd_skip_flags & lock->l_flags)
 		return false;
 
+matched:
 	if (data->lmd_flags & LDLM_FL_TEST_LOCK) {
 		LDLM_LOCK_GET(lock);
 		ldlm_lock_touch_in_lru(lock);
@@ -1159,8 +1158,8 @@ static bool lock_matches(struct ldlm_lock *lock, void *vdata)
  *
  * Return:	a referenced lock or NULL.
  */
-static struct ldlm_lock *search_itree(struct ldlm_resource *res,
-				      struct lock_match_data *data)
+struct ldlm_lock *search_itree(struct ldlm_resource *res,
+			       struct ldlm_match_data *data)
 {
 	int idx;
 
@@ -1185,6 +1184,7 @@ static struct ldlm_lock *search_itree(struct ldlm_resource *res,
 
 	return NULL;
 }
+EXPORT_SYMBOL(search_itree);
 
 /*
  * Search for a lock with given properties in a queue.
@@ -1195,7 +1195,7 @@ static struct ldlm_lock *search_itree(struct ldlm_resource *res,
  * Return:	a referenced lock or NULL.
  */
 static struct ldlm_lock *search_queue(struct list_head *queue,
-				      struct lock_match_data *data)
+				      struct ldlm_match_data *data)
 {
 	struct ldlm_lock *lock;
 
@@ -1280,7 +1280,7 @@ enum ldlm_mode ldlm_lock_match_with_skip(struct ldlm_namespace *ns,
 					 enum ldlm_mode mode,
 					 struct lustre_handle *lockh, int unref)
 {
-	struct lock_match_data data = {
+	struct ldlm_match_data data = {
 		.lmd_old	= NULL,
 		.lmd_lock	= NULL,
 		.lmd_mode	= &mode,
@@ -1288,6 +1288,7 @@ enum ldlm_mode ldlm_lock_match_with_skip(struct ldlm_namespace *ns,
 		.lmd_flags	= flags,
 		.lmd_skip_flags	= skip_flags,
 		.lmd_unref	= unref,
+		.lmd_has_ast_data = false,
 	};
 	struct ldlm_resource *res;
 	struct ldlm_lock *lock;
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index c748e58..dcddf17 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -549,6 +549,10 @@ int osc_ldlm_glimpse_ast(struct ldlm_lock *dlmlock, void *data)
 	struct ost_lvb *lvb;
 	struct req_capsule *cap;
 	struct cl_object *obj = NULL;
+	struct ldlm_resource *res = dlmlock->l_resource;
+	struct ldlm_match_data matchdata = { 0 };
+	union ldlm_policy_data policy;
+	enum ldlm_mode mode = LCK_PW | LCK_GROUP | LCK_PR;
 	int result;
 	u16 refcheck;
 
@@ -559,13 +563,40 @@ int osc_ldlm_glimpse_ast(struct ldlm_lock *dlmlock, void *data)
 		result = PTR_ERR(env);
 		goto out;
 	}
+	policy.l_extent.start = 0;
+	policy.l_extent.end = LUSTRE_EOF;
 
-	lock_res_and_lock(dlmlock);
-	if (dlmlock->l_ast_data) {
-		obj = osc2cl(dlmlock->l_ast_data);
-		cl_object_get(obj);
+	matchdata.lmd_mode = &mode;
+	matchdata.lmd_policy = &policy;
+	matchdata.lmd_flags = LDLM_FL_TEST_LOCK | LDLM_FL_CBPENDING;
+	matchdata.lmd_unref = 1;
+	matchdata.lmd_has_ast_data = true;
+
+	LDLM_LOCK_GET(dlmlock);
+
+	/* If any dlmlock has l_ast_data set, we must find it or we risk
+	 * missing a size update done under a different lock.
+	 */
+	while (dlmlock) {
+		lock_res_and_lock(dlmlock);
+		if (dlmlock->l_ast_data) {
+			obj = osc2cl(dlmlock->l_ast_data);
+			cl_object_get(obj);
+		}
+		unlock_res_and_lock(dlmlock);
+		LDLM_LOCK_PUT(dlmlock);
+
+		dlmlock = NULL;
+
+		if (!obj && res->lr_type == LDLM_EXTENT) {
+			if (OBD_FAIL_CHECK(OBD_FAIL_OSC_NO_SIZE_DATA))
+				break;
+
+			lock_res(res);
+			dlmlock = search_itree(res, &matchdata);
+			unlock_res(res);
+		}
 	}
-	unlock_res_and_lock(dlmlock);
 
 	if (obj) {
 		/* Do not grab the mutex of cl_lock for glimpse.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 485/622] lustre: lmv: use lu_tgt_descs to manage tgts
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (483 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 484/622] lustre: osc: glimpse - search for active lock James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 486/622] lustre: lmv: share object alloc QoS code with LMV James Simmons
                   ` (137 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Like LOD, use lu_tgt_descs to manage tgts, so that they can
share tgt management code.

TODO: use the same tgt management code for LOV/LFSCK.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: 59fc1218fccf ("LU-11213 lmv: use lu_tgt_descs to manage tgts")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35218
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradaed.org>
---
 fs/lustre/include/lu_object.h     |  68 +++++
 fs/lustre/include/obd.h           |   5 +-
 fs/lustre/lmv/lmv_intent.c        |  10 +-
 fs/lustre/lmv/lmv_internal.h      |  86 +++---
 fs/lustre/lmv/lmv_obd.c           | 540 +++++++++++++++-----------------------
 fs/lustre/lmv/lmv_qos.c           |  57 ++--
 fs/lustre/lmv/lproc_lmv.c         |  20 +-
 fs/lustre/obdclass/Makefile       |   2 +-
 fs/lustre/obdclass/lu_tgt_descs.c | 192 ++++++++++++++
 9 files changed, 572 insertions(+), 408 deletions(-)
 create mode 100644 fs/lustre/obdclass/lu_tgt_descs.c

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index aed0d4b..c30c06d 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1392,6 +1392,38 @@ struct lu_tgt_desc {
 			ltd_connecting:1;  /* target is connecting */
 };
 
+/* number of pointers at 1st level */
+#define TGT_PTRS		(PAGE_SIZE / sizeof(void *))
+/* number of pointers at 2nd level */
+#define TGT_PTRS_PER_BLOCK	(PAGE_SIZE / sizeof(void *))
+
+struct lu_tgt_desc_idx {
+	struct lu_tgt_desc *ldi_tgt[TGT_PTRS_PER_BLOCK];
+};
+
+struct lu_tgt_descs {
+	/* list of known TGTs */
+	struct lu_tgt_desc_idx	*ltd_tgt_idx[TGT_PTRS];
+	/* Size of the lu_tgts array, granted to be a power of 2 */
+	u32			ltd_tgts_size;
+	/* number of registered TGTs */
+	u32			ltd_tgtnr;
+	/* bitmap of TGTs available */
+	unsigned long		*ltd_tgt_bitmap;
+	/* TGTs scheduled to be deleted */
+	u32			ltd_death_row;
+	/* Table refcount used for delayed deletion */
+	int			ltd_refcount;
+	/* mutex to serialize concurrent updates to the tgt table */
+	struct mutex		ltd_mutex;
+	/* read/write semaphore used for array relocation */
+	struct rw_semaphore	ltd_rw_sem;
+};
+
+#define LTD_TGT(ltd, index)						\
+	((ltd)->ltd_tgt_idx[(index) / TGT_PTRS_PER_BLOCK]		\
+				->ldi_tgt[(index) % TGT_PTRS_PER_BLOCK])
+
 /* QoS data for LOD/LMV */
 struct lu_qos {
 	struct list_head	 lq_svr_list;	/* lu_svr_qos list */
@@ -1412,5 +1444,41 @@ struct lu_qos {
 int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
 u64 lu_prandom_u64_max(u64 ep_ro);
 
+int lu_tgt_descs_init(struct lu_tgt_descs *ltd);
+void lu_tgt_descs_fini(struct lu_tgt_descs *ltd);
+int lu_tgt_descs_add(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt);
+void lu_tgt_descs_del(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt);
+
+static inline struct lu_tgt_desc *ltd_first_tgt(struct lu_tgt_descs *ltd)
+{
+	int index;
+
+	index = find_first_bit(ltd->ltd_tgt_bitmap,
+			       ltd->ltd_tgts_size);
+	return (index < ltd->ltd_tgts_size) ? LTD_TGT(ltd, index) : NULL;
+}
+
+static inline struct lu_tgt_desc *ltd_next_tgt(struct lu_tgt_descs *ltd,
+					       struct lu_tgt_desc *tgt)
+{
+	int index;
+
+	if (!tgt)
+		return NULL;
+
+	index = tgt->ltd_index;
+	LASSERT(index < ltd->ltd_tgts_size);
+	index = find_next_bit(ltd->ltd_tgt_bitmap,
+			      ltd->ltd_tgts_size, index + 1);
+	return (index < ltd->ltd_tgts_size) ? LTD_TGT(ltd, index) : NULL;
+}
+
+#define ltd_foreach_tgt(ltd, tgt) \
+	for (tgt = ltd_first_tgt(ltd); tgt; tgt = ltd_next_tgt(ltd, tgt))
+
+#define ltd_foreach_tgt_safe(ltd, tgt, tmp)				  \
+	for (tgt = ltd_first_tgt(ltd), tmp = ltd_next_tgt(ltd, tgt); tgt; \
+	     tgt = tmp, tmp = ltd_next_tgt(ltd, tgt))
+
 /** @} lu */
 #endif /* __LUSTRE_LU_OBJECT_H */
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index ef37f78..41431f9 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -424,14 +424,13 @@ struct lmv_obd {
 	spinlock_t		lmv_lock;
 	struct lmv_desc		desc;
 
-	struct mutex		lmv_init_mutex;
 	int			connected;
 	int			max_easize;
 	int			max_def_easize;
 	u32			lmv_statfs_start;
 
-	u32			tgts_size; /* size of tgts array */
-	struct lmv_tgt_desc	**tgts;
+	struct lu_tgt_descs     lmv_mdt_descs;
+
 	struct obd_connect_data	conn_data;
 	struct kobject		*lmv_tgts_kobj;
 	void			*lmv_cache;
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index f62cd7c..542b16d 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -83,7 +83,7 @@ static int lmv_intent_remote(struct obd_export *exp, struct lookup_intent *it,
 
 	LASSERT(fid_is_sane(&body->mbo_fid1));
 
-	tgt = lmv_find_target(lmv, &body->mbo_fid1);
+	tgt = lmv_fid2tgt(lmv, &body->mbo_fid1);
 	if (IS_ERR(tgt)) {
 		rc = PTR_ERR(tgt);
 		goto out;
@@ -199,9 +199,9 @@ int lmv_revalidate_slaves(struct obd_export *exp,
 		op_data->op_fid1 = fid;
 		op_data->op_fid2 = fid;
 
-		tgt = lmv_get_target(lmv, lsm->lsm_md_oinfo[i].lmo_mds, NULL);
-		if (IS_ERR(tgt)) {
-			rc = PTR_ERR(tgt);
+		tgt = lmv_tgt(lmv, lsm->lsm_md_oinfo[i].lmo_mds);
+		if (!tgt) {
+			rc = -ENODEV;
 			goto cleanup;
 		}
 
@@ -349,7 +349,7 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 		if (lmv_dir_striped(op_data->op_mea1))
 			op_data->op_fid1 = op_data->op_fid2;
 
-		tgt = lmv_find_target(lmv, &op_data->op_fid2);
+		tgt = lmv_fid2tgt(lmv, &op_data->op_fid2);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index c673656..e0c3ba0 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -70,57 +70,81 @@ static inline struct obd_device *lmv2obd_dev(struct lmv_obd *lmv)
 	return container_of_safe(lmv, struct obd_device, u.lmv);
 }
 
-static inline struct lmv_tgt_desc *
-lmv_get_target(struct lmv_obd *lmv, u32 mdt_idx, int *index)
+static inline struct lu_tgt_desc *
+lmv_tgt(struct lmv_obd *lmv, u32 index)
 {
-	int i;
+	return index < lmv->lmv_mdt_descs.ltd_tgts_size ?
+		LTD_TGT(&lmv->lmv_mdt_descs, index) : NULL;
+}
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		if (!lmv->tgts[i])
-			continue;
+static inline bool
+lmv_mdt0_inited(struct lmv_obd *lmv)
+{
+	return lmv->lmv_mdt_descs.ltd_tgt_bitmap &&
+	       test_bit(0, lmv->lmv_mdt_descs.ltd_tgt_bitmap);
+}
 
-		if (lmv->tgts[i]->ltd_index == mdt_idx) {
-			if (index)
-				*index = i;
-			return lmv->tgts[i];
-		}
-	}
+#define lmv_foreach_tgt(lmv, tgt) ltd_foreach_tgt(&(lmv)->lmv_mdt_descs, tgt)
+
+#define lmv_foreach_tgt_safe(lmv, tgt, tmp) \
+	ltd_foreach_tgt_safe(&(lmv)->lmv_mdt_descs, tgt, tmp)
+
+static inline
+struct lu_tgt_desc *lmv_first_connected_tgt(struct lmv_obd *lmv)
+{
+	struct lu_tgt_desc *tgt;
 
-	return ERR_PTR(-ENODEV);
+	tgt = ltd_first_tgt(&lmv->lmv_mdt_descs);
+	while (tgt && !tgt->ltd_exp)
+		tgt = ltd_next_tgt(&lmv->lmv_mdt_descs, tgt);
+
+	return tgt;
 }
 
-static inline int
-lmv_find_target_index(struct lmv_obd *lmv, const struct lu_fid *fid)
+static inline
+struct lu_tgt_desc *lmv_next_connected_tgt(struct lmv_obd *lmv,
+					   struct lu_tgt_desc *tgt)
 {
-	struct lmv_tgt_desc *ltd;
-	u32 mdt_idx = 0;
-	int index = 0;
+	do {
+		tgt = ltd_next_tgt(&lmv->lmv_mdt_descs, tgt);
+	} while (tgt && !tgt->ltd_exp);
 
-	if (lmv->desc.ld_tgt_count > 1) {
-		int rc;
+	return tgt;
+}
 
-		rc = lmv_fld_lookup(lmv, fid, &mdt_idx);
-		if (rc < 0)
-			return rc;
-	}
+#define lmv_foreach_connected_tgt(lmv, tgt) \
+	for (tgt = lmv_first_connected_tgt(lmv); tgt; \
+	     tgt = lmv_next_connected_tgt(lmv, tgt))
 
-	ltd = lmv_get_target(lmv, mdt_idx, &index);
-	if (IS_ERR(ltd))
-		return PTR_ERR(ltd);
+static inline int
+lmv_fid2tgt_index(struct lmv_obd *lmv, const struct lu_fid *fid)
+{
+	u32 mdt_idx;
+	int rc;
+
+	if (lmv->desc.ld_tgt_count < 2)
+		return 0;
 
-	return index;
+	rc = lmv_fld_lookup(lmv, fid, &mdt_idx);
+	if (rc < 0)
+		return rc;
+
+	return mdt_idx;
 }
 
 static inline struct lmv_tgt_desc *
-lmv_find_target(struct lmv_obd *lmv, const struct lu_fid *fid)
+lmv_fid2tgt(struct lmv_obd *lmv, const struct lu_fid *fid)
 {
+	struct lu_tgt_desc *tgt;
 	int index;
 
-	index = lmv_find_target_index(lmv, fid);
+	index = lmv_fid2tgt_index(lmv, fid);
 	if (index < 0)
 		return ERR_PTR(index);
 
-	return lmv->tgts[index];
+	tgt = lmv_tgt(lmv, index);
+
+	return tgt ? tgt : ERR_PTR(-ENODEV);
 }
 
 static inline int lmv_stripe_md_size(int stripe_count)
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 26021bb..8d682b4 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -78,28 +78,24 @@ void lmv_activate_target(struct lmv_obd *lmv, struct lmv_tgt_desc *tgt,
 static int lmv_set_mdc_active(struct lmv_obd *lmv, const struct obd_uuid *uuid,
 			      int activate)
 {
-	struct lmv_tgt_desc *tgt = NULL;
+	struct lu_tgt_desc *tgt = NULL;
 	struct obd_device *obd;
-	u32 i;
 	int rc = 0;
 
 	CDEBUG(D_INFO, "Searching in lmv %p for uuid %s (activate=%d)\n",
 	       lmv, uuid->uuid, activate);
 
 	spin_lock(&lmv->lmv_lock);
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv->tgts[i];
-		if (!tgt || !tgt->ltd_exp)
-			continue;
-
-		CDEBUG(D_INFO, "Target idx %d is %s conn %#llx\n", i,
-		       tgt->ltd_uuid.uuid, tgt->ltd_exp->exp_handle.h_cookie);
+	lmv_foreach_connected_tgt(lmv, tgt) {
+		CDEBUG(D_INFO, "Target idx %d is %s conn %#llx\n",
+		       tgt->ltd_index, tgt->ltd_uuid.uuid,
+		       tgt->ltd_exp->exp_handle.h_cookie);
 
 		if (obd_uuid_equals(uuid, &tgt->ltd_uuid))
 			break;
 	}
 
-	if (i == lmv->desc.ld_tgt_count) {
+	if (!tgt) {
 		rc = -EINVAL;
 		goto out_lmv_lock;
 	}
@@ -112,7 +108,7 @@ static int lmv_set_mdc_active(struct lmv_obd *lmv, const struct obd_uuid *uuid,
 
 	CDEBUG(D_INFO, "Found OBD %s=%s device %d (%p) type %s at LMV idx %d\n",
 	       obd->obd_name, obd->obd_uuid.uuid, obd->obd_minor, obd,
-	       obd->obd_type->typ_name, i);
+	       obd->obd_type->typ_name, tgt->ltd_index);
 	LASSERT(strcmp(obd->obd_type->typ_name, LUSTRE_MDC_NAME) == 0);
 
 	if (tgt->ltd_active == activate) {
@@ -133,7 +129,7 @@ static int lmv_set_mdc_active(struct lmv_obd *lmv, const struct obd_uuid *uuid,
 static struct obd_uuid *lmv_get_uuid(struct obd_export *exp)
 {
 	struct lmv_obd *lmv = &exp->exp_obd->u.lmv;
-	struct lmv_tgt_desc *tgt = lmv->tgts[0];
+	struct lmv_tgt_desc *tgt = lmv_tgt(lmv, 0);
 
 	return tgt ? obd_get_uuid(tgt->ltd_exp) : NULL;
 }
@@ -235,9 +231,9 @@ static int lmv_init_ea_size(struct obd_export *exp, u32 easize, u32 def_easize)
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	u32 i;
-	int rc = 0;
+	struct lmv_tgt_desc *tgt;
 	int change = 0;
+	int rc = 0;
 
 	if (lmv->max_easize < easize) {
 		lmv->max_easize = easize;
@@ -254,20 +250,14 @@ static int lmv_init_ea_size(struct obd_export *exp, u32 easize, u32 def_easize)
 	if (lmv->connected == 0)
 		return 0;
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		struct lmv_tgt_desc *tgt = lmv->tgts[i];
-
-		if (!tgt || !tgt->ltd_exp) {
-			CWARN("%s: NULL export for %d\n", obd->obd_name, i);
-			continue;
-		}
+	lmv_foreach_connected_tgt(lmv, tgt) {
 		if (!tgt->ltd_active)
 			continue;
 
 		rc = md_init_ea_size(tgt->ltd_exp, easize, def_easize);
 		if (rc) {
 			CERROR("%s: obd_init_ea_size() failed on MDT target %d: rc = %d\n",
-			       obd->obd_name, i, rc);
+			       obd->obd_name, tgt->ltd_index, rc);
 			break;
 		}
 	}
@@ -364,15 +354,12 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 	return 0;
 }
 
-static void lmv_del_target(struct lmv_obd *lmv, int index)
+static void lmv_del_target(struct lmv_obd *lmv, struct lu_tgt_desc *tgt)
 {
-	if (!lmv->tgts[index])
-		return;
-
-	lqos_del_tgt(&lmv->lmv_qos, lmv->tgts[index]);
-
-	kfree(lmv->tgts[index]);
-	lmv->tgts[index] = NULL;
+	LASSERT(tgt);
+	lqos_del_tgt(&lmv->lmv_qos, tgt);
+	lu_tgt_descs_del(&lmv->lmv_mdt_descs, tgt);
+	kfree(tgt);
 }
 
 static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
@@ -381,6 +368,7 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct obd_device *mdc_obd;
 	struct lmv_tgt_desc *tgt;
+	struct lu_tgt_descs *ltd = &lmv->lmv_mdt_descs;
 	int orig_tgt_count = 0;
 	int rc = 0;
 
@@ -394,78 +382,36 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 		return -EINVAL;
 	}
 
-	mutex_lock(&lmv->lmv_init_mutex);
-
-	if ((index < lmv->tgts_size) && lmv->tgts[index]) {
-		tgt = lmv->tgts[index];
-		CERROR("%s: UUID %s already assigned at LMV target index %d: rc = %d\n",
-		       obd->obd_name,
-		       obd_uuid2str(&tgt->ltd_uuid), index, -EEXIST);
-		mutex_unlock(&lmv->lmv_init_mutex);
-		return -EEXIST;
-	}
-
-	if (index >= lmv->tgts_size) {
-		/* We need to reallocate the lmv target array. */
-		struct lmv_tgt_desc **newtgts, **old = NULL;
-		u32 newsize = 1;
-		u32 oldsize = 0;
-
-		while (newsize < index + 1)
-			newsize <<= 1;
-		newtgts = kcalloc(newsize, sizeof(*newtgts), GFP_NOFS);
-		if (!newtgts) {
-			mutex_unlock(&lmv->lmv_init_mutex);
-			return -ENOMEM;
-		}
-
-		if (lmv->tgts_size) {
-			memcpy(newtgts, lmv->tgts,
-			       sizeof(*newtgts) * lmv->tgts_size);
-			old = lmv->tgts;
-			oldsize = lmv->tgts_size;
-		}
-
-		lmv->tgts = newtgts;
-		lmv->tgts_size = newsize;
-		smp_rmb();
-		kfree(old);
-
-		CDEBUG(D_CONFIG, "tgts: %p size: %d\n", lmv->tgts,
-		       lmv->tgts_size);
-	}
-
 	tgt = kzalloc(sizeof(*tgt), GFP_NOFS);
-	if (!tgt) {
-		mutex_unlock(&lmv->lmv_init_mutex);
+	if (!tgt)
 		return -ENOMEM;
-	}
 
 	mutex_init(&tgt->ltd_fid_mutex);
 	tgt->ltd_index = index;
 	tgt->ltd_uuid = *uuidp;
 	tgt->ltd_active = 0;
-	lmv->tgts[index] = tgt;
-	if (index >= lmv->desc.ld_tgt_count) {
+
+	mutex_lock(&ltd->ltd_mutex);
+	rc = lu_tgt_descs_add(ltd, tgt);
+	if (!rc && index >= lmv->desc.ld_tgt_count) {
 		orig_tgt_count = lmv->desc.ld_tgt_count;
 		lmv->desc.ld_tgt_count = index + 1;
 	}
+	mutex_unlock(&ltd->ltd_mutex);
 
-	if (!lmv->connected) {
+	if (rc)
+		goto out_tgt;
+
+	if (!lmv->connected)
 		/* lmv_check_connect() will connect this target. */
-		mutex_unlock(&lmv->lmv_init_mutex);
 		return rc;
-	}
 
-	/* Otherwise let's connect it ourselves */
-	mutex_unlock(&lmv->lmv_init_mutex);
 	rc = lmv_connect_mdc(obd, tgt);
 	if (rc) {
-		spin_lock(&lmv->lmv_lock);
-		if (lmv->desc.ld_tgt_count == index + 1)
-			lmv->desc.ld_tgt_count = orig_tgt_count;
+		mutex_lock(&ltd->ltd_mutex);
+		lmv->desc.ld_tgt_count = orig_tgt_count;
 		memset(tgt, 0, sizeof(*tgt));
-		spin_unlock(&lmv->lmv_lock);
+		mutex_unlock(&ltd->ltd_mutex);
 	} else {
 		int easize = sizeof(struct lmv_stripe_md) +
 			     lmv->desc.ld_tgt_count * sizeof(struct lu_fid);
@@ -473,47 +419,46 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	}
 
 	return rc;
+
+out_tgt:
+	kfree(tgt);
+	return rc;
 }
 
 static int lmv_check_connect(struct obd_device *obd)
 {
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
-	u32 i;
-	int rc;
 	int easize;
+	int rc;
 
 	if (lmv->connected)
 		return 0;
 
-	mutex_lock(&lmv->lmv_init_mutex);
+	mutex_lock(&lmv->lmv_mdt_descs.ltd_mutex);
 	if (lmv->connected) {
-		mutex_unlock(&lmv->lmv_init_mutex);
-		return 0;
+		rc = 0;
+		goto unlock;
 	}
 
 	if (lmv->desc.ld_tgt_count == 0) {
-		mutex_unlock(&lmv->lmv_init_mutex);
-		CERROR("%s: no targets configured.\n", obd->obd_name);
-		return -EINVAL;
+		CERROR("%s: no targets configured: rc = -EINVAL\n",
+		       obd->obd_name);
+		rc = -EINVAL;
+		goto unlock;
 	}
 
-	LASSERT(lmv->tgts);
-
-	if (!lmv->tgts[0]) {
-		mutex_unlock(&lmv->lmv_init_mutex);
-		CERROR("%s: no target configured for index 0.\n",
+	if (!lmv_mdt0_inited(lmv)) {
+		CERROR("%s: no target configured for index 0: rc = -EINVAL.\n",
 		       obd->obd_name);
-		return -EINVAL;
+		rc = -EINVAL;
+		goto unlock;
 	}
 
 	CDEBUG(D_CONFIG, "Time to connect %s to %s\n",
 	       obd->obd_uuid.uuid, obd->obd_name);
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv->tgts[i];
-		if (!tgt)
-			continue;
+	lmv_foreach_tgt(lmv, tgt) {
 		rc = lmv_connect_mdc(obd, tgt);
 		if (rc)
 			goto out_disc;
@@ -522,29 +467,22 @@ static int lmv_check_connect(struct obd_device *obd)
 	lmv->connected = 1;
 	easize = lmv_mds_md_size(lmv->desc.ld_tgt_count, LMV_MAGIC);
 	lmv_init_ea_size(obd->obd_self_export, easize, 0);
-	mutex_unlock(&lmv->lmv_init_mutex);
-	return 0;
+unlock:
+	mutex_unlock(&lmv->lmv_mdt_descs.ltd_mutex);
 
-out_disc:
-	while (i-- > 0) {
-		int rc2;
+	return rc;
 
-		tgt = lmv->tgts[i];
-		if (!tgt)
-			continue;
+out_disc:
+	lmv_foreach_tgt(lmv, tgt) {
 		tgt->ltd_active = 0;
-		if (tgt->ltd_exp) {
-			--lmv->desc.ld_active_tgt_count;
-			rc2 = obd_disconnect(tgt->ltd_exp);
-			if (rc2) {
-				CERROR("LMV target %s disconnect on MDC idx %d: error %d\n",
-				       tgt->ltd_uuid.uuid, i, rc2);
-			}
-		}
+		if (!tgt->ltd_exp)
+			continue;
+
+		--lmv->desc.ld_active_tgt_count;
+		obd_disconnect(tgt->ltd_exp);
 	}
 
-	mutex_unlock(&lmv->lmv_init_mutex);
-	return rc;
+	goto unlock;
 }
 
 static int lmv_disconnect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
@@ -591,27 +529,15 @@ static int lmv_disconnect(struct obd_export *exp)
 {
 	struct obd_device *obd = class_exp2obd(exp);
 	struct lmv_obd *lmv = &obd->u.lmv;
+	struct lmv_tgt_desc *tgt;
 	int rc;
-	u32 i;
 
-	if (!lmv->tgts)
-		goto out_local;
-
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		if (!lmv->tgts[i] || !lmv->tgts[i]->ltd_exp)
-			continue;
-
-		lmv_disconnect_mdc(obd, lmv->tgts[i]);
-	}
+	lmv_foreach_connected_tgt(lmv, tgt)
+		lmv_disconnect_mdc(obd, tgt);
 
 	if (lmv->lmv_tgts_kobj)
 		kobject_put(lmv->lmv_tgts_kobj);
 
-out_local:
-	/*
-	 * This is the case when no real connection is established by
-	 * lmv_check_connect().
-	 */
 	if (!lmv->connected)
 		class_export_put(exp);
 	rc = class_disconnect(exp);
@@ -631,7 +557,7 @@ static int lmv_fid2path(struct obd_export *exp, int len, void *karg,
 	int remote_gf_size = 0;
 	int rc;
 
-	tgt = lmv_find_target(lmv, &gf->gf_fid);
+	tgt = lmv_fid2tgt(lmv, &gf->gf_fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -696,7 +622,7 @@ static int lmv_fid2path(struct obd_export *exp, int len, void *karg,
 		goto out_fid2path;
 	}
 
-	tgt = lmv_find_target(lmv, &gf->gf_fid);
+	tgt = lmv_fid2tgt(lmv, &gf->gf_fid);
 	if (IS_ERR(tgt)) {
 		rc = -EINVAL;
 		goto out_fid2path;
@@ -719,12 +645,13 @@ static int lmv_hsm_req_count(struct lmv_obd *lmv,
 			     const struct hsm_user_request *hur,
 			     const struct lmv_tgt_desc *tgt_mds)
 {
-	u32 i, nr = 0;
 	struct lmv_tgt_desc *curr_tgt;
+	u32 i;
+	int nr = 0;
 
 	/* count how many requests must be sent to the given target */
 	for (i = 0; i < hur->hur_request.hr_itemcount; i++) {
-		curr_tgt = lmv_find_target(lmv, &hur->hur_user_item[i].hui_fid);
+		curr_tgt = lmv_fid2tgt(lmv, &hur->hur_user_item[i].hui_fid);
 		if (IS_ERR(curr_tgt))
 			return PTR_ERR(curr_tgt);
 		if (obd_uuid_equals(&curr_tgt->ltd_uuid, &tgt_mds->ltd_uuid))
@@ -736,17 +663,16 @@ static int lmv_hsm_req_count(struct lmv_obd *lmv,
 static int lmv_hsm_req_build(struct lmv_obd *lmv,
 			     struct hsm_user_request *hur_in,
 			     const struct lmv_tgt_desc *tgt_mds,
-			     struct hsm_user_request *hur_out)
+			      struct hsm_user_request *hur_out)
 {
-	int i, nr_out;
+	u32 i, nr_out;
 	struct lmv_tgt_desc *curr_tgt;
 
 	/* build the hsm_user_request for the given target */
 	hur_out->hur_request = hur_in->hur_request;
 	nr_out = 0;
 	for (i = 0; i < hur_in->hur_request.hr_itemcount; i++) {
-		curr_tgt = lmv_find_target(lmv,
-					   &hur_in->hur_user_item[i].hui_fid);
+		curr_tgt = lmv_fid2tgt(lmv, &hur_in->hur_user_item[i].hui_fid);
 		if (IS_ERR(curr_tgt))
 			return PTR_ERR(curr_tgt);
 		if (obd_uuid_equals(&curr_tgt->ltd_uuid, &tgt_mds->ltd_uuid)) {
@@ -767,20 +693,14 @@ static int lmv_hsm_ct_unregister(struct obd_device *obd, unsigned int cmd,
 				 void __user *uarg)
 {
 	struct lmv_obd *lmv = &obd->u.lmv;
-	u32 i;
+	struct lu_tgt_desc *tgt;
 
 	/* unregister request (call from llapi_hsm_copytool_fini) */
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		struct lmv_tgt_desc *tgt = lmv->tgts[i];
-
-		if (!tgt || !tgt->ltd_exp)
-			continue;
-
+	lmv_foreach_connected_tgt(lmv, tgt)
 		/* best effort: try to clean as much as possible
 		 * (continue on error)
 		 */
-		obd_iocontrol(cmd, lmv->tgts[i]->ltd_exp, len, lk, uarg);
-	}
+		obd_iocontrol(cmd, tgt->ltd_exp, len, lk, uarg);
 
 	/* Whatever the result, remove copytool from kuc groups.
 	 * Unreached coordinators will get EPIPE on next requests
@@ -795,11 +715,12 @@ static int lmv_hsm_ct_register(struct obd_device *obd, unsigned int cmd,
 {
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct file *filp;
-	u32 i, j;
-	int err;
 	bool any_set = false;
 	struct kkuc_ct_data *kcd;
 	size_t kcd_size;
+	struct lu_tgt_desc *tgt;
+	u32 i;
+	int err;
 	int rc = 0;
 
 	filp = fget(lk->lk_wfd);
@@ -838,26 +759,22 @@ static int lmv_hsm_ct_register(struct obd_device *obd, unsigned int cmd,
 	 * In case of failure, unregister from previous MDS,
 	 * except if it because of inactive target.
 	 */
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		struct lmv_tgt_desc *tgt = lmv->tgts[i];
-
-		if (!tgt || !tgt->ltd_exp)
-			continue;
-
+	lmv_foreach_connected_tgt(lmv, tgt) {
 		err = obd_iocontrol(cmd, tgt->ltd_exp, len, lk, uarg);
 		if (err) {
 			if (tgt->ltd_active) {
 				/* permanent error */
 				CERROR("error: iocontrol MDC %s on MDTidx %d cmd %x: err = %d\n",
-				       tgt->ltd_uuid.uuid, i, cmd, err);
+				       tgt->ltd_uuid.uuid, tgt->ltd_index, cmd,
+				       err);
 				rc = err;
 				lk->lk_flags |= LK_FLG_STOP;
+				i = tgt->ltd_index;
 				/* unregister from previous MDS */
-				for (j = 0; j < i; j++) {
-					tgt = lmv->tgts[j];
+				lmv_foreach_connected_tgt(lmv, tgt) {
+					if (tgt->ltd_index >= i)
+						break;
 
-					if (!tgt || !tgt->ltd_exp)
-						continue;
 					obd_iocontrol(cmd, tgt->ltd_exp, len,
 						      lk, uarg);
 				}
@@ -891,11 +808,10 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 {
 	struct obd_device *obddev = class_exp2obd(exp);
 	struct lmv_obd *lmv = &obddev->u.lmv;
-	struct lmv_tgt_desc *tgt = NULL;
-	u32 i = 0;
-	int rc = 0;
+	struct lu_tgt_desc *tgt = NULL;
 	int set = 0;
 	u32 count = lmv->desc.ld_tgt_count;
+	int rc = 0;
 
 	if (count == 0)
 		return -ENOTTY;
@@ -911,7 +827,7 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		if (index >= count)
 			return -ENODEV;
 
-		tgt = lmv->tgts[index];
+		tgt = lmv_tgt(lmv, index);
 		if (!tgt || !tgt->ltd_active)
 			return -ENODATA;
 
@@ -944,14 +860,11 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 			if (count <= qctl->qc_idx)
 				return -EINVAL;
 
-			tgt = lmv->tgts[qctl->qc_idx];
+			tgt = lmv_tgt(lmv, qctl->qc_idx);
 			if (!tgt || !tgt->ltd_exp)
 				return -EINVAL;
 		} else if (qctl->qc_valid == QC_UUID) {
-			for (i = 0; i < count; i++) {
-				tgt = lmv->tgts[i];
-				if (!tgt)
-					continue;
+			lmv_foreach_tgt(lmv, tgt) {
 				if (!obd_uuid_equals(&tgt->ltd_uuid,
 						     &qctl->obd_uuid))
 					continue;
@@ -965,11 +878,11 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 			return -EINVAL;
 		}
 
-		if (i >= count)
+		if (tgt->ltd_index >= count)
 			return -EAGAIN;
 
 		LASSERT(tgt && tgt->ltd_exp);
-		oqctl = kzalloc(sizeof(*oqctl), GFP_NOFS);
+		oqctl = kzalloc(sizeof(*oqctl), GFP_KERNEL);
 		if (!oqctl)
 			return -ENOMEM;
 
@@ -984,11 +897,11 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		break;
 	}
 	case LL_IOC_GET_CONNECT_FLAGS: {
-		tgt = lmv->tgts[0];
+		tgt = lmv_tgt(lmv, 0);
+		rc = -ENODATA;
 
-		if (!tgt || !tgt->ltd_exp)
-			return -ENODATA;
-		rc = obd_iocontrol(cmd, tgt->ltd_exp, len, karg, uarg);
+		if (tgt && tgt->ltd_exp)
+			rc = obd_iocontrol(cmd, tgt->ltd_exp, len, karg, uarg);
 		break;
 	}
 	case LL_IOC_FID2MDTIDX: {
@@ -1015,7 +928,7 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 	case LL_IOC_HSM_ACTION: {
 		struct md_op_data *op_data = karg;
 
-		tgt = lmv_find_target(lmv, &op_data->op_fid1);
+		tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 
@@ -1028,7 +941,7 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 	case LL_IOC_HSM_PROGRESS: {
 		const struct hsm_progress_kernel *hpk = karg;
 
-		tgt = lmv_find_target(lmv, &hpk->hpk_fid);
+		tgt = lmv_fid2tgt(lmv, &hpk->hpk_fid);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 		rc = obd_iocontrol(cmd, tgt->ltd_exp, len, karg, uarg);
@@ -1046,22 +959,17 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		 * the request.
 		 */
 		if (reqcount == 1 || count == 1) {
-			tgt = lmv_find_target(lmv,
-					      &hur->hur_user_item[0].hui_fid);
+			tgt = lmv_fid2tgt(lmv, &hur->hur_user_item[0].hui_fid);
 			if (IS_ERR(tgt))
 				return PTR_ERR(tgt);
 			rc = obd_iocontrol(cmd, tgt->ltd_exp, len, karg, uarg);
 		} else {
 			/* split fid list to their respective MDS */
-			for (i = 0; i < count; i++) {
+			lmv_foreach_connected_tgt(lmv, tgt) {
 				struct hsm_user_request *req;
 				size_t reqlen;
 				int nr, rc1;
 
-				tgt = lmv->tgts[i];
-				if (!tgt || !tgt->ltd_exp)
-					continue;
-
 				nr = lmv_hsm_req_count(lmv, hur, tgt);
 				if (nr < 0)
 					return nr;
@@ -1094,11 +1002,11 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		struct md_op_data *op_data = karg;
 		struct lmv_tgt_desc *tgt1, *tgt2;
 
-		tgt1 = lmv_find_target(lmv, &op_data->op_fid1);
+		tgt1 = lmv_fid2tgt(lmv, &op_data->op_fid1);
 		if (IS_ERR(tgt1))
 			return PTR_ERR(tgt1);
 
-		tgt2 = lmv_find_target(lmv, &op_data->op_fid2);
+		tgt2 = lmv_fid2tgt(lmv, &op_data->op_fid2);
 		if (IS_ERR(tgt2))
 			return PTR_ERR(tgt2);
 
@@ -1122,13 +1030,10 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		break;
 	}
 	default:
-		for (i = 0; i < count; i++) {
+		lmv_foreach_connected_tgt(lmv, tgt) {
 			struct obd_device *mdc_obd;
 			int err;
 
-			tgt = lmv->tgts[i];
-			if (!tgt || !tgt->ltd_exp)
-				continue;
 			/* ll_umount_begin() sets force flag but for lmv, not
 			 * mdc. Let's pass it through
 			 */
@@ -1139,7 +1044,8 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 				if (tgt->ltd_active) {
 					CERROR("%s: error: iocontrol MDC %s on MDTidx %d cmd %x: err = %d\n",
 					       lmv2obd_dev(lmv)->obd_name,
-					       tgt->ltd_uuid.uuid, i, cmd, err);
+					       tgt->ltd_uuid.uuid,
+					       tgt->ltd_index, cmd, err);
 					if (!rc)
 						rc = err;
 				}
@@ -1207,9 +1113,9 @@ int __lmv_fid_alloc(struct lmv_obd *lmv, struct lu_fid *fid, u32 mds)
 	struct lmv_tgt_desc *tgt;
 	int rc;
 
-	tgt = lmv_get_target(lmv, mds, NULL);
-	if (IS_ERR(tgt))
-		return PTR_ERR(tgt);
+	tgt = lmv_tgt(lmv, mds);
+	if (!tgt)
+		return -ENODEV;
 
 	/*
 	 * New seq alloc and FLD setup should be atomic. Otherwise we may find
@@ -1276,11 +1182,6 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 		return -EINVAL;
 	}
 
-	lmv->tgts_size = 32U;
-	lmv->tgts = kcalloc(lmv->tgts_size, sizeof(*lmv->tgts), GFP_NOFS);
-	if (!lmv->tgts)
-		return -ENOMEM;
-
 	obd_str2uuid(&lmv->desc.ld_uuid, desc->ld_uuid.uuid);
 	lmv->desc.ld_tgt_count = 0;
 	lmv->desc.ld_active_tgt_count = 0;
@@ -1289,7 +1190,6 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	lmv->max_easize = 0;
 
 	spin_lock_init(&lmv->lmv_lock);
-	mutex_init(&lmv->lmv_init_mutex);
 
 	/* Set up allocation policy (QoS and RR) */
 	INIT_LIST_HEAD(&lmv->lmv_qos.lq_svr_list);
@@ -1321,30 +1221,28 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 
 	rc = fld_client_init(&lmv->lmv_fld, obd->obd_name,
 			     LUSTRE_CLI_FLD_HASH_DHT);
-	if (rc) {
+	if (rc)
 		CERROR("Can't init FLD, err %d\n", rc);
-		return rc;
-	}
 
-	return 0;
+	rc = lu_tgt_descs_init(&lmv->lmv_mdt_descs);
+	if (rc)
+		CWARN("%s: error initialize target table: rc = %d\n",
+		      obd->obd_name, rc);
+
+	return rc;
 }
 
 static int lmv_cleanup(struct obd_device *obd)
 {
 	struct lmv_obd *lmv = &obd->u.lmv;
+	struct lu_tgt_desc *tgt;
+	struct lu_tgt_desc *tmp;
 
 	fld_client_fini(&lmv->lmv_fld);
-	if (lmv->tgts) {
-		int i;
+	lmv_foreach_tgt_safe(lmv, tgt, tmp)
+		lmv_del_target(lmv, tgt);
+	lu_tgt_descs_fini(&lmv->lmv_mdt_descs);
 
-		for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-			if (!lmv->tgts[i])
-				continue;
-			lmv_del_target(lmv, i);
-		}
-		kfree(lmv->tgts);
-		lmv->tgts_size = 0;
-	}
 	return 0;
 }
 
@@ -1423,8 +1321,10 @@ static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 	struct obd_device *obd = class_exp2obd(exp);
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct obd_statfs *temp;
+	struct lu_tgt_desc *tgt;
+	u32 i;
+	u32 idx;
 	int rc = 0;
-	u32 i, idx;
 
 	temp = kzalloc(sizeof(*temp), GFP_NOFS);
 	if (!temp)
@@ -1435,15 +1335,14 @@ static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 
 	for (i = 0; i < lmv->desc.ld_tgt_count; i++, idx++) {
 		idx = idx % lmv->desc.ld_tgt_count;
-		if (!lmv->tgts[idx] || !lmv->tgts[idx]->ltd_exp)
+		tgt = lmv_tgt(lmv, idx);
+		if (!tgt || !tgt->ltd_exp)
 			continue;
 
-		rc = obd_statfs(env, lmv->tgts[idx]->ltd_exp, temp,
-				max_age, flags);
+		rc = obd_statfs(env, tgt->ltd_exp, temp, max_age, flags);
 		if (rc) {
 			CERROR("%s: can't stat MDS #%d: rc = %d\n",
-			       lmv->tgts[idx]->ltd_exp->exp_obd->obd_name, i,
-			       rc);
+			       tgt->ltd_exp->exp_obd->obd_name, i, rc);
 			goto out_free_temp;
 		}
 
@@ -1524,8 +1423,12 @@ static int lmv_get_root(struct obd_export *exp, const char *fileset,
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
+	struct lu_tgt_desc *tgt = lmv_tgt(lmv, 0);
+
+	if (!tgt)
+		return -ENODEV;
 
-	return md_get_root(lmv->tgts[0]->ltd_exp, fileset, fid);
+	return md_get_root(tgt->ltd_exp, fileset, fid);
 }
 
 static int lmv_getxattr(struct obd_export *exp, const struct lu_fid *fid,
@@ -1536,7 +1439,7 @@ static int lmv_getxattr(struct obd_export *exp, const struct lu_fid *fid,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, fid);
+	tgt = lmv_fid2tgt(lmv, fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1554,7 +1457,7 @@ static int lmv_setxattr(struct obd_export *exp, const struct lu_fid *fid,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, fid);
+	tgt = lmv_fid2tgt(lmv, fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1569,7 +1472,7 @@ static int lmv_getattr(struct obd_export *exp, struct md_op_data *op_data,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1585,7 +1488,7 @@ static int lmv_null_inode(struct obd_export *exp, const struct lu_fid *fid)
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	u32 i;
+	struct lu_tgt_desc *tgt;
 
 	CDEBUG(D_INODE, "CBDATA for " DFID "\n", PFID(fid));
 
@@ -1594,11 +1497,8 @@ static int lmv_null_inode(struct obd_export *exp, const struct lu_fid *fid)
 	 * lookup lock in space of MDT storing direntry and update/open lock in
 	 * space of MDT storing inode.
 	 */
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		if (!lmv->tgts[i] || !lmv->tgts[i]->ltd_exp)
-			continue;
-		md_null_inode(lmv->tgts[i]->ltd_exp, fid);
-	}
+	lmv_foreach_connected_tgt(lmv, tgt)
+		md_null_inode(tgt->ltd_exp, fid);
 
 	return 0;
 }
@@ -1610,7 +1510,7 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1627,7 +1527,7 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 	struct lmv_tgt_desc *tgt;
 
 	if (!lmv_dir_striped(lsm) || !namelen) {
-		tgt = lmv_find_target(lmv, fid);
+		tgt = lmv_fid2tgt(lmv, fid);
 		if (IS_ERR(tgt))
 			return tgt;
 
@@ -1648,11 +1548,11 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 
 	*fid = oinfo->lmo_fid;
 	*mds = oinfo->lmo_mds;
-	tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
+	tgt = lmv_tgt(lmv, oinfo->lmo_mds);
 
 	CDEBUG(D_INODE, "locate MDT %u parent " DFID "\n", *mds, PFID(fid));
 
-	return tgt;
+	return tgt ? tgt : ERR_PTR(-ENODEV);
 }
 
 /**
@@ -1690,9 +1590,9 @@ struct lmv_tgt_desc *
 	 */
 	if (op_data->op_bias & MDS_CREATE_VOLATILE &&
 	    (int)op_data->op_mds != -1) {
-		tgt = lmv_get_target(lmv, op_data->op_mds, NULL);
-		if (IS_ERR(tgt))
-			return tgt;
+		tgt = lmv_tgt(lmv, op_data->op_mds);
+		if (!tgt)
+			return ERR_PTR(-ENODEV);
 
 		if (lmv_dir_striped(lsm)) {
 			int i;
@@ -1715,7 +1615,10 @@ struct lmv_tgt_desc *
 
 		op_data->op_fid1 = oinfo->lmo_fid;
 		op_data->op_mds = oinfo->lmo_mds;
-		tgt = lmv_get_target(lmv, oinfo->lmo_mds, NULL);
+
+		tgt = lmv_tgt(lmv, oinfo->lmo_mds);
+		if (!tgt)
+			tgt = ERR_PTR(-ENODEV);
 	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
 		   lmv_dir_qos_mkdir(op_data->op_default_mea1) &&
 		   !lmv_dir_striped(lsm)) {
@@ -1847,7 +1750,7 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 		 * Send the create request to the MDT where the object
 		 * will be located
 		 */
-		tgt = lmv_find_target(lmv, &op_data->op_fid2);
+		tgt = lmv_fid2tgt(lmv, &op_data->op_fid2);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 
@@ -1881,7 +1784,7 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 
 	CDEBUG(D_INODE, "ENQUEUE on " DFID "\n", PFID(&op_data->op_fid1));
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -1958,7 +1861,7 @@ static int lmv_early_cancel(struct obd_export *exp, struct lmv_tgt_desc *tgt,
 		return 0;
 
 	if (!tgt) {
-		tgt = lmv_find_target(lmv, fid);
+		tgt = lmv_fid2tgt(lmv, fid);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 	}
@@ -2041,7 +1944,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	op_data->op_fsgid = from_kgid(&init_user_ns, current_fsgid());
 	op_data->op_cap = current_cap();
 
-	parent_tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	parent_tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(parent_tgt))
 		return PTR_ERR(parent_tgt);
 
@@ -2068,10 +1971,9 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 
 		/* save it in fid4 temporarily for early cancel */
 		op_data->op_fid4 = lsm->lsm_md_oinfo[rc].lmo_fid;
-		sp_tgt = lmv_get_target(lmv, lsm->lsm_md_oinfo[rc].lmo_mds,
-					NULL);
-		if (IS_ERR(sp_tgt))
-			return PTR_ERR(sp_tgt);
+		sp_tgt = lmv_tgt(lmv, lsm->lsm_md_oinfo[rc].lmo_mds);
+		if (!sp_tgt)
+			return -ENODEV;
 
 		/*
 		 * if parent is being migrated too, fill op_fid2 with target
@@ -2088,17 +1990,15 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 				return rc;
 
 			op_data->op_fid2 = lsm->lsm_md_oinfo[rc].lmo_fid;
-			tp_tgt = lmv_get_target(lmv,
-						lsm->lsm_md_oinfo[rc].lmo_mds,
-						NULL);
-			if (IS_ERR(tp_tgt))
-				return PTR_ERR(tp_tgt);
+			tp_tgt = lmv_tgt(lmv, lsm->lsm_md_oinfo[rc].lmo_mds);
+			if (!tp_tgt)
+				return -ENODEV;
 		}
 	} else {
 		sp_tgt = parent_tgt;
 	}
 
-	child_tgt = lmv_find_target(lmv, &op_data->op_fid3);
+	child_tgt = lmv_fid2tgt(lmv, &op_data->op_fid3);
 	if (IS_ERR(child_tgt))
 		return PTR_ERR(child_tgt);
 
@@ -2121,7 +2021,7 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	 */
 	if (S_ISDIR(op_data->op_mode) &&
 	    (exp_connect_flags2(exp) & OBD_CONNECT2_DIR_MIGRATE)) {
-		tgt = lmv_find_target(lmv, &target_fid);
+		tgt = lmv_fid2tgt(lmv, &target_fid);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 	} else {
@@ -2219,7 +2119,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	 * then it will send the request to the target parent
 	 */
 	if (fid_is_sane(&op_data->op_fid4)) {
-		tgt = lmv_find_target(lmv, &op_data->op_fid4);
+		tgt = lmv_fid2tgt(lmv, &op_data->op_fid4);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 	} else {
@@ -2247,7 +2147,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	}
 
 	if (fid_is_sane(&op_data->op_fid3)) {
-		src_tgt = lmv_find_target(lmv, &op_data->op_fid3);
+		src_tgt = lmv_fid2tgt(lmv, &op_data->op_fid3);
 		if (IS_ERR(src_tgt))
 			return PTR_ERR(src_tgt);
 
@@ -2313,7 +2213,7 @@ static int lmv_rename(struct obd_export *exp, struct md_op_data *op_data,
 	ptlrpc_req_finished(*request);
 	*request = NULL;
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid4);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid4);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2344,7 +2244,7 @@ static int lmv_setattr(struct obd_export *exp, struct md_op_data *op_data,
 		op_data->op_xvalid);
 
 	op_data->op_flags |= MF_MDC_CANCEL_FID1;
-	tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2358,7 +2258,7 @@ static int lmv_fsync(struct obd_export *exp, const struct lu_fid *fid,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, fid);
+	tgt = lmv_fid2tgt(lmv, fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2465,9 +2365,9 @@ static struct lu_dirent *stripe_dirent_load(struct lmv_dir_ctxt *ctxt,
 			break;
 		}
 
-		tgt = lmv_get_target(ctxt->ldc_lmv, oinfo->lmo_mds, NULL);
-		if (IS_ERR(tgt)) {
-			rc = PTR_ERR(tgt);
+		tgt = lmv_tgt(ctxt->ldc_lmv, oinfo->lmo_mds);
+		if (!tgt) {
+			rc = -ENODEV;
 			break;
 		}
 
@@ -2516,7 +2416,7 @@ static int lmv_file_resync(struct obd_export *exp, struct md_op_data *data)
 	if (rc != 0)
 		return rc;
 
-	tgt = lmv_find_target(lmv, &data->op_fid1);
+	tgt = lmv_fid2tgt(lmv, &data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2741,7 +2641,7 @@ static int lmv_read_page(struct obd_export *exp, struct md_op_data *op_data,
 					     offset, ppage);
 	}
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid1);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2792,7 +2692,7 @@ static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 		return PTR_ERR(parent_tgt);
 
 	if (likely(!fid_is_zero(&op_data->op_fid2))) {
-		tgt = lmv_find_target(lmv, &op_data->op_fid2);
+		tgt = lmv_fid2tgt(lmv, &op_data->op_fid2);
 		if (IS_ERR(tgt))
 			return PTR_ERR(tgt);
 	} else {
@@ -2845,7 +2745,7 @@ static int lmv_unlink(struct obd_export *exp, struct md_op_data *op_data,
 	ptlrpc_req_finished(*request);
 	*request = NULL;
 
-	tgt = lmv_find_target(lmv, &op_data->op_fid2);
+	tgt = lmv_fid2tgt(lmv, &op_data->op_fid2);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -2881,6 +2781,7 @@ static int lmv_get_info(const struct lu_env *env, struct obd_export *exp,
 {
 	struct obd_device *obd;
 	struct lmv_obd *lmv;
+	struct lu_tgt_desc *tgt;
 	int rc = 0;
 
 	obd = class_exp2obd(exp);
@@ -2892,18 +2793,8 @@ static int lmv_get_info(const struct lu_env *env, struct obd_export *exp,
 
 	lmv = &obd->u.lmv;
 	if (keylen >= strlen("remote_flag") && !strcmp(key, "remote_flag")) {
-		int i;
-
 		LASSERT(*vallen == sizeof(u32));
-		for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-			struct lmv_tgt_desc *tgt = lmv->tgts[i];
-
-			/*
-			 * All tgts should be connected when this gets called.
-			 */
-			if (!tgt || !tgt->ltd_exp)
-				continue;
-
+		lmv_foreach_connected_tgt(lmv, tgt) {
 			if (!obd_get_info(env, tgt->ltd_exp, keylen, key,
 					  vallen, val))
 				return 0;
@@ -2916,8 +2807,11 @@ static int lmv_get_info(const struct lu_env *env, struct obd_export *exp,
 		 * Forwarding this request to first MDS, it should know LOV
 		 * desc.
 		 */
-		rc = obd_get_info(env, lmv->tgts[0]->ltd_exp, keylen, key,
-				  vallen, val);
+		tgt = lmv_tgt(lmv, 0);
+		if (!tgt)
+			return -ENODEV;
+
+		rc = obd_get_info(env, tgt->ltd_exp, keylen, key, vallen, val);
 		if (!rc && KEY_IS(KEY_CONN_DATA))
 			exp->exp_connect_data = *(struct obd_connect_data *)val;
 		return rc;
@@ -2937,6 +2831,7 @@ static int lmv_rmfid(struct obd_export *exp, struct fid_array *fa,
 	struct ptlrpc_request_set *set = _set;
 	struct lmv_obd *lmv = &obddev->u.lmv;
 	int tgt_count = lmv->desc.ld_tgt_count;
+	struct lu_tgt_desc *tgt;
 	struct fid_array *fat, **fas = NULL;
 	int i, rc, **rcs = NULL;
 
@@ -2987,11 +2882,11 @@ static int lmv_rmfid(struct obd_export *exp, struct fid_array *fa,
 		fat->fa_fids[fat->fa_nr++] = fa->fa_fids[i];
 	}
 
-	for (i = 0; i < tgt_count; i++) {
-		fat = fas[i];
+	lmv_foreach_connected_tgt(lmv, tgt) {
+		fat = fas[tgt->ltd_index];
 		if (!fat || fat->fa_nr == 0)
 			continue;
-		rc = md_rmfid(lmv->tgts[i]->ltd_exp, fat, rcs[i], set);
+		rc = md_rmfid(tgt->ltd_exp, fat, rcs[tgt->ltd_index], set);
 	}
 
 	rc = ptlrpc_set_wait(NULL, set);
@@ -3062,14 +2957,9 @@ static int lmv_set_info_async(const struct lu_env *env, struct obd_export *exp,
 
 	if (KEY_IS(KEY_READ_ONLY) || KEY_IS(KEY_FLUSH_CTX) ||
 	    KEY_IS(KEY_DEFAULT_EASIZE)) {
-		int i, err = 0;
-
-		for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-			tgt = lmv->tgts[i];
-
-			if (!tgt || !tgt->ltd_exp)
-				continue;
+		int err = 0;
 
+		lmv_foreach_connected_tgt(lmv, tgt) {
 			err = obd_set_info_async(env, tgt->ltd_exp,
 						 keylen, key, vallen, val, set);
 			if (err && rc == 0)
@@ -3272,16 +3162,14 @@ static int lmv_cancel_unused(struct obd_export *exp, const struct lu_fid *fid,
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	int rc = 0;
+	struct lu_tgt_desc *tgt;
 	int err;
-	u32 i;
+	int rc = 0;
 
 	LASSERT(fid);
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		struct lmv_tgt_desc *tgt = lmv->tgts[i];
-
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+	lmv_foreach_connected_tgt(lmv, tgt) {
+		if (!tgt->ltd_active)
 			continue;
 
 		err = md_cancel_unused(tgt->ltd_exp, fid, policy, mode, flags,
@@ -3297,7 +3185,7 @@ static int lmv_set_lock_data(struct obd_export *exp,
 			     void *data, u64 *bits)
 {
 	struct lmv_obd *lmv = &exp->exp_obd->u.lmv;
-	struct lmv_tgt_desc *tgt = lmv->tgts[0];
+	struct lmv_tgt_desc *tgt = lmv_tgt(lmv, 0);
 
 	if (!tgt || !tgt->ltd_exp)
 		return -EINVAL;
@@ -3315,8 +3203,9 @@ static enum ldlm_mode lmv_lock_match(struct obd_export *exp, u64 flags,
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
 	enum ldlm_mode rc;
-	int tgt;
-	u32 i;
+	struct lu_tgt_desc *tgt;
+	int i;
+	int index;
 
 	CDEBUG(D_INODE, "Lock match for " DFID "\n", PFID(fid));
 
@@ -3326,21 +3215,21 @@ static enum ldlm_mode lmv_lock_match(struct obd_export *exp, u64 flags,
 	 * space of MDT storing inode.  Try the MDT that the FID maps to first,
 	 * since this can be easily found, and only try others if that fails.
 	 */
-	for (i = 0, tgt = lmv_find_target_index(lmv, fid);
+	for (i = 0, index = lmv_fid2tgt_index(lmv, fid);
 	     i < lmv->desc.ld_tgt_count;
-	     i++, tgt = (tgt + 1) % lmv->desc.ld_tgt_count) {
-		if (tgt < 0) {
+	     i++, index = (index + 1) % lmv->desc.ld_tgt_count) {
+		if (index < 0) {
 			CDEBUG(D_HA, "%s: " DFID " is inaccessible: rc = %d\n",
-			       obd->obd_name, PFID(fid), tgt);
-			tgt = 0;
+			       obd->obd_name, PFID(fid), index);
+			index = 0;
 		}
 
-		if (!lmv->tgts[tgt] || !lmv->tgts[tgt]->ltd_exp ||
-		    !lmv->tgts[tgt]->ltd_active)
+		tgt = lmv_tgt(lmv, index);
+		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
 			continue;
 
-		rc = md_lock_match(lmv->tgts[tgt]->ltd_exp, flags, fid,
-				   type, policy, mode, lockh);
+		rc = md_lock_match(tgt->ltd_exp, flags, fid, type, policy, mode,
+				   lockh);
 		if (rc)
 			return rc;
 	}
@@ -3355,7 +3244,7 @@ static int lmv_get_lustre_md(struct obd_export *exp,
 			     struct lustre_md *md)
 {
 	struct lmv_obd *lmv = &exp->exp_obd->u.lmv;
-	struct lmv_tgt_desc *tgt = lmv->tgts[0];
+	struct lmv_tgt_desc *tgt = lmv_tgt(lmv, 0);
 
 	if (!tgt || !tgt->ltd_exp)
 		return -EINVAL;
@@ -3366,7 +3255,7 @@ static int lmv_free_lustre_md(struct obd_export *exp, struct lustre_md *md)
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct lmv_tgt_desc *tgt = lmv->tgts[0];
+	struct lmv_tgt_desc *tgt = lmv_tgt(lmv, 0);
 
 	if (md->default_lmv) {
 		lmv_free_memmd(md->default_lmv);
@@ -3389,7 +3278,7 @@ static int lmv_set_open_replay_data(struct obd_export *exp,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, &och->och_fid);
+	tgt = lmv_fid2tgt(lmv, &och->och_fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -3403,7 +3292,7 @@ static int lmv_clear_open_replay_data(struct obd_export *exp,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, &och->och_fid);
+	tgt = lmv_fid2tgt(lmv, &och->och_fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -3426,7 +3315,7 @@ static int lmv_intent_getattr_async(struct obd_export *exp,
 	if (IS_ERR(ptgt))
 		return PTR_ERR(ptgt);
 
-	ctgt = lmv_find_target(lmv, &op_data->op_fid2);
+	ctgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
 	if (IS_ERR(ctgt))
 		return PTR_ERR(ctgt);
 
@@ -3447,7 +3336,7 @@ static int lmv_revalidate_lock(struct obd_export *exp, struct lookup_intent *it,
 	struct lmv_obd *lmv = &obd->u.lmv;
 	struct lmv_tgt_desc *tgt;
 
-	tgt = lmv_find_target(lmv, fid);
+	tgt = lmv_fid2tgt(lmv, fid);
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
@@ -3482,13 +3371,11 @@ static int lmv_quotactl(struct obd_device *unused, struct obd_export *exp,
 {
 	struct obd_device *obd = class_exp2obd(exp);
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct lmv_tgt_desc *tgt = lmv->tgts[0];
+	struct lmv_tgt_desc *tgt = lmv_tgt(lmv, 0);
 	int rc = 0;
 	u64 curspace = 0, curinodes = 0;
-	u32 i;
 
-	if (!tgt || !tgt->ltd_exp || !tgt->ltd_active ||
-	    !lmv->desc.ld_tgt_count) {
+	if (!tgt || !tgt->ltd_exp || !tgt->ltd_active) {
 		CERROR("master lmv inactive\n");
 		return -EIO;
 	}
@@ -3496,17 +3383,16 @@ static int lmv_quotactl(struct obd_device *unused, struct obd_export *exp,
 	if (oqctl->qc_cmd != Q_GETOQUOTA)
 		return obd_quotactl(tgt->ltd_exp, oqctl);
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+	lmv_foreach_connected_tgt(lmv, tgt) {
 		int err;
 
-		tgt = lmv->tgts[i];
-
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+		if (!tgt->ltd_active)
 			continue;
 
 		err = obd_quotactl(tgt->ltd_exp, oqctl);
 		if (err) {
-			CERROR("getquota on mdt %d failed. %d\n", i, err);
+			CERROR("getquota on mdt %d failed. %d\n",
+			       tgt->ltd_index, err);
 			if (!rc)
 				rc = err;
 		} else {
diff --git a/fs/lustre/lmv/lmv_qos.c b/fs/lustre/lmv/lmv_qos.c
index 85053d2e..0bee7c0 100644
--- a/fs/lustre/lmv/lmv_qos.c
+++ b/fs/lustre/lmv/lmv_qos.c
@@ -77,7 +77,6 @@ static int lmv_qos_calc_ppts(struct lmv_obd *lmv)
 	u64 ba_max, ba_min, ba;
 	u64 ia_max, ia_min, ia;
 	u32 num_active;
-	unsigned int i;
 	int prio_wide;
 	time64_t now, age;
 	u32 maxage = lmv->desc.ld_qos_maxage;
@@ -114,9 +113,8 @@ static int lmv_qos_calc_ppts(struct lmv_obd *lmv)
 	now = ktime_get_real_seconds();
 
 	/* Calculate server penalty per object */
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv->tgts[i];
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+	lmv_foreach_tgt(lmv, tgt) {
+		if (!tgt->ltd_exp || !tgt->ltd_active)
 			continue;
 
 		/* bavail >> 16 to avoid overflow */
@@ -164,9 +162,8 @@ static int lmv_qos_calc_ppts(struct lmv_obd *lmv)
 		 * we have to double the MDT penalty
 		 */
 		num_active = 2;
-		for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-			tgt = lmv->tgts[i];
-			if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+		lmv_foreach_tgt(lmv, tgt) {
+			if (!tgt->ltd_exp || !tgt->ltd_active)
 				continue;
 
 			tgt->ltd_qos.ltq_penalty_per_obj <<= 1;
@@ -265,7 +262,6 @@ static int lmv_qos_used(struct lmv_obd *lmv, struct lu_tgt_desc *tgt,
 {
 	struct lu_tgt_qos *ltq;
 	struct lu_svr_qos *svr;
-	unsigned int i;
 
 	ltq = &tgt->ltd_qos;
 	LASSERT(ltq);
@@ -301,9 +297,8 @@ static int lmv_qos_used(struct lmv_obd *lmv, struct lu_tgt_desc *tgt,
 
 	*total_wt = 0;
 	/* Decrease all MDT penalties */
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		ltq = &lmv->tgts[i]->ltd_qos;
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+	lmv_foreach_tgt(lmv, tgt) {
+		if (!tgt->ltd_exp || !tgt->ltd_active)
 			continue;
 
 		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
@@ -311,7 +306,7 @@ static int lmv_qos_used(struct lmv_obd *lmv, struct lu_tgt_desc *tgt,
 		else
 			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
 
-		lmv_qos_calc_weight(lmv->tgts[i]);
+		lmv_qos_calc_weight(tgt);
 
 		/* Recalc the total weight of usable osts */
 		if (ltq->ltq_usable)
@@ -319,7 +314,7 @@ static int lmv_qos_used(struct lmv_obd *lmv, struct lu_tgt_desc *tgt,
 
 		CDEBUG(D_OTHER,
 		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
-		       i, ltq->ltq_usable,
+		       tgt->ltd_index, ltq->ltq_usable,
 		       tgt_statfs_bavail(tgt) >> 10,
 		       ltq->ltq_penalty_per_obj >> 10,
 		       ltq->ltq_penalty >> 10,
@@ -337,7 +332,6 @@ struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 	u64 total_weight = 0;
 	u64 cur_weight = 0;
 	u64 rand;
-	int i;
 	int rc;
 
 	if (!lmv_qos_is_usable(lmv))
@@ -356,11 +350,7 @@ struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 		goto unlock;
 	}
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv->tgts[i];
-		if (!tgt)
-			continue;
-
+	lmv_foreach_tgt(lmv, tgt) {
 		tgt->ltd_qos.ltq_usable = 0;
 		if (!tgt->ltd_exp || !tgt->ltd_active)
 			continue;
@@ -372,10 +362,8 @@ struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 
 	rand = lu_prandom_u64_max(total_weight);
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv->tgts[i];
-
-		if (!tgt || !tgt->ltd_qos.ltq_usable)
+	lmv_foreach_tgt(lmv, tgt) {
+		if (!tgt->ltd_qos.ltq_usable)
 			continue;
 
 		cur_weight += tgt->ltd_qos.ltq_weight;
@@ -404,17 +392,18 @@ struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt)
 
 	spin_lock(&lmv->lmv_qos.lq_rr.lqr_alloc);
 	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv->tgts[(i + lmv->lmv_qos_rr_index) %
-				lmv->desc.ld_tgt_count];
-		if (tgt && tgt->ltd_exp && tgt->ltd_active) {
-			*mdt = tgt->ltd_index;
-			lmv->lmv_qos_rr_index =
-				(i + lmv->lmv_qos_rr_index + 1) %
-				lmv->desc.ld_tgt_count;
-			spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
-
-			return tgt;
-		}
+		tgt = lmv_tgt(lmv,
+			(i + lmv->lmv_qos_rr_index) % lmv->desc.ld_tgt_count);
+		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+			continue;
+
+		*mdt = tgt->ltd_index;
+		lmv->lmv_qos_rr_index =
+			(i + lmv->lmv_qos_rr_index + 1) %
+			lmv->desc.ld_tgt_count;
+		spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+
+		return tgt;
 	}
 	spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
 
diff --git a/fs/lustre/lmv/lproc_lmv.c b/fs/lustre/lmv/lproc_lmv.c
index 659ebeb..af670f8 100644
--- a/fs/lustre/lmv/lproc_lmv.c
+++ b/fs/lustre/lmv/lproc_lmv.c
@@ -183,14 +183,17 @@ static void *lmv_tgt_seq_start(struct seq_file *p, loff_t *pos)
 {
 	struct obd_device *dev = p->private;
 	struct lmv_obd *lmv = &dev->u.lmv;
+	struct lu_tgt_desc *tgt;
+
+	while (*pos < lmv->lmv_mdt_descs.ltd_tgts_size) {
+		tgt = lmv_tgt(lmv, (u32)*pos);
+		if (tgt)
+			return tgt;
 
-	while (*pos < lmv->tgts_size) {
-		if (lmv->tgts[*pos])
-			return lmv->tgts[*pos];
 		++*pos;
 	}
 
-	return  NULL;
+	return NULL;
 }
 
 static void lmv_tgt_seq_stop(struct seq_file *p, void *v)
@@ -201,11 +204,14 @@ static void *lmv_tgt_seq_next(struct seq_file *p, void *v, loff_t *pos)
 {
 	struct obd_device *dev = p->private;
 	struct lmv_obd *lmv = &dev->u.lmv;
+	struct lu_tgt_desc *tgt;
 
 	++*pos;
-	while (*pos < lmv->tgts_size) {
-		if (lmv->tgts[*pos])
-			return lmv->tgts[*pos];
+	while (*pos < lmv->lmv_mdt_descs.ltd_tgts_size) {
+		tgt = lmv_tgt(lmv, (u32)*pos);
+		if (tgt)
+			return tgt;
+
 		++*pos;
 	}
 
diff --git a/fs/lustre/obdclass/Makefile b/fs/lustre/obdclass/Makefile
index 6d762ed..5718a6d 100644
--- a/fs/lustre/obdclass/Makefile
+++ b/fs/lustre/obdclass/Makefile
@@ -8,4 +8,4 @@ obdclass-y := llog.o llog_cat.o llog_obd.o llog_swab.o class_obd.o \
 	      lustre_handles.o lustre_peer.o statfs_pack.o linkea.o \
 	      obdo.o obd_config.o obd_mount.o lu_object.o lu_ref.o \
 	      cl_object.o cl_page.o cl_lock.o cl_io.o kernelcomm.o \
-	      jobid.o integrity.o obd_cksum.o lu_qos.o
+	      jobid.o integrity.o obd_cksum.o lu_qos.o lu_tgt_descs.o
diff --git a/fs/lustre/obdclass/lu_tgt_descs.c b/fs/lustre/obdclass/lu_tgt_descs.c
new file mode 100644
index 0000000..04d6acc
--- /dev/null
+++ b/fs/lustre/obdclass/lu_tgt_descs.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ *
+ * lustre/obdclass/lu_tgt_descs.c
+ *
+ * Lustre target descriptions
+ * These are the only exported functions, they provide some generic
+ * infrastructure for target description management used by LOD/LMV
+ *
+ */
+
+#define DEBUG_SUBSYSTEM S_CLASS
+
+#include <linux/module.h>
+#include <linux/list.h>
+#include <obd_class.h>
+#include <obd_support.h>
+#include <lustre_disk.h>
+#include <lustre_fid.h>
+#include <lu_object.h>
+
+/**
+ * Allocate and initialize target table.
+ *
+ * A helper function to initialize the target table and allocate
+ * a bitmap of the available targets.
+ *
+ * @ltd		target's table to initialize
+ *
+ * Return:	0 on success
+ *		negated errno on error
+ **/
+int lu_tgt_descs_init(struct lu_tgt_descs *ltd)
+{
+	mutex_init(&ltd->ltd_mutex);
+	init_rwsem(&ltd->ltd_rw_sem);
+
+	/*
+	 * the tgt array and bitmap are allocated/grown dynamically as tgts are
+	 * added to the LOD/LMV, see lu_tgt_descs_add()
+	 */
+	ltd->ltd_tgt_bitmap = bitmap_zalloc(BITS_PER_LONG, GFP_NOFS);
+	if (!ltd->ltd_tgt_bitmap)
+		return -ENOMEM;
+
+	ltd->ltd_tgts_size  = BITS_PER_LONG;
+	ltd->ltd_tgtnr      = 0;
+
+	ltd->ltd_death_row = 0;
+	ltd->ltd_refcount  = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL(lu_tgt_descs_init);
+
+/**
+ * Free bitmap and target table pages.
+ *
+ * @ltd		target table
+ */
+void lu_tgt_descs_fini(struct lu_tgt_descs *ltd)
+{
+	int i;
+
+	bitmap_free(ltd->ltd_tgt_bitmap);
+	for (i = 0; i < TGT_PTRS; i++)
+		kfree(ltd->ltd_tgt_idx[i]);
+	ltd->ltd_tgts_size = 0;
+}
+EXPORT_SYMBOL(lu_tgt_descs_fini);
+
+/**
+ * Expand size of target table.
+ *
+ * When the target table is full, we have to extend the table. To do so,
+ * we allocate new memory with some reserve, move data from the old table
+ * to the new one and release memory consumed by the old table.
+ *
+ * @ltd		target table
+ * @newsize	new size of the table
+ *
+ * Return:	0 on success
+ *		-ENOMEM if reallocation failed
+ */
+static int lu_tgt_descs_resize(struct lu_tgt_descs *ltd, u32 newsize)
+{
+	unsigned long *new_bitmap, *old_bitmap = NULL;
+
+	/* someone else has already resize the array */
+	if (newsize <= ltd->ltd_tgts_size)
+		return 0;
+
+	new_bitmap = bitmap_zalloc(newsize, GFP_NOFS);
+	if (!new_bitmap)
+		return -ENOMEM;
+
+	if (ltd->ltd_tgts_size > 0) {
+		/* the bitmap already exists, copy data from old one */
+		bitmap_copy(new_bitmap, ltd->ltd_tgt_bitmap,
+			    ltd->ltd_tgts_size);
+		old_bitmap = ltd->ltd_tgt_bitmap;
+	}
+
+	ltd->ltd_tgts_size  = newsize;
+	ltd->ltd_tgt_bitmap = new_bitmap;
+
+	bitmap_free(old_bitmap);
+
+	CDEBUG(D_CONFIG, "tgt size: %d\n", ltd->ltd_tgts_size);
+
+	return 0;
+}
+
+/**
+ * Add new target to target table.
+ *
+ * Extend target table if it's full, update target table and bitmap.
+ * Notice we need to take ltd_rw_sem exclusively before entry to ensure
+ * atomic switch.
+ *
+ * @ltd		target table
+ * @tgt		new target desc
+ *
+ * Return:	0 on success
+ *		-ENOMEM if reallocation failed
+ *		-EEXIST if target existed
+ */
+int lu_tgt_descs_add(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
+{
+	u32 index = tgt->ltd_index;
+	int rc;
+
+	if (index >= ltd->ltd_tgts_size) {
+		u32 newsize = 1;
+
+		while (newsize < index + 1)
+			newsize = newsize << 1;
+
+		rc = lu_tgt_descs_resize(ltd, newsize);
+		if (rc)
+			return rc;
+	} else if (test_bit(index, ltd->ltd_tgt_bitmap)) {
+		return -EEXIST;
+	}
+
+	if (ltd->ltd_tgt_idx[index / TGT_PTRS_PER_BLOCK] == NULL) {
+		ltd->ltd_tgt_idx[index / TGT_PTRS_PER_BLOCK] =
+			kzalloc(sizeof(*ltd->ltd_tgt_idx[0]), GFP_NOFS);
+		if (ltd->ltd_tgt_idx[index / TGT_PTRS_PER_BLOCK] == NULL)
+			return -ENOMEM;
+	}
+
+	LTD_TGT(ltd, tgt->ltd_index) = tgt;
+	set_bit(tgt->ltd_index, ltd->ltd_tgt_bitmap);
+	ltd->ltd_tgtnr++;
+
+	return 0;
+}
+EXPORT_SYMBOL(lu_tgt_descs_add);
+
+/**
+ * Delete target from target table
+ */
+void lu_tgt_descs_del(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
+{
+	LTD_TGT(ltd, tgt->ltd_index) = NULL;
+	clear_bit(tgt->ltd_index, ltd->ltd_tgt_bitmap);
+	ltd->ltd_tgtnr--;
+}
+EXPORT_SYMBOL(lu_tgt_descs_del);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 486/622] lustre: lmv: share object alloc QoS code with LMV
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (484 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 485/622] lustre: lmv: use lu_tgt_descs to manage tgts James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 487/622] lustre: import: Fix missing spin_unlock() James Simmons
                   ` (136 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Move object alloc QoS code to obdclass, so that LMV and LOD
can share the same code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11213
Lustre-commit: d3090bb2b486 ("LU-11213 lod: share object alloc QoS code with LMV")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35219
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h |   7 +
 fs/lustre/lmv/Makefile        |   2 +-
 fs/lustre/lmv/lmv_internal.h  |   4 -
 fs/lustre/lmv/lmv_obd.c       |  87 +++++++++
 fs/lustre/lmv/lmv_qos.c       | 411 ------------------------------------------
 fs/lustre/obdclass/lu_qos.c   | 303 +++++++++++++++++++++++++++++++
 6 files changed, 398 insertions(+), 416 deletions(-)
 delete mode 100644 fs/lustre/lmv/lmv_qos.c

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index c30c06d..eaf20ea 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1442,6 +1442,13 @@ struct lu_qos {
 void lu_qos_rr_init(struct lu_qos_rr *lqr);
 int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
 int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
+bool lqos_is_usable(struct lu_qos *qos, u32 active_tgt_nr);
+int lqos_calc_penalties(struct lu_qos *qos, struct lu_tgt_descs *ltd,
+			u32 active_tgt_nr, u32 maxage, bool is_mdt);
+void lqos_calc_weight(struct lu_tgt_desc *tgt);
+int lqos_recalc_weight(struct lu_qos *qos, struct lu_tgt_descs *ltd,
+		       struct lu_tgt_desc *tgt, u32 active_tgt_nr,
+		       u64 *total_wt);
 u64 lu_prandom_u64_max(u64 ep_ro);
 
 int lu_tgt_descs_init(struct lu_tgt_descs *ltd);
diff --git a/fs/lustre/lmv/Makefile b/fs/lustre/lmv/Makefile
index 6f9a19c..ad470bf 100644
--- a/fs/lustre/lmv/Makefile
+++ b/fs/lustre/lmv/Makefile
@@ -1,4 +1,4 @@
 ccflags-y += -I$(srctree)/$(src)/../include
 
 obj-$(CONFIG_LUSTRE_FS) += lmv.o
-lmv-y := lmv_obd.o lmv_intent.o lmv_fld.o lproc_lmv.o lmv_qos.o
+lmv-y := lmv_obd.o lmv_intent.o lmv_fld.o lproc_lmv.o
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index e0c3ba0..d95fa3f 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -218,10 +218,6 @@ static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 struct lmv_tgt_desc *lmv_locate_tgt(struct lmv_obd *lmv,
 				    struct md_op_data *op_data);
 
-/* lmv_qos.c */
-struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt);
-struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt);
-
 /* lproc_lmv.c */
 int lmv_tunables_init(struct obd_device *obd);
 
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 8d682b4..2959b18 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1518,6 +1518,93 @@ static int lmv_close(struct obd_export *exp, struct md_op_data *op_data,
 	return md_close(tgt->ltd_exp, op_data, mod, request);
 }
 
+static struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
+{
+	struct lu_tgt_desc *tgt;
+	u64 total_weight = 0;
+	u64 cur_weight = 0;
+	u64 rand;
+	int rc;
+
+	if (!lqos_is_usable(&lmv->lmv_qos, lmv->desc.ld_active_tgt_count))
+		return ERR_PTR(-EAGAIN);
+
+	down_write(&lmv->lmv_qos.lq_rw_sem);
+
+	if (!lqos_is_usable(&lmv->lmv_qos, lmv->desc.ld_active_tgt_count)) {
+		tgt = ERR_PTR(-EAGAIN);
+		goto unlock;
+	}
+
+	rc = lqos_calc_penalties(&lmv->lmv_qos, &lmv->lmv_mdt_descs,
+				 lmv->desc.ld_active_tgt_count,
+				 lmv->desc.ld_qos_maxage, true);
+	if (rc) {
+		tgt = ERR_PTR(rc);
+		goto unlock;
+	}
+
+	lmv_foreach_tgt(lmv, tgt) {
+		tgt->ltd_qos.ltq_usable = 0;
+		if (!tgt->ltd_exp || !tgt->ltd_active)
+			continue;
+
+		tgt->ltd_qos.ltq_usable = 1;
+		lqos_calc_weight(tgt);
+		total_weight += tgt->ltd_qos.ltq_weight;
+	}
+
+	rand = lu_prandom_u64_max(total_weight);
+
+	lmv_foreach_connected_tgt(lmv, tgt) {
+		if (!tgt->ltd_qos.ltq_usable)
+			continue;
+
+		cur_weight += tgt->ltd_qos.ltq_weight;
+		if (cur_weight < rand)
+			continue;
+
+		*mdt = tgt->ltd_index;
+		lqos_recalc_weight(&lmv->lmv_qos, &lmv->lmv_mdt_descs, tgt,
+				   lmv->desc.ld_active_tgt_count,
+				   &total_weight);
+		rc = 0;
+		goto unlock;
+	}
+
+	/* no proper target found */
+	tgt = ERR_PTR(-EAGAIN);
+	goto unlock;
+unlock:
+	up_write(&lmv->lmv_qos.lq_rw_sem);
+
+	return tgt;
+}
+
+static struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt)
+{
+	struct lu_tgt_desc *tgt;
+	int i;
+	int index;
+
+	spin_lock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
+		index = (i + lmv->lmv_qos_rr_index) % lmv->desc.ld_tgt_count;
+		tgt = lmv_tgt(lmv, index);
+		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+			continue;
+
+		*mdt = tgt->ltd_index;
+		lmv->lmv_qos_rr_index = (*mdt + 1) % lmv->desc.ld_tgt_count;
+		spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+
+		return tgt;
+	}
+	spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
+
+	return ERR_PTR(-ENODEV);
+}
+
 static struct lmv_tgt_desc *
 lmv_locate_tgt_by_name(struct lmv_obd *lmv, struct lmv_stripe_md *lsm,
 		       const char *name, int namelen, struct lu_fid *fid,
diff --git a/fs/lustre/lmv/lmv_qos.c b/fs/lustre/lmv/lmv_qos.c
deleted file mode 100644
index 0bee7c0..0000000
--- a/fs/lustre/lmv/lmv_qos.c
+++ /dev/null
@@ -1,411 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * This file is part of Lustre, http://www.lustre.org/
- *
- * lustre/lmv/lmv_qos.c
- *
- * LMV QoS.
- * These are the only exported functions, they provide some generic
- * infrastructure for object allocation QoS
- *
- */
-
-#define DEBUG_SUBSYSTEM S_LMV
-
-#include <asm/div64.h>
-#include <uapi/linux/lustre/lustre_idl.h>
-#include <lustre_swab.h>
-#include <obd_class.h>
-
-#include "lmv_internal.h"
-
-static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
-{
-	struct obd_statfs *statfs = &tgt->ltd_statfs;
-
-	return statfs->os_bavail * statfs->os_bsize;
-}
-
-static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
-{
-	return tgt->ltd_statfs.os_ffree;
-}
-
-/**
- * Calculate penalties per-tgt and per-server
- *
- * Re-calculate penalties when the configuration changes, active targets
- * change and after statfs refresh (all these are reflected by lq_dirty flag).
- * On every MDT and MDS: decay the penalty by half for every 8x the update
- * interval that the device has been idle. That gives lots of time for the
- * statfs information to be updated (which the penalty is only a proxy for),
- * and avoids penalizing MDS/MDTs under light load.
- * See lmv_qos_calc_weight() for how penalties are factored into the weight.
- *
- * @lmv			LMV device
- *
- * Return:		0 on success
- *			-EAGAIN	if the number of MDTs isn't enough or all
- *			MDT spaces are almost the same
- */
-static int lmv_qos_calc_ppts(struct lmv_obd *lmv)
-{
-	struct lu_qos *qos = &lmv->lmv_qos;
-	struct lu_tgt_desc *tgt;
-	struct lu_svr_qos *svr;
-	u64 ba_max, ba_min, ba;
-	u64 ia_max, ia_min, ia;
-	u32 num_active;
-	int prio_wide;
-	time64_t now, age;
-	u32 maxage = lmv->desc.ld_qos_maxage;
-	int rc = 0;
-
-
-	if (!qos->lq_dirty)
-		goto out;
-
-	num_active = lmv->desc.ld_active_tgt_count;
-	if (num_active < 2) {
-		rc = -EAGAIN;
-		goto out;
-	}
-
-	/* find bavail on each server */
-	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-		svr->lsq_bavail = 0;
-		svr->lsq_iavail = 0;
-	}
-	qos->lq_active_svr_count = 0;
-
-	/*
-	 * How badly user wants to select targets "widely" (not recently chosen
-	 * and not on recent MDS's).  As opposed to "freely" (free space avail.)
-	 * 0-256
-	 */
-	prio_wide = 256 - qos->lq_prio_free;
-
-	ba_min = (u64)(-1);
-	ba_max = 0;
-	ia_min = (u64)(-1);
-	ia_max = 0;
-	now = ktime_get_real_seconds();
-
-	/* Calculate server penalty per object */
-	lmv_foreach_tgt(lmv, tgt) {
-		if (!tgt->ltd_exp || !tgt->ltd_active)
-			continue;
-
-		/* bavail >> 16 to avoid overflow */
-		ba = tgt_statfs_bavail(tgt) >> 16;
-		if (!ba)
-			continue;
-
-		ba_min = min(ba, ba_min);
-		ba_max = max(ba, ba_max);
-
-		/* iavail >> 8 to avoid overflow */
-		ia = tgt_statfs_iavail(tgt) >> 8;
-		if (!ia)
-			continue;
-
-		ia_min = min(ia, ia_min);
-		ia_max = max(ia, ia_max);
-
-		/* Count the number of usable MDS's */
-		if (tgt->ltd_qos.ltq_svr->lsq_bavail == 0)
-			qos->lq_active_svr_count++;
-		tgt->ltd_qos.ltq_svr->lsq_bavail += ba;
-		tgt->ltd_qos.ltq_svr->lsq_iavail += ia;
-
-		/*
-		 * per-MDT penalty is
-		 * prio * bavail * iavail / (num_tgt - 1) / 2
-		 */
-		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia;
-		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active - 1);
-		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
-
-		age = (now - tgt->ltd_qos.ltq_used) >> 3;
-		if (qos->lq_reset || age > 32 * maxage)
-			tgt->ltd_qos.ltq_penalty = 0;
-		else if (age > maxage)
-			/* Decay tgt penalty. */
-			tgt->ltd_qos.ltq_penalty >>= (age / maxage);
-	}
-
-	num_active = qos->lq_active_svr_count;
-	if (num_active < 2) {
-		/*
-		 * If there's only 1 MDS, we can't penalize it, so instead
-		 * we have to double the MDT penalty
-		 */
-		num_active = 2;
-		lmv_foreach_tgt(lmv, tgt) {
-			if (!tgt->ltd_exp || !tgt->ltd_active)
-				continue;
-
-			tgt->ltd_qos.ltq_penalty_per_obj <<= 1;
-		}
-	}
-
-	/*
-	 * Per-MDS penalty is
-	 * prio * bavail * iavail / server_tgts / (num_svr - 1) / 2
-	 */
-	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-		ba = svr->lsq_bavail;
-		ia = svr->lsq_iavail;
-		svr->lsq_penalty_per_obj = prio_wide * ba  * ia;
-		do_div(ba, svr->lsq_tgt_count * (num_active - 1));
-		svr->lsq_penalty_per_obj >>= 1;
-
-		age = (now - svr->lsq_used) >> 3;
-		if (qos->lq_reset || age > 32 * maxage)
-			svr->lsq_penalty = 0;
-		else if (age > maxage)
-			/* Decay server penalty. */
-			svr->lsq_penalty >>= age / maxage;
-	}
-
-	qos->lq_dirty = 0;
-	qos->lq_reset = 0;
-
-	/*
-	 * If each MDT has almost same free space, do rr allocation for better
-	 * creation performance
-	 */
-	qos->lq_same_space = 0;
-	if ((ba_max * (256 - qos->lq_threshold_rr)) >> 8 < ba_min &&
-	    (ia_max * (256 - qos->lq_threshold_rr)) >> 8 < ia_min) {
-		qos->lq_same_space = 1;
-		/* Reset weights for the next time we enter qos mode */
-		qos->lq_reset = 1;
-	}
-	rc = 0;
-
-out:
-	if (!rc && qos->lq_same_space)
-		return -EAGAIN;
-
-	return rc;
-}
-
-static inline bool lmv_qos_is_usable(struct lmv_obd *lmv)
-{
-	if (!lmv->lmv_qos.lq_dirty && lmv->lmv_qos.lq_same_space)
-		return false;
-
-	if (lmv->desc.ld_active_tgt_count < 2)
-		return false;
-
-	return true;
-}
-
-/**
- * Calculate weight for a given MDT.
- *
- * The final MDT weight is bavail >> 16 * iavail >> 8 minus the MDT and MDS
- * penalties.  See lmv_qos_calc_ppts() for how penalties are calculated.
- *
- * \param[in] tgt	MDT target descriptor
- */
-static void lmv_qos_calc_weight(struct lu_tgt_desc *tgt)
-{
-	struct lu_tgt_qos *ltq = &tgt->ltd_qos;
-	u64 temp, temp2;
-
-	temp = (tgt_statfs_bavail(tgt) >> 16) * (tgt_statfs_iavail(tgt) >> 8);
-	temp2 = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
-	if (temp < temp2)
-		ltq->ltq_weight = 0;
-	else
-		ltq->ltq_weight = temp - temp2;
-}
-
-/**
- * Re-calculate weights.
- *
- * The function is called when some target was used for a new object. In
- * this case we should re-calculate all the weights to keep new allocations
- * balanced well.
- *
- * \param[in] lmv	LMV device
- * \param[in] tgt	target where a new object was placed
- * \param[out] total_wt	new total weight for the pool
- *
- * \retval		0
- */
-static int lmv_qos_used(struct lmv_obd *lmv, struct lu_tgt_desc *tgt,
-			u64 *total_wt)
-{
-	struct lu_tgt_qos *ltq;
-	struct lu_svr_qos *svr;
-
-	ltq = &tgt->ltd_qos;
-	LASSERT(ltq);
-
-	/* Don't allocate on this device anymore, until the next alloc_qos */
-	ltq->ltq_usable = 0;
-
-	svr = ltq->ltq_svr;
-
-	/*
-	 * Decay old penalty by half (we're adding max penalty, and don't
-	 * want it to run away.)
-	 */
-	ltq->ltq_penalty >>= 1;
-	svr->lsq_penalty >>= 1;
-
-	/* mark the MDS and MDT as recently used */
-	ltq->ltq_used = svr->lsq_used = ktime_get_real_seconds();
-
-	/* Set max penalties for this MDT and MDS */
-	ltq->ltq_penalty += ltq->ltq_penalty_per_obj *
-			    lmv->desc.ld_active_tgt_count;
-	svr->lsq_penalty += svr->lsq_penalty_per_obj *
-		lmv->lmv_qos.lq_active_svr_count;
-
-	/* Decrease all MDS penalties */
-	list_for_each_entry(svr, &lmv->lmv_qos.lq_svr_list, lsq_svr_list) {
-		if (svr->lsq_penalty < svr->lsq_penalty_per_obj)
-			svr->lsq_penalty = 0;
-		else
-			svr->lsq_penalty -= svr->lsq_penalty_per_obj;
-	}
-
-	*total_wt = 0;
-	/* Decrease all MDT penalties */
-	lmv_foreach_tgt(lmv, tgt) {
-		if (!tgt->ltd_exp || !tgt->ltd_active)
-			continue;
-
-		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
-			ltq->ltq_penalty = 0;
-		else
-			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
-
-		lmv_qos_calc_weight(tgt);
-
-		/* Recalc the total weight of usable osts */
-		if (ltq->ltq_usable)
-			*total_wt += ltq->ltq_weight;
-
-		CDEBUG(D_OTHER,
-		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
-		       tgt->ltd_index, ltq->ltq_usable,
-		       tgt_statfs_bavail(tgt) >> 10,
-		       ltq->ltq_penalty_per_obj >> 10,
-		       ltq->ltq_penalty >> 10,
-		       ltq->ltq_svr->lsq_penalty_per_obj >> 10,
-		       ltq->ltq_svr->lsq_penalty >> 10,
-		       ltq->ltq_weight >> 10);
-	}
-
-	return 0;
-}
-
-struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
-{
-	struct lu_tgt_desc *tgt;
-	u64 total_weight = 0;
-	u64 cur_weight = 0;
-	u64 rand;
-	int rc;
-
-	if (!lmv_qos_is_usable(lmv))
-		return ERR_PTR(-EAGAIN);
-
-	down_write(&lmv->lmv_qos.lq_rw_sem);
-
-	if (!lmv_qos_is_usable(lmv)) {
-		tgt = ERR_PTR(-EAGAIN);
-		goto unlock;
-	}
-
-	rc = lmv_qos_calc_ppts(lmv);
-	if (rc) {
-		tgt = ERR_PTR(rc);
-		goto unlock;
-	}
-
-	lmv_foreach_tgt(lmv, tgt) {
-		tgt->ltd_qos.ltq_usable = 0;
-		if (!tgt->ltd_exp || !tgt->ltd_active)
-			continue;
-
-		tgt->ltd_qos.ltq_usable = 1;
-		lmv_qos_calc_weight(tgt);
-		total_weight += tgt->ltd_qos.ltq_weight;
-	}
-
-	rand = lu_prandom_u64_max(total_weight);
-
-	lmv_foreach_tgt(lmv, tgt) {
-		if (!tgt->ltd_qos.ltq_usable)
-			continue;
-
-		cur_weight += tgt->ltd_qos.ltq_weight;
-		if (cur_weight < rand)
-			continue;
-
-		*mdt = tgt->ltd_index;
-		lmv_qos_used(lmv, tgt, &total_weight);
-		rc = 0;
-		goto unlock;
-	}
-
-	/* no proper target found */
-	tgt = ERR_PTR(-EAGAIN);
-	goto unlock;
-unlock:
-	up_write(&lmv->lmv_qos.lq_rw_sem);
-
-	return tgt;
-}
-
-struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt)
-{
-	struct lu_tgt_desc *tgt;
-	int i;
-
-	spin_lock(&lmv->lmv_qos.lq_rr.lqr_alloc);
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		tgt = lmv_tgt(lmv,
-			(i + lmv->lmv_qos_rr_index) % lmv->desc.ld_tgt_count);
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
-			continue;
-
-		*mdt = tgt->ltd_index;
-		lmv->lmv_qos_rr_index =
-			(i + lmv->lmv_qos_rr_index + 1) %
-			lmv->desc.ld_tgt_count;
-		spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
-
-		return tgt;
-	}
-	spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
-
-	return ERR_PTR(-ENODEV);
-}
diff --git a/fs/lustre/obdclass/lu_qos.c b/fs/lustre/obdclass/lu_qos.c
index d4803e8..e77e81d 100644
--- a/fs/lustre/obdclass/lu_qos.c
+++ b/fs/lustre/obdclass/lu_qos.c
@@ -207,3 +207,306 @@ u64 lu_prandom_u64_max(u64 ep_ro)
 	return rand;
 }
 EXPORT_SYMBOL(lu_prandom_u64_max);
+
+static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
+{
+	struct obd_statfs *statfs = &tgt->ltd_statfs;
+
+	return statfs->os_bavail * statfs->os_bsize;
+}
+
+static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
+{
+	return tgt->ltd_statfs.os_ffree;
+}
+
+/**
+ * Calculate penalties per-tgt and per-server
+ *
+ * Re-calculate penalties when the configuration changes, active targets
+ * change and after statfs refresh (all these are reflected by lq_dirty flag).
+ * On every tgt and server: decay the penalty by half for every 8x the update
+ * interval that the device has been idle. That gives lots of time for the
+ * statfs information to be updated (which the penalty is only a proxy for),
+ * and avoids penalizing server/tgt under light load.
+ * See lqos_calc_weight() for how penalties are factored into the weight.
+ *
+ * @qos			lu_qos
+ * @ltd			lu_tgt_descs
+ * @active_tgt_nr	active tgt number
+ * @ maxage		qos max age
+ * @is_mdt		MDT will count inode usage
+ *
+ * Return:		0 on success
+ *			-EAGAIN the number of tgt isn't enough or all
+ *			tgt spaces are almost the same
+ */
+int lqos_calc_penalties(struct lu_qos *qos, struct lu_tgt_descs *ltd,
+			u32 active_tgt_nr, u32 maxage, bool is_mdt)
+{
+	struct lu_tgt_desc *tgt;
+	struct lu_svr_qos *svr;
+	u64 ba_max, ba_min, ba;
+	u64 ia_max, ia_min, ia = 1;
+	u32 num_active;
+	int prio_wide;
+	time64_t now, age;
+	int rc;
+
+	if (!qos->lq_dirty) {
+		rc = 0;
+		goto out;
+	}
+
+	num_active = active_tgt_nr - 1;
+	if (num_active < 1) {
+		rc = -EAGAIN;
+		goto out;
+	}
+
+	/* find bavail on each server */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		svr->lsq_bavail = 0;
+		/* if inode is not counted, set to 1 to ignore */
+		svr->lsq_iavail = is_mdt ? 0 : 1;
+	}
+	qos->lq_active_svr_count = 0;
+
+	/*
+	 * How badly user wants to select targets "widely" (not recently chosen
+	 * and not on recent MDS's).  As opposed to "freely" (free space avail.)
+	 * 0-256
+	 */
+	prio_wide = 256 - qos->lq_prio_free;
+
+	ba_min = (u64)(-1);
+	ba_max = 0;
+	ia_min = (u64)(-1);
+	ia_max = 0;
+	now = ktime_get_real_seconds();
+
+	/* Calculate server penalty per object */
+	ltd_foreach_tgt(ltd, tgt) {
+		if (!tgt->ltd_active)
+			continue;
+
+		/* when inode is counted, bavail >> 16 to avoid overflow */
+		ba = tgt_statfs_bavail(tgt);
+		if (is_mdt)
+			ba >>= 16;
+		else
+			ba >>= 8;
+		if (!ba)
+			continue;
+
+		ba_min = min(ba, ba_min);
+		ba_max = max(ba, ba_max);
+
+		/* Count the number of usable servers */
+		if (tgt->ltd_qos.ltq_svr->lsq_bavail == 0)
+			qos->lq_active_svr_count++;
+		tgt->ltd_qos.ltq_svr->lsq_bavail += ba;
+
+		if (is_mdt) {
+			/* iavail >> 8 to avoid overflow */
+			ia = tgt_statfs_iavail(tgt) >> 8;
+			if (!ia)
+				continue;
+
+			ia_min = min(ia, ia_min);
+			ia_max = max(ia, ia_max);
+
+			tgt->ltd_qos.ltq_svr->lsq_iavail += ia;
+		}
+
+		/*
+		 * per-tgt penalty is
+		 * prio * bavail * iavail / (num_tgt - 1) / 2
+		 */
+		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia;
+		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active);
+		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
+
+		age = (now - tgt->ltd_qos.ltq_used) >> 3;
+		if (qos->lq_reset || age > 32 * maxage)
+			tgt->ltd_qos.ltq_penalty = 0;
+		else if (age > maxage)
+			/* Decay tgt penalty. */
+			tgt->ltd_qos.ltq_penalty >>= (age / maxage);
+	}
+
+	num_active = qos->lq_active_svr_count - 1;
+	if (num_active < 1) {
+		/*
+		 * If there's only 1 server, we can't penalize it, so instead
+		 * we have to double the tgt penalty
+		 */
+		num_active = 1;
+		ltd_foreach_tgt(ltd, tgt) {
+			if (!tgt->ltd_active)
+				continue;
+
+			tgt->ltd_qos.ltq_penalty_per_obj <<= 1;
+		}
+	}
+
+	/*
+	 * Per-server penalty is
+	 * prio * bavail * iavail / server_tgts / (num_svr - 1) / 2
+	 */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		ba = svr->lsq_bavail;
+		ia = svr->lsq_iavail;
+		svr->lsq_penalty_per_obj = prio_wide * ba  * ia;
+		do_div(ba, svr->lsq_tgt_count * num_active);
+		svr->lsq_penalty_per_obj >>= 1;
+
+		age = (now - svr->lsq_used) >> 3;
+		if (qos->lq_reset || age > 32 * maxage)
+			svr->lsq_penalty = 0;
+		else if (age > maxage)
+			/* Decay server penalty. */
+			svr->lsq_penalty >>= age / maxage;
+	}
+
+	qos->lq_dirty = 0;
+	qos->lq_reset = 0;
+
+	/*
+	 * If each tgt has almost same free space, do rr allocation for better
+	 * creation performance
+	 */
+	qos->lq_same_space = 0;
+	if ((ba_max * (256 - qos->lq_threshold_rr)) >> 8 < ba_min &&
+	    (ia_max * (256 - qos->lq_threshold_rr)) >> 8 < ia_min) {
+		qos->lq_same_space = 1;
+		/* Reset weights for the next time we enter qos mode */
+		qos->lq_reset = 1;
+	}
+	rc = 0;
+
+out:
+	if (!rc && qos->lq_same_space)
+		return -EAGAIN;
+
+	return rc;
+}
+EXPORT_SYMBOL(lqos_calc_penalties);
+
+bool lqos_is_usable(struct lu_qos *qos, u32 active_tgt_nr)
+{
+	if (!qos->lq_dirty && qos->lq_same_space)
+		return false;
+
+	if (active_tgt_nr < 2)
+		return false;
+
+	return true;
+}
+EXPORT_SYMBOL(lqos_is_usable);
+
+/**
+ * Calculate weight for a given tgt.
+ *
+ * The final tgt weight is bavail >> 16 * iavail >> 8 minus the tgt and server
+ * penalties.  See lqos_calc_ppts() for how penalties are calculated.
+ *
+ * @tgt		target descriptor
+ */
+void lqos_calc_weight(struct lu_tgt_desc *tgt)
+{
+	struct lu_tgt_qos *ltq = &tgt->ltd_qos;
+	u64 temp, temp2;
+
+	temp = (tgt_statfs_bavail(tgt) >> 16) * (tgt_statfs_iavail(tgt) >> 8);
+	temp2 = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
+	if (temp < temp2)
+		ltq->ltq_weight = 0;
+	else
+		ltq->ltq_weight = temp - temp2;
+}
+EXPORT_SYMBOL(lqos_calc_weight);
+
+/**
+ * Re-calculate weights.
+ *
+ * The function is called when some target was used for a new object. In
+ * this case we should re-calculate all the weights to keep new allocations
+ * balanced well.
+ *
+ * @qos			lu_qos
+ * @ltd			lu_tgt_descs
+ * @tgt			target where a new object was placed
+ * @active_tgt_nr	active tgt number
+ * @total_wt		new total weight for the pool
+ *
+ * Return:		0
+ */
+int lqos_recalc_weight(struct lu_qos *qos, struct lu_tgt_descs *ltd,
+		       struct lu_tgt_desc *tgt, u32 active_tgt_nr,
+		       u64 *total_wt)
+{
+	struct lu_tgt_qos *ltq;
+	struct lu_svr_qos *svr;
+
+	ltq = &tgt->ltd_qos;
+	LASSERT(ltq);
+
+	/* Don't allocate on this device anymore, until the next alloc_qos */
+	ltq->ltq_usable = 0;
+
+	svr = ltq->ltq_svr;
+
+	/*
+	 * Decay old penalty by half (we're adding max penalty, and don't
+	 * want it to run away.)
+	 */
+	ltq->ltq_penalty >>= 1;
+	svr->lsq_penalty >>= 1;
+
+	/* mark the server and tgt as recently used */
+	ltq->ltq_used = svr->lsq_used = ktime_get_real_seconds();
+
+	/* Set max penalties for this tgt and server */
+	ltq->ltq_penalty += ltq->ltq_penalty_per_obj * active_tgt_nr;
+	svr->lsq_penalty += svr->lsq_penalty_per_obj * active_tgt_nr;
+
+	/* Decrease all MDS penalties */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		if (svr->lsq_penalty < svr->lsq_penalty_per_obj)
+			svr->lsq_penalty = 0;
+		else
+			svr->lsq_penalty -= svr->lsq_penalty_per_obj;
+	}
+
+	*total_wt = 0;
+	/* Decrease all tgt penalties */
+	ltd_foreach_tgt(ltd, tgt) {
+		if (!tgt->ltd_active)
+			continue;
+
+		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
+			ltq->ltq_penalty = 0;
+		else
+			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
+
+		lqos_calc_weight(tgt);
+
+		/* Recalc the total weight of usable osts */
+		if (ltq->ltq_usable)
+			*total_wt += ltq->ltq_weight;
+
+		CDEBUG(D_OTHER,
+		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
+		       tgt->ltd_index, ltq->ltq_usable,
+		       tgt_statfs_bavail(tgt) >> 10,
+		       ltq->ltq_penalty_per_obj >> 10,
+		       ltq->ltq_penalty >> 10,
+		       ltq->ltq_svr->lsq_penalty_per_obj >> 10,
+		       ltq->ltq_svr->lsq_penalty >> 10,
+		       ltq->ltq_weight >> 10);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(lqos_recalc_weight);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 487/622] lustre: import: Fix missing spin_unlock()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (485 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 486/622] lustre: lmv: share object alloc QoS code with LMV James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 488/622] lnet: o2iblnd: Make credits hiw connection aware James Simmons
                   ` (135 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

A recent patch moved the spin_unlock() down into
each branch of an 'if', but missed the final 'else'.
Add the spin_unlock in the else.

Fixes: 428ed8100580 ("lustre: import: fix race between imp_state & imp_invalid")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11542
Lustre-commit: 3dbdd38a6adc ("LU-11542 import: Fix missing spin_unlock()")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35999
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/pinger.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/lustre/ptlrpc/pinger.c b/fs/lustre/ptlrpc/pinger.c
index a812942..f584fc6 100644
--- a/fs/lustre/ptlrpc/pinger.c
+++ b/fs/lustre/ptlrpc/pinger.c
@@ -242,6 +242,8 @@ static void ptlrpc_pinger_process_import(struct obd_import *imp,
 	} else if ((imp->imp_pingable && !suppress) || force_next || force) {
 		spin_unlock(&imp->imp_lock);
 		ptlrpc_ping(imp);
+	} else {
+		spin_unlock(&imp->imp_lock);
 	}
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 488/622] lnet: o2iblnd: Make credits hiw connection aware
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (486 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 487/622] lustre: import: Fix missing spin_unlock() James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 489/622] lustre: obdecho: avoid panic with partially object init James Simmons
                   ` (134 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

The IBLND_CREDITS_HIGHWATER mark check currently looks only
at the global peer credits tunable, ignoring the connection
specific queue depth when determining the threshold at
which to send a NOOP message to return credits.

This is incorrect because while connection queue depth
defaults to the same as peer credits, it can be less than
that global value for specific connections.

So we must check for this case when setting the threshold.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12569
Lustre-commit: 1b87e8f61781 ("LU-12569 o2iblnd: Make credits hiw connection aware")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35578
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 2f2337a..bc79874 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -102,9 +102,11 @@ struct kib_tunables {
 #define IBLND_CREDITS_MAX	  ((typeof(((struct kib_msg *)0)->ibm_credits)) - 1)
 
 /* when eagerly to return credits */
-#define IBLND_CREDITS_HIGHWATER(t, v)	((v) == IBLND_MSG_VERSION_1 ? \
-					IBLND_CREDIT_HIGHWATER_V1 : \
-					t->lnd_peercredits_hiw)
+#define IBLND_CREDITS_HIGHWATER(t, conn)			\
+	(((conn)->ibc_version) == IBLND_MSG_VERSION_1 ?		\
+	 IBLND_CREDIT_HIGHWATER_V1 :				\
+	 min((t)->lnd_peercredits_hiw,				\
+	     (u32)(conn)->ibc_queue_depth - 1))
 
 # define kiblnd_rdma_create_id(ns, cb, dev, ps, qpt) rdma_create_id(ns, cb, \
 								    dev, ps, \
@@ -791,7 +793,7 @@ struct kib_peer_ni {
 	tunables = &ni->ni_lnd_tunables.lnd_tun_u.lnd_o2ib;
 
 	if (conn->ibc_outstanding_credits <
-	    IBLND_CREDITS_HIGHWATER(tunables, conn->ibc_version) &&
+	    IBLND_CREDITS_HIGHWATER(tunables, conn) &&
 	    !kiblnd_send_keepalive(conn))
 		return 0; /* No need to send NOOP */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 489/622] lustre: obdecho: avoid panic with partially object init
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (487 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 488/622] lnet: o2iblnd: Make credits hiw connection aware James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 490/622] lnet: o2iblnd: cache max_qp_wr James Simmons
                   ` (133 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

in some cases (like ENOMEM) init function can't called, so
any init related code should placed in the object delete handler,
not in the object free.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12707
Lustre-commit: 1a9ca8417c60 ("LU-12707 obdecho: avoid panic with partially object init")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/35950
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdecho/echo_client.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 84823ec..172fe11 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -444,10 +444,16 @@ static int echo_object_init(const struct lu_env *env, struct lu_object *obj,
 	return 0;
 }
 
-static void echo_object_free(const struct lu_env *env, struct lu_object *obj)
+static void echo_object_delete(const struct lu_env *env, struct lu_object *obj)
 {
 	struct echo_object *eco = cl2echo_obj(lu2cl(obj));
-	struct echo_client_obd *ec = eco->eo_dev->ed_ec;
+	struct echo_client_obd *ec;
+
+	/* object delete called unconditolally - layer init or not */
+	if (!eco->eo_dev)
+		return;
+
+	ec = eco->eo_dev->ed_ec;
 
 	LASSERT(atomic_read(&eco->eo_npages) == 0);
 
@@ -455,10 +461,16 @@ static void echo_object_free(const struct lu_env *env, struct lu_object *obj)
 	list_del_init(&eco->eo_obj_chain);
 	spin_unlock(&ec->ec_lock);
 
+	kfree(eco->eo_oinfo);
+}
+
+static void echo_object_free(const struct lu_env *env, struct lu_object *obj)
+{
+	struct echo_object *eco    = cl2echo_obj(lu2cl(obj));
+
 	lu_object_fini(obj);
 	lu_object_header_fini(obj->lo_header);
 
-	kfree(eco->eo_oinfo);
 	kmem_cache_free(echo_object_kmem, eco);
 }
 
@@ -472,7 +484,7 @@ static int echo_object_print(const struct lu_env *env, void *cookie,
 
 static const struct lu_object_operations echo_lu_obj_ops = {
 	.loo_object_init	= echo_object_init,
-	.loo_object_delete	= NULL,
+	.loo_object_delete	= echo_object_delete,
 	.loo_object_release	= NULL,
 	.loo_object_free	= echo_object_free,
 	.loo_object_print	= echo_object_print,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 490/622] lnet: o2iblnd: cache max_qp_wr
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (488 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 489/622] lustre: obdecho: avoid panic with partially object init James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:15 ` [lustre-devel] [PATCH 491/622] lustre: som: integrate LSOM with lfs find James Simmons
                   ` (132 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When creating the device the maximum number of work requests per qp
which can be allocated is already known. Cache that internally,
and when creating the qp make sure the qp's max_send_wr does not
exceed that max. If it does then cap max_send_wr to max_qp_wr.
Recalculate the connection's queue depth based on the max_qp_wr.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12621
Lustre-commit: 7ee319ed7f9d ("LU-12621 o2iblnd: cache max_qp_wr")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36073
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 42 ++++++++++++++++++++++++----------------
 net/lnet/klnds/o2iblnd/o2iblnd.h |  1 +
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 278823f..d4d5d4f 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -656,16 +656,28 @@ static unsigned int kiblnd_send_wrs(struct kib_conn *conn)
 	 * One WR for the LNet message
 	 * And ibc_max_frags for the transfer WRs
 	 */
+	int ret;
+	int multiplier = 1 + conn->ibc_max_frags;
 	enum kib_dev_caps dev_caps = conn->ibc_hdev->ibh_dev->ibd_dev_caps;
-	unsigned int ret = 1 + conn->ibc_max_frags;
 
 	/* FastReg needs two extra WRs for map and invalidate */
 	if (dev_caps & IBLND_DEV_CAPS_FASTREG_ENABLED)
-		ret += 2;
+		multiplier += 2;
 
 	/* account for a maximum of ibc_queue_depth in-flight transfers */
-	ret *= conn->ibc_queue_depth;
-	return ret;
+	ret = multiplier * conn->ibc_queue_depth;
+
+	if (ret > conn->ibc_hdev->ibh_max_qp_wr) {
+		CDEBUG(D_NET,
+		       "peer_credits %u will result in send work request size %d larger than maximum %d device can handle\n",
+		       conn->ibc_queue_depth, ret,
+		       conn->ibc_hdev->ibh_max_qp_wr);
+		conn->ibc_queue_depth =
+			conn->ibc_hdev->ibh_max_qp_wr / multiplier;
+	}
+
+	/* don't go beyond the maximum the device can handle */
+	return min(ret, conn->ibc_hdev->ibh_max_qp_wr);
 }
 
 struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
@@ -814,20 +826,13 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 	init_qp_attr->qp_type = IB_QPT_RC;
 	init_qp_attr->send_cq = cq;
 	init_qp_attr->recv_cq = cq;
+	/* kiblnd_send_wrs() can change the connection's queue depth if
+	 * the maximum work requests for the device is maxed out
+	 */
+	init_qp_attr->cap.max_send_wr = kiblnd_send_wrs(conn);
+	init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn);
 
-	conn->ibc_sched = sched;
-
-	do {
-		init_qp_attr->cap.max_send_wr = kiblnd_send_wrs(conn);
-		init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn);
-
-		rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
-		if (!rc || conn->ibc_queue_depth < 2)
-			break;
-
-		conn->ibc_queue_depth--;
-	} while (rc);
-
+	rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr);
 	if (rc) {
 		CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d, send_sge: %d, recv_sge: %d\n",
 		       rc, init_qp_attr->cap.max_send_wr,
@@ -837,6 +842,8 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 		goto failed_2;
 	}
 
+	conn->ibc_sched = sched;
+
 	if (conn->ibc_queue_depth != peer_ni->ibp_queue_depth)
 		CWARN("peer %s - queue depth reduced from %u to %u  to allow for qp creation\n",
 		      libcfs_nid2str(peer_ni->ibp_nid),
@@ -2330,6 +2337,7 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 	}
 
 	hdev->ibh_mr_size = dev_attr->max_mr_size;
+	hdev->ibh_max_qp_wr = dev_attr->max_qp_wr;
 
 	CERROR("Invalid mr size: %#llx\n", hdev->ibh_mr_size);
 	return -EINVAL;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index bc79874..ac91757 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -178,6 +178,7 @@ struct kib_hca_dev {
 	int			ibh_page_size;	/* page size of current HCA */
 	u64			ibh_page_mask;	/* page mask of current HCA */
 	u64			ibh_mr_size;	/* size of MR */
+	int			ibh_max_qp_wr;	/* maximum work requests size */
 	struct ib_pd		*ibh_pd;	/* PD */
 	struct kib_dev		*ibh_dev;	/* owner */
 	atomic_t		ibh_ref;	/* refcount */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 491/622] lustre: som: integrate LSOM with lfs find
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (489 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 490/622] lnet: o2iblnd: cache max_qp_wr James Simmons
@ 2020-02-27 21:15 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 492/622] lustre: llite: error handling of ll_och_fill() James Simmons
                   ` (131 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:15 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

The patch integrates LSOM functionality with lfs find so that it
is possible to use LSOM functionality directly on the client. The
MDS fills in the mbo_size and mbo_blocks fields from the LSOM
xattr, if the actual size/blocks are not available, and then set
new OBD_MD_FLLSIZE and OBD_MD_FLLBLOCKS flags in the reply so that
the client knows these fields are valid.

The lfs find command adds "-l|--lazy" option to allow the use of
LSOM data from the MDS.

Add a new version of ioctl(LL_IOC_MDC_GETINFO) call that also returns
valid flags from the MDS RPC to userspace in struct lov_user_mds_data
so that it is possible to determine whether the size and blocks are
returned by the call.  The old LL_IOC_MDC_GETINFO ioctl number is
renamed to LL_IOC_MDC_GETINFO_OLD and is binary compatible, but
newly-compiled applications will use the new struct lov_user_mds_data.

New llapi interfaces llapi_get_lum_file(), llapi_get_lum_dir(),
llapi_get_lum_file_fd(), llapi_get_lum_dir_fd() are added to fetch
valid stat() attributes and LOV info to the user.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11367
Lustre-commit: 11aa7f8704c4 ("LU-11367 som: integrate LSOM with lfs find")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35167
Reviewed-by: Li Xi <lixi@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                   | 97 +++++++++++++++++++++++++++++++--
 include/uapi/linux/lustre/lustre_idl.h  |  3 +
 include/uapi/linux/lustre/lustre_user.h | 17 +++++-
 3 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 812f535..4dccd24 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1604,16 +1604,24 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case LL_IOC_LOV_GETSTRIPE:
 	case LL_IOC_LOV_GETSTRIPE_NEW:
 	case LL_IOC_MDC_GETINFO:
+	case LL_IOC_MDC_GETINFO_OLD:
 	case IOC_MDC_GETFILEINFO:
+	case IOC_MDC_GETFILEINFO_OLD:
 	case IOC_MDC_GETFILESTRIPE: {
 		struct ptlrpc_request *request = NULL;
 		struct lov_user_md __user *lump;
 		struct lov_mds_md *lmm = NULL;
 		struct mdt_body *body;
 		char *filename = NULL;
+		lstat_t __user *statp = NULL;
+		struct statx __user *stxp = NULL;
+		u64 __user *flagsp = NULL;
+		u32 __user *lmmsizep = NULL;
+		struct lu_fid __user *fidp = NULL;
 		int lmmsize;
 
-		if (cmd == IOC_MDC_GETFILEINFO ||
+		if (cmd == IOC_MDC_GETFILEINFO_OLD ||
+		    cmd == IOC_MDC_GETFILEINFO ||
 		    cmd == IOC_MDC_GETFILESTRIPE) {
 			filename = ll_getname((const char __user *)arg);
 			if (IS_ERR(filename))
@@ -1635,7 +1643,9 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		}
 
 		if (rc == -ENODATA && (cmd == IOC_MDC_GETFILEINFO ||
-				       cmd == LL_IOC_MDC_GETINFO)) {
+				       cmd == LL_IOC_MDC_GETINFO ||
+				       cmd == IOC_MDC_GETFILEINFO_OLD ||
+				       cmd == LL_IOC_MDC_GETINFO_OLD)) {
 			lmmsize = 0;
 			rc = 0;
 		}
@@ -1647,10 +1657,21 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		    cmd == LL_IOC_LOV_GETSTRIPE ||
 		    cmd == LL_IOC_LOV_GETSTRIPE_NEW) {
 			lump = (struct lov_user_md __user *)arg;
+		} else if (cmd == IOC_MDC_GETFILEINFO_OLD ||
+			   cmd == LL_IOC_MDC_GETINFO_OLD){
+			struct lov_user_mds_data_v1 __user *lmdp;
+
+			lmdp = (struct lov_user_mds_data_v1 __user *)arg;
+			statp = &lmdp->lmd_st;
+			lump = &lmdp->lmd_lmm;
 		} else {
 			struct lov_user_mds_data __user *lmdp;
 
 			lmdp = (struct lov_user_mds_data __user *)arg;
+			fidp = &lmdp->lmd_fid;
+			stxp = &lmdp->lmd_stx;
+			flagsp = &lmdp->lmd_flags;
+			lmmsizep = &lmdp->lmd_lmmsize;
 			lump = &lmdp->lmd_lmm;
 		}
 
@@ -1670,8 +1691,8 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			rc = -EOVERFLOW;
 		}
 
-		if (cmd == IOC_MDC_GETFILEINFO || cmd == LL_IOC_MDC_GETINFO) {
-			struct lov_user_mds_data __user *lmdp;
+		if (cmd == IOC_MDC_GETFILEINFO_OLD ||
+		    cmd == LL_IOC_MDC_GETINFO_OLD) {
 			lstat_t st = { 0 };
 
 			st.st_dev = inode->i_sb->s_dev;
@@ -1690,8 +1711,72 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 						     sbi->ll_flags &
 						     LL_SBI_32BIT_API);
 
-			lmdp = (struct lov_user_mds_data __user *)arg;
-			if (copy_to_user(&lmdp->lmd_st, &st, sizeof(st))) {
+			if (copy_to_user(statp, &st, sizeof(st))) {
+				rc = -EFAULT;
+				goto out_req;
+			}
+		} else if (cmd == IOC_MDC_GETFILEINFO ||
+			   cmd == LL_IOC_MDC_GETINFO) {
+			struct statx stx = { 0 };
+			u64 valid = body->mbo_valid;
+
+			stx.stx_blksize = PAGE_SIZE;
+			stx.stx_nlink = body->mbo_nlink;
+			stx.stx_uid = body->mbo_uid;
+			stx.stx_gid = body->mbo_gid;
+			stx.stx_mode = body->mbo_mode;
+			stx.stx_ino = cl_fid_build_ino(&body->mbo_fid1,
+						       sbi->ll_flags &
+						       LL_SBI_32BIT_API);
+			stx.stx_size = body->mbo_size;
+			stx.stx_blocks = body->mbo_blocks;
+			stx.stx_atime.tv_sec = body->mbo_atime;
+			stx.stx_ctime.tv_sec = body->mbo_ctime;
+			stx.stx_mtime.tv_sec = body->mbo_mtime;
+			stx.stx_rdev_major = MAJOR(body->mbo_rdev);
+			stx.stx_rdev_minor = MINOR(body->mbo_rdev);
+			stx.stx_dev_major = MAJOR(inode->i_sb->s_dev);
+			stx.stx_dev_minor = MINOR(inode->i_sb->s_dev);
+			stx.stx_mask |= STATX_BASIC_STATS;
+
+			/*
+			 * For a striped directory, the size and blocks returned
+			 * from MDT is not correct.
+			 * The size and blocks are aggregated by client across
+			 * all stripes.
+			 * Thus for a striped directory, do not return the valid
+			 * FLSIZE and FLBLOCKS flags to the caller.
+			 * However, this whould be better decided by the MDS
+			 * instead of the client.
+			 */
+			if (cmd == LL_IOC_MDC_GETINFO &&
+			    ll_i2info(inode)->lli_lsm_md)
+				valid &= ~(OBD_MD_FLSIZE | OBD_MD_FLBLOCKS);
+
+			if (flagsp && copy_to_user(flagsp, &valid,
+						   sizeof(*flagsp))) {
+				rc = -EFAULT;
+				goto out_req;
+			}
+
+			if (fidp && copy_to_user(fidp, &body->mbo_fid1,
+						 sizeof(*fidp))) {
+				rc = -EFAULT;
+				goto out_req;
+			}
+
+			if (!(valid & OBD_MD_FLSIZE))
+				stx.stx_mask &= ~STATX_SIZE;
+			if (!(valid & OBD_MD_FLBLOCKS))
+				stx.stx_mask &= ~STATX_BLOCKS;
+
+			if (stxp && copy_to_user(stxp, &stx, sizeof(stx))) {
+				rc = -EFAULT;
+				goto out_req;
+			}
+
+			if (lmmsizep && copy_to_user(lmmsizep, &lmmsize,
+						     sizeof(*lmmsizep))) {
 				rc = -EFAULT;
 				goto out_req;
 			}
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 47321ae..d4b29d8 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1211,6 +1211,9 @@ static inline __u32 lov_mds_md_size(__u16 stripes, __u32 lmm_magic)
 #define OBD_MD_FLPROJID		(0x0100000000000000ULL) /* project ID */
 #define OBD_MD_SECCTX        (0x0200000000000000ULL) /* embed security xattr */
 
+#define OBD_MD_FLLAZYSIZE    (0x0400000000000000ULL) /* Lazy size */
+#define OBD_MD_FLLAZYBLOCKS  (0x0800000000000000ULL) /* Lazy blocks */
+
 #define OBD_MD_FLALLQUOTA (OBD_MD_FLUSRQUOTA | \
 			   OBD_MD_FLGRPQUOTA | \
 			   OBD_MD_FLPRJQUOTA)
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 695ceb2..06a691b 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -371,8 +371,10 @@ struct ll_ioc_lease_id {
 #define IOC_MDC_TYPE		'i'
 #define IOC_MDC_LOOKUP		_IOWR(IOC_MDC_TYPE, 20, struct obd_device *)
 #define IOC_MDC_GETFILESTRIPE	_IOWR(IOC_MDC_TYPE, 21, struct lov_user_md *)
-#define IOC_MDC_GETFILEINFO	_IOWR(IOC_MDC_TYPE, 22, struct lov_user_mds_data *)
-#define LL_IOC_MDC_GETINFO	_IOWR(IOC_MDC_TYPE, 23, struct lov_user_mds_data *)
+#define IOC_MDC_GETFILEINFO_OLD	_IOWR(IOC_MDC_TYPE, 22, struct lov_user_mds_data_v1 *)
+#define IOC_MDC_GETFILEINFO	_IOWR(IOC_MDC_TYPE, 22, struct lov_user_mds_data)
+#define LL_IOC_MDC_GETINFO_OLD	_IOWR(IOC_MDC_TYPE, 23, struct lov_user_mds_data_v1 *)
+#define LL_IOC_MDC_GETINFO	_IOWR(IOC_MDC_TYPE, 23, struct lov_user_mds_data)
 
 #define MAX_OBD_NAME 128 /* If this changes, a NEW ioctl must be added */
 
@@ -636,12 +638,21 @@ static inline __u32 lov_user_md_size(__u16 stripes, __u32 lmm_magic)
  * is possible the application has already #included <sys/stat.h>.
  */
 #ifdef HAVE_LOV_USER_MDS_DATA
-#define lov_user_mds_data lov_user_mds_data_v1
+#define lov_user_mds_data lov_user_mds_data_v2
 struct lov_user_mds_data_v1 {
 	lstat_t lmd_st;			/* MDS stat struct */
 	struct lov_user_md_v1 lmd_lmm;	/* LOV EA V1 user data */
 } __packed;
 
+struct lov_user_mds_data_v2 {
+	struct lu_fid lmd_fid;		/* Lustre FID */
+	struct statx lmd_stx;		/* MDS statx struct */
+	__u64 lmd_flags;		/* MDS stat flags */
+	__u32 lmd_lmmsize;		/* LOV EA size */
+	__u32 lmd_padding;		/* unused */
+	struct lov_user_md_v1 lmd_lmm;	/* LOV EA user data */
+} __attribute__((packed));
+
 struct lov_user_mds_data_v3 {
 	lstat_t lmd_st;			/* MDS stat struct */
 	struct lov_user_md_v3 lmd_lmm;	/* LOV EA V3 user data */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 492/622] lustre: llite: error handling of ll_och_fill()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (490 preceding siblings ...)
  2020-02-27 21:15 ` [lustre-devel] [PATCH 491/622] lustre: som: integrate LSOM with lfs find James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 493/622] lnet: Don't queue msg when discovery has completed James Simmons
                   ` (130 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

The return error of ll_och_fill() should be handled.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12690
Lustre-commit: 4d6d58575d3d ("LU-12690 llite: error handling of ll_och_fill()")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35913
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 856aa64..31d7dce 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1091,7 +1091,9 @@ static int ll_lease_och_release(struct inode *inode, struct file *file)
 		goto out_release_it;
 
 	LASSERT(it_disposition(&it, DISP_ENQ_OPEN_REF));
-	ll_och_fill(sbi->ll_md_exp, &it, och);
+	rc = ll_och_fill(sbi->ll_md_exp, &it, och);
+	if (rc)
+		goto out_release_it;
 
 	if (!it_disposition(&it, DISP_OPEN_LEASE)) /* old server? */ {
 		rc = -EOPNOTSUPP;
@@ -2225,7 +2227,9 @@ int ll_release_openhandle(struct inode *inode, struct lookup_intent *it)
 		goto out;
 	}
 
-	ll_och_fill(ll_i2sbi(inode)->ll_md_exp, it, och);
+	rc = ll_och_fill(ll_i2sbi(inode)->ll_md_exp, it, och);
+	if (rc)
+		goto out;
 
 	rc = ll_close_inode_openhandle(inode, och, 0, NULL);
 out:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 493/622] lnet: Don't queue msg when discovery has completed
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (491 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 492/622] lustre: llite: error handling of ll_och_fill() James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 494/622] lnet: Use alternate ping processing for non-mr peers James Simmons
                   ` (129 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

In lnet_initiate_peer_discovery(), it is possible for the peer object
to change after the call to lnet_discover_peer_locked(), and it is
also possible for the peer to complete discovery between the first
call to lnet_peer_is_uptodate() and our placing the lnet_msg onto
the peer's lp_dc_pendq. After the call to lnet_discover_peer_locked()
check whether the, potentially new, peer object is up to date while
holding the lp_lock. If the peer is up to date, then we needn't
queue the message. Otherwise, we continue to hold the lock to place
the message on the peer's lp_dc_pendq.

Cray-bug-id: LUS-7596
WC-bug-id: https://jira.whamcloud.com/browse/LU-12739
Lustre-commit: 4ef62976448d ("LU-12739 lnet: Don't queue msg when discovery has completed")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36139
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  1 +
 net/lnet/lnet/lib-move.c      | 19 +++++++++++++------
 net/lnet/lnet/peer.c          | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index f2f5455..db1b7e5 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -876,6 +876,7 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 }
 
 bool lnet_peer_is_uptodate(struct lnet_peer *lp);
+bool lnet_peer_is_uptodate_locked(struct lnet_peer *lp);
 bool lnet_is_discovery_disabled(struct lnet_peer *lp);
 bool lnet_peer_gw_discovery(struct lnet_peer *lp);
 
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 2f31f06..6da0be4 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1807,15 +1807,21 @@ struct lnet_ni *
 	}
 	/* The peer may have changed. */
 	peer = lpni->lpni_peer_net->lpn_peer;
+	spin_lock(&peer->lp_lock);
+	if (lnet_peer_is_uptodate_locked(peer)) {
+		spin_unlock(&peer->lp_lock);
+		lnet_peer_ni_decref_locked(lpni);
+		return 0;
+	}
 	/* queue message and return */
 	msg->msg_rtr_nid_param = rtr_nid;
 	msg->msg_sending = 0;
 	msg->msg_txpeer = NULL;
-	spin_lock(&peer->lp_lock);
 	list_add_tail(&msg->msg_list, &peer->lp_dc_pendq);
+	primary_nid = peer->lp_primary_nid;
 	spin_unlock(&peer->lp_lock);
+
 	lnet_peer_ni_decref_locked(lpni);
-	primary_nid = peer->lp_primary_nid;
 
 	CDEBUG(D_NET, "msg %p delayed. %s pending discovery\n",
 	       msg, libcfs_nid2str(primary_nid));
@@ -2428,11 +2434,10 @@ struct lnet_ni *
 	 */
 	msg->msg_src_nid_param = src_nid;
 
-	/* Now that we have a peer_ni, check if we want to discover
-	 * the peer. Traffic to the LNET_RESERVED_PORTAL should not
-	 * trigger discovery.
+	/* If necessary, perform discovery on the peer that owns this peer_ni.
+	 * Note, this can result in the ownership of this peer_ni changing
+	 * to another peer object.
 	 */
-	peer = lpni->lpni_peer_net->lpn_peer;
 	rc = lnet_initiate_peer_discovery(lpni, msg, rtr_nid, cpt);
 	if (rc) {
 		lnet_peer_ni_decref_locked(lpni);
@@ -2441,6 +2446,8 @@ struct lnet_ni *
 	}
 	lnet_peer_ni_decref_locked(lpni);
 
+	peer = lpni->lpni_peer_net->lpn_peer;
+
 	/* Identify the different send cases
 	 */
 	if (src_nid == LNET_NID_ANY)
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 088bb62..0d33ade 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1831,6 +1831,17 @@ struct lnet_peer_ni *
 	return rc;
 }
 
+bool
+lnet_peer_is_uptodate(struct lnet_peer *lp)
+{
+	bool rc;
+
+	spin_lock(&lp->lp_lock);
+	rc = lnet_peer_is_uptodate_locked(lp);
+	spin_unlock(&lp->lp_lock);
+	return rc;
+}
+
 /*
  * Is a peer uptodate from the point of view of discovery?
  *
@@ -1840,11 +1851,11 @@ struct lnet_peer_ni *
  * Otherwise look at whether the peer needs rediscovering.
  */
 bool
-lnet_peer_is_uptodate(struct lnet_peer *lp)
+lnet_peer_is_uptodate_locked(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
 {
 	bool rc;
 
-	spin_lock(&lp->lp_lock);
 	if (lp->lp_state & (LNET_PEER_DISCOVERING |
 			    LNET_PEER_FORCE_PING |
 			    LNET_PEER_FORCE_PUSH)) {
@@ -1861,7 +1872,6 @@ struct lnet_peer_ni *
 	} else {
 		rc = false;
 	}
-	spin_unlock(&lp->lp_lock);
 
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 494/622] lnet: Use alternate ping processing for non-mr peers
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (492 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 493/622] lnet: Don't queue msg when discovery has completed James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 495/622] lustre: obdclass: qos penalties miscalculated James Simmons
                   ` (128 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Router peers without multi-rail capabilities (i.e. older Lustre
versions) or router peers that have discovery disabled need to use
the alternate ping processing introduced by LU-12422. Otherwise,
these peers go through the normal discovery processing, but their
remote network interfaces are never added to the peer object. This
causes routes through these peers to be considered down when
avoid_asym_router_failure is enabled.

Cray-bug-id: LUS-7866
WC-bug-id: https://jira.whamcloud.com/browse/LU-12763
Lustre-commit: 010f6b1819b9 ("LU-12763 lnet: Use alternate ping processing for non-mr peers")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36182
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 1 +
 net/lnet/lnet/peer.c          | 1 +
 net/lnet/lnet/router.c        | 9 ++++++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index db1b7e5..56556fd 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -878,6 +878,7 @@ int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 bool lnet_peer_is_uptodate(struct lnet_peer *lp);
 bool lnet_peer_is_uptodate_locked(struct lnet_peer *lp);
 bool lnet_is_discovery_disabled(struct lnet_peer *lp);
+bool lnet_is_discovery_disabled_locked(struct lnet_peer *lp);
 bool lnet_peer_gw_discovery(struct lnet_peer *lp);
 
 static inline bool
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 0d33ade..a067136 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1141,6 +1141,7 @@ struct lnet_peer_ni *
 
 bool
 lnet_is_discovery_disabled_locked(struct lnet_peer *lp)
+__must_hold(&lp->lp_lock)
 {
 	if (lnet_peer_discovery_disabled)
 		return true;
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 7246eea..a5e4af0 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -227,7 +227,7 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	 * aliveness information can only be obtained when discovery is
 	 * enabled.
 	 */
-	if (lnet_peer_discovery_disabled)
+	if (lnet_is_discovery_disabled(gw))
 		return route->lr_alive;
 
 	/* check the gateway's interfaces on the route rnet to make sure
@@ -316,11 +316,14 @@ bool lnet_is_route_alive(struct lnet_route *route)
 
 	spin_lock(&lp->lp_lock);
 	lp_state = lp->lp_state;
-	spin_unlock(&lp->lp_lock);
 
 	/* only handle replies if discovery is disabled. */
-	if (!lnet_peer_discovery_disabled)
+	if (!lnet_is_discovery_disabled_locked(lp)) {
+		spin_unlock(&lp->lp_lock);
 		return;
+	}
+
+	spin_unlock(&lp->lp_lock);
 
 	if (lp_state & LNET_PEER_PING_FAILED) {
 		CDEBUG(D_NET,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 495/622] lustre: obdclass: qos penalties miscalculated
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (493 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 494/622] lnet: Use alternate ping processing for non-mr peers James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 496/622] lustre: osc: wrong cache of LVB attrs James Simmons
                   ` (127 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

In lqos_calc_penalties(), the penalty_per_obj is miscalculated.

Fixes: e6dd0ec9bcd2 ("lustre: lmv: share object alloc QoS code with LMV")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12495
Lustre-commit: 9130d05de4e2 ("LU-12495 obdclass: qos penalties miscalculated")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36269
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lu_qos.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/obdclass/lu_qos.c b/fs/lustre/obdclass/lu_qos.c
index e77e81d..13ab4a7 100644
--- a/fs/lustre/obdclass/lu_qos.c
+++ b/fs/lustre/obdclass/lu_qos.c
@@ -323,7 +323,7 @@ int lqos_calc_penalties(struct lu_qos *qos, struct lu_tgt_descs *ltd,
 		 * per-tgt penalty is
 		 * prio * bavail * iavail / (num_tgt - 1) / 2
 		 */
-		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia;
+		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia >> 8;
 		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active);
 		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
 
@@ -357,7 +357,7 @@ int lqos_calc_penalties(struct lu_qos *qos, struct lu_tgt_descs *ltd,
 	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
 		ba = svr->lsq_bavail;
 		ia = svr->lsq_iavail;
-		svr->lsq_penalty_per_obj = prio_wide * ba  * ia;
+		svr->lsq_penalty_per_obj = prio_wide * ba  * ia >> 8;
 		do_div(ba, svr->lsq_tgt_count * num_active);
 		svr->lsq_penalty_per_obj >>= 1;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 496/622] lustre: osc: wrong cache of LVB attrs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (494 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 495/622] lustre: obdclass: qos penalties miscalculated James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 497/622] lustre: osc: wrong cache of LVB attrs, part2 James Simmons
                   ` (126 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Vitaly Fertman <c17818@cray.com>

osc object keeps the cache of LVB, obtained on lock enqueue, in
lov_oinfo. This cache gets all the modifications happenning on
the client, whereas the original LVB in locks does not get them.
At the same time, this cache is lost on object destroy, which
may appear on layout change in particular.

ldlm locks are left in LRU and could be matched on next operations.
First enqueue does not match a lock in LRU due to @kms_ignore in
enqueue_base, however if the lock will be obtained on a small offset
with some locks existent in LRU on larger offsets, the obtained size
will be cut by the policy region when set to KMS.

2nd enqueue can already match and add stale data to oinfo. Thus the
OSC cache is left with a small KMS. However the logic of preparing
a partial page code checks the KMS to decide if to read a page and
as it is small,the page is not read and therefore the non-read part
of the page is zeroed.

The object destroy detaches dlm locks from osc object, offload the
current osc oinfo cache to all the locks, so that it could be
reconstructed for the next osc oinfo. Introduce per-lock flag to
control the cached attribute status and drop re-enqueue after osc
object replacement.

This patch also fixes the handling of KMS_IGNORE added in LU-11964. It
is used only for skip the self lock in a search there is no other logic
for it and it is not needed for DOM locks at all - all the relevant
semantics is supposed to be accomplished by cbpending flag.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12681
Lustre-commit: 8ac020df4592 ("LU-12681 osc: wrong cache of LVB attrs")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Cray-bug-id: LUS-7731
Reviewed-on: https://review.whamcloud.com/36199
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm_flags.h |  8 ++++++
 fs/lustre/llite/namei.c              |  3 ---
 fs/lustre/mdc/mdc_dev.c              | 47 ++++++++++++++++++++++--------------
 fs/lustre/osc/osc_internal.h         |  3 +--
 fs/lustre/osc/osc_lock.c             | 15 ++++++------
 fs/lustre/osc/osc_object.c           | 24 +++++++++++++++++-
 fs/lustre/osc/osc_request.c          | 15 ++----------
 7 files changed, 70 insertions(+), 45 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm_flags.h b/fs/lustre/include/lustre_dlm_flags.h
index 3d69c49..06337d3 100644
--- a/fs/lustre/include/lustre_dlm_flags.h
+++ b/fs/lustre/include/lustre_dlm_flags.h
@@ -399,6 +399,14 @@
 #define ldlm_is_ndelay(_l)		LDLM_TEST_FLAG((_l), 1ULL << 58)
 #define ldlm_set_ndelay(_l)		LDLM_SET_FLAG((_l), 1ULL << 58)
 
+/**
+ * LVB from this lock is cached in osc object
+ */
+#define LDLM_FL_LVB_CACHED              0x0800000000000000ULL /* bit  59 */
+#define ldlm_is_lvb_cached(_l)          LDLM_TEST_FLAG((_l), 1ULL << 59)
+#define ldlm_set_lvb_cached(_l)         LDLM_SET_FLAG((_l), 1ULL << 59)
+#define ldlm_clear_lvb_cached(_l)       LDLM_CLEAR_FLAG((_l), 1ULL << 59)
+
 /** l_flags bits marked as "ast" bits */
 #define LDLM_FL_AST_MASK		(LDLM_FL_FLOCK_DEADLOCK		|\
 					 LDLM_FL_DISCARD_DATA)
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index de01a73..ce72910 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -276,9 +276,6 @@ static void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 			CDEBUG(D_INODE, "cannot flush DoM data "
 			       DFID": rc = %d\n",
 			       PFID(ll_inode2fid(inode)), rc);
-		lock_res_and_lock(lock);
-		ldlm_set_kms_ignore(lock);
-		unlock_res_and_lock(lock);
 	}
 
 	if (bits & MDS_INODELOCK_LAYOUT) {
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index b49509c..d589f97 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -312,7 +312,6 @@ static int mdc_dlm_blocking_ast0(const struct lu_env *env,
 		dlmlock->l_ast_data = NULL;
 		cl_object_get(obj);
 	}
-	ldlm_set_kms_ignore(dlmlock);
 	unlock_res_and_lock(dlmlock);
 
 	/* if l_ast_data is NULL, the dlmlock was enqueued by AGL or
@@ -432,7 +431,7 @@ void mdc_lock_lvb_update(const struct lu_env *env, struct osc_object *osc,
 }
 
 static void mdc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
-			     struct lustre_handle *lockh, bool lvb_update)
+			     struct lustre_handle *lockh)
 {
 	struct ldlm_lock *dlmlock;
 
@@ -473,10 +472,11 @@ static void mdc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 		descr->cld_end = CL_PAGE_EOF;
 
 		/* no lvb update for matched lock */
-		if (lvb_update) {
+		if (!ldlm_is_lvb_cached(dlmlock)) {
 			LASSERT(oscl->ols_flags & LDLM_FL_LVB_READY);
 			mdc_lock_lvb_update(env, cl2osc(oscl->ols_cl.cls_obj),
 					    dlmlock, NULL);
+			ldlm_set_lvb_cached(dlmlock);
 		}
 	}
 	unlock_res_and_lock(dlmlock);
@@ -514,7 +514,7 @@ static int mdc_lock_upcall(void *cookie, struct lustre_handle *lockh,
 
 	CDEBUG(D_INODE, "rc %d, err %d\n", rc, errcode);
 	if (rc == 0)
-		mdc_lock_granted(env, oscl, lockh, errcode == ELDLM_OK);
+		mdc_lock_granted(env, oscl, lockh);
 
 	/* Error handling, some errors are tolerable. */
 	if (oscl->ols_locklessable && rc == -EUSERS) {
@@ -685,10 +685,8 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 	 * LVB information, e.g. canceled locks or locks of just pruned object,
 	 * such locks should be skipped.
 	 */
-	mode = ldlm_lock_match_with_skip(obd->obd_namespace, match_flags,
-					 LDLM_FL_KMS_IGNORE, res_id,
-					 einfo->ei_type, policy, mode,
-					 &lockh, 0);
+	mode = ldlm_lock_match(obd->obd_namespace, match_flags, res_id,
+			       einfo->ei_type, policy, mode, &lockh, 0);
 	if (mode) {
 		struct ldlm_lock *matched;
 
@@ -696,13 +694,6 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 			return ELDLM_OK;
 
 		matched = ldlm_handle2lock(&lockh);
-		/* this shouldn't happen but this check is kept to make
-		 * related test fail if problem occurs
-		 */
-		if (unlikely(ldlm_is_kms_ignore(matched))) {
-			LDLM_ERROR(matched, "matched lock has KMS ignore flag");
-			goto no_match;
-		}
 
 		if (OBD_FAIL_CHECK(OBD_FAIL_MDC_GLIMPSE_DDOS))
 			ldlm_set_kms_ignore(matched);
@@ -717,7 +708,6 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 			LDLM_LOCK_PUT(matched);
 			return ELDLM_OK;
 		}
-no_match:
 		ldlm_lock_decref(&lockh, mode);
 		LDLM_LOCK_PUT(matched);
 	}
@@ -1362,9 +1352,30 @@ static int mdc_attr_get(const struct lu_env *env, struct cl_object *obj,
 
 static int mdc_object_ast_clear(struct ldlm_lock *lock, void *data)
 {
-	if (lock->l_ast_data == data)
+	struct osc_object *osc = (struct osc_object *)data;
+	struct ost_lvb *lvb = &lock->l_ost_lvb;
+	struct lov_oinfo *oinfo;
+
+	if (lock->l_ast_data == data) {
 		lock->l_ast_data = NULL;
-	ldlm_set_kms_ignore(lock);
+
+		LASSERT(osc);
+		LASSERT(osc->oo_oinfo);
+		LASSERT(lvb);
+
+		/* Updates lvb in lock by the cached oinfo */
+		oinfo = osc->oo_oinfo;
+		cl_object_attr_lock(&osc->oo_cl);
+		memcpy(lvb, &oinfo->loi_lvb, sizeof(oinfo->loi_lvb));
+		cl_object_attr_unlock(&osc->oo_cl);
+
+		LDLM_DEBUG(lock,
+			   "update lvb size %llu blocks %llu [cma]time: %llu %llu %llu",
+			   lvb->lvb_size, lvb->lvb_blocks,
+			   lvb->lvb_ctime, lvb->lvb_mtime, lvb->lvb_atime);
+
+		ldlm_clear_lvb_cached(lock);
+	}
 	return LDLM_ITER_CONTINUE;
 }
 
diff --git a/fs/lustre/osc/osc_internal.h b/fs/lustre/osc/osc_internal.h
index 6f71d8d..b3b365a 100644
--- a/fs/lustre/osc/osc_internal.h
+++ b/fs/lustre/osc/osc_internal.h
@@ -54,8 +54,7 @@ int osc_lock_discard_pages(const struct lu_env *env, struct osc_object *osc,
 
 int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 		     u64 *flags, union ldlm_policy_data *policy,
-		     struct ost_lvb *lvb, int kms_valid,
-		     osc_enqueue_upcall_f upcall,
+		     struct ost_lvb *lvb, osc_enqueue_upcall_f upcall,
 		     void *cookie, struct ldlm_enqueue_info *einfo,
 		     struct ptlrpc_request_set *rqset, int async,
 		     bool speculative);
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index dcddf17..02d3436 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -143,9 +143,6 @@ static void osc_lock_build_policy(const struct lu_env *env,
  * with the DLM lock reply from the server. Copy of osc_update_enqueue()
  * logic.
  *
- * This can be optimized to not update attributes when lock is a result of a
- * local match.
- *
  * Called under lock and resource spin-locks.
  */
 static void osc_lock_lvb_update(const struct lu_env *env,
@@ -197,7 +194,7 @@ static void osc_lock_lvb_update(const struct lu_env *env,
 }
 
 static void osc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
-			     struct lustre_handle *lockh, bool lvb_update)
+			     struct lustre_handle *lockh)
 {
 	struct ldlm_lock *dlmlock;
 
@@ -240,10 +237,11 @@ static void osc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 		descr->cld_gid = ext->gid;
 
 		/* no lvb update for matched lock */
-		if (lvb_update) {
+		if (!ldlm_is_lvb_cached(dlmlock)) {
 			LASSERT(oscl->ols_flags & LDLM_FL_LVB_READY);
 			osc_lock_lvb_update(env, cl2osc(oscl->ols_cl.cls_obj),
 					    dlmlock, NULL);
+			ldlm_set_lvb_cached(dlmlock);
 		}
 		LINVRNT(osc_lock_invariant(oscl));
 	}
@@ -281,7 +279,7 @@ static int osc_lock_upcall(void *cookie, struct lustre_handle *lockh,
 	}
 
 	if (rc == 0)
-		osc_lock_granted(env, oscl, lockh, errcode == ELDLM_OK);
+		osc_lock_granted(env, oscl, lockh);
 
 	/* Error handling, some errors are tolerable. */
 	if (oscl->ols_locklessable && rc == -EUSERS) {
@@ -338,7 +336,9 @@ static int osc_lock_upcall_speculative(void *cookie,
 	lock_res_and_lock(dlmlock);
 	LASSERT(ldlm_is_granted(dlmlock));
 
-	/* there is no osc_lock associated with speculative lock */
+	/* there is no osc_lock associated with speculative lock
+	 * thus no need to set LDLM_FL_LVB_CACHED
+	 */
 	osc_lock_lvb_update(env, osc, dlmlock, NULL);
 
 	unlock_res_and_lock(dlmlock);
@@ -1022,7 +1022,6 @@ static int osc_lock_enqueue(const struct lu_env *env,
 	}
 	result = osc_enqueue_base(exp, resname, &oscl->ols_flags,
 				  policy, &oscl->ols_lvb,
-				  osc->oo_oinfo->loi_kms_valid,
 				  upcall, cookie,
 				  &oscl->ols_einfo, PTLRPCD_SET, async,
 				  oscl->ols_speculative);
diff --git a/fs/lustre/osc/osc_object.c b/fs/lustre/osc/osc_object.c
index fdee8fa..d2206e8 100644
--- a/fs/lustre/osc/osc_object.c
+++ b/fs/lustre/osc/osc_object.c
@@ -196,8 +196,30 @@ int osc_object_glimpse(const struct lu_env *env,
 
 static int osc_object_ast_clear(struct ldlm_lock *lock, void *data)
 {
-	if (lock->l_ast_data == data)
+	struct osc_object *osc = (struct osc_object *)data;
+	struct ost_lvb *lvb = lock->l_lvb_data;
+	struct lov_oinfo *oinfo;
+
+	if (lock->l_ast_data == data) {
 		lock->l_ast_data = NULL;
+
+		LASSERT(osc);
+		LASSERT(osc->oo_oinfo);
+		LASSERT(lvb);
+
+		/* Updates lvb in lock by the cached oinfo */
+		oinfo = osc->oo_oinfo;
+		cl_object_attr_lock(&osc->oo_cl);
+		memcpy(lvb, &oinfo->loi_lvb, sizeof(oinfo->loi_lvb));
+		cl_object_attr_unlock(&osc->oo_cl);
+
+		LDLM_DEBUG(lock,
+			   "update lvb size %llu blocks %llu [cma]time: %llu %llu %llu",
+			   lvb->lvb_size, lvb->lvb_blocks,
+			   lvb->lvb_ctime, lvb->lvb_mtime, lvb->lvb_atime);
+
+		ldlm_clear_lvb_cached(lock);
+	}
 	return LDLM_ITER_CONTINUE;
 }
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 7ba9ea5..0e32496 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -2496,9 +2496,8 @@ int osc_enqueue_interpret(const struct lu_env *env, struct ptlrpc_request *req,
  */
 int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 		     u64 *flags, union ldlm_policy_data *policy,
-		     struct ost_lvb *lvb, int kms_valid,
-		     osc_enqueue_upcall_f upcall, void *cookie,
-		     struct ldlm_enqueue_info *einfo,
+		     struct ost_lvb *lvb, osc_enqueue_upcall_f upcall,
+		     void *cookie, struct ldlm_enqueue_info *einfo,
 		     struct ptlrpc_request_set *rqset, int async,
 		     bool speculative)
 {
@@ -2516,15 +2515,6 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 	policy->l_extent.start -= policy->l_extent.start & ~PAGE_MASK;
 	policy->l_extent.end |= ~PAGE_MASK;
 
-	/*
-	 * kms is not valid when either object is completely fresh (so that no
-	 * locks are cached), or object was evicted. In the latter case cached
-	 * lock cannot be used, because it would prime inode state with
-	 * potentially stale LVB.
-	 */
-	if (!kms_valid)
-		goto no_match;
-
 	/* Next, search for already existing extent locks that will cover us */
 	/* If we're trying to read, we also search for an existing PW lock.  The
 	 * VFS and page cache already protect us locally, so lots of readers/
@@ -2589,7 +2579,6 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 		LDLM_LOCK_PUT(matched);
 	}
 
-no_match:
 	if (*flags & (LDLM_FL_TEST_LOCK | LDLM_FL_MATCH_LOCK))
 		return -ENOLCK;
 	if (intent) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 497/622] lustre: osc: wrong cache of LVB attrs, part2
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (495 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 496/622] lustre: osc: wrong cache of LVB attrs James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 498/622] lustre: vvp: dirty pages with pagevec James Simmons
                   ` (125 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Vitaly Fertman <c17818@cray.com>

It may happen that osc oinfo lvb cache has size < kms.

It occurs if a reply re-ordering happens and an older size is applied
to oinfo unconditionally.

Another possibility is RA, when osc_match_base() attaches the dlm lock
to osc object but does not cache the lvb. The next layout change will
overwrites the lock lvb by the oinfo cache (previous LUS-7731 fix),
presumably smaller values. Therefore, the next lock re-use may run
into a problem with partial page write which thinks the preliminary
read is not needed.

Do not let the cached oinfo lvb size to become less than kms.
Also, cache the lock's lvb in the oinfo on osc_match_base().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12681
Lustre-commit: 40319db5bc64 ("LU-12681 osc: wrong cache of LVB attrs, part2")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Cray-bug-id: LUS-7731
Reviewed-on: https://review.whamcloud.com/36200
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_dev.c      | 72 +++++++++++++++++++++++++++-----------------
 fs/lustre/osc/osc_internal.h | 12 ++++++--
 fs/lustre/osc/osc_lock.c     | 40 +++++++++++++-----------
 fs/lustre/osc/osc_object.c   | 16 ++++++----
 fs/lustre/osc/osc_request.c  | 19 +++++++++---
 5 files changed, 100 insertions(+), 59 deletions(-)

diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index d589f97..312e527 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -69,21 +69,17 @@ static void mdc_lock_lvb_update(const struct lu_env *env,
 				struct ldlm_lock *dlmlock,
 				struct ost_lvb *lvb);
 
-static int mdc_set_dom_lock_data(const struct lu_env *env,
-				 struct ldlm_lock *lock, void *data)
+static int mdc_set_dom_lock_data(struct ldlm_lock *lock, void *data)
 {
-	struct osc_object *obj = data;
 	int set = 0;
 
 	LASSERT(lock);
 	LASSERT(lock->l_glimpse_ast == mdc_ldlm_glimpse_ast);
 
 	lock_res_and_lock(lock);
-	if (!lock->l_ast_data) {
-		lock->l_ast_data = data;
-		mdc_lock_lvb_update(env, obj, lock, NULL);
-	}
 
+	if (!lock->l_ast_data)
+		lock->l_ast_data = data;
 	if (lock->l_ast_data == data)
 		set = 1;
 
@@ -93,9 +89,9 @@ static int mdc_set_dom_lock_data(const struct lu_env *env,
 }
 
 int mdc_dom_lock_match(const struct lu_env *env, struct obd_export *exp,
-		       struct ldlm_res_id *res_id,
-		       enum ldlm_type type, union ldlm_policy_data *policy,
-		       enum ldlm_mode mode, u64 *flags, void *data,
+		       struct ldlm_res_id *res_id, enum ldlm_type type,
+		       union ldlm_policy_data *policy, enum ldlm_mode mode,
+		       u64 *flags, struct osc_object *obj,
 		       struct lustre_handle *lockh, int unref)
 {
 	struct obd_device *obd = exp->exp_obd;
@@ -107,11 +103,19 @@ int mdc_dom_lock_match(const struct lu_env *env, struct obd_export *exp,
 	if (rc == 0 || lflags & LDLM_FL_TEST_LOCK)
 		return rc;
 
-	if (data) {
+	if (obj) {
 		struct ldlm_lock *lock = ldlm_handle2lock(lockh);
 
 		LASSERT(lock);
-		if (!mdc_set_dom_lock_data(env, lock, data)) {
+		if (mdc_set_dom_lock_data(lock, obj)) {
+			lock_res_and_lock(lock);
+			if (!ldlm_is_lvb_cached(lock)) {
+				LASSERT(lock->l_ast_data == obj);
+				mdc_lock_lvb_update(env, obj, lock, NULL);
+				ldlm_set_lvb_cached(lock);
+			}
+			unlock_res_and_lock(lock);
+		} else {
 			ldlm_lock_decref(lockh, rc);
 			rc = 0;
 		}
@@ -400,6 +404,7 @@ void mdc_lock_lvb_update(const struct lu_env *env, struct osc_object *osc,
 	struct cl_attr *attr = &osc_env_info(env)->oti_attr;
 	unsigned int valid = CAT_BLOCKS | CAT_ATIME | CAT_CTIME | CAT_MTIME |
 			     CAT_SIZE;
+	unsigned int setkms = 0;
 
 	if (!lvb) {
 		LASSERT(dlmlock);
@@ -415,17 +420,23 @@ void mdc_lock_lvb_update(const struct lu_env *env, struct osc_object *osc,
 		size = lvb->lvb_size;
 
 		if (size >= oinfo->loi_kms) {
-			LDLM_DEBUG(dlmlock,
-				   "lock acquired, setting rss=%llu, kms=%llu",
-				   lvb->lvb_size, size);
 			valid |= CAT_KMS;
 			attr->cat_kms = size;
-		} else {
-			LDLM_DEBUG(dlmlock,
-				   "lock acquired, setting rss=%llu, leaving kms=%llu",
-				   lvb->lvb_size, oinfo->loi_kms);
+			setkms = 1;
 		}
 	}
+
+	/* The size should not be less than the kms */
+	if (attr->cat_size < oinfo->loi_kms)
+		attr->cat_size = oinfo->loi_kms;
+
+	LDLM_DEBUG(dlmlock,
+		   "acquired size %llu, setting rss=%llu;%s kms=%llu, end=%llu",
+		   lvb->lvb_size, attr->cat_size,
+		   setkms ? "" : " leaving",
+		   setkms ? attr->cat_kms : oinfo->loi_kms,
+		   dlmlock ? dlmlock->l_policy_data.l_extent.end : -1ull);
+
 	cl_object_attr_update(env, obj, attr, valid);
 	cl_object_attr_unlock(obj);
 }
@@ -433,6 +444,7 @@ void mdc_lock_lvb_update(const struct lu_env *env, struct osc_object *osc,
 static void mdc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 			     struct lustre_handle *lockh)
 {
+	struct osc_object *osc = cl2osc(oscl->ols_cl.cls_obj);
 	struct ldlm_lock *dlmlock;
 
 	dlmlock = ldlm_handle2lock_long(lockh, 0);
@@ -474,8 +486,8 @@ static void mdc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 		/* no lvb update for matched lock */
 		if (!ldlm_is_lvb_cached(dlmlock)) {
 			LASSERT(oscl->ols_flags & LDLM_FL_LVB_READY);
-			mdc_lock_lvb_update(env, cl2osc(oscl->ols_cl.cls_obj),
-					    dlmlock, NULL);
+			LASSERT(osc == dlmlock->l_ast_data);
+			mdc_lock_lvb_update(env, osc, dlmlock, NULL);
 			ldlm_set_lvb_cached(dlmlock);
 		}
 	}
@@ -698,7 +710,7 @@ int mdc_enqueue_send(const struct lu_env *env, struct obd_export *exp,
 		if (OBD_FAIL_CHECK(OBD_FAIL_MDC_GLIMPSE_DDOS))
 			ldlm_set_kms_ignore(matched);
 
-		if (mdc_set_dom_lock_data(env, matched, einfo->ei_cbdata)) {
+		if (mdc_set_dom_lock_data(matched, einfo->ei_cbdata)) {
 			*flags |= LDLM_FL_LVB_READY;
 
 			/* We already have a lock, and it's referenced. */
@@ -1365,15 +1377,19 @@ static int mdc_object_ast_clear(struct ldlm_lock *lock, void *data)
 
 		/* Updates lvb in lock by the cached oinfo */
 		oinfo = osc->oo_oinfo;
-		cl_object_attr_lock(&osc->oo_cl);
-		memcpy(lvb, &oinfo->loi_lvb, sizeof(oinfo->loi_lvb));
-		cl_object_attr_unlock(&osc->oo_cl);
 
 		LDLM_DEBUG(lock,
-			   "update lvb size %llu blocks %llu [cma]time: %llu %llu %llu",
-			   lvb->lvb_size, lvb->lvb_blocks,
-			   lvb->lvb_ctime, lvb->lvb_mtime, lvb->lvb_atime);
+			   "update lock size %llu blocks %llu [cma]time: %llu %llu %llu by oinfo size %llu blocks %llu [cma]time %llu %llu %llu",
+			   lvb->lvb_size,
+			   lvb->lvb_blocks, lvb->lvb_ctime, lvb->lvb_mtime,
+			   lvb->lvb_atime, oinfo->loi_lvb.lvb_size,
+			   oinfo->loi_lvb.lvb_blocks, oinfo->loi_lvb.lvb_ctime,
+			   oinfo->loi_lvb.lvb_mtime, oinfo->loi_lvb.lvb_atime);
+		LASSERT(oinfo->loi_lvb.lvb_size >= oinfo->loi_kms);
 
+		cl_object_attr_lock(&osc->oo_cl);
+		memcpy(lvb, &oinfo->loi_lvb, sizeof(oinfo->loi_lvb));
+		cl_object_attr_unlock(&osc->oo_cl);
 		ldlm_clear_lvb_cached(lock);
 	}
 	return LDLM_ITER_CONTINUE;
diff --git a/fs/lustre/osc/osc_internal.h b/fs/lustre/osc/osc_internal.h
index b3b365a..492c60d 100644
--- a/fs/lustre/osc/osc_internal.h
+++ b/fs/lustre/osc/osc_internal.h
@@ -52,6 +52,11 @@ int osc_extent_finish(const struct lu_env *env, struct osc_extent *ext,
 int osc_lock_discard_pages(const struct lu_env *env, struct osc_object *osc,
 			   pgoff_t start, pgoff_t end, bool discard);
 
+void osc_lock_lvb_update(const struct lu_env *env,
+			 struct osc_object *osc,
+			 struct ldlm_lock *dlmlock,
+			 struct ost_lvb *lvb);
+
 int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 		     u64 *flags, union ldlm_policy_data *policy,
 		     struct ost_lvb *lvb, osc_enqueue_upcall_f upcall,
@@ -59,9 +64,10 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 		     struct ptlrpc_request_set *rqset, int async,
 		     bool speculative);
 
-int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
-		   enum ldlm_type type, union ldlm_policy_data *policy,
-		   enum ldlm_mode mode, u64 *flags, void *data,
+int osc_match_base(const struct lu_env *env, struct obd_export *exp,
+		   struct ldlm_res_id *res_id, enum ldlm_type type,
+		   union ldlm_policy_data *policy, enum ldlm_mode mode,
+		   u64 *flags, struct osc_object *obj,
 		   struct lustre_handle *lockh, int unref);
 
 int osc_setattr_async(struct obd_export *exp, struct obdo *oa,
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index 02d3436..ce592d7 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -145,15 +145,16 @@ static void osc_lock_build_policy(const struct lu_env *env,
  *
  * Called under lock and resource spin-locks.
  */
-static void osc_lock_lvb_update(const struct lu_env *env,
-				struct osc_object *osc,
-				struct ldlm_lock *dlmlock,
-				struct ost_lvb *lvb)
+void osc_lock_lvb_update(const struct lu_env *env,
+			 struct osc_object *osc,
+			 struct ldlm_lock *dlmlock,
+			 struct ost_lvb *lvb)
 {
 	struct cl_object *obj = osc2cl(osc);
 	struct lov_oinfo *oinfo = osc->oo_oinfo;
 	struct cl_attr *attr = &osc_env_info(env)->oti_attr;
 	unsigned int valid;
+	unsigned int setkms = 0;
 
 	valid = CAT_BLOCKS | CAT_ATIME | CAT_CTIME | CAT_MTIME | CAT_SIZE;
 	if (!lvb)
@@ -175,20 +176,24 @@ static void osc_lock_lvb_update(const struct lu_env *env,
 		if (size > dlmlock->l_policy_data.l_extent.end)
 			size = dlmlock->l_policy_data.l_extent.end + 1;
 		if (size >= oinfo->loi_kms) {
-			LDLM_DEBUG(dlmlock,
-				   "lock acquired, setting rss=%llu, kms=%llu",
-				   lvb->lvb_size, size);
 			valid |= CAT_KMS;
 			attr->cat_kms = size;
-		} else {
-			LDLM_DEBUG(dlmlock,
-				   "lock acquired, setting rss=%llu; leaving kms=%llu, end=%llu",
-				   lvb->lvb_size, oinfo->loi_kms,
-				   dlmlock->l_policy_data.l_extent.end);
+			setkms = 1;
 		}
 		ldlm_lock_allow_match_locked(dlmlock);
 	}
 
+	/* The size should not be less than the kms */
+	if (attr->cat_size < oinfo->loi_kms)
+		attr->cat_size = oinfo->loi_kms;
+
+	LDLM_DEBUG(dlmlock,
+		   "acquired size %llu, setting rss=%llu;%s kms=%llu, end=%llu",
+		   lvb->lvb_size, attr->cat_size,
+		   setkms ? "" : " leaving",
+		   setkms ? attr->cat_kms : oinfo->loi_kms,
+		   dlmlock ? dlmlock->l_policy_data.l_extent.end : -1ull);
+
 	cl_object_attr_update(env, obj, attr, valid);
 	cl_object_attr_unlock(obj);
 }
@@ -196,6 +201,7 @@ static void osc_lock_lvb_update(const struct lu_env *env,
 static void osc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 			     struct lustre_handle *lockh)
 {
+	struct osc_object *osc = cl2osc(oscl->ols_cl.cls_obj);
 	struct ldlm_lock *dlmlock;
 
 	dlmlock = ldlm_handle2lock_long(lockh, 0);
@@ -239,8 +245,8 @@ static void osc_lock_granted(const struct lu_env *env, struct osc_lock *oscl,
 		/* no lvb update for matched lock */
 		if (!ldlm_is_lvb_cached(dlmlock)) {
 			LASSERT(oscl->ols_flags & LDLM_FL_LVB_READY);
-			osc_lock_lvb_update(env, cl2osc(oscl->ols_cl.cls_obj),
-					    dlmlock, NULL);
+			LASSERT(osc == dlmlock->l_ast_data);
+			osc_lock_lvb_update(env, osc, dlmlock, NULL);
 			ldlm_set_lvb_cached(dlmlock);
 		}
 		LINVRNT(osc_lock_invariant(oscl));
@@ -1271,9 +1277,9 @@ struct ldlm_lock *osc_obj_dlmlock_at_pgoff(const struct lu_env *env,
 	 * with a uniq gid and it conflicts with all other lock modes too
 	 */
 again:
-	mode = osc_match_base(osc_export(obj), resname, LDLM_EXTENT, policy,
-			      LCK_PR | LCK_PW | LCK_GROUP, &flags, obj, &lockh,
-			      dap_flags & OSC_DAP_FL_CANCELING);
+	mode = osc_match_base(env, osc_export(obj), resname, LDLM_EXTENT,
+			      policy, LCK_PR | LCK_PW | LCK_GROUP, &flags,
+			      obj, &lockh, dap_flags & OSC_DAP_FL_CANCELING);
 	if (mode != 0) {
 		lock = ldlm_handle2lock(&lockh);
 		/* RACE: the lock is cancelled so let's try again */
diff --git a/fs/lustre/osc/osc_object.c b/fs/lustre/osc/osc_object.c
index d2206e8..6d24cd3 100644
--- a/fs/lustre/osc/osc_object.c
+++ b/fs/lustre/osc/osc_object.c
@@ -209,15 +209,19 @@ static int osc_object_ast_clear(struct ldlm_lock *lock, void *data)
 
 		/* Updates lvb in lock by the cached oinfo */
 		oinfo = osc->oo_oinfo;
-		cl_object_attr_lock(&osc->oo_cl);
-		memcpy(lvb, &oinfo->loi_lvb, sizeof(oinfo->loi_lvb));
-		cl_object_attr_unlock(&osc->oo_cl);
 
 		LDLM_DEBUG(lock,
-			   "update lvb size %llu blocks %llu [cma]time: %llu %llu %llu",
-			   lvb->lvb_size, lvb->lvb_blocks,
-			   lvb->lvb_ctime, lvb->lvb_mtime, lvb->lvb_atime);
+			   "update lock size %llu blocks %llu [cma]time: %llu %llu %llu by oinfo size %llu blocks %llu [cma]time %llu %llu %llu",
+			   lvb->lvb_size,
+			   lvb->lvb_blocks, lvb->lvb_ctime, lvb->lvb_mtime,
+			   lvb->lvb_atime, oinfo->loi_lvb.lvb_size,
+			   oinfo->loi_lvb.lvb_blocks, oinfo->loi_lvb.lvb_ctime,
+			   oinfo->loi_lvb.lvb_mtime, oinfo->loi_lvb.lvb_atime);
+		LASSERT(oinfo->loi_lvb.lvb_size >= oinfo->loi_kms);
 
+		cl_object_attr_lock(&osc->oo_cl);
+		memcpy(lvb, &oinfo->loi_lvb, sizeof(oinfo->loi_lvb));
+		cl_object_attr_unlock(&osc->oo_cl);
 		ldlm_clear_lvb_cached(lock);
 	}
 	return LDLM_ITER_CONTINUE;
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 0e32496..95e09ce 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -2643,9 +2643,10 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 	return rc;
 }
 
-int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
-		   enum ldlm_type type, union ldlm_policy_data *policy,
-		   enum ldlm_mode mode, u64 *flags, void *data,
+int osc_match_base(const struct lu_env *env, struct obd_export *exp,
+		   struct ldlm_res_id *res_id, enum ldlm_type type,
+		   union ldlm_policy_data *policy, enum ldlm_mode mode,
+		   u64 *flags, struct osc_object *obj,
 		   struct lustre_handle *lockh, int unref)
 {
 	struct obd_device *obd = exp->exp_obd;
@@ -2674,11 +2675,19 @@ int osc_match_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 	if (!rc || lflags & LDLM_FL_TEST_LOCK)
 		return rc;
 
-	if (data) {
+	if (obj) {
 		struct ldlm_lock *lock = ldlm_handle2lock(lockh);
 
 		LASSERT(lock);
-		if (!osc_set_lock_data(lock, data)) {
+		if (osc_set_lock_data(lock, obj)) {
+			lock_res_and_lock(lock);
+			if (!ldlm_is_lvb_cached(lock)) {
+				LASSERT(lock->l_ast_data == obj);
+				osc_lock_lvb_update(env, obj, lock, NULL);
+				ldlm_set_lvb_cached(lock);
+			}
+			unlock_res_and_lock(lock);
+		} else {
 			ldlm_lock_decref(lockh, rc);
 			rc = 0;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 498/622] lustre: vvp: dirty pages with pagevec
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (496 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 497/622] lustre: osc: wrong cache of LVB attrs, part2 James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 499/622] lustre: ptlrpc: resend may corrupt the data James Simmons
                   ` (124 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

When doing i/o from multiple writers to a single file, the
per-file page cache lock (the mapping lock) becomes a
bottleneck.

Most current uses are single page at a time.  This converts
one prominent use, marking page as dirty, to use a pagevec.

When many threads are writing to one file, this improves
write performance by around 25%.

This requires implementing our own version of the
set_page_dirty-->__set_page_dirty_nobuffers functions.

This was modeled on upstream tip of tree:
v5.2-rc4-224-ge01e060fe0 (7/13/2019)

The relevant code is unchanged since Linux 4.17, and has
changed only minimally since before Linux 2.6.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9920
Lustre-commit: a7299cb012f8 ("LU-9920 vvp: dirty pages with pagevec")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/28711
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h   |   2 +-
 fs/lustre/include/lustre_osc.h  |   6 +--
 fs/lustre/llite/llite_lib.c     |   5 +-
 fs/lustre/llite/vvp_io.c        | 102 +++++++++++++++++++++++++++++++++++-----
 fs/lustre/mdc/mdc_request.c     |   7 +--
 fs/lustre/obdecho/echo_client.c |  11 ++++-
 fs/lustre/osc/osc_cache.c       |  13 ++++-
 fs/lustre/osc/osc_io.c          |  23 +++++++--
 fs/lustre/osc/osc_page.c        |   7 ++-
 mm/page-writeback.c             |   1 +
 10 files changed, 144 insertions(+), 33 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 4c68d7b..75ece62 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1458,7 +1458,7 @@ struct cl_io_slice {
 };
 
 typedef void (*cl_commit_cbt)(const struct lu_env *, struct cl_io *,
-			      struct cl_page *);
+			      struct pagevec *);
 
 struct cl_read_ahead {
 	/*
diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index de7ccd6..2cd23f2 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -584,9 +584,9 @@ int osc_set_async_flags(struct osc_object *obj, struct osc_page *opg,
 int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
 			struct page *page, loff_t offset);
 int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
-		       struct osc_page *ops);
-int osc_page_cache_add(const struct lu_env *env,
-		       const struct cl_page_slice *slice, struct cl_io *io);
+		       struct osc_page *ops, cl_commit_cbt cb);
+int osc_page_cache_add(const struct lu_env *env, struct osc_page *opg,
+		       struct cl_io *io, cl_commit_cbt cb);
 int osc_teardown_async_page(const struct lu_env *env, struct osc_object *obj,
 			    struct osc_page *ops);
 int osc_flush_async_page(const struct lu_env *env, struct cl_io *io,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index ad7c2e2..5d74f30 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2149,6 +2149,7 @@ void ll_delete_inode(struct inode *inode)
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct address_space *mapping = &inode->i_data;
 	unsigned long nrpages;
+	unsigned long flags;
 
 	if (S_ISREG(inode->i_mode) && lli->lli_clob) {
 		/* It is last chance to write out dirty pages,
@@ -2172,9 +2173,9 @@ void ll_delete_inode(struct inode *inode)
 	 */
 	nrpages = mapping->nrpages;
 	if (nrpages) {
-		xa_lock_irq(&mapping->i_pages);
+		xa_lock_irqsave(&mapping->i_pages, flags);
 		nrpages = mapping->nrpages;
-		xa_unlock_irq(&mapping->i_pages);
+		xa_unlock_irqrestore(&mapping->i_pages, flags);
 	} /* Workaround end */
 
 	LASSERTF(nrpages == 0,
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index d0d8b1f..aa8f2e1 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -39,7 +39,8 @@
 #define DEBUG_SUBSYSTEM S_LLITE
 
 #include <obd.h>
-
+#include <linux/pagevec.h>
+#include <linux/memcontrol.h>
 #include "llite_internal.h"
 #include "vvp_internal.h"
 
@@ -860,19 +861,98 @@ static int vvp_io_commit_sync(const struct lu_env *env, struct cl_io *io,
 	return bytes > 0 ? bytes : rc;
 }
 
+/* Taken from kernel set_page_dirty, __set_page_dirty_nobuffers
+ * Last change to this area: b93b016313b3ba8003c3b8bb71f569af91f19fc7
+ *
+ * Current with Linus tip of tree (7/13/2019):
+ * v5.2-rc4-224-ge01e060fe0
+ *
+ */
+void vvp_set_pagevec_dirty(struct pagevec *pvec)
+{
+	struct page *page = pvec->pages[0];
+	struct address_space *mapping = page->mapping;
+	unsigned long flags;
+	int count = pagevec_count(pvec);
+	int dirtied = 0;
+	int i = 0;
+
+	/* From set_page_dirty */
+	for (i = 0; i < count; i++)
+		ClearPageReclaim(pvec->pages[i]);
+
+	LASSERTF(page->mapping,
+		 "mapping must be set. page %p, page->private (cl_page) %p",
+		 page, (void *) page->private);
+
+	/* Rest of code derived from __set_page_dirty_nobuffers */
+	xa_lock_irqsave(&mapping->i_pages, flags);
+
+	/* Notes on differences with __set_page_dirty_nobuffers:
+	 * 1. We don't need to call page_mapping because we know this is a page
+	 * cache page.
+	 * 2. We have the pages locked, so there is no need for the careful
+	 * mapping/mapping2 dance.
+	 * 3. No mapping is impossible. (Race w/truncate mentioned in
+	 * dirty_nobuffers should be impossible because we hold the page lock.)
+	 * 4. All mappings are the same because i/o is only to one file.
+	 * 5. We invert the lock order on lock_page_memcg(page) and the mapping
+	 * xa_lock, but this is the only function that should use that pair of
+	 * locks and it can't race because Lustre locks pages throughout i/o.
+	 */
+	for (i = 0; i < count; i++) {
+		page = pvec->pages[i];
+		lock_page_memcg(page);
+		if (TestSetPageDirty(page)) {
+			unlock_page_memcg(page);
+			continue;
+		}
+		LASSERTF(page->mapping == mapping,
+			 "all pages must have the same mapping.  page %p, mapping %p, first mapping %p\n",
+			 page, page->mapping, mapping);
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		account_page_dirtied(page, mapping);
+		__xa_set_mark(&mapping->i_pages, page_index(page),
+			      PAGECACHE_TAG_DIRTY);
+		dirtied++;
+		unlock_page_memcg(page);
+	}
+	xa_unlock_irqrestore(&mapping->i_pages, flags);
+
+	CDEBUG(D_VFSTRACE, "mapping %p, count %d, dirtied %d\n", mapping,
+	       count, dirtied);
+
+	if (mapping->host && dirtied) {
+		/* !PageAnon && !swapper_space */
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	}
+}
+
 static void write_commit_callback(const struct lu_env *env, struct cl_io *io,
-				  struct cl_page *page)
+				  struct pagevec *pvec)
 {
-	struct page *vmpage = page->cp_vmpage;
+	struct cl_page *page;
+	struct page *vmpage;
+	int count = 0;
+	int i = 0;
 
-	SetPageUptodate(vmpage);
-	set_page_dirty(vmpage);
+	count = pagevec_count(pvec);
+	LASSERT(count > 0);
 
-	cl_page_disown(env, io, page);
+	for (i = 0; i < count; i++) {
+		vmpage = pvec->pages[i];
+		SetPageUptodate(vmpage);
+	}
+
+	vvp_set_pagevec_dirty(pvec);
 
-	/* held in ll_cl_init() */
-	lu_ref_del(&page->cp_reference, "cl_io", cl_io_top(io));
-	cl_page_put(env, page);
+	for (i = 0; i < count; i++) {
+		vmpage = pvec->pages[i];
+		page = (struct cl_page *) vmpage->private;
+		cl_page_disown(env, io, page);
+		lu_ref_del(&page->cp_reference, "cl_io", cl_io_top(io));
+		cl_page_put(env, page);
+	}
 }
 
 /* make sure the page list is contiguous */
@@ -1128,9 +1208,9 @@ static int vvp_io_kernel_fault(struct vvp_fault_io *cfio)
 }
 
 static void mkwrite_commit_callback(const struct lu_env *env, struct cl_io *io,
-				    struct cl_page *page)
+				    struct pagevec *pvec)
 {
-	set_page_dirty(page->cp_vmpage);
+	vvp_set_pagevec_dirty(pvec);
 }
 
 static int vvp_io_fault_start(const struct lu_env *env,
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 34cf177..287013f 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1138,16 +1138,17 @@ static struct page *mdc_page_locate(struct address_space *mapping, u64 *hash,
 	 */
 	unsigned long offset = hash_x_index(*hash, hash64);
 	struct page *page;
+	unsigned long flags;
 	int found;
 
-	xa_lock_irq(&mapping->i_pages);
+	xa_lock_irqsave(&mapping->i_pages, flags);
 	found = radix_tree_gang_lookup(&mapping->i_pages,
 				       (void **)&page, offset, 1);
 	if (found > 0 && !xa_is_value(page)) {
 		struct lu_dirpage *dp;
 
 		get_page(page);
-		xa_unlock_irq(&mapping->i_pages);
+		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		/*
 		 * In contrast to find_lock_page() we are sure that directory
 		 * page cannot be truncated (while DLM lock is held) and,
@@ -1197,7 +1198,7 @@ static struct page *mdc_page_locate(struct address_space *mapping, u64 *hash,
 			page = ERR_PTR(-EIO);
 		}
 	} else {
-		xa_unlock_irq(&mapping->i_pages);
+		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		page = NULL;
 	}
 	return page;
diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 172fe11..8e04636 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -998,16 +998,23 @@ static int __cl_echo_cancel(struct lu_env *env, struct echo_device *ed,
 }
 
 static void echo_commit_callback(const struct lu_env *env, struct cl_io *io,
-				 struct cl_page *page)
+				 struct pagevec *pvec)
 {
 	struct echo_thread_info *info;
 	struct cl_2queue *queue;
+	int i = 0;
 
 	info = echo_env_info(env);
 	LASSERT(io == &info->eti_io);
 
 	queue = &info->eti_queue;
-	cl_page_list_add(&queue->c2_qout, page);
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *vmpage = pvec->pages[i];
+		struct cl_page *page = (struct cl_page *)vmpage->private;
+
+		cl_page_list_add(&queue->c2_qout, page);
+	}
 }
 
 static int cl_echo_object_brw(struct echo_object *eco, int rw, u64 offset,
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 3d47c02..dde03bd 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2303,13 +2303,14 @@ int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
 EXPORT_SYMBOL(osc_prep_async_page);
 
 int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
-		       struct osc_page *ops)
+		       struct osc_page *ops, cl_commit_cbt cb)
 {
 	struct osc_io *oio = osc_env_io(env);
 	struct osc_extent *ext = NULL;
 	struct osc_async_page *oap = &ops->ops_oap;
 	struct client_obd *cli = oap->oap_cli;
 	struct osc_object *osc = oap->oap_obj;
+	struct pagevec        *pvec = &osc_env_info(env)->oti_pagevec;
 	pgoff_t index;
 	unsigned int grants = 0, tmp;
 	int brw_flags = OBD_BRW_ASYNC;
@@ -2431,7 +2432,15 @@ int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 
 		rc = 0;
 		if (grants == 0) {
-			/* we haven't allocated grant for this page. */
+			/* We haven't allocated grant for this page, and we
+			 * must not hold a page lock while we do enter_cache,
+			 * so we must mark dirty & unlock any pages in the
+			 * write commit pagevec.
+			 */
+			if (pagevec_count(pvec)) {
+				cb(env, io, pvec);
+				pagevec_reinit(pvec);
+			}
 			rc = osc_enter_cache(env, cli, oap, tmp);
 			if (rc == 0)
 				grants = tmp;
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 8e299d4..f340266 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -40,6 +40,7 @@
 
 #include <lustre_obdo.h>
 #include <lustre_osc.h>
+#include <linux/pagevec.h>
 
 #include "osc_internal.h"
 
@@ -288,6 +289,7 @@ int osc_io_commit_async(const struct lu_env *env,
 	struct cl_page *page;
 	struct cl_page *last_page;
 	struct osc_page *opg;
+	struct pagevec  *pvec = &osc_env_info(env)->oti_pagevec;
 	int result = 0;
 
 	LASSERT(qin->pl_nr > 0);
@@ -306,6 +308,8 @@ int osc_io_commit_async(const struct lu_env *env,
 		}
 	}
 
+	pagevec_init(pvec);
+
 	while (qin->pl_nr > 0) {
 		struct osc_async_page *oap;
 
@@ -325,7 +329,7 @@ int osc_io_commit_async(const struct lu_env *env,
 
 		/* The page may be already in dirty cache. */
 		if (list_empty(&oap->oap_pending_item)) {
-			result = osc_page_cache_add(env, &opg->ops_cl, io);
+			result = osc_page_cache_add(env, opg, io, cb);
 			if (result != 0)
 				break;
 		}
@@ -335,12 +339,21 @@ int osc_io_commit_async(const struct lu_env *env,
 
 		cl_page_list_del(env, qin, page);
 
-		(*cb)(env, io, page);
-		/* Can't access page any more. Page can be in transfer and
-		 * complete at any time.
-		 */
+		/* if there are no more slots, do the callback & reinit */
+		if (pagevec_add(pvec, page->cp_vmpage) == 0) {
+			(*cb)(env, io, pvec);
+			pagevec_reinit(pvec);
+		}
 	}
 
+	/* Clean up any partially full pagevecs */
+	if (pagevec_count(pvec) != 0)
+		(*cb)(env, io, pvec);
+
+	/* Can't access these pages any more. Page can be in transfer and
+	 * complete at any time.
+	 */
+
 	/* for sync write, kernel will wait for this page to be flushed before
 	 * osc_io_end() is called, so release it earlier.
 	 * for mkwrite(), it's known there is no further pages.
diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 0910f3a..6685968 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -92,14 +92,13 @@ static void osc_page_transfer_add(const struct lu_env *env,
 	osc_lru_use(osc_cli(obj), opg);
 }
 
-int osc_page_cache_add(const struct lu_env *env,
-		       const struct cl_page_slice *slice, struct cl_io *io)
+int osc_page_cache_add(const struct lu_env *env, struct osc_page *opg,
+		       struct cl_io *io, cl_commit_cbt cb)
 {
-	struct osc_page *opg = cl2osc_page(slice);
 	int result;
 
 	osc_page_transfer_get(opg, "transfer\0cache");
-	result = osc_queue_async_io(env, io, opg);
+	result = osc_queue_async_io(env, io, opg, cb);
 	if (result != 0)
 		osc_page_transfer_put(env, opg);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 50055d2..3b5a43d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2433,6 +2433,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		mem_cgroup_track_foreign_dirty(page, wb);
 	}
 }
+EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for deaccounting dirty page without writeback.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 499/622] lustre: ptlrpc: resend may corrupt the data
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (497 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 498/622] lustre: vvp: dirty pages with pagevec James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 500/622] lnet: eliminate uninitialized warning James Simmons
                   ` (123 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

Late resend if arrives much later than another modification RPC
which has been already handled on this slot, may be still applied
and therefore overrides the last one

Send RPCs from client in increasing order for each tag
and check it on server to check late resend.

A slot can be reused by a client after kill while
the server continue to rely on it.

Add flag for such obsolete requests, here we trust the
client and perform xid check for all in progress requests.

Cray-bug-id: LUS-6272, LUS-7277, LUS-7339
WC-bug-id: https://jira.whamcloud.com/browse/LU-11444
Lustre-commit: 23773b32bfe1 ("LU-11444 ptlrpc: resend may corrupt the data")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-on: https://review.whamcloud.com/35114
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_mdc.h |  1 +
 fs/lustre/include/lustre_net.h |  1 +
 fs/lustre/llite/llite_lib.c    |  4 +++-
 fs/lustre/obdclass/genops.c    |  6 ++++++
 fs/lustre/ptlrpc/client.c      | 10 ++++++++++
 fs/lustre/ptlrpc/service.c     | 11 ++++++++---
 6 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/lustre_mdc.h b/fs/lustre/include/lustre_mdc.h
index aecb6ee..f57783d 100644
--- a/fs/lustre/include/lustre_mdc.h
+++ b/fs/lustre/include/lustre_mdc.h
@@ -70,6 +70,7 @@ static inline void mdc_get_mod_rpc_slot(struct ptlrpc_request *req,
 	opc = lustre_msg_get_opc(req->rq_reqmsg);
 	tag = obd_get_mod_rpc_slot(cli, opc, it);
 	lustre_msg_set_tag(req->rq_reqmsg, tag);
+	ptlrpc_reassign_next_xid(req);
 }
 
 static inline void mdc_put_mod_rpc_slot(struct ptlrpc_request *req,
diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 8dad08e..40c1ae8 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1916,6 +1916,7 @@ void ptlrpc_retain_replayable_request(struct ptlrpc_request *req,
 u64 ptlrpc_next_xid(void);
 u64 ptlrpc_sample_next_xid(void);
 u64 ptlrpc_req_xid(struct ptlrpc_request *request);
+void ptlrpc_reassign_next_xid(struct ptlrpc_request *req);
 
 /* Set of routines to run a function in ptlrpcd context */
 void *ptlrpcd_alloc_work(struct obd_import *imp,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 5d74f30..4580be3 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -240,6 +240,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				   OBD_CONNECT2_FLR |
 				   OBD_CONNECT2_LOCK_CONVERT |
 				   OBD_CONNECT2_ARCHIVE_ID_ARRAY |
+				   OBD_CONNECT2_INC_XID |
 				   OBD_CONNECT2_LSOM |
 				   OBD_CONNECT2_ASYNC_DISCARD |
 				   OBD_CONNECT2_PCC;
@@ -459,7 +460,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	if (data->ocd_version < OBD_OCD_VERSION(2, 12, 50, 0))
 		data->ocd_connect_flags |= OBD_CONNECT_LOCKAHEAD_OLD;
 
-	data->ocd_connect_flags2 = OBD_CONNECT2_LOCKAHEAD;
+	data->ocd_connect_flags2 = OBD_CONNECT2_LOCKAHEAD |
+				   OBD_CONNECT2_INC_XID;
 
 	if (!OBD_FAIL_CHECK(OBD_FAIL_OSC_CONNECT_GRANT_PARAM))
 		data->ocd_connect_flags |= OBD_CONNECT_GRANT_PARAM;
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 49db077..5d4e421 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -1550,6 +1550,12 @@ u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc,
 			LASSERT(!test_and_set_bit(i, cli->cl_mod_tag_bitmap));
 			spin_unlock(&cli->cl_mod_rpcs_lock);
 			/* tag 0 is reserved for non-modify RPCs */
+
+			CDEBUG(D_RPCTRACE,
+			       "%s: modify RPC slot %u is allocated opc %u, max %hu\n",
+			       cli->cl_import->imp_obd->obd_name,
+			       i + 1, opc, max);
+
 			return i + 1;
 		}
 		spin_unlock(&cli->cl_mod_rpcs_lock);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index c359ac0..8d874f2 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -717,6 +717,16 @@ static inline void ptlrpc_assign_next_xid(struct ptlrpc_request *req)
 
 static atomic64_t ptlrpc_last_xid;
 
+void ptlrpc_reassign_next_xid(struct ptlrpc_request *req)
+{
+	spin_lock(&req->rq_import->imp_lock);
+	list_del_init(&req->rq_unreplied_list);
+	ptlrpc_assign_next_xid_nolock(req);
+	spin_unlock(&req->rq_import->imp_lock);
+	DEBUG_REQ(D_RPCTRACE, req, "reassign xid");
+}
+EXPORT_SYMBOL(ptlrpc_reassign_next_xid);
+
 int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 			     u32 version, int opcode, char **bufs,
 			     struct ptlrpc_cli_ctx *ctx)
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index c66c690..b2a33a3 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -864,6 +864,13 @@ static void ptlrpc_server_drop_request(struct ptlrpc_request *req)
 	}
 }
 
+static void ptlrpc_del_exp_list(struct ptlrpc_request *req)
+{
+	spin_lock(&req->rq_export->exp_rpc_lock);
+	list_del_init(&req->rq_exp_list);
+	spin_unlock(&req->rq_export->exp_rpc_lock);
+}
+
 /**
  * to finish a request: stop sending more early replies, and release
  * the request.
@@ -1367,9 +1374,7 @@ static void ptlrpc_server_hpreq_fini(struct ptlrpc_request *req)
 		if (req->rq_ops->hpreq_fini)
 			req->rq_ops->hpreq_fini(req);
 
-		spin_lock(&req->rq_export->exp_rpc_lock);
-		list_del_init(&req->rq_exp_list);
-		spin_unlock(&req->rq_export->exp_rpc_lock);
+		ptlrpc_del_exp_list(req);
 	}
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 500/622] lnet: eliminate uninitialized warning
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (498 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 499/622] lustre: ptlrpc: resend may corrupt the data James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 501/622] lnet: o2ib: Record rc in debug log on startup failure James Simmons
                   ` (122 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

lustre-release/net/lnet/lnet/router.c: In funciton 'lnet_del_route':
include/linux/compiler.h:177:26: error: 'lp' may be used uninitialized
in this function [-Werror=maybe-uninitialized]
  case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \

lustre-release/net/lnet/lnet/router.c:754:20: note: 'lp' was declared here
  struct lnet_peer *lp;

lustre-release/net/lnet/lnet/router.c: At top level:
cc1: error: unrecognized command line option '-Wno-stringop-overflow' [-Werror]
cc1: error: unrecognized command line option '-Wno-stringop-truncation' [-Werror]
cc1: error: unrecognized command line option '-Wno-format-truncation' [-Werror]
cc1: all warnings being treated as errors

codes logic gurantee @lpi and @lpni are inited at the same time,
but let's init @lpi to make gcc happy.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12764
Lustre-commit: a8fbaa1b998f ("LU-12764 lnet: eliminate uninitialized warning")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/36189
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index a5e4af0..447706d 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -721,7 +721,7 @@ static void lnet_shuffle_seed(void)
 	struct lnet_peer_ni *lpni;
 	struct lnet_route *route;
 	struct list_head zombies;
-	struct lnet_peer *lp;
+	struct lnet_peer *lp = NULL;
 	int i = 0;
 
 	INIT_LIST_HEAD(&rnet_zombies);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 501/622] lnet: o2ib: Record rc in debug log on startup failure
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (499 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 500/622] lnet: eliminate uninitialized warning James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 502/622] lnet: o2ib: Reintroduce kiblnd_dev_search James Simmons
                   ` (121 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Since kiblnd_startup() return -ENETDOWN on failure, let's record the
rc value for the failure case in the debug log.

Cray-bug-id: LUS-7935
WC-bug-id: https://jira.whamcloud.com/browse/LU-12824
Lustre-commit: 99f85541a685 ("LU-12824 o2ib: Record rc in debug log on startup failure")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36325
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index d4d5d4f..d162b0a7 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2848,10 +2848,10 @@ static int kiblnd_dev_start_threads(struct kib_dev *dev, u32 *cpts, int ncpts)
 
 static int kiblnd_startup(struct lnet_ni *ni)
 {
-	char *ifname;
+	char *ifname = NULL;
 	struct lnet_inetdev *ifaces = NULL;
 	struct kib_dev *ibdev = NULL;
-	struct kib_net *net;
+	struct kib_net *net = NULL;
 	unsigned long flags;
 	int rc;
 	int i;
@@ -2866,8 +2866,10 @@ static int kiblnd_startup(struct lnet_ni *ni)
 
 	net = kzalloc(sizeof(*net), GFP_NOFS);
 	ni->ni_data = net;
-	if (!net)
+	if (!net) {
+		rc = -ENOMEM;
 		goto net_failed;
+	}
 
 	net->ibn_incarnation = ktime_get_real_ns() / NSEC_PER_USEC;
 
@@ -2884,6 +2886,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 
 		if (ni->ni_interfaces[1]) {
 			CERROR("ko2iblnd: Multiple interfaces not supported\n");
+			rc = -EINVAL;
 			goto failed;
 		}
 
@@ -2894,6 +2897,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 
 	if (strlen(ifname) >= sizeof(ibdev->ibd_ifname)) {
 		CERROR("IPoIB interface name too long: %s\n", ifname);
+		rc = -E2BIG;
 		goto failed;
 	}
 
@@ -2968,7 +2972,9 @@ static int kiblnd_startup(struct lnet_ni *ni)
 net_failed:
 	kiblnd_shutdown(ni);
 
-	CDEBUG(D_NET, "%s failed\n", __func__);
+	CDEBUG(D_NET, "Configuration of device %s failed: rc = %d\n",
+	       ifname ? ifname : "", rc);
+
 	return -ENETDOWN;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 502/622] lnet: o2ib: Reintroduce kiblnd_dev_search
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (500 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 501/622] lnet: o2ib: Record rc in debug log on startup failure James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 503/622] lustre: ptlrpc: fix watchdog ratelimit logic James Simmons
                   ` (120 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

If we add an interface to multiple nets then we need to re-use the
struct ib_dev object for each of the nets.

Cray-bug-id: LUS-7935
Fixes: 3aa523159321 ("lnet: consoldate secondary IP address handling")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12824
Lustre-commit: e25e45c612a0 ("LU-12824 o2ib: Reintroduce kiblnd_dev_search")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36326
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 85 +++++++++++++++++++++++++++++-----------
 1 file changed, 63 insertions(+), 22 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index d162b0a7..1cc5358 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2821,7 +2821,8 @@ static int kiblnd_start_schedulers(struct kib_sched_info *sched)
 	return rc;
 }
 
-static int kiblnd_dev_start_threads(struct kib_dev *dev, u32 *cpts, int ncpts)
+static int kiblnd_dev_start_threads(struct kib_dev *dev, bool newdev, u32 *cpts,
+				    int ncpts)
 {
 	int cpt;
 	int rc;
@@ -2833,7 +2834,7 @@ static int kiblnd_dev_start_threads(struct kib_dev *dev, u32 *cpts, int ncpts)
 		cpt = !cpts ? i : cpts[i];
 		sched = kiblnd_data.kib_scheds[cpt];
 
-		if (sched->ibs_nthreads > 0)
+		if (!newdev && sched->ibs_nthreads > 0)
 			continue;
 
 		rc = kiblnd_start_schedulers(kiblnd_data.kib_scheds[cpt]);
@@ -2846,6 +2847,39 @@ static int kiblnd_dev_start_threads(struct kib_dev *dev, u32 *cpts, int ncpts)
 	return 0;
 }
 
+static struct kib_dev *
+kiblnd_dev_search(char *ifname)
+{
+	struct kib_dev *alias = NULL;
+	struct kib_dev *dev;
+	char            *colon;
+	char            *colon2;
+
+	colon = strchr(ifname, ':');
+	list_for_each_entry(dev, &kiblnd_data.kib_devs, ibd_list) {
+		if (strcmp(&dev->ibd_ifname[0], ifname) == 0)
+			return dev;
+
+		if (alias)
+			continue;
+
+		colon2 = strchr(dev->ibd_ifname, ':');
+		if (colon)
+			*colon = 0;
+		if (colon2)
+			*colon2 = 0;
+
+		if (strcmp(&dev->ibd_ifname[0], ifname) == 0)
+			alias = dev;
+
+		if (colon)
+			*colon = ':';
+		if (colon2)
+			*colon2 = ':';
+	}
+	return alias;
+}
+
 static int kiblnd_startup(struct lnet_ni *ni)
 {
 	char *ifname = NULL;
@@ -2855,6 +2889,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 	unsigned long flags;
 	int rc;
 	int i;
+	bool newdev;
 
 	LASSERT(ni->ni_net->net_lnd == &the_o2iblnd);
 
@@ -2916,36 +2951,42 @@ static int kiblnd_startup(struct lnet_ni *ni)
 		goto failed;
 	}
 
-	ibdev = kzalloc(sizeof(*ibdev), GFP_KERNEL);
-	if (!ibdev) {
-		rc = -ENOMEM;
-		goto failed;
-	}
+	ibdev = kiblnd_dev_search(ifname);
+	newdev = !ibdev;
+	/* hmm...create kib_dev even for alias */
+	if (!ibdev || strcmp(&ibdev->ibd_ifname[0], ifname) != 0) {
+		ibdev = kzalloc(sizeof(*ibdev), GFP_NOFS);
+		if (!ibdev) {
+			rc = -ENOMEM;
+			goto failed;
+		}
 
-	ibdev->ibd_ifip = ifaces[i].li_ipaddr;
-	strlcpy(ibdev->ibd_ifname, ifaces[i].li_name,
-		sizeof(ibdev->ibd_ifname));
-	ibdev->ibd_can_failover = !!(ifaces[i].li_flags & IFF_MASTER);
+		ibdev->ibd_ifip = ifaces[i].li_ipaddr;
+		strlcpy(ibdev->ibd_ifname, ifaces[i].li_name,
+			sizeof(ibdev->ibd_ifname));
+		ibdev->ibd_can_failover = !!(ifaces[i].li_flags & IFF_MASTER);
 
-	INIT_LIST_HEAD(&ibdev->ibd_nets);
-	INIT_LIST_HEAD(&ibdev->ibd_list); /* not yet in kib_devs */
-	INIT_LIST_HEAD(&ibdev->ibd_fail_list);
+		INIT_LIST_HEAD(&ibdev->ibd_nets);
+		INIT_LIST_HEAD(&ibdev->ibd_list); /* not yet in kib_devs */
+		INIT_LIST_HEAD(&ibdev->ibd_fail_list);
 
-	/* initialize the device */
-	rc = kiblnd_dev_failover(ibdev, ni->ni_net_ns);
-	if (rc) {
-		CERROR("ko2iblnd: Can't initialize device: rc = %d\n", rc);
-		goto failed;
-	}
+		/* initialize the device */
+		rc = kiblnd_dev_failover(ibdev, ni->ni_net_ns);
+		if (rc) {
+			CERROR("ko2iblnd: Can't initialize device: rc = %d\n",
+			       rc);
+			goto failed;
+		}
 
-	list_add_tail(&ibdev->ibd_list, &kiblnd_data.kib_devs);
+		list_add_tail(&ibdev->ibd_list, &kiblnd_data.kib_devs);
+	}
 
 	net->ibn_dev = ibdev;
 	ni->ni_nid = LNET_MKNID(LNET_NIDNET(ni->ni_nid), ibdev->ibd_ifip);
 
 	ni->ni_dev_cpt = ifaces[i].li_cpt;
 
-	rc = kiblnd_dev_start_threads(ibdev, ni->ni_cpts, ni->ni_ncpts);
+	rc = kiblnd_dev_start_threads(ibdev, newdev, ni->ni_cpts, ni->ni_ncpts);
 	if (rc)
 		goto failed;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 503/622] lustre: ptlrpc: fix watchdog ratelimit logic
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (501 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 502/622] lnet: o2ib: Reintroduce kiblnd_dev_search James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 504/622] lustre: flr: avoid reading unhealthy mirror James Simmons
                   ` (119 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The ptlrpc-level watchdog ratelimiting is broken. The kernel prints:

    mdt00_009: service thread pid 18935 was inactive for 72s.
    Watchdog stack traces are limited to 3 per 300s, skipping...

even though there hasn't been any stack trace printed before.

It looks like the __ratelimit() return value is backward from
what one would expect from normal English grammar, namely that
if __ratelimit() returns true the action should NOT be limited.

Fix the logic checking the __ratelimit() return value, and add a
check in sanity test_422 (which forces a service thread timeout)
to ensure that the watchdog sometimes prints a full stack.

Fixes: aeaf46886c7b ("lustre: ptlrpc: add watchdog for ptlrpc service threads")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12838
Lustre-commit: 594c79f2f855 ("LU-12838 ptlrpc: fix watchdog ratelimit logic")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36409
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index b2a33a3..fe0e108 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -2067,7 +2067,8 @@ static void ptlrpc_watchdog_fire(struct work_struct *w)
 	s64 ms_lapse = ktime_ms_delta(ktime_get(), thread->t_touched);
 	u32 ms_frac = do_div(ms_lapse, MSEC_PER_SEC);
 
-	if (!__ratelimit(&watchdog_limit)) {
+	/* ___ratelimit() returns true if the action is NOT ratelimited */
+	if (__ratelimit(&watchdog_limit)) {
 		/* below message is checked in sanity-quota.sh test_6,18 */
 		LCONSOLE_WARN("%s: service thread pid %u was inactive for %llu.%.03u seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:\n",
 			      thread->t_task->comm, thread->t_task->pid,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 504/622] lustre: flr: avoid reading unhealthy mirror
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (502 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 503/622] lustre: ptlrpc: fix watchdog ratelimit logic James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 505/622] lustre: obdclass: lu_tgt_descs cleanup James Simmons
                   ` (118 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

* Fix an error in lov_io_mirror_init() which would wait unnecessarily
  if we're retrying the last mirror of the file.

* In osc_io_iter_init() we'd check its OSC import status so that the
  read path can quickly switch another mirror.
  sanity-flr test_33b is added to test this case.

* And with all mirrors have been tried, we'd turn off the quick switch
  so that when all mirrors contain bad OSTs, the read will still try
  its best to get partial data from a component before trying another
  mirror.
  sanity-flr test_33c is added to test this case.

Fixes: 4b102da53ad ("lustre: ptlrpc: idle connections can disconnect")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12328
Lustre-commit: 39da3c06275e ("LU-12328 flr: avoid reading unhealthy mirror")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34952
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h |  8 +++++++-
 fs/lustre/lov/lov_io.c        | 25 ++++++++++++++++---------
 fs/lustre/osc/osc_io.c        | 16 +++++++++++++++-
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 75ece62..c3376a4 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1906,7 +1906,13 @@ struct cl_io {
 	/**
 	 * Set if IO is triggered by async workqueue readahead.
 	 */
-				ci_async_readahead:1;
+				ci_async_readahead:1,
+	/**
+	 * Set if we've tried all mirrors for this read IO, if it's not set,
+	 * the read IO will check to-be-read OSCs' status, and make fast-switch
+	 * another mirror if some of the OSTs are not healthy.
+	 */
+				ci_tried_all_mirrors:1;
 	/**
 	 * How many times the read has retried before this one.
 	 * Set by the top level and consumed by the LOV.
diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index 56e4a982..971f9ba 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -140,6 +140,7 @@ static int lov_io_sub_init(const struct lu_env *env, struct lov_io *lio,
 	sub_io->ci_lock_no_expand = io->ci_lock_no_expand;
 	sub_io->ci_ndelay = io->ci_ndelay;
 	sub_io->ci_layout_version = io->ci_layout_version;
+	sub_io->ci_tried_all_mirrors = io->ci_tried_all_mirrors;
 
 	rc = cl_io_sub_init(sub->sub_env, sub_io, io->ci_type, sub_obj);
 	if (rc < 0)
@@ -395,13 +396,13 @@ static int lov_io_mirror_init(struct lov_io *lio, struct lov_object *obj,
 				found = true;
 				break;
 			}
-		}
-
+		} /* each component of the mirror */
 		if (found) {
 			index = (index + i) % comp->lo_mirror_count;
 			break;
 		}
-	}
+	} /* each mirror */
+
 	if (i == comp->lo_mirror_count) {
 		CERROR(DFID ": failed to find a component covering I/O region at %llu\n",
 		       PFID(lu_object_fid(lov2lu(obj))), lio->lis_pos);
@@ -423,16 +424,21 @@ static int lov_io_mirror_init(struct lov_io *lio, struct lov_object *obj,
 	 * of this client has been partitioned. We should relinquish CPU for
 	 * a while before trying again.
 	 */
-	++io->ci_ndelay_tried;
-	if (io->ci_ndelay && io->ci_ndelay_tried >= comp->lo_mirror_count) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		schedule_timeout(msecs_to_jiffies(MSEC_PER_SEC)); /* 10ms */
+	if (io->ci_ndelay && io->ci_ndelay_tried > 0 &&
+	    (io->ci_ndelay_tried % comp->lo_mirror_count == 0)) {
+		schedule_timeout_interruptible(HZ / 100 + 1); /* 10ms */
 		if (signal_pending(current))
 			return -EINTR;
 
-		/* reset retry counter */
-		io->ci_ndelay_tried = 1;
+		/**
+		 * we'd set ci_tried_all_mirrors to turn off fast mirror
+		 * switching for read after we've tried all mirrors several
+		 * rounds.
+		 */
+		io->ci_tried_all_mirrors = io->ci_ndelay_tried %
+					   (comp->lo_mirror_count * 4) == 0;
 	}
+	++io->ci_ndelay_tried;
 
 	CDEBUG(D_VFSTRACE, "use %sdelayed RPC state for this IO\n",
 	       io->ci_ndelay ? "non-" : "");
@@ -668,6 +674,7 @@ static void lov_io_sub_inherit(struct lov_io_sub *sub, struct lov_io *lio,
 	case CIT_READ:
 	case CIT_WRITE: {
 		io->u.ci_wr.wr_sync = cl_io_is_sync_write(parent);
+		io->ci_tried_all_mirrors = parent->ci_tried_all_mirrors;
 		if (cl_io_is_append(parent)) {
 			io->u.ci_wr.wr_append = 1;
 		} else {
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index f340266..1ff2df2 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -368,6 +368,13 @@ int osc_io_commit_async(const struct lu_env *env,
 }
 EXPORT_SYMBOL(osc_io_commit_async);
 
+static bool osc_import_not_healthy(struct obd_import *imp)
+{
+	return imp->imp_invalid || imp->imp_deactive ||
+	       !(imp->imp_state == LUSTRE_IMP_FULL ||
+		 imp->imp_state == LUSTRE_IMP_IDLE);
+}
+
 int osc_io_iter_init(const struct lu_env *env, const struct cl_io_slice *ios)
 {
 	struct osc_object *osc = cl2osc(ios->cis_obj);
@@ -376,7 +383,14 @@ int osc_io_iter_init(const struct lu_env *env, const struct cl_io_slice *ios)
 	int rc = -EIO;
 
 	spin_lock(&imp->imp_lock);
-	if (likely(!imp->imp_invalid)) {
+	/**
+	 * check whether this OSC device is available for non-delay read,
+	 * fast switching mirror if we haven't tried all mirrors.
+	 */
+	if (ios->cis_io->ci_type == CIT_READ && ios->cis_io->ci_ndelay &&
+	    !ios->cis_io->ci_tried_all_mirrors && osc_import_not_healthy(imp)) {
+		rc = -EWOULDBLOCK;
+	} else if (likely(!imp->imp_invalid)) {
 		atomic_inc(&osc->oo_nr_ios);
 		oio->oi_is_active = 1;
 		rc = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 505/622] lustre: obdclass: lu_tgt_descs cleanup
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (503 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 504/622] lustre: flr: avoid reading unhealthy mirror James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 506/622] lustre: ptlrpc: Properly swab ll_fiemap_info_key James Simmons
                   ` (117 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

This patch cleans up code about lu_tgt_descs, so that it's cleaner
to add MDT object QoS allocation support:
* rename struct ost_pool to lu_tgt_pool.
* put struct lu_qos, lmv_desc/lov_desc and lu_tgt_pool into struct
  lu_tgt_descs because it's more natural to manage these data there
  and fewer arguments are needed to pass around in related functions.
* remove lu_tgt_descs.ltd_tgtnr, use
  lu_tgt_descs.ltd_lov_desc.ld_tgt_count instead, because they are
  duplicate.
* other cleanups.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12624
Lustre-commit: 45222b2ef279 ("LU-12624 obdclass: lu_tgt_descs cleanup")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35824
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h     |  81 +++---
 fs/lustre/include/obd.h           |   7 +-
 fs/lustre/lmv/lmv_fld.c           |   6 +-
 fs/lustre/lmv/lmv_internal.h      |   2 +-
 fs/lustre/lmv/lmv_obd.c           | 118 ++++-----
 fs/lustre/lmv/lproc_lmv.c         |  19 +-
 fs/lustre/lov/lov_internal.h      |  14 +-
 fs/lustre/lov/lov_pool.c          |  10 +-
 fs/lustre/obdclass/Makefile       |   2 +-
 fs/lustre/obdclass/lu_qos.c       | 512 --------------------------------------
 fs/lustre/obdclass/lu_tgt_descs.c | 509 ++++++++++++++++++++++++++++++++++++-
 11 files changed, 618 insertions(+), 662 deletions(-)
 delete mode 100644 fs/lustre/obdclass/lu_qos.c

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index eaf20ea..e92f12f 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1322,14 +1322,14 @@ struct lu_kmem_descr {
 extern u32 lu_context_tags_default;
 extern u32 lu_session_tags_default;
 
-/* Generic subset of OSTs */
-struct ost_pool {
+/* Generic subset of tgts */
+struct lu_tgt_pool {
 	u32		   *op_array;	/* array of index of
 					 * lov_obd->lov_tgts
 					 */
-	unsigned int	    op_count;	/* number of OSTs in the array */
-	unsigned int	    op_size;	/* allocated size of lp_array */
-	struct rw_semaphore op_rw_sem;	/* to protect ost_pool use */
+	unsigned int	    op_count;	/* number of tgts in the array */
+	unsigned int	    op_size;	/* allocated size of op_array */
+	struct rw_semaphore op_rw_sem;	/* to protect lu_tgt_pool use */
 };
 
 /* round-robin QoS data for LOD/LMV */
@@ -1338,7 +1338,7 @@ struct lu_qos_rr {
 	u32			 lqr_start_idx;	/* start index of new inode */
 	u32			 lqr_offset_idx;/* aliasing for start_idx */
 	int			 lqr_start_count;/* reseed counter */
-	struct ost_pool		 lqr_pool;	/* round-robin optimized list */
+	struct lu_tgt_pool	 lqr_pool;	/* round-robin optimized list */
 	unsigned long		 lqr_dirty:1;	/* recalc round-robin list */
 };
 
@@ -1401,13 +1401,30 @@ struct lu_tgt_desc_idx {
 	struct lu_tgt_desc *ldi_tgt[TGT_PTRS_PER_BLOCK];
 };
 
+/* QoS data for LOD/LMV */
+struct lu_qos {
+	struct list_head	 lq_svr_list;	 /* lu_svr_qos list */
+	struct rw_semaphore	 lq_rw_sem;
+	u32			 lq_active_svr_count;
+	unsigned int		 lq_prio_free;	 /* priority for free space */
+	unsigned int		 lq_threshold_rr;/* priority for rr */
+	struct lu_qos_rr	 lq_rr;		 /* round robin qos data */
+	unsigned long		 lq_dirty:1,	 /* recalc qos data */
+				 lq_same_space:1,/* the servers all have approx.
+						  * the same space avail
+						  */
+				 lq_reset:1;	 /* zero current penalties */
+};
+
 struct lu_tgt_descs {
+	union {
+		struct lov_desc		ltd_lov_desc;
+		struct lmv_desc		ltd_lmv_desc;
+	};
 	/* list of known TGTs */
 	struct lu_tgt_desc_idx	*ltd_tgt_idx[TGT_PTRS];
 	/* Size of the lu_tgts array, granted to be a power of 2 */
 	u32			ltd_tgts_size;
-	/* number of registered TGTs */
-	u32			ltd_tgtnr;
 	/* bitmap of TGTs available */
 	unsigned long		*ltd_tgt_bitmap;
 	/* TGTs scheduled to be deleted */
@@ -1418,43 +1435,31 @@ struct lu_tgt_descs {
 	struct mutex		ltd_mutex;
 	/* read/write semaphore used for array relocation */
 	struct rw_semaphore	ltd_rw_sem;
+	/* QoS */
+	struct lu_qos		ltd_qos;
+	/* all tgts in a packed array */
+	struct lu_tgt_pool	ltd_tgt_pool;
+	/* true if tgt is MDT */
+	bool			ltd_is_mdt;
 };
 
 #define LTD_TGT(ltd, index)						\
-	((ltd)->ltd_tgt_idx[(index) / TGT_PTRS_PER_BLOCK]		\
-				->ldi_tgt[(index) % TGT_PTRS_PER_BLOCK])
+	 (ltd)->ltd_tgt_idx[(index) / TGT_PTRS_PER_BLOCK]		\
+			->ldi_tgt[(index) % TGT_PTRS_PER_BLOCK]
 
-/* QoS data for LOD/LMV */
-struct lu_qos {
-	struct list_head	 lq_svr_list;	/* lu_svr_qos list */
-	struct rw_semaphore	 lq_rw_sem;
-	u32			 lq_active_svr_count;
-	unsigned int		 lq_prio_free;   /* priority for free space */
-	unsigned int		 lq_threshold_rr;/* priority for rr */
-	struct lu_qos_rr	 lq_rr;          /* round robin qos data */
-	unsigned long		 lq_dirty:1,     /* recalc qos data */
-				 lq_same_space:1,/* the servers all have approx.
-						  * the same space avail
-						  */
-				 lq_reset:1;     /* zero current penalties */
-};
-
-void lu_qos_rr_init(struct lu_qos_rr *lqr);
-int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
-int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
-bool lqos_is_usable(struct lu_qos *qos, u32 active_tgt_nr);
-int lqos_calc_penalties(struct lu_qos *qos, struct lu_tgt_descs *ltd,
-			u32 active_tgt_nr, u32 maxage, bool is_mdt);
-void lqos_calc_weight(struct lu_tgt_desc *tgt);
-int lqos_recalc_weight(struct lu_qos *qos, struct lu_tgt_descs *ltd,
-		       struct lu_tgt_desc *tgt, u32 active_tgt_nr,
-		       u64 *total_wt);
 u64 lu_prandom_u64_max(u64 ep_ro);
+void lu_qos_rr_init(struct lu_qos_rr *lqr);
+int lu_qos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
+void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt);
 
-int lu_tgt_descs_init(struct lu_tgt_descs *ltd);
+int lu_tgt_descs_init(struct lu_tgt_descs *ltd, bool is_mdt);
 void lu_tgt_descs_fini(struct lu_tgt_descs *ltd);
-int lu_tgt_descs_add(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt);
-void lu_tgt_descs_del(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt);
+int ltd_add_tgt(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt);
+void ltd_del_tgt(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt);
+bool ltd_qos_is_usable(struct lu_tgt_descs *ltd);
+int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd);
+int ltd_qos_update(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt,
+		   u64 *total_wt);
 
 static inline struct lu_tgt_desc *ltd_first_tgt(struct lu_tgt_descs *ltd)
 {
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 41431f9..4ba70c7 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -394,7 +394,7 @@ struct lov_md_tgt_desc {
 struct lov_obd {
 	struct lov_desc		desc;
 	struct lov_tgt_desc   **lov_tgts;	/* sparse array */
-	struct ost_pool		lov_packed;	/* all OSTs in a packed array */
+	struct lu_tgt_pool	lov_packed;	/* all OSTs in a packed array */
 	struct mutex		lov_lock;
 	struct obd_connect_data lov_ocd;
 	atomic_t		lov_refcount;
@@ -422,7 +422,6 @@ struct lov_obd {
 struct lmv_obd {
 	struct lu_client_fld	lmv_fld;
 	spinlock_t		lmv_lock;
-	struct lmv_desc		desc;
 
 	int			connected;
 	int			max_easize;
@@ -435,10 +434,12 @@ struct lmv_obd {
 	struct kobject		*lmv_tgts_kobj;
 	void			*lmv_cache;
 
-	struct lu_qos		lmv_qos;
 	u32			lmv_qos_rr_index;
 };
 
+#define lmv_mdt_count	lmv_mdt_descs.ltd_lmv_desc.ld_tgt_count
+#define lmv_qos		lmv_mdt_descs.ltd_qos
+
 struct niobuf_local {
 	u64			lnb_file_offset;
 	u32			lnb_page_offset;
diff --git a/fs/lustre/lmv/lmv_fld.c b/fs/lustre/lmv/lmv_fld.c
index ef2c866..ea1ef72 100644
--- a/fs/lustre/lmv/lmv_fld.c
+++ b/fs/lustre/lmv/lmv_fld.c
@@ -75,11 +75,11 @@ int lmv_fld_lookup(struct lmv_obd *lmv, const struct lu_fid *fid, u32 *mds)
 	CDEBUG(D_INODE, "FLD lookup got mds #%x for fid=" DFID "\n",
 	       *mds, PFID(fid));
 
-	if (*mds >= lmv->desc.ld_tgt_count) {
+	if (*mds >= lmv->lmv_mdt_descs.ltd_tgts_size) {
 		rc = -EINVAL;
 		CERROR("%s: FLD lookup got invalid mds #%x (max: %x) for fid=" DFID ": rc = %d\n",
-		       obd->obd_name, *mds, lmv->desc.ld_tgt_count, PFID(fid),
-		       rc);
+		       obd->obd_name, *mds, lmv->lmv_mdt_descs.ltd_tgts_size,
+		       PFID(fid), rc);
 	}
 	return rc;
 }
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index d95fa3f..70d86676 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -122,7 +122,7 @@ struct lu_tgt_desc *lmv_next_connected_tgt(struct lmv_obd *lmv,
 	u32 mdt_idx;
 	int rc;
 
-	if (lmv->desc.ld_tgt_count < 2)
+	if (lmv->lmv_mdt_count < 2)
 		return 0;
 
 	rc = lmv_fld_lookup(lmv, fid, &mdt_idx);
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 2959b18..84be905 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -64,7 +64,8 @@ void lmv_activate_target(struct lmv_obd *lmv, struct lmv_tgt_desc *tgt,
 		return;
 
 	tgt->ltd_active = activate;
-	lmv->desc.ld_active_tgt_count += (activate ? 1 : -1);
+	lmv->lmv_mdt_descs.ltd_lmv_desc.ld_active_tgt_count +=
+		(activate ? 1 : -1);
 	tgt->ltd_exp->exp_obd->obd_inactive = !activate;
 }
 
@@ -330,11 +331,11 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 
 	tgt->ltd_active = 1;
 	tgt->ltd_exp = mdc_exp;
-	lmv->desc.ld_active_tgt_count++;
+	lmv->lmv_mdt_descs.ltd_lmv_desc.ld_active_tgt_count++;
 
 	md_init_ea_size(tgt->ltd_exp, lmv->max_easize, lmv->max_def_easize);
 
-	rc = lqos_add_tgt(&lmv->lmv_qos, tgt);
+	rc = lu_qos_add_tgt(&lmv->lmv_qos, tgt);
 	if (rc) {
 		obd_disconnect(mdc_exp);
 		return rc;
@@ -357,8 +358,7 @@ static int lmv_connect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 static void lmv_del_target(struct lmv_obd *lmv, struct lu_tgt_desc *tgt)
 {
 	LASSERT(tgt);
-	lqos_del_tgt(&lmv->lmv_qos, tgt);
-	lu_tgt_descs_del(&lmv->lmv_mdt_descs, tgt);
+	ltd_del_tgt(&lmv->lmv_mdt_descs, tgt);
 	kfree(tgt);
 }
 
@@ -369,7 +369,6 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	struct obd_device *mdc_obd;
 	struct lmv_tgt_desc *tgt;
 	struct lu_tgt_descs *ltd = &lmv->lmv_mdt_descs;
-	int orig_tgt_count = 0;
 	int rc = 0;
 
 	CDEBUG(D_CONFIG, "Target uuid: %s. index %d\n", uuidp->uuid, index);
@@ -392,11 +391,7 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 	tgt->ltd_active = 0;
 
 	mutex_lock(&ltd->ltd_mutex);
-	rc = lu_tgt_descs_add(ltd, tgt);
-	if (!rc && index >= lmv->desc.ld_tgt_count) {
-		orig_tgt_count = lmv->desc.ld_tgt_count;
-		lmv->desc.ld_tgt_count = index + 1;
-	}
+	rc = ltd_add_tgt(ltd, tgt);
 	mutex_unlock(&ltd->ltd_mutex);
 
 	if (rc)
@@ -407,14 +402,10 @@ static int lmv_add_target(struct obd_device *obd, struct obd_uuid *uuidp,
 		return rc;
 
 	rc = lmv_connect_mdc(obd, tgt);
-	if (rc) {
-		mutex_lock(&ltd->ltd_mutex);
-		lmv->desc.ld_tgt_count = orig_tgt_count;
-		memset(tgt, 0, sizeof(*tgt));
-		mutex_unlock(&ltd->ltd_mutex);
-	} else {
+	if (!rc) {
 		int easize = sizeof(struct lmv_stripe_md) +
-			     lmv->desc.ld_tgt_count * sizeof(struct lu_fid);
+			     lmv->lmv_mdt_count * sizeof(struct lu_fid);
+
 		lmv_init_ea_size(obd->obd_self_export, easize, 0);
 	}
 
@@ -441,7 +432,7 @@ static int lmv_check_connect(struct obd_device *obd)
 		goto unlock;
 	}
 
-	if (lmv->desc.ld_tgt_count == 0) {
+	if (!lmv->lmv_mdt_count) {
 		CERROR("%s: no targets configured: rc = -EINVAL\n",
 		       obd->obd_name);
 		rc = -EINVAL;
@@ -465,7 +456,7 @@ static int lmv_check_connect(struct obd_device *obd)
 	}
 
 	lmv->connected = 1;
-	easize = lmv_mds_md_size(lmv->desc.ld_tgt_count, LMV_MAGIC);
+	easize = lmv_mds_md_size(lmv->lmv_mdt_count, LMV_MAGIC);
 	lmv_init_ea_size(obd->obd_self_export, easize, 0);
 unlock:
 	mutex_unlock(&lmv->lmv_mdt_descs.ltd_mutex);
@@ -478,7 +469,7 @@ static int lmv_check_connect(struct obd_device *obd)
 		if (!tgt->ltd_exp)
 			continue;
 
-		--lmv->desc.ld_active_tgt_count;
+		--lmv->lmv_mdt_descs.ltd_lmv_desc.ld_active_tgt_count;
 		obd_disconnect(tgt->ltd_exp);
 	}
 
@@ -810,7 +801,7 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 	struct lmv_obd *lmv = &obddev->u.lmv;
 	struct lu_tgt_desc *tgt = NULL;
 	int set = 0;
-	u32 count = lmv->desc.ld_tgt_count;
+	u32 count = lmv->lmv_mdt_count;
 	int rc = 0;
 
 	if (count == 0)
@@ -824,7 +815,8 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		u32 index;
 
 		memcpy(&index, data->ioc_inlbuf2, sizeof(u32));
-		if (index >= count)
+
+		if (index >= lmv->lmv_mdt_descs.ltd_tgts_size)
 			return -ENODEV;
 
 		tgt = lmv_tgt(lmv, index);
@@ -857,12 +849,7 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		struct obd_quotactl *oqctl;
 
 		if (qctl->qc_valid == QC_MDTIDX) {
-			if (count <= qctl->qc_idx)
-				return -EINVAL;
-
 			tgt = lmv_tgt(lmv, qctl->qc_idx);
-			if (!tgt || !tgt->ltd_exp)
-				return -EINVAL;
 		} else if (qctl->qc_valid == QC_UUID) {
 			lmv_foreach_tgt(lmv, tgt) {
 				if (!obd_uuid_equals(&tgt->ltd_uuid,
@@ -878,10 +865,9 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 			return -EINVAL;
 		}
 
-		if (tgt->ltd_index >= count)
-			return -EAGAIN;
+		if (!tgt || !tgt->ltd_exp)
+			return -EINVAL;
 
-		LASSERT(tgt && tgt->ltd_exp);
 		oqctl = kzalloc(sizeof(*oqctl), GFP_KERNEL);
 		if (!oqctl)
 			return -ENOMEM;
@@ -1069,7 +1055,7 @@ static u32 lmv_placement_policy(struct obd_device *obd,
 	struct lmv_user_md *lum;
 	u32 mdt;
 
-	if (lmv->desc.ld_tgt_count == 1)
+	if (lmv->lmv_mdt_count == 1)
 		return 0;
 
 	lum = op_data->op_data;
@@ -1182,27 +1168,17 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 		return -EINVAL;
 	}
 
-	obd_str2uuid(&lmv->desc.ld_uuid, desc->ld_uuid.uuid);
-	lmv->desc.ld_tgt_count = 0;
-	lmv->desc.ld_active_tgt_count = 0;
-	lmv->desc.ld_qos_maxage = LMV_DESC_QOS_MAXAGE_DEFAULT;
+	obd_str2uuid(&lmv->lmv_mdt_descs.ltd_lmv_desc.ld_uuid,
+		     desc->ld_uuid.uuid);
+	lmv->lmv_mdt_descs.ltd_lmv_desc.ld_tgt_count = 0;
+	lmv->lmv_mdt_descs.ltd_lmv_desc.ld_active_tgt_count = 0;
+	lmv->lmv_mdt_descs.ltd_lmv_desc.ld_qos_maxage =
+		LMV_DESC_QOS_MAXAGE_DEFAULT;
 	lmv->max_def_easize = 0;
 	lmv->max_easize = 0;
 
 	spin_lock_init(&lmv->lmv_lock);
 
-	/* Set up allocation policy (QoS and RR) */
-	INIT_LIST_HEAD(&lmv->lmv_qos.lq_svr_list);
-	init_rwsem(&lmv->lmv_qos.lq_rw_sem);
-	lmv->lmv_qos.lq_dirty = 1;
-	lmv->lmv_qos.lq_reset = 1;
-	/* Default priority is toward free space balance */
-	lmv->lmv_qos.lq_prio_free = 232;
-	/* Default threshold for rr (roughly 17%) */
-	lmv->lmv_qos.lq_threshold_rr = 43;
-
-	lu_qos_rr_init(&lmv->lmv_qos.lq_rr);
-
 	/*
 	 * initialize rr_index to lower 32bit of netid, so that client
 	 * can distribute subdirs evenly from the beginning.
@@ -1224,7 +1200,7 @@ static int lmv_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	if (rc)
 		CERROR("Can't init FLD, err %d\n", rc);
 
-	rc = lu_tgt_descs_init(&lmv->lmv_mdt_descs);
+	rc = lu_tgt_descs_init(&lmv->lmv_mdt_descs, true);
 	if (rc)
 		CWARN("%s: error initialize target table: rc = %d\n",
 		      obd->obd_name, rc);
@@ -1292,7 +1268,7 @@ static int lmv_select_statfs_mdt(struct lmv_obd *lmv, u32 flags)
 	if (flags & OBD_STATFS_FOR_MDT0)
 		return 0;
 
-	if (lmv->lmv_statfs_start || lmv->desc.ld_tgt_count == 1)
+	if (lmv->lmv_statfs_start || lmv->lmv_mdt_count == 1)
 		return lmv->lmv_statfs_start;
 
 	/* choose initial MDT for this client */
@@ -1306,8 +1282,8 @@ static int lmv_select_statfs_mdt(struct lmv_obd *lmv, u32 flags)
 			/* We dont need a full 64-bit modulus, just enough
 			 * to distribute the requests across MDTs evenly.
 			 */
-			lmv->lmv_statfs_start =
-				(u32)lnet_id.nid % lmv->desc.ld_tgt_count;
+			lmv->lmv_statfs_start = (u32)lnet_id.nid %
+						lmv->lmv_mdt_count;
 			break;
 		}
 	}
@@ -1333,8 +1309,8 @@ static int lmv_statfs(const struct lu_env *env, struct obd_export *exp,
 	/* distribute statfs among MDTs */
 	idx = lmv_select_statfs_mdt(lmv, flags);
 
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++, idx++) {
-		idx = idx % lmv->desc.ld_tgt_count;
+	for (i = 0; i < lmv->lmv_mdt_descs.ltd_tgts_size; i++, idx++) {
+		idx = idx % lmv->lmv_mdt_descs.ltd_tgts_size;
 		tgt = lmv_tgt(lmv, idx);
 		if (!tgt || !tgt->ltd_exp)
 			continue;
@@ -1410,7 +1386,7 @@ int lmv_statfs_check_update(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 	int rc;
 
 	if (ktime_get_seconds() - tgt->ltd_statfs_age <
-	    obd->u.lmv.desc.ld_qos_maxage)
+	    obd->u.lmv.lmv_mdt_descs.ltd_lmv_desc.ld_qos_maxage)
 		return 0;
 
 	rc = obd_statfs_async(tgt->ltd_exp, &oinfo, 0, NULL);
@@ -1526,19 +1502,17 @@ static struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 	u64 rand;
 	int rc;
 
-	if (!lqos_is_usable(&lmv->lmv_qos, lmv->desc.ld_active_tgt_count))
+	if (!ltd_qos_is_usable(&lmv->lmv_mdt_descs))
 		return ERR_PTR(-EAGAIN);
 
 	down_write(&lmv->lmv_qos.lq_rw_sem);
 
-	if (!lqos_is_usable(&lmv->lmv_qos, lmv->desc.ld_active_tgt_count)) {
+	if (!ltd_qos_is_usable(&lmv->lmv_mdt_descs)) {
 		tgt = ERR_PTR(-EAGAIN);
 		goto unlock;
 	}
 
-	rc = lqos_calc_penalties(&lmv->lmv_qos, &lmv->lmv_mdt_descs,
-				 lmv->desc.ld_active_tgt_count,
-				 lmv->desc.ld_qos_maxage, true);
+	rc = ltd_qos_penalties_calc(&lmv->lmv_mdt_descs);
 	if (rc) {
 		tgt = ERR_PTR(rc);
 		goto unlock;
@@ -1550,7 +1524,7 @@ static struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 			continue;
 
 		tgt->ltd_qos.ltq_usable = 1;
-		lqos_calc_weight(tgt);
+		lu_tgt_qos_weight_calc(tgt);
 		total_weight += tgt->ltd_qos.ltq_weight;
 	}
 
@@ -1565,9 +1539,7 @@ static struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv, u32 *mdt)
 			continue;
 
 		*mdt = tgt->ltd_index;
-		lqos_recalc_weight(&lmv->lmv_qos, &lmv->lmv_mdt_descs, tgt,
-				   lmv->desc.ld_active_tgt_count,
-				   &total_weight);
+		ltd_qos_update(&lmv->lmv_mdt_descs, tgt, &total_weight);
 		rc = 0;
 		goto unlock;
 	}
@@ -1588,14 +1560,16 @@ static struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt)
 	int index;
 
 	spin_lock(&lmv->lmv_qos.lq_rr.lqr_alloc);
-	for (i = 0; i < lmv->desc.ld_tgt_count; i++) {
-		index = (i + lmv->lmv_qos_rr_index) % lmv->desc.ld_tgt_count;
+	for (i = 0; i < lmv->lmv_mdt_descs.ltd_tgts_size; i++) {
+		index = (i + lmv->lmv_qos_rr_index) %
+			lmv->lmv_mdt_descs.ltd_tgts_size;
 		tgt = lmv_tgt(lmv, index);
 		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
 			continue;
 
 		*mdt = tgt->ltd_index;
-		lmv->lmv_qos_rr_index = (*mdt + 1) % lmv->desc.ld_tgt_count;
+		lmv->lmv_qos_rr_index = (*mdt + 1) %
+					lmv->lmv_mdt_descs.ltd_tgts_size;
 		spin_unlock(&lmv->lmv_qos.lq_rr.lqr_alloc);
 
 		return tgt;
@@ -1791,7 +1765,7 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	struct lmv_tgt_desc *tgt;
 	int rc;
 
-	if (!lmv->desc.ld_active_tgt_count)
+	if (!lmv->lmv_mdt_descs.ltd_lmv_desc.ld_active_tgt_count)
 		return -EIO;
 
 	if (lmv_dir_bad_hash(op_data->op_mea1))
@@ -2903,7 +2877,7 @@ static int lmv_get_info(const struct lu_env *env, struct obd_export *exp,
 			exp->exp_connect_data = *(struct obd_connect_data *)val;
 		return rc;
 	} else if (KEY_IS(KEY_TGT_COUNT)) {
-		*((int *)val) = lmv->desc.ld_tgt_count;
+		*((int *)val) = lmv->lmv_mdt_descs.ltd_tgts_size;
 		return 0;
 	}
 
@@ -2917,7 +2891,7 @@ static int lmv_rmfid(struct obd_export *exp, struct fid_array *fa,
 	struct obd_device *obddev = class_exp2obd(exp);
 	struct ptlrpc_request_set *set = _set;
 	struct lmv_obd *lmv = &obddev->u.lmv;
-	int tgt_count = lmv->desc.ld_tgt_count;
+	int tgt_count = lmv->lmv_mdt_count;
 	struct lu_tgt_desc *tgt;
 	struct fid_array *fat, **fas = NULL;
 	int i, rc, **rcs = NULL;
@@ -3303,8 +3277,8 @@ static enum ldlm_mode lmv_lock_match(struct obd_export *exp, u64 flags,
 	 * since this can be easily found, and only try others if that fails.
 	 */
 	for (i = 0, index = lmv_fid2tgt_index(lmv, fid);
-	     i < lmv->desc.ld_tgt_count;
-	     i++, index = (index + 1) % lmv->desc.ld_tgt_count) {
+	     i < lmv->lmv_mdt_descs.ltd_tgts_size;
+	     i++, index = (index + 1) % lmv->lmv_mdt_descs.ltd_tgts_size) {
 		if (index < 0) {
 			CDEBUG(D_HA, "%s: " DFID " is inaccessible: rc = %d\n",
 			       obd->obd_name, PFID(fid), index);
diff --git a/fs/lustre/lmv/lproc_lmv.c b/fs/lustre/lmv/lproc_lmv.c
index af670f8..79e27b3 100644
--- a/fs/lustre/lmv/lproc_lmv.c
+++ b/fs/lustre/lmv/lproc_lmv.c
@@ -45,10 +45,8 @@ static ssize_t numobd_show(struct kobject *kobj, struct attribute *attr,
 {
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
-	struct lmv_desc *desc;
 
-	desc = &dev->u.lmv.desc;
-	return sprintf(buf, "%u\n", desc->ld_tgt_count);
+	return sprintf(buf, "%u\n", dev->u.lmv.lmv_mdt_count);
 }
 LUSTRE_RO_ATTR(numobd);
 
@@ -57,10 +55,9 @@ static ssize_t activeobd_show(struct kobject *kobj, struct attribute *attr,
 {
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
-	struct lmv_desc *desc;
 
-	desc = &dev->u.lmv.desc;
-	return sprintf(buf, "%u\n", desc->ld_active_tgt_count);
+	return sprintf(buf, "%u\n",
+		     dev->u.lmv.lmv_mdt_descs.ltd_lmv_desc.ld_active_tgt_count);
 }
 LUSTRE_RO_ATTR(activeobd);
 
@@ -69,10 +66,9 @@ static ssize_t desc_uuid_show(struct kobject *kobj, struct attribute *attr,
 {
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
-	struct lmv_desc *desc;
 
-	desc = &dev->u.lmv.desc;
-	return sprintf(buf, "%s\n", desc->ld_uuid.uuid);
+	return sprintf(buf, "%s\n",
+		       dev->u.lmv.lmv_mdt_descs.ltd_lmv_desc.ld_uuid.uuid);
 }
 LUSTRE_RO_ATTR(desc_uuid);
 
@@ -83,7 +79,8 @@ static ssize_t qos_maxage_show(struct kobject *kobj,
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
 
-	return sprintf(buf, "%u\n", dev->u.lmv.desc.ld_qos_maxage);
+	return sprintf(buf, "%u\n",
+		       dev->u.lmv.lmv_mdt_descs.ltd_lmv_desc.ld_qos_maxage);
 }
 
 static ssize_t qos_maxage_store(struct kobject *kobj,
@@ -100,7 +97,7 @@ static ssize_t qos_maxage_store(struct kobject *kobj,
 	if (rc)
 		return rc;
 
-	dev->u.lmv.desc.ld_qos_maxage = val;
+	dev->u.lmv.lmv_mdt_descs.ltd_lmv_desc.ld_qos_maxage = val;
 
 	return count;
 }
diff --git a/fs/lustre/lov/lov_internal.h b/fs/lustre/lov/lov_internal.h
index d235abe..3725d1e 100644
--- a/fs/lustre/lov/lov_internal.h
+++ b/fs/lustre/lov/lov_internal.h
@@ -221,7 +221,7 @@ struct lsm_operations {
 
 struct pool_desc {
 	char			 pool_name[LOV_MAXPOOLNAME + 1];
-	struct ost_pool		 pool_obds;
+	struct lu_tgt_pool	 pool_obds;
 	atomic_t		 pool_refcount;
 	struct rhash_head	 pool_hash;		/* access by poolname */
 	union {
@@ -322,12 +322,12 @@ struct lov_stripe_md *lov_unpackmd(struct lov_obd *lov, void *buf,
 
 #define LOV_MDC_TGT_MAX 256
 
-/* ost_pool methods */
-int lov_ost_pool_init(struct ost_pool *op, unsigned int count);
-int lov_ost_pool_extend(struct ost_pool *op, unsigned int min_count);
-int lov_ost_pool_add(struct ost_pool *op, u32 idx, unsigned int min_count);
-int lov_ost_pool_remove(struct ost_pool *op, u32 idx);
-int lov_ost_pool_free(struct ost_pool *op);
+/* lu_tgt_pool methods */
+int lov_ost_pool_init(struct lu_tgt_pool *op, unsigned int count);
+int lov_ost_pool_extend(struct lu_tgt_pool *op, unsigned int min_count);
+int lov_ost_pool_add(struct lu_tgt_pool *op, u32 idx, unsigned int min_count);
+int lov_ost_pool_remove(struct lu_tgt_pool *op, u32 idx);
+int lov_ost_pool_free(struct lu_tgt_pool *op);
 
 /* high level pool methods */
 int lov_pool_new(struct obd_device *obd, char *poolname);
diff --git a/fs/lustre/lov/lov_pool.c b/fs/lustre/lov/lov_pool.c
index a0552fb..9ab81cb 100644
--- a/fs/lustre/lov/lov_pool.c
+++ b/fs/lustre/lov/lov_pool.c
@@ -231,7 +231,7 @@ static int pool_proc_open(struct inode *inode, struct file *file)
 };
 
 #define LOV_POOL_INIT_COUNT 2
-int lov_ost_pool_init(struct ost_pool *op, unsigned int count)
+int lov_ost_pool_init(struct lu_tgt_pool *op, unsigned int count)
 {
 	if (count == 0)
 		count = LOV_POOL_INIT_COUNT;
@@ -249,7 +249,7 @@ int lov_ost_pool_init(struct ost_pool *op, unsigned int count)
 }
 
 /* Caller must hold write op_rwlock */
-int lov_ost_pool_extend(struct ost_pool *op, unsigned int min_count)
+int lov_ost_pool_extend(struct lu_tgt_pool *op, unsigned int min_count)
 {
 	int new_count;
 	u32 *new;
@@ -273,7 +273,7 @@ int lov_ost_pool_extend(struct ost_pool *op, unsigned int min_count)
 	return 0;
 }
 
-int lov_ost_pool_add(struct ost_pool *op, u32 idx, unsigned int min_count)
+int lov_ost_pool_add(struct lu_tgt_pool *op, u32 idx, unsigned int min_count)
 {
 	int rc = 0, i;
 
@@ -298,7 +298,7 @@ int lov_ost_pool_add(struct ost_pool *op, u32 idx, unsigned int min_count)
 	return rc;
 }
 
-int lov_ost_pool_remove(struct ost_pool *op, u32 idx)
+int lov_ost_pool_remove(struct lu_tgt_pool *op, u32 idx)
 {
 	int i;
 
@@ -318,7 +318,7 @@ int lov_ost_pool_remove(struct ost_pool *op, u32 idx)
 	return -EINVAL;
 }
 
-int lov_ost_pool_free(struct ost_pool *op)
+int lov_ost_pool_free(struct lu_tgt_pool *op)
 {
 	if (op->op_size == 0)
 		return 0;
diff --git a/fs/lustre/obdclass/Makefile b/fs/lustre/obdclass/Makefile
index 5718a6d..9693a5e 100644
--- a/fs/lustre/obdclass/Makefile
+++ b/fs/lustre/obdclass/Makefile
@@ -8,4 +8,4 @@ obdclass-y := llog.o llog_cat.o llog_obd.o llog_swab.o class_obd.o \
 	      lustre_handles.o lustre_peer.o statfs_pack.o linkea.o \
 	      obdo.o obd_config.o obd_mount.o lu_object.o lu_ref.o \
 	      cl_object.o cl_page.o cl_lock.o cl_io.o kernelcomm.o \
-	      jobid.o integrity.o obd_cksum.o lu_qos.o lu_tgt_descs.o
+	      jobid.o integrity.o obd_cksum.o lu_tgt_descs.o
diff --git a/fs/lustre/obdclass/lu_qos.c b/fs/lustre/obdclass/lu_qos.c
deleted file mode 100644
index 13ab4a7..0000000
--- a/fs/lustre/obdclass/lu_qos.c
+++ /dev/null
@@ -1,512 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * This file is part of Lustre, http://www.lustre.org/
- *
- * lustre/obdclass/lu_qos.c
- *
- * Lustre QoS.
- * These are the only exported functions, they provide some generic
- * infrastructure for object allocation QoS
- *
- */
-
-#define DEBUG_SUBSYSTEM S_CLASS
-
-#include <linux/module.h>
-#include <linux/list.h>
-#include <linux/random.h>
-#include <obd_class.h>
-#include <obd_support.h>
-#include <lustre_disk.h>
-#include <lustre_fid.h>
-#include <lu_object.h>
-
-void lu_qos_rr_init(struct lu_qos_rr *lqr)
-{
-	spin_lock_init(&lqr->lqr_alloc);
-	lqr->lqr_dirty = 1;
-}
-EXPORT_SYMBOL(lu_qos_rr_init);
-
-/**
- * Add a new target to Quality of Service (QoS) target table.
- *
- * Add a new MDT/OST target to the structure representing an OSS. Resort the
- * list of known MDSs/OSSs by the number of MDTs/OSTs attached to each MDS/OSS.
- * The MDS/OSS list is protected internally and no external locking is required.
- *
- * @qos		lu_qos data
- * @ltd		target description
- *
- * Return:	0 on success
- *		-ENOMEM	on error
- */
-int lqos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
-{
-	struct lu_svr_qos *svr = NULL;
-	struct lu_svr_qos *tempsvr;
-	struct obd_export *exp = ltd->ltd_exp;
-	int found = 0;
-	u32 id = 0;
-	int rc = 0;
-
-	down_write(&qos->lq_rw_sem);
-	/*
-	 * a bit hacky approach to learn NID of corresponding connection
-	 * but there is no official API to access information like this
-	 * with OSD API.
-	 */
-	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-		if (obd_uuid_equals(&svr->lsq_uuid,
-				    &exp->exp_connection->c_remote_uuid)) {
-			found++;
-			break;
-		}
-		if (svr->lsq_id > id)
-			id = svr->lsq_id;
-	}
-
-	if (!found) {
-		svr = kmalloc(sizeof(*svr), GFP_NOFS);
-		if (!svr) {
-			rc = -ENOMEM;
-			goto out;
-		}
-		memcpy(&svr->lsq_uuid, &exp->exp_connection->c_remote_uuid,
-		       sizeof(svr->lsq_uuid));
-		++id;
-		svr->lsq_id = id;
-	} else {
-		/* Assume we have to move this one */
-		list_del(&svr->lsq_svr_list);
-	}
-
-	svr->lsq_tgt_count++;
-	ltd->ltd_qos.ltq_svr = svr;
-
-	CDEBUG(D_OTHER, "add tgt %s to server %s (%d targets)\n",
-	       obd_uuid2str(&ltd->ltd_uuid), obd_uuid2str(&svr->lsq_uuid),
-	       svr->lsq_tgt_count);
-
-	/*
-	 * Add sorted by # of tgts.  Find the first entry that we're
-	 * bigger than...
-	 */
-	list_for_each_entry(tempsvr, &qos->lq_svr_list, lsq_svr_list) {
-		if (svr->lsq_tgt_count > tempsvr->lsq_tgt_count)
-			break;
-	}
-	/*
-	 * ...and add before it.  If we're the first or smallest, tempsvr
-	 * points to the list head, and we add to the end.
-	 */
-	list_add_tail(&svr->lsq_svr_list, &tempsvr->lsq_svr_list);
-
-	qos->lq_dirty = 1;
-	qos->lq_rr.lqr_dirty = 1;
-
-out:
-	up_write(&qos->lq_rw_sem);
-	return rc;
-}
-EXPORT_SYMBOL(lqos_add_tgt);
-
-/**
- * Remove MDT/OST target from QoS table.
- *
- * Removes given MDT/OST target from QoS table and releases related
- * MDS/OSS structure if no target remain on the MDS/OSS.
- *
- * @qos		lu_qos data
- * @ltd		target description
- *
- * Return:	0 on success
- *		-ENOENT	if no server was found
- */
-int lqos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
-{
-	struct lu_svr_qos *svr;
-	int rc = 0;
-
-	down_write(&qos->lq_rw_sem);
-	svr = ltd->ltd_qos.ltq_svr;
-	if (!svr) {
-		rc = -ENOENT;
-		goto out;
-	}
-
-	svr->lsq_tgt_count--;
-	if (svr->lsq_tgt_count == 0) {
-		CDEBUG(D_OTHER, "removing server %s\n",
-		       obd_uuid2str(&svr->lsq_uuid));
-		list_del(&svr->lsq_svr_list);
-		ltd->ltd_qos.ltq_svr = NULL;
-		kfree(svr);
-	}
-
-	qos->lq_dirty = 1;
-	qos->lq_rr.lqr_dirty = 1;
-out:
-	up_write(&qos->lq_rw_sem);
-	return rc;
-}
-EXPORT_SYMBOL(lqos_del_tgt);
-
-/**
- * lu_prandom_u64_max - returns a pseudo-random u64 number in interval
- * [0, ep_ro)
- *
- * #ep_ro	right open interval endpoint
- *
- * Return:	a pseudo-random 64-bit number that is in interval [0, ep_ro).
- */
-u64 lu_prandom_u64_max(u64 ep_ro)
-{
-	u64 rand = 0;
-
-	if (ep_ro) {
-#if BITS_PER_LONG == 32
-		/*
-		 * If ep_ro > 32-bit, first generate the high
-		 * 32 bits of the random number, then add in the low
-		 * 32 bits (truncated to the upper limit, if needed)
-		 */
-		if (ep_ro > 0xffffffffULL)
-			rand = prandom_u32_max((u32)(ep_ro >> 32)) << 32;
-
-		if (rand == (ep_ro & 0xffffffff00000000ULL))
-			rand |= prandom_u32_max((u32)ep_ro);
-		else
-			rand |= prandom_u32();
-#else
-		rand = ((u64)prandom_u32() << 32 | prandom_u32()) % ep_ro;
-#endif
-	}
-
-	return rand;
-}
-EXPORT_SYMBOL(lu_prandom_u64_max);
-
-static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
-{
-	struct obd_statfs *statfs = &tgt->ltd_statfs;
-
-	return statfs->os_bavail * statfs->os_bsize;
-}
-
-static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
-{
-	return tgt->ltd_statfs.os_ffree;
-}
-
-/**
- * Calculate penalties per-tgt and per-server
- *
- * Re-calculate penalties when the configuration changes, active targets
- * change and after statfs refresh (all these are reflected by lq_dirty flag).
- * On every tgt and server: decay the penalty by half for every 8x the update
- * interval that the device has been idle. That gives lots of time for the
- * statfs information to be updated (which the penalty is only a proxy for),
- * and avoids penalizing server/tgt under light load.
- * See lqos_calc_weight() for how penalties are factored into the weight.
- *
- * @qos			lu_qos
- * @ltd			lu_tgt_descs
- * @active_tgt_nr	active tgt number
- * @ maxage		qos max age
- * @is_mdt		MDT will count inode usage
- *
- * Return:		0 on success
- *			-EAGAIN the number of tgt isn't enough or all
- *			tgt spaces are almost the same
- */
-int lqos_calc_penalties(struct lu_qos *qos, struct lu_tgt_descs *ltd,
-			u32 active_tgt_nr, u32 maxage, bool is_mdt)
-{
-	struct lu_tgt_desc *tgt;
-	struct lu_svr_qos *svr;
-	u64 ba_max, ba_min, ba;
-	u64 ia_max, ia_min, ia = 1;
-	u32 num_active;
-	int prio_wide;
-	time64_t now, age;
-	int rc;
-
-	if (!qos->lq_dirty) {
-		rc = 0;
-		goto out;
-	}
-
-	num_active = active_tgt_nr - 1;
-	if (num_active < 1) {
-		rc = -EAGAIN;
-		goto out;
-	}
-
-	/* find bavail on each server */
-	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-		svr->lsq_bavail = 0;
-		/* if inode is not counted, set to 1 to ignore */
-		svr->lsq_iavail = is_mdt ? 0 : 1;
-	}
-	qos->lq_active_svr_count = 0;
-
-	/*
-	 * How badly user wants to select targets "widely" (not recently chosen
-	 * and not on recent MDS's).  As opposed to "freely" (free space avail.)
-	 * 0-256
-	 */
-	prio_wide = 256 - qos->lq_prio_free;
-
-	ba_min = (u64)(-1);
-	ba_max = 0;
-	ia_min = (u64)(-1);
-	ia_max = 0;
-	now = ktime_get_real_seconds();
-
-	/* Calculate server penalty per object */
-	ltd_foreach_tgt(ltd, tgt) {
-		if (!tgt->ltd_active)
-			continue;
-
-		/* when inode is counted, bavail >> 16 to avoid overflow */
-		ba = tgt_statfs_bavail(tgt);
-		if (is_mdt)
-			ba >>= 16;
-		else
-			ba >>= 8;
-		if (!ba)
-			continue;
-
-		ba_min = min(ba, ba_min);
-		ba_max = max(ba, ba_max);
-
-		/* Count the number of usable servers */
-		if (tgt->ltd_qos.ltq_svr->lsq_bavail == 0)
-			qos->lq_active_svr_count++;
-		tgt->ltd_qos.ltq_svr->lsq_bavail += ba;
-
-		if (is_mdt) {
-			/* iavail >> 8 to avoid overflow */
-			ia = tgt_statfs_iavail(tgt) >> 8;
-			if (!ia)
-				continue;
-
-			ia_min = min(ia, ia_min);
-			ia_max = max(ia, ia_max);
-
-			tgt->ltd_qos.ltq_svr->lsq_iavail += ia;
-		}
-
-		/*
-		 * per-tgt penalty is
-		 * prio * bavail * iavail / (num_tgt - 1) / 2
-		 */
-		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia >> 8;
-		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active);
-		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
-
-		age = (now - tgt->ltd_qos.ltq_used) >> 3;
-		if (qos->lq_reset || age > 32 * maxage)
-			tgt->ltd_qos.ltq_penalty = 0;
-		else if (age > maxage)
-			/* Decay tgt penalty. */
-			tgt->ltd_qos.ltq_penalty >>= (age / maxage);
-	}
-
-	num_active = qos->lq_active_svr_count - 1;
-	if (num_active < 1) {
-		/*
-		 * If there's only 1 server, we can't penalize it, so instead
-		 * we have to double the tgt penalty
-		 */
-		num_active = 1;
-		ltd_foreach_tgt(ltd, tgt) {
-			if (!tgt->ltd_active)
-				continue;
-
-			tgt->ltd_qos.ltq_penalty_per_obj <<= 1;
-		}
-	}
-
-	/*
-	 * Per-server penalty is
-	 * prio * bavail * iavail / server_tgts / (num_svr - 1) / 2
-	 */
-	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-		ba = svr->lsq_bavail;
-		ia = svr->lsq_iavail;
-		svr->lsq_penalty_per_obj = prio_wide * ba  * ia >> 8;
-		do_div(ba, svr->lsq_tgt_count * num_active);
-		svr->lsq_penalty_per_obj >>= 1;
-
-		age = (now - svr->lsq_used) >> 3;
-		if (qos->lq_reset || age > 32 * maxage)
-			svr->lsq_penalty = 0;
-		else if (age > maxage)
-			/* Decay server penalty. */
-			svr->lsq_penalty >>= age / maxage;
-	}
-
-	qos->lq_dirty = 0;
-	qos->lq_reset = 0;
-
-	/*
-	 * If each tgt has almost same free space, do rr allocation for better
-	 * creation performance
-	 */
-	qos->lq_same_space = 0;
-	if ((ba_max * (256 - qos->lq_threshold_rr)) >> 8 < ba_min &&
-	    (ia_max * (256 - qos->lq_threshold_rr)) >> 8 < ia_min) {
-		qos->lq_same_space = 1;
-		/* Reset weights for the next time we enter qos mode */
-		qos->lq_reset = 1;
-	}
-	rc = 0;
-
-out:
-	if (!rc && qos->lq_same_space)
-		return -EAGAIN;
-
-	return rc;
-}
-EXPORT_SYMBOL(lqos_calc_penalties);
-
-bool lqos_is_usable(struct lu_qos *qos, u32 active_tgt_nr)
-{
-	if (!qos->lq_dirty && qos->lq_same_space)
-		return false;
-
-	if (active_tgt_nr < 2)
-		return false;
-
-	return true;
-}
-EXPORT_SYMBOL(lqos_is_usable);
-
-/**
- * Calculate weight for a given tgt.
- *
- * The final tgt weight is bavail >> 16 * iavail >> 8 minus the tgt and server
- * penalties.  See lqos_calc_ppts() for how penalties are calculated.
- *
- * @tgt		target descriptor
- */
-void lqos_calc_weight(struct lu_tgt_desc *tgt)
-{
-	struct lu_tgt_qos *ltq = &tgt->ltd_qos;
-	u64 temp, temp2;
-
-	temp = (tgt_statfs_bavail(tgt) >> 16) * (tgt_statfs_iavail(tgt) >> 8);
-	temp2 = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
-	if (temp < temp2)
-		ltq->ltq_weight = 0;
-	else
-		ltq->ltq_weight = temp - temp2;
-}
-EXPORT_SYMBOL(lqos_calc_weight);
-
-/**
- * Re-calculate weights.
- *
- * The function is called when some target was used for a new object. In
- * this case we should re-calculate all the weights to keep new allocations
- * balanced well.
- *
- * @qos			lu_qos
- * @ltd			lu_tgt_descs
- * @tgt			target where a new object was placed
- * @active_tgt_nr	active tgt number
- * @total_wt		new total weight for the pool
- *
- * Return:		0
- */
-int lqos_recalc_weight(struct lu_qos *qos, struct lu_tgt_descs *ltd,
-		       struct lu_tgt_desc *tgt, u32 active_tgt_nr,
-		       u64 *total_wt)
-{
-	struct lu_tgt_qos *ltq;
-	struct lu_svr_qos *svr;
-
-	ltq = &tgt->ltd_qos;
-	LASSERT(ltq);
-
-	/* Don't allocate on this device anymore, until the next alloc_qos */
-	ltq->ltq_usable = 0;
-
-	svr = ltq->ltq_svr;
-
-	/*
-	 * Decay old penalty by half (we're adding max penalty, and don't
-	 * want it to run away.)
-	 */
-	ltq->ltq_penalty >>= 1;
-	svr->lsq_penalty >>= 1;
-
-	/* mark the server and tgt as recently used */
-	ltq->ltq_used = svr->lsq_used = ktime_get_real_seconds();
-
-	/* Set max penalties for this tgt and server */
-	ltq->ltq_penalty += ltq->ltq_penalty_per_obj * active_tgt_nr;
-	svr->lsq_penalty += svr->lsq_penalty_per_obj * active_tgt_nr;
-
-	/* Decrease all MDS penalties */
-	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-		if (svr->lsq_penalty < svr->lsq_penalty_per_obj)
-			svr->lsq_penalty = 0;
-		else
-			svr->lsq_penalty -= svr->lsq_penalty_per_obj;
-	}
-
-	*total_wt = 0;
-	/* Decrease all tgt penalties */
-	ltd_foreach_tgt(ltd, tgt) {
-		if (!tgt->ltd_active)
-			continue;
-
-		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
-			ltq->ltq_penalty = 0;
-		else
-			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
-
-		lqos_calc_weight(tgt);
-
-		/* Recalc the total weight of usable osts */
-		if (ltq->ltq_usable)
-			*total_wt += ltq->ltq_weight;
-
-		CDEBUG(D_OTHER,
-		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
-		       tgt->ltd_index, ltq->ltq_usable,
-		       tgt_statfs_bavail(tgt) >> 10,
-		       ltq->ltq_penalty_per_obj >> 10,
-		       ltq->ltq_penalty >> 10,
-		       ltq->ltq_svr->lsq_penalty_per_obj >> 10,
-		       ltq->ltq_svr->lsq_penalty >> 10,
-		       ltq->ltq_weight >> 10);
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL(lqos_recalc_weight);
diff --git a/fs/lustre/obdclass/lu_tgt_descs.c b/fs/lustre/obdclass/lu_tgt_descs.c
index 04d6acc..60c50a0 100644
--- a/fs/lustre/obdclass/lu_tgt_descs.c
+++ b/fs/lustre/obdclass/lu_tgt_descs.c
@@ -35,6 +35,7 @@
 
 #include <linux/module.h>
 #include <linux/list.h>
+#include <linux/random.h>
 #include <obd_class.h>
 #include <obd_support.h>
 #include <lustre_disk.h>
@@ -42,17 +43,221 @@
 #include <lu_object.h>
 
 /**
+ * lu_prandom_u64_max - returns a pseudo-random u64 number in interval
+ * [0, ep_ro)
+ *
+ * @ep_ro	right open interval endpoint
+ *
+ * Return:	a pseudo-random 64-bit number that is in interval [0, ep_ro).
+ */
+u64 lu_prandom_u64_max(u64 ep_ro)
+{
+	u64 rand = 0;
+
+	if (ep_ro) {
+#if BITS_PER_LONG == 32
+		/*
+		 * If ep_ro > 32-bit, first generate the high
+		 * 32 bits of the random number, then add in the low
+		 * 32 bits (truncated to the upper limit, if needed)
+		 */
+		if (ep_ro > 0xffffffffULL)
+			rand = prandom_u32_max((u32)(ep_ro >> 32)) << 32;
+
+		if (rand == (ep_ro & 0xffffffff00000000ULL))
+			rand |= prandom_u32_max((u32)ep_ro);
+		else
+			rand |= prandom_u32();
+#else
+		rand = ((u64)prandom_u32() << 32 | prandom_u32()) % ep_ro;
+#endif
+	}
+
+	return rand;
+}
+EXPORT_SYMBOL(lu_prandom_u64_max);
+
+void lu_qos_rr_init(struct lu_qos_rr *lqr)
+{
+	spin_lock_init(&lqr->lqr_alloc);
+	lqr->lqr_dirty = 1;
+}
+EXPORT_SYMBOL(lu_qos_rr_init);
+
+/**
+ * Add a new target to Quality of Service (QoS) target table.
+ *
+ * Add a new MDT/OST target to the structure representing an OSS. Resort the
+ * list of known MDSs/OSSs by the number of MDTs/OSTs attached to each MDS/OSS.
+ * The MDS/OSS list is protected internally and no external locking is required.
+ *
+ * @qos		lu_qos data
+ * @tgt		target description
+ *
+ * Return:	0 on success
+ *		-ENOMEM on error
+ */
+int lu_qos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *tgt)
+{
+	struct lu_svr_qos *svr = NULL;
+	struct lu_svr_qos *tempsvr;
+	struct obd_export *exp = tgt->ltd_exp;
+	int found = 0;
+	u32 id = 0;
+	int rc = 0;
+
+	/* tgt not connected, this function will be called again later */
+	if (!exp)
+		return 0;
+
+	down_write(&qos->lq_rw_sem);
+	/*
+	 * a bit hacky approach to learn NID of corresponding connection
+	 * but there is no official API to access information like this
+	 * with OSD API.
+	 */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		if (obd_uuid_equals(&svr->lsq_uuid,
+				    &exp->exp_connection->c_remote_uuid)) {
+			found++;
+			break;
+		}
+		if (svr->lsq_id > id)
+			id = svr->lsq_id;
+	}
+
+	if (!found) {
+		svr = kzalloc(sizeof(*svr), GFP_NOFS);
+		if (!svr) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		memcpy(&svr->lsq_uuid, &exp->exp_connection->c_remote_uuid,
+		       sizeof(svr->lsq_uuid));
+		++id;
+		svr->lsq_id = id;
+	} else {
+		/* Assume we have to move this one */
+		list_del(&svr->lsq_svr_list);
+	}
+
+	svr->lsq_tgt_count++;
+	tgt->ltd_qos.ltq_svr = svr;
+
+	CDEBUG(D_OTHER, "add tgt %s to server %s (%d targets)\n",
+	       obd_uuid2str(&tgt->ltd_uuid), obd_uuid2str(&svr->lsq_uuid),
+	       svr->lsq_tgt_count);
+
+	/*
+	 * Add sorted by # of tgts.  Find the first entry that we're
+	 * bigger than...
+	 */
+	list_for_each_entry(tempsvr, &qos->lq_svr_list, lsq_svr_list) {
+		if (svr->lsq_tgt_count > tempsvr->lsq_tgt_count)
+			break;
+	}
+	/*
+	 * ...and add before it.  If we're the first or smallest, tempsvr
+	 * points to the list head, and we add to the end.
+	 */
+	list_add_tail(&svr->lsq_svr_list, &tempsvr->lsq_svr_list);
+
+	qos->lq_dirty = 1;
+	qos->lq_rr.lqr_dirty = 1;
+
+out:
+	up_write(&qos->lq_rw_sem);
+	return rc;
+}
+EXPORT_SYMBOL(lu_qos_add_tgt);
+
+/**
+ * Remove MDT/OST target from QoS table.
+ *
+ * Removes given MDT/OST target from QoS table and releases related
+ * MDS/OSS structure if no target remain on the MDS/OSS.
+ *
+ * @qos		lu_qos data
+ * @ltd		target description
+ *
+ * Return:	0 on success
+ *		-ENOENT if no server was found
+ */
+static int lu_qos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
+{
+	struct lu_svr_qos *svr;
+	int rc = 0;
+
+	down_write(&qos->lq_rw_sem);
+	svr = ltd->ltd_qos.ltq_svr;
+	if (!svr) {
+		rc = -ENOENT;
+		goto out;
+	}
+
+	svr->lsq_tgt_count--;
+	if (svr->lsq_tgt_count == 0) {
+		CDEBUG(D_OTHER, "removing server %s\n",
+		       obd_uuid2str(&svr->lsq_uuid));
+		list_del(&svr->lsq_svr_list);
+		ltd->ltd_qos.ltq_svr = NULL;
+		kfree(svr);
+	}
+
+	qos->lq_dirty = 1;
+	qos->lq_rr.lqr_dirty = 1;
+out:
+	up_write(&qos->lq_rw_sem);
+	return rc;
+}
+
+static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
+{
+	struct obd_statfs *statfs = &tgt->ltd_statfs;
+
+	return statfs->os_bavail * statfs->os_bsize;
+}
+
+static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
+{
+	return tgt->ltd_statfs.os_ffree;
+}
+
+/**
+ * Calculate weight for a given tgt.
+ *
+ * The final tgt weight is bavail >> 16 * iavail >> 8 minus the tgt and server
+ * penalties.  See ltd_qos_penalties_calc() for how penalties are calculated.
+ *
+ * @tgt		target descriptor
+ */
+void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt)
+{
+	struct lu_tgt_qos *ltq = &tgt->ltd_qos;
+	u64 temp, temp2;
+
+	temp = (tgt_statfs_bavail(tgt) >> 16) * (tgt_statfs_iavail(tgt) >> 8);
+	temp2 = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
+	if (temp < temp2)
+		ltq->ltq_weight = 0;
+	else
+		ltq->ltq_weight = temp - temp2;
+}
+EXPORT_SYMBOL(lu_tgt_qos_weight_calc);
+
+/**
  * Allocate and initialize target table.
  *
  * A helper function to initialize the target table and allocate
  * a bitmap of the available targets.
  *
  * @ltd		target's table to initialize
+ * @is_mdt	target table for MDTs
  *
  * Return:	0 on success
  *		negated errno on error
  **/
-int lu_tgt_descs_init(struct lu_tgt_descs *ltd)
+int lu_tgt_descs_init(struct lu_tgt_descs *ltd, bool is_mdt)
 {
 	mutex_init(&ltd->ltd_mutex);
 	init_rwsem(&ltd->ltd_rw_sem);
@@ -66,11 +271,22 @@ int lu_tgt_descs_init(struct lu_tgt_descs *ltd)
 		return -ENOMEM;
 
 	ltd->ltd_tgts_size  = BITS_PER_LONG;
-	ltd->ltd_tgtnr      = 0;
-
 	ltd->ltd_death_row = 0;
 	ltd->ltd_refcount  = 0;
 
+	/* Set up allocation policy (QoS and RR) */
+	INIT_LIST_HEAD(&ltd->ltd_qos.lq_svr_list);
+	init_rwsem(&ltd->ltd_qos.lq_rw_sem);
+	ltd->ltd_qos.lq_dirty = 1;
+	ltd->ltd_qos.lq_reset = 1;
+	/* Default priority is toward free space balance */
+	ltd->ltd_qos.lq_prio_free = 232;
+	/* Default threshold for rr (roughly 17%) */
+	ltd->ltd_qos.lq_threshold_rr = 43;
+	ltd->ltd_is_mdt = is_mdt;
+
+	lu_qos_rr_init(&ltd->ltd_qos.lq_rr);
+
 	return 0;
 }
 EXPORT_SYMBOL(lu_tgt_descs_init);
@@ -147,7 +363,7 @@ static int lu_tgt_descs_resize(struct lu_tgt_descs *ltd, u32 newsize)
  *		-ENOMEM if reallocation failed
  *		-EEXIST if target existed
  */
-int lu_tgt_descs_add(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
+int ltd_add_tgt(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
 {
 	u32 index = tgt->ltd_index;
 	int rc;
@@ -174,19 +390,294 @@ int lu_tgt_descs_add(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
 
 	LTD_TGT(ltd, tgt->ltd_index) = tgt;
 	set_bit(tgt->ltd_index, ltd->ltd_tgt_bitmap);
-	ltd->ltd_tgtnr++;
+
+	ltd->ltd_lov_desc.ld_tgt_count++;
+	if (tgt->ltd_active)
+		ltd->ltd_lov_desc.ld_active_tgt_count++;
 
 	return 0;
 }
-EXPORT_SYMBOL(lu_tgt_descs_add);
+EXPORT_SYMBOL(ltd_add_tgt);
 
 /**
  * Delete target from target table
  */
-void lu_tgt_descs_del(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
+void ltd_del_tgt(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt)
 {
+	lu_qos_del_tgt(&ltd->ltd_qos, tgt);
 	LTD_TGT(ltd, tgt->ltd_index) = NULL;
 	clear_bit(tgt->ltd_index, ltd->ltd_tgt_bitmap);
-	ltd->ltd_tgtnr--;
+	ltd->ltd_lov_desc.ld_tgt_count--;
+	if (tgt->ltd_active)
+		ltd->ltd_lov_desc.ld_active_tgt_count--;
+}
+EXPORT_SYMBOL(ltd_del_tgt);
+
+/**
+ * Whether QoS data is up-to-date and QoS can be applied.
+ */
+bool ltd_qos_is_usable(struct lu_tgt_descs *ltd)
+{
+	if (!ltd->ltd_qos.lq_dirty && ltd->ltd_qos.lq_same_space)
+		return false;
+
+	if (ltd->ltd_lov_desc.ld_active_tgt_count < 2)
+		return false;
+
+	return true;
+}
+EXPORT_SYMBOL(ltd_qos_is_usable);
+
+/**
+ * Calculate penalties per-tgt and per-server
+ *
+ * Re-calculate penalties when the configuration changes, active targets
+ * change and after statfs refresh (all these are reflected by lq_dirty flag).
+ * On every tgt and server: decay the penalty by half for every 8x the update
+ * interval that the device has been idle. That gives lots of time for the
+ * statfs information to be updated (which the penalty is only a proxy for),
+ * and avoids penalizing server/tgt under light load.
+ * See lu_qos_tgt_weight_calc() for how penalties are factored into the weight.
+ *
+ * \param[in] ltd		lu_tgt_descs
+ *
+ * \retval 0		on success
+ * \retval -EAGAIN	the number of tgt isn't enough or all tgt spaces are
+ *			almost the same
+ */
+int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd)
+{
+	struct lu_qos *qos = &ltd->ltd_qos;
+	struct lov_desc *desc = &ltd->ltd_lov_desc;
+	struct lu_tgt_desc *tgt;
+	struct lu_svr_qos *svr;
+	u64 ba_max, ba_min, ba;
+	u64 ia_max, ia_min, ia = 1;
+	u32 num_active;
+	int prio_wide;
+	time64_t now, age;
+	int rc;
+
+	if (!qos->lq_dirty) {
+		rc = 0;
+		goto out;
+	}
+
+	num_active = desc->ld_active_tgt_count - 1;
+	if (num_active < 1) {
+		rc = -EAGAIN;
+		goto out;
+	}
+
+	/* find bavail on each server */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		svr->lsq_bavail = 0;
+		/* if inode is not counted, set to 1 to ignore */
+		svr->lsq_iavail = ltd->ltd_is_mdt ? 0 : 1;
+	}
+	qos->lq_active_svr_count = 0;
+
+	/*
+	 * How badly user wants to select targets "widely" (not recently chosen
+	 * and not on recent MDS's).  As opposed to "freely" (free space avail.)
+	 * 0-256
+	 */
+	prio_wide = 256 - qos->lq_prio_free;
+
+	ba_min = (u64)(-1);
+	ba_max = 0;
+	ia_min = (u64)(-1);
+	ia_max = 0;
+	now = ktime_get_real_seconds();
+
+	/* Calculate server penalty per object */
+	ltd_foreach_tgt(ltd, tgt) {
+		if (!tgt->ltd_active)
+			continue;
+
+		/* when inode is counted, bavail >> 16 to avoid overflow */
+		ba = tgt_statfs_bavail(tgt);
+		if (ltd->ltd_is_mdt)
+			ba >>= 16;
+		else
+			ba >>= 8;
+		if (!ba)
+			continue;
+
+		ba_min = min(ba, ba_min);
+		ba_max = max(ba, ba_max);
+
+		/* Count the number of usable servers */
+		if (tgt->ltd_qos.ltq_svr->lsq_bavail == 0)
+			qos->lq_active_svr_count++;
+		tgt->ltd_qos.ltq_svr->lsq_bavail += ba;
+
+		if (ltd->ltd_is_mdt) {
+			/* iavail >> 8 to avoid overflow */
+			ia = tgt_statfs_iavail(tgt) >> 8;
+			if (!ia)
+				continue;
+
+			ia_min = min(ia, ia_min);
+			ia_max = max(ia, ia_max);
+
+			tgt->ltd_qos.ltq_svr->lsq_iavail += ia;
+		}
+
+		/*
+		 * per-tgt penalty is
+		 * prio * bavail * iavail / (num_tgt - 1) / 2
+		 */
+		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia;
+		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active);
+		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
+
+		age = (now - tgt->ltd_qos.ltq_used) >> 3;
+		if (qos->lq_reset || age > 32 * desc->ld_qos_maxage)
+			tgt->ltd_qos.ltq_penalty = 0;
+		else if (age > desc->ld_qos_maxage)
+			/* Decay tgt penalty. */
+			tgt->ltd_qos.ltq_penalty >>= age / desc->ld_qos_maxage;
+	}
+
+	num_active = qos->lq_active_svr_count - 1;
+	if (num_active < 1) {
+		/*
+		 * If there's only 1 server, we can't penalize it, so instead
+		 * we have to double the tgt penalty
+		 */
+		num_active = 1;
+		ltd_foreach_tgt(ltd, tgt) {
+			if (!tgt->ltd_active)
+				continue;
+
+			tgt->ltd_qos.ltq_penalty_per_obj <<= 1;
+		}
+	}
+
+	/*
+	 * Per-server penalty is
+	 * prio * bavail * iavail / server_tgts / (num_svr - 1) / 2
+	 */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		ba = svr->lsq_bavail;
+		ia = svr->lsq_iavail;
+		svr->lsq_penalty_per_obj = prio_wide * ba  * ia;
+		do_div(ba, svr->lsq_tgt_count * num_active);
+		svr->lsq_penalty_per_obj >>= 1;
+
+		age = (now - svr->lsq_used) >> 3;
+		if (qos->lq_reset || age > 32 * desc->ld_qos_maxage)
+			svr->lsq_penalty = 0;
+		else if (age > desc->ld_qos_maxage)
+			/* Decay server penalty. */
+			svr->lsq_penalty >>= age / desc->ld_qos_maxage;
+	}
+
+	qos->lq_dirty = 0;
+	qos->lq_reset = 0;
+
+	/*
+	 * If each tgt has almost same free space, do rr allocation for better
+	 * creation performance
+	 */
+	qos->lq_same_space = 0;
+	if ((ba_max * (256 - qos->lq_threshold_rr)) >> 8 < ba_min &&
+	    (ia_max * (256 - qos->lq_threshold_rr)) >> 8 < ia_min) {
+		qos->lq_same_space = 1;
+		/* Reset weights for the next time we enter qos mode */
+		qos->lq_reset = 1;
+	}
+	rc = 0;
+
+out:
+	if (!rc && qos->lq_same_space)
+		return -EAGAIN;
+
+	return rc;
+}
+EXPORT_SYMBOL(ltd_qos_penalties_calc);
+
+/**
+ * Re-calculate penalties and weights of all tgts.
+ *
+ * The function is called when some target was used for a new object. In
+ * this case we should re-calculate all the weights to keep new allocations
+ * balanced well.
+ *
+ * \param[in] ltd		lu_tgt_descs
+ * \param[in] tgt		recently used tgt
+ * \param[out] total_wt		new total weight for the pool
+ *
+ * \retval		0
+ */
+int ltd_qos_update(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt,
+		   u64 *total_wt)
+{
+	struct lu_qos *qos = &ltd->ltd_qos;
+	struct lu_tgt_qos *ltq;
+	struct lu_svr_qos *svr;
+
+	ltq = &tgt->ltd_qos;
+	LASSERT(ltq);
+
+	/* Don't allocate on this device anymore, until the next alloc_qos */
+	ltq->ltq_usable = 0;
+
+	svr = ltq->ltq_svr;
+
+	/*
+	 * Decay old penalty by half (we're adding max penalty, and don't
+	 * want it to run away.)
+	 */
+	ltq->ltq_penalty >>= 1;
+	svr->lsq_penalty >>= 1;
+
+	/* mark the server and tgt as recently used */
+	ltq->ltq_used = svr->lsq_used = ktime_get_real_seconds();
+
+	/* Set max penalties for this tgt and server */
+	ltq->ltq_penalty += ltq->ltq_penalty_per_obj *
+			    ltd->ltd_lov_desc.ld_active_tgt_count;
+	svr->lsq_penalty += svr->lsq_penalty_per_obj *
+			    ltd->ltd_lov_desc.ld_active_tgt_count;
+
+	/* Decrease all MDS penalties */
+	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
+		if (svr->lsq_penalty < svr->lsq_penalty_per_obj)
+			svr->lsq_penalty = 0;
+		else
+			svr->lsq_penalty -= svr->lsq_penalty_per_obj;
+	}
+
+	*total_wt = 0;
+	/* Decrease all tgt penalties */
+	ltd_foreach_tgt(ltd, tgt) {
+		if (!tgt->ltd_active)
+			continue;
+
+		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
+			ltq->ltq_penalty = 0;
+		else
+			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
+
+		lu_tgt_qos_weight_calc(tgt);
+
+		/* Recalc the total weight of usable osts */
+		if (ltq->ltq_usable)
+			*total_wt += ltq->ltq_weight;
+
+		CDEBUG(D_OTHER,
+		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
+		       tgt->ltd_index, ltq->ltq_usable,
+		       tgt_statfs_bavail(tgt) >> 10,
+		       ltq->ltq_penalty_per_obj >> 10,
+		       ltq->ltq_penalty >> 10,
+		       ltq->ltq_svr->lsq_penalty_per_obj >> 10,
+		       ltq->ltq_svr->lsq_penalty >> 10,
+		       ltq->ltq_weight >> 10);
+	}
+
+	return 0;
 }
-EXPORT_SYMBOL(lu_tgt_descs_del);
+EXPORT_SYMBOL(ltd_qos_update);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 506/622] lustre: ptlrpc: Properly swab ll_fiemap_info_key
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (504 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 505/622] lustre: obdclass: lu_tgt_descs cleanup James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 507/622] lustre: llite: clear flock when using localflock James Simmons
                   ` (116 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Oleg Drokin <green@whamcloud.com>

It was using lustre_swab_fiemap which is incorrect since the
structures don't match.

Added lustre_swab_fiemap_info_key that swabs embedded
obdo and ll_fiemap_info_key structures.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11997
Lustre-commit: 2b905746ee3b ("LU-11997 ptlrpc: Properly swab ll_fiemap_info_key")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36308
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_swab.h |  1 +
 fs/lustre/ptlrpc/layout.c       |  4 ++--
 fs/lustre/ptlrpc/pack_generic.c | 17 ++++++++++++++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/include/lustre_swab.h b/fs/lustre/include/lustre_swab.h
index dd3c50c..a5c1de5 100644
--- a/fs/lustre/include/lustre_swab.h
+++ b/fs/lustre/include/lustre_swab.h
@@ -81,6 +81,7 @@
 void lustre_swab_ost_body(struct ost_body *b);
 void lustre_swab_ost_last_id(u64 *id);
 void lustre_swab_fiemap(struct fiemap *fiemap);
+void lustre_swab_fiemap_info_key(struct ll_fiemap_info_key *fiemap_info);
 void lustre_swab_lov_user_md_v1(struct lov_user_md_v1 *lum);
 void lustre_swab_lov_user_md_v3(struct lov_user_md_v3 *lum);
 void lustre_swab_lov_comp_md_v1(struct lov_comp_md_v1 *lum);
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index dd04eee..06db86d 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -1134,8 +1134,8 @@ struct req_msg_field RMF_OST_ID =
 EXPORT_SYMBOL(RMF_OST_ID);
 
 struct req_msg_field RMF_FIEMAP_KEY =
-	DEFINE_MSGF("fiemap", 0, sizeof(struct ll_fiemap_info_key),
-		    lustre_swab_fiemap, NULL);
+	DEFINE_MSGF("fiemap_key", 0, sizeof(struct ll_fiemap_info_key),
+		    lustre_swab_fiemap_info_key, NULL);
 EXPORT_SYMBOL(RMF_FIEMAP_KEY);
 
 struct req_msg_field RMF_FIEMAP_VAL =
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 9b28624..b569d57 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -1913,21 +1913,32 @@ static void lustre_swab_fiemap_extent(struct fiemap_extent *fm_extent)
 	__swab32s(&fm_extent->fe_device);
 }
 
-void lustre_swab_fiemap(struct fiemap *fiemap)
+static void lustre_swab_fiemap_hdr(struct fiemap *fiemap)
 {
-	u32 i;
-
 	__swab64s(&fiemap->fm_start);
 	__swab64s(&fiemap->fm_length);
 	__swab32s(&fiemap->fm_flags);
 	__swab32s(&fiemap->fm_mapped_extents);
 	__swab32s(&fiemap->fm_extent_count);
 	__swab32s(&fiemap->fm_reserved);
+}
+
+void lustre_swab_fiemap(struct fiemap *fiemap)
+{
+	u32 i;
+
+	lustre_swab_fiemap_hdr(fiemap);
 
 	for (i = 0; i < fiemap->fm_mapped_extents; i++)
 		lustre_swab_fiemap_extent(&fiemap->fm_extents[i]);
 }
 
+void lustre_swab_fiemap_info_key(struct ll_fiemap_info_key *fiemap_info)
+{
+	lustre_swab_obdo(&fiemap_info->lfik_oa);
+	lustre_swab_fiemap_hdr(&fiemap_info->lfik_fiemap);
+}
+
 void lustre_swab_mdt_rec_reint (struct mdt_rec_reint *rr)
 {
 	__swab32s(&rr->rr_opcode);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 507/622] lustre: llite: clear flock when using localflock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (505 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 506/622] lustre: ptlrpc: Properly swab ll_fiemap_info_key James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 508/622] lustre: sec: reserve flags for client side encryption James Simmons
                   ` (115 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

When mounting a client with "-o localflock" or equivalent option in
/etc/fstab, it does not clear out the "flock" mount option flag from
the superblock.  This results in "flock" still being the option used
and it displays both options in the /proc/mounts output:

  10.0.0.1 at o2ib:/lfs on /mnt/lfs type lustre (rw,flock,localflock)

Mount a client with both "flock,localflock" as mount options and
verify that the "flock" option is cleared by "localflock", and
vice versa.  Verify that "noflock" clears both options.

Remove the "remount_client()" helper in conf-sanity.sh, since this
shadows a helper function of the same name in test-framework.sh and
is confusing.  Instead, use "mount_client()" now that it can accept
mount options, and just pass "remount" explicitly in a few places.

Fixes: 083c51418b67 ("lustre: llite: enable flock mount option by default")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12859
Lustre-commit: 22ee4a1f64ec ("LU-12859 llite: clear flock when using localflock")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36452
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 4580be3..49490ee 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -823,12 +823,12 @@ static int ll_options(char *options, struct ll_sb_info *sbi)
 		}
 		tmp = ll_set_opt("flock", s1, LL_SBI_FLOCK);
 		if (tmp) {
-			*flags |= tmp;
+			*flags = (*flags & ~LL_SBI_LOCALFLOCK) | tmp;
 			goto next;
 		}
 		tmp = ll_set_opt("localflock", s1, LL_SBI_LOCALFLOCK);
 		if (tmp) {
-			*flags |= tmp;
+			*flags = (*flags & ~LL_SBI_FLOCK) | tmp;
 			goto next;
 		}
 		tmp = ll_set_opt("noflock", s1,
@@ -2672,11 +2672,16 @@ int ll_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (sbi->ll_flags & LL_SBI_NOLCK)
 		seq_puts(seq, ",nolock");
 
+	/* "flock" is the default since 2.13, but it wasn't for many years,
+	 * so it is still useful to print this to show it is enabled.
+	 * Start to print "noflock" so it is now clear when flock is disabled.
+	 */
 	if (sbi->ll_flags & LL_SBI_FLOCK)
 		seq_puts(seq, ",flock");
-
-	if (sbi->ll_flags & LL_SBI_LOCALFLOCK)
+	else if (sbi->ll_flags & LL_SBI_LOCALFLOCK)
 		seq_puts(seq, ",localflock");
+	else
+		seq_puts(seq, ",noflock");
 
 	if (sbi->ll_flags & LL_SBI_USER_XATTR)
 		seq_puts(seq, ",user_xattr");
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 508/622] lustre: sec: reserve flags for client side encryption
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (506 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 507/622] lustre: llite: clear flock when using localflock James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 509/622] lustre: llite: limit max xattr size by kernel value James Simmons
                   ` (114 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Sebastien Buisson <sbuisson@ddn.com>

Reserve OBD_CONNECT2_ENC connection flag so that 'encrypt' or
'test_dummy_encryption' client mount options can only be used if
server side knows how to handle encrypted object size properly.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12275
Lustre-commit: 4f9632f97011 ("LU-12275 sec: reserve flags for client side encryption")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/36360
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 1 +
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 8 ++++----
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index ca169ec..98d1e3b 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -126,6 +126,7 @@
 	"pcc",			/* 0x1000 */
 	"plain_layout",		/* 0x2000 */
 	"async_discard",	/* 0x4000 */
+	"client_encryption",	/* 0x8000 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index c0b4ad9..da51dc1 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1160,6 +1160,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_PCC);
 	LASSERTF(OBD_CONNECT2_ASYNC_DISCARD == 0x4000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ASYNC_DISCARD);
+	LASSERTF(OBD_CONNECT2_ENCRYPT == 0x8000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_ENCRYPT);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index d4b29d8..4277ac6 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -813,15 +813,15 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_ASYNC_DISCARD     0x4000ULL /* support async DoM data
 						  * discard
 						  */
-
+#define OBD_CONNECT2_ENCRYPT	       0x8000ULL /* client-to-disk encrypt */
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
  * flag value is not in use on some other branch.  Please clear any such
  * changes with senior engineers before starting to use a new flag.  Then,
  * submit a small patch against EVERY branch that ONLY adds the new flag,
- * updates obd_connect_names[] for lprocfs_rd_connect_flags(), adds the
- * flag to check_obd_connect_data(), and updates wiretests accordingly, so it
- * can be approved and landed easily to reserve the flag for future use.
+ * updates obd_connect_names[], adds the flag to check_obd_connect_data(),
+ * and updates wiretests accordingly, so it can be approved and landed easily
+ * to reserve the flag for future use.
  */
 
 /* The MNE_SWAB flag is overloading the MDS_MDS bit only for the MGS
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 509/622] lustre: llite: limit max xattr size by kernel value
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (507 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 508/622] lustre: sec: reserve flags for client side encryption James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 510/622] lustre: ptlrpc: return proper error code James Simmons
                   ` (113 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Limit the maximum xattr size returned to userspace from the MDS to
what the currently-running kernel supports (XATTR_SIZE_MAX=65536
bytes typically).  While it is possible a Lustre backing filesystem
may store larger xattrs than this, it wouldn't be possible for users
to access a larger xattr via kernel xattr interfaces.

This fixes interop problems when newer clients and tests are running
against older servers:

  sanity.sh: line 8946: /usr/bin/setfattr: Argument list too long

Skip subtests for new features in 2.13 so 2.12 interop testing passes.

Fix test-framework.sh::large_xattr_enabled() to return true for ZFS.
Fix test-framework.sh::max_xattr_size() to return the actual value
returned from the MDS rather than computing it locally.

Fixes: 4c9f501e6d5 ("lustre: osd: Set max ea size to XATTR_SIZE_MAX")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12784
Lustre-commit: 84097792f56c ("LU-12784 llite: limit max xattr size by kernel value")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36240
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lproc_llite.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index c2ec3fb..439c096 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -925,7 +925,9 @@ static ssize_t max_easize_show(struct kobject *kobj,
 	if (rc)
 		return rc;
 
-	return sprintf(buf, "%u\n", ealen);
+	/* Limit xattr size returned to userspace based on kernel maximum */
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+			ealen > XATTR_SIZE_MAX ? XATTR_SIZE_MAX : ealen);
 }
 LUSTRE_RO_ATTR(max_easize);
 
@@ -954,7 +956,9 @@ static ssize_t default_easize_show(struct kobject *kobj,
 	if (rc)
 		return rc;
 
-	return sprintf(buf, "%u\n", ealen);
+	/* Limit xattr size returned to userspace based on kernel maximum */
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+			ealen > XATTR_SIZE_MAX ? XATTR_SIZE_MAX : ealen);
 }
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 510/622] lustre: ptlrpc: return proper error code
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (508 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 509/622] lustre: llite: limit max xattr size by kernel value James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 511/622] lnet: fix peer_ni selection James Simmons
                   ` (112 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

from ptlrpc_disconnect_prep_req() using ERR_PTR()
as the callers expect.

Fixes: 4b102da53ad ("lustre: ptlrpc: idle connections can disconnect")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12799
Lustre-commit: 9e2620d75cce ("LU-12799 ptlrpc: return proper error code")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36282
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index c4a732d..76a40be 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1571,7 +1571,7 @@ static struct ptlrpc_request *ptlrpc_disconnect_prep_req(struct obd_import *imp)
 	req = ptlrpc_request_alloc_pack(imp, &RQF_MDS_DISCONNECT,
 					LUSTRE_OBD_VERSION, rq_opc);
 	if (!req)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	/* We are disconnecting, do not retry a failed DISCONNECT rpc if
 	 * it fails.  We can get through the above with a down server
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 511/622] lnet: fix peer_ni selection
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (509 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 510/622] lustre: ptlrpc: return proper error code James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 512/622] lustre: pcc: Auto attach for PCC during IO James Simmons
                   ` (111 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When selecting a peer-ni we must use the same peer NID
through all the messages which belong to the same RPC.
This is necessary in order to ensure we do the RDMA over
the optimal interface.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12893
Lustre-commit: 94ee26738884 ("LU-12893 lnet: fix peer_ni selection")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36552
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradaed.org>
---
 net/lnet/lnet/lib-move.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 6da0be4..b8278ad 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1710,8 +1710,11 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
  * Local Destination
  * MR Peer
  *
- * Run the selection algorithm on the peer NIs unless we're sending
- * a response, in this case just send to the destination
+ * Don't run the selection algorithm on the peer NIs. By specifying the
+ * local NID, we're also saying that we should always use the destination NID
+ * provided. This handles the case where we should be using the same
+ * destination NID for the all the messages which belong to the same RPC
+ * request.
  */
 static int
 lnet_handle_spec_local_mr_dst(struct lnet_send_data *sd)
@@ -1724,16 +1727,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		return -EINVAL;
 	}
 
-	/* only run the selection algorithm to pick the peer_ni if we're
-	 * sending a GET or a PUT. Responses are sent to the same
-	 * destination NID provided.
-	 */
-	if (!(sd->sd_send_case & SND_RESP)) {
-		sd->sd_best_lpni =
-		  lnet_find_best_lpni_on_net(sd, sd->sd_peer,
-					     sd->sd_best_ni->ni_net->net_id);
-	}
-
 	if (sd->sd_best_lpni &&
 	    sd->sd_best_lpni->lpni_nid == the_lnet.ln_loni->ni_nid)
 		return lnet_handle_lo_send(sd);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 512/622] lustre: pcc: Auto attach for PCC during IO
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (510 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 511/622] lnet: fix peer_ni selection James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 513/622] lustre: lmv: alloc dir stripes by QoS James Simmons
                   ` (110 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

PCC uses the layout lock to protect the cache validity. Currently
PCC only supports auto attach at the next open. However, the
layout lock can be revoked at any time by LRU/manual lock
shrinking or lock conflict callback.

For example, the layout lock can be revoked when performing I/Os
after opened the file. At this time, the cached file will be
detached involuntary. The I/O originally directed into PCC will
redirect to OSTs after the data restore into OSTs' objects. The
cost of this unwilling behavior may be expensive.

To avoid this problem, this patch implements auto attach for PCC
even during IOs (not only at the open time).

For debug purpose, now we have three auto attach options:
- open_attach: auto attach at the next open;
- io_attach: auto attach during IO
- stat_attach: auto attach at stat() call.

The reason to add the stat_attach option is that: when check
PCC state via "lfs pcc state", it will not only open the file but
also stat() on the file, to verify the feature of auto attach
during IO, we need to both disable open_attach and stat_attach.

And all these auto attach options are enabled by default.

This patch also fixed the bug for auto cache at create time:
In the current Lustre, the truncate operation will revoke the
LOOKUP ibits lock, and the file dentry cache will be invalidated.
The following open with O_CREAT flag will call into ->atomic_open,
the file was wrongly though as newly created file and try to
auto cache the file. So after client known it is not a
DISP_OPEN_CREATE, it should cleanup the already created PCC copy.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12526
Lustre-commit: a120bb135257 ("LU-12526 pcc: Auto attach for PCC during IO")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/36005
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/namei.c                 |  43 +++++----
 fs/lustre/llite/pcc.c                   | 157 ++++++++++++++++++++++++++------
 fs/lustre/llite/pcc.h                   |  45 +++++++--
 include/uapi/linux/lustre/lustre_idl.h  |   1 +
 include/uapi/linux/lustre/lustre_user.h |   8 ++
 5 files changed, 199 insertions(+), 55 deletions(-)

diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index ce72910..f4ca16e 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -696,11 +696,6 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 	return rc;
 }
 
-struct pcc_create_attach {
-	struct pcc_dataset *pca_dataset;
-	struct dentry *pca_dentry;
-};
-
 static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 				   struct lookup_intent *it, void **secctx,
 				   u32 *secctxlen,
@@ -950,8 +945,7 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	u32 secctxlen = 0;
 	struct dentry *de;
 	struct ll_sb_info *sbi;
-	struct pcc_create_attach pca = {NULL, NULL};
-	struct pcc_dataset *dataset = NULL;
+	struct pcc_create_attach pca = { NULL, NULL };
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE,
@@ -988,6 +982,7 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 		if (!filename_is_volatile(dentry->d_name.name,
 					  dentry->d_name.len, NULL)) {
 			struct pcc_matcher item;
+			struct pcc_dataset *dataset;
 
 			item.pm_uid = from_kuid(&init_user_ns, current_uid());
 			item.pm_gid = from_kgid(&init_user_ns, current_gid());
@@ -1020,18 +1015,30 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 					dput(de);
 				goto out_release;
 			}
-			if (dataset && dentry->d_inode) {
-				rc = pcc_inode_create_fini(dataset,
-							   dentry->d_inode,
-							   pca.pca_dentry);
-				if (rc) {
-					if (de)
-						dput(de);
-					goto out_release;
-				}
+
+			rc = pcc_inode_create_fini(dentry->d_inode, &pca);
+			if (rc) {
+				if (de)
+					dput(de);
+				goto out_release;
 			}
 
 			file->f_mode |= FMODE_CREATED;
+		} else {
+			/* Open the file with O_CREAT, but the file already
+			 * existed on MDT. This may happened in the case that
+			 * the LOOKUP ibits lock is revoked and the
+			 * corresponding dentry cache is deleted.
+			 * i.e. In the current Lustre, the truncate operation
+			 * will revoke the LOOKUP ibits lock, and the file
+			 * dentry cache will be invalidated. The following open
+			 * with O_CREAT flag will call into ->atomic_open, the
+			 * file was wrongly though as newly created file and
+			 * try to auto cache the file. So after client knows it
+			 * is not a DISP_OPEN_CREATE, it should cleanup the
+			 * already created PCC copy.
+			 */
+			pcc_create_attach_cleanup(dir->i_sb, &pca);
 		}
 
 		if (d_really_is_positive(dentry) &&
@@ -1055,11 +1062,11 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 		} else {
 			rc = finish_no_open(file, de);
 		}
+	} else {
+		pcc_create_attach_cleanup(dir->i_sb, &pca);
 	}
 
 out_release:
-	if (dataset)
-		pcc_dataset_put(dataset);
 	ll_intent_release(it);
 	kfree(it);
 
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index c8c2442..b926f87 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -472,12 +472,30 @@ static int pcc_id_parse(struct pcc_cmd *cmd, const char *id)
 		if (id <= 0)
 			return -EINVAL;
 		cmd->u.pccc_add.pccc_roid = id;
+	} else if (strcmp(key, "auto_attach") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id == 0)
+			cmd->u.pccc_add.pccc_flags &= ~PCC_DATASET_AUTO_ATTACH;
 	} else if (strcmp(key, "open_attach") == 0) {
 		rc = kstrtoul(val, 10, &id);
 		if (rc)
 			return rc;
-		if (id > 0)
-			cmd->u.pccc_add.pccc_flags |= PCC_DATASET_OPEN_ATTACH;
+		if (id == 0)
+			cmd->u.pccc_add.pccc_flags &= ~PCC_DATASET_OPEN_ATTACH;
+	} else if (strcmp(key, "io_attach") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id == 0)
+			cmd->u.pccc_add.pccc_flags &= ~PCC_DATASET_IO_ATTACH;
+	} else if (strcmp(key, "stat_attach") == 0) {
+		rc = kstrtoul(val, 10, &id);
+		if (rc)
+			return rc;
+		if (id == 0)
+			cmd->u.pccc_add.pccc_flags &= ~PCC_DATASET_STAT_ATTACH;
 	} else if (strcmp(key, "rwpcc") == 0) {
 		rc = kstrtoul(val, 10, &id);
 		if (rc)
@@ -504,6 +522,18 @@ static int pcc_id_parse(struct pcc_cmd *cmd, const char *id)
 	char *token;
 	int rc;
 
+	switch (cmd->pccc_cmd) {
+	case PCC_ADD_DATASET:
+		/* Enable auto attach by default */
+		cmd->u.pccc_add.pccc_flags |= PCC_DATASET_AUTO_ATTACH;
+		break;
+	case PCC_DEL_DATASET:
+	case PCC_CLEAR_ALL:
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	val = buffer;
 	while (val && strlen(val) != 0) {
 		token = strsep(&val, " ");
@@ -1002,7 +1032,6 @@ static void pcc_inode_init(struct pcc_inode *pcci, struct ll_inode_info *lli)
 {
 	pcci->pcci_lli = lli;
 	lli->lli_pcc_inode = pcci;
-	lli->lli_pcc_state = PCC_STATE_FL_NONE;
 	atomic_set(&pcci->pcci_refcount, 0);
 	pcci->pcci_type = LU_PCC_NONE;
 	pcci->pcci_layout_gen = CL_LAYOUT_GEN_NONE;
@@ -1072,9 +1101,9 @@ void pcc_file_init(struct pcc_file *pccf)
 	pccf->pccf_type = LU_PCC_NONE;
 }
 
-static inline bool pcc_open_attach_enabled(struct pcc_dataset *dataset)
+static inline bool pcc_auto_attach_enabled(struct pcc_dataset *dataset)
 {
-	return dataset->pccd_flags & PCC_DATASET_OPEN_ATTACH;
+	return dataset->pccd_flags & PCC_DATASET_AUTO_ATTACH;
 }
 
 static const char pcc_xattr_layout[] = XATTR_USER_PREFIX "PCC.layout";
@@ -1085,7 +1114,7 @@ static int pcc_layout_xattr_set(struct pcc_inode *pcci, u32 gen)
 	struct ll_inode_info *lli = pcci->pcci_lli;
 	int rc;
 
-	if (!(lli->lli_pcc_state & PCC_STATE_FL_OPEN_ATTACH))
+	if (!(lli->lli_pcc_state & PCC_STATE_FL_AUTO_ATTACH))
 		return 0;
 
 	rc = __vfs_setxattr(pcc_dentry, pcc_dentry->d_inode, pcc_xattr_layout,
@@ -1137,6 +1166,8 @@ static void pcc_inode_attach_init(struct pcc_dataset *dataset,
 				  struct dentry *dentry,
 				  enum lu_pcc_type type)
 {
+	struct ll_inode_info *lli = pcci->pcci_lli;
+
 	pcci->pcci_path.mnt = mntget(dataset->pccd_path.mnt);
 	pcci->pcci_path.dentry = dentry;
 	LASSERT(atomic_read(&pcci->pcci_refcount) == 0);
@@ -1144,11 +1175,12 @@ static void pcc_inode_attach_init(struct pcc_dataset *dataset,
 	pcci->pcci_type = type;
 	pcci->pcci_attr_valid = false;
 
-	if (pcc_open_attach_enabled(dataset)) {
-		struct ll_inode_info *lli = pcci->pcci_lli;
-
+	if (dataset->pccd_flags & PCC_DATASET_OPEN_ATTACH)
 		lli->lli_pcc_state |= PCC_STATE_FL_OPEN_ATTACH;
-	}
+	if (dataset->pccd_flags & PCC_DATASET_IO_ATTACH)
+		lli->lli_pcc_state |= PCC_STATE_FL_IO_ATTACH;
+	if (dataset->pccd_flags & PCC_DATASET_STAT_ATTACH)
+		lli->lli_pcc_state |= PCC_STATE_FL_STAT_ATTACH;
 }
 
 static inline void pcc_layout_gen_set(struct pcc_inode *pcci,
@@ -1252,7 +1284,7 @@ static int pcc_try_datasets_attach(struct inode *inode, u32 gen,
 	down_read(&super->pccs_rw_sem);
 	list_for_each_entry_safe(dataset, tmp,
 				 &super->pccs_datasets, pccd_linkage) {
-		if (!pcc_open_attach_enabled(dataset))
+		if (!pcc_auto_attach_enabled(dataset))
 			continue;
 		rc = pcc_try_dataset_attach(inode, gen, type, dataset, cached);
 		if (rc < 0 || (!rc && *cached))
@@ -1263,13 +1295,15 @@ static int pcc_try_datasets_attach(struct inode *inode, u32 gen,
 	return rc;
 }
 
-static int pcc_try_open_attach(struct inode *inode, bool *cached)
+static int pcc_try_auto_attach(struct inode *inode, bool *cached, bool is_open)
 {
 	struct pcc_super *super = &ll_i2sbi(inode)->ll_pcc_super;
 	struct cl_layout clt = {
 		.cl_layout_gen = 0,
 		.cl_is_released = false,
 	};
+	struct ll_inode_info *lli = ll_i2info(inode);
+	u32 gen;
 	int rc;
 
 	/*
@@ -1283,13 +1317,25 @@ static int pcc_try_open_attach(struct inode *inode, bool *cached)
 	 * obtain valid layout lock from MDT (i.e. the file is being
 	 * HSM restoring).
 	 */
-	if (ll_layout_version_get(ll_i2info(inode)) == CL_LAYOUT_GEN_NONE)
-		return 0;
+	if (is_open) {
+		if (ll_layout_version_get(lli) == CL_LAYOUT_GEN_NONE)
+			return 0;
+	} else {
+		rc = ll_layout_refresh(inode, &gen);
+		if (rc)
+			return rc;
+	}
 
 	rc = pcc_get_layout_info(inode, &clt);
 	if (rc)
 		return rc;
 
+	if (!is_open && gen != clt.cl_layout_gen) {
+		CDEBUG(D_CACHE, DFID" layout changed from %d to %d.\n",
+		       PFID(ll_inode2fid(inode)), gen, clt.cl_layout_gen);
+		return -EINVAL;
+	}
+
 	if (clt.cl_is_released)
 		rc = pcc_try_datasets_attach(inode, clt.cl_layout_gen,
 					     LU_PCC_READWRITE, cached);
@@ -1319,7 +1365,9 @@ int pcc_file_open(struct inode *inode, struct file *file)
 		goto out_unlock;
 
 	if (!pcci || !pcc_inode_has_layout(pcci)) {
-		rc = pcc_try_open_attach(inode, &cached);
+		if (lli->lli_pcc_state & PCC_STATE_FL_OPEN_ATTACH)
+			rc = pcc_try_auto_attach(inode, &cached, true);
+
 		if (rc < 0 || !cached)
 			goto out_unlock;
 
@@ -1379,8 +1427,9 @@ void pcc_file_release(struct inode *inode, struct file *file)
 	pcc_inode_unlock(inode);
 }
 
-static void pcc_io_init(struct inode *inode, bool *cached)
+static void pcc_io_init(struct inode *inode, enum pcc_io_type iot, bool *cached)
 {
+	struct ll_inode_info *lli = ll_i2info(inode);
 	struct pcc_inode *pcci;
 
 	pcc_inode_lock(inode);
@@ -1391,6 +1440,17 @@ static void pcc_io_init(struct inode *inode, bool *cached)
 		*cached = true;
 	} else {
 		*cached = false;
+		if ((lli->lli_pcc_state & PCC_STATE_FL_IO_ATTACH &&
+		     iot != PIT_GETATTR) ||
+		    (iot == PIT_GETATTR &&
+		     lli->lli_pcc_state & PCC_STATE_FL_STAT_ATTACH)) {
+			(void) pcc_try_auto_attach(inode, cached, false);
+			if (*cached) {
+				pcci = ll_i2pcci(inode);
+				LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
+				atomic_inc(&pcci->pcci_active_ios);
+			}
+		}
 	}
 	pcc_inode_unlock(inode);
 }
@@ -1418,7 +1478,7 @@ ssize_t pcc_file_read_iter(struct kiocb *iocb,
 		return 0;
 	}
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_READ, cached);
 	if (!*cached)
 		return 0;
 
@@ -1453,7 +1513,7 @@ ssize_t pcc_file_write_iter(struct kiocb *iocb,
 		return -EAGAIN;
 	}
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_WRITE, cached);
 	if (!*cached)
 		return 0;
 
@@ -1489,7 +1549,7 @@ int pcc_inode_setattr(struct inode *inode, struct iattr *attr,
 		return 0;
 	}
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_GETATTR, cached);
 	if (!*cached)
 		return 0;
 
@@ -1523,7 +1583,7 @@ int pcc_inode_getattr(struct inode *inode, bool *cached)
 		return 0;
 	}
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_SETATTR, cached);
 	if (!*cached)
 		return 0;
 
@@ -1585,7 +1645,7 @@ ssize_t pcc_file_splice_read(struct file *in_file, loff_t *ppos,
 	if (!file_inode(pcc_file)->i_fop->splice_read)
 		return -ENOTSUPP;
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_SPLICE_READ, cached);
 	if (!*cached)
 		return 0;
 
@@ -1610,7 +1670,7 @@ int pcc_fsync(struct file *file, loff_t start, loff_t end,
 		return 0;
 	}
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_FSYNC, cached);
 	if (!*cached)
 		return 0;
 
@@ -1716,7 +1776,7 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		CDEBUG(D_MMAP,
 		       "%s: PCC backend fs not support ->page_mkwrite()\n",
 		       ll_i2sbi(inode)->ll_fsname);
-		pcc_ioctl_detach(inode, PCC_DETACH_OPT_NONE);
+		pcc_ioctl_detach(inode, PCC_DETACH_OPT_UNCACHE);
 		up_read(&mm->mmap_sem);
 		*cached = true;
 		return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
@@ -1724,7 +1784,7 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	/* Pause to allow for a race with concurrent detach */
 	OBD_FAIL_TIMEOUT(OBD_FAIL_LLITE_PCC_MKWRITE_PAUSE, cfs_fail_val);
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_PAGE_MKWRITE, cached);
 	if (!*cached) {
 		/* This happens when the file is detached from PCC after got
 		 * the fault page via ->fault() on the inode of the PCC copy.
@@ -1757,7 +1817,7 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 */
 	if (OBD_FAIL_CHECK(OBD_FAIL_LLITE_PCC_DETACH_MKWRITE)) {
 		pcc_io_fini(inode);
-		pcc_ioctl_detach(inode, PCC_DETACH_OPT_NONE);
+		pcc_ioctl_detach(inode, PCC_DETACH_OPT_UNCACHE);
 		up_read(&mm->mmap_sem);
 		return VM_FAULT_RETRY | VM_FAULT_NOPAGE;
 	}
@@ -1785,7 +1845,7 @@ int pcc_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return 0;
 	}
 
-	pcc_io_init(inode, cached);
+	pcc_io_init(inode, PIT_FAULT, cached);
 	if (!*cached)
 		return 0;
 
@@ -1993,13 +2053,21 @@ int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
 	return rc;
 }
 
-int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
-			  struct dentry *pcc_dentry)
+int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca)
 {
+	struct dentry *pcc_dentry = pca->pca_dentry;
 	const struct cred *old_cred;
 	struct pcc_inode *pcci;
 	int rc = 0;
 
+	if (!pca->pca_dataset)
+		return 0;
+
+	if (!inode)
+		goto out_dataset_put;
+
+	LASSERT(pcc_dentry);
+
 	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	pcc_inode_lock(inode);
 	LASSERT(!ll_i2pcci(inode));
@@ -2015,7 +2083,8 @@ int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 		goto out_put;
 
 	pcc_inode_init(pcci, ll_i2info(inode));
-	pcc_inode_attach_init(dataset, pcci, pcc_dentry, LU_PCC_READWRITE);
+	pcc_inode_attach_init(pca->pca_dataset, pcci, pcc_dentry,
+			      LU_PCC_READWRITE);
 
 	rc = pcc_layout_xattr_set(pcci, 0);
 	if (rc) {
@@ -2038,9 +2107,36 @@ int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
 	pcc_inode_unlock(inode);
 	revert_creds(old_cred);
 
+out_dataset_put:
+	pcc_dataset_put(pca->pca_dataset);
 	return rc;
 }
 
+void pcc_create_attach_cleanup(struct super_block *sb,
+			       struct pcc_create_attach *pca)
+{
+	if (!pca->pca_dataset)
+		return;
+
+	if (pca->pca_dentry) {
+		const struct cred *old_cred;
+		int rc;
+
+		old_cred = override_creds(pcc_super_cred(sb));
+		rc = vfs_unlink(pca->pca_dentry->d_parent->d_inode,
+				pca->pca_dentry, NULL);
+		if (rc)
+			CWARN("failed to unlink PCC file %.*s, rc = %d\n",
+			      pca->pca_dentry->d_name.len,
+			      pca->pca_dentry->d_name.name, rc);
+		/* ignore the unlink failure */
+		revert_creds(old_cred);
+		dput(pca->pca_dentry);
+	}
+
+	pcc_dataset_put(pca->pca_dataset);
+}
+
 static int pcc_filp_write(struct file *filp, const void *buf, ssize_t count,
 			  loff_t *offset)
 {
@@ -2202,7 +2298,6 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	pcc_inode_lock(inode);
 	pcci = ll_i2pcci(inode);
-	lli->lli_pcc_state &= ~PCC_STATE_FL_ATTACHING;
 	if (rc || lease_broken) {
 		if (attached && pcci)
 			pcc_inode_put(pcci);
@@ -2221,6 +2316,7 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 	if (rc)
 		goto out_put;
 
+	LASSERT(lli->lli_pcc_state & PCC_STATE_FL_ATTACHING);
 	rc = ll_layout_refresh(inode, &gen2);
 	if (!rc) {
 		if (gen2 == gen) {
@@ -2240,6 +2336,7 @@ int pcc_readwrite_attach_fini(struct file *file, struct inode *inode,
 		pcc_inode_put(pcci);
 	}
 out_unlock:
+	lli->lli_pcc_state &= ~PCC_STATE_FL_ATTACHING;
 	pcc_inode_unlock(inode);
 	revert_creds(old_cred);
 	return rc;
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index c00cb0b..a221ef6 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -93,12 +93,19 @@ struct pcc_matcher {
 
 enum pcc_dataset_flags {
 	PCC_DATASET_NONE	= 0x0,
-	/* Try auto attach at open, disabled by default */
-	PCC_DATASET_OPEN_ATTACH	= 0x1,
+	/* Try auto attach at open, enabled by default */
+	PCC_DATASET_OPEN_ATTACH	= 0x01,
+	/* Try auto attach during IO when layout refresh, enabled by default */
+	PCC_DATASET_IO_ATTACH	= 0x02,
+	/* Try auto attach at stat */
+	PCC_DATASET_STAT_ATTACH	= 0x04,
+	PCC_DATASET_AUTO_ATTACH	= PCC_DATASET_OPEN_ATTACH |
+				  PCC_DATASET_IO_ATTACH |
+				  PCC_DATASET_STAT_ATTACH,
 	/* PCC backend is only used for RW-PCC */
-	PCC_DATASET_RWPCC	= 0x2,
+	PCC_DATASET_RWPCC	= 0x08,
 	/* PCC backend is only used for RO-PCC */
-	PCC_DATASET_ROPCC	= 0x4,
+	PCC_DATASET_ROPCC	= 0x10,
 	/* PCC backend provides caching services for both RW-PCC and RO-PCC */
 	PCC_DATASET_PCC_ALL	= PCC_DATASET_RWPCC | PCC_DATASET_ROPCC,
 };
@@ -154,6 +161,25 @@ struct pcc_file {
 	enum lu_pcc_type	 pccf_type;
 };
 
+enum pcc_io_type {
+	/* read system call */
+	PIT_READ = 1,
+	/* write system call */
+	PIT_WRITE,
+	/* truncate, utime system calls */
+	PIT_SETATTR,
+	/* stat system call */
+	PIT_GETATTR,
+	/* mmap write handling */
+	PIT_PAGE_MKWRITE,
+	/* page fault handling */
+	PIT_FAULT,
+	/* fsync system call handling */
+	PIT_FSYNC,
+	/* splice_read system call */
+	PIT_SPLICE_READ
+};
+
 enum pcc_cmd_type {
 	PCC_ADD_DATASET = 0,
 	PCC_DEL_DATASET,
@@ -177,6 +203,11 @@ struct pcc_cmd {
 	} u;
 };
 
+struct pcc_create_attach {
+	struct pcc_dataset *pca_dataset;
+	struct dentry *pca_dentry;
+};
+
 int pcc_super_init(struct pcc_super *super);
 void pcc_super_fini(struct pcc_super *super);
 int pcc_cmd_handle(char *buffer, unsigned long count,
@@ -212,12 +243,12 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 		     bool *cached);
 int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
 		     struct lu_fid *fid, struct dentry **pcc_dentry);
-int pcc_inode_create_fini(struct pcc_dataset *dataset, struct inode *inode,
-			   struct dentry *pcc_dentry);
+int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca);
+void pcc_create_attach_cleanup(struct super_block *sb,
+			       struct pcc_create_attach *pca);
 struct pcc_dataset *pcc_dataset_match_get(struct pcc_super *super,
 					  struct pcc_matcher *matcher);
 void pcc_dataset_put(struct pcc_dataset *dataset);
 void pcc_inode_free(struct inode *inode);
 void pcc_layout_invalidate(struct inode *inode);
-
 #endif /* LLITE_PCC_H */
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 4277ac6..a74d979 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1723,6 +1723,7 @@ enum mds_op_bias {
 	MDS_CLOSE_LAYOUT_SPLIT	= 1 << 17,
 	MDS_TRUNC_KEEP_LEASE	= 1 << 18,
 	MDS_PCC_ATTACH		= 1 << 19,
+	MDS_CLOSE_UPDATE_TIMES	= 1 << 20,
 };
 
 #define MDS_CLOSE_INTENT (MDS_HSM_RELEASE | MDS_CLOSE_LAYOUT_SWAP |         \
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 06a691b..2178666 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -2180,6 +2180,14 @@ enum lu_pcc_state_flags {
 	PCC_STATE_FL_ATTACHING		= 0x02,
 	/* Allow to auto attach at open */
 	PCC_STATE_FL_OPEN_ATTACH	= 0x04,
+	/* Allow to auto attach during I/O after layout lock revocation */
+	PCC_STATE_FL_IO_ATTACH		= 0x08,
+	/* Allow to auto attach at stat */
+	PCC_STATE_FL_STAT_ATTACH	= 0x10,
+	/* Allow to auto attach@the next open or layout refresh */
+	PCC_STATE_FL_AUTO_ATTACH	= PCC_STATE_FL_OPEN_ATTACH |
+					  PCC_STATE_FL_IO_ATTACH |
+					  PCC_STATE_FL_STAT_ATTACH,
 };
 
 struct lu_pcc_state {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 513/622] lustre: lmv: alloc dir stripes by QoS
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (511 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 512/622] lustre: pcc: Auto attach for PCC during IO James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 514/622] lustre: llite: Don't clear d_fsdata in ll_release() James Simmons
                   ` (109 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Similar to file OST object allocation, introduce directory stripe
allocation by space usage, but they don't share the same code because
of the many differences between them: file has mirrors, PFL, object
precreation; while for directory, the first stripe is always on the
same MDT where its master object is on. The changes include:
* add lod_mdt_alloc_qos() to allocate stripes by space/inode usage.
* add lod_mdt_alloc_rr() to allocate stripes round-robin.
* add lod_mdt_alloc_specific() to allocate stripes in the old way.
* add sysfs support for lmv_desc field in LOD structure, and move
  those remain in procfs to sysfs.

This patch also changes LMV QoS code:
* mkdir by QoS if user mkdir by command 'lfs mkdir -i -1 ...', or the
  parent directory default LMV starting MDT index is -1.
* with the above change, 'space' hash flag is useless, remove all
  related code.
* previously 'lfs mkdir -i -1' QoS code is in lfs_setdirstripe(),
  but now it's done in LMV, remove the old code.

Update sanity 413a 413b to support QoS mkdir of both plain and
striped directories.

Update lfs-setdirstripe man to reflect the changes.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12624
Lustre-commit: c1d0a355a6a6 ("LU-12624 lod: alloc dir stripes by QoS")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35825
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradaed.org>
---
 fs/lustre/include/lustre_lmv.h          |  12 --
 fs/lustre/lmv/lmv_intent.c              |  16 +-
 fs/lustre/lmv/lmv_internal.h            |   4 +-
 fs/lustre/lmv/lmv_obd.c                 | 279 ++++++++++++++++----------------
 fs/lustre/obdclass/lu_tgt_descs.c       |  17 +-
 fs/lustre/ptlrpc/wiretest.c             |   1 -
 include/uapi/linux/lustre/lustre_user.h |  10 +-
 7 files changed, 154 insertions(+), 185 deletions(-)

diff --git a/fs/lustre/include/lustre_lmv.h b/fs/lustre/include/lustre_lmv.h
index b33a6ed..a538559 100644
--- a/fs/lustre/include/lustre_lmv.h
+++ b/fs/lustre/include/lustre_lmv.h
@@ -55,12 +55,6 @@ struct lmv_stripe_md {
 	struct lmv_oinfo lsm_md_oinfo[0];
 };
 
-static inline bool lmv_is_known_hash_type(u32 type)
-{
-	return (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_FNV_1A_64 ||
-	       (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_ALL_CHARS;
-}
-
 static inline bool lmv_dir_striped(const struct lmv_stripe_md *lsm)
 {
 	return lsm && lsm->lsm_md_magic == LMV_MAGIC;
@@ -89,12 +83,6 @@ static inline bool lmv_dir_bad_hash(const struct lmv_stripe_md *lsm)
 	return !lmv_is_known_hash_type(lsm->lsm_md_hash_type);
 }
 
-/* NB, this is checking directory default LMV */
-static inline bool lmv_dir_qos_mkdir(const struct lmv_stripe_md *lsm)
-{
-	return lsm && (lsm->lsm_md_hash_type & LMV_HASH_FLAG_SPACE);
-}
-
 static inline bool
 lsm_md_eq(const struct lmv_stripe_md *lsm1, const struct lmv_stripe_md *lsm2)
 {
diff --git a/fs/lustre/lmv/lmv_intent.c b/fs/lustre/lmv/lmv_intent.c
index 542b16d..ca9bbe8 100644
--- a/fs/lustre/lmv/lmv_intent.c
+++ b/fs/lustre/lmv/lmv_intent.c
@@ -306,22 +306,10 @@ static int lmv_intent_open(struct obd_export *exp, struct md_op_data *op_data,
 				/*
 				 * open(O_CREAT | O_EXCL) needs to check
 				 * existing name, which should be done on both
-				 * old and new layout, to avoid creating new
-				 * file under old layout, check old layout on
+				 * old and new layout, check old layout on
 				 * client side.
 				 */
-				tgt = lmv_locate_tgt(lmv, op_data);
-				if (IS_ERR(tgt))
-					return PTR_ERR(tgt);
-
-				rc = md_getattr_name(tgt->ltd_exp, op_data,
-						     reqp);
-				if (!rc) {
-					ptlrpc_req_finished(*reqp);
-					*reqp = NULL;
-					return -EEXIST;
-				}
-
+				rc = lmv_migrate_existence_check(lmv, op_data);
 				if (rc != -ENOENT)
 					return rc;
 
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index 70d86676..e23eb37 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -49,7 +49,6 @@ int lmv_intent_lock(struct obd_export *exp, struct md_op_data *op_data,
 		    u64 extra_lock_flags);
 
 int lmv_fld_lookup(struct lmv_obd *lmv, const struct lu_fid *fid, u32 *mds);
-int __lmv_fid_alloc(struct lmv_obd *lmv, struct lu_fid *fid, u32 mds);
 int lmv_fid_alloc(const struct lu_env *env, struct obd_export *exp,
 		  struct lu_fid *fid, struct md_op_data *op_data);
 
@@ -217,8 +216,9 @@ static inline bool lmv_dir_retry_check_update(struct md_op_data *op_data)
 
 struct lmv_tgt_desc *lmv_locate_tgt(struct lmv_obd *lmv,
 				    struct md_op_data *op_data);
+int lmv_migrate_existence_check(struct lmv_obd *lmv,
+				struct md_op_data *op_data);
 
 /* lproc_lmv.c */
 int lmv_tunables_init(struct obd_device *obd);
-
 #endif
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 84be905..e92be25 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1045,106 +1045,36 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 	return rc;
 }
 
-/**
- * This is _inode_ placement policy function (not name).
- */
-static u32 lmv_placement_policy(struct obd_device *obd,
-				struct md_op_data *op_data)
+int lmv_fid_alloc(const struct lu_env *env, struct obd_export *exp,
+		  struct lu_fid *fid, struct md_op_data *op_data)
 {
+	struct obd_device *obd = class_exp2obd(exp);
 	struct lmv_obd *lmv = &obd->u.lmv;
-	struct lmv_user_md *lum;
-	u32 mdt;
-
-	if (lmv->lmv_mdt_count == 1)
-		return 0;
-
-	lum = op_data->op_data;
-	/*
-	 * Choose MDT by
-	 * 1. See if the stripe offset is specified by lum.
-	 * 2. If parent has default LMV, and its hash type is "space", choose
-	 *    MDT with QoS. (see lmv_locate_tgt_qos()).
-	 * 3. Then check if default LMV stripe offset is not -1.
-	 * 4. Finally choose MDS by name hash if the parent
-	 *    is striped directory. (see lmv_locate_tgt()).
-	 *
-	 * presently explicit MDT location is not supported
-	 * for foreign dirs (as it can't be embedded into free
-	 * format LMV, like with lum_stripe_offset), so we only
-	 * rely on default stripe offset or then name hashing.
-	 */
-	if (op_data->op_cli_flags & CLI_SET_MEA && lum &&
-	    le32_to_cpu(lum->lum_magic != LMV_MAGIC_FOREIGN) &&
-	    le32_to_cpu(lum->lum_stripe_offset) != (u32)-1) {
-		mdt = le32_to_cpu(lum->lum_stripe_offset);
-	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
-		   !lmv_dir_striped(op_data->op_mea1) &&
-		   lmv_dir_qos_mkdir(op_data->op_default_mea1)) {
-		mdt = op_data->op_mds;
-	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
-		   op_data->op_default_mea1 &&
-		   op_data->op_default_mea1->lsm_md_master_mdt_index !=
-			(u32)-1) {
-		mdt = op_data->op_default_mea1->lsm_md_master_mdt_index;
-		op_data->op_mds = mdt;
-	} else {
-		mdt = op_data->op_mds;
-	}
-
-	return mdt;
-}
-
-int __lmv_fid_alloc(struct lmv_obd *lmv, struct lu_fid *fid, u32 mds)
-{
 	struct lmv_tgt_desc *tgt;
 	int rc;
 
-	tgt = lmv_tgt(lmv, mds);
+	LASSERT(op_data);
+	LASSERT(fid);
+
+	tgt = lmv_tgt(lmv, op_data->op_mds);
 	if (!tgt)
 		return -ENODEV;
 
+	if (!tgt->ltd_active || !tgt->ltd_exp)
+		return -ENODEV;
+
 	/*
 	 * New seq alloc and FLD setup should be atomic. Otherwise we may find
 	 * on server that seq in new allocated fid is not yet known.
 	 */
 	mutex_lock(&tgt->ltd_fid_mutex);
-
-	if (tgt->ltd_active == 0 || !tgt->ltd_exp) {
-		rc = -ENODEV;
-		goto out;
-	}
-
-	/*
-	 * Asking underlaying tgt layer to allocate new fid.
-	 */
 	rc = obd_fid_alloc(NULL, tgt->ltd_exp, fid, NULL);
+	mutex_unlock(&tgt->ltd_fid_mutex);
 	if (rc > 0) {
 		LASSERT(fid_is_sane(fid));
 		rc = 0;
 	}
 
-out:
-	mutex_unlock(&tgt->ltd_fid_mutex);
-	return rc;
-}
-
-int lmv_fid_alloc(const struct lu_env *env, struct obd_export *exp,
-		  struct lu_fid *fid, struct md_op_data *op_data)
-{
-	struct obd_device *obd = class_exp2obd(exp);
-	struct lmv_obd *lmv = &obd->u.lmv;
-	u32 mds;
-	int rc;
-
-	LASSERT(op_data);
-	LASSERT(fid);
-
-	mds = lmv_placement_policy(obd, op_data);
-
-	rc = __lmv_fid_alloc(lmv, fid, mds);
-	if (rc)
-		CERROR("Can't alloc new fid, rc %d\n", rc);
-
 	return rc;
 }
 
@@ -1624,8 +1554,7 @@ static struct lu_tgt_desc *lmv_locate_tgt_rr(struct lmv_obd *lmv, u32 *mdt)
  * which is set outside, and if dir is migrating, 'op_data->op_post_migrate'
  * indicates whether old or new layout is used to locate.
  *
- * For plain direcotry, normally it will locate MDT by FID, but if this
- * directory has default LMV, and its hash type is "space", locate MDT with QoS.
+ * For plain direcotry, it just locate the MDT of op_data->op_fid1.
  *
  * @lmv:	LMV device
  * @op_data:	client MD stack parameters, name, namelen
@@ -1650,7 +1579,7 @@ struct lmv_tgt_desc *
 	 * ct_restore().
 	 */
 	if (op_data->op_bias & MDS_CREATE_VOLATILE &&
-	    (int)op_data->op_mds != -1) {
+	    op_data->op_mds != LMV_OFFSET_DEFAULT) {
 		tgt = lmv_tgt(lmv, op_data->op_mds);
 		if (!tgt)
 			return ERR_PTR(-ENODEV);
@@ -1679,30 +1608,7 @@ struct lmv_tgt_desc *
 
 		tgt = lmv_tgt(lmv, oinfo->lmo_mds);
 		if (!tgt)
-			tgt = ERR_PTR(-ENODEV);
-	} else if (op_data->op_code == LUSTRE_OPC_MKDIR &&
-		   lmv_dir_qos_mkdir(op_data->op_default_mea1) &&
-		   !lmv_dir_striped(lsm)) {
-		tgt = lmv_locate_tgt_qos(lmv, &op_data->op_mds);
-		if (tgt == ERR_PTR(-EAGAIN))
-			tgt = lmv_locate_tgt_rr(lmv, &op_data->op_mds);
-		/*
-		 * only update statfs when mkdir under dir with "space" hash,
-		 * this means the cached statfs may be stale, and current mkdir
-		 * may not follow QoS accurately, but it's not serious, and it
-		 * avoids periodic statfs when client doesn't mkdir under
-		 * "space" hashed directories.
-		 *
-		 * TODO: after MDT support QoS object allocation, also update
-		 * statfs for 'lfs mkdir -i -1 ...", currently it's done in user
-		 * space.
-		 */
-		if (!IS_ERR(tgt)) {
-			struct obd_device *obd;
-
-			obd = container_of(lmv, struct obd_device, u.lmv);
-			lmv_statfs_check_update(obd, tgt);
-		}
+			return ERR_PTR(-ENODEV);
 	} else {
 		tgt = lmv_locate_tgt_by_name(lmv, op_data->op_mea1,
 				op_data->op_name, op_data->op_namelen,
@@ -1755,6 +1661,78 @@ struct lmv_tgt_desc *
 				&op_data->op_mds, true);
 }
 
+int lmv_migrate_existence_check(struct lmv_obd *lmv, struct md_op_data *op_data)
+{
+	struct lu_tgt_desc *tgt;
+	struct ptlrpc_request *request;
+	int rc;
+
+	LASSERT(lmv_dir_migrating(op_data->op_mea1));
+
+	tgt = lmv_locate_tgt(lmv, op_data);
+	if (IS_ERR(tgt))
+		return PTR_ERR(tgt);
+
+	rc = md_getattr_name(tgt->ltd_exp, op_data, &request);
+	if (!rc) {
+		ptlrpc_req_finished(request);
+		return -EEXIST;
+	}
+
+	return rc;
+}
+
+/* mkdir by QoS in two cases:
+ * 1. 'lfs mkdir -i -1'
+ * 2. parent default LMV master_mdt_index is -1
+ *
+ * NB, mkdir by QoS only if parent is not striped, this is to avoid remote
+ * directories under striped directory.
+ */
+static inline bool lmv_op_qos_mkdir(const struct md_op_data *op_data)
+{
+	const struct lmv_stripe_md *lsm = op_data->op_default_mea1;
+	const struct lmv_user_md *lum = op_data->op_data;
+
+	if (op_data->op_code != LUSTRE_OPC_MKDIR)
+		return false;
+
+	if (lmv_dir_striped(op_data->op_mea1))
+		return false;
+
+	if (op_data->op_cli_flags & CLI_SET_MEA && lum &&
+	    (le32_to_cpu(lum->lum_magic) == LMV_USER_MAGIC ||
+	     le32_to_cpu(lum->lum_magic) == LMV_USER_MAGIC_SPECIFIC) &&
+	    le32_to_cpu(lum->lum_stripe_offset) == LMV_OFFSET_DEFAULT)
+		return true;
+
+	if (lsm && lsm->lsm_md_master_mdt_index == LMV_OFFSET_DEFAULT)
+		return true;
+
+	return false;
+}
+
+/* 'lfs mkdir -i <specific_MDT>' */
+static inline bool lmv_op_user_specific_mkdir(const struct md_op_data *op_data)
+{
+	const struct lmv_user_md *lum = op_data->op_data;
+
+	return op_data->op_code == LUSTRE_OPC_MKDIR &&
+	       op_data->op_cli_flags & CLI_SET_MEA && lum &&
+	       (le32_to_cpu(lum->lum_magic) == LMV_USER_MAGIC ||
+		le32_to_cpu(lum->lum_magic) == LMV_USER_MAGIC_SPECIFIC) &&
+	       le32_to_cpu(lum->lum_stripe_offset) != LMV_OFFSET_DEFAULT;
+}
+
+/* parent default LMV master_mdt_index is not -1. */
+static inline bool
+lmv_op_default_specific_mkdir(const struct md_op_data *op_data)
+{
+	return op_data->op_code == LUSTRE_OPC_MKDIR &&
+	       op_data->op_default_mea1 &&
+	       op_data->op_default_mea1->lsm_md_master_mdt_index !=
+			LMV_OFFSET_DEFAULT;
+}
 int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 		const void *data, size_t datalen, umode_t mode, uid_t uid,
 		gid_t gid, kernel_cap_t cap_effective, u64 rdev,
@@ -1774,20 +1752,9 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	if (lmv_dir_migrating(op_data->op_mea1)) {
 		/*
 		 * if parent is migrating, create() needs to lookup existing
-		 * name, to avoid creating new file under old layout of
-		 * migrating directory, check old layout here.
+		 * name in both old and new layout, check old layout on client.
 		 */
-		tgt = lmv_locate_tgt(lmv, op_data);
-		if (IS_ERR(tgt))
-			return PTR_ERR(tgt);
-
-		rc = md_getattr_name(tgt->ltd_exp, op_data, request);
-		if (!rc) {
-			ptlrpc_req_finished(*request);
-			*request = NULL;
-			return -EEXIST;
-		}
-
+		rc = lmv_migrate_existence_check(lmv, op_data);
 		if (rc != -ENOENT)
 			return rc;
 
@@ -1798,28 +1765,44 @@ int lmv_create(struct obd_export *exp, struct md_op_data *op_data,
 	if (IS_ERR(tgt))
 		return PTR_ERR(tgt);
 
-	CDEBUG(D_INODE, "CREATE name '%.*s' on " DFID " -> mds #%x\n",
-	       (int)op_data->op_namelen, op_data->op_name,
-	       PFID(&op_data->op_fid1), op_data->op_mds);
-
-	rc = lmv_fid_alloc(NULL, exp, &op_data->op_fid2, op_data);
-	if (rc)
-		return rc;
-
-	if (exp_connect_flags(exp) & OBD_CONNECT_DIR_STRIPE) {
+	if (lmv_op_qos_mkdir(op_data)) {
+		tgt = lmv_locate_tgt_qos(lmv, &op_data->op_mds);
+		if (tgt == ERR_PTR(-EAGAIN))
+			tgt = lmv_locate_tgt_rr(lmv, &op_data->op_mds);
 		/*
-		 * Send the create request to the MDT where the object
-		 * will be located
+		 * only update statfs after QoS mkdir, this means the cached
+		 * statfs may be stale, and current mkdir may not follow QoS
+		 * accurately, but it's not serious, and avoids periodic statfs
+		 * when client doesn't mkdir by QoS.
 		 */
-		tgt = lmv_fid2tgt(lmv, &op_data->op_fid2);
-		if (IS_ERR(tgt))
-			return PTR_ERR(tgt);
+		if (!IS_ERR(tgt))
+			lmv_statfs_check_update(obd, tgt);
+	} else if (lmv_op_user_specific_mkdir(op_data)) {
+		struct lmv_user_md *lum = op_data->op_data;
 
-		op_data->op_mds = tgt->ltd_index;
+		op_data->op_mds = le32_to_cpu(lum->lum_stripe_offset);
+		tgt = lmv_tgt(lmv, op_data->op_mds);
+		if (!tgt)
+			return -ENODEV;
+	} else if (lmv_op_default_specific_mkdir(op_data)) {
+		op_data->op_mds =
+			op_data->op_default_mea1->lsm_md_master_mdt_index;
+		tgt = lmv_tgt(lmv, op_data->op_mds);
+		if (!tgt)
+			return -ENODEV;
 	}
 
-	CDEBUG(D_INODE, "CREATE obj " DFID " -> mds #%x\n",
-	       PFID(&op_data->op_fid1), op_data->op_mds);
+	if (IS_ERR(tgt))
+		return PTR_ERR(tgt);
+
+	rc = lmv_fid_alloc(NULL, exp, &op_data->op_fid2, op_data);
+	if (rc)
+		return rc;
+
+	CDEBUG(D_INODE, "CREATE name '%.*s' "DFID" on " DFID " -> mds #%x\n",
+		(int)op_data->op_namelen, op_data->op_name,
+		PFID(&op_data->op_fid2), PFID(&op_data->op_fid1),
+		op_data->op_mds);
 
 	op_data->op_flags |= MF_MDC_CANCEL_FID1;
 	rc = md_create(tgt->ltd_exp, op_data, data, datalen, mode, uid, gid,
@@ -2063,10 +2046,20 @@ static int lmv_migrate(struct obd_export *exp, struct md_op_data *op_data,
 	if (IS_ERR(child_tgt))
 		return PTR_ERR(child_tgt);
 
-	if (!S_ISDIR(op_data->op_mode) && tp_tgt)
-		rc = __lmv_fid_alloc(lmv, &target_fid, tp_tgt->ltd_index);
-	else
-		rc = lmv_fid_alloc(NULL, exp, &target_fid, op_data);
+	/* for directory, migrate to MDT specified by lum_stripe_offset;
+	 * otherwise migrate to the target stripe of parent, but parent
+	 * directory may have finished migration (normally current file too),
+	 * allocate FID on MDT lum_stripe_offset, and server will check
+	 * whether file was migrated already.
+	 */
+	if (S_ISDIR(op_data->op_mode) || !tp_tgt) {
+		struct lmv_user_md *lum = op_data->op_data;
+
+		op_data->op_mds = le32_to_cpu(lum->lum_stripe_offset);
+	} else  {
+		op_data->op_mds = tp_tgt->ltd_index;
+	}
+	rc = lmv_fid_alloc(NULL, exp, &target_fid, op_data);
 	if (rc)
 		return rc;
 
@@ -3071,7 +3064,7 @@ static int lmv_unpack_md_v1(struct obd_export *exp, struct lmv_stripe_md *lsm,
 		 * set default value -1, so lmv_locate_tgt() knows this stripe
 		 * target is not initialized.
 		 */
-		lsm->lsm_md_oinfo[i].lmo_mds = (u32)-1;
+		lsm->lsm_md_oinfo[i].lmo_mds = LMV_OFFSET_DEFAULT;
 		if (!fid_is_sane(&lsm->lsm_md_oinfo[i].lmo_fid))
 			continue;
 
diff --git a/fs/lustre/obdclass/lu_tgt_descs.c b/fs/lustre/obdclass/lu_tgt_descs.c
index 60c50a0..5a141ce 100644
--- a/fs/lustre/obdclass/lu_tgt_descs.c
+++ b/fs/lustre/obdclass/lu_tgt_descs.c
@@ -106,10 +106,6 @@ int lu_qos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *tgt)
 	u32 id = 0;
 	int rc = 0;
 
-	/* tgt not connected, this function will be called again later */
-	if (!exp)
-		return 0;
-
 	down_write(&qos->lq_rw_sem);
 	/*
 	 * a bit hacky approach to learn NID of corresponding connection
@@ -528,7 +524,7 @@ int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd)
 		 * per-tgt penalty is
 		 * prio * bavail * iavail / (num_tgt - 1) / 2
 		 */
-		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia;
+		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia >> 8;
 		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active);
 		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
 
@@ -562,8 +558,9 @@ int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd)
 	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
 		ba = svr->lsq_bavail;
 		ia = svr->lsq_iavail;
-		svr->lsq_penalty_per_obj = prio_wide * ba  * ia;
-		do_div(ba, svr->lsq_tgt_count * num_active);
+		svr->lsq_penalty_per_obj = prio_wide * ba  * ia >> 8;
+		do_div(svr->lsq_penalty_per_obj,
+		       svr->lsq_tgt_count * num_active);
 		svr->lsq_penalty_per_obj >>= 1;
 
 		age = (now - svr->lsq_used) >> 3;
@@ -656,6 +653,7 @@ int ltd_qos_update(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt,
 		if (!tgt->ltd_active)
 			continue;
 
+		ltq = &tgt->ltd_qos;
 		if (ltq->ltq_penalty < ltq->ltq_penalty_per_obj)
 			ltq->ltq_penalty = 0;
 		else
@@ -668,9 +666,10 @@ int ltd_qos_update(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt,
 			*total_wt += ltq->ltq_weight;
 
 		CDEBUG(D_OTHER,
-		       "recalc tgt %d usable=%d avail=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
+		       "recalc tgt %d usable=%d bavail=%llu ffree=%llu tgtppo=%llu tgtp=%llu svrppo=%llu svrp=%llu wt=%llu\n",
 		       tgt->ltd_index, ltq->ltq_usable,
-		       tgt_statfs_bavail(tgt) >> 10,
+		       tgt_statfs_bavail(tgt) >> 16,
+			  tgt_statfs_iavail(tgt) >> 8,
 		       ltq->ltq_penalty_per_obj >> 10,
 		       ltq->ltq_penalty >> 10,
 		       ltq->ltq_svr->lsq_penalty_per_obj >> 10,
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index da51dc1..671878d 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1663,7 +1663,6 @@ void lustre_assert_wire_constants(void)
 	BUILD_BUG_ON(LMV_MAGIC_V1 != 0x0CD20CD0);
 	BUILD_BUG_ON(LMV_MAGIC_STRIPE != 0x0CD40CD0);
 	BUILD_BUG_ON(LMV_HASH_TYPE_MASK != 0x0000ffff);
-	BUILD_BUG_ON(LMV_HASH_FLAG_SPACE != 0x08000000);
 	BUILD_BUG_ON(LMV_HASH_FLAG_MIGRATION != 0x80000000);
 
 	/* Checks for struct obd_statfs */
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 2178666..b46f52b 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -429,6 +429,7 @@ static inline bool lov_pattern_supported_normal_comp(__u32 pattern)
 #define LOV_MAXPOOLNAME 15
 #define LOV_POOLNAMEF "%.15s"
 #define LOV_OFFSET_DEFAULT      ((__u16)-1)
+#define LMV_OFFSET_DEFAULT      ((__u32)-1)
 
 #define LOV_MIN_STRIPE_BITS	16	/* maximum PAGE_SIZE (ia64), power of 2 */
 #define LOV_MIN_STRIPE_SIZE	(1 << LOV_MIN_STRIPE_BITS)
@@ -687,10 +688,11 @@ enum lmv_hash_type {
  */
 #define LMV_HASH_TYPE_MASK		0x0000ffff
 
-/* once this is set on a plain directory default layout, newly created
- * subdirectories will be distributed on all MDTs by space usage.
- */
-#define LMV_HASH_FLAG_SPACE		0x08000000
+static inline bool lmv_is_known_hash_type(__u32 type)
+{
+	return (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_FNV_1A_64 ||
+	       (type & LMV_HASH_TYPE_MASK) == LMV_HASH_TYPE_ALL_CHARS;
+}
 
 /* The striped directory has ever lost its master LMV EA, then LFSCK
  * re-generated it. This flag is used to indicate such case. It is an
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 514/622] lustre: llite: Don't clear d_fsdata in ll_release()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (512 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 513/622] lustre: lmv: alloc dir stripes by QoS James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 515/622] lustre: llite: move agl_thread cleanup out of thread James Simmons
                   ` (108 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.de>

The whole point of using rcu_free() is that some code might
still be accessing the dentry (e.g. lockless lookup) and so
the dentry cannot be freed until the end of the grace
period.

As lockless lookup can accesses d_fsdata -- ll_dcompare calls
d_lustre_invalid() -- we also mustn't clear d_fsdata before
the end of the grace period.
We don't need to clear it at all - by the time it is freed,
the inode will no longer be accessed.

Fixes: 7126bc2e8d60c ("lustre: switch to use of ->d_init()")

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: James Simmons <jsimmons@infradaed.org>
---
 fs/lustre/llite/dcache.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/lustre/llite/dcache.c b/fs/lustre/llite/dcache.c
index 2dfe12a..3230d32 100644
--- a/fs/lustre/llite/dcache.c
+++ b/fs/lustre/llite/dcache.c
@@ -63,7 +63,6 @@ static void ll_release(struct dentry *de)
 		kfree(lld->lld_it);
 	}
 
-	de->d_fsdata = NULL;
 	call_rcu(&lld->lld_rcu_head, free_dentry_data);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 515/622] lustre: llite: move agl_thread cleanup out of thread.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (513 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 514/622] lustre: llite: Don't clear d_fsdata in ll_release() James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 516/622] lustre/lnet: remove unnecessary use of msecs_to_jiffies() James Simmons
                   ` (107 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.de>

When we start a thread with kthread_create() and later stop
it with kthread_stop(), there is no guarantee that the thread
function runs at all.  So it is not safe to leave cleanup
to the thread.

So move the cleanup code to a separate function which
stops the thread and then cleans up.

Fixes: c044fb0f835c ("staging: lustre: remove 'ptlrpc_thread usage' for sai_agl_thread")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: James Simmons <jsimmons@infradaed.org>
---
 fs/lustre/llite/statahead.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/lustre/llite/statahead.c b/fs/lustre/llite/statahead.c
index 497aba3..1639408 100644
--- a/fs/lustre/llite/statahead.c
+++ b/fs/lustre/llite/statahead.c
@@ -915,7 +915,19 @@ static int ll_agl_thread(void *arg)
 			schedule();
 		__set_current_state(TASK_RUNNING);
 	}
+	return 0;
+}
+
+static void ll_stop_agl(struct ll_statahead_info *sai)
+{
+	struct ll_inode_info *plli = ll_i2info(sai->sai_dentry->d_inode);
+	struct ll_inode_info *clli;
 
+	CDEBUG(D_READA, "stop agl thread: sai %p pid %u\n",
+	       sai, (unsigned int)sai->sai_agl_task->pid);
+	kthread_stop(sai->sai_agl_task);
+
+	sai->sai_agl_task = NULL;
 	spin_lock(&plli->lli_agl_lock);
 	sai->sai_agl_valid = 0;
 	while ((clli = list_first_entry_or_null(&sai->sai_agls,
@@ -929,9 +941,8 @@ static int ll_agl_thread(void *arg)
 	}
 	spin_unlock(&plli->lli_agl_lock);
 	CDEBUG(D_READA, "agl thread stopped: sai %p, parent %pd\n",
-	       sai, parent);
+	       sai, sai->sai_dentry);
 	ll_sai_put(sai);
-	return 0;
 }
 
 /* start agl thread */
@@ -1134,13 +1145,9 @@ static int ll_statahead_thread(void *arg)
 		__set_current_state(TASK_RUNNING);
 	}
 out:
-	if (sai->sai_agl_task) {
-		kthread_stop(sai->sai_agl_task);
+	if (sai->sai_agl_task)
+		ll_stop_agl(sai);
 
-		CDEBUG(D_READA, "stop agl thread: sai %p pid %u\n",
-		       sai, (unsigned int)sai->sai_agl_task->pid);
-		sai->sai_agl_task = NULL;
-	}
 	/*
 	 * wait for inflight statahead RPCs to finish, and then we can free sai
 	 * safely because statahead RPC will access sai data
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 516/622] lustre/lnet: remove unnecessary use of msecs_to_jiffies()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (514 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 515/622] lustre: llite: move agl_thread cleanup out of thread James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 517/622] lnet: net_fault: don't pass struct member to do_div() James Simmons
                   ` (106 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.de>

msecs_to_jiffies() is useful when you have a number of milliseconds,
but when you have a number of seconds,
   sec * HZ
is simpler than
   msecs_to_jiffies(sec * MSECS_PER_SEC)

Similary for small divisions of a second (e.g. HZ/4)

So change all calls to msecs_to_jiffies() the reference MSECS_PER_SEC to
simple multiplications by HZ.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: James Simmons <jsimmons@infradaed.org>
---
 fs/lustre/mgc/mgc_request.c         | 8 ++++----
 fs/lustre/obdclass/integrity.c      | 2 +-
 fs/lustre/osc/osc_request.c         | 5 ++---
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 2 +-
 net/lnet/libcfs/linux-crypto.c      | 2 +-
 net/lnet/lnet/lib-socket.c          | 4 ++--
 6 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/mgc/mgc_request.c b/fs/lustre/mgc/mgc_request.c
index 5bfa1b7..28064fd 100644
--- a/fs/lustre/mgc/mgc_request.c
+++ b/fs/lustre/mgc/mgc_request.c
@@ -555,12 +555,12 @@ static int mgc_requeue_thread(void *data)
 		 * caused the lock revocation to finish its setup, plus some
 		 * random so everyone doesn't try to reconnect at once.
 		 */
-		to = msecs_to_jiffies(MGC_TIMEOUT_MIN_SECONDS * MSEC_PER_SEC);
-		/* rand is centi-seconds */
-		to += msecs_to_jiffies(rand * MSEC_PER_SEC / 100);
+		/* rand is centi-seconds, "to" is in centi-HZ */
+		to = MGC_TIMEOUT_MIN_SECONDS * HZ * 100;
+		to += rand * HZ;
 		wait_event_idle_timeout(rq_waitq,
 					rq_state & (RQ_STOP | RQ_PRECLEANUP),
-					to);
+					to/100);
 
 		/*
 		 * iterate & processing through the list. for each cld, process
diff --git a/fs/lustre/obdclass/integrity.c b/fs/lustre/obdclass/integrity.c
index 2d5760d..230e1a5 100644
--- a/fs/lustre/obdclass/integrity.c
+++ b/fs/lustre/obdclass/integrity.c
@@ -226,7 +226,7 @@ static void obd_t10_performance_test(const char *obd_name,
 	memset(buf, 0xAD, PAGE_SIZE);
 	kunmap(page);
 
-	for (start = jiffies, end = start + msecs_to_jiffies(MSEC_PER_SEC / 4),
+	for (start = jiffies, end = start + HZ / 4,
 	     bcount = 0; time_before(jiffies, end) && rc == 0; bcount++) {
 		rc = __obd_t10_performance_test(obd_name, cksum_type, page,
 						buf_len / PAGE_SIZE);
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 95e09ce..9c43756 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -901,9 +901,8 @@ static void osc_grant_work_handler(struct work_struct *data)
 		return;
 
 	if (next_shrink > ktime_get_seconds())
-		schedule_delayed_work(&work, msecs_to_jiffies(
-					(next_shrink - ktime_get_seconds()) *
-					MSEC_PER_SEC));
+		schedule_delayed_work(&work,
+				      (next_shrink - ktime_get_seconds()) * HZ);
 	else
 		schedule_work(&work.work);
 }
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 1110553..fcd9db2 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -3550,7 +3550,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 					     kiblnd_data.kib_peer_hash_size;
 			}
 
-			deadline += msecs_to_jiffies(p * MSEC_PER_SEC);
+			deadline += p * HZ;
 			spin_lock_irqsave(lock, flags);
 		}
 
diff --git a/net/lnet/libcfs/linux-crypto.c b/net/lnet/libcfs/linux-crypto.c
index 532fab4..add4e79 100644
--- a/net/lnet/libcfs/linux-crypto.c
+++ b/net/lnet/libcfs/linux-crypto.c
@@ -346,7 +346,7 @@ static void cfs_crypto_performance_test(enum cfs_crypto_hash_alg hash_alg)
 	memset(buf, 0xAD, PAGE_SIZE);
 	kunmap(page);
 
-	for (start = jiffies, end = start + msecs_to_jiffies(MSEC_PER_SEC / 4),
+	for (start = jiffies, end = start + HZ / 4,
 	     bcount = 0; time_before(jiffies, end) && err == 0; bcount++) {
 		struct ahash_request *hdesc;
 		int i;
diff --git a/net/lnet/lnet/lib-socket.c b/net/lnet/lnet/lib-socket.c
index 046bd2d..0c65dc9 100644
--- a/net/lnet/lnet/lib-socket.c
+++ b/net/lnet/lnet/lib-socket.c
@@ -47,7 +47,7 @@
 lnet_sock_write(struct socket *sock, void *buffer, int nob, int timeout)
 {
 	int rc;
-	long jiffies_left = timeout * msecs_to_jiffies(MSEC_PER_SEC);
+	long jiffies_left = timeout * HZ;
 	unsigned long then;
 	struct timeval tv;
 	struct __kernel_sock_timeval ktv;
@@ -105,7 +105,7 @@
 lnet_sock_read(struct socket *sock, void *buffer, int nob, int timeout)
 {
 	int rc;
-	long jiffies_left = timeout * msecs_to_jiffies(MSEC_PER_SEC);
+	long jiffies_left = timeout * HZ;
 	unsigned long then;
 	struct timeval tv;
 	struct __kernel_sock_timeval ktv;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 517/622] lnet: net_fault: don't pass struct member to do_div()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (515 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 516/622] lustre/lnet: remove unnecessary use of msecs_to_jiffies() James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 518/622] lustre: obd: discard unused enum James Simmons
                   ` (105 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.de>

do_div() changes it's first argument, so passing a struct member
is not a good idea unless we really want the struct to change,
which we don't in these cases.
So copy the value to a local variable and call do_div() on
that.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: James Simmons <jsimmons@infradaed.org>
---
 net/lnet/lnet/net_fault.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/lnet/lnet/net_fault.c b/net/lnet/lnet/net_fault.c
index 9f78e43..e43b1e1 100644
--- a/net/lnet/lnet/net_fault.c
+++ b/net/lnet/lnet/net_fault.c
@@ -394,9 +394,11 @@ struct lnet_drop_rule {
 		}
 
 	} else { /* rate based drop */
-		drop = rule->dr_stat.fs_count++ == rule->dr_drop_at;
+		u64 count;
 
-		if (!do_div(rule->dr_stat.fs_count, attr->u.drop.da_rate)) {
+		drop = rule->dr_stat.fs_count++ == rule->dr_drop_at;
+		count = rule->dr_stat.fs_count;
+		if (!do_div(count, attr->u.drop.da_rate)) {
 			rule->dr_drop_at = rule->dr_stat.fs_count +
 				prandom_u32_max(attr->u.drop.da_rate);
 			CDEBUG(D_NET, "Drop Rule %s->%s: next drop: %lu\n",
@@ -563,9 +565,12 @@ struct delay_daemon_data {
 		}
 
 	} else { /* rate based delay */
+		u64 count;
+
 		delay = rule->dl_stat.fs_count++ == rule->dl_delay_at;
+		count = rule->dl_stat.fs_count;
 		/* generate the next random rate sequence */
-		if (!do_div(rule->dl_stat.fs_count, attr->u.delay.la_rate)) {
+		if (!do_div(count, attr->u.delay.la_rate)) {
 			rule->dl_delay_at = rule->dl_stat.fs_count +
 				prandom_u32_max(attr->u.delay.la_rate);
 			CDEBUG(D_NET, "Delay Rule %s->%s: next delay: %lu\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 518/622] lustre: obd: discard unused enum
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (516 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 517/622] lnet: net_fault: don't pass struct member to do_div() James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 519/622] lustre: update version to 2.13.50 James Simmons
                   ` (104 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.de>

The values in this enum are never used - so discard it.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: James Simmons <jsimmons@infradaed.org>
---
 fs/lustre/include/obd.h | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index 4ba70c7..5f5a595 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -133,14 +133,6 @@ struct timeout_item {
 #define OSC_MAX_DIRTY_MB_MAX	2048	/* arbitrary, but < MAX_LONG bytes */
 #define OSC_DEFAULT_RESENDS	10
 
-/* possible values for fo_sync_lock_cancel */
-enum {
-	NEVER_SYNC_ON_CANCEL	= 0,
-	BLOCKING_SYNC_ON_CANCEL	= 1,
-	ALWAYS_SYNC_ON_CANCEL	= 2,
-	NUM_SYNC_ON_CANCEL_STATES
-};
-
 enum obd_cl_sem_lock_class {
 	OBD_CLI_SEM_NORMAL,
 	OBD_CLI_SEM_MGC,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 519/622] lustre: update version to 2.13.50
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (517 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 518/622] lustre: obd: discard unused enum James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 520/622] lustre: llite: report latency for filesystem ops James Simmons
                   ` (103 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

With all of the the missing patches from the lustre 2.13 version
merged upstream its time to update the upstream clients version.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_ver.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_ver.h b/include/uapi/linux/lustre/lustre_ver.h
index 8ceb57d..0f07260 100644
--- a/include/uapi/linux/lustre/lustre_ver.h
+++ b/include/uapi/linux/lustre/lustre_ver.h
@@ -2,10 +2,10 @@
 #define _LUSTRE_VER_H_
 
 #define LUSTRE_MAJOR 2
-#define LUSTRE_MINOR 11
-#define LUSTRE_PATCH 99
+#define LUSTRE_MINOR 13
+#define LUSTRE_PATCH 50
 #define LUSTRE_FIX 0
-#define LUSTRE_VERSION_STRING "2.11.99"
+#define LUSTRE_VERSION_STRING "2.13.50"
 
 #define OBD_OCD_VERSION(major, minor, patch, fix)			\
 	(((major) << 24) + ((minor) << 16) + ((patch) << 8) + (fix))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 520/622] lustre: llite: report latency for filesystem ops
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (518 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 519/622] lustre: update version to 2.13.50 James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 521/622] lustre: osc: don't re-enable grant shrink on reconnect James Simmons
                   ` (102 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Add the elapsed time of VFS operations to the llite stats
counter, instead of just tracking the number of operations,
to allow tracking of operation round-trip latency.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12631
Lustre-commit: ea58c4cfb0fc ("LU-12631 llite: report latency for filesystem ops")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36078
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lprocfs_status.h |  4 +-
 fs/lustre/llite/dir.c              |  4 +-
 fs/lustre/llite/file.c             | 69 ++++++++++++++++++++++++---------
 fs/lustre/llite/llite_internal.h   |  7 ++--
 fs/lustre/llite/llite_lib.c        | 15 ++++++--
 fs/lustre/llite/llite_mmap.c       | 36 ++++++++++++------
 fs/lustre/llite/lproc_llite.c      | 78 ++++++++++++++++++++------------------
 fs/lustre/llite/namei.c            | 39 ++++++++++++++-----
 fs/lustre/llite/pcc.c              |  4 +-
 fs/lustre/llite/pcc.h              |  4 +-
 fs/lustre/llite/super25.c          |  1 -
 fs/lustre/llite/xattr.c            | 49 ++++++++++++++----------
 12 files changed, 199 insertions(+), 111 deletions(-)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index fdc1b19..ac62560 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -138,10 +138,10 @@ enum {
 	LPROCFS_CNTR_STDDEV		= 0x0004,
 
 	/* counter data type */
-	LPROCFS_TYPE_REGS		= 0x0100,
+	LPROCFS_TYPE_REQS		= 0x0100,
 	LPROCFS_TYPE_BYTES		= 0x0200,
 	LPROCFS_TYPE_PAGES		= 0x0400,
-	LPROCFS_TYPE_CYCLE		= 0x0800,
+	LPROCFS_TYPE_USEC		= 0x0800,
 };
 
 #define LC_MIN_INIT ((~(u64)0) >> 1)
diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 4dccd24..c38862e 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -298,6 +298,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 	bool api32 = ll_need_32bit_api(sbi);
 	struct md_op_data *op_data;
 	struct lu_fid pfid = { 0 };
+	ktime_t kstart = ktime_get();
 	int rc;
 
 	CDEBUG(D_VFSTRACE,
@@ -374,7 +375,8 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 	ll_finish_md_op_data(op_data);
 out:
 	if (!rc)
-		ll_stats_ops_tally(sbi, LPROC_LL_READDIR, 1);
+		ll_stats_ops_tally(sbi, LPROC_LL_READDIR,
+				   ktime_us_delta(ktime_get(), kstart));
 
 	return rc;
 }
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 31d7dce..92eead1 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -383,13 +383,12 @@ int ll_file_release(struct inode *inode, struct file *file)
 	struct ll_file_data *fd;
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct ll_inode_info *lli = ll_i2info(inode);
+	ktime_t kstart = ktime_get();
 	int rc;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p)\n",
 	       PFID(ll_inode2fid(inode)), inode);
 
-	if (!is_root_inode(inode))
-		ll_stats_ops_tally(sbi, LPROC_LL_RELEASE, 1);
 	fd = LUSTRE_FPRIVATE(file);
 	LASSERT(fd);
 
@@ -402,7 +401,8 @@ int ll_file_release(struct inode *inode, struct file *file)
 	if (is_root_inode(inode)) {
 		LUSTRE_FPRIVATE(file) = NULL;
 		ll_file_data_put(fd);
-		return 0;
+		rc = 0;
+		goto out;
 	}
 
 	pcc_file_release(inode, file);
@@ -418,6 +418,10 @@ int ll_file_release(struct inode *inode, struct file *file)
 	if (CFS_FAIL_TIMEOUT_MS(OBD_FAIL_PTLRPC_DUMP_LOG, cfs_fail_val))
 		libcfs_debug_dumplog();
 
+out:
+	if (!rc && inode->i_sb->s_root != file_dentry(file))
+		ll_stats_ops_tally(sbi, LPROC_LL_RELEASE,
+				   ktime_us_delta(ktime_get(), kstart));
 	return rc;
 }
 
@@ -699,6 +703,7 @@ int ll_file_open(struct inode *inode, struct file *file)
 	struct obd_client_handle **och_p = NULL;
 	u64 *och_usecount = NULL;
 	struct ll_file_data *fd;
+	ktime_t kstart = ktime_get();
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p), flags %o\n",
@@ -896,7 +901,8 @@ int ll_file_open(struct inode *inode, struct file *file)
 		if (fd)
 			ll_file_data_put(fd);
 	} else {
-		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_OPEN, 1);
+		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_OPEN,
+				   ktime_us_delta(ktime_get(), kstart));
 	}
 
 out_nofiledata:
@@ -1676,6 +1682,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t result;
 	u16 refcheck;
 	ssize_t rc2;
+	ktime_t kstart = ktime_get();
 	bool cached;
 
 	if (!iov_iter_count(to))
@@ -1694,7 +1701,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	 */
 	result = pcc_file_read_iter(iocb, to, &cached);
 	if (cached)
-		return result;
+		goto out;
 
 	ll_ras_enter(file);
 
@@ -1719,10 +1726,13 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	cl_env_put(env, &refcheck);
 out:
-	if (result > 0)
+	if (result > 0) {
 		ll_rw_stats_tally(ll_i2sbi(file_inode(file)), current->pid,
 				  LUSTRE_FPRIVATE(file), iocb->ki_pos, result,
 				  READ);
+		ll_stats_ops_tally(ll_i2sbi(file_inode(file)), LPROC_LL_READ,
+				   ktime_us_delta(ktime_get(), kstart));
+	}
 
 	return result;
 }
@@ -1795,6 +1805,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	u16 refcheck;
 	bool cached;
+	ktime_t kstart = ktime_get();
 	int result;
 
 	if (!iov_iter_count(from)) {
@@ -1813,8 +1824,10 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	 * from PCC cache automatically.
 	 */
 	result = pcc_file_write_iter(iocb, from, &cached);
-	if (cached && result != -ENOSPC && result != -EDQUOT)
-		return result;
+	if (cached && result != -ENOSPC && result != -EDQUOT) {
+		rc_normal = result;
+		goto out;
+	}
 
 	/* NB: we can't do direct IO for tiny writes because they use the page
 	 * cache, we can't do sync writes because tiny writes can't flush
@@ -1855,10 +1868,14 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	cl_env_put(env, &refcheck);
 out:
-	if (rc_normal > 0)
+	if (rc_normal > 0) {
 		ll_rw_stats_tally(ll_i2sbi(file_inode(file)), current->pid,
 				  LUSTRE_FPRIVATE(file), iocb->ki_pos,
 				  rc_normal, WRITE);
+		ll_stats_ops_tally(ll_i2sbi(file_inode(file)), LPROC_LL_WRITE,
+				   ktime_us_delta(ktime_get(), kstart));
+	}
+
 	return rc_normal;
 }
 
@@ -3850,12 +3867,12 @@ static loff_t ll_file_seek(struct file *file, loff_t offset, int origin)
 {
 	struct inode *inode = file_inode(file);
 	loff_t retval, eof = 0;
+	ktime_t kstart = ktime_get();
 
 	retval = offset + ((origin == SEEK_END) ? i_size_read(inode) :
 			   (origin == SEEK_CUR) ? file->f_pos : 0);
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p), to=%llu=%#llx(%d)\n",
 	       PFID(ll_inode2fid(inode)), inode, retval, retval, origin);
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_LLSEEK, 1);
 
 	if (origin == SEEK_END || origin == SEEK_HOLE || origin == SEEK_DATA) {
 		retval = ll_glimpse_size(inode);
@@ -3864,8 +3881,12 @@ static loff_t ll_file_seek(struct file *file, loff_t offset, int origin)
 		eof = i_size_read(inode);
 	}
 
-	return generic_file_llseek_size(file, offset, origin,
-					ll_file_maxbytes(inode), eof);
+	retval = generic_file_llseek_size(file, offset, origin,
+					  ll_file_maxbytes(inode), eof);
+	if (retval >= 0)
+		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_LLSEEK,
+				   ktime_us_delta(ktime_get(), kstart));
+	return retval;
 }
 
 static int ll_flush(struct file *file, fl_owner_t id)
@@ -3948,14 +3969,13 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct inode *inode = file_inode(file);
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct ptlrpc_request *req;
+	ktime_t kstart = ktime_get();
 	int rc, err;
 
 	CDEBUG(D_VFSTRACE,
 	       "VFS Op:inode=" DFID "(%p), start %lld, end %lld, datasync %d\n",
 	       PFID(ll_inode2fid(inode)), inode, start, end, datasync);
 
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FSYNC, 1);
-
 
 	rc = file_write_and_wait_range(file, start, end);
 	inode_lock(inode);
@@ -4002,6 +4022,10 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	}
 
 	inode_unlock(inode);
+
+	if (!rc)
+		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FSYNC,
+				   ktime_us_delta(ktime_get(), kstart));
 	return rc;
 }
 
@@ -4019,6 +4043,7 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct lustre_handle lockh = {0};
 	union ldlm_policy_data flock = { { 0 } };
 	int fl_type = file_lock->fl_type;
+	ktime_t kstart = ktime_get();
 	u64 flags = 0;
 	int rc;
 	int rc2 = 0;
@@ -4026,7 +4051,6 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID " file_lock=%p\n",
 	       PFID(ll_inode2fid(inode)), file_lock);
 
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FLOCK, 1);
 
 	if (file_lock->fl_flags & FL_FLOCK)
 		LASSERT((cmd == F_SETLKW) || (cmd == F_SETLK));
@@ -4122,6 +4146,9 @@ int ll_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 
 	ll_finish_md_op_data(op_data);
 
+	if (!rc)
+		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FLOCK,
+				   ktime_us_delta(ktime_get(), kstart));
 	return rc;
 }
 
@@ -4515,10 +4542,9 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 	struct inode *inode = d_inode(path->dentry);
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct ll_inode_info *lli = ll_i2info(inode);
+	ktime_t kstart = ktime_get();
 	int rc;
 
-	ll_stats_ops_tally(sbi, LPROC_LL_GETATTR, 1);
-
 	rc = ll_inode_revalidate(path->dentry, IT_GETATTR);
 	if (rc < 0)
 		return rc;
@@ -4582,6 +4608,9 @@ int ll_getattr(const struct path *path, struct kstat *stat,
 	stat->size = i_size_read(inode);
 	stat->blocks = inode->i_blocks;
 
+	ll_stats_ops_tally(sbi, LPROC_LL_GETATTR,
+			   ktime_us_delta(ktime_get(), kstart));
+
 	return 0;
 }
 
@@ -4634,6 +4663,7 @@ int ll_inode_permission(struct inode *inode, int mask)
 	const struct cred *old_cred = NULL;
 	struct cred *cred = NULL;
 	bool squash_id = false;
+	ktime_t kstart = ktime_get();
 	int rc = 0;
 
 	if (mask & MAY_NOT_BLOCK)
@@ -4682,7 +4712,6 @@ int ll_inode_permission(struct inode *inode, int mask)
 		old_cred = override_creds(cred);
 	}
 
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_INODE_PERM, 1);
 	rc = generic_permission(inode, mask);
 
 	/* restore current process's credentials and FS capability */
@@ -4691,6 +4720,10 @@ int ll_inode_permission(struct inode *inode, int mask)
 		put_cred(cred);
 	}
 
+	if (!rc)
+		ll_stats_ops_tally(sbi, LPROC_LL_INODE_PERM,
+				   ktime_us_delta(ktime_get(), kstart));
+
 	return rc;
 }
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index d84f50c..205ea50 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -775,7 +775,7 @@ int cl_get_grouplock(struct cl_object *obj, unsigned long gid, int nonblock,
 /* llite/lproc_llite.c */
 int ll_debugfs_register_super(struct super_block *sb, const char *name);
 void ll_debugfs_unregister_super(struct super_block *sb);
-void ll_stats_ops_tally(struct ll_sb_info *sbi, int op, int count);
+void ll_stats_ops_tally(struct ll_sb_info *sbi, int op, long count);
 void ll_rw_stats_tally(struct ll_sb_info *sbi, pid_t pid,
 		       struct ll_file_data *file, loff_t pos,
 		       size_t count, int rw);
@@ -783,10 +783,12 @@ void ll_rw_stats_tally(struct ll_sb_info *sbi, pid_t pid,
 enum {
 	LPROC_LL_READ_BYTES,
 	LPROC_LL_WRITE_BYTES,
+	LPROC_LL_READ,
+	LPROC_LL_WRITE,
 	LPROC_LL_IOCTL,
 	LPROC_LL_OPEN,
 	LPROC_LL_RELEASE,
-	LPROC_LL_MAP,
+	LPROC_LL_MMAP,
 	LPROC_LL_FAULT,
 	LPROC_LL_MKWRITE,
 	LPROC_LL_LLSEEK,
@@ -805,7 +807,6 @@ enum {
 	LPROC_LL_MKNOD,
 	LPROC_LL_RENAME,
 	LPROC_LL_STATFS,
-	LPROC_LL_ALLOC_INODE,
 	LPROC_LL_SETXATTR,
 	LPROC_LL_GETXATTR,
 	LPROC_LL_GETXATTR_HITS,
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 49490ee..84472fb 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1644,6 +1644,7 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 	struct inode *inode = d_inode(dentry);
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct md_op_data *op_data = NULL;
+	ktime_t kstart = ktime_get();
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE, "%s: setattr inode " DFID "(%p) from %llu to %llu, valid %x, hsm_import %d\n",
@@ -1820,8 +1821,10 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 		inode_has_no_xattr(inode);
 	}
 
-	ll_stats_ops_tally(ll_i2sbi(inode), (attr->ia_valid & ATTR_SIZE) ?
-			LPROC_LL_TRUNC : LPROC_LL_SETATTR, 1);
+	if (!rc)
+		ll_stats_ops_tally(ll_i2sbi(inode), attr->ia_valid & ATTR_SIZE ?
+					LPROC_LL_TRUNC : LPROC_LL_SETATTR,
+				   ktime_us_delta(ktime_get(), kstart));
 
 	return rc;
 }
@@ -1918,10 +1921,10 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs)
 	struct super_block *sb = de->d_sb;
 	struct obd_statfs osfs;
 	u64 fsid = huge_encode_dev(sb->s_dev);
+	ktime_t kstart = ktime_get();
 	int rc;
 
-	CDEBUG(D_VFSTRACE, "VFS Op: at %llu jiffies\n", get_jiffies_64());
-	ll_stats_ops_tally(ll_s2sbi(sb), LPROC_LL_STATFS, 1);
+	CDEBUG(D_VFSTRACE, "VFS Op:sb=%s (%p)\n", sb->s_id, sb);
 
 	/* Some amount of caching on the client is allowed */
 	rc = ll_statfs_internal(ll_s2sbi(sb), &osfs, OBD_STATFS_SUM);
@@ -1950,6 +1953,10 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs)
 	sfs->f_bavail = osfs.os_bavail;
 	sfs->f_fsid.val[0] = (u32)fsid;
 	sfs->f_fsid.val[1] = (u32)(fsid >> 32);
+
+	ll_stats_ops_tally(ll_s2sbi(sb), LPROC_LL_STATFS,
+			   ktime_us_delta(ktime_get(), kstart));
+
 	return 0;
 }
 
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index 5c13164..b955756e 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -363,13 +363,11 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	bool cached;
 	vm_fault_t result;
 	sigset_t old, new;
-
-	ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
-			   LPROC_LL_FAULT, 1);
+	ktime_t kstart = ktime_get();
 
 	result = pcc_fault(vma, vmf, &cached);
 	if (cached)
-		return result;
+		goto out;
 
 	/* Only SIGKILL and SIGTERM are allowed for fault/nopage/mkwrite
 	 * so that it can be killed by admin but not cause segfault by
@@ -407,11 +405,17 @@ static vm_fault_t ll_fault(struct vm_fault *vmf)
 	}
 	sigprocmask(SIG_SETMASK, &old, NULL);
 
-	if (vmf->page && result == VM_FAULT_LOCKED)
+out:
+	if (vmf->page && result == VM_FAULT_LOCKED) {
 		ll_rw_stats_tally(ll_i2sbi(file_inode(vma->vm_file)),
 				  current->pid, LUSTRE_FPRIVATE(vma->vm_file),
 				  cl_offset(NULL, vmf->page->index), PAGE_SIZE,
 				  READ);
+		ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
+				   LPROC_LL_FAULT,
+				   ktime_us_delta(ktime_get(), kstart));
+	}
+
 	return result;
 }
 
@@ -424,13 +428,11 @@ static vm_fault_t ll_page_mkwrite(struct vm_fault *vmf)
 	bool cached;
 	int err;
 	vm_fault_t ret;
+	ktime_t kstart = ktime_get();
 
-	ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
-			   LPROC_LL_MKWRITE, 1);
-
-	err = pcc_page_mkwrite(vma, vmf, &cached);
+	ret = pcc_page_mkwrite(vma, vmf, &cached);
 	if (cached)
-		return err;
+		goto out;
 
 	file_update_time(vma->vm_file);
 	do {
@@ -465,11 +467,17 @@ static vm_fault_t ll_page_mkwrite(struct vm_fault *vmf)
 		break;
 	}
 
-	if (ret == VM_FAULT_LOCKED)
+out:
+	if (ret == VM_FAULT_LOCKED) {
 		ll_rw_stats_tally(ll_i2sbi(file_inode(vma->vm_file)),
 				  current->pid, LUSTRE_FPRIVATE(vma->vm_file),
 				  cl_offset(NULL, vmf->page->index), PAGE_SIZE,
 				  WRITE);
+		ll_stats_ops_tally(ll_i2sbi(file_inode(vma->vm_file)),
+				   LPROC_LL_MKWRITE,
+				   ktime_us_delta(ktime_get(), kstart));
+	}
+
 	return ret;
 }
 
@@ -527,6 +535,7 @@ int ll_teardown_mmaps(struct address_space *mapping, u64 first, u64 last)
 int ll_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file_inode(file);
+	ktime_t kstart = ktime_get();
 	bool cached;
 	int rc;
 
@@ -537,7 +546,6 @@ int ll_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (cached && rc != 0)
 		return rc;
 
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_MAP, 1);
 	rc = generic_file_mmap(file, vma);
 	if (rc == 0) {
 		vma->vm_ops = &ll_file_vm_ops;
@@ -547,5 +555,9 @@ int ll_file_mmap(struct file *file, struct vm_area_struct *vma)
 			rc = ll_glimpse_size(inode);
 	}
 
+	if (!rc)
+		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_MMAP,
+				   ktime_us_delta(ktime_get(), kstart));
+
 	return rc;
 }
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 439c096..82c5e5c 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1541,54 +1541,58 @@ static void sbi_kobj_release(struct kobject *kobj)
 	.release	= sbi_kobj_release,
 };
 
+#define LPROCFS_TYPE_LATENCY \
+	(LPROCFS_TYPE_USEC | LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV)
 static const struct llite_file_opcode {
 	u32		opcode;
 	u32		type;
 	const char	*opname;
 } llite_opcode_table[LPROC_LL_FILE_OPCODES] = {
 	/* file operation */
-	{ LPROC_LL_READ_BYTES,     LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_BYTES,
-				   "read_bytes" },
-	{ LPROC_LL_WRITE_BYTES,    LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_BYTES,
-				   "write_bytes" },
-	{ LPROC_LL_IOCTL,	   LPROCFS_TYPE_REGS, "ioctl" },
-	{ LPROC_LL_OPEN,	   LPROCFS_TYPE_REGS, "open" },
-	{ LPROC_LL_RELEASE,	   LPROCFS_TYPE_REGS, "close" },
-	{ LPROC_LL_MAP,		   LPROCFS_TYPE_REGS, "mmap" },
-	{ LPROC_LL_FAULT,	   LPROCFS_TYPE_REGS, "page_fault" },
-	{ LPROC_LL_MKWRITE,	   LPROCFS_TYPE_REGS, "page_mkwrite" },
-	{ LPROC_LL_LLSEEK,	   LPROCFS_TYPE_REGS, "seek" },
-	{ LPROC_LL_FSYNC,	   LPROCFS_TYPE_REGS, "fsync" },
-	{ LPROC_LL_READDIR,	   LPROCFS_TYPE_REGS, "readdir" },
+	{ LPROC_LL_READ_BYTES,	LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_BYTES,
+		"read_bytes" },
+	{ LPROC_LL_WRITE_BYTES,	LPROCFS_CNTR_AVGMINMAX | LPROCFS_TYPE_BYTES,
+		"write_bytes" },
+	{ LPROC_LL_READ,	LPROCFS_TYPE_LATENCY,	"read" },
+	{ LPROC_LL_WRITE,	LPROCFS_TYPE_LATENCY,	"write" },
+	{ LPROC_LL_IOCTL,	LPROCFS_TYPE_REQS,	"ioctl" },
+	{ LPROC_LL_OPEN,	LPROCFS_TYPE_LATENCY,	"open" },
+	{ LPROC_LL_RELEASE,	LPROCFS_TYPE_LATENCY,	"close" },
+	{ LPROC_LL_MMAP,	LPROCFS_TYPE_LATENCY,	"mmap" },
+	{ LPROC_LL_FAULT,	LPROCFS_TYPE_LATENCY,	"page_fault" },
+	{ LPROC_LL_MKWRITE,	LPROCFS_TYPE_LATENCY,	"page_mkwrite" },
+	{ LPROC_LL_LLSEEK,	LPROCFS_TYPE_LATENCY,	"seek" },
+	{ LPROC_LL_FSYNC,	LPROCFS_TYPE_LATENCY,	"fsync" },
+	{ LPROC_LL_READDIR,	LPROCFS_TYPE_LATENCY,	"readdir" },
 	/* inode operation */
-	{ LPROC_LL_SETATTR,	   LPROCFS_TYPE_REGS, "setattr" },
-	{ LPROC_LL_TRUNC,	   LPROCFS_TYPE_REGS, "truncate" },
-	{ LPROC_LL_FLOCK,	   LPROCFS_TYPE_REGS, "flock" },
-	{ LPROC_LL_GETATTR,	   LPROCFS_TYPE_REGS, "getattr" },
+	{ LPROC_LL_SETATTR,	LPROCFS_TYPE_LATENCY,	"setattr" },
+	{ LPROC_LL_TRUNC,	LPROCFS_TYPE_LATENCY,	"truncate" },
+	{ LPROC_LL_FLOCK,	LPROCFS_TYPE_LATENCY,	"flock" },
+	{ LPROC_LL_GETATTR,	LPROCFS_TYPE_LATENCY,	"getattr" },
 	/* dir inode operation */
-	{ LPROC_LL_CREATE,	   LPROCFS_TYPE_REGS, "create" },
-	{ LPROC_LL_LINK,	   LPROCFS_TYPE_REGS, "link" },
-	{ LPROC_LL_UNLINK,	   LPROCFS_TYPE_REGS, "unlink" },
-	{ LPROC_LL_SYMLINK,	   LPROCFS_TYPE_REGS, "symlink" },
-	{ LPROC_LL_MKDIR,	   LPROCFS_TYPE_REGS, "mkdir" },
-	{ LPROC_LL_RMDIR,	   LPROCFS_TYPE_REGS, "rmdir" },
-	{ LPROC_LL_MKNOD,	   LPROCFS_TYPE_REGS, "mknod" },
-	{ LPROC_LL_RENAME,	   LPROCFS_TYPE_REGS, "rename" },
+	{ LPROC_LL_CREATE,	LPROCFS_TYPE_LATENCY,	"create" },
+	{ LPROC_LL_LINK,	LPROCFS_TYPE_LATENCY,	"link" },
+	{ LPROC_LL_UNLINK,	LPROCFS_TYPE_LATENCY,	"unlink" },
+	{ LPROC_LL_SYMLINK,	LPROCFS_TYPE_LATENCY,	"symlink" },
+	{ LPROC_LL_MKDIR,	LPROCFS_TYPE_LATENCY,	"mkdir" },
+	{ LPROC_LL_RMDIR,	LPROCFS_TYPE_LATENCY,	"rmdir" },
+	{ LPROC_LL_MKNOD,	LPROCFS_TYPE_LATENCY,	"mknod" },
+	{ LPROC_LL_RENAME,	LPROCFS_TYPE_LATENCY,	"rename" },
 	/* special inode operation */
-	{ LPROC_LL_STATFS,	   LPROCFS_TYPE_REGS, "statfs" },
-	{ LPROC_LL_ALLOC_INODE,    LPROCFS_TYPE_REGS, "alloc_inode" },
-	{ LPROC_LL_SETXATTR,       LPROCFS_TYPE_REGS, "setxattr" },
-	{ LPROC_LL_GETXATTR,       LPROCFS_TYPE_REGS, "getxattr" },
-	{ LPROC_LL_GETXATTR_HITS,  LPROCFS_TYPE_REGS, "getxattr_hits" },
-	{ LPROC_LL_LISTXATTR,      LPROCFS_TYPE_REGS, "listxattr" },
-	{ LPROC_LL_REMOVEXATTR,    LPROCFS_TYPE_REGS, "removexattr" },
-	{ LPROC_LL_INODE_PERM,     LPROCFS_TYPE_REGS, "inode_permission" },
+	{ LPROC_LL_STATFS,	LPROCFS_TYPE_LATENCY,	"statfs" },
+	{ LPROC_LL_SETXATTR,	LPROCFS_TYPE_LATENCY,	"setxattr" },
+	{ LPROC_LL_GETXATTR,	LPROCFS_TYPE_LATENCY,	"getxattr" },
+	{ LPROC_LL_GETXATTR_HITS, LPROCFS_TYPE_REQS,	"getxattr_hits" },
+	{ LPROC_LL_LISTXATTR,	LPROCFS_TYPE_LATENCY,	"listxattr" },
+	{ LPROC_LL_REMOVEXATTR,	LPROCFS_TYPE_LATENCY,	"removexattr" },
+	{ LPROC_LL_INODE_PERM,	LPROCFS_TYPE_LATENCY,	"inode_permission" },
 };
 
-void ll_stats_ops_tally(struct ll_sb_info *sbi, int op, int count)
+void ll_stats_ops_tally(struct ll_sb_info *sbi, int op, long count)
 {
 	if (!sbi->ll_stats)
 		return;
+
 	if (sbi->ll_stats_track_type == STATS_TRACK_ALL)
 		lprocfs_counter_add(sbi->ll_stats, op, count);
 	else if (sbi->ll_stats_track_type == STATS_TRACK_PID &&
@@ -1661,12 +1665,14 @@ int ll_debugfs_register_super(struct super_block *sb, const char *name)
 		u32 type = llite_opcode_table[id].type;
 		void *ptr = NULL;
 
-		if (type & LPROCFS_TYPE_REGS)
-			ptr = "regs";
+		if (type & LPROCFS_TYPE_REQS)
+			ptr = "reqs";
 		else if (type & LPROCFS_TYPE_BYTES)
 			ptr = "bytes";
 		else if (type & LPROCFS_TYPE_PAGES)
 			ptr = "pages";
+		else if (type & LPROCFS_TYPE_USEC)
+			ptr = "usec";
 		lprocfs_counter_init(sbi->ll_stats,
 				     llite_opcode_table[id].opcode,
 				     (type & LPROCFS_CNTR_AVGMINMAX),
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index f4ca16e..5b9f3a7 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -1322,6 +1322,7 @@ static int ll_new_node(struct inode *dir, struct dentry *dentry,
 static int ll_mknod(struct inode *dir, struct dentry *dchild,
 		    umode_t mode, dev_t rdev)
 {
+	ktime_t kstart = ktime_get();
 	int err;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir=" DFID "(%p) mode %o dev %x\n",
@@ -1353,7 +1354,8 @@ static int ll_mknod(struct inode *dir, struct dentry *dchild,
 	}
 
 	if (!err)
-		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_MKNOD, 1);
+		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_MKNOD,
+				   ktime_us_delta(ktime_get(), kstart));
 
 	return err;
 }
@@ -1364,6 +1366,7 @@ static int ll_mknod(struct inode *dir, struct dentry *dchild,
 static int ll_create_nd(struct inode *dir, struct dentry *dentry,
 			umode_t mode, bool want_excl)
 {
+	ktime_t kstart = ktime_get();
 	int rc;
 
 	CDEBUG(D_VFSTRACE,
@@ -1372,11 +1375,13 @@ static int ll_create_nd(struct inode *dir, struct dentry *dentry,
 
 	rc = ll_mknod(dir, dentry, mode, 0);
 
-	ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_CREATE, 1);
-
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, unhashed %d\n",
 	       dentry, d_unhashed(dentry));
 
+	if (!rc)
+		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_CREATE,
+				   ktime_us_delta(ktime_get(), kstart));
+
 	return rc;
 }
 
@@ -1385,6 +1390,7 @@ static int ll_unlink(struct inode *dir, struct dentry *dchild)
 	struct ptlrpc_request *request = NULL;
 	struct md_op_data *op_data;
 	struct mdt_body *body;
+	ktime_t kstart = ktime_get();
 	int rc;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd,dir=%lu/%u(%p)\n",
@@ -1414,7 +1420,8 @@ static int ll_unlink(struct inode *dir, struct dentry *dchild)
 		set_nlink(dchild->d_inode, body->mbo_nlink);
 
 	ll_update_times(request, dir);
-	ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_UNLINK, 1);
+	ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_UNLINK,
+				   ktime_us_delta(ktime_get(), kstart));
 
  out:
 	ptlrpc_req_finished(request);
@@ -1423,6 +1430,7 @@ static int ll_unlink(struct inode *dir, struct dentry *dchild)
 
 static int ll_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
+	ktime_t kstart = ktime_get();
 	int err;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir" DFID "(%p)\n",
@@ -1434,13 +1442,15 @@ static int ll_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 
 	err = ll_new_node(dir, dentry, NULL, mode, 0, LUSTRE_OPC_MKDIR);
 	if (!err)
-		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_MKDIR, 1);
+		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_MKDIR,
+				   ktime_us_delta(ktime_get(), kstart));
 
 	return err;
 }
 
 static int ll_rmdir(struct inode *dir, struct dentry *dchild)
 {
+	ktime_t kstart = ktime_get();
 	struct ptlrpc_request *request = NULL;
 	struct md_op_data *op_data;
 	int rc;
@@ -1463,7 +1473,8 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild)
 	ll_finish_md_op_data(op_data);
 	if (rc == 0) {
 		ll_update_times(request, dir);
-		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR, 1);
+		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_RMDIR,
+				   ktime_us_delta(ktime_get(), kstart));
 	}
 
 	ptlrpc_req_finished(request);
@@ -1473,6 +1484,7 @@ static int ll_rmdir(struct inode *dir, struct dentry *dchild)
 static int ll_symlink(struct inode *dir, struct dentry *dentry,
 		      const char *oldname)
 {
+	ktime_t kstart = ktime_get();
 	int err;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir=" DFID "(%p),target=%.*s\n",
@@ -1482,7 +1494,8 @@ static int ll_symlink(struct inode *dir, struct dentry *dentry,
 			  0, LUSTRE_OPC_SYMLINK);
 
 	if (!err)
-		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_SYMLINK, 1);
+		ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_SYMLINK,
+				   ktime_us_delta(ktime_get(), kstart));
 
 	return err;
 }
@@ -1494,6 +1507,7 @@ static int ll_link(struct dentry *old_dentry, struct inode *dir,
 	struct ll_sb_info *sbi = ll_i2sbi(dir);
 	struct ptlrpc_request *request = NULL;
 	struct md_op_data *op_data;
+	ktime_t kstart = ktime_get();
 	int err;
 
 	CDEBUG(D_VFSTRACE,
@@ -1513,7 +1527,8 @@ static int ll_link(struct dentry *old_dentry, struct inode *dir,
 		goto out;
 
 	ll_update_times(request, dir);
-	ll_stats_ops_tally(sbi, LPROC_LL_LINK, 1);
+	ll_stats_ops_tally(sbi, LPROC_LL_LINK,
+			   ktime_us_delta(ktime_get(), kstart));
 out:
 	ptlrpc_req_finished(request);
 	return err;
@@ -1526,6 +1541,7 @@ static int ll_rename(struct inode *src, struct dentry *src_dchild,
 	struct ptlrpc_request *request = NULL;
 	struct ll_sb_info *sbi = ll_i2sbi(src);
 	struct md_op_data *op_data;
+	ktime_t kstart = ktime_get();
 	int err;
 
 	if (flags)
@@ -1555,12 +1571,15 @@ static int ll_rename(struct inode *src, struct dentry *src_dchild,
 	if (!err) {
 		ll_update_times(request, src);
 		ll_update_times(request, tgt);
-		ll_stats_ops_tally(sbi, LPROC_LL_RENAME, 1);
 	}
 
 	ptlrpc_req_finished(request);
-	if (!err)
+	if (!err) {
 		d_move(src_dchild, tgt_dchild);
+		ll_stats_ops_tally(sbi, LPROC_LL_RENAME,
+				   ktime_us_delta(ktime_get(), kstart));
+	}
+
 	return err;
 }
 
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index b926f87..a40f242 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -1754,8 +1754,8 @@ void pcc_vm_close(struct vm_area_struct *vma)
 	pcc_inode_unlock(inode);
 }
 
-int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
-		     bool *cached)
+vm_fault_t pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			    bool *cached)
 {
 	struct page *page = vmf->page;
 	struct mm_struct *mm = vma->vm_mm;
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index a221ef6..ec2e421 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -239,8 +239,8 @@ int pcc_fsync(struct file *file, loff_t start, loff_t end,
 void pcc_vm_open(struct vm_area_struct *vma);
 void pcc_vm_close(struct vm_area_struct *vma);
 int pcc_fault(struct vm_area_struct *mva, struct vm_fault *vmf, bool *cached);
-int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
-		     bool *cached);
+vm_fault_t pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			    bool *cached);
 int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
 		     struct lu_fid *fid, struct dentry **pcc_dentry);
 int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca);
diff --git a/fs/lustre/llite/super25.c b/fs/lustre/llite/super25.c
index 38d60b0..006be6b 100644
--- a/fs/lustre/llite/super25.c
+++ b/fs/lustre/llite/super25.c
@@ -50,7 +50,6 @@ static struct inode *ll_alloc_inode(struct super_block *sb)
 {
 	struct ll_inode_info *lli;
 
-	ll_stats_ops_tally(ll_s2sbi(sb), LPROC_LL_ALLOC_INODE, 1);
 	lli = kmem_cache_zalloc(ll_inode_cachep, GFP_NOFS);
 	if (!lli)
 		return NULL;
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index 4e1ce34..7134f10 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -91,6 +91,7 @@ static int ll_xattr_set_common(const struct xattr_handler *handler,
 	struct ptlrpc_request *req = NULL;
 	const char *pv = value;
 	char *fullname;
+	ktime_t kstart = ktime_get();
 	u64 valid;
 	int rc;
 
@@ -98,13 +99,10 @@ static int ll_xattr_set_common(const struct xattr_handler *handler,
 	 * unconditionally replaced by "". When removexattr() is
 	 * called we get a NULL value and XATTR_REPLACE for flags.
 	 */
-	if (!value && flags == XATTR_REPLACE) {
-		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_REMOVEXATTR, 1);
+	if (!value && flags == XATTR_REPLACE)
 		valid = OBD_MD_FLXATTRRM;
-	} else {
-		ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_SETXATTR, 1);
+	else
 		valid = OBD_MD_FLXATTR;
-	}
 
 	rc = xattr_type_filter(sbi, handler);
 	if (rc)
@@ -153,6 +151,11 @@ static int ll_xattr_set_common(const struct xattr_handler *handler,
 	}
 
 	ptlrpc_req_finished(req);
+
+	ll_stats_ops_tally(ll_i2sbi(inode), valid == OBD_MD_FLXATTRRM ?
+				LPROC_LL_REMOVEXATTR : LPROC_LL_SETXATTR,
+			   ktime_us_delta(ktime_get(), kstart));
+
 	return 0;
 }
 
@@ -294,6 +297,11 @@ static int ll_xattr_set(const struct xattr_handler *handler,
 			const char *name, const void *value, size_t size,
 			int flags)
 {
+	ktime_t kstart = ktime_get();
+	int op_type = flags == XATTR_REPLACE ? LPROC_LL_REMOVEXATTR :
+					       LPROC_LL_SETXATTR;
+	int rc;
+
 	LASSERT(inode);
 	LASSERT(name);
 
@@ -302,18 +310,14 @@ static int ll_xattr_set(const struct xattr_handler *handler,
 
 	/* lustre/trusted.lov.xxx would be passed through xattr API */
 	if (!strcmp(name, "lov")) {
-		int op_type = flags == XATTR_REPLACE ? LPROC_LL_REMOVEXATTR :
-						       LPROC_LL_SETXATTR;
-
-		ll_stats_ops_tally(ll_i2sbi(inode), op_type, 1);
-
-		return ll_setstripe_ea(dentry, (struct lov_user_md *)value,
+		rc = ll_setstripe_ea(dentry, (struct lov_user_md *)value,
 				       size);
+		ll_stats_ops_tally(ll_i2sbi(inode), op_type,
+				   ktime_us_delta(ktime_get(), kstart));
+		return rc;
 	} else if (!strcmp(name, "lma") || !strcmp(name, "link")) {
-		int op_type = flags == XATTR_REPLACE ? LPROC_LL_REMOVEXATTR :
-						       LPROC_LL_SETXATTR;
-
-		ll_stats_ops_tally(ll_i2sbi(inode), op_type, 1);
+		ll_stats_ops_tally(ll_i2sbi(inode), op_type,
+				   ktime_us_delta(ktime_get(), kstart));
 		return 0;
 	}
 
@@ -402,14 +406,13 @@ static int ll_xattr_get_common(const struct xattr_handler *handler,
 			       const char *name, void *buffer, size_t size)
 {
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
+	ktime_t kstart = ktime_get();
 	char *fullname;
 	int rc;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p)\n",
 	       PFID(ll_inode2fid(inode)), inode);
 
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_GETXATTR, 1);
-
 	rc = xattr_type_filter(sbi, handler);
 	if (rc)
 		return rc;
@@ -444,6 +447,9 @@ static int ll_xattr_get_common(const struct xattr_handler *handler,
 	rc = ll_xattr_list(inode, fullname, handler->flags, buffer, size,
 			   OBD_MD_FLXATTR);
 	kfree(fullname);
+	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_GETXATTR,
+			   ktime_us_delta(ktime_get(), kstart));
+
 	return rc;
 }
 
@@ -569,6 +575,7 @@ ssize_t ll_listxattr(struct dentry *dentry, char *buffer, size_t size)
 {
 	struct inode *inode = d_inode(dentry);
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
+	ktime_t kstart = ktime_get();
 	char *xattr_name;
 	ssize_t rc, rc2;
 	size_t len, rem;
@@ -578,8 +585,6 @@ ssize_t ll_listxattr(struct dentry *dentry, char *buffer, size_t size)
 	CDEBUG(D_VFSTRACE, "VFS Op:inode=" DFID "(%p)\n",
 	       PFID(ll_inode2fid(inode)), inode);
 
-	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_LISTXATTR, 1);
-
 	rc = ll_xattr_list(inode, NULL, XATTR_OTHER_T, buffer, size,
 			   OBD_MD_FLXATTRLS);
 	if (rc < 0)
@@ -591,7 +596,7 @@ ssize_t ll_listxattr(struct dentry *dentry, char *buffer, size_t size)
 	 * exists.
 	 */
 	if (!size)
-		return rc + sizeof(XATTR_LUSTRE_LOV);
+		goto out;
 
 	xattr_name = buffer;
 	rem = rc;
@@ -625,6 +630,10 @@ ssize_t ll_listxattr(struct dentry *dentry, char *buffer, size_t size)
 
 	memcpy(buffer + rc, XATTR_LUSTRE_LOV, sizeof(XATTR_LUSTRE_LOV));
 
+out:
+	ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_LISTXATTR,
+			   ktime_us_delta(ktime_get(), kstart));
+
 	return rc + sizeof(XATTR_LUSTRE_LOV);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 521/622] lustre: osc: don't re-enable grant shrink on reconnect
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (519 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 520/622] lustre: llite: report latency for filesystem ops James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 522/622] lustre: llite: statfs to use NODELAY with MDS James Simmons
                   ` (101 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Zarochentsev <c17826@cray.com>

client requests grant shrinking support on each
reconnect and re-enables the capability even it was
explicitly disabled by lctl set_param.

Cray-bug-id: LUS-7585
WC-bug-id: https://jira.whamcloud.com/browse/LU-12759
Lustre-commit: efa3425c5f5a ("LU-12759 osc: don't re-enable grant shrink on reconnect")
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-on: https://review.whamcloud.com/36177
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h |  4 +++-
 fs/lustre/osc/lproc_osc.c         | 32 +++++++++-----------------------
 fs/lustre/osc/osc_request.c       |  4 ++--
 3 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index c2f98e6..501a896 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -303,7 +303,9 @@ struct obd_import {
 					/* import has tried to connect with server */
 					imp_connect_tried:1,
 					/* connected but not FULL yet */
-					imp_connected:1;
+					imp_connected:1,
+				  /* grant shrink disabled */
+				  imp_grant_shrink_disabled:1;
 
 	u32				imp_connect_op;
 	u32				imp_idle_timeout;
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 8e0088b..2bc7047 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -695,18 +695,17 @@ static ssize_t grant_shrink_show(struct kobject *kobj, struct attribute *attr,
 {
 	struct obd_device *obd = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
-	struct client_obd *cli = &obd->u.cli;
-	struct obd_connect_data *ocd;
+	struct obd_import *imp;
 	ssize_t len;
 
 	len = lprocfs_climp_check(obd);
 	if (len)
 		return len;
 
-	ocd = &cli->cl_import->imp_connect_data;
-
+	imp = obd->u.cli.cl_import;
 	len = snprintf(buf, PAGE_SIZE, "%d\n",
-		       !!OCD_HAS_FLAG(ocd, GRANT_SHRINK));
+		       !imp->imp_grant_shrink_disabled &&
+		       OCD_HAS_FLAG(&imp->imp_connect_data, GRANT_SHRINK));
 	up_read(&obd->u.cli.cl_sem);
 
 	return len;
@@ -717,8 +716,7 @@ static ssize_t grant_shrink_store(struct kobject *kobj, struct attribute *attr,
 {
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
-	struct client_obd *cli = &dev->u.cli;
-	struct obd_connect_data *ocd;
+	struct obd_import *imp;
 	bool val;
 	int rc;
 
@@ -733,22 +731,10 @@ static ssize_t grant_shrink_store(struct kobject *kobj, struct attribute *attr,
 	if (rc)
 		return rc;
 
-	ocd = &cli->cl_import->imp_connect_data;
-
-	if (!val) {
-		if (OCD_HAS_FLAG(ocd, GRANT_SHRINK))
-			ocd->ocd_connect_flags &= ~OBD_CONNECT_GRANT_SHRINK;
-	} else {
-		/**
-		 * server replied obd_connect_data is always bigger, so
-		 * client's imp_connect_flags_orig are always supported
-		 * by the server
-		 */
-		if (!OCD_HAS_FLAG(ocd, GRANT_SHRINK) &&
-		    cli->cl_import->imp_connect_flags_orig &
-		    OBD_CONNECT_GRANT_SHRINK)
-			ocd->ocd_connect_flags |= OBD_CONNECT_GRANT_SHRINK;
-	}
+	imp = dev->u.cli.cl_import;
+	spin_lock(&imp->imp_lock);
+	imp->imp_grant_shrink_disabled = !val;
+	spin_unlock(&imp->imp_lock);
 
 	up_read(&dev->u.cli.cl_sem);
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 9c43756..39cac7d 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -844,8 +844,8 @@ static int osc_should_shrink_grant(struct client_obd *client)
 	if (!client->cl_import)
 		return 0;
 
-	if ((client->cl_import->imp_connect_data.ocd_connect_flags &
-	     OBD_CONNECT_GRANT_SHRINK) == 0)
+	if (!OCD_HAS_FLAG(&client->cl_import->imp_connect_data, GRANT_SHRINK) ||
+	    client->cl_import->imp_grant_shrink_disabled)
 		return 0;
 
 	if (ktime_get_seconds() >= next_shrink - 5) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 522/622] lustre: llite: statfs to use NODELAY with MDS
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (520 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 521/622] lustre: osc: don't re-enable grant shrink on reconnect James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 523/622] lustre: ptlrpc: grammar fix James Simmons
                   ` (100 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

otherwise client umount can get stuck if MDS is down
for a reason. recovery-small/110k simulates this.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12809
Lustre-commit: a7ae8da24229 ("LU-12809 llite: statfs to use NODELAY with MDS")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36297
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 84472fb..1245336 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1869,6 +1869,9 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 
 	max_age = ktime_get_seconds() - sbi->ll_statfs_max_age;
 
+	if (sbi->ll_flags & LL_SBI_LAZYSTATFS)
+		flags |= OBD_STATFS_NODELAY;
+
 	rc = obd_statfs(NULL, sbi->ll_md_exp, osfs, max_age, flags);
 	if (rc)
 		return rc;
@@ -1882,9 +1885,6 @@ int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
 	if (osfs->os_state & OS_STATE_SUM)
 		goto out;
 
-	if (sbi->ll_flags & LL_SBI_LAZYSTATFS)
-		flags |= OBD_STATFS_NODELAY;
-
 	rc = obd_statfs(NULL, sbi->ll_dt_exp, &obd_osfs, max_age, flags);
 	if (rc) {
 		/* Possibly a filesystem with no OSTs.  Report MDT totals. */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 523/622] lustre: ptlrpc: grammar fix.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (521 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 522/622] lustre: llite: statfs to use NODELAY with MDS James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 524/622] lustre: lov: check all entries in lov_flush_composite James Simmons
                   ` (99 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Zarochentsev <c17826@cray.com>

ptlrpc_invalidate_import() error message grammar fix.

Cray-bug-id: LUS-4015
WC-bug-id: https://jira.whamcloud.com/browse/LU-12370
Lustre-commit: 316eddce9382 ("LU-12370 ptlrpc: grammar fix.")
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-on: https://review.whamcloud.com/36508
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Stephan Thiell <sthiell@stanford.edu>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Colin Faber <cfaber@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 76a40be..813d3c8 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -381,7 +381,7 @@ void ptlrpc_invalidate_import(struct obd_import *imp)
 						  "still on delayed list");
 				}
 
-				CERROR("%s: Unregistering RPCs found (%d). Network is sluggish? Waiting them to error out.\n",
+				CERROR("%s: Unregistering RPCs found (%d). Network is sluggish? Waiting for them to error out.\n",
 				       cli_tgt,
 				       atomic_read(&imp->imp_unregistering));
 			}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 524/622] lustre: lov: check all entries in lov_flush_composite
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (522 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 523/622] lustre: ptlrpc: grammar fix James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 525/622] lustre: pcc: Incorrect size after re-attach James Simmons
                   ` (98 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

Check all layout entries for DOM layout and exit with
-ENODATA if no one exists. Caller consider that as valid
case due to layout change.

Define llo_flush methods for all layouts as required
by lov_dispatch().

Patch cleans up also cl_dom_size field in cl_layout which
was used in previous ll_dom_lock_cancel() implementation

Run lov_flush_composite under down_read lov->lo_type_guard to avoid
race with layout change.

Fixes: 865a95df36 ("lustre: llite: improve ll_dom_lock_cancel")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12704
Lustre-commit: 44460570fd21 ("LU-12704 lov: check all entries in lov_flush_composite")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Reviewed-on: https://review.whamcloud.com/36368
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h |  2 --
 fs/lustre/llite/namei.c       |  6 ++++++
 fs/lustre/lov/lov_object.c    | 42 +++++++++++++++++++++++-------------------
 3 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index c3376a4..67731b0 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -287,8 +287,6 @@ struct cl_layout {
 	struct lu_buf		cl_buf;
 	/** size of layout in lov_mds_md format. */
 	size_t			cl_size;
-	/** size of DoM component if exists or zero otherwise */
-	u64			cl_dom_comp_size;
 	/** Layout generation. */
 	u32			cl_layout_gen;
 	/** whether layout is a composite one */
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 5b9f3a7..c87653d 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -198,6 +198,12 @@ static int ll_dom_lock_cancel(struct inode *inode, struct ldlm_lock *lock)
 
 	/* reach MDC layer to flush data under  the DoM ldlm lock */
 	rc = cl_object_flush(env, lli->lli_clob, lock);
+	if (rc == -ENODATA) {
+		CDEBUG(D_INODE, "inode "DFID" layout has no DoM stripe\n",
+		       PFID(ll_inode2fid(inode)));
+		/* most likely result of layout change, do nothing */
+		rc = 0;
+	}
 
 	cl_env_put(env, &refcheck);
 	return rc;
diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index 5c4d8f9..f2c7bc2 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -1048,13 +1048,23 @@ static int lov_flush_composite(const struct lu_env *env,
 			       struct ldlm_lock *lock)
 {
 	struct lov_object *lov = cl2lov(obj);
-	struct lovsub_object *lovsub;
+	struct lov_layout_entry *lle;
+	int rc = -ENODATA;
 
-	if (!lsme_is_dom(lov->lo_lsm->lsm_entries[0]))
-		return -EINVAL;
+	lov_foreach_layout_entry(lov, lle) {
+		if (!lsme_is_dom(lle->lle_lsme))
+			continue;
+		rc = cl_object_flush(env, lovsub2cl(lle->lle_dom.lo_dom), lock);
+		break;
+	}
+
+	return rc;
+}
 
-	lovsub = lov->u.composite.lo_entries[0].lle_dom.lo_dom;
-	return cl_object_flush(env, lovsub2cl(lovsub), lock);
+static int lov_flush_empty(const struct lu_env *env, struct cl_object *obj,
+			   struct ldlm_lock *lock)
+{
+	return 0;
 }
 
 const static struct lov_layout_operations lov_dispatch[] = {
@@ -1066,7 +1076,8 @@ static int lov_flush_composite(const struct lu_env *env,
 		.llo_page_init	= lov_page_init_empty,
 		.llo_lock_init	= lov_lock_init_empty,
 		.llo_io_init	= lov_io_init_empty,
-		.llo_getattr	= lov_attr_get_empty
+		.llo_getattr	= lov_attr_get_empty,
+		.llo_flush	= lov_flush_empty,
 	},
 	[LLT_RELEASED] = {
 		.llo_init	= lov_init_released,
@@ -1076,7 +1087,8 @@ static int lov_flush_composite(const struct lu_env *env,
 		.llo_page_init	= lov_page_init_empty,
 		.llo_lock_init	= lov_lock_init_empty,
 		.llo_io_init	= lov_io_init_released,
-		.llo_getattr	= lov_attr_get_empty
+		.llo_getattr	= lov_attr_get_empty,
+		.llo_flush	= lov_flush_empty,
 	},
 	[LLT_COMP] = {
 		.llo_init	= lov_init_composite,
@@ -1098,6 +1110,7 @@ static int lov_flush_composite(const struct lu_env *env,
 		.llo_lock_init = lov_lock_init_empty,
 		.llo_io_init   = lov_io_init_empty,
 		.llo_getattr   = lov_attr_get_empty,
+		.llo_flush	= lov_flush_empty,
 	},
 };
 
@@ -2085,18 +2098,8 @@ static int lov_object_layout_get(const struct lu_env *env,
 
 	cl->cl_size = lov_comp_md_size(lsm);
 	cl->cl_layout_gen = lsm->lsm_layout_gen;
-	cl->cl_dom_comp_size = 0;
 	cl->cl_is_released = lsm->lsm_is_released;
-	if (lsm_is_composite(lsm->lsm_magic)) {
-		struct lov_stripe_md_entry *lsme = lsm->lsm_entries[0];
-
-		cl->cl_is_composite = true;
-
-		if (lsme_is_dom(lsme))
-			cl->cl_dom_comp_size = lsme->lsme_extent.e_end;
-	} else {
-		cl->cl_is_composite = false;
-	}
+	cl->cl_is_composite = lsm_is_composite(lsm->lsm_magic);
 
 	rc = lov_lsm_pack(lsm, buf->lb_buf, buf->lb_len);
 	lov_lsm_put(lsm);
@@ -2123,7 +2126,8 @@ static loff_t lov_object_maxbytes(struct cl_object *obj)
 static int lov_object_flush(const struct lu_env *env, struct cl_object *obj,
 			    struct ldlm_lock *lock)
 {
-	return LOV_2DISPATCH_NOLOCK(cl2lov(obj), llo_flush, env, obj, lock);
+	return LOV_2DISPATCH_MAYLOCK(cl2lov(obj), llo_flush, true, env, obj,
+				     lock);
 }
 
 static const struct cl_object_operations lov_ops = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 525/622] lustre: pcc: Incorrect size after re-attach
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (523 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 524/622] lustre: lov: check all entries in lov_flush_composite James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 526/622] lustre: pcc: auto attach not work after client cache clear James Simmons
                   ` (97 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

The following test case will result in incorrect size for PCC copy:

- Attach a file with size of s1 (s2 > 0) into PCC;
- Detach this file with --keep option, and the data will retain
  on PCC;
- Truncate this file locally or on an remote client to a new size
  s2 (s2 < s1);
- Re-attach the file again. The size of PCC copy is still s1.

To solve this problem, it need to truncate the size of the PCC copy
to the same size of the Lustre copy which will be HSM released
later after finished the data copy (archive) phase.
This patch also adds the handle for the signal pending when the
attach process is killed by an administrator.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13023
Lustre-commmit: 7a810496c2c ("LU-13023 pcc: Incorrect size after re-attach")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/36884
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/pcc.c | 55 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index a40f242..550045b 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -2023,16 +2023,21 @@ static int __pcc_inode_create(struct pcc_dataset *dataset,
 	return rc;
 }
 
-/* TODO: Set the project ID for PCC copy */
-int pcc_inode_store_ugpid(struct dentry *dentry, kuid_t uid, kgid_t gid)
+/*
+ * Reset uid, gid or size for the PCC copy masked by @valid.
+ * TODO: Set the project ID for PCC copy.
+ */
+int pcc_inode_reset_iattr(struct dentry *dentry, unsigned int valid,
+			  kuid_t uid, kgid_t gid, loff_t size)
 {
 	struct inode *inode = dentry->d_inode;
 	struct iattr attr;
 	int rc;
 
-	attr.ia_valid = ATTR_UID | ATTR_GID;
+	attr.ia_valid = valid;
 	attr.ia_uid = uid;
 	attr.ia_gid = gid;
+	attr.ia_size = size;
 
 	inode_lock(inode);
 	rc = notify_change(dentry, &attr, NULL);
@@ -2077,8 +2082,8 @@ int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca)
 		goto out_put;
 	}
 
-	rc = pcc_inode_store_ugpid(pcc_dentry, old_cred->suid,
-				   old_cred->sgid);
+	rc = pcc_inode_reset_iattr(pcc_dentry, ATTR_UID | ATTR_GID,
+				   old_cred->suid, old_cred->sgid, 0);
 	if (rc)
 		goto out_put;
 
@@ -2152,9 +2157,9 @@ static int pcc_filp_write(struct file *filp, const void *buf, ssize_t count,
 	return 0;
 }
 
-static int pcc_copy_data(struct file *src, struct file *dst)
+static ssize_t pcc_copy_data(struct file *src, struct file *dst)
 {
-	int rc = 0;
+	ssize_t rc = 0;
 	ssize_t rc2;
 	loff_t pos, offset = 0;
 	size_t buf_len = 1048576;
@@ -2165,6 +2170,10 @@ static int pcc_copy_data(struct file *src, struct file *dst)
 		return -ENOMEM;
 
 	while (1) {
+		if (signal_pending(current)) {
+			rc = -EINTR;
+			goto out_free;
+		}
 		pos = offset;
 		rc2 = kernel_read(src, buf, buf_len, &pos);
 		if (rc2 < 0) {
@@ -2180,6 +2189,7 @@ static int pcc_copy_data(struct file *src, struct file *dst)
 		offset += rc2;
 	}
 
+	rc = offset;
 out_free:
 	kvfree(buf);
 	return rc;
@@ -2219,6 +2229,7 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	struct dentry *dentry;
 	struct file *pcc_filp;
 	struct path path;
+	ssize_t ret;
 	int rc;
 
 	rc = pcc_attach_allowed_check(inode);
@@ -2232,27 +2243,35 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 
 	old_cred = override_creds(pcc_super_cred(inode->i_sb));
 	rc = __pcc_inode_create(dataset, &lli->lli_fid, &dentry);
-	if (rc) {
-		revert_creds(old_cred);
+	if (rc)
 		goto out_dataset_put;
-	}
 
 	path.mnt = dataset->pccd_path.mnt;
 	path.dentry = dentry;
-	pcc_filp = dentry_open(&path, O_TRUNC | O_WRONLY | O_LARGEFILE,
-			       current_cred());
+	pcc_filp = dentry_open(&path, O_WRONLY | O_LARGEFILE, current_cred());
 	if (IS_ERR_OR_NULL(pcc_filp)) {
 		rc = pcc_filp ? PTR_ERR(pcc_filp) : -EINVAL;
-		revert_creds(old_cred);
 		goto out_dentry;
 	}
 
-	rc = pcc_inode_store_ugpid(dentry, old_cred->uid, old_cred->gid);
-	revert_creds(old_cred);
+	rc = pcc_inode_reset_iattr(dentry, ATTR_UID | ATTR_GID,
+				   old_cred->uid, old_cred->gid, 0);
 	if (rc)
 		goto out_fput;
 
-	rc = pcc_copy_data(file, pcc_filp);
+	ret = pcc_copy_data(file, pcc_filp);
+	if (ret < 0) {
+		rc = ret;
+		goto out_fput;
+	}
+
+	/*
+	 * It must to truncate the PCC copy to the same size of the Lustre
+	 * copy after copy data. Otherwise, it may get wrong file size after
+	 * re-attach a file. See LU-13023 for details.
+	 */
+	rc = pcc_inode_reset_iattr(dentry, ATTR_SIZE, KUIDT_INIT(0),
+				   KGIDT_INIT(0), ret);
 	if (rc)
 		goto out_fput;
 
@@ -2276,13 +2295,13 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	fput(pcc_filp);
 out_dentry:
 	if (rc) {
-		old_cred = override_creds(pcc_super_cred(inode->i_sb));
 		(void) pcc_inode_remove(inode, dentry);
-		revert_creds(old_cred);
 		dput(dentry);
 	}
 out_dataset_put:
 	pcc_dataset_put(dataset);
+	revert_creds(old_cred);
+
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 526/622] lustre: pcc: auto attach not work after client cache clear
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (524 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 525/622] lustre: pcc: Incorrect size after re-attach James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 527/622] lustre: pcc: Init saved dataset flags properly James Simmons
                   ` (96 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

When the inode of a PCC cached file in unused state was evicted
from icache due to memory pressure or manual icache cleanup (i.e.
"echo 3 > /proc/sys/vm/drop_caches"), this file will be detached
from PCC also, and all PCC state for this file is cleared.
In the current design, PCC only tries to auto attache the file
once attached into PCC according to the in-memery PCC state. Thus
later IO for the file is not directed to PCC and will trigger the
data restore.

If this is a not desired result for the user, then we need to try
to auto attach file that was never attached into PCC or once
attached but detached as a result of shrinking its inode from
icache.

Although the candidates to try auto attach are increased, but only
the file in HSM released state (which can directly get from file
layout) will be checked.

This bug is easy reproduced on rhel8. It seems that the command
"echo 3 > /proc/sys/vm/drop_caches" will drop all unused inodes
from icache, but it is not true for rhel7.

This patch also adds the check for the input parameter @rwid,
which should be non zero value and same as the archive ID.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13030
Lustre-commit: a5ef2d6e068e ("LU-13030 pcc: auto attach not work after client cache clear")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/36892
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h        |  27 ++++-
 fs/lustre/llite/llite_lib.c             |   2 +
 fs/lustre/llite/pcc.c                   | 194 ++++++++++++++++++++++++++------
 fs/lustre/llite/pcc.h                   |  26 +++--
 include/uapi/linux/lustre/lustre_user.h |  10 --
 5 files changed, 204 insertions(+), 55 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 205ea50..8e7b949 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -206,7 +206,22 @@ struct ll_inode_info {
 
 			struct mutex		 lli_pcc_lock;
 			enum lu_pcc_state_flags	 lli_pcc_state;
-			struct pcc_inode	*lli_pcc_inode;
+			/*
+			 * @lli_pcc_generation saves the gobal PCC generation
+			 * when the file was successfully attached into PCC.
+			 * The flags of the PCC dataset are saved in
+			 * @lli_pcc_dsflags.
+			 * The gobal PCC generation will be increased when add
+			 * or delete a PCC backend, or change the configuration
+			 * parameters for PCC.
+			 * If @lli_pcc_generation is same as the gobal PCC
+			 * generation, we can use the saved flags of the PCC
+			 * dataset to determine whether need to try auto attach
+			 * safely.
+			 */
+			u64				lli_pcc_generation;
+			enum pcc_dataset_flags		lli_pcc_dsflags;
+			struct pcc_inode		*lli_pcc_inode;
 			struct mutex			lli_group_mutex;
 			u64				lli_group_users;
 			unsigned long			lli_group_gid;
@@ -1432,4 +1447,14 @@ int cl_setattr_ost(struct cl_object *obj, const struct iattr *attr,
 u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32);
 u32 cl_fid_build_gen(const struct lu_fid *fid);
 
+static inline struct pcc_super *ll_i2pccs(struct inode *inode)
+{
+	return &ll_i2sbi(inode)->ll_pcc_super;
+}
+
+static inline struct pcc_super *ll_info2pccs(struct ll_inode_info *lli)
+{
+	return ll_i2pccs(ll_info2i(lli));
+}
+
 #endif /* LLITE_INTERNAL_H */
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 1245336..c2baf6a 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -983,6 +983,8 @@ void ll_lli_init(struct ll_inode_info *lli)
 		mutex_init(&lli->lli_pcc_lock);
 		lli->lli_pcc_state = PCC_STATE_FL_NONE;
 		lli->lli_pcc_inode = NULL;
+		lli->lli_pcc_dsflags = PCC_DATASET_NONE;
+		lli->lli_pcc_generation = 0;
 		mutex_init(&lli->lli_group_mutex);
 		lli->lli_group_users = 0;
 		lli->lli_group_gid = 0;
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index 550045b..a0e31c8 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -126,6 +126,7 @@ int pcc_super_init(struct pcc_super *super)
 	cap_lower(cred->cap_effective, CAP_SYS_RESOURCE);
 	init_rwsem(&super->pccs_rw_sem);
 	INIT_LIST_HEAD(&super->pccs_datasets);
+	super->pccs_generation = 1;
 
 	return 0;
 }
@@ -553,6 +554,12 @@ static int pcc_id_parse(struct pcc_cmd *cmd, const char *id)
 		 */
 		if ((cmd->u.pccc_add.pccc_flags & PCC_DATASET_PCC_ALL) == 0)
 			cmd->u.pccc_add.pccc_flags |= PCC_DATASET_PCC_ALL;
+
+		/* For RW-PCC, the value of @rwid must be non zero. */
+		if (cmd->u.pccc_add.pccc_flags & PCC_DATASET_RWPCC &&
+		    cmd->u.pccc_add.pccc_rwid == 0)
+			return -EINVAL;
+
 		break;
 	case PCC_DEL_DATASET:
 	case PCC_CLEAR_ALL:
@@ -840,6 +847,7 @@ struct pcc_dataset *
 		if (strcmp(dataset->pccd_pathname, pathname) == 0) {
 			list_del_init(&dataset->pccd_linkage);
 			pcc_dataset_put(dataset);
+			super->pccs_generation++;
 			rc = 0;
 			break;
 		}
@@ -880,6 +888,7 @@ static void pcc_remove_datasets(struct pcc_super *super)
 		list_del(&dataset->pccd_linkage);
 		pcc_dataset_put(dataset);
 	}
+	super->pccs_generation++;
 	up_write(&super->pccs_rw_sem);
 }
 
@@ -1101,9 +1110,15 @@ void pcc_file_init(struct pcc_file *pccf)
 	pccf->pccf_type = LU_PCC_NONE;
 }
 
-static inline bool pcc_auto_attach_enabled(struct pcc_dataset *dataset)
+static inline bool pcc_auto_attach_enabled(enum pcc_dataset_flags flags,
+					   enum pcc_io_type iot)
 {
-	return dataset->pccd_flags & PCC_DATASET_AUTO_ATTACH;
+	if (iot == PIT_OPEN)
+		return flags & PCC_DATASET_OPEN_ATTACH;
+	if (iot == PIT_GETATTR)
+		return flags & PCC_DATASET_STAT_ATTACH;
+	else
+		return flags & PCC_DATASET_AUTO_ATTACH;
 }
 
 static const char pcc_xattr_layout[] = XATTR_USER_PREFIX "PCC.layout";
@@ -1114,7 +1129,7 @@ static int pcc_layout_xattr_set(struct pcc_inode *pcci, u32 gen)
 	struct ll_inode_info *lli = pcci->pcci_lli;
 	int rc;
 
-	if (!(lli->lli_pcc_state & PCC_STATE_FL_AUTO_ATTACH))
+	if (!(lli->lli_pcc_dsflags & PCC_DATASET_AUTO_ATTACH))
 		return 0;
 
 	rc = __vfs_setxattr(pcc_dentry, pcc_dentry->d_inode, pcc_xattr_layout,
@@ -1166,21 +1181,33 @@ static void pcc_inode_attach_init(struct pcc_dataset *dataset,
 				  struct dentry *dentry,
 				  enum lu_pcc_type type)
 {
-	struct ll_inode_info *lli = pcci->pcci_lli;
-
 	pcci->pcci_path.mnt = mntget(dataset->pccd_path.mnt);
 	pcci->pcci_path.dentry = dentry;
 	LASSERT(atomic_read(&pcci->pcci_refcount) == 0);
 	atomic_set(&pcci->pcci_refcount, 1);
 	pcci->pcci_type = type;
 	pcci->pcci_attr_valid = false;
+}
 
-	if (dataset->pccd_flags & PCC_DATASET_OPEN_ATTACH)
-		lli->lli_pcc_state |= PCC_STATE_FL_OPEN_ATTACH;
-	if (dataset->pccd_flags & PCC_DATASET_IO_ATTACH)
-		lli->lli_pcc_state |= PCC_STATE_FL_IO_ATTACH;
-	if (dataset->pccd_flags & PCC_DATASET_STAT_ATTACH)
-		lli->lli_pcc_state |= PCC_STATE_FL_STAT_ATTACH;
+static inline void pcc_inode_dsflags_set(struct ll_inode_info *lli,
+					 struct pcc_dataset *dataset)
+{
+	lli->lli_pcc_generation = ll_info2pccs(lli)->pccs_generation;
+	lli->lli_pcc_dsflags = dataset->pccd_flags;
+}
+
+static void pcc_inode_attach_set(struct pcc_super *super,
+				 struct pcc_dataset *dataset,
+				 struct ll_inode_info *lli,
+				 struct pcc_inode *pcci,
+				 struct dentry *dentry,
+				 enum lu_pcc_type type)
+{
+	pcc_inode_init(pcci, lli);
+	pcc_inode_attach_init(dataset, pcci, dentry, type);
+	down_read(&super->pccs_rw_sem);
+	pcc_inode_dsflags_set(lli, dataset);
+	up_read(&super->pccs_rw_sem);
 }
 
 static inline void pcc_layout_gen_set(struct pcc_inode *pcci,
@@ -1263,6 +1290,7 @@ static int pcc_try_dataset_attach(struct inode *inode, u32 gen,
 			pcc_inode_get(pcci);
 			pcci->pcci_type = type;
 		}
+		pcc_inode_dsflags_set(lli, dataset);
 		pcc_layout_gen_set(pcci, gen);
 		*cached = true;
 	}
@@ -1274,28 +1302,83 @@ static int pcc_try_dataset_attach(struct inode *inode, u32 gen,
 	return rc;
 }
 
-static int pcc_try_datasets_attach(struct inode *inode, u32 gen,
-				   enum lu_pcc_type type, bool *cached)
+static int pcc_try_datasets_attach(struct inode *inode, enum pcc_io_type iot,
+				   u32 gen, enum lu_pcc_type type,
+				   bool *cached)
 {
-	struct pcc_dataset *dataset, *tmp;
 	struct pcc_super *super = &ll_i2sbi(inode)->ll_pcc_super;
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_dataset *dataset = NULL, *tmp;
 	int rc = 0;
 
 	down_read(&super->pccs_rw_sem);
 	list_for_each_entry_safe(dataset, tmp,
 				 &super->pccs_datasets, pccd_linkage) {
-		if (!pcc_auto_attach_enabled(dataset))
-			continue;
+		if (!pcc_auto_attach_enabled(dataset->pccd_flags, iot))
+			break;
+
 		rc = pcc_try_dataset_attach(inode, gen, type, dataset, cached);
 		if (rc < 0 || (!rc && *cached))
 			break;
 	}
+
+	/*
+	 * Update the saved dataset flags for the inode accordingly if failed.
+	 */
+	if (!rc && !*cached) {
+		/*
+		 * Currently auto attach strategy for a PCC backend is
+		 * unchangeable once once it was added into the PCC datasets on
+		 * a client as the support to change auto attach strategy is
+		 * not implemented yet.
+		 */
+		/*
+		 * If tried to attach from one PCC backend:
+		 * @lli_pcc_generation > 0:
+		 * 1) The file was once attached into PCC, but now the
+		 * corresponding PCC backend should be removed from the client;
+		 * 2) The layout generation was changed, the data has been
+		 * restored;
+		 * 3) The corresponding PCC copy is not existed on PCC
+		 * @lli_pcc_generation == 0:
+		 * The file is never attached into PCC but in a HSM released
+		 * state, or once attached into PCC but the inode was evicted
+		 * from icache later.
+		 * Set the saved dataset flags with PCC_DATASET_NONE. Then this
+		 * file will skip from the candidates to try auto attach until
+		 * the file is attached ninto PCC again.
+		 *
+		 * If the file was never attached into PCC, or once attached but
+		 * its inode was evicted from icache (lli_pcc_generation == 0),
+		 * set the saved dataset flags with PCC_DATASET_NONE.
+		 *
+		 * If the file was once attached into PCC but the corresponding
+		 * dataset was removed from the client, set the saved dataset
+		 * flags with PCC_DATASET_NONE.
+		 *
+		 * TODO: If the file was once attached into PCC but not try to
+		 * auto attach due to the change of the configuration parameters
+		 * for this dataset (i.e. change from auto attach enabled to
+		 * auto attach disabled for this dataset), update the saved
+		 * dataset flags witha the found one.
+		 */
+		lli->lli_pcc_dsflags = PCC_DATASET_NONE;
+	}
 	up_read(&super->pccs_rw_sem);
 
 	return rc;
 }
 
-static int pcc_try_auto_attach(struct inode *inode, bool *cached, bool is_open)
+/*
+ * TODO: For RW-PCC, it is desirable to store HSM info as a layout (LU-10606).
+ * Thus the client can get archive ID from the layout directly. When try to
+ * attach the file automatically which is in HSM released state (according to
+ * LOV_PATTERN_F_RELEASED in the layout), it can determine whether the file is
+ * valid cached on PCC more precisely according to the @rwid (archive ID) in
+ * the PCC dataset and the archive ID in HSM attrs.
+ */
+static int pcc_try_auto_attach(struct inode *inode, bool *cached,
+			       enum pcc_io_type iot)
 {
 	struct pcc_super *super = &ll_i2sbi(inode)->ll_pcc_super;
 	struct cl_layout clt = {
@@ -1317,7 +1400,7 @@ static int pcc_try_auto_attach(struct inode *inode, bool *cached, bool is_open)
 	 * obtain valid layout lock from MDT (i.e. the file is being
 	 * HSM restoring).
 	 */
-	if (is_open) {
+	if (iot == PIT_OPEN) {
 		if (ll_layout_version_get(lli) == CL_LAYOUT_GEN_NONE)
 			return 0;
 	} else {
@@ -1330,19 +1413,54 @@ static int pcc_try_auto_attach(struct inode *inode, bool *cached, bool is_open)
 	if (rc)
 		return rc;
 
-	if (!is_open && gen != clt.cl_layout_gen) {
+	if (iot != PIT_OPEN && gen != clt.cl_layout_gen) {
 		CDEBUG(D_CACHE, DFID" layout changed from %d to %d.\n",
 		       PFID(ll_inode2fid(inode)), gen, clt.cl_layout_gen);
 		return -EINVAL;
 	}
 
 	if (clt.cl_is_released)
-		rc = pcc_try_datasets_attach(inode, clt.cl_layout_gen,
+		rc = pcc_try_datasets_attach(inode, iot, clt.cl_layout_gen,
 					     LU_PCC_READWRITE, cached);
 
 	return rc;
 }
 
+static inline bool pcc_may_auto_attach(struct inode *inode,
+				       enum pcc_io_type iot)
+{
+	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_super *super = ll_i2pccs(inode);
+
+	/* Known the file was not in any PCC backend. */
+	if (lli->lli_pcc_dsflags & PCC_DATASET_NONE)
+		return false;
+
+	/*
+	 * lli_pcc_generation = 0 means that the file was never attached into
+	 * PCC, or may be once attached into PCC but detached as the inode is
+	 * evicted from icache (i.e. "echo 3 > /proc/sys/vm/drop_caches" or
+	 * icache shrinking due to the memory pressure), which will cause the
+	 * file detach from PCC when releasing the inode from icache.
+	 * In either case, we still try to attach.
+	 */
+	/* lli_pcc_generation == 0, or the PCC setting was changed,
+	 * or there is no PCC setup on the client and the try will return
+	 * immediately in pcc_try_auto_attch().
+	 */
+	if (super->pccs_generation != lli->lli_pcc_generation)
+		return true;
+
+	/* The cached setting @lli_pcc_dsflags is valid */
+	if (iot == PIT_OPEN)
+		return lli->lli_pcc_dsflags & PCC_DATASET_OPEN_ATTACH;
+
+	if (iot == PIT_GETATTR)
+		return lli->lli_pcc_dsflags & PCC_DATASET_STAT_ATTACH;
+
+	return lli->lli_pcc_dsflags & PCC_DATASET_IO_ATTACH;
+}
+
 int pcc_file_open(struct inode *inode, struct file *file)
 {
 	struct pcc_inode *pcci;
@@ -1365,8 +1483,8 @@ int pcc_file_open(struct inode *inode, struct file *file)
 		goto out_unlock;
 
 	if (!pcci || !pcc_inode_has_layout(pcci)) {
-		if (lli->lli_pcc_state & PCC_STATE_FL_OPEN_ATTACH)
-			rc = pcc_try_auto_attach(inode, &cached, true);
+		if (pcc_may_auto_attach(inode, PIT_OPEN))
+			rc = pcc_try_auto_attach(inode, &cached, PIT_OPEN);
 
 		if (rc < 0 || !cached)
 			goto out_unlock;
@@ -1429,7 +1547,6 @@ void pcc_file_release(struct inode *inode, struct file *file)
 
 static void pcc_io_init(struct inode *inode, enum pcc_io_type iot, bool *cached)
 {
-	struct ll_inode_info *lli = ll_i2info(inode);
 	struct pcc_inode *pcci;
 
 	pcc_inode_lock(inode);
@@ -1440,11 +1557,8 @@ static void pcc_io_init(struct inode *inode, enum pcc_io_type iot, bool *cached)
 		*cached = true;
 	} else {
 		*cached = false;
-		if ((lli->lli_pcc_state & PCC_STATE_FL_IO_ATTACH &&
-		     iot != PIT_GETATTR) ||
-		    (iot == PIT_GETATTR &&
-		     lli->lli_pcc_state & PCC_STATE_FL_STAT_ATTACH)) {
-			(void) pcc_try_auto_attach(inode, cached, false);
+		if (pcc_may_auto_attach(inode, iot)) {
+			(void) pcc_try_auto_attach(inode, cached, iot);
 			if (*cached) {
 				pcci = ll_i2pcci(inode);
 				LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
@@ -2061,6 +2175,7 @@ int pcc_inode_create(struct super_block *sb, struct pcc_dataset *dataset,
 int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca)
 {
 	struct dentry *pcc_dentry = pca->pca_dentry;
+	struct pcc_super *super = ll_i2pccs(inode);
 	const struct cred *old_cred;
 	struct pcc_inode *pcci;
 	int rc = 0;
@@ -2073,7 +2188,7 @@ int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca)
 
 	LASSERT(pcc_dentry);
 
-	old_cred = override_creds(pcc_super_cred(inode->i_sb));
+	old_cred = override_creds(super->pccs_cred);
 	pcc_inode_lock(inode);
 	LASSERT(!ll_i2pcci(inode));
 	pcci = kmem_cache_zalloc(pcc_inode_slab, GFP_NOFS);
@@ -2087,9 +2202,8 @@ int pcc_inode_create_fini(struct inode *inode, struct pcc_create_attach *pca)
 	if (rc)
 		goto out_put;
 
-	pcc_inode_init(pcci, ll_i2info(inode));
-	pcc_inode_attach_init(pca->pca_dataset, pcci, pcc_dentry,
-			      LU_PCC_READWRITE);
+	pcc_inode_attach_set(super, pca->pca_dataset, ll_i2info(inode),
+			     pcci, pcc_dentry, LU_PCC_READWRITE);
 
 	rc = pcc_layout_xattr_set(pcci, 0);
 	if (rc) {
@@ -2224,6 +2338,7 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 {
 	struct pcc_dataset *dataset;
 	struct ll_inode_info *lli = ll_i2info(inode);
+	struct pcc_super *super = ll_i2pccs(inode);
 	struct pcc_inode *pcci;
 	const struct cred *old_cred;
 	struct dentry *dentry;
@@ -2241,7 +2356,7 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 	if (!dataset)
 		return -ENOENT;
 
-	old_cred = override_creds(pcc_super_cred(inode->i_sb));
+	old_cred = override_creds(super->pccs_cred);
 	rc = __pcc_inode_create(dataset, &lli->lli_fid, &dentry);
 	if (rc)
 		goto out_dataset_put;
@@ -2287,8 +2402,8 @@ int pcc_readwrite_attach(struct file *file, struct inode *inode,
 		goto out_unlock;
 	}
 
-	pcc_inode_init(pcci, lli);
-	pcc_inode_attach_init(dataset, pcci, dentry, LU_PCC_READWRITE);
+	pcc_inode_attach_set(super, dataset, lli, pcci,
+			     dentry, LU_PCC_READWRITE);
 out_unlock:
 	pcc_inode_unlock(inode);
 out_fput:
@@ -2417,8 +2532,15 @@ int pcc_ioctl_detach(struct inode *inode, u32 opt)
 	LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
 
 	if (pcci->pcci_type == LU_PCC_READWRITE) {
-		if (opt == PCC_DETACH_OPT_UNCACHE)
+		if (opt == PCC_DETACH_OPT_UNCACHE) {
 			hsm_remove = true;
+			/*
+			 * The file will be removed from PCC, set the flags
+			 * with PCC_DATASET_NONE even the later removal of the
+			 * PCC copy fails.
+			 */
+			lli->lli_pcc_dsflags = PCC_DATASET_NONE;
+		}
 
 		__pcc_layout_invalidate(pcci);
 		pcc_inode_put(pcci);
diff --git a/fs/lustre/llite/pcc.h b/fs/lustre/llite/pcc.h
index ec2e421..60f9bea 100644
--- a/fs/lustre/llite/pcc.h
+++ b/fs/lustre/llite/pcc.h
@@ -92,20 +92,22 @@ struct pcc_matcher {
 };
 
 enum pcc_dataset_flags {
-	PCC_DATASET_NONE	= 0x0,
+	PCC_DATASET_INVALID	= 0x0,
+	/* Indicate that known the file is not in PCC. */
+	PCC_DATASET_NONE	= 0x1,
 	/* Try auto attach at open, enabled by default */
-	PCC_DATASET_OPEN_ATTACH	= 0x01,
+	PCC_DATASET_OPEN_ATTACH	= 0x02,
 	/* Try auto attach during IO when layout refresh, enabled by default */
-	PCC_DATASET_IO_ATTACH	= 0x02,
+	PCC_DATASET_IO_ATTACH	= 0x04,
 	/* Try auto attach at stat */
-	PCC_DATASET_STAT_ATTACH	= 0x04,
+	PCC_DATASET_STAT_ATTACH	= 0x08,
 	PCC_DATASET_AUTO_ATTACH	= PCC_DATASET_OPEN_ATTACH |
 				  PCC_DATASET_IO_ATTACH |
 				  PCC_DATASET_STAT_ATTACH,
 	/* PCC backend is only used for RW-PCC */
-	PCC_DATASET_RWPCC	= 0x08,
+	PCC_DATASET_RWPCC	= 0x10,
 	/* PCC backend is only used for RO-PCC */
-	PCC_DATASET_ROPCC	= 0x10,
+	PCC_DATASET_ROPCC	= 0x20,
 	/* PCC backend provides caching services for both RW-PCC and RO-PCC */
 	PCC_DATASET_PCC_ALL	= PCC_DATASET_RWPCC | PCC_DATASET_ROPCC,
 };
@@ -114,7 +116,7 @@ struct pcc_dataset {
 	u32			pccd_rwid;	 /* Archive ID */
 	u32			pccd_roid;	 /* Readonly ID */
 	struct pcc_match_rule	pccd_rule;	 /* Match rule */
-	enum pcc_dataset_flags	pccd_flags;	 /* flags of PCC backend */
+	enum pcc_dataset_flags	pccd_flags;	 /* Flags of PCC backend */
 	char			pccd_pathname[PATH_MAX]; /* full path */
 	struct path		pccd_path;	 /* Root path */
 	struct list_head	pccd_linkage;  /* Linked to pccs_datasets */
@@ -128,6 +130,12 @@ struct pcc_super {
 	struct list_head	 pccs_datasets;
 	/* creds of process who forced instantiation of super block */
 	const struct cred	*pccs_cred;
+	/*
+	 * Gobal PCC Generation: it will be increased once the configuration
+	 * for PCC is changed, i.e. add or delete a PCC backend, modify the
+	 * parameters for PCC.
+	 */
+	u64			pccs_generation;
 };
 
 struct pcc_inode {
@@ -177,7 +185,9 @@ enum pcc_io_type {
 	/* fsync system call handling */
 	PIT_FSYNC,
 	/* splice_read system call */
-	PIT_SPLICE_READ
+	PIT_SPLICE_READ,
+	/* open system call */
+	PIT_OPEN,
 };
 
 enum pcc_cmd_type {
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index b46f52b..12b1f78 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -2180,16 +2180,6 @@ enum lu_pcc_state_flags {
 	PCC_STATE_FL_ATTR_VALID		= 0x01,
 	/* The file is being attached into PCC */
 	PCC_STATE_FL_ATTACHING		= 0x02,
-	/* Allow to auto attach at open */
-	PCC_STATE_FL_OPEN_ATTACH	= 0x04,
-	/* Allow to auto attach during I/O after layout lock revocation */
-	PCC_STATE_FL_IO_ATTACH		= 0x08,
-	/* Allow to auto attach at stat */
-	PCC_STATE_FL_STAT_ATTACH	= 0x10,
-	/* Allow to auto attach at the next open or layout refresh */
-	PCC_STATE_FL_AUTO_ATTACH	= PCC_STATE_FL_OPEN_ATTACH |
-					  PCC_STATE_FL_IO_ATTACH |
-					  PCC_STATE_FL_STAT_ATTACH,
 };
 
 struct lu_pcc_state {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 527/622] lustre: pcc: Init saved dataset flags properly
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (525 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 526/622] lustre: pcc: auto attach not work after client cache clear James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 528/622] lustre: use simple sleep in some cases James Simmons
                   ` (95 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Qian Yingjin <qian@ddn.com>

When init a new inode, the saved flags is set wrongly with
PCC_DATASET_NONE which means that the file is known in NONE
of PCC dataset.
This patch corrects it with PCC_DATASET_INVALID.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13030
Lustre-commit: e467a421c7aa ("LU-13030 pcc: Init saved dataset flags properly")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/36923
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c |  2 +-
 fs/lustre/llite/pcc.c       | 13 +++++--------
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index c2baf6a..384b55b 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -983,7 +983,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 		mutex_init(&lli->lli_pcc_lock);
 		lli->lli_pcc_state = PCC_STATE_FL_NONE;
 		lli->lli_pcc_inode = NULL;
-		lli->lli_pcc_dsflags = PCC_DATASET_NONE;
+		lli->lli_pcc_dsflags = PCC_DATASET_INVALID;
 		lli->lli_pcc_generation = 0;
 		mutex_init(&lli->lli_group_mutex);
 		lli->lli_group_users = 0;
diff --git a/fs/lustre/llite/pcc.c b/fs/lustre/llite/pcc.c
index a0e31c8..3a2c8f2 100644
--- a/fs/lustre/llite/pcc.c
+++ b/fs/lustre/llite/pcc.c
@@ -1346,21 +1346,18 @@ static int pcc_try_datasets_attach(struct inode *inode, enum pcc_io_type iot,
 		 * from icache later.
 		 * Set the saved dataset flags with PCC_DATASET_NONE. Then this
 		 * file will skip from the candidates to try auto attach until
-		 * the file is attached ninto PCC again.
+		 * the file is attached into PCC again.
 		 *
 		 * If the file was never attached into PCC, or once attached but
 		 * its inode was evicted from icache (lli_pcc_generation == 0),
+		 * or the corresponding dataset was removed from the client,
 		 * set the saved dataset flags with PCC_DATASET_NONE.
 		 *
-		 * If the file was once attached into PCC but the corresponding
-		 * dataset was removed from the client, set the saved dataset
-		 * flags with PCC_DATASET_NONE.
-		 *
 		 * TODO: If the file was once attached into PCC but not try to
 		 * auto attach due to the change of the configuration parameters
 		 * for this dataset (i.e. change from auto attach enabled to
 		 * auto attach disabled for this dataset), update the saved
-		 * dataset flags witha the found one.
+		 * dataset flags with the found one.
 		 */
 		lli->lli_pcc_dsflags = PCC_DATASET_NONE;
 	}
@@ -1437,7 +1434,7 @@ static inline bool pcc_may_auto_attach(struct inode *inode,
 		return false;
 
 	/*
-	 * lli_pcc_generation = 0 means that the file was never attached into
+	 * lli_pcc_generation == 0 means that the file was never attached into
 	 * PCC, or may be once attached into PCC but detached as the inode is
 	 * evicted from icache (i.e. "echo 3 > /proc/sys/vm/drop_caches" or
 	 * icache shrinking due to the memory pressure), which will cause the
@@ -1446,7 +1443,7 @@ static inline bool pcc_may_auto_attach(struct inode *inode,
 	 */
 	/* lli_pcc_generation == 0, or the PCC setting was changed,
 	 * or there is no PCC setup on the client and the try will return
-	 * immediately in pcc_try_auto_attch().
+	 * immediately in pcc_try_auto_attach().
 	 */
 	if (super->pccs_generation != lli->lli_pcc_generation)
 		return true;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 528/622] lustre: use simple sleep in some cases
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (526 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 527/622] lustre: pcc: Init saved dataset flags properly James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 529/622] lustre: lov: use wait_event() in lov_subobject_kill() James Simmons
                   ` (94 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

To match the OpenSFS branch change schedule_timeout_uninterruptible()
to ssleep(). In mdc_request.c the change to ssleep() lets us remove
a wait queue. In seq_client_alloc_meta() wait 2 seconds before
attempting to run seq_client_rpc() again.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10467
Lustre-commit: 077b35568be5 ("LU-10467 lustre: don't use l_wait_event() for simple sleep.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35966
Lustre-commit: d0ca764a1a91 ("LU-10467 lustre: don't use l_wait_event() for poll loops.")
Reviewed-on: https://review.whamcloud.com/35968
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/fid/fid_request.c | 7 +++++++
 fs/lustre/llite/llite_lib.c | 3 ++-
 fs/lustre/lov/lov_request.c | 4 +++-
 fs/lustre/mdc/mdc_request.c | 5 ++---
 fs/lustre/ptlrpc/events.c   | 7 +++----
 5 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/fid/fid_request.c b/fs/lustre/fid/fid_request.c
index a54d1e5..6cede30 100644
--- a/fs/lustre/fid/fid_request.c
+++ b/fs/lustre/fid/fid_request.c
@@ -40,6 +40,7 @@
 #define DEBUG_SUBSYSTEM S_FID
 
 #include <linux/module.h>
+#include <linux/delay.h>
 
 #include <obd.h>
 #include <obd_class.h>
@@ -155,6 +156,12 @@ static int seq_client_alloc_meta(const struct lu_env *env,
 		 */
 		rc = seq_client_rpc(seq, &seq->lcs_space,
 				    SEQ_ALLOC_META, "meta");
+		if (rc == -EINPROGRESS || rc == -EAGAIN)
+			/* MDT0 is not ready, let's wait for 2
+			 * seconds and retry.
+			 */
+			ssleep(2);
+
 	} while (rc == -EINPROGRESS || rc == -EAGAIN);
 
 	return rc;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 384b55b..7e128f0 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -42,6 +42,7 @@
 #include <linux/statfs.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
 #include <linux/uuid.h>
 #include <linux/random.h>
 #include <linux/security.h>
@@ -2344,7 +2345,7 @@ void ll_umount_begin(struct super_block *sb)
 	 * to decrement mnt_cnt and hope to finish it within 10sec.
 	 */
 	while (cnt < 10 && !may_umount(sbi->ll_mnt.mnt)) {
-		schedule_timeout_uninterruptible(HZ);
+		ssleep(1);
 		cnt++;
 	}
 
diff --git a/fs/lustre/lov/lov_request.c b/fs/lustre/lov/lov_request.c
index added19..d263cec 100644
--- a/fs/lustre/lov/lov_request.c
+++ b/fs/lustre/lov/lov_request.c
@@ -33,6 +33,8 @@
 
 #define DEBUG_SUBSYSTEM S_LOV
 
+#include <linux/delay.h>
+
 #include <obd_class.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 #include "lov_internal.h"
@@ -130,7 +132,7 @@ static int lov_check_and_wait_active(struct lov_obd *lov, int ost_idx)
 	mutex_unlock(&lov->lov_lock);
 
 	while (cnt < obd_timeout && !lov_check_set(lov, ost_idx)) {
-		schedule_timeout_uninterruptible(HZ);
+		ssleep(1);
 		cnt++;
 	}
 	if (tgt->ltd_active)
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 287013f..54f6d15 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -37,6 +37,7 @@
 #include <linux/pagemap.h>
 #include <linux/init.h>
 #include <linux/utsname.h>
+#include <linux/delay.h>
 #include <linux/file.h>
 #include <linux/kthread.h>
 #include <linux/prefetch.h>
@@ -1043,13 +1044,11 @@ static int mdc_getpage(struct obd_export *exp, const struct lu_fid *fid,
 {
 	struct ptlrpc_bulk_desc *desc;
 	struct ptlrpc_request *req;
-	wait_queue_head_t waitq;
 	int resends = 0;
 	int rc;
 	int i;
 
 	*request = NULL;
-	init_waitqueue_head(&waitq);
 
 restart_bulk:
 	req = ptlrpc_request_alloc(class_exp2cliimp(exp), &RQF_MDS_READPAGE);
@@ -1093,7 +1092,7 @@ static int mdc_getpage(struct obd_export *exp, const struct lu_fid *fid,
 			       exp->exp_obd->obd_name, -EIO);
 			return -EIO;
 		}
-		wait_event_idle_timeout(waitq, 0, resends * HZ);
+		ssleep(resends);
 
 		goto restart_bulk;
 	}
diff --git a/fs/lustre/ptlrpc/events.c b/fs/lustre/ptlrpc/events.c
index e6a49db..ce13aa6 100644
--- a/fs/lustre/ptlrpc/events.c
+++ b/fs/lustre/ptlrpc/events.c
@@ -34,9 +34,8 @@
 #define DEBUG_SUBSYSTEM S_RPC
 
 #include <linux/libcfs/libcfs.h>
-# ifdef __mips64__
-#  include <linux/kernel.h>
-# endif
+#include <linux/kernel.h>
+#include <linux/delay.h>
 
 #include <obd_class.h>
 #include <lustre_net.h>
@@ -522,7 +521,7 @@ static void ptlrpc_ni_fini(void)
 			if (retries != 0)
 				CWARN("Event queue still busy\n");
 
-			schedule_timeout_uninterruptible(2 * HZ);
+			ssleep(2);
 			break;
 		}
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 529/622] lustre: lov: use wait_event() in lov_subobject_kill()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (527 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 528/622] lustre: use simple sleep in some cases James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 530/622] lustre: llite: use wait_event in cl_object_put_last() James Simmons
                   ` (93 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

lov_subobject_kill() has an open-coded version
of wait_event(). Change it to use the macro.

There is no need to take a spinlock just to check if a variable
have changed value. If there was, the first test would be protected too.

"lti_waiter" now has no users and can be removed from lov_thread_info.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10467
Lustre-commit: c0894d1d32670 ("LU-10467 lov: use wait_event() in lov_subobject_kill()")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36343
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_cl_internal.h |  1 -
 fs/lustre/lov/lov_object.c      | 24 +-----------------------
 2 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/fs/lustre/lov/lov_cl_internal.h b/fs/lustre/lov/lov_cl_internal.h
index 8791e69..e21439d 100644
--- a/fs/lustre/lov/lov_cl_internal.h
+++ b/fs/lustre/lov/lov_cl_internal.h
@@ -474,7 +474,6 @@ struct lov_thread_info {
 	struct ost_lvb		lti_lvb;
 	struct cl_2queue	lti_cl2q;
 	struct cl_page_list     lti_plist;
-	wait_queue_entry_t	lti_waiter;
 };
 
 /**
diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index f2c7bc2..2a35993 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -287,7 +287,6 @@ static void lov_subobject_kill(const struct lu_env *env, struct lov_object *lov,
 	struct cl_object *sub;
 	struct lu_site *site;
 	wait_queue_head_t *wq;
-	wait_queue_entry_t *waiter;
 
 	LASSERT(r0->lo_sub[idx] == los);
 
@@ -303,28 +302,7 @@ static void lov_subobject_kill(const struct lu_env *env, struct lov_object *lov,
 	/* ... wait until it is actually destroyed---sub-object clears its
 	 * ->lo_sub[] slot in lovsub_object_free()
 	 */
-	if (r0->lo_sub[idx] == los) {
-		waiter = &lov_env_info(env)->lti_waiter;
-		init_waitqueue_entry(waiter, current);
-		add_wait_queue(wq, waiter);
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		while (1) {
-			/* this wait-queue is signaled at the end of
-			 * lu_object_free().
-			 */
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			spin_lock(&r0->lo_sub_lock);
-			if (r0->lo_sub[idx] == los) {
-				spin_unlock(&r0->lo_sub_lock);
-				schedule();
-			} else {
-				spin_unlock(&r0->lo_sub_lock);
-				set_current_state(TASK_RUNNING);
-				break;
-			}
-		}
-		remove_wait_queue(wq, waiter);
-	}
+	wait_event(*wq, r0->lo_sub[idx] != los);
 	LASSERT(!r0->lo_sub[idx]);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 530/622] lustre: llite: use wait_event in cl_object_put_last()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (528 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 529/622] lustre: lov: use wait_event() in lov_subobject_kill() James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 531/622] lustre: modules: Use LIST_HEAD for declaring list_heads James Simmons
                   ` (92 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

cl_object_put_last() contains an open-coded version
of wait_event().
Replace it with the library macro.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10467
Lustre-commit: f963f19c94b5 ("LU-10467 llite: use wait_event in cl_object_put_last()")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36345
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/lcommon_cl.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/fs/lustre/llite/lcommon_cl.c b/fs/lustre/llite/lcommon_cl.c
index 3129316..76f76a0 100644
--- a/fs/lustre/llite/lcommon_cl.c
+++ b/fs/lustre/llite/lcommon_cl.c
@@ -221,7 +221,6 @@ int cl_file_inode_init(struct inode *inode, struct lustre_md *md)
 static void cl_object_put_last(struct lu_env *env, struct cl_object *obj)
 {
 	struct lu_object_header *header = obj->co_lu.lo_header;
-	wait_queue_entry_t waiter;
 
 	if (unlikely(atomic_read(&header->loh_ref) != 1)) {
 		struct lu_site *site = obj->co_lu.lo_dev->ld_site;
@@ -229,18 +228,7 @@ static void cl_object_put_last(struct lu_env *env, struct cl_object *obj)
 
 		wq = lu_site_wq_from_fid(site, &header->loh_fid);
 
-		init_waitqueue_entry(&waiter, current);
-		add_wait_queue(wq, &waiter);
-
-		while (1) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (atomic_read(&header->loh_ref) == 1)
-				break;
-			schedule();
-		}
-
-		set_current_state(TASK_RUNNING);
-		remove_wait_queue(wq, &waiter);
+		wait_event(*wq, atomic_read(&header->loh_ref) == 1);
 	}
 
 	cl_object_put(env, obj);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 531/622] lustre: modules: Use LIST_HEAD for declaring list_heads
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (529 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 530/622] lustre: llite: use wait_event in cl_object_put_last() James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 532/622] lustre: handle: move refcount into the lustre_handle James Simmons
                   ` (91 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Rather than
  struct list_head foo = LIST_HEAD_INIT(foo);
use
  LIST_HEAD(foo);

This is shorter and more in-keeping with upstream style.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 546993d587c5 ("LU-9679 modules: Use LIST_HEAD for declaring list_heads")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36669
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mdc/mdc_locks.c   | 2 +-
 fs/lustre/mdc/mdc_reint.c   | 2 +-
 fs/lustre/osc/osc_request.c | 2 +-
 fs/lustre/ptlrpc/pinger.c   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index b91c162..4d40087 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -581,7 +581,7 @@ static struct ptlrpc_request *mdc_intent_layout_pack(struct obd_export *exp,
 						     struct md_op_data *op_data)
 {
 	struct obd_device *obd = class_exp2obd(exp);
-	struct list_head cancels = LIST_HEAD_INIT(cancels);
+	LIST_HEAD(cancels);
 	struct ptlrpc_request *req;
 	struct ldlm_intent *lit;
 	struct layout_intent *layout;
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index d26e27d..0dc0de4 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -470,7 +470,7 @@ int mdc_rename(struct obd_export *exp, struct md_op_data *op_data,
 
 int mdc_file_resync(struct obd_export *exp, struct md_op_data *op_data)
 {
-	struct list_head cancels = LIST_HEAD_INIT(cancels);
+	LIST_HEAD(cancels);
 	struct ptlrpc_request *req;
 	struct ldlm_lock *lock;
 	struct mdt_rec_resync *rec;
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 39cac7d..d6761dd 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -3382,7 +3382,7 @@ int osc_cleanup_common(struct obd_device *obd)
 	.quotactl	= osc_quotactl,
 };
 
-struct list_head osc_shrink_list = LIST_HEAD_INIT(osc_shrink_list);
+LIST_HEAD(osc_shrink_list);
 DEFINE_SPINLOCK(osc_shrink_lock);
 
 static struct shrinker osc_cache_shrinker = {
diff --git a/fs/lustre/ptlrpc/pinger.c b/fs/lustre/ptlrpc/pinger.c
index f584fc6..d8f57bb 100644
--- a/fs/lustre/ptlrpc/pinger.c
+++ b/fs/lustre/ptlrpc/pinger.c
@@ -43,7 +43,7 @@
 
 struct mutex pinger_mutex;
 static LIST_HEAD(pinger_imports);
-static struct list_head timeout_list = LIST_HEAD_INIT(timeout_list);
+static LIST_HEAD(timeout_list);
 
 struct ptlrpc_request *
 ptlrpc_prep_ping(struct obd_import *imp)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 532/622] lustre: handle: move refcount into the lustre_handle.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (530 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 531/622] lustre: modules: Use LIST_HEAD for declaring list_heads James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 533/622] lustre: llite: support page unaligned stride readahead James Simmons
                   ` (90 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

Most objects with a lustre_handle have a refcount. The exception
is mdt_mfd which uses locking (med_open_lock) to manage its
lifetime. The lustre_handles code currently needs a call-out to
increment its refcount. To simplify things, move the refcount
into the lustre_hanle (which will be largely ignored by mdt_mfd)
and discard the call-out.

To avoid warnings when refcount debugging is enabled the refcount
of mdt_mfd is initialized to 1, and decremeneted after any
class_handle2object() call which would have incremented it.

In order to preserve the same debug messages, we store an object type
name in the portals_handle_ops, and use that in a CDEBUG() when
incrementing the ref count.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12542
Lustre-commit: 1d1f6c8908b3 ("LU-12542 handle: move refcount into the lustre_handle.")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35794
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h      |  6 ------
 fs/lustre/include/lustre_export.h   |  3 +--
 fs/lustre/include/lustre_handles.h  |  4 +++-
 fs/lustre/ldlm/ldlm_lock.c          | 36 +++++++++++++++---------------------
 fs/lustre/obdclass/genops.c         | 25 ++++++++++---------------
 fs/lustre/obdclass/lustre_handles.c |  5 ++++-
 fs/lustre/obdecho/echo_client.c     |  2 +-
 fs/lustre/ptlrpc/service.c          |  4 ++--
 8 files changed, 36 insertions(+), 49 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index f7d2d9c..7621d1e 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -598,12 +598,6 @@ struct ldlm_lock {
 	 */
 	struct portals_handle		l_handle;
 	/**
-	 * Lock reference count.
-	 * This is how many users have pointers to actual structure, so that
-	 * we do not accidentally free lock structure that is in use.
-	 */
-	atomic_t			l_refc;
-	/**
 	 * Internal spinlock protects l_resource.  We should hold this lock
 	 * first before taking res_lock.
 	 */
diff --git a/fs/lustre/include/lustre_export.h b/fs/lustre/include/lustre_export.h
index 967ce37..878dedd 100644
--- a/fs/lustre/include/lustre_export.h
+++ b/fs/lustre/include/lustre_export.h
@@ -67,12 +67,11 @@ struct obd_export {
 	 * what export they are talking to.
 	 */
 	struct portals_handle		exp_handle;
-	refcount_t			exp_refcount;
 	/**
 	 * Set of counters below is to track where export references are
 	 * kept. The exp_rpc_count is used for reconnect handling also,
 	 * the cb_count and locks_count are for debug purposes only for now.
-	 * The sum of them should be less than exp_refcount by 3
+	 * The sum of them should be less than exp_handle.href by 3
 	 */
 	atomic_t			exp_rpc_count; /* RPC references */
 	atomic_t			exp_cb_count; /* Commit callback references */
diff --git a/fs/lustre/include/lustre_handles.h b/fs/lustre/include/lustre_handles.h
index 0440970..7c93d72 100644
--- a/fs/lustre/include/lustre_handles.h
+++ b/fs/lustre/include/lustre_handles.h
@@ -46,8 +46,9 @@
 #include <linux/types.h>
 
 struct portals_handle_ops {
-	void (*hop_addref)(void *object);
 	void (*hop_free)(void *object, int size);
+	/* hop_type is used for some debugging messages */
+	char *hop_type;
 };
 
 /* These handles are most easily used by having them appear at the very top of
@@ -66,6 +67,7 @@ struct portals_handle {
 	struct list_head		h_link;
 	u64				h_cookie;
 	const struct portals_handle_ops	*h_ops;
+	refcount_t			h_ref;
 
 	/* newly added fields to handle the RCU issue. -jxiong */
 	struct rcu_head			h_rcu;
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index d14221a..62d2c1d 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -148,7 +148,7 @@ const char *ldlm_it2str(enum ldlm_intent_flags it)
  */
 struct ldlm_lock *ldlm_lock_get(struct ldlm_lock *lock)
 {
-	atomic_inc(&lock->l_refc);
+	refcount_inc(&lock->l_handle.h_ref);
 	return lock;
 }
 EXPORT_SYMBOL(ldlm_lock_get);
@@ -161,8 +161,8 @@ struct ldlm_lock *ldlm_lock_get(struct ldlm_lock *lock)
 void ldlm_lock_put(struct ldlm_lock *lock)
 {
 	LASSERT(lock->l_resource != LP_POISON);
-	LASSERT(atomic_read(&lock->l_refc) > 0);
-	if (atomic_dec_and_test(&lock->l_refc)) {
+	LASSERT(refcount_read(&lock->l_handle.h_ref) > 0);
+	if (refcount_dec_and_test(&lock->l_handle.h_ref)) {
 		struct ldlm_resource *res;
 
 		LDLM_DEBUG(lock,
@@ -358,12 +358,6 @@ void ldlm_lock_destroy_nolock(struct ldlm_lock *lock)
 	}
 }
 
-/* this is called by portals_handle2object with the handle lock taken */
-static void lock_handle_addref(void *lock)
-{
-	LDLM_LOCK_GET((struct ldlm_lock *)lock);
-}
-
 static void lock_handle_free(void *lock, int size)
 {
 	LASSERT(size == sizeof(struct ldlm_lock));
@@ -371,8 +365,8 @@ static void lock_handle_free(void *lock, int size)
 }
 
 static struct portals_handle_ops lock_handle_ops = {
-	.hop_addref = lock_handle_addref,
 	.hop_free   = lock_handle_free,
+	.hop_type   = "ldlm",
 };
 
 /**
@@ -397,7 +391,7 @@ static struct ldlm_lock *ldlm_lock_new(struct ldlm_resource *resource)
 	lock->l_resource = resource;
 	lu_ref_add(&resource->lr_reference, "lock", lock);
 
-	atomic_set(&lock->l_refc, 2);
+	refcount_set(&lock->l_handle.h_ref, 2);
 	INIT_LIST_HEAD(&lock->l_res_link);
 	INIT_LIST_HEAD(&lock->l_lru);
 	INIT_LIST_HEAD(&lock->l_pending_chain);
@@ -1896,13 +1890,13 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 &vaf,
 				 lock,
 				 lock->l_handle.h_cookie,
-				 atomic_read(&lock->l_refc),
+				 refcount_read(&lock->l_handle.h_ref),
 				 lock->l_readers, lock->l_writers,
 				 ldlm_lockname[lock->l_granted_mode],
 				 ldlm_lockname[lock->l_req_mode],
 				 lock->l_flags, nid,
 				 lock->l_remote_handle.cookie,
-				 exp ? refcount_read(&exp->exp_refcount) : -99,
+				 exp ? refcount_read(&exp->exp_handle.h_ref) : -99,
 				 lock->l_pid, lock->l_callback_timeout,
 				 lock->l_lvb_type);
 		va_end(args);
@@ -1916,7 +1910,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 &vaf,
 				 ldlm_lock_to_ns_name(lock), lock,
 				 lock->l_handle.h_cookie,
-				 atomic_read(&lock->l_refc),
+				 refcount_read(&lock->l_handle.h_ref),
 				 lock->l_readers, lock->l_writers,
 				 ldlm_lockname[lock->l_granted_mode],
 				 ldlm_lockname[lock->l_req_mode],
@@ -1929,7 +1923,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 lock->l_req_extent.end,
 				 lock->l_flags, nid,
 				 lock->l_remote_handle.cookie,
-				 exp ? refcount_read(&exp->exp_refcount) : -99,
+				 exp ? refcount_read(&exp->exp_handle.h_ref) : -99,
 				 lock->l_pid, lock->l_callback_timeout,
 				 lock->l_lvb_type);
 		break;
@@ -1940,7 +1934,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 &vaf,
 				 ldlm_lock_to_ns_name(lock), lock,
 				 lock->l_handle.h_cookie,
-				 atomic_read(&lock->l_refc),
+				 refcount_read(&lock->l_handle.h_ref),
 				 lock->l_readers, lock->l_writers,
 				 ldlm_lockname[lock->l_granted_mode],
 				 ldlm_lockname[lock->l_req_mode],
@@ -1952,7 +1946,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 lock->l_policy_data.l_flock.end,
 				 lock->l_flags, nid,
 				 lock->l_remote_handle.cookie,
-				 exp ? refcount_read(&exp->exp_refcount) : -99,
+				 exp ? refcount_read(&exp->exp_handle.h_ref) : -99,
 				 lock->l_pid, lock->l_callback_timeout);
 		break;
 
@@ -1962,7 +1956,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 &vaf,
 				 ldlm_lock_to_ns_name(lock),
 				 lock, lock->l_handle.h_cookie,
-				 atomic_read(&lock->l_refc),
+				 refcount_read(&lock->l_handle.h_ref),
 				 lock->l_readers, lock->l_writers,
 				 ldlm_lockname[lock->l_granted_mode],
 				 ldlm_lockname[lock->l_req_mode],
@@ -1972,7 +1966,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 ldlm_typename[resource->lr_type],
 				 lock->l_flags, nid,
 				 lock->l_remote_handle.cookie,
-				 exp ? refcount_read(&exp->exp_refcount) : -99,
+				 exp ? refcount_read(&exp->exp_handle.h_ref) : -99,
 				 lock->l_pid, lock->l_callback_timeout,
 				 lock->l_lvb_type);
 		break;
@@ -1983,7 +1977,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 &vaf,
 				 ldlm_lock_to_ns_name(lock),
 				 lock, lock->l_handle.h_cookie,
-				 atomic_read(&lock->l_refc),
+				 refcount_read(&lock->l_handle.h_ref),
 				 lock->l_readers, lock->l_writers,
 				 ldlm_lockname[lock->l_granted_mode],
 				 ldlm_lockname[lock->l_req_mode],
@@ -1992,7 +1986,7 @@ void _ldlm_lock_debug(struct ldlm_lock *lock,
 				 ldlm_typename[resource->lr_type],
 				 lock->l_flags, nid,
 				 lock->l_remote_handle.cookie,
-				 exp ? refcount_read(&exp->exp_refcount) : -99,
+				 exp ? refcount_read(&exp->exp_handle.h_ref) : -99,
 				 lock->l_pid, lock->l_callback_timeout,
 				 lock->l_lvb_type);
 		break;
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 5d4e421..7f841d5 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -708,7 +708,7 @@ static void class_export_destroy(struct obd_export *exp)
 {
 	struct obd_device *obd = exp->exp_obd;
 
-	LASSERT(refcount_read(&exp->exp_refcount) == 0);
+	LASSERT(refcount_read(&exp->exp_handle.h_ref) == 0);
 	LASSERT(obd);
 
 	CDEBUG(D_IOCTL, "destroying export %p/%s for %s\n", exp,
@@ -732,33 +732,28 @@ static void class_export_destroy(struct obd_export *exp)
 	OBD_FREE_RCU(exp, sizeof(*exp), &exp->exp_handle);
 }
 
-static void export_handle_addref(void *export)
-{
-	class_export_get(export);
-}
-
 static struct portals_handle_ops export_handle_ops = {
-	.hop_addref	= export_handle_addref,
 	.hop_free	= NULL,
+	.hop_type	= "export",
 };
 
 struct obd_export *class_export_get(struct obd_export *exp)
 {
-	refcount_inc(&exp->exp_refcount);
-	CDEBUG(D_INFO, "GETting export %p : new refcount %d\n", exp,
-	       refcount_read(&exp->exp_refcount));
+	refcount_inc(&exp->exp_handle.h_ref);
+	CDEBUG(D_INFO, "GET export %p refcount=%d\n", exp,
+	       refcount_read(&exp->exp_handle.h_ref));
 	return exp;
 }
 EXPORT_SYMBOL(class_export_get);
 
 void class_export_put(struct obd_export *exp)
 {
-	LASSERT(refcount_read(&exp->exp_refcount) >  0);
-	LASSERT(refcount_read(&exp->exp_refcount) < LI_POISON);
+	LASSERT(refcount_read(&exp->exp_handle.h_ref) >  0);
+	LASSERT(refcount_read(&exp->exp_handle.h_ref) < LI_POISON);
 	CDEBUG(D_INFO, "PUTting export %p : new refcount %d\n", exp,
-	       refcount_read(&exp->exp_refcount) - 1);
+	       refcount_read(&exp->exp_handle.h_ref) - 1);
 
-	if (refcount_dec_and_test(&exp->exp_refcount)) {
+	if (refcount_dec_and_test(&exp->exp_handle.h_ref)) {
 		struct obd_device *obd = exp->exp_obd;
 
 		CDEBUG(D_IOCTL, "final put %p/%s\n",
@@ -809,7 +804,7 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 
 	export->exp_conn_cnt = 0;
 	/* 2 = class_handle_hash + last */
-	refcount_set(&export->exp_refcount, 2);
+	refcount_set(&export->exp_handle.h_ref, 2);
 	atomic_set(&export->exp_rpc_count, 0);
 	atomic_set(&export->exp_cb_count, 0);
 	atomic_set(&export->exp_locks_count, 0);
diff --git a/fs/lustre/obdclass/lustre_handles.c b/fs/lustre/obdclass/lustre_handles.c
index 7fa3ef6..95a34db 100644
--- a/fs/lustre/obdclass/lustre_handles.c
+++ b/fs/lustre/obdclass/lustre_handles.c
@@ -152,7 +152,10 @@ void *class_handle2object(u64 cookie, const struct portals_handle_ops *ops)
 
 		spin_lock(&h->h_lock);
 		if (likely(h->h_in != 0)) {
-			h->h_ops->hop_addref(h);
+			refcount_inc(&h->h_ref);
+			CDEBUG(D_INFO, "GET %s %p refcount=%d\n",
+			       h->h_ops->hop_type, h,
+			       refcount_read(&h->h_ref));
 			retval = h;
 		}
 		spin_unlock(&h->h_lock);
diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index 8e04636..c473f547 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -1669,7 +1669,7 @@ static int echo_client_cleanup(struct obd_device *obddev)
 	lu_session_tags_clear(ECHO_SES_TAG & ~LCT_SESSION);
 	lu_context_tags_clear(ECHO_DT_CTX_TAG);
 
-	LASSERT(refcount_read(&ec->ec_exp->exp_refcount) > 0);
+	LASSERT(refcount_read(&ec->ec_exp->exp_handle.h_ref) > 0);
 	rc = obd_disconnect(ec->ec_exp);
 	if (rc != 0)
 		CERROR("fail to disconnect device: %d\n", rc);
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index fe0e108..c874487 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1768,7 +1768,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	       (request->rq_export ?
 		(char *)request->rq_export->exp_client_uuid.uuid : "0"),
 	       (request->rq_export ?
-		refcount_read(&request->rq_export->exp_refcount) : -99),
+		refcount_read(&request->rq_export->exp_handle.h_ref) : -99),
 	       lustre_msg_get_status(request->rq_reqmsg), request->rq_xid,
 	       libcfs_id2str(request->rq_peer),
 	       lustre_msg_get_opc(request->rq_reqmsg),
@@ -1809,7 +1809,7 @@ static int ptlrpc_server_handle_request(struct ptlrpc_service_part *svcpt,
 	       (request->rq_export ?
 		(char *)request->rq_export->exp_client_uuid.uuid : "0"),
 	       (request->rq_export ?
-		refcount_read(&request->rq_export->exp_refcount) : -99),
+		refcount_read(&request->rq_export->exp_handle.h_ref) : -99),
 	       lustre_msg_get_status(request->rq_reqmsg),
 	       request->rq_xid,
 	       libcfs_id2str(request->rq_peer),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 533/622] lustre: llite: support page unaligned stride readahead
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (531 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 532/622] lustre: handle: move refcount into the lustre_handle James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 534/622] lustre: ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM James Simmons
                   ` (89 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Currently, Lustre works well for aligned IO, but performance
is pretty bad for unaligned IO stride read, we might need
take some efforts to improve this situation.

One of the main problem with current stride read is it is
based on Page Index, so if we hit unaligned page case,
stride Read detection will not work well. To support unaligned
page stride read, we might change page index to bytes offset
thus stride read pattern detection work well and we won't hit
many small pages RPC and readahead window reset. At the same
time, we shall keep as much as performances for existed cases
and make sure there won't be obvious regressions for
aligned-stride and sequential read.

Benchmark numbers:
iozone -w -c -i 5 -t1 -j 2 -s 1G -r 43k -F /mnt/lustre/data

Patched                 Unpatched
1386630.75 kB/sec       152002.50 kB/sec

At least performance bumped up more than ~800%.

Benchmarked with IOR from ihara:
	FPP Read(MB/sec)        SSF Read(MB/sec)
Unpatched 44,636                7,731

Patched   44,318                20,745

Got 250% performances up for ior_hard_read workload.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12518
Lustre-commit: 91d264551508 ("LU-12518 llite: support page unaligned stride readahead")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35437
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c           |   2 +-
 fs/lustre/llite/llite_internal.h |  11 +-
 fs/lustre/llite/rw.c             | 388 ++++++++++++++++++++++-----------------
 3 files changed, 228 insertions(+), 173 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 92eead1..d196da8 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1703,7 +1703,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (cached)
 		goto out;
 
-	ll_ras_enter(file);
+	ll_ras_enter(file, iocb->ki_pos, iov_iter_count(to));
 
 	result = ll_do_fast_read(iocb, to);
 	if (result < 0 || iov_iter_count(to) == 0)
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 8e7b949..fe9d568 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -654,11 +654,6 @@ struct ll_readahead_state {
 	 */
 	unsigned long	ras_requests;
 	/*
-	 * Page index with respect to the current request, these value
-	 * will not be accurate when dealing with reads issued via mmap.
-	 */
-	unsigned long	ras_request_index;
-	/*
 	 * The following 3 items are used for detecting the stride I/O
 	 * mode.
 	 * In stride I/O mode,
@@ -681,6 +676,10 @@ struct ll_readahead_state {
 	unsigned long	ras_consecutive_stride_requests;
 	/* index of the last page that async readahead starts */
 	pgoff_t		ras_async_last_readpage;
+	/* whether we should increase readahead window */
+	bool		ras_need_increase_window;
+	/* whether ra miss check should be skipped */
+	bool		ras_no_miss_check;
 };
 
 struct ll_readahead_work {
@@ -778,7 +777,7 @@ static inline bool ll_sbi_has_file_heat(struct ll_sb_info *sbi)
 	return !!(sbi->ll_flags & LL_SBI_FILE_HEAT);
 }
 
-void ll_ras_enter(struct file *f);
+void ll_ras_enter(struct file *f, unsigned long pos, unsigned long count);
 
 /* llite/lcommon_misc.c */
 int cl_ocd_update(struct obd_device *host, struct obd_device *watched,
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 38f7aa2c..bf91ae1 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -131,12 +131,11 @@ void ll_ra_stats_inc(struct inode *inode, enum ra_stat which)
 
 #define RAS_CDEBUG(ras) \
 	CDEBUG(D_READA,							     \
-	       "lre %lu cr %lu cb %lu ws %lu wl %lu nra %lu rpc %lu r %lu ri %lu csr %lu sf %lu sb %lu sl %lu lr %lu\n", \
+	       "lre %lu cr %lu cb %lu ws %lu wl %lu nra %lu rpc %lu r %lu csr %lu sf %lu sb %lu sl %lu lr %lu\n", \
 	       ras->ras_last_read_end, ras->ras_consecutive_requests,	     \
 	       ras->ras_consecutive_bytes, ras->ras_window_start,	     \
 	       ras->ras_window_len, ras->ras_next_readahead,		     \
-	       ras->ras_rpc_size,					     \
-	       ras->ras_requests, ras->ras_request_index,		     \
+	       ras->ras_rpc_size, ras->ras_requests,			     \
 	       ras->ras_consecutive_stride_requests, ras->ras_stride_offset, \
 	       ras->ras_stride_bytes, ras->ras_stride_length,		     \
 	       ras->ras_async_last_readpage)
@@ -154,18 +153,6 @@ static int pos_in_window(unsigned long pos, unsigned long point,
 	return start <= pos && pos <= end;
 }
 
-void ll_ras_enter(struct file *f)
-{
-	struct ll_file_data *fd = LUSTRE_FPRIVATE(f);
-	struct ll_readahead_state *ras = &fd->fd_ras;
-
-	spin_lock(&ras->ras_lock);
-	ras->ras_requests++;
-	ras->ras_request_index = 0;
-	ras->ras_consecutive_requests++;
-	spin_unlock(&ras->ras_lock);
-}
-
 /**
  * Initiates read-ahead of a page with given index.
  *
@@ -311,15 +298,23 @@ static inline int stride_io_mode(struct ll_readahead_state *ras)
 
 static int ria_page_count(struct ra_io_arg *ria)
 {
-	u64 length = ria->ria_end >= ria->ria_start ?
-		     ria->ria_end - ria->ria_start + 1 : 0;
-	unsigned int bytes_count;
-
+	u64 length_bytes = ria->ria_end >= ria->ria_start ?
+			   (ria->ria_end - ria->ria_start + 1) << PAGE_SHIFT : 0;
+	unsigned int bytes_count, pg_count;
+
+	if (ria->ria_length > ria->ria_bytes && ria->ria_bytes &&
+	    (ria->ria_length % PAGE_SIZE || ria->ria_bytes % PAGE_SIZE ||
+	     ria->ria_stoff % PAGE_SIZE)) {
+		/* Over-estimate un-aligned page stride read */
+		pg_count = ((ria->ria_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) + 1;
+		pg_count *= length_bytes / ria->ria_length + 1;
+
+		return pg_count;
+	}
 	bytes_count = stride_byte_count(ria->ria_stoff, ria->ria_length,
 					 ria->ria_bytes, ria->ria_start,
-					 length << PAGE_SHIFT);
+					 length_bytes);
 	return (bytes_count + PAGE_SIZE - 1) >> PAGE_SHIFT;
-
 }
 
 static unsigned long ras_align(struct ll_readahead_state *ras,
@@ -333,16 +328,28 @@ static unsigned long ras_align(struct ll_readahead_state *ras,
 }
 
 /*Check whether the index is in the defined ra-window */
-static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
+static bool ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 {
+	unsigned long pos = idx << PAGE_SHIFT;
+	unsigned long offset;
+
 	/* If ria_length == ria_pages, it means non-stride I/O mode,
 	 * idx should always inside read-ahead window in this case
 	 * For stride I/O mode, just check whether the idx is inside
 	 * the ria_pages.
 	 */
-	return ria->ria_length == 0 || ria->ria_length == ria->ria_bytes ||
-	       (idx >= ria->ria_stoff && (idx - ria->ria_stoff) %
-		ria->ria_length < ria->ria_bytes);
+	if (ria->ria_length == 0 || ria->ria_length == ria->ria_bytes)
+		return true;
+
+	if (pos >= ria->ria_stoff) {
+		offset = (pos - ria->ria_stoff) % ria->ria_length;
+		if (offset < ria->ria_bytes ||
+		    (ria->ria_length - offset) < PAGE_SIZE)
+			return true;
+	} else if (pos + PAGE_SIZE > ria->ria_stoff)
+		return true;
+
+	return false;
 }
 
 static unsigned long
@@ -351,7 +358,6 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 		    struct ra_io_arg *ria, pgoff_t *ra_end)
 {
 	struct cl_read_ahead ra = { 0 };
-	bool stride_ria;
 	pgoff_t page_idx;
 	int count = 0;
 	int rc;
@@ -359,7 +365,6 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 	LASSERT(ria);
 	RIA_DEBUG(ria);
 
-	stride_ria = ria->ria_length > ria->ria_bytes && ria->ria_bytes > 0;
 	for (page_idx = ria->ria_start;
 	     page_idx <= ria->ria_end && ria->ria_reserved > 0; page_idx++) {
 		if (ras_inside_ra_window(page_idx, ria)) {
@@ -417,7 +422,7 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 				ria->ria_reserved--;
 				count++;
 			}
-		} else if (stride_ria) {
+		} else if (stride_io_mode(ras)) {
 			/* If it is not in the read-ahead window, and it is
 			 * read-ahead mode, then check whether it should skip
 			 * the stride gap.
@@ -428,7 +433,8 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 			offset = (pos - ria->ria_stoff) % ria->ria_length;
 			if (offset >= ria->ria_bytes) {
 				pos += (ria->ria_length - offset);
-				page_idx = (pos >> PAGE_SHIFT) - 1;
+				if ((pos >> PAGE_SHIFT) >= page_idx + 1)
+					page_idx = (pos >> PAGE_SHIFT) - 1;
 				CDEBUG(D_READA,
 				       "Stride: jump %lu pages to %lu\n",
 				       ria->ria_length - offset, page_idx);
@@ -775,11 +781,10 @@ void ll_readahead_init(struct inode *inode, struct ll_readahead_state *ras)
  * Check whether the read request is in the stride window.
  * If it is in the stride window, return true, otherwise return false.
  */
-static bool index_in_stride_window(struct ll_readahead_state *ras,
-				   pgoff_t index)
+static bool read_in_stride_window(struct ll_readahead_state *ras,
+				  unsigned long pos, unsigned long count)
 {
 	unsigned long stride_gap;
-	unsigned long pos = index << PAGE_SHIFT;
 
 	if (ras->ras_stride_length == 0 || ras->ras_stride_bytes == 0 ||
 	    ras->ras_stride_bytes == ras->ras_stride_length)
@@ -789,12 +794,13 @@ static bool index_in_stride_window(struct ll_readahead_state *ras,
 
 	/* If it is contiguous read */
 	if (stride_gap == 0)
-		return ras->ras_consecutive_bytes + PAGE_SIZE <=
+		return ras->ras_consecutive_bytes + count <=
 			ras->ras_stride_bytes;
 
 	/* Otherwise check the stride by itself */
 	return (ras->ras_stride_length - ras->ras_stride_bytes) == stride_gap &&
-		ras->ras_consecutive_bytes == ras->ras_stride_bytes;
+		ras->ras_consecutive_bytes == ras->ras_stride_bytes &&
+		count <= ras->ras_stride_bytes;
 }
 
 static void ras_init_stride_detector(struct ll_readahead_state *ras,
@@ -802,13 +808,6 @@ static void ras_init_stride_detector(struct ll_readahead_state *ras,
 {
 	unsigned long stride_gap = pos - ras->ras_last_read_end - 1;
 
-	if ((stride_gap != 0 || ras->ras_consecutive_stride_requests == 0) &&
-	    !stride_io_mode(ras)) {
-		ras->ras_stride_bytes = ras->ras_consecutive_bytes;
-		ras->ras_stride_length =  ras->ras_consecutive_bytes +
-					 stride_gap;
-	}
-	LASSERT(ras->ras_request_index == 0);
 	LASSERT(ras->ras_consecutive_stride_requests == 0);
 
 	if (pos <= ras->ras_last_read_end) {
@@ -819,6 +818,8 @@ static void ras_init_stride_detector(struct ll_readahead_state *ras,
 
 	ras->ras_stride_bytes = ras->ras_consecutive_bytes;
 	ras->ras_stride_length = stride_gap + ras->ras_consecutive_bytes;
+	ras->ras_consecutive_stride_requests++;
+	ras->ras_stride_offset = pos;
 
 	RAS_CDEBUG(ras);
 }
@@ -895,49 +896,97 @@ static void ras_increase_window(struct inode *inode,
 	}
 }
 
-static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
-		       struct ll_readahead_state *ras, unsigned long index,
-		       enum ras_update_flags flags)
+/**
+ * Seek within 8 pages are considered as sequential read for now.
+ */
+static inline bool is_loose_seq_read(struct ll_readahead_state *ras,
+				     unsigned long pos)
 {
-	struct ll_ra_info *ra = &sbi->ll_ra_info;
-	int zero = 0, stride_detect = 0, ra_miss = 0;
-	unsigned long pos = index << PAGE_SHIFT;
-	bool hit = flags & LL_RAS_HIT;
-
-	spin_lock(&ras->ras_lock);
-
-	if (!hit)
-		CDEBUG(D_READA, DFID " pages at %lu miss.\n",
-		       PFID(ll_inode2fid(inode)), index);
+	return pos_in_window(pos, ras->ras_last_read_end,
+			     8 << PAGE_SHIFT, 8 << PAGE_SHIFT);
+}
 
-	ll_ra_stats_inc_sbi(sbi, hit ? RA_STAT_HIT : RA_STAT_MISS);
+static void ras_detect_read_pattern(struct ll_readahead_state *ras,
+				    struct ll_sb_info *sbi,
+				    unsigned long pos, unsigned long count,
+				    bool mmap)
+{
+	bool stride_detect = false;
+	unsigned long index = pos >> PAGE_SHIFT;
 
-	/* reset the read-ahead window in two cases.  First when the app seeks
-	 * or reads to some other part of the file.  Secondly if we get a
-	 * read-ahead miss that we think we've previously issued.  This can
-	 * be a symptom of there being so many read-ahead pages that the VM is
-	 * reclaiming it before we get to it.
+	/*
+	 * Reset the read-ahead window in two cases. First when the app seeks
+	 * or reads to some other part of the file. Secondly if we get a
+	 * read-ahead miss that we think we've previously issued. This can
+	 * be a symptom of there being so many read-ahead pages that the VM
+	 * is reclaiming it before we get to it.
 	 */
-	if (!pos_in_window(pos, ras->ras_last_read_end,
-			   8 << PAGE_SHIFT, 8 << PAGE_SHIFT)) {
-		zero = 1;
+	if (!is_loose_seq_read(ras, pos)) {
+		/* Check whether it is in stride I/O mode */
+		if (!read_in_stride_window(ras, pos, count)) {
+			if (ras->ras_consecutive_stride_requests == 0)
+				ras_init_stride_detector(ras, pos, count);
+			else
+				ras_stride_reset(ras);
+			ras->ras_consecutive_bytes = 0;
+			ras_reset(ras, index);
+		} else {
+			ras->ras_consecutive_bytes = 0;
+			ras->ras_consecutive_requests = 0;
+			if (++ras->ras_consecutive_stride_requests > 1)
+				stride_detect = true;
+			RAS_CDEBUG(ras);
+		}
 		ll_ra_stats_inc_sbi(sbi, RA_STAT_DISTANT_READPAGE);
-	} else if (!hit && ras->ras_window_len &&
-		   index < ras->ras_next_readahead &&
-		   pos_in_window(index, ras->ras_window_start, 0,
-				 ras->ras_window_len)) {
-		ra_miss = 1;
-		ll_ra_stats_inc_sbi(sbi, RA_STAT_MISS_IN_WINDOW);
+	} else if (stride_io_mode(ras)) {
+		/*
+		 * If this is contiguous read but in stride I/O mode
+		 * currently, check whether stride step still is valid,
+		 * if invalid, it will reset the stride ra window to
+		 * be zero.
+		 */
+		if (!read_in_stride_window(ras, pos, count)) {
+			ras_stride_reset(ras);
+			ras->ras_window_len = 0;
+			ras->ras_next_readahead = index;
+		}
 	}
 
-	/* On the second access to a file smaller than the tunable
+	ras->ras_consecutive_bytes += count;
+	if (mmap) {
+		unsigned int idx = (ras->ras_consecutive_bytes >> PAGE_SHIFT);
+
+		if ((idx >= 4 && idx % 4 == 0) || stride_detect)
+			ras->ras_need_increase_window = true;
+	} else if ((ras->ras_consecutive_requests > 1 || stride_detect)) {
+		ras->ras_need_increase_window = true;
+	}
+
+	ras->ras_last_read_end = pos + count - 1;
+}
+
+void ll_ras_enter(struct file *f, unsigned long pos, unsigned long count)
+{
+	struct ll_file_data *fd = LUSTRE_FPRIVATE(f);
+	struct ll_readahead_state *ras = &fd->fd_ras;
+	struct inode *inode = file_inode(f);
+	unsigned long index = pos >> PAGE_SHIFT;
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
+
+	spin_lock(&ras->ras_lock);
+	ras->ras_requests++;
+	ras->ras_consecutive_requests++;
+	ras->ras_need_increase_window = false;
+	ras->ras_no_miss_check = false;
+	/*
+	 * On the second access to a file smaller than the tunable
 	 * ra_max_read_ahead_whole_pages trigger RA on all pages in the
 	 * file up to ra_max_pages_per_file.  This is simply a best effort
-	 * and only occurs once per open file.  Normal RA behavior is reverted
-	 * to for subsequent IO.  The mmap case does not increment
-	 * ras_requests and thus can never trigger this behavior.
+	 * and only occurs once per open file. Normal RA behavior is reverted
+	 * to for subsequent IO.
 	 */
-	if (ras->ras_requests >= 2 && !ras->ras_request_index) {
+	if (ras->ras_requests >= 2) {
+		struct ll_ra_info *ra = &sbi->ll_ra_info;
 		u64 kms_pages;
 
 		kms_pages = (i_size_read(inode) + PAGE_SIZE - 1) >>
@@ -952,73 +1001,111 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 			ras->ras_window_start = 0;
 			ras->ras_next_readahead = index + 1;
 			ras->ras_window_len = min(ra->ra_max_pages_per_file,
-				ra->ra_max_read_ahead_whole_pages);
+						  ra->ra_max_read_ahead_whole_pages);
+			ras->ras_no_miss_check = true;
 			goto out_unlock;
 		}
 	}
-	if (zero) {
-		/* check whether it is in stride I/O mode*/
-		if (!index_in_stride_window(ras, index)) {
-			if (ras->ras_consecutive_stride_requests == 0 &&
-			    ras->ras_request_index == 0) {
-				ras_init_stride_detector(ras, pos, PAGE_SIZE);
-				ras->ras_consecutive_stride_requests++;
-			} else {
-				ras_stride_reset(ras);
-			}
+	ras_detect_read_pattern(ras, sbi, pos, count, false);
+out_unlock:
+	spin_unlock(&ras->ras_lock);
+}
+
+static bool index_in_stride_window(struct ll_readahead_state *ras,
+				   unsigned int index)
+{
+	unsigned long pos = index << PAGE_SHIFT;
+	unsigned long offset;
+
+	if (ras->ras_stride_length == 0 || ras->ras_stride_bytes == 0 ||
+	    ras->ras_stride_bytes == ras->ras_stride_length)
+		return false;
+
+	if (pos >= ras->ras_stride_offset) {
+		offset = (pos - ras->ras_stride_offset) %
+			 ras->ras_stride_length;
+		if (offset < ras->ras_stride_bytes ||
+		    ras->ras_stride_length - offset < PAGE_SIZE)
+			return true;
+	} else if (ras->ras_stride_offset - pos < PAGE_SIZE) {
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * ll_ras_enter() is used to detect read pattern according to
+ * pos and count.
+ *
+ * ras_update() is used to detect cache miss and
+ * reset window or increase window accordingly
+ */
+static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
+		       struct ll_readahead_state *ras, unsigned long index,
+		       enum ras_update_flags flags)
+{
+	struct ll_ra_info *ra = &sbi->ll_ra_info;
+	bool hit = flags & LL_RAS_HIT;
+
+	spin_lock(&ras->ras_lock);
+
+	if (!hit)
+		CDEBUG(D_READA, DFID " pages at %lu miss.\n",
+		       PFID(ll_inode2fid(inode)), index);
+	ll_ra_stats_inc_sbi(sbi, hit ? RA_STAT_HIT : RA_STAT_MISS);
+
+	/*
+	 * The readahead window has been expanded to cover whole
+	 * file size, we don't care whether ra miss happen or not.
+	 * Because we will read whole file to page cache even if
+	 * some pages missed.
+	 */
+	if (ras->ras_no_miss_check)
+		goto out_unlock;
+
+	if (flags & LL_RAS_MMAP)
+		ras_detect_read_pattern(ras, sbi, index << PAGE_SHIFT,
+					PAGE_SIZE, true);
+
+	if (!hit && ras->ras_window_len &&
+	    index < ras->ras_next_readahead &&
+	    pos_in_window(index, ras->ras_window_start, 0,
+			  ras->ras_window_len)) {
+		ll_ra_stats_inc_sbi(sbi, RA_STAT_MISS_IN_WINDOW);
+		ras->ras_need_increase_window = false;
+
+		if (index_in_stride_window(ras, index) &&
+		    stride_io_mode(ras)) {
+			/*
+			 * if (index != ras->ras_last_readpage + 1)
+			 *      ras->ras_consecutive_pages = 0;
+			 */
 			ras_reset(ras, index);
-			ras->ras_consecutive_bytes += PAGE_SIZE;
-			goto out_unlock;
-		} else {
-			ras->ras_consecutive_bytes = 0;
-			ras->ras_consecutive_requests = 0;
-			if (++ras->ras_consecutive_stride_requests > 1)
-				stride_detect = 1;
-			RAS_CDEBUG(ras);
-		}
-	} else {
-		if (ra_miss) {
-			if (index_in_stride_window(ras, index) &&
-			    stride_io_mode(ras)) {
-				if (index != (ras->ras_last_read_end >>
-					      PAGE_SHIFT) + 1)
-					ras->ras_consecutive_bytes = 0;
-				ras_reset(ras, index);
-
-				/* If stride-RA hit cache miss, the stride
-				 * detector will not be reset to avoid the
-				 * overhead of redetecting read-ahead mode,
-				 * but on the condition that the stride window
-				 * is still intersect with normal sequential
-				 * read-ahead window.
-				 */
-				if (ras->ras_window_start <
-				    (ras->ras_stride_offset >> PAGE_SHIFT))
-					ras_stride_reset(ras);
-				RAS_CDEBUG(ras);
-			} else {
-				/* Reset both stride window and normal RA
-				 * window
-				 */
-				ras_reset(ras, index);
-				ras->ras_consecutive_bytes += PAGE_SIZE;
-				ras_stride_reset(ras);
-				goto out_unlock;
-			}
-		} else if (stride_io_mode(ras)) {
-			/* If this is contiguous read but in stride I/O mode
-			 * currently, check whether stride step still is valid,
-			 * if invalid, it will reset the stride ra window
+			/*
+			 * If stride-RA hit cache miss, the stride
+			 * detector will not be reset to avoid the
+			 * overhead of redetecting read-ahead mode,
+			 * but on the condition that the stride window
+			 * is still intersect with normal sequential
+			 * read-ahead window.
 			 */
-			if (!index_in_stride_window(ras, index)) {
-				/* Shrink stride read-ahead window to be zero */
+			if (ras->ras_window_start <
+			    ras->ras_stride_offset)
 				ras_stride_reset(ras);
-				ras->ras_window_len = 0;
-				ras->ras_next_readahead = index;
-			}
+			RAS_CDEBUG(ras);
+		} else {
+			/*
+			 * Reset both stride window and normal RA
+			 * window.
+			 */
+			ras_reset(ras, index);
+			/* ras->ras_consecutive_pages++; */
+			ras->ras_consecutive_bytes = 0;
+			ras_stride_reset(ras);
+			goto out_unlock;
 		}
 	}
-	ras->ras_consecutive_bytes += PAGE_SIZE;
 	ras_set_start(ras, index);
 
 	if (stride_io_mode(ras)) {
@@ -1037,44 +1124,13 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		if (!hit)
 			ras->ras_next_readahead = index + 1;
 	}
-	RAS_CDEBUG(ras);
 
-	/* Trigger RA in the mmap case where ras_consecutive_requests
-	 * is not incremented and thus can't be used to trigger RA
-	 */
-	if (ras->ras_consecutive_bytes >= (4 << PAGE_SHIFT) &&
-	    flags & LL_RAS_MMAP) {
+	if (ras->ras_need_increase_window) {
 		ras_increase_window(inode, ras, ra);
-		/*
-		 * reset consecutive pages so that the readahead window can
-		 * grow gradually.
-		 */
-		ras->ras_consecutive_bytes = 0;
-		goto out_unlock;
-	}
-
-	/* Initially reset the stride window offset to next_readahead*/
-	if (ras->ras_consecutive_stride_requests == 2 && stride_detect) {
-		/**
-		 * Once stride IO mode is detected, next_readahead should be
-		 * reset to make sure next_readahead > stride offset
-		 */
-		ras->ras_next_readahead = max(index, ras->ras_next_readahead);
-		ras->ras_stride_offset = index << PAGE_SHIFT;
-		ras->ras_window_start = max(index, ras->ras_window_start);
+		ras->ras_need_increase_window = false;
 	}
 
-	/* The initial ras_window_len is set to the request size.  To avoid
-	 * uselessly reading and discarding pages for random IO the window is
-	 * only increased once per consecutive request received.
-	 */
-	if ((ras->ras_consecutive_requests > 1 || stride_detect) &&
-	    !ras->ras_request_index)
-		ras_increase_window(inode, ras, ra);
 out_unlock:
-	RAS_CDEBUG(ras);
-	ras->ras_request_index++;
-	ras->ras_last_read_end = pos + PAGE_SIZE - 1;
 	spin_unlock(&ras->ras_lock);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 534/622] lustre: ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (532 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 533/622] lustre: llite: support page unaligned stride readahead James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 535/622] lustre: osc: allow increasing osc.*.short_io_bytes James Simmons
                   ` (88 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Ann Koehler <amk@cray.com>

Another path through ptl_send_rpc() can cause the assert reported
in LU-10643. The assertion in ptlrpc_register_bulk() on
!desc->bd_registered fails when an rpc is resent and the first
send attempt failed to successfully attach the reply buffer. The
bulk error cleanup in ptl_send_rpc() does not reset the
bd_registered flag.

Cray-bug-id: LUS-7946
WC-bug-id: https://jira.whamcloud.com/browse/LU-12816
Lustre-commit: e6225c07ce4c ("LU-12816 ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM")
Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: https://review.whamcloud.com/36309
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/niobuf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 12a9a5e..fcf7bfa 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -720,6 +720,8 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 	 * the chance to have long unlink to sluggish net is smaller here.
 	 */
 	ptlrpc_unregister_bulk(request, 0);
+	if (request->rq_bulk)
+		request->rq_bulk->bd_registered = 0;
 out:
 	if (rc == -ENOMEM) {
 		/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 535/622] lustre: osc: allow increasing osc.*.short_io_bytes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (533 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 534/622] lustre: ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 536/622] lnet: remove pt_number from lnet_peer_table James Simmons
                   ` (87 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The osc.*.short_io_bytes parameter was mixing up the default and
maximum parameter values, and did not allow increasing the parameter
beyond the default.

Allow it to be increased to the maximum value, which depends on the
client PAGE_SIZE, and the amount of free space in the maximally-sized
OST RPC.  Since the maximum size is system dependent, allow some
grace when setting the parameter, so that a single tunable parameter
can work on a variety of different systems.

However, if it is larger than the maximum RDMA size (which is already
too large) return an error, as it means something is wrong.

Add a test case to exercise the osc.*.short_io_bytes parameter.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12910
Lustre-commit: cedc7f361a6e ("LU-12910 osc: allow increasing osc.*.short_io_bytes")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36587
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h      | 25 ++++++++++++++-----------
 fs/lustre/ldlm/ldlm_lib.c           |  2 +-
 fs/lustre/obdclass/lprocfs_status.c | 20 ++++++++++++--------
 3 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 40c1ae8..87e1d60 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -306,17 +306,19 @@
  *	DT_MAX_BRW_PAGES * niobuf_remote
  *
  * - single object with 16 pages is 512 bytes
- * - OST_IO_MAXREQSIZE must be at least 1 page of cookies plus some spillover
+ * - OST_IO_MAXREQSIZE must be at least 1 niobuf per page of data
  * - Must be a multiple of 1024
+ * - should allow a reasonably large SHORT_IO_BYTES size (64KB)
  */
 #define _OST_MAXREQSIZE_BASE ((unsigned long)(sizeof(struct lustre_msg) + \
-				 sizeof(struct ptlrpc_body) + \
-				 sizeof(struct obdo) + \
-				 sizeof(struct obd_ioobj) + \
-				 sizeof(struct niobuf_remote)))
-#define _OST_MAXREQSIZE_SUM ((unsigned long)(_OST_MAXREQSIZE_BASE + \
-				 sizeof(struct niobuf_remote) * \
-				 (DT_MAX_BRW_PAGES - 1)))
+			     /* lm_buflens */ sizeof(u32) * 4 +		  \
+					      sizeof(struct ptlrpc_body) +\
+					      sizeof(struct obdo) +	  \
+					      sizeof(struct obd_ioobj) +  \
+					      sizeof(struct niobuf_remote)))
+#define _OST_MAXREQSIZE_SUM ((unsigned long)(_OST_MAXREQSIZE_BASE +	    \
+					     sizeof(struct niobuf_remote) * \
+					     DT_MAX_BRW_PAGES))
 
 /**
  * MDS incoming request with LOV EA
@@ -335,14 +337,15 @@
 /* Safe estimate of free space in standard RPC, provides upper limit for # of
  * bytes of i/o to pack in RPC (skipping bulk transfer).
  */
-#define OST_SHORT_IO_SPACE	(OST_IO_MAXREQSIZE - _OST_MAXREQSIZE_BASE)
+#define OST_MAX_SHORT_IO_BYTES	((OST_IO_MAXREQSIZE - _OST_MAXREQSIZE_BASE) & \
+				 PAGE_MASK)
 
 /* Actual size used for short i/o buffer.  Calculation means this:
  * At least one page (for large PAGE_SIZE), or 16 KiB, but not more
  * than the available space aligned to a page boundary.
  */
-#define OBD_MAX_SHORT_IO_BYTES	(min(max(PAGE_SIZE, 16UL * 1024UL), \
-					 OST_SHORT_IO_SPACE & PAGE_MASK))
+#define OBD_DEF_SHORT_IO_BYTES	min(max(PAGE_SIZE, 16UL * 1024UL), \
+					OST_MAX_SHORT_IO_BYTES)
 
 /* Macro to hide a typecast and BUILD_BUG. */
 #define ptlrpc_req_async_args(_var, req) ({				\
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 127ed32..58919d3 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -381,7 +381,7 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	 */
 	cli->cl_max_pages_per_rpc = PTLRPC_MAX_BRW_PAGES;
 
-	cli->cl_max_short_io_bytes = OBD_MAX_SHORT_IO_BYTES;
+	cli->cl_max_short_io_bytes = OBD_DEF_SHORT_IO_BYTES;
 
 	/*
 	 * set cl_chunkbits default value to PAGE_CACHE_SHIFT,
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 98d1e3b..806d6517 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -1894,18 +1894,24 @@ ssize_t short_io_bytes_store(struct kobject *kobj, struct attribute *attr,
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
 	struct client_obd *cli = &dev->u.cli;
-	u32 val;
+	unsigned long long val;
+	char *endp;
 	int rc;
 
 	rc = lprocfs_climp_check(dev);
 	if (rc)
 		return rc;
 
-	rc = kstrtouint(buffer, 0, &val);
-	if (rc)
+	val = memparse(buffer, &endp);
+	if (*endp) {
+		rc = -EINVAL;
 		goto out;
+	}
+
+	if (val == -1)
+		val = OBD_DEF_SHORT_IO_BYTES;
 
-	if (val && (val < MIN_SHORT_IO_BYTES || val > OBD_MAX_SHORT_IO_BYTES)) {
+	if (val && (val < MIN_SHORT_IO_BYTES || val > LNET_MTU)) {
 		rc = -ERANGE;
 		goto out;
 	}
@@ -1913,10 +1919,8 @@ ssize_t short_io_bytes_store(struct kobject *kobj, struct attribute *attr,
 	rc = count;
 
 	spin_lock(&cli->cl_loi_list_lock);
-	if (val > (cli->cl_max_pages_per_rpc << PAGE_SHIFT))
-		rc = -ERANGE;
-	else
-		cli->cl_max_short_io_bytes = val;
+	cli->cl_max_short_io_bytes = min_t(unsigned long long,
+					   val, OST_MAX_SHORT_IO_BYTES);
 	spin_unlock(&cli->cl_loi_list_lock);
 
 out:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 536/622] lnet: remove pt_number from lnet_peer_table.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (534 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 535/622] lustre: osc: allow increasing osc.*.short_io_bytes James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 537/622] lnet: Optimize check for routing feature flag James Simmons
                   ` (86 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

This fields is no longer used - except for an ASSERT().
It did have a use once, but that was removed in
Commit 21602c7db4cf ("staging: lustre: Dynamic LNet
                      Configuration (DLC)")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12936
Lustre-commit: e9c9e2103a78 ("LU-12936 lnet: remove pt_number from lnet_peer_table.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36671
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h | 3 ---
 net/lnet/lnet/peer.c           | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 18d4e4e..51cc9ce 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -765,7 +765,6 @@ struct lnet_peer_net {
  *
  * protected by lnet_net_lock/EX for update
  *    pt_version
- *    pt_number
  *    pt_hash[...]
  *    pt_peer_list
  *    pt_peers
@@ -778,8 +777,6 @@ struct lnet_peer_net {
 struct lnet_peer_table {
 	/* /proc validity stamp */
 	int			 pt_version;
-	/* # peers extant */
-	atomic_t		 pt_number;
 	/* peers */
 	struct list_head	 pt_peer_list;
 	/* # peers */
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index a067136..4f0da4b 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -354,8 +354,6 @@
 
 	/* decrement the ref count on the peer table */
 	ptable = the_lnet.ln_peer_tables[lpni->lpni_cpt];
-	LASSERT(atomic_read(&ptable->pt_number) > 0);
-	atomic_dec(&ptable->pt_number);
 
 	/*
 	 * The peer_ni can no longer be found with a lookup. But there
@@ -1246,7 +1244,6 @@ struct lnet_peer_net *
 		ptable = the_lnet.ln_peer_tables[lpni->lpni_cpt];
 		list_add_tail(&lpni->lpni_hashlist, &ptable->pt_hash[hash]);
 		ptable->pt_version++;
-		atomic_inc(&ptable->pt_number);
 		/* This is the 1st refcount on lpni. */
 		atomic_inc(&lpni->lpni_refcount);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 537/622] lnet: Optimize check for routing feature flag
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (535 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 536/622] lnet: remove pt_number from lnet_peer_table James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 538/622] lustre: llite: file write pos mimatch James Simmons
                   ` (85 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Check the routing feature flag outside of the loop.

Cray-bug-id: LUS-7862
WC-bug-id: https://jira.whamcloud.com/browse/LU-12942
Lustre-commit: 7a99dc0b2f27 ("LU-12942 lnet: Optimize check for routing feature flag")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36679
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/router.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 447706d..41d0eb0 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -325,12 +325,14 @@ bool lnet_is_route_alive(struct lnet_route *route)
 
 	spin_unlock(&lp->lp_lock);
 
-	if (lp_state & LNET_PEER_PING_FAILED) {
-		CDEBUG(D_NET,
-		       "Ping failed with %d. Set routes down for gw %s\n",
-		       lp->lp_ping_error, libcfs_nid2str(lp->lp_primary_nid));
-		/* If the ping failed then mark the routes served by this
-		 * peer down
+	if (lp_state & LNET_PEER_PING_FAILED ||
+	    pbuf->pb_info.pi_features & LNET_PING_FEAT_RTE_DISABLED) {
+		CDEBUG(D_NET, "Set routes down for gw %s because %s %d\n",
+		       libcfs_nid2str(lp->lp_primary_nid),
+		       lp_state & LNET_PEER_PING_FAILED ? "ping failed" :
+		       "route feature is disabled", lp->lp_ping_error);
+		/* If the ping failed or the peer has routing disabled then
+		 * mark the routes served by this peer down
 		 */
 		list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
 			lnet_set_route_aliveness(route, false);
@@ -359,13 +361,6 @@ bool lnet_is_route_alive(struct lnet_route *route)
 			    route->lr_gateway->lp_primary_nid)
 				continue;
 
-			/* gateway has the routing feature disabled */
-			if (pbuf->pb_info.pi_features &
-			      LNET_PING_FEAT_RTE_DISABLED) {
-				lnet_set_route_aliveness(route, false);
-				continue;
-			}
-
 			llpn = lnet_peer_get_net_locked(lp, route->lr_lnet);
 			if (!llpn) {
 				lnet_set_route_aliveness(route, false);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 538/622] lustre: llite: file write pos mimatch
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (536 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 537/622] lnet: Optimize check for routing feature flag James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 539/622] lustre: ldlm: FLOCK request can be processed twice James Simmons
                   ` (84 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Bobi Jam <bobijam@whamcloud.com>

In vvp_io_write_start(), after data were successfully written, but
for some reason (e.g. out of quota), the data does not or got
partially commited, so that the file's write position (kiocb->ki_pos)
would be pushed forward falsely, and in the next iteration of write
loop, it fails the assertion

ASSERTION( io->u.ci_rw.rw_iocb.ki_pos == range->cir_pos )

This patch corrects ki_pos if this scenario happens.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12503
Lustre-commit: 1d2aa1513dc4 ("LU-12503 llite: file write pos mimatch")
Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36021
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/vvp_io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index aa8f2e1..b3f628c 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -1068,9 +1068,12 @@ static int vvp_io_write_start(const struct lu_env *env,
 	struct cl_object *obj = io->ci_obj;
 	struct inode *inode = vvp_object_inode(obj);
 	struct ll_inode_info *lli = ll_i2info(inode);
+	struct file *file = vio->vui_fd->fd_file;
 	bool lock_inode = !inode_is_locked(inode) && !IS_NOSEC(inode);
 	loff_t pos = io->u.ci_wr.wr.crw_pos;
 	size_t cnt = io->u.ci_wr.wr.crw_count;
+	size_t nob = io->ci_nob;
+	size_t written = 0;
 	ssize_t result = 0;
 
 	down_read(&lli->lli_trunc_sem);
@@ -1135,6 +1138,7 @@ static int vvp_io_write_start(const struct lu_env *env,
 		if (unlikely(lock_inode))
 			inode_unlock(inode);
 
+		written = result;
 		if (result > 0 || result == -EIOCBQUEUED)
 			result = generic_write_sync(vio->vui_iocb, result);
 	}
@@ -1149,6 +1153,15 @@ static int vvp_io_write_start(const struct lu_env *env,
 			       io->ci_nob, result);
 		}
 	}
+	if (vio->vui_iocb->ki_pos != (pos + io->ci_nob - nob)) {
+		CDEBUG(D_VFSTRACE,
+		       "%s: write position mismatch: ki_pos %lld vs. pos %lld, written %ld, commit %ld rc %ld\n",
+		       file_dentry(file)->d_name.name,
+		       vio->vui_iocb->ki_pos, pos + io->ci_nob - nob,
+		       written, io->ci_nob - nob, result);
+		/* rewind ki_pos to where it has successfully committed */
+		vio->vui_iocb->ki_pos = pos + io->ci_nob - nob;
+	}
 	if (result > 0) {
 		set_bit(LLIF_DATA_MODIFIED, &(ll_i2info(inode))->lli_flags);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 539/622] lustre: ldlm: FLOCK request can be processed twice
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (537 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 538/622] lustre: llite: file write pos mimatch James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 540/622] lnet: timers: correctly offset mod_timer James Simmons
                   ` (83 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

Original request can be processed after resend
request, so it can create a lock on MDT without
client lock or unlock other lock.

Make flock enqueue to use modify RPC slot.

Cray-bug-id: LUS-5739
WC-bug-id: https://jira.whamcloud.com/browse/LU-12828
Lustre-commit: 85a12c6c8d7a ("LU-12828 ldlm: FLOCK request can be processed twice")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/36340
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h |  3 +++
 fs/lustre/include/lustre_mdc.h | 25 -----------------------
 fs/lustre/include/lustre_net.h |  3 ++-
 fs/lustre/include/obd_class.h  |  6 ++----
 fs/lustre/ldlm/ldlm_request.c  | 34 ++++++++++++++++++++++++++++---
 fs/lustre/mdc/mdc_locks.c      | 45 +++++++++++++-----------------------------
 fs/lustre/mdc/mdc_reint.c      |  4 ++--
 fs/lustre/mdc/mdc_request.c    | 20 +++++++++----------
 fs/lustre/obdclass/genops.c    | 30 +++++-----------------------
 fs/lustre/ptlrpc/client.c      | 29 +++++++++++++++++++++++++--
 10 files changed, 96 insertions(+), 103 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 7621d1e..31d360e 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -959,6 +959,8 @@ struct ldlm_enqueue_info {
 	void			*ei_cbdata;
 	/* whether enqueue slave stripes */
 	unsigned int		ei_enq_slave:1;
+	/* whether acquire rpc slot */
+	unsigned int		ei_enq_slot:1;
 };
 
 extern struct obd_ops ldlm_obd_ops;
@@ -1279,6 +1281,7 @@ int ldlm_prep_elc_req(struct obd_export *exp,
 		      int version, int opc, int canceloff,
 		      struct list_head *cancels, int count);
 
+struct ptlrpc_request *ldlm_enqueue_pack(struct obd_export *exp, int lvb_len);
 int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 			  enum ldlm_type type, u8 with_policy,
 			  enum ldlm_mode mode,
diff --git a/fs/lustre/include/lustre_mdc.h b/fs/lustre/include/lustre_mdc.h
index f57783d..d7b6e4a 100644
--- a/fs/lustre/include/lustre_mdc.h
+++ b/fs/lustre/include/lustre_mdc.h
@@ -60,31 +60,6 @@
 struct ptlrpc_request;
 struct obd_device;
 
-static inline void mdc_get_mod_rpc_slot(struct ptlrpc_request *req,
-					struct lookup_intent *it)
-{
-	struct client_obd *cli = &req->rq_import->imp_obd->u.cli;
-	u32 opc;
-	u16 tag;
-
-	opc = lustre_msg_get_opc(req->rq_reqmsg);
-	tag = obd_get_mod_rpc_slot(cli, opc, it);
-	lustre_msg_set_tag(req->rq_reqmsg, tag);
-	ptlrpc_reassign_next_xid(req);
-}
-
-static inline void mdc_put_mod_rpc_slot(struct ptlrpc_request *req,
-					struct lookup_intent *it)
-{
-	struct client_obd *cli = &req->rq_import->imp_obd->u.cli;
-	u32 opc;
-	u16 tag;
-
-	opc = lustre_msg_get_opc(req->rq_reqmsg);
-	tag = lustre_msg_get_tag(req->rq_reqmsg);
-	obd_put_mod_rpc_slot(cli, opc, it, tag);
-}
-
 /**
  * Update the maximum possible easize.
  *
diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 87e1d60..90a0b01 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1919,7 +1919,8 @@ void ptlrpc_retain_replayable_request(struct ptlrpc_request *req,
 u64 ptlrpc_next_xid(void);
 u64 ptlrpc_sample_next_xid(void);
 u64 ptlrpc_req_xid(struct ptlrpc_request *request);
-void ptlrpc_reassign_next_xid(struct ptlrpc_request *req);
+void ptlrpc_get_mod_rpc_slot(struct ptlrpc_request *req);
+void ptlrpc_put_mod_rpc_slot(struct ptlrpc_request *req);
 
 /* Set of routines to run a function in ptlrpcd context */
 void *ptlrpcd_alloc_work(struct obd_import *imp,
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index bc01eca..a099768 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -115,10 +115,8 @@ static inline char *obd_import_nid2str(struct obd_import *imp)
 int obd_set_max_mod_rpcs_in_flight(struct client_obd *cli, u16 max);
 int obd_mod_rpc_stats_seq_show(struct client_obd *cli, struct seq_file *seq);
 
-u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc,
-			 struct lookup_intent *it);
-void obd_put_mod_rpc_slot(struct client_obd *cli, u32 opc,
-			  struct lookup_intent *it, u16 tag);
+u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc);
+void obd_put_mod_rpc_slot(struct client_obd *cli, u32 opc, u16 tag);
 
 struct llog_handle;
 struct llog_rec_hdr;
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 20bdba4..6df057d 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -347,6 +347,11 @@ static void failed_lock_cleanup(struct ldlm_namespace *ns,
 	}
 }
 
+static bool ldlm_request_slot_needed(enum ldlm_type type)
+{
+	return type == LDLM_FLOCK || type == LDLM_IBITS;
+}
+
 /**
  * Finishing portion of client lock enqueue code.
  *
@@ -365,6 +370,11 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 	struct ldlm_reply *reply;
 	int cleanup_phase = 1;
 
+	if (ldlm_request_slot_needed(type))
+		obd_put_request_slot(&req->rq_import->imp_obd->u.cli);
+
+	ptlrpc_put_mod_rpc_slot(req);
+
 	lock = ldlm_handle2lock(lockh);
 	/* ldlm_cli_enqueue is holding a reference on this lock. */
 	if (!lock) {
@@ -662,8 +672,7 @@ int ldlm_prep_enqueue_req(struct obd_export *exp, struct ptlrpc_request *req,
 }
 EXPORT_SYMBOL(ldlm_prep_enqueue_req);
 
-static struct ptlrpc_request *ldlm_enqueue_pack(struct obd_export *exp,
-						int lvb_len)
+struct ptlrpc_request *ldlm_enqueue_pack(struct obd_export *exp, int lvb_len)
 {
 	struct ptlrpc_request *req;
 	int rc;
@@ -682,6 +691,7 @@ static struct ptlrpc_request *ldlm_enqueue_pack(struct obd_export *exp,
 	ptlrpc_request_set_replen(req);
 	return req;
 }
+EXPORT_SYMBOL(ldlm_enqueue_pack);
 
 /**
  * Client-side lock enqueue.
@@ -814,6 +824,24 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 					     LDLM_GLIMPSE_ENQUEUE);
 	}
 
+	/* It is important to obtain modify RPC slot first (if applicable), so
+	 * that threads that are waiting for a modify RPC slot are not polluting
+	 * our rpcs in flight counter.
+	 */
+	if (einfo->ei_enq_slot)
+		ptlrpc_get_mod_rpc_slot(req);
+
+	if (ldlm_request_slot_needed(einfo->ei_type)) {
+		rc = obd_get_request_slot(&req->rq_import->imp_obd->u.cli);
+		if (rc) {
+			if (einfo->ei_enq_slot)
+				ptlrpc_put_mod_rpc_slot(req);
+			failed_lock_cleanup(ns, lock, einfo->ei_mode);
+			LDLM_LOCK_RELEASE(lock);
+			goto out;
+		}
+	}
+
 	if (async) {
 		LASSERT(reqp);
 		return 0;
@@ -835,7 +863,7 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 		LDLM_LOCK_RELEASE(lock);
 	else
 		rc = err;
-
+out:
 	if (!req_passed_in && req) {
 		ptlrpc_req_finished(req);
 		if (reqp)
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 4d40087..60bbae1 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -856,6 +856,16 @@ static int mdc_finish_enqueue(struct obd_export *exp,
 	return rc;
 }
 
+static inline bool mdc_skip_mod_rpc_slot(const struct lookup_intent *it)
+{
+	if (it &&
+	    (it->it_op == IT_GETATTR || it->it_op == IT_LOOKUP ||
+	     it->it_op == IT_READDIR ||
+	     (it->it_op == IT_LAYOUT && !(it->it_flags & MDS_FMODE_WRITE))))
+		return true;
+	return false;
+}
+
 /* We always reserve enough space in the reply packet for a stripe MD, because
  * we don't know in advance the file type.
  */
@@ -877,7 +887,7 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		.l_inodebits = { MDS_INODELOCK_XATTR }
 	};
 	struct obd_device *obddev = class_exp2obd(exp);
-	struct ptlrpc_request *req = NULL;
+	struct ptlrpc_request *req;
 	u64 flags, saved_flags = extra_lock_flags;
 	struct ldlm_res_id res_id;
 	int generation, resends = 0;
@@ -920,6 +930,7 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		LASSERTF(einfo->ei_type == LDLM_FLOCK, "lock type %d\n",
 			 einfo->ei_type);
 		res_id.name[3] = LDLM_FLOCK;
+		req = ldlm_enqueue_pack(exp, 0);
 	} else if (it->it_op & IT_OPEN) {
 		req = mdc_intent_open_pack(exp, it, op_data, acl_bufsize);
 	} else if (it->it_op & (IT_GETATTR | IT_LOOKUP)) {
@@ -947,21 +958,7 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		req->rq_sent = ktime_get_real_seconds() + resends;
 	}
 
-	/* It is important to obtain modify RPC slot first (if applicable), so
-	 * that threads that are waiting for a modify RPC slot are not polluting
-	 * our rpcs in flight counter.
-	 * We do not do flock request limiting, though
-	 */
-	if (it) {
-		mdc_get_mod_rpc_slot(req, it);
-		rc = obd_get_request_slot(&obddev->u.cli);
-		if (rc != 0) {
-			mdc_put_mod_rpc_slot(req, it);
-			mdc_clear_replay_flag(req, 0);
-			ptlrpc_req_finished(req);
-			return rc;
-		}
-	}
+	einfo->ei_enq_slot = !mdc_skip_mod_rpc_slot(it);
 
 	/* With Data-on-MDT the glimpse callback is needed too.
 	 * It is set here in advance but not in mdc_finish_enqueue()
@@ -987,12 +984,10 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 		    (einfo->ei_type == LDLM_FLOCK) &&
 		    (einfo->ei_mode == LCK_NL))
 			goto resend;
+		ptlrpc_req_finished(req);
 		return rc;
 	}
 
-	obd_put_request_slot(&obddev->u.cli);
-	mdc_put_mod_rpc_slot(req, it);
-
 	if (rc < 0) {
 		CDEBUG(D_INFO,
 		       "%s: ldlm_cli_enqueue " DFID ":" DFID "=%s failed: rc = %d\n",
@@ -1343,16 +1338,12 @@ static int mdc_intent_getattr_async_interpret(const struct lu_env *env,
 	struct ldlm_enqueue_info *einfo = &minfo->mi_einfo;
 	struct lookup_intent *it;
 	struct lustre_handle *lockh;
-	struct obd_device *obddev;
 	struct ldlm_reply *lockrep;
 	u64 flags = LDLM_FL_HAS_INTENT;
 
 	it = &minfo->mi_it;
 	lockh = &minfo->mi_lockh;
 
-	obddev = class_exp2obd(exp);
-
-	obd_put_request_slot(&obddev->u.cli);
 	if (OBD_FAIL_CHECK(OBD_FAIL_MDC_GETATTR_ENQUEUE))
 		rc = -ETIMEDOUT;
 
@@ -1387,7 +1378,6 @@ int mdc_intent_getattr_async(struct obd_export *exp,
 	struct lookup_intent *it = &minfo->mi_it;
 	struct ptlrpc_request *req;
 	struct mdc_getattr_args *ga;
-	struct obd_device *obddev = class_exp2obd(exp);
 	struct ldlm_res_id res_id;
 	union ldlm_policy_data policy = {
 		.l_inodebits = { MDS_INODELOCK_LOOKUP | MDS_INODELOCK_UPDATE }
@@ -1409,12 +1399,6 @@ int mdc_intent_getattr_async(struct obd_export *exp,
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
-	rc = obd_get_request_slot(&obddev->u.cli);
-	if (rc != 0) {
-		ptlrpc_req_finished(req);
-		return rc;
-	}
-
 	/* With Data-on-MDT the glimpse callback is needed too.
 	 * It is set here in advance but not in mdc_finish_enqueue()
 	 * to avoid possible races. It is safe to have glimpse handler
@@ -1426,7 +1410,6 @@ int mdc_intent_getattr_async(struct obd_export *exp,
 	rc = ldlm_cli_enqueue(exp, &req, &minfo->mi_einfo, &res_id, &policy,
 			      &flags, NULL, 0, LVB_T_NONE, &minfo->mi_lockh, 1);
 	if (rc < 0) {
-		obd_put_request_slot(&obddev->u.cli);
 		ptlrpc_req_finished(req);
 		return rc;
 	}
diff --git a/fs/lustre/mdc/mdc_reint.c b/fs/lustre/mdc/mdc_reint.c
index 0dc0de4..dade5686 100644
--- a/fs/lustre/mdc/mdc_reint.c
+++ b/fs/lustre/mdc/mdc_reint.c
@@ -47,9 +47,9 @@ static int mdc_reint(struct ptlrpc_request *request, int level)
 
 	request->rq_send_state = level;
 
-	mdc_get_mod_rpc_slot(request, NULL);
+	ptlrpc_get_mod_rpc_slot(request);
 	rc = ptlrpc_queue_wait(request);
-	mdc_put_mod_rpc_slot(request, NULL);
+	ptlrpc_put_mod_rpc_slot(request);
 	if (rc)
 		CDEBUG(D_INFO, "error in handling %d\n", rc);
 	else if (!req_capsule_server_get(&request->rq_pill, &RMF_MDT_BODY))
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 54f6d15..8569858 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -412,12 +412,12 @@ static int mdc_xattr_common(struct obd_export *exp,
 
 	/* make rpc */
 	if (opcode == MDS_REINT)
-		mdc_get_mod_rpc_slot(req, NULL);
+		ptlrpc_get_mod_rpc_slot(req);
 
 	rc = ptlrpc_queue_wait(req);
 
 	if (opcode == MDS_REINT)
-		mdc_put_mod_rpc_slot(req, NULL);
+		ptlrpc_put_mod_rpc_slot(req);
 
 	if (rc)
 		ptlrpc_req_finished(req);
@@ -990,9 +990,9 @@ static int mdc_close(struct obd_export *exp, struct md_op_data *op_data,
 
 	ptlrpc_request_set_replen(req);
 
-	mdc_get_mod_rpc_slot(req, NULL);
+	ptlrpc_get_mod_rpc_slot(req);
 	rc = ptlrpc_queue_wait(req);
-	mdc_put_mod_rpc_slot(req, NULL);
+	ptlrpc_put_mod_rpc_slot(req);
 
 	if (!req->rq_repmsg) {
 		CDEBUG(D_RPCTRACE, "request %p failed to send: rc = %d\n", req,
@@ -1779,9 +1779,9 @@ static int mdc_ioc_hsm_progress(struct obd_export *exp,
 
 	ptlrpc_request_set_replen(req);
 
-	mdc_get_mod_rpc_slot(req, NULL);
+	ptlrpc_get_mod_rpc_slot(req);
 	rc = ptlrpc_queue_wait(req);
-	mdc_put_mod_rpc_slot(req, NULL);
+	ptlrpc_put_mod_rpc_slot(req);
 out:
 	ptlrpc_req_finished(req);
 	return rc;
@@ -1984,9 +1984,9 @@ static int mdc_ioc_hsm_state_set(struct obd_export *exp,
 
 	ptlrpc_request_set_replen(req);
 
-	mdc_get_mod_rpc_slot(req, NULL);
+	ptlrpc_get_mod_rpc_slot(req);
 	rc = ptlrpc_queue_wait(req);
-	mdc_put_mod_rpc_slot(req, NULL);
+	ptlrpc_put_mod_rpc_slot(req);
 out:
 	ptlrpc_req_finished(req);
 	return rc;
@@ -2049,9 +2049,9 @@ static int mdc_ioc_hsm_request(struct obd_export *exp,
 
 	ptlrpc_request_set_replen(req);
 
-	mdc_get_mod_rpc_slot(req, NULL);
+	ptlrpc_get_mod_rpc_slot(req);
 	rc = ptlrpc_queue_wait(req);
-	mdc_put_mod_rpc_slot(req, NULL);
+	ptlrpc_put_mod_rpc_slot(req);
 out:
 	ptlrpc_req_finished(req);
 	return rc;
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 7f841d5..bceb055 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -1495,36 +1495,18 @@ static inline bool obd_mod_rpc_slot_avail(struct client_obd *cli,
 	return avail;
 }
 
-static inline bool obd_skip_mod_rpc_slot(const struct lookup_intent *it)
-{
-	if (it &&
-	    (it->it_op == IT_GETATTR || it->it_op == IT_LOOKUP ||
-	     it->it_op == IT_READDIR ||
-	     (it->it_op == IT_LAYOUT && !(it->it_flags & MDS_FMODE_WRITE))))
-		return true;
-	return false;
-}
-
 /* Get a modify RPC slot from the obd client @cli according
- * to the kind of operation @opc that is going to be sent
- * and the intent @it of the operation if it applies.
+ * to the kind of operation @opc that is going to be sent.
  * If the maximum number of modify RPCs in flight is reached
  * the thread is put to sleep.
  * Returns the tag to be set in the request message. Tag 0
  * is reserved for non-modifying requests.
  */
-u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc,
-			 struct lookup_intent *it)
+u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc)
 {
 	bool close_req = false;
 	u16 i, max;
 
-	/* read-only metadata RPCs don't consume a slot on MDT
-	 * for reply reconstruction
-	 */
-	if (obd_skip_mod_rpc_slot(it))
-		return 0;
-
 	if (opc == MDS_CLOSE)
 		close_req = true;
 
@@ -1567,15 +1549,13 @@ u16 obd_get_mod_rpc_slot(struct client_obd *cli, u32 opc,
 
 /*
  * Put a modify RPC slot from the obd client @cli according
- * to the kind of operation @opc that has been sent and the
- * intent @it of the operation if it applies.
+ * to the kind of operation @opc that has been sent.
  */
-void obd_put_mod_rpc_slot(struct client_obd *cli, u32 opc,
-			  struct lookup_intent *it, u16 tag)
+void obd_put_mod_rpc_slot(struct client_obd *cli, u32 opc, u16 tag)
 {
 	bool close_req = false;
 
-	if (obd_skip_mod_rpc_slot(it))
+	if (tag == 0)
 		return;
 
 	if (opc == MDS_CLOSE)
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 8d874f2..632ddf1 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -717,7 +717,7 @@ static inline void ptlrpc_assign_next_xid(struct ptlrpc_request *req)
 
 static atomic64_t ptlrpc_last_xid;
 
-void ptlrpc_reassign_next_xid(struct ptlrpc_request *req)
+static void ptlrpc_reassign_next_xid(struct ptlrpc_request *req)
 {
 	spin_lock(&req->rq_import->imp_lock);
 	list_del_init(&req->rq_unreplied_list);
@@ -725,7 +725,32 @@ void ptlrpc_reassign_next_xid(struct ptlrpc_request *req)
 	spin_unlock(&req->rq_import->imp_lock);
 	DEBUG_REQ(D_RPCTRACE, req, "reassign xid");
 }
-EXPORT_SYMBOL(ptlrpc_reassign_next_xid);
+
+void ptlrpc_get_mod_rpc_slot(struct ptlrpc_request *req)
+{
+	struct client_obd *cli = &req->rq_import->imp_obd->u.cli;
+	u32 opc;
+	u16 tag;
+
+	opc = lustre_msg_get_opc(req->rq_reqmsg);
+	tag = obd_get_mod_rpc_slot(cli, opc);
+	lustre_msg_set_tag(req->rq_reqmsg, tag);
+	ptlrpc_reassign_next_xid(req);
+}
+EXPORT_SYMBOL(ptlrpc_get_mod_rpc_slot);
+
+void ptlrpc_put_mod_rpc_slot(struct ptlrpc_request *req)
+{
+	u16 tag = lustre_msg_get_tag(req->rq_reqmsg);
+
+	if (tag != 0) {
+		struct client_obd *cli = &req->rq_import->imp_obd->u.cli;
+		u32 opc = lustre_msg_get_opc(req->rq_reqmsg);
+
+		obd_put_mod_rpc_slot(cli, opc, tag);
+	}
+}
+EXPORT_SYMBOL(ptlrpc_put_mod_rpc_slot);
 
 int ptlrpc_request_bufs_pack(struct ptlrpc_request *request,
 			     u32 version, int opcode, char **bufs,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 540/622] lnet: timers: correctly offset mod_timer.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (538 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 539/622] lustre: ldlm: FLOCK request can be processed twice James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 541/622] lustre: ptlrpc: update wiretest for new values James Simmons
                   ` (82 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

During a high level code review of the lustre time code it was
discovered that some of the mod_timer() calles was missing
adding the current jiffies value to the timeout that converted
to jiffies from seconds. Add this proper offset.

Fixes: 5109c2502543 ("staging: lustre: lnet: move ping and delay injection to time64_t")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12931
Lustre-commit: e150810faa5 ("LU-12931 timers: correctly offset mod_timer.")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/36688
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/net_fault.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/lnet/lnet/net_fault.c b/net/lnet/lnet/net_fault.c
index e43b1e1..8408e93 100644
--- a/net/lnet/lnet/net_fault.c
+++ b/net/lnet/lnet/net_fault.c
@@ -487,7 +487,7 @@ struct lnet_delay_rule {
 	/** baseline to caculate dl_delay_time */
 	time64_t		dl_time_base;
 	/** jiffies to send the next delayed message */
-	unsigned long		dl_msg_send;
+	time64_t		dl_msg_send;
 	/** delayed message list */
 	struct list_head	dl_msg_list;
 	/** statistic of delayed messages */
@@ -592,7 +592,7 @@ struct delay_daemon_data {
 	msg->msg_delay_send = ktime_get_seconds() + attr->u.delay.la_latency;
 	if (rule->dl_msg_send == -1) {
 		rule->dl_msg_send = msg->msg_delay_send;
-		mod_timer(&rule->dl_timer, rule->dl_msg_send);
+		mod_timer(&rule->dl_timer, jiffies + rule->dl_msg_send * HZ);
 	}
 
 	spin_unlock(&rule->dl_lock);
@@ -664,7 +664,7 @@ struct delay_daemon_data {
 		msg = list_first_entry(&rule->dl_msg_list,
 				       struct lnet_msg, msg_list);
 		rule->dl_msg_send = msg->msg_delay_send;
-		mod_timer(&rule->dl_timer, rule->dl_msg_send);
+		mod_timer(&rule->dl_timer, jiffies + rule->dl_msg_send * HZ);
 	}
 	spin_unlock(&rule->dl_lock);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 541/622] lustre: ptlrpc: update wiretest for new values
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (539 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 540/622] lnet: timers: correctly offset mod_timer James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 542/622] lustre: ptlrpc: do lu_env_refill for any new request James Simmons
                   ` (81 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Update wiretest.c file to fixes issues with some #defines that were
changed to named enums. Don't need to wire check posix acl structures
if CONFIG_FS_POSIX_ACL is disabled.

Fixes: cd7fd3b2e230 ("lustre: obd: add rmfid support")
Fixes: c52da9b97ee0 ("lustre: introduce CONFIG_LUSTRE_FS_POSIX_ACL")
Fixes: 0b75bfcd14ac ("lustre: uapi: Add nonrotational flag to statfs")

WC-bug-id: https://jira.whamcloud.com/browse/LU-12937
Lustre-commit: bc2e23e1cd80 ("LU-12937 utils: update wirecheck for new values")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36706
Reviewed-by: Artem Blagodarenko <c17828@cray.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 671878d..9fc7a5b 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1748,19 +1748,19 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct obd_statfs, os_spare9));
 	LASSERTF((int)sizeof(((struct obd_statfs *)0)->os_spare9) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct obd_statfs *)0)->os_spare9));
-	LASSERTF(OS_STATE_DEGRADED == 0x1, "found %lld\n",
+	LASSERTF(OS_STATE_DEGRADED == 0x00000001UL, "found %lld\n",
 		 (long long)OS_STATE_DEGRADED);
-	LASSERTF(OS_STATE_READONLY == 0x2, "found %lld\n",
+	LASSERTF(OS_STATE_READONLY == 0x00000002UL, "found %lld\n",
 		 (long long)OS_STATE_READONLY);
-	LASSERTF(OS_STATE_NOPRECREATE == 0x4, "found %lld\n",
+	LASSERTF(OS_STATE_NOPRECREATE == 0x00000004UL, "found %lld\n",
 		 (long long)OS_STATE_NOPRECREATE);
-	LASSERTF(OS_STATE_ENOSPC == 0x20, "found %lld\n",
+	LASSERTF(OS_STATE_ENOSPC == 0x00000020UL, "found %lld\n",
 		 (long long)OS_STATE_ENOSPC);
-	LASSERTF(OS_STATE_ENOINO == 0x40, "found %lld\n",
+	LASSERTF(OS_STATE_ENOINO == 0x00000040UL, "found %lld\n",
 		 (long long)OS_STATE_ENOINO);
-	LASSERTF(OS_STATE_SUM == 0x100, "found %lld\n",
+	LASSERTF(OS_STATE_SUM == 0x00000100UL, "found %lld\n",
 		 (long long)OS_STATE_SUM);
-	LASSERTF(OS_STATE_NONROT == 0x200, "found %lld\n",
+	LASSERTF(OS_STATE_NONROT == 0x00000200UL, "found %lld\n",
 		 (long long)OS_STATE_NONROT);
 
 	/* Checks for struct obd_ioobj */
@@ -2178,19 +2178,19 @@ void lustre_assert_wire_constants(void)
 		 LUSTRE_DIRECTIO_FL);
 	LASSERTF(LUSTRE_INLINE_DATA_FL == 0x10000000, "found 0x%.8x\n",
 		 LUSTRE_INLINE_DATA_FL);
-	LASSERTF(MDS_INODELOCK_LOOKUP == 0x000001, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_LOOKUP == 0x00000001UL, "found 0x%.8x\n",
 		 MDS_INODELOCK_LOOKUP);
-	LASSERTF(MDS_INODELOCK_UPDATE == 0x000002, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_UPDATE == 0x00000002UL, "found 0x%.8x\n",
 		 MDS_INODELOCK_UPDATE);
-	LASSERTF(MDS_INODELOCK_OPEN == 0x000004, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_OPEN == 0x00000004UL, "found 0x%.8x\n",
 		 MDS_INODELOCK_OPEN);
-	LASSERTF(MDS_INODELOCK_LAYOUT == 0x000008, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_LAYOUT == 0x00000008UL, "found 0x%.8x\n",
 		 MDS_INODELOCK_LAYOUT);
-	LASSERTF(MDS_INODELOCK_PERM == 0x000010, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_PERM == 0x00000010UL, "found 0x%.8x\n",
 		MDS_INODELOCK_PERM);
-	LASSERTF(MDS_INODELOCK_XATTR == 0x000020, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_XATTR == 0x00000020UL, "found 0x%.8x\n",
 		MDS_INODELOCK_XATTR);
-	LASSERTF(MDS_INODELOCK_DOM == 0x000040, "found 0x%.8x\n",
+	LASSERTF(MDS_INODELOCK_DOM == 0x00000040UL, "found 0x%.8x\n",
 		MDS_INODELOCK_DOM);
 
 	/* Checks for struct mdt_ioepoch */
@@ -4176,6 +4176,7 @@ void lustre_assert_wire_constants(void)
 	BUILD_BUG_ON(FIEMAP_EXTENT_NO_DIRECT != 0x40000000);
 	BUILD_BUG_ON(FIEMAP_EXTENT_NET != 0x80000000);
 
+#ifdef CONFIG_FS_POSIX_ACL
 	/* Checks for type posix_acl_xattr_entry */
 	LASSERTF((int)sizeof(struct posix_acl_xattr_entry) == 8, "found %lld\n",
 		 (long long)(int)sizeof(struct posix_acl_xattr_entry));
@@ -4199,6 +4200,7 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct posix_acl_xattr_header, a_version));
 	LASSERTF((int)sizeof(((struct posix_acl_xattr_header *)0)->a_version) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct posix_acl_xattr_header *)0)->a_version));
+#endif /* CONFIG_FS_POSIX_ACL */
 
 	/* Checks for struct link_ea_header */
 	LASSERTF((int)sizeof(struct link_ea_header) == 24, "found %lld\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 542/622] lustre: ptlrpc: do lu_env_refill for any new request
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (540 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 541/622] lustre: ptlrpc: update wiretest for new values James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 543/622] lustre: obd: perform proper division James Simmons
                   ` (80 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Perform lu_env_refill() prior any new request handling. That was
done already server side by tgt_request_handle() and is moved now
to ptlrpc_main() to work for any handler as well,
e.g. ldlm_cancel_handler()

WC-bug-id: https://jira.whamcloud.com/browse/LU-12741
Lustre-commit: 3f304b75d24a ("LU-12741 ptlrpc: do lu_env_refill for new request")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36714
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index c874487..f65d5c5 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -2281,6 +2281,12 @@ static int ptlrpc_main(void *arg)
 			ptlrpc_start_thread(svcpt, 0);
 		}
 
+		/* reset le_ses to initial state */
+		env->le_ses = NULL;
+		/* Refill the context before execution to make sure
+		 * all thread keys are allocated
+		 */
+		lu_env_refill(env);
 		/* Process all incoming reqs before handling any */
 		if (ptlrpc_server_request_incoming(svcpt)) {
 			lu_context_enter(&env->le_ctx);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 543/622] lustre: obd: perform proper division
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (541 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 542/622] lustre: ptlrpc: do lu_env_refill for any new request James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 544/622] lustre: uapi: introduce OBD_CONNECT2_CRUSH James Simmons
                   ` (79 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

Lustre stats have two files lc_sum and lc_count which are both
s64 so using do_div() is completely wrong. Use div64_s64()
instead.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6174
Lustre-commit: e8f793f620f4 ("LU-6174 obd: perform proper division")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/36751
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 806d6517..893f06d 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -799,15 +799,10 @@ int lprocfs_rd_import(struct seq_file *m, void *data)
 
 	header = &obd->obd_svc_stats->ls_cnt_header[PTLRPC_REQWAIT_CNTR];
 	lprocfs_stats_collect(obd->obd_svc_stats, PTLRPC_REQWAIT_CNTR, &ret);
-	if (ret.lc_count != 0) {
-		/* first argument to do_div MUST be u64 */
-		u64 sum = ret.lc_sum;
-
-		do_div(sum, ret.lc_count);
-		ret.lc_sum = sum;
-	} else {
+	if (ret.lc_count != 0)
+		ret.lc_sum = div64_s64(ret.lc_sum, ret.lc_count);
+	else
 		ret.lc_sum = 0;
-	}
 	seq_printf(m,
 		   "    rpcs:\n"
 		   "       inflight: %u\n"
@@ -848,11 +843,7 @@ int lprocfs_rd_import(struct seq_file *m, void *data)
 				      PTLRPC_LAST_CNTR + BRW_READ_BYTES + rw,
 				      &ret);
 		if (ret.lc_sum > 0 && ret.lc_count > 0) {
-			/* first argument to do_div MUST be u64 */
-			u64 sum = ret.lc_sum;
-
-			do_div(sum, ret.lc_count);
-			ret.lc_sum = sum;
+			ret.lc_sum = div64_s64(ret.lc_sum, ret.lc_count);
 			seq_printf(m,
 				   "    %s_data_averages:\n"
 				   "       bytes_per_rpc: %llu\n",
@@ -864,11 +855,7 @@ int lprocfs_rd_import(struct seq_file *m, void *data)
 		header = &obd->obd_svc_stats->ls_cnt_header[j];
 		lprocfs_stats_collect(obd->obd_svc_stats, j, &ret);
 		if (ret.lc_sum > 0 && ret.lc_count != 0) {
-			/* first argument to do_div MUST be u64 */
-			u64 sum = ret.lc_sum;
-
-			do_div(sum, ret.lc_count);
-			ret.lc_sum = sum;
+			ret.lc_sum = div64_s64(ret.lc_sum, ret.lc_count);
 			seq_printf(m,
 				   "       %s_per_rpc: %llu\n",
 				   header->lc_units, ret.lc_sum);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 544/622] lustre: uapi: introduce OBD_CONNECT2_CRUSH
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (542 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 543/622] lustre: obd: perform proper division James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 545/622] lnet: Wait for single discovery attempt of routers James Simmons
                   ` (78 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Introduce a new connect flag OBD_CONNECT2_CRUSH to indicate whether
client or server supports new directory hash type 'crush'.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11025
Lustre-commit: dbafa9df0f8f ("LU-11025 uapi: introduce OBD_CONNECT2_CRUSH")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36774
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c    | 2 +-
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 893f06d..9772194 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -124,7 +124,7 @@
 	"selinux_policy",	/* 0x400 */
 	"lsom",			/* 0x800 */
 	"pcc",			/* 0x1000 */
-	"plain_layout",		/* 0x2000 */
+	"crush",		/* 0x2000 */
 	"async_discard",	/* 0x4000 */
 	"client_encryption",	/* 0x8000 */
 	NULL
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 9fc7a5b..6c66815 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1158,6 +1158,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_LSOM);
 	LASSERTF(OBD_CONNECT2_PCC == 0x1000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_PCC);
+	LASSERTF(OBD_CONNECT2_CRUSH == 0x2000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_CRUSH);
 	LASSERTF(OBD_CONNECT2_ASYNC_DISCARD == 0x4000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ASYNC_DISCARD);
 	LASSERTF(OBD_CONNECT2_ENCRYPT == 0x8000ULL, "found 0x%.16llxULL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index a74d979..a69d49a 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -810,6 +810,8 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_SELINUX_POLICY    0x400ULL	/* has client SELinux policy */
 #define OBD_CONNECT2_LSOM	       0x800ULL	/* LSOM support */
 #define OBD_CONNECT2_PCC	       0x1000ULL /* Persistent Client Cache */
+#define OBD_CONNECT2_CRUSH	       0x2000ULL /* crush hash striped directory
+						  */
 #define OBD_CONNECT2_ASYNC_DISCARD     0x4000ULL /* support async DoM data
 						  * discard
 						  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 545/622] lnet: Wait for single discovery attempt of routers
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (543 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 544/622] lustre: uapi: introduce OBD_CONNECT2_CRUSH James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 546/622] lustre: mgc: config lock leak James Simmons
                   ` (77 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Historically, check_routers_before_use would cause LNet
initialization to pause until all routers had been ping'd once.

This behavior was changed in commit
fe17e9b8370affe063769b880f02b9190584baaa from LU-11298. Now, LNet
will wait indefinitely until discovery completes on all routers.
This is problematic, because if even one router is down then LNet
will stall forever.

Introduce a new lnet_peer state to indicate whether a router has
been discovered (either successfully or not) to restore the historic
behavior.

Fixes fe17e9b8370a ("LU-11298 lnet: use peer for gateway")

Cray-bug-id: LUS-8184
WC-bug-id: https://jira.whamcloud.com/browse/LU-13001
Lustre-commit: d45a032d9a5c ("LU-13001 lnet: Wait for single discovery attempt of routers")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36820
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h | 2 ++
 net/lnet/lnet/router.c         | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 51cc9ce..4b110eb 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -732,6 +732,8 @@ struct lnet_peer {
 
 /* gw undergoing alive discovery */
 #define LNET_PEER_RTR_DISCOVERY	BIT(16)
+/* gw has undergone discovery (does not indicate success or failure) */
+#define LNET_PEER_RTR_DISCOVERED BIT(17)
 
 struct lnet_peer_net {
 	/* chain on lp_peer_nets */
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 41d0eb0..71ba951 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -408,6 +408,7 @@ bool lnet_is_route_alive(struct lnet_route *route)
 
 	spin_lock(&lp->lp_lock);
 	lp->lp_state &= ~LNET_PEER_RTR_DISCOVERY;
+	lp->lp_state |= LNET_PEER_RTR_DISCOVERED;
 	spin_unlock(&lp->lp_lock);
 
 	/* Router discovery successful? All peer information would've been
@@ -882,7 +883,7 @@ int lnet_get_rtr_pool_cfg(int cpt, struct lnet_ioctl_pool_cfg *pool_cfg)
 		list_for_each_entry(rtr, &the_lnet.ln_routers, lp_rtr_list) {
 			spin_lock(&rtr->lp_lock);
 
-			if (!(rtr->lp_state & LNET_PEER_DISCOVERED)) {
+			if (!(rtr->lp_state & LNET_PEER_RTR_DISCOVERED)) {
 				all_known = 0;
 				spin_unlock(&rtr->lp_lock);
 				break;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 546/622] lustre: mgc: config lock leak
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (544 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 545/622] lnet: Wait for single discovery attempt of routers James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 547/622] lnet: check if current->nsproxy is NULL before using James Simmons
                   ` (76 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

Regression introduced by "LU-580: update mgc llog process code".
It takes additional cld reference to the lock, but lock cancel forget
during normal shutdown. So this lock holds cld on the list for a long
time. any config modification needs to cancel each lock separately.

Cray-bugid: LUS-6253
Fixes: d7e09d0397e8 ("LU-580: update mgc llog process code")

WC-bug-id: https://jira.whamcloud.com/browse/LU-11185
Lustre-commit: 0ad54d597773 ("LU-11185 mgc: config lock leak")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/32890
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h |  1 +
 fs/lustre/ldlm/ldlm_lock.c    |  3 +++
 fs/lustre/mgc/mgc_request.c   | 57 ++++++++++++++++++++++++++-----------------
 3 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index a099768..85fe129 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -197,6 +197,7 @@ int class_config_parse_llog(const struct lu_env *env, struct llog_ctxt *ctxt,
 /* list of active configuration logs  */
 struct config_llog_data {
 	struct ldlm_res_id		cld_resid;
+	struct lustre_handle		cld_lockh;
 	struct config_llog_instance	cld_cfg;
 	struct list_head		cld_list_chain; /* on config_llog_list */
 	atomic_t			cld_refcount;
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 62d2c1d..2471e30 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -512,6 +512,9 @@ struct ldlm_lock *__ldlm_handle2lock(const struct lustre_handle *handle,
 
 	LASSERT(handle);
 
+	if (!lustre_handle_is_used(handle))
+		return NULL;
+
 	lock = class_handle2object(handle->cookie, &lock_handle_ops);
 	if (!lock)
 		return NULL;
diff --git a/fs/lustre/mgc/mgc_request.c b/fs/lustre/mgc/mgc_request.c
index 28064fd..b2c296e 100644
--- a/fs/lustre/mgc/mgc_request.c
+++ b/fs/lustre/mgc/mgc_request.c
@@ -122,7 +122,7 @@ static int mgc_logname2resid(char *logname, struct ldlm_res_id *res_id,
 static int config_log_get(struct config_llog_data *cld)
 {
 	atomic_inc(&cld->cld_refcount);
-	CDEBUG(D_INFO, "log %s refs %d\n", cld->cld_logname,
+	CDEBUG(D_INFO, "log %s (%p) refs %d\n", cld->cld_logname, cld,
 	       atomic_read(&cld->cld_refcount));
 	return 0;
 }
@@ -135,7 +135,7 @@ static void config_log_put(struct config_llog_data *cld)
 	if (!cld)
 		return;
 
-	CDEBUG(D_INFO, "log %s refs %d\n", cld->cld_logname,
+	CDEBUG(D_INFO, "log %s(%p) refs %d\n", cld->cld_logname, cld,
 	       atomic_read(&cld->cld_refcount));
 	LASSERT(atomic_read(&cld->cld_refcount) > 0);
 
@@ -379,16 +379,26 @@ struct config_llog_data *do_config_log_add(struct obd_device *obd,
 	return ERR_PTR(rc);
 }
 
-static inline void config_mark_cld_stop(struct config_llog_data *cld)
-{
-	if (!cld)
-		return;
+DEFINE_MUTEX(llog_process_lock);
 
-	mutex_lock(&cld->cld_lock);
+static inline void config_mark_cld_stop_nolock(struct config_llog_data *cld)
+{
 	spin_lock(&config_list_lock);
 	cld->cld_stopping = 1;
 	spin_unlock(&config_list_lock);
-	mutex_unlock(&cld->cld_lock);
+
+	CDEBUG(D_INFO, "lockh %#llx\n", cld->cld_lockh.cookie);
+	if (!ldlm_lock_addref_try(&cld->cld_lockh, LCK_CR))
+		ldlm_lock_decref_and_cancel(&cld->cld_lockh, LCK_CR);
+}
+
+static inline void config_mark_cld_stop(struct config_llog_data *cld)
+{
+	if (cld) {
+		mutex_lock(&cld->cld_lock);
+		config_mark_cld_stop_nolock(cld);
+		mutex_unlock(&cld->cld_lock);
+	}
 }
 
 /** Stop watching for updates on this log.
@@ -420,10 +430,6 @@ static int config_log_end(char *logname, struct config_llog_instance *cfg)
 		return rc;
 	}
 
-	spin_lock(&config_list_lock);
-	cld->cld_stopping = 1;
-	spin_unlock(&config_list_lock);
-
 	cld_recover = cld->cld_recover;
 	cld->cld_recover = NULL;
 
@@ -431,21 +437,22 @@ static int config_log_end(char *logname, struct config_llog_instance *cfg)
 	cld->cld_params = NULL;
 	cld_sptlrpc = cld->cld_sptlrpc;
 	cld->cld_sptlrpc = NULL;
+
+	config_mark_cld_stop_nolock(cld);
 	mutex_unlock(&cld->cld_lock);
 
 	config_mark_cld_stop(cld_recover);
-	config_log_put(cld_recover);
-
 	config_mark_cld_stop(cld_params);
-	config_log_put(cld_params);
+	config_mark_cld_stop(cld_sptlrpc);
 
+	config_log_put(cld_params);
+	config_log_put(cld_recover);
 	config_log_put(cld_sptlrpc);
 
 	/* drop the ref from the find */
 	config_log_put(cld);
 	/* drop the start ref */
 	config_log_put(cld);
-
 	CDEBUG(D_MGC, "end config log %s (%d)\n", logname ? logname : "client",
 	       rc);
 	return rc;
@@ -627,9 +634,14 @@ static void mgc_requeue_add(struct config_llog_data *cld)
 	       cld->cld_stopping, rq_state);
 	LASSERT(atomic_read(&cld->cld_refcount) > 0);
 
+	/* lets cancel an existent lock to mark cld as "lostlock" */
+	CDEBUG(D_INFO, "lockh %#llx\n", cld->cld_lockh.cookie);
+	if (!ldlm_lock_addref_try(&cld->cld_lockh, LCK_CR))
+		ldlm_lock_decref_and_cancel(&cld->cld_lockh, LCK_CR);
+
 	mutex_lock(&cld->cld_lock);
 	spin_lock(&config_list_lock);
-	if (!(rq_state & RQ_STOP) && !cld->cld_stopping && !cld->cld_lostlock) {
+	if (!(rq_state & RQ_STOP) && !cld->cld_stopping) {
 		cld->cld_lostlock = 1;
 		rq_state |= RQ_NOW;
 		wakeup = true;
@@ -803,6 +815,7 @@ static int mgc_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
 		LASSERT(atomic_read(&cld->cld_refcount) > 0);
 
 		lock->l_ast_data = NULL;
+		cld->cld_lockh.cookie = 0;
 		/* Are we done with this log? */
 		if (cld->cld_stopping) {
 			CDEBUG(D_MGC, "log %s: stopping, won't requeue\n",
@@ -1616,9 +1629,12 @@ int mgc_process_log(struct obd_device *mgc, struct config_llog_data *cld)
 		/* Get the cld, it will be released in mgc_blocking_ast. */
 		config_log_get(cld);
 		rc = ldlm_lock_set_data(&lockh, (void *)cld);
+		LASSERT(!lustre_handle_is_used(&cld->cld_lockh));
 		LASSERT(rc == 0);
+		cld->cld_lockh = lockh;
 	} else {
 		CDEBUG(D_MGC, "Can't get cfg lock: %d\n", rcl);
+		cld->cld_lockh.cookie = 0;
 
 		if (rcl == -ESHUTDOWN &&
 		    atomic_read(&mgc->u.cli.cl_mgc_refcount) > 0 && !retry) {
@@ -1673,9 +1689,6 @@ int mgc_process_log(struct obd_device *mgc, struct config_llog_data *cld)
 				CERROR("%s: recover log %s failed: rc = %d not fatal.\n",
 				       mgc->obd_name, cld->cld_logname, rc);
 				rc = 0;
-				spin_lock(&config_list_lock);
-				cld->cld_lostlock = 1;
-				spin_unlock(&config_list_lock);
 			}
 		}
 	} else {
@@ -1685,12 +1698,12 @@ int mgc_process_log(struct obd_device *mgc, struct config_llog_data *cld)
 	CDEBUG(D_MGC, "%s: configuration from log '%s' %sed (%d).\n",
 	       mgc->obd_name, cld->cld_logname, rc ? "fail" : "succeed", rc);
 
-	mutex_unlock(&cld->cld_lock);
-
 	/* Now drop the lock so MGS can revoke it */
 	if (!rcl)
 		ldlm_lock_decref(&lockh, LCK_CR);
 
+	mutex_unlock(&cld->cld_lock);
+
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 547/622] lnet: check if current->nsproxy is NULL before using
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (545 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 546/622] lustre: mgc: config lock leak James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 548/622] lustre: ptlrpc: always reset generation for idle reconnect James Simmons
                   ` (75 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Sonia Sharma <sharmaso@whamcloud.com>

A crash is seen at few sites in the function
rdma_create_id(current->nsproxy->net_ns, cb, dev, ps, qpt).
The issue is identified with the first param in this
function - current->nsproxy->net_ns. There is a
possibility that this value is NULL and resulting in
"kernel NULL pointer dereference" crash.

Handle the case of NULL value gracefully by adding
a check and using init_net if current or
current->nsproxy is NULL.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11385
Lustre-commit: ef1783e282f6 ("LU-11385 lnet: check if current->nsproxy is NULL before using")
Signed-off-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/34577
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.h | 6 +++---
 net/lnet/lnet/acceptor.c         | 7 ++++---
 net/lnet/lnet/config.c           | 9 ++++++---
 net/lnet/lnet/lib-move.c         | 4 ++--
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index ac91757..2169fdd 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -108,9 +108,9 @@ struct kib_tunables {
 	 min((t)->lnd_peercredits_hiw,				\
 	     (u32)(conn)->ibc_queue_depth - 1))
 
-# define kiblnd_rdma_create_id(ns, cb, dev, ps, qpt) rdma_create_id(ns, cb, \
-								    dev, ps, \
-								    qpt)
+# define kiblnd_rdma_create_id(ns, cb, dev, ps, qpt) \
+	 rdma_create_id((ns) ? (ns) : &init_net, cb, dev, ps, qpt)
+
 /* 2 OOB shall suffice for 1 keepalive and 1 returning credits */
 #define IBLND_OOB_CAPABLE(v)	((v) != IBLND_MSG_VERSION_1)
 #define IBLND_OOB_MSGS(v)	(IBLND_OOB_CAPABLE(v) ? 2 : 0)
diff --git a/net/lnet/lnet/acceptor.c b/net/lnet/lnet/acceptor.c
index 23b5bf0..acd1d75 100644
--- a/net/lnet/lnet/acceptor.c
+++ b/net/lnet/lnet/acceptor.c
@@ -458,14 +458,15 @@
 
 	if (!lnet_count_acceptor_nets())  /* not required */
 		return 0;
-
-	lnet_acceptor_state.pta_ns = current->nsproxy->net_ns;
+	if (current->nsproxy && current->nsproxy->net_ns)
+		lnet_acceptor_state.pta_ns = current->nsproxy->net_ns;
+	else
+		lnet_acceptor_state.pta_ns = &init_net;
 	task = kthread_run(lnet_acceptor, (void *)(uintptr_t)secure,
 			   "acceptor_%03ld", secure);
 	if (IS_ERR(task)) {
 		rc2 = PTR_ERR(task);
 		CERROR("Can't start acceptor thread: %ld\n", rc2);
-
 		return -ESRCH;
 	}
 
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index 2c8edcd..f521b0b 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -464,10 +464,10 @@ struct lnet_net *
 	ni->ni_nid = LNET_MKNID(net->net_id, 0);
 
 	/* Store net namespace in which current ni is being created */
-	if (current->nsproxy->net_ns)
+	if (current->nsproxy && current->nsproxy->net_ns)
 		ni->ni_net_ns = get_net(current->nsproxy->net_ns);
 	else
-		ni->ni_net_ns = NULL;
+		ni->ni_net_ns = get_net(&init_net);
 
 	ni->ni_state = LNET_NI_STATE_INIT;
 	list_add_tail(&ni->ni_netlist, &net->net_ni_added);
@@ -1642,7 +1642,10 @@ int lnet_inet_enumerate(struct lnet_inetdev **dev_list, struct net *ns)
 	int rc;
 	int i;
 
-	nip = lnet_inet_enumerate(&ifaces, current->nsproxy->net_ns);
+	if (current->nsproxy && current->nsproxy->net_ns)
+		nip = lnet_inet_enumerate(&ifaces, current->nsproxy->net_ns);
+	else
+		nip = lnet_inet_enumerate(&ifaces, &init_net);
 	if (nip < 0) {
 		if (nip != -ENOENT) {
 			LCONSOLE_ERROR_MSG(0x117,
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index b8278ad..ca0009c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -4826,9 +4826,9 @@ struct lnet_msg *
 			 * If not, assign order above 0xffff0000,
 			 * to make this ni not a priority.
 			 */
-			if (!net_eq(ni->ni_net_ns, current->nsproxy->net_ns))
+			if (current->nsproxy &&
+			    !net_eq(ni->ni_net_ns, current->nsproxy->net_ns))
 				order += 0xffff0000;
-
 			if (srcnidp)
 				*srcnidp = ni->ni_nid;
 			if (orderp)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 548/622] lustre: ptlrpc: always reset generation for idle reconnect
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (546 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 547/622] lnet: check if current->nsproxy is NULL before using James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 549/622] lustre: obdclass: Allow read-ahead for write requests James Simmons
                   ` (74 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

Idle reconnetion is common case and reconnections will
be quick mostly, so always reset generation for this case,
otherwise, it will make application fail just for Idle
reconnection feature.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12378
Lustre-commit: 94fbe511ba96 ("LU-12378 ptlrpc: always reset generation for idle reconnect")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/35052
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/import.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 813d3c8..028dd65 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1674,7 +1674,8 @@ static void ptlrpc_reset_reqs_generation(struct obd_import *imp)
 			rq_list) {
 		spin_lock(&old->rq_lock);
 		if (old->rq_import_generation == imp->imp_generation - 1 &&
-		    !old->rq_no_resend)
+		    ((imp->imp_initiated_at == imp->imp_generation) ||
+		     !old->rq_no_resend))
 			old->rq_import_generation = imp->imp_generation;
 		spin_unlock(&old->rq_lock);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 549/622] lustre: obdclass: Allow read-ahead for write requests
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (547 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 548/622] lustre: ptlrpc: always reset generation for idle reconnect James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 550/622] lustre: ldlm: separate buckets from ldlm hash table James Simmons
                   ` (73 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

cl_io_read_ahead asserts that read-ahead can only happen
due to CIT_READ or CIT_FAULT requests.
Since LU-9618, we expect CIT_WRITE requests to also
sometimes trigger read-ahead.
So the LINVRNT() needs to be extended to acknowledge
that.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12718
Lustre-commit: 514bd936d061 ("LU-12718 obdclass: Allow read-ahead for write requests")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36000
Reviewed-by: Shilong Wang <wshilong@ddn.com>
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/cl_io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index 14849ed..3bc9097 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -554,7 +554,9 @@ int cl_io_read_ahead(const struct lu_env *env, struct cl_io *io,
 	const struct cl_io_slice *scan;
 	int result = 0;
 
-	LINVRNT(io->ci_type == CIT_READ || io->ci_type == CIT_FAULT);
+	LINVRNT(io->ci_type == CIT_READ ||
+		io->ci_type == CIT_FAULT ||
+		io->ci_type == CIT_WRITE);
 	LINVRNT(cl_io_invariant(io));
 
 	list_for_each_entry(scan, &io->ci_layers, cis_linkage) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 550/622] lustre: ldlm: separate buckets from ldlm hash table
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (548 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 549/622] lustre: obdclass: Allow read-ahead for write requests James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:16 ` [lustre-devel] [PATCH 551/622] lustre: llite: don't cache MDS_OPEN_LOCK for volatile files James Simmons
                   ` (72 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

ldlm maintains a per-namespace hashtable of resources.
With these hash tables it stores per-bucket 'struct adaptive_timeout'
structures.

Presumably having a single struct for the whole table results in too
much contention while having one per resource results in very little
adaption.

A future patch will change ldlm to use rhashtable which does not
support per-bucket data, so we need to manage the data separately.

There is no need for the multiple adaptive_timeout to align with the
hash chains, and trying to do this has resulted in a rather complex
hash function.
The purpose of ldlm_res_hop_fid_hash() appears to be to keep
resources with the same fid in the same hash bucket, so they use
the same adaptive timeout.  However it fails at doing this
because it puts the fid-specific bits in the wrong part of the hash.
If that is not the purpose, then I can see no point to the
complexitiy.

This patch creates a completely separate array of adaptive timeouts
(and other less interesting data) and uses a hash of the fid to index
that, meaning that a simple hash can be used for the hash table.

In the previous code, two namespace uses the same value for
nsd_all_bits and nsd_bkt_bits.  This results in zero bits being
used to choose a bucket - so there is only one bucket.
This looks odd and would confuse hash_32(), so I've adjusted the
numbers so there is always at least 1 bit (2 buckets).

WC-bug-id: https://jira.whamcloud.com/browse/LU-8130
Lustre-commit: d234e2cf5f55 ("LU-8130 ldlm: separate buckets from ldlm hash table")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36218
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h |  2 ++
 fs/lustre/ldlm/ldlm_resource.c | 56 ++++++++++++++++++------------------------
 2 files changed, 26 insertions(+), 32 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 31d360e..cc4b8b0 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -364,6 +364,8 @@ struct ldlm_namespace {
 
 	/** Resource hash table for namespace. */
 	struct cfs_hash		*ns_rs_hash;
+	struct ldlm_ns_bucket	*ns_rs_buckets;
+	unsigned int		ns_bucket_bits;
 
 	/** serialize */
 	spinlock_t		ns_lock;
diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 14e03bc..65ff32c 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -452,10 +452,9 @@ static unsigned int ldlm_res_hop_hash(struct cfs_hash *hs,
 	return val & mask;
 }
 
-static unsigned int ldlm_res_hop_fid_hash(struct cfs_hash *hs,
-					  const void *key, unsigned int mask)
+static unsigned int ldlm_res_hop_fid_hash(const struct ldlm_res_id *id,
+					  unsigned int bits)
 {
-	const struct ldlm_res_id *id = key;
 	struct lu_fid fid;
 	u32 hash;
 	u32 val;
@@ -468,18 +467,11 @@ static unsigned int ldlm_res_hop_fid_hash(struct cfs_hash *hs,
 	hash += (hash >> 4) + (hash << 12); /* mixing oid and seq */
 	if (id->name[LUSTRE_RES_ID_HSH_OFF] != 0) {
 		val = id->name[LUSTRE_RES_ID_HSH_OFF];
-		hash += (val >> 5) + (val << 11);
 	} else {
 		val = fid_oid(&fid);
 	}
-	hash = hash_long(hash, hs->hs_bkt_bits);
-	/* give me another random factor */
-	hash -= hash_long((unsigned long)hs, val % 11 + 3);
-
-	hash <<= hs->hs_cur_bits - hs->hs_bkt_bits;
-	hash |= ldlm_res_hop_hash(hs, key, CFS_HASH_NBKT(hs) - 1);
-
-	return hash & mask;
+	hash += (val >> 5) + (val << 11);
+	return hash_32(hash, bits);
 }
 
 static void *ldlm_res_hop_key(struct hlist_node *hnode)
@@ -531,16 +523,6 @@ static void ldlm_res_hop_put(struct cfs_hash *hs, struct hlist_node *hnode)
 	.hs_put		= ldlm_res_hop_put
 };
 
-static struct cfs_hash_ops ldlm_ns_fid_hash_ops = {
-	.hs_hash	= ldlm_res_hop_fid_hash,
-	.hs_key		= ldlm_res_hop_key,
-	.hs_keycmp      = ldlm_res_hop_keycmp,
-	.hs_keycpy      = NULL,
-	.hs_object      = ldlm_res_hop_object,
-	.hs_get		= ldlm_res_hop_get_locked,
-	.hs_put		= ldlm_res_hop_put
-};
-
 struct ldlm_ns_hash_def {
 	enum ldlm_ns_type	nsd_type;
 	/** hash bucket bits */
@@ -556,13 +538,13 @@ struct ldlm_ns_hash_def {
 		.nsd_type       = LDLM_NS_TYPE_MDC,
 		.nsd_bkt_bits   = 11,
 		.nsd_all_bits   = 16,
-		.nsd_hops       = &ldlm_ns_fid_hash_ops,
+		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_MDT,
 		.nsd_bkt_bits   = 14,
 		.nsd_all_bits   = 21,
-		.nsd_hops       = &ldlm_ns_fid_hash_ops,
+		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_OSC,
@@ -578,13 +560,13 @@ struct ldlm_ns_hash_def {
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_MGC,
-		.nsd_bkt_bits   = 4,
+		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
 		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
 	{
 		.nsd_type       = LDLM_NS_TYPE_MGT,
-		.nsd_bkt_bits   = 4,
+		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
 		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
@@ -613,9 +595,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 					  enum ldlm_ns_type ns_type)
 {
 	struct ldlm_namespace *ns = NULL;
-	struct ldlm_ns_bucket *nsb;
 	struct ldlm_ns_hash_def *nsd;
-	struct cfs_hash_bd bd;
 	int idx;
 	int rc;
 
@@ -644,7 +624,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 
 	ns->ns_rs_hash = cfs_hash_create(name,
 					 nsd->nsd_all_bits, nsd->nsd_all_bits,
-					 nsd->nsd_bkt_bits, sizeof(*nsb),
+					 nsd->nsd_bkt_bits, 0,
 					 CFS_HASH_MIN_THETA,
 					 CFS_HASH_MAX_THETA,
 					 nsd->nsd_hops,
@@ -655,8 +635,16 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	if (!ns->ns_rs_hash)
 		goto out_ns;
 
-	cfs_hash_for_each_bucket(ns->ns_rs_hash, &bd, idx) {
-		nsb = cfs_hash_bd_extra_get(ns->ns_rs_hash, &bd);
+	ns->ns_bucket_bits = nsd->nsd_all_bits - nsd->nsd_bkt_bits;
+	ns->ns_rs_buckets = kvmalloc(BIT(ns->ns_bucket_bits) *
+				     sizeof(ns->ns_rs_buckets[0]),
+				     GFP_KERNEL);
+	if (!ns->ns_rs_buckets)
+		goto out_hash;
+
+	for (idx = 0; idx < (1 << ns->ns_bucket_bits); idx++) {
+		struct ldlm_ns_bucket *nsb = &ns->ns_rs_buckets[idx];
+
 		at_init(&nsb->nsb_at_estimate, ldlm_enqueue_min, 0);
 		nsb->nsb_namespace = ns;
 	}
@@ -711,6 +699,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	ldlm_namespace_sysfs_unregister(ns);
 	ldlm_namespace_cleanup(ns, 0);
 out_hash:
+	kvfree(ns->ns_rs_buckets);
 	kfree(ns->ns_name);
 	cfs_hash_putref(ns->ns_rs_hash);
 out_ns:
@@ -973,6 +962,7 @@ void ldlm_namespace_free_post(struct ldlm_namespace *ns)
 	ldlm_namespace_debugfs_unregister(ns);
 	ldlm_namespace_sysfs_unregister(ns);
 	cfs_hash_putref(ns->ns_rs_hash);
+	kvfree(ns->ns_rs_buckets);
 	kfree(ns->ns_name);
 	/* Namespace @ns should be not on list at this time, otherwise
 	 * this will cause issues related to using freed @ns in poold
@@ -1087,6 +1077,7 @@ struct ldlm_resource *
 	struct cfs_hash_bd bd;
 	u64 version;
 	int ns_refcount = 0;
+	int hash;
 
 	LASSERT(!parent);
 	LASSERT(ns->ns_rs_hash);
@@ -1111,7 +1102,8 @@ struct ldlm_resource *
 	if (!res)
 		return ERR_PTR(-ENOMEM);
 
-	res->lr_ns_bucket = cfs_hash_bd_extra_get(ns->ns_rs_hash, &bd);
+	hash = ldlm_res_hop_fid_hash(name, ns->ns_bucket_bits);
+	res->lr_ns_bucket = &ns->ns_rs_buckets[hash];
 	res->lr_name = *name;
 	res->lr_type = type;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 551/622] lustre: llite: don't cache MDS_OPEN_LOCK for volatile files
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (549 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 550/622] lustre: ldlm: separate buckets from ldlm hash table James Simmons
@ 2020-02-27 21:16 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 552/622] lnet: discard lnd_refcount James Simmons
                   ` (71 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:16 UTC (permalink / raw)
  To: lustre-devel

The kernels knfsd constantly opens and closes files for each
access which can result in a continuous stream of open+close RPCs
being send to the MDS. To avoid this Lustre created a special
flag, ll_nfs_dentry, which enables caching of the MDS_OPEN_LOCK
on the client. The fhandles API also uses the same exportfs layer
as NFS which indirectly ends up caching the MDS_OPEN_LOCK as well.
This is okay for normal files except for Lustre's special volatile
files that are used for HSM restore. It is expected on the last
close of a Lustre volatile file that it is no longer accessable.
To ensure this behavior is kept don't cache MDS_OPEN_LOCK for
volatile files.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8585
Lustre-commit: 6a3a842add0e ("LU-8585 llite: don't cache MDS_OPEN_LOCK for volatile files")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/36641
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index d196da8..a3c36a7 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -798,6 +798,7 @@ int ll_file_open(struct inode *inode, struct file *file)
 	} else {
 		LASSERT(*och_usecount == 0);
 		if (!it->it_disposition) {
+			struct dentry *dentry = file_dentry(file);
 			struct ll_dentry_data *ldd;
 
 			/* We cannot just request lock handle now, new ELC code
@@ -822,10 +823,13 @@ int ll_file_open(struct inode *inode, struct file *file)
 			 * lookup path only, since ll_iget_for_nfs always calls
 			 * ll_d_init().
 			 */
-			ldd = ll_d2d(file->f_path.dentry);
+			ldd = ll_d2d(dentry);
 			if (ldd && ldd->lld_nfs_dentry) {
 				ldd->lld_nfs_dentry = 0;
-				it->it_flags |= MDS_OPEN_LOCK;
+				if (!filename_is_volatile(dentry->d_name.name,
+							  dentry->d_name.len,
+							  NULL))
+					it->it_flags |= MDS_OPEN_LOCK;
 			}
 
 			/*
@@ -833,8 +837,7 @@ int ll_file_open(struct inode *inode, struct file *file)
 			 * to get file with different fid.
 			 */
 			it->it_flags |= MDS_OPEN_BY_FID;
-			rc = ll_intent_file_open(file->f_path.dentry,
-						 NULL, 0, it);
+			rc = ll_intent_file_open(dentry, NULL, 0, it);
 			if (rc)
 				goto out_openerr;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 552/622] lnet: discard lnd_refcount
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (550 preceding siblings ...)
  2020-02-27 21:16 ` [lustre-devel] [PATCH 551/622] lustre: llite: don't cache MDS_OPEN_LOCK for volatile files James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 553/622] lnet: socklnd: rename struct ksock_peer to struct ksock_peer_ni James Simmons
                   ` (70 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The lnd_refcount in 'struct lnet_lnd' is never tested (except
in an ASSERT()), so it cannot be needed.  Let's remove it.

Each individual lnd keeps track of how many lnet_ni are
registered for that lnd e.g. ksocklnd has a counter in ksnd_nnets
and o2iblnd has a linked list in kib_devs.
They hold a reference on the module while there are registered
devices, and the lnd is only freed (and the lnd_refcount checked)
when the module is unloaded.  This confirms that lnd_refcount
adds no value.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 606299929509 ("LU-12678 lnet: discard lnd_refcount")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36829
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  1 -
 net/lnet/lnet/api-ni.c         | 18 ------------------
 2 files changed, 19 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 4b110eb..e105308 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -246,7 +246,6 @@ struct lnet_test_peer {
 struct lnet_lnd {
 	/* fields managed by portals */
 	struct list_head	lnd_list;	/* stash in the LND table */
-	int			lnd_refcount;	/* # active instances */
 
 	/* fields initialised by the LND */
 	u32			lnd_type;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index e66d9dc7..6c913b5 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -758,7 +758,6 @@ static void lnet_assert_wire_constants(void)
 	LASSERT(!lnet_find_lnd_by_type(lnd->lnd_type));
 
 	list_add_tail(&lnd->lnd_list, &the_lnet.ln_lnds);
-	lnd->lnd_refcount = 0;
 
 	CDEBUG(D_NET, "%s LND registered\n", libcfs_lnd2str(lnd->lnd_type));
 
@@ -772,7 +771,6 @@ static void lnet_assert_wire_constants(void)
 	mutex_lock(&the_lnet.ln_lnd_mutex);
 
 	LASSERT(lnet_find_lnd_by_type(lnd->lnd_type) == lnd);
-	LASSERT(!lnd->lnd_refcount);
 
 	list_del(&lnd->lnd_list);
 	CDEBUG(D_NET, "%s LND unregistered\n", libcfs_lnd2str(lnd->lnd_type));
@@ -2045,15 +2043,6 @@ static void lnet_push_target_fini(void)
 	/* Do peer table cleanup for this net */
 	lnet_peer_tables_cleanup(net);
 
-	lnet_net_lock(LNET_LOCK_EX);
-	/*
-	 * decrement ref count on lnd only when the entire network goes
-	 * away
-	 */
-	net->net_lnd->lnd_refcount--;
-
-	lnet_net_unlock(LNET_LOCK_EX);
-
 	lnet_net_free(net);
 }
 
@@ -2134,9 +2123,6 @@ static void lnet_push_target_fini(void)
 	if (rc) {
 		LCONSOLE_ERROR_MSG(0x105, "Error %d starting up LNI %s\n",
 				   rc, libcfs_lnd2str(net->net_lnd->lnd_type));
-		lnet_net_lock(LNET_LOCK_EX);
-		net->net_lnd->lnd_refcount--;
-		lnet_net_unlock(LNET_LOCK_EX);
 		goto failed0;
 	}
 
@@ -2247,10 +2233,6 @@ static void lnet_push_target_fini(void)
 			}
 		}
 
-		lnet_net_lock(LNET_LOCK_EX);
-		lnd->lnd_refcount++;
-		lnet_net_unlock(LNET_LOCK_EX);
-
 		net->net_lnd = lnd;
 
 		mutex_unlock(&the_lnet.ln_lnd_mutex);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 553/622] lnet: socklnd: rename struct ksock_peer to struct ksock_peer_ni
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (551 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 552/622] lnet: discard lnd_refcount James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 554/622] lnet: change ksocknal_create_peer() to return pointer James Simmons
                   ` (69 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

In the OpenSFS tree when typedefs were removed from the socklnd
driver all ksock peers were renamed to struct ksock_peer_ni.
This didn't happened for the linux client so lets bring both
trees in sync.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Lustre-commit: 93090d9b8250 ("LU-6142 socklnd: remove typedefs from ksocklnd")
Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-on: https://review.whamcloud.com/28275
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c       | 78 +++++++++++++++++-----------------
 net/lnet/klnds/socklnd/socklnd.h       | 38 ++++++++---------
 net/lnet/klnds/socklnd/socklnd_cb.c    | 24 +++++------
 net/lnet/klnds/socklnd/socklnd_proto.c |  4 +-
 4 files changed, 72 insertions(+), 72 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index e2a9819..79068f3 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -99,12 +99,12 @@
 }
 
 static int
-ksocknal_create_peer(struct ksock_peer **peerp, struct lnet_ni *ni,
+ksocknal_create_peer(struct ksock_peer_ni **peerp, struct lnet_ni *ni,
 		     struct lnet_process_id id)
 {
 	int cpt = lnet_cpt_of_nid(id.nid, ni);
 	struct ksock_net *net = ni->ni_data;
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 
 	LASSERT(id.nid != LNET_NID_ANY);
 	LASSERT(id.pid != LNET_PID_ANY);
@@ -148,7 +148,7 @@
 }
 
 void
-ksocknal_destroy_peer(struct ksock_peer *peer_ni)
+ksocknal_destroy_peer(struct ksock_peer_ni *peer_ni)
 {
 	struct ksock_net *net = peer_ni->ksnp_ni->ni_data;
 
@@ -175,11 +175,11 @@
 	spin_unlock_bh(&net->ksnn_lock);
 }
 
-struct ksock_peer *
+struct ksock_peer_ni *
 ksocknal_find_peer_locked(struct lnet_ni *ni, struct lnet_process_id id)
 {
 	struct list_head *peer_list = ksocknal_nid2peerlist(id.nid);
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 
 	list_for_each_entry(peer_ni, peer_list, ksnp_list) {
 		LASSERT(!peer_ni->ksnp_closing);
@@ -199,10 +199,10 @@ struct ksock_peer *
 	return NULL;
 }
 
-struct ksock_peer *
+struct ksock_peer_ni *
 ksocknal_find_peer(struct lnet_ni *ni, struct lnet_process_id id)
 {
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 
 	read_lock(&ksocknal_data.ksnd_global_lock);
 	peer_ni = ksocknal_find_peer_locked(ni, id);
@@ -214,7 +214,7 @@ struct ksock_peer *
 }
 
 static void
-ksocknal_unlink_peer_locked(struct ksock_peer *peer_ni)
+ksocknal_unlink_peer_locked(struct ksock_peer_ni *peer_ni)
 {
 	int i;
 	u32 ip;
@@ -250,7 +250,7 @@ struct ksock_peer *
 		       struct lnet_process_id *id, u32 *myip, u32 *peer_ip,
 		       int *port, int *conn_count, int *share_count)
 {
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 	struct ksock_route *route;
 	int i;
 	int j;
@@ -318,7 +318,7 @@ struct ksock_peer *
 ksocknal_associate_route_conn_locked(struct ksock_route *route,
 				     struct ksock_conn *conn)
 {
-	struct ksock_peer *peer_ni = route->ksnr_peer;
+	struct ksock_peer_ni *peer_ni = route->ksnr_peer;
 	int type = conn->ksnc_type;
 	struct ksock_interface *iface;
 
@@ -362,7 +362,7 @@ struct ksock_peer *
 }
 
 static void
-ksocknal_add_route_locked(struct ksock_peer *peer_ni, struct ksock_route *route)
+ksocknal_add_route_locked(struct ksock_peer_ni *peer_ni, struct ksock_route *route)
 {
 	struct ksock_conn *conn;
 	struct ksock_route *route2;
@@ -400,7 +400,7 @@ struct ksock_peer *
 static void
 ksocknal_del_route_locked(struct ksock_route *route)
 {
-	struct ksock_peer *peer_ni = route->ksnr_peer;
+	struct ksock_peer_ni *peer_ni = route->ksnr_peer;
 	struct ksock_interface *iface;
 	struct ksock_conn *conn;
 	struct list_head *ctmp;
@@ -443,8 +443,8 @@ struct ksock_peer *
 ksocknal_add_peer(struct lnet_ni *ni, struct lnet_process_id id, u32 ipaddr,
 		  int port)
 {
-	struct ksock_peer *peer_ni;
-	struct ksock_peer *peer2;
+	struct ksock_peer_ni *peer_ni;
+	struct ksock_peer_ni *peer2;
 	struct ksock_route *route;
 	struct ksock_route *route2;
 	int rc;
@@ -497,7 +497,7 @@ struct ksock_peer *
 }
 
 static void
-ksocknal_del_peer_locked(struct ksock_peer *peer_ni, u32 ip)
+ksocknal_del_peer_locked(struct ksock_peer_ni *peer_ni, u32 ip)
 {
 	struct ksock_conn *conn;
 	struct ksock_route *route;
@@ -556,8 +556,8 @@ struct ksock_peer *
 ksocknal_del_peer(struct lnet_ni *ni, struct lnet_process_id id, u32 ip)
 {
 	LIST_HEAD(zombies);
-	struct ksock_peer *pnxt;
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *pnxt;
+	struct ksock_peer_ni *peer_ni;
 	int lo;
 	int hi;
 	int i;
@@ -615,7 +615,7 @@ struct ksock_peer *
 static struct ksock_conn *
 ksocknal_get_conn_by_idx(struct lnet_ni *ni, int index)
 {
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 	struct ksock_conn *conn;
 	int i;
 
@@ -729,7 +729,7 @@ struct ksock_peer *
 }
 
 static int
-ksocknal_select_ips(struct ksock_peer *peer_ni, u32 *peerips, int n_peerips)
+ksocknal_select_ips(struct ksock_peer_ni *peer_ni, u32 *peerips, int n_peerips)
 {
 	rwlock_t *global_lock = &ksocknal_data.ksnd_global_lock;
 	struct ksock_net *net = peer_ni->ksnp_ni->ni_data;
@@ -844,7 +844,7 @@ struct ksock_peer *
 }
 
 static void
-ksocknal_create_routes(struct ksock_peer *peer_ni, int port,
+ksocknal_create_routes(struct ksock_peer_ni *peer_ni, int port,
 		       u32 *peer_ipaddrs, int npeer_ipaddrs)
 {
 	struct ksock_route *newroute = NULL;
@@ -984,7 +984,7 @@ struct ksock_peer *
 }
 
 static int
-ksocknal_connecting(struct ksock_peer *peer_ni, u32 ipaddr)
+ksocknal_connecting(struct ksock_peer_ni *peer_ni, u32 ipaddr)
 {
 	struct ksock_route *route;
 
@@ -1005,8 +1005,8 @@ struct ksock_peer *
 	u64 incarnation;
 	struct ksock_conn *conn;
 	struct ksock_conn *conn2;
-	struct ksock_peer *peer_ni = NULL;
-	struct ksock_peer *peer2;
+	struct ksock_peer_ni *peer_ni = NULL;
+	struct ksock_peer_ni *peer2;
 	struct ksock_sched *sched;
 	struct ksock_hello_msg *hello;
 	int cpt;
@@ -1422,7 +1422,7 @@ struct ksock_peer *
 	 * connection for the reaper to terminate.
 	 * Caller holds ksnd_global_lock exclusively in irq context
 	 */
-	struct ksock_peer *peer_ni = conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = conn->ksnc_peer;
 	struct ksock_route *route;
 	struct ksock_conn *conn2;
 
@@ -1495,7 +1495,7 @@ struct ksock_peer *
 }
 
 void
-ksocknal_peer_failed(struct ksock_peer *peer_ni)
+ksocknal_peer_failed(struct ksock_peer_ni *peer_ni)
 {
 	int notify = 0;
 	time64_t last_alive = 0;
@@ -1525,7 +1525,7 @@ struct ksock_peer *
 void
 ksocknal_finalize_zcreq(struct ksock_conn *conn)
 {
-	struct ksock_peer *peer_ni = conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = conn->ksnc_peer;
 	struct ksock_tx *tx;
 	struct ksock_tx *tmp;
 	LIST_HEAD(zlist);
@@ -1569,7 +1569,7 @@ struct ksock_peer *
 	 * ksnc_refcount will eventually hit zero, and then the reaper will
 	 * destroy it.
 	 */
-	struct ksock_peer *peer_ni = conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = conn->ksnc_peer;
 	struct ksock_sched *sched = conn->ksnc_scheduler;
 	int failed = 0;
 
@@ -1703,7 +1703,7 @@ struct ksock_peer *
 }
 
 int
-ksocknal_close_peer_conns_locked(struct ksock_peer *peer_ni,
+ksocknal_close_peer_conns_locked(struct ksock_peer_ni *peer_ni,
 				 u32 ipaddr, int why)
 {
 	struct ksock_conn *conn;
@@ -1726,7 +1726,7 @@ struct ksock_peer *
 int
 ksocknal_close_conn_and_siblings(struct ksock_conn *conn, int why)
 {
-	struct ksock_peer *peer_ni = conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = conn->ksnc_peer;
 	u32 ipaddr = conn->ksnc_ipaddr;
 	int count;
 
@@ -1742,8 +1742,8 @@ struct ksock_peer *
 int
 ksocknal_close_matching_conns(struct lnet_process_id id, u32 ipaddr)
 {
-	struct ksock_peer *peer_ni;
-	struct ksock_peer *pnxt;
+	struct ksock_peer_ni *peer_ni;
+	struct ksock_peer_ni *pnxt;
 	int lo;
 	int hi;
 	int i;
@@ -1816,7 +1816,7 @@ struct ksock_peer *
 	int connect = 1;
 	time64_t last_alive = 0;
 	time64_t now = ktime_get_seconds();
-	struct ksock_peer *peer_ni = NULL;
+	struct ksock_peer_ni *peer_ni = NULL;
 	rwlock_t *glock = &ksocknal_data.ksnd_global_lock;
 	struct lnet_process_id id = {
 		.nid = nid,
@@ -1872,7 +1872,7 @@ struct ksock_peer *
 }
 
 static void
-ksocknal_push_peer(struct ksock_peer *peer_ni)
+ksocknal_push_peer(struct ksock_peer_ni *peer_ni)
 {
 	int index;
 	int i;
@@ -1921,7 +1921,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		int peer_off; /* searching offset in peer_ni hash table */
 
 		for (peer_off = 0; ; peer_off++) {
-			struct ksock_peer *peer_ni;
+			struct ksock_peer_ni *peer_ni;
 			int i = 0;
 
 			read_lock(&ksocknal_data.ksnd_global_lock);
@@ -1958,7 +1958,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	int rc;
 	int i;
 	int j;
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 	struct ksock_route *route;
 
 	if (!ipaddress || !netmask)
@@ -2014,7 +2014,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 }
 
 static void
-ksocknal_peer_del_interface_locked(struct ksock_peer *peer_ni, u32 ipaddr)
+ksocknal_peer_del_interface_locked(struct ksock_peer_ni *peer_ni, u32 ipaddr)
 {
 	struct list_head *tmp;
 	struct list_head *nxt;
@@ -2059,8 +2059,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 {
 	struct ksock_net *net = ni->ni_data;
 	int rc = -ENOENT;
-	struct ksock_peer *nxt;
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *nxt;
+	struct ksock_peer_ni *peer_ni;
 	u32 this_ip;
 	int i;
 	int j;
@@ -2457,7 +2457,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 static void
 ksocknal_debug_peerhash(struct lnet_ni *ni)
 {
-	struct ksock_peer *peer_ni = NULL;
+	struct ksock_peer_ni *peer_ni = NULL;
 	int i;
 
 	read_lock(&ksocknal_data.ksnd_global_lock);
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index efdd02e..1e10663 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -262,7 +262,7 @@ struct ksock_nal_data {
  * what the header matched or whether the message needs forwarding.
  */
 struct ksock_conn;  /* forward ref */
-struct ksock_peer;  /* forward ref */
+struct ksock_peer_ni;  /* forward ref */
 struct ksock_route; /* forward ref */
 struct ksock_proto; /* forward ref */
 
@@ -311,7 +311,7 @@ struct ksock_tx {				/* transmit packet */
 #define SOCKNAL_RX_SLOP		6 /* skipping body */
 
 struct ksock_conn {
-	struct ksock_peer      *ksnc_peer;		/* owning peer_ni */
+	struct ksock_peer_ni   *ksnc_peer;		/* owning peer_ni */
 	struct ksock_route     *ksnc_route;		/* owning route */
 	struct list_head	ksnc_list;		/* stash on peer_ni's conn list */
 	struct socket	       *ksnc_sock;		/* actual socket */
@@ -383,7 +383,7 @@ struct ksock_conn {
 struct ksock_route {
 	struct list_head	ksnr_list;		/* chain on peer_ni route list */
 	struct list_head	ksnr_connd_list;	/* chain on ksnr_connd_routes */
-	struct ksock_peer      *ksnr_peer;		/* owning peer_ni */
+	struct ksock_peer_ni   *ksnr_peer;		/* owning peer_ni */
 	atomic_t		ksnr_refcount;		/* # users */
 	time64_t		ksnr_timeout;		/* when (in secs) reconnection
 							 * can happen next
@@ -408,7 +408,7 @@ struct ksock_route {
 
 #define SOCKNAL_KEEPALIVE_PING	1	/* cookie for keepalive ping */
 
-struct ksock_peer {
+struct ksock_peer_ni {
 	struct list_head	ksnp_list;		/* stash on global peer_ni list */
 	time64_t		ksnp_last_alive;	/* when (in seconds) I was last
 							 * alive
@@ -607,16 +607,16 @@ struct ksock_proto {
 }
 
 static inline void
-ksocknal_peer_addref(struct ksock_peer *peer_ni)
+ksocknal_peer_addref(struct ksock_peer_ni *peer_ni)
 {
 	LASSERT(atomic_read(&peer_ni->ksnp_refcount) > 0);
 	atomic_inc(&peer_ni->ksnp_refcount);
 }
 
-void ksocknal_destroy_peer(struct ksock_peer *peer_ni);
+void ksocknal_destroy_peer(struct ksock_peer_ni *peer_ni);
 
 static inline void
-ksocknal_peer_decref(struct ksock_peer *peer_ni)
+ksocknal_peer_decref(struct ksock_peer_ni *peer_ni)
 {
 	LASSERT(atomic_read(&peer_ni->ksnp_refcount) > 0);
 	if (atomic_dec_and_test(&peer_ni->ksnp_refcount))
@@ -633,21 +633,21 @@ int ksocknal_recv(struct lnet_ni *ni, void *private, struct lnet_msg *lntmsg,
 
 int ksocknal_add_peer(struct lnet_ni *ni, struct lnet_process_id id, u32 ip,
 		      int port);
-struct ksock_peer *ksocknal_find_peer_locked(struct lnet_ni *ni,
-					     struct lnet_process_id id);
-struct ksock_peer *ksocknal_find_peer(struct lnet_ni *ni,
-				      struct lnet_process_id id);
-void ksocknal_peer_failed(struct ksock_peer *peer_ni);
+struct ksock_peer_ni *ksocknal_find_peer_locked(struct lnet_ni *ni,
+					        struct lnet_process_id id);
+struct ksock_peer_ni *ksocknal_find_peer(struct lnet_ni *ni,
+				         struct lnet_process_id id);
+void ksocknal_peer_failed(struct ksock_peer_ni *peer_ni);
 int ksocknal_create_conn(struct lnet_ni *ni, struct ksock_route *route,
 			 struct socket *sock, int type);
 void ksocknal_close_conn_locked(struct ksock_conn *conn, int why);
 void ksocknal_terminate_conn(struct ksock_conn *conn);
 void ksocknal_destroy_conn(struct ksock_conn *conn);
-int ksocknal_close_peer_conns_locked(struct ksock_peer *peer_ni,
+int ksocknal_close_peer_conns_locked(struct ksock_peer_ni *peer_ni,
 				     u32 ipaddr, int why);
 int ksocknal_close_conn_and_siblings(struct ksock_conn *conn, int why);
 int ksocknal_close_matching_conns(struct lnet_process_id id, u32 ipaddr);
-struct ksock_conn *ksocknal_find_conn_locked(struct ksock_peer *peer_ni,
+struct ksock_conn *ksocknal_find_conn_locked(struct ksock_peer_ni *peer_ni,
 					     struct ksock_tx *tx, int nonblk);
 
 int ksocknal_launch_packet(struct lnet_ni *ni, struct ksock_tx *tx,
@@ -662,11 +662,11 @@ int ksocknal_launch_packet(struct lnet_ni *ni, struct ksock_tx *tx,
 void ksocknal_query(struct lnet_ni *ni, lnet_nid_t nid, time64_t *when);
 int ksocknal_thread_start(int (*fn)(void *arg), void *arg, char *name);
 void ksocknal_thread_fini(void);
-void ksocknal_launch_all_connections_locked(struct ksock_peer *peer_ni);
-struct ksock_route *ksocknal_find_connectable_route_locked(
-	struct ksock_peer *peer_ni);
-struct ksock_route *ksocknal_find_connecting_route_locked(
-	struct ksock_peer *peer_ni);
+void ksocknal_launch_all_connections_locked(struct ksock_peer_ni *peer_ni);
+struct ksock_route *
+ksocknal_find_connectable_route_locked(struct ksock_peer_ni *peer_ni);
+struct ksock_route *
+ksocknal_find_connecting_route_locked(struct ksock_peer_ni *peer_ni);
 int ksocknal_new_packet(struct ksock_conn *conn, int skip);
 int ksocknal_scheduler(void *arg);
 int ksocknal_connd(void *arg);
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 0132727..2b93331 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -394,7 +394,7 @@ struct ksock_tx *
 ksocknal_check_zc_req(struct ksock_tx *tx)
 {
 	struct ksock_conn *conn = tx->tx_conn;
-	struct ksock_peer *peer_ni = conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = conn->ksnc_peer;
 
 	/*
 	 * Set tx_msg.ksm_zc_cookies[0] to a unique non-zero cookie and add tx
@@ -440,7 +440,7 @@ struct ksock_tx *
 static void
 ksocknal_uncheck_zc_req(struct ksock_tx *tx)
 {
-	struct ksock_peer *peer_ni = tx->tx_conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = tx->tx_conn->ksnc_peer;
 
 	LASSERT(tx->tx_msg.ksm_type != KSOCK_MSG_NOOP);
 	LASSERT(tx->tx_zc_capable);
@@ -581,7 +581,7 @@ struct ksock_tx *
 }
 
 void
-ksocknal_launch_all_connections_locked(struct ksock_peer *peer_ni)
+ksocknal_launch_all_connections_locked(struct ksock_peer_ni *peer_ni)
 {
 	struct ksock_route *route;
 
@@ -597,7 +597,7 @@ struct ksock_tx *
 }
 
 struct ksock_conn *
-ksocknal_find_conn_locked(struct ksock_peer *peer_ni, struct ksock_tx *tx,
+ksocknal_find_conn_locked(struct ksock_peer_ni *peer_ni, struct ksock_tx *tx,
 			  int nonblk)
 {
 	struct ksock_conn *c;
@@ -763,7 +763,7 @@ struct ksock_conn *
 }
 
 struct ksock_route *
-ksocknal_find_connectable_route_locked(struct ksock_peer *peer_ni)
+ksocknal_find_connectable_route_locked(struct ksock_peer_ni *peer_ni)
 {
 	time64_t now = ktime_get_seconds();
 	struct ksock_route *route;
@@ -797,7 +797,7 @@ struct ksock_route *
 }
 
 struct ksock_route *
-ksocknal_find_connecting_route_locked(struct ksock_peer *peer_ni)
+ksocknal_find_connecting_route_locked(struct ksock_peer_ni *peer_ni)
 {
 	struct ksock_route *route;
 
@@ -815,7 +815,7 @@ struct ksock_route *
 ksocknal_launch_packet(struct lnet_ni *ni, struct ksock_tx *tx,
 		       struct lnet_process_id id)
 {
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 	struct ksock_conn *conn;
 	rwlock_t *g_lock;
 	int retry;
@@ -1806,7 +1806,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 ksocknal_connect(struct ksock_route *route)
 {
 	LIST_HEAD(zombies);
-	struct ksock_peer *peer_ni = route->ksnr_peer;
+	struct ksock_peer_ni *peer_ni = route->ksnr_peer;
 	int type;
 	int wanted;
 	struct socket *sock;
@@ -2213,7 +2213,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 }
 
 static struct ksock_conn *
-ksocknal_find_timed_out_conn(struct ksock_peer *peer_ni)
+ksocknal_find_timed_out_conn(struct ksock_peer_ni *peer_ni)
 {
 	/* We're called with a shared lock on ksnd_global_lock */
 	struct ksock_conn *conn;
@@ -2296,7 +2296,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 }
 
 static inline void
-ksocknal_flush_stale_txs(struct ksock_peer *peer_ni)
+ksocknal_flush_stale_txs(struct ksock_peer_ni *peer_ni)
 {
 	struct ksock_tx *tx;
 	LIST_HEAD(stale_txs);
@@ -2322,7 +2322,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 }
 
 static int
-ksocknal_send_keepalive_locked(struct ksock_peer *peer_ni)
+ksocknal_send_keepalive_locked(struct ksock_peer_ni *peer_ni)
 	__must_hold(&ksocknal_data.ksnd_global_lock)
 {
 	struct ksock_sched *sched;
@@ -2388,7 +2388,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 ksocknal_check_peer_timeouts(int idx)
 {
 	struct list_head *peers = &ksocknal_data.ksnd_peers[idx];
-	struct ksock_peer *peer_ni;
+	struct ksock_peer_ni *peer_ni;
 	struct ksock_conn *conn;
 	struct ksock_tx *tx;
 
diff --git a/net/lnet/klnds/socklnd/socklnd_proto.c b/net/lnet/klnds/socklnd/socklnd_proto.c
index 64c0c74..c6ea302 100644
--- a/net/lnet/klnds/socklnd/socklnd_proto.c
+++ b/net/lnet/klnds/socklnd/socklnd_proto.c
@@ -367,7 +367,7 @@
 static int
 ksocknal_handle_zcreq(struct ksock_conn *c, u64 cookie, int remote)
 {
-	struct ksock_peer *peer_ni = c->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = c->ksnc_peer;
 	struct ksock_conn *conn;
 	struct ksock_tx *tx;
 	int rc;
@@ -411,7 +411,7 @@
 static int
 ksocknal_handle_zcack(struct ksock_conn *conn, u64 cookie1, u64 cookie2)
 {
-	struct ksock_peer *peer_ni = conn->ksnc_peer;
+	struct ksock_peer_ni *peer_ni = conn->ksnc_peer;
 	struct ksock_tx *tx;
 	struct ksock_tx *tmp;
 	LIST_HEAD(zlist);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 554/622] lnet: change ksocknal_create_peer() to return pointer
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (552 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 553/622] lnet: socklnd: rename struct ksock_peer to struct ksock_peer_ni James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 555/622] lnet: discard ksnn_lock James Simmons
                   ` (68 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

ksocknal_create_peer() currently returns an error status, and if that
is 0, a pointer is stored in a by-reference argument.  The preferred
pattern in the kernel is to return the pointer, or the error code
encoded with ERR_PTR().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 049683bc0fc0 ("LU-12678 lnet: change ksocknal_create_peer() to return pointer")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36833
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 79068f3..3e69d9c 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -98,9 +98,8 @@
 	kfree(route);
 }
 
-static int
-ksocknal_create_peer(struct ksock_peer_ni **peerp, struct lnet_ni *ni,
-		     struct lnet_process_id id)
+static struct ksock_peer_ni *
+ksocknal_create_peer(struct lnet_ni *ni, struct lnet_process_id id)
 {
 	int cpt = lnet_cpt_of_nid(id.nid, ni);
 	struct ksock_net *net = ni->ni_data;
@@ -112,7 +111,7 @@
 
 	peer_ni = kzalloc_cpt(sizeof(*peer_ni), GFP_NOFS, cpt);
 	if (!peer_ni)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	peer_ni->ksnp_ni = ni;
 	peer_ni->ksnp_id = id;
@@ -136,15 +135,14 @@
 
 		kfree(peer_ni);
 		CERROR("Can't create peer_ni: network shutdown\n");
-		return -ESHUTDOWN;
+		return ERR_PTR(-ESHUTDOWN);
 	}
 
 	net->ksnn_npeers++;
 
 	spin_unlock_bh(&net->ksnn_lock);
 
-	*peerp = peer_ni;
-	return 0;
+	return peer_ni;
 }
 
 void
@@ -447,16 +445,15 @@ struct ksock_peer_ni *
 	struct ksock_peer_ni *peer2;
 	struct ksock_route *route;
 	struct ksock_route *route2;
-	int rc;
 
 	if (id.nid == LNET_NID_ANY ||
 	    id.pid == LNET_PID_ANY)
 		return -EINVAL;
 
 	/* Have a brand new peer_ni ready... */
-	rc = ksocknal_create_peer(&peer_ni, ni, id);
-	if (rc)
-		return rc;
+	peer_ni = ksocknal_create_peer(ni, id);
+	if (IS_ERR(peer_ni))
+		return PTR_ERR(peer_ni);
 
 	route = ksocknal_create_route(ipaddr, port);
 	if (!route) {
@@ -1114,9 +1111,11 @@ struct ksock_peer_ni *
 		ksocknal_peer_addref(peer_ni);
 		write_lock_bh(global_lock);
 	} else {
-		rc = ksocknal_create_peer(&peer_ni, ni, peerid);
-		if (rc)
+		peer_ni = ksocknal_create_peer(ni, peerid);
+		if (IS_ERR(peer_ni)) {
+			rc = PTR_ERR(peer_ni);
 			goto failed_1;
+		}
 
 		write_lock_bh(global_lock);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 555/622] lnet: discard ksnn_lock
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (553 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 554/622] lnet: change ksocknal_create_peer() to return pointer James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 556/622] lnet: discard LNetMEInsert James Simmons
                   ` (67 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

This lock in 'struct ksock_net' is being taken in places where it
isn't needed, so it is worth cleaning up.

It isn't needed when checking if ksnn_npeers has reached
0 yet, as at that point in the code, the value can only
decrement to zero and then stay there.

It is only needed:
 - to ensure concurrent updates to ksnn_npeers don't race, and
 - to ensure that no more peers are added after the net is shutdown.

The first is best achieved using atomic_t.
The second is more easily achieved by replacing the ksnn_shutdown
flag with a large negative bias on ksnn_npeers, and using
atomic_inc_unless_negative().

So change ksnn_npeers to atomic_t and discard ksnn_lock
and ksnn_shutdown.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: fb983bbebf81 ("LU-12678 lnet: discard ksnn_lock")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36834
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 46 +++++++++++++---------------------------
 net/lnet/klnds/socklnd/socklnd.h |  9 +++++---
 2 files changed, 21 insertions(+), 34 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 3e69d9c..1d0bedb 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -109,9 +109,16 @@
 	LASSERT(id.pid != LNET_PID_ANY);
 	LASSERT(!in_interrupt());
 
+	if (!atomic_inc_unless_negative(&net->ksnn_npeers)) {
+		CERROR("Can't create peer_ni: network shutdown\n");
+		return ERR_PTR(-ESHUTDOWN);
+	}
+
 	peer_ni = kzalloc_cpt(sizeof(*peer_ni), GFP_NOFS, cpt);
-	if (!peer_ni)
+	if (!peer_ni) {
+		atomic_dec(&net->ksnn_npeers);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	peer_ni->ksnp_ni = ni;
 	peer_ni->ksnp_id = id;
@@ -128,20 +135,6 @@
 	INIT_LIST_HEAD(&peer_ni->ksnp_zc_req_list);
 	spin_lock_init(&peer_ni->ksnp_lock);
 
-	spin_lock_bh(&net->ksnn_lock);
-
-	if (net->ksnn_shutdown) {
-		spin_unlock_bh(&net->ksnn_lock);
-
-		kfree(peer_ni);
-		CERROR("Can't create peer_ni: network shutdown\n");
-		return ERR_PTR(-ESHUTDOWN);
-	}
-
-	net->ksnn_npeers++;
-
-	spin_unlock_bh(&net->ksnn_lock);
-
 	return peer_ni;
 }
 
@@ -168,9 +161,7 @@
 	 * do with this peer_ni has been cleaned up when its refcount drops to
 	 * zero.
 	 */
-	spin_lock_bh(&net->ksnn_lock);
-	net->ksnn_npeers--;
-	spin_unlock_bh(&net->ksnn_lock);
+	atomic_dec(&net->ksnn_npeers);
 }
 
 struct ksock_peer_ni *
@@ -464,7 +455,7 @@ struct ksock_peer_ni *
 	write_lock_bh(&ksocknal_data.ksnd_global_lock);
 
 	/* always called with a ref on ni, so shutdown can't have started */
-	LASSERT(!((struct ksock_net *)ni->ni_data)->ksnn_shutdown);
+	LASSERT(atomic_read(&((struct ksock_net *)ni->ni_data)->ksnn_npeers) >= 0);
 
 	peer2 = ksocknal_find_peer_locked(ni, id);
 	if (peer2) {
@@ -1120,7 +1111,7 @@ struct ksock_peer_ni *
 		write_lock_bh(global_lock);
 
 		/* called with a ref on ni, so shutdown can't have started */
-		LASSERT(!((struct ksock_net *)ni->ni_data)->ksnn_shutdown);
+		LASSERT(atomic_read(&((struct ksock_net *)ni->ni_data)->ksnn_npeers) >= 0);
 
 		peer2 = ksocknal_find_peer_locked(ni, peerid);
 		if (!peer2) {
@@ -2516,30 +2507,24 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	LASSERT(ksocknal_data.ksnd_init == SOCKNAL_INIT_ALL);
 	LASSERT(ksocknal_data.ksnd_nnets > 0);
 
-	spin_lock_bh(&net->ksnn_lock);
-	net->ksnn_shutdown = 1;		/* prevent new peers */
-	spin_unlock_bh(&net->ksnn_lock);
+	/* prevent new peers */
+	atomic_add(SOCKNAL_SHUTDOWN_BIAS, &net->ksnn_npeers);
 
 	/* Delete all peers */
 	ksocknal_del_peer(ni, anyid, 0);
 
 	/* Wait for all peer_ni state to clean up */
 	i = 2;
-	spin_lock_bh(&net->ksnn_lock);
-	while (net->ksnn_npeers) {
-		spin_unlock_bh(&net->ksnn_lock);
-
+	while (atomic_read(&net->ksnn_npeers) > SOCKNAL_SHUTDOWN_BIAS) {
 		i++;
 		CDEBUG(((i & (-i)) == i) ? D_WARNING : D_NET, /* power of 2? */
 		       "waiting for %d peers to disconnect\n",
-		       net->ksnn_npeers);
+		       atomic_read(&net->ksnn_npeers) - SOCKNAL_SHUTDOWN_BIAS);
 		schedule_timeout_uninterruptible(HZ);
 
 		ksocknal_debug_peerhash(ni);
 
-		spin_lock_bh(&net->ksnn_lock);
 	}
-	spin_unlock_bh(&net->ksnn_lock);
 
 	for (i = 0; i < net->ksnn_ninterfaces; i++) {
 		LASSERT(!net->ksnn_interfaces[i].ksni_npeers);
@@ -2691,7 +2676,6 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	if (!net)
 		goto fail_0;
 
-	spin_lock_init(&net->ksnn_lock);
 	net->ksnn_incarnation = ktime_get_real_ns();
 	ni->ni_data = net;
 	net_tunables = &ni->ni_net->net_tunables;
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 1e10663..832bc08 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -166,14 +166,17 @@ struct ksock_tunables {
 
 struct ksock_net {
 	u64			ksnn_incarnation;	/* my epoch */
-	spinlock_t		ksnn_lock;		/* serialise */
 	struct list_head	ksnn_list;		/* chain on global list */
-	int			ksnn_npeers;		/* # peers */
-	int			ksnn_shutdown;		/* shutting down? */
+	atomic_t		ksnn_npeers;		/* # peers */
 	int			ksnn_ninterfaces;	/* IP interfaces */
 	struct ksock_interface	ksnn_interfaces[LNET_INTERFACES_NUM];
 };
 
+/* When the ksock_net is shut down, this bias is added to
+ * ksnn_npeers, which prevents new pears from being added.
+ */
+#define SOCKNAL_SHUTDOWN_BIAS	(INT_MIN + 1)
+
 /** connd timeout */
 #define SOCKNAL_CONND_TIMEOUT	120
 /** reserved thread for accepting & creating new connd */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 556/622] lnet: discard LNetMEInsert
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (554 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 555/622] lnet: discard ksnn_lock James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 557/622] lustre: lmv: fix to return correct MDT count James Simmons
                   ` (66 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

This function is unused and has never been used.
It is not used by cray-dvs - the other user of LNet.

So discard it.

Lustre-commit: bd5e458cc5fc ("LU-12678 lnet: discard LNetMEInsert")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36858
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/api.h | 10 +-----
 net/lnet/lnet/lib-me.c   | 91 ++----------------------------------------------
 2 files changed, 3 insertions(+), 98 deletions(-)

diff --git a/include/linux/lnet/api.h b/include/linux/lnet/api.h
index 4b152c8..ac602fc 100644
--- a/include/linux/lnet/api.h
+++ b/include/linux/lnet/api.h
@@ -91,7 +91,7 @@
  * and a set of match criteria. The match criteria can be used to reject
  * incoming requests based on process ID or the match bits provided in the
  * request. MEs can be dynamically inserted into a match list by LNetMEAttach()
- * and LNetMEInsert(), and removed from its list by LNetMEUnlink().
+ * and removed from its list by LNetMEUnlink().
  * @{
  */
 int LNetMEAttach(unsigned int portal,
@@ -102,14 +102,6 @@ int LNetMEAttach(unsigned int portal,
 		 enum lnet_ins_pos pos_in,
 		 struct lnet_handle_me *handle_out);
 
-int LNetMEInsert(struct lnet_handle_me current_in,
-		 struct lnet_process_id match_id_in,
-		 u64 match_bits_in,
-		 u64 ignore_bits_in,
-		 enum lnet_unlink unlink_in,
-		 enum lnet_ins_pos position_in,
-		 struct lnet_handle_me *handle_out);
-
 int LNetMEUnlink(struct lnet_handle_me current_in);
 /** @} lnet_me */
 
diff --git a/net/lnet/lnet/lib-me.c b/net/lnet/lnet/lib-me.c
index 4fe6991..47cf498 100644
--- a/net/lnet/lnet/lib-me.c
+++ b/net/lnet/lnet/lib-me.c
@@ -63,8 +63,8 @@
  *		appended to the match list. Allowed constants: LNET_INS_BEFORE,
  *		LNET_INS_AFTER.
  * @handle	On successful returns, a handle to the newly created ME object
- *		is saved here. This handle can be used later in LNetMEInsert(),
- *		LNetMEUnlink(), or LNetMDAttach() functions.
+ *		is saved here. This handle can be used later in LNetMEUnlink(),
+ *		or LNetMDAttach() functions.
  *
  * Return:	0 On success.
  *		-EINVAL If @portal is invalid.
@@ -125,93 +125,6 @@
 EXPORT_SYMBOL(LNetMEAttach);
 
 /**
- * Create and a match entry and insert it before or after the ME pointed to by
- * @current_meh. The new ME is empty, i.e. not associated with a memory
- * descriptor. LNetMDAttach() can be used to attach a MD to an empty ME.
- *
- * This function is identical to LNetMEAttach() except for the position
- * where the new ME is inserted.
- *
- * @current_meh		A handle for a ME. The new ME will be inserted
- *			immediately before or immediately after this ME.
- * @match_id		See the discussion for LNetMEAttach().
- * @match_bits
- * @ignore_bits
- * @unlink
- * @pos
- * @handle
- *
- * Return:		0 On success.
- *			-ENOMEM If new ME object cannot be allocated.
- *			-ENOENT If @current_meh does not point to a valid match entry.
- */
-int
-LNetMEInsert(struct lnet_handle_me current_meh,
-	     struct lnet_process_id match_id,
-	     u64 match_bits, u64 ignore_bits,
-	     enum lnet_unlink unlink, enum lnet_ins_pos pos,
-	     struct lnet_handle_me *handle)
-{
-	struct lnet_me *current_me;
-	struct lnet_me *new_me;
-	struct lnet_portal *ptl;
-	int cpt;
-
-	LASSERT(the_lnet.ln_refcount > 0);
-
-	if (pos == LNET_INS_LOCAL)
-		return -EPERM;
-
-	new_me = kzalloc(sizeof(*new_me), GFP_NOFS);
-	if (!new_me)
-		return -ENOMEM;
-
-	cpt = lnet_cpt_of_cookie(current_meh.cookie);
-
-	lnet_res_lock(cpt);
-
-	current_me = lnet_handle2me(&current_meh);
-	if (!current_me) {
-		kfree(new_me);
-
-		lnet_res_unlock(cpt);
-		return -ENOENT;
-	}
-
-	LASSERT(current_me->me_portal < the_lnet.ln_nportals);
-
-	ptl = the_lnet.ln_portals[current_me->me_portal];
-	if (lnet_ptl_is_unique(ptl)) {
-		/* nosense to insertion on unique portal */
-		kfree(new_me);
-		lnet_res_unlock(cpt);
-		return -EPERM;
-	}
-
-	new_me->me_pos = current_me->me_pos;
-	new_me->me_portal = current_me->me_portal;
-	new_me->me_match_id = match_id;
-	new_me->me_match_bits = match_bits;
-	new_me->me_ignore_bits = ignore_bits;
-	new_me->me_unlink = unlink;
-	new_me->me_md = NULL;
-
-	lnet_res_lh_initialize(the_lnet.ln_me_containers[cpt], &new_me->me_lh);
-
-	if (pos == LNET_INS_AFTER)
-		list_add(&new_me->me_list, &current_me->me_list);
-	else
-		list_add_tail(&new_me->me_list, &current_me->me_list);
-
-	lnet_me2handle(handle, new_me);
-
-	lnet_res_unlock(cpt);
-
-	return 0;
-}
-EXPORT_SYMBOL(LNetMEInsert);
-
-/**
  * Unlink a match entry from its match list.
  *
  * This operation also releases any resources associated with the ME. If a
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 557/622] lustre: lmv: fix to return correct MDT count
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (555 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 556/622] lnet: discard LNetMEInsert James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 558/622] lustre: obdclass: remove assertion for imp_refcount James Simmons
                   ` (65 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Wang Shilong <wshilong@ddn.com>

@ltd_tgts_size could be larger than actual MDT count,
as we preallocate ltd_tgts and resize it if necessary.

Fix it to use @ld_tgt_count instead.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12951
Lustre-commit: 3aa8826aabc7 ("LU-12951 lmv: fix to return correct MDT count")
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-on: https://review.whamcloud.com/36713
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_obd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index e92be25..ee52bba 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -2870,7 +2870,7 @@ static int lmv_get_info(const struct lu_env *env, struct obd_export *exp,
 			exp->exp_connect_data = *(struct obd_connect_data *)val;
 		return rc;
 	} else if (KEY_IS(KEY_TGT_COUNT)) {
-		*((int *)val) = lmv->lmv_mdt_descs.ltd_tgts_size;
+		*((int *)val) = lmv->lmv_mdt_descs.ltd_lmv_desc.ld_tgt_count;
 		return 0;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 558/622] lustre: obdclass: remove assertion for imp_refcount
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (556 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 557/622] lustre: lmv: fix to return correct MDT count James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 559/622] lnet: Prefer route specified by rtr_nid James Simmons
                   ` (64 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Li Dongyang <dongyangli@ddn.com>

After calling obd_zombie_import_add(), obd_import could
be freed by obd_zombie before we check imp_refcount with
LASSERT_ATOMIC_GE_LT. It's a use after free and could
crash the box.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12965
Lustre-commit: dd71e74fecf4 ("LU-12965 obdclass: remove assertion for imp_refcount")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/36743
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/genops.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index bceb055..a31e9ce 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -945,9 +945,6 @@ void class_import_put(struct obd_import *imp)
 		CDEBUG(D_INFO, "final put import %p\n", imp);
 		obd_zombie_import_add(imp);
 	}
-
-	/* catch possible import put race */
-	LASSERT_ATOMIC_GE_LT(&imp->imp_refcount, 0, LI_POISON);
 }
 EXPORT_SYMBOL(class_import_put);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 559/622] lnet: Prefer route specified by rtr_nid
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (557 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 558/622] lustre: obdclass: remove assertion for imp_refcount James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 560/622] lustre: all: prefer sizeof(*var) for alloc James Simmons
                   ` (63 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Restore an optimization that was initially added under LU-11413. For
routed REPLY and ACK we should preferably use the same router from
which the GET/PUT was receieved.

Cray-bug-id: LUS-8008
WC-bug-id: https://jira.whamcloud.com/browse/LU-12646
Lustre-commit: ca8958189198 ("LU-12646 lnet: Prefer route specified by rtr_nid")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/35737
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 131 +++++++++++++++++++++++++++--------------------
 net/lnet/lnet/lib-msg.c  |   4 --
 2 files changed, 76 insertions(+), 59 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index ca0009c..6a2833c 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1330,7 +1330,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 static struct lnet_route *
 lnet_find_route_locked(struct lnet_net *net, u32 remote_net,
-		       lnet_nid_t rtr_nid, struct lnet_route **prev_route,
+		       struct lnet_route **prev_route,
 		       struct lnet_peer_ni **gwni)
 {
 	struct lnet_peer_ni *best_gw_ni = NULL;
@@ -1342,10 +1342,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_peer *lp;
 	int rc;
 
-	/*
-	 * If @rtr_nid is not LNET_NID_ANY, return the gateway with
-	 * rtr_nid nid, otherwise find the best gateway I can use
-	 */
 	rnet = lnet_find_rnet_locked(remote_net);
 	if (!rnet)
 		return NULL;
@@ -1652,13 +1648,14 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 	rc = lnet_post_send_locked(msg, 0);
 	if (!rc)
-		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) : %s try# %d\n",
+		CDEBUG(D_NET, "TRACE: %s(%s:%s) -> %s(%s:%s) %s : %s try# %d\n",
 		       libcfs_nid2str(msg->msg_hdr.src_nid),
 		       libcfs_nid2str(msg->msg_txni->ni_nid),
 		       libcfs_nid2str(sd->sd_src_nid),
 		       libcfs_nid2str(msg->msg_hdr.dest_nid),
 		       libcfs_nid2str(sd->sd_dst_nid),
 		       libcfs_nid2str(msg->msg_txpeer->lpni_nid),
+		       libcfs_nid2str(sd->sd_rtr_nid),
 		       lnet_msgtyp2str(msg->msg_type), msg->msg_retry_count);
 
 	return rc;
@@ -1829,70 +1826,91 @@ struct lnet_ni *
 			     struct lnet_peer **gw_peer)
 {
 	int rc;
+	u32 local_lnet;
 	struct lnet_peer *gw;
 	struct lnet_peer *lp;
 	struct lnet_peer_net *lpn;
 	struct lnet_peer_net *best_lpn = NULL;
 	struct lnet_remotenet *rnet;
-	struct lnet_route *best_route;
-	struct lnet_route *last_route;
+	struct lnet_route *best_route = NULL;
+	struct lnet_route *last_route = NULL;
 	struct lnet_peer_ni *lpni = NULL;
 	struct lnet_peer_ni *gwni = NULL;
 	lnet_nid_t src_nid = sd->sd_src_nid;
 
-	/* we've already looked up the initial lpni using dst_nid */
-	lpni = sd->sd_best_lpni;
-	/* the peer tree must be in existence */
-	LASSERT(lpni && lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer);
-	lp = lpni->lpni_peer_net->lpn_peer;
+	/* If a router nid was specified then we are replying to a GET or
+	 * sending an ACK. In this case we use the gateway associated with the
+	 * specified router nid.
+	 */
+	if (sd->sd_rtr_nid != LNET_NID_ANY) {
+		gwni = lnet_find_peer_ni_locked(sd->sd_rtr_nid);
+		if (!gwni) {
+			CERROR("No peer NI for gateway %s\n",
+			       libcfs_nid2str(sd->sd_rtr_nid));
+			return -EHOSTUNREACH;
+		}
+		gw = gwni->lpni_peer_net->lpn_peer;
+		lnet_peer_ni_decref_locked(gwni);
+		local_lnet = LNET_NIDNET(sd->sd_rtr_nid);
+	} else {
+		/* we've already looked up the initial lpni using dst_nid */
+		lpni = sd->sd_best_lpni;
+		/* the peer tree must be in existence */
+		LASSERT(lpni && lpni->lpni_peer_net &&
+			lpni->lpni_peer_net->lpn_peer);
+		lp = lpni->lpni_peer_net->lpn_peer;
+
+		list_for_each_entry(lpn, &lp->lp_peer_nets, lpn_peer_nets) {
+			/* is this remote network reachable?  */
+			rnet = lnet_find_rnet_locked(lpn->lpn_net_id);
+			if (!rnet)
+				continue;
 
-	list_for_each_entry(lpn, &lp->lp_peer_nets, lpn_peer_nets) {
-		/* is this remote network reachable?  */
-		rnet = lnet_find_rnet_locked(lpn->lpn_net_id);
-		if (!rnet)
-			continue;
+			if (!best_lpn)
+				best_lpn = lpn;
+
+			if (best_lpn->lpn_seq <= lpn->lpn_seq)
+				continue;
 
-		if (!best_lpn)
 			best_lpn = lpn;
+		}
 
-		if (best_lpn->lpn_seq <= lpn->lpn_seq)
-			continue;
+		if (!best_lpn) {
+			CERROR("peer %s has no available nets\n",
+			       libcfs_nid2str(sd->sd_dst_nid));
+			return -EHOSTUNREACH;
+		}
 
-		best_lpn = lpn;
-	}
+		sd->sd_best_lpni = lnet_find_best_lpni_on_net(sd, lp,
+							      best_lpn->lpn_net_id);
+		if (!sd->sd_best_lpni) {
+			CERROR("peer %s down\n",
+			       libcfs_nid2str(sd->sd_dst_nid));
+			return -EHOSTUNREACH;
+		}
 
-	if (!best_lpn) {
-		CERROR("peer %s has no available nets\n",
-		       libcfs_nid2str(sd->sd_dst_nid));
-		return -EHOSTUNREACH;
-	}
+		best_route = lnet_find_route_locked(NULL, best_lpn->lpn_net_id,
+						    &last_route, &gwni);
+		if (!best_route) {
+			CERROR("no route to %s from %s\n",
+			       libcfs_nid2str(dst_nid),
+			       libcfs_nid2str(src_nid));
+			return -EHOSTUNREACH;
+		}
 
-	sd->sd_best_lpni = lnet_find_best_lpni_on_net(sd, lp,
-						      best_lpn->lpn_net_id);
-	if (!sd->sd_best_lpni) {
-		CERROR("peer %s down\n", libcfs_nid2str(sd->sd_dst_nid));
-		return -EHOSTUNREACH;
-	}
+		if (!gwni) {
+			CERROR("Internal Error. Route expected to %s from %s\n",
+			       libcfs_nid2str(dst_nid),
+			       libcfs_nid2str(src_nid));
+			return -EFAULT;
+		}
 
-	best_route = lnet_find_route_locked(NULL, best_lpn->lpn_net_id,
-					    sd->sd_rtr_nid, &last_route,
-					    &gwni);
-	if (!best_route) {
-		CERROR("no route to %s from %s\n",
-		       libcfs_nid2str(dst_nid), libcfs_nid2str(src_nid));
-		return -EHOSTUNREACH;
-	}
+		gw = best_route->lr_gateway;
+		LASSERT(gw == gwni->lpni_peer_net->lpn_peer);
+		local_lnet = best_route->lr_lnet;
 
-	if (!gwni) {
-		CERROR("Internal Error. Route expected to %s from %s\n",
-		       libcfs_nid2str(dst_nid),
-		       libcfs_nid2str(src_nid));
-		return -EFAULT;
 	}
 
-	gw = best_route->lr_gateway;
-	LASSERT(gw == gwni->lpni_peer_net->lpn_peer);
-
 	/* Discover this gateway if it hasn't already been discovered.
 	 * This means we might delay the message until discovery has
 	 * completed
@@ -1906,14 +1924,15 @@ struct lnet_ni *
 	if (!sd->sd_best_ni) {
 		struct lnet_peer_net *lpeer;
 
-		lpeer = lnet_peer_get_net_locked(gw, best_route->lr_lnet);
+		lpeer = lnet_peer_get_net_locked(gw, local_lnet);
 		sd->sd_best_ni = lnet_find_best_ni_on_spec_net(NULL, gw, lpeer,
 							       sd->sd_md_cpt,
 							       true);
 	}
+
 	if (!sd->sd_best_ni) {
 		CERROR("Internal Error. Expected local ni on %s but non found :%s\n",
-		       libcfs_net2str(best_route->lr_lnet),
+		       libcfs_net2str(local_lnet),
 		       libcfs_nid2str(sd->sd_src_nid));
 		return -EFAULT;
 	}
@@ -1924,9 +1943,11 @@ struct lnet_ni *
 	/* increment the sequence numbers since now we're sure we're
 	 * going to use this path
 	 */
-	LASSERT(best_route && last_route);
-	best_route->lr_seq = last_route->lr_seq + 1;
-	best_lpn->lpn_seq++;
+	if (sd->sd_rtr_nid == LNET_NID_ANY) {
+		LASSERT(best_route && last_route);
+		best_route->lr_seq = last_route->lr_seq + 1;
+		best_lpn->lpn_seq++;
+	}
 
 	return 0;
 }
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index d74ff53..86ac692 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -397,10 +397,6 @@
 		msg->msg_hdr.msg.ack.match_bits = msg->msg_ev.match_bits;
 		msg->msg_hdr.msg.ack.mlength = cpu_to_le32(msg->msg_ev.mlength);
 
-		/*
-		 * NB: we probably want to use NID of msg::msg_from as 3rd
-		 * parameter (router NID) if it's routed message
-		 */
 		rc = lnet_send(msg->msg_ev.target.nid, msg, msg->msg_from);
 
 		lnet_net_lock(cpt);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 560/622] lustre: all: prefer sizeof(*var) for alloc
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (558 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 559/622] lnet: Prefer route specified by rtr_nid James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 561/622] lustre: handle: discard OBD_FREE_RCU James Simmons
                   ` (62 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The construct
   var = kzalloc(sizeof(*var, GFP...)
is more obviously correct than
   var = kzalloc(sizeof(struct something), GFP...);
and is preferred

So convert allocations and frees that use sizeof(struct..)
to use one of the simpler constructs.

For cfs_percpt_alloc() allocations, we are allocating
a array of pointers. so sizeof(*var[0]) is best.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 11f2c86650fd ("LU-9679 all: prefer sizeof(*var) for ALLOC/FREE")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36661
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c | 2 +-
 fs/lustre/obdecho/echo_client.c     | 2 +-
 net/lnet/klnds/o2iblnd/o2iblnd.c    | 9 +++++----
 net/lnet/libcfs/libcfs_lock.c       | 2 +-
 net/lnet/lnet/api-ni.c              | 7 +++----
 net/lnet/lnet/lib-eq.c              | 2 +-
 net/lnet/lnet/lib-ptl.c             | 2 +-
 net/lnet/lnet/router.c              | 2 +-
 net/lnet/selftest/rpc.c             | 2 +-
 9 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 9772194..4fc35c5 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -1137,7 +1137,7 @@ struct lprocfs_stats *lprocfs_alloc_stats(unsigned int num,
 
 	/* alloc num of counter headers */
 	stats->ls_cnt_header = kvmalloc_array(stats->ls_num,
-					      sizeof(struct lprocfs_counter_header),
+					      sizeof(*stats->ls_cnt_header),
 					      GFP_KERNEL | __GFP_ZERO);
 	if (!stats->ls_cnt_header)
 		goto fail;
diff --git a/fs/lustre/obdecho/echo_client.c b/fs/lustre/obdecho/echo_client.c
index c473f547..84dea56 100644
--- a/fs/lustre/obdecho/echo_client.c
+++ b/fs/lustre/obdecho/echo_client.c
@@ -1367,7 +1367,7 @@ static int echo_client_prep_commit(const struct lu_env *env,
 	npages = batch >> PAGE_SHIFT;
 	tot_pages = count >> PAGE_SHIFT;
 
-	lnb = kvmalloc_array(npages, sizeof(struct niobuf_local),
+	lnb = kvmalloc_array(npages, sizeof(*lnb),
 			     GFP_NOFS | __GFP_ZERO);
 	if (!lnb) {
 		ret = -ENOMEM;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 1cc5358..04e121b 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -852,7 +852,8 @@ struct kib_conn *kiblnd_create_conn(struct kib_peer_ni *peer_ni,
 
 	kfree(init_qp_attr);
 
-	conn->ibc_rxs = kzalloc_cpt(IBLND_RX_MSGS(conn) * sizeof(struct kib_rx),
+	conn->ibc_rxs = kzalloc_cpt(IBLND_RX_MSGS(conn) *
+				    sizeof(*conn->ibc_rxs),
 				    GFP_NOFS, cpt);
 	if (!conn->ibc_rxs) {
 		CERROR("Cannot allocate RX buffers\n");
@@ -2119,7 +2120,7 @@ static int kiblnd_create_tx_pool(struct kib_poolset *ps, int size,
 		return -ENOMEM;
 	}
 
-	tpo->tpo_tx_descs = kzalloc_cpt(size * sizeof(struct kib_tx),
+	tpo->tpo_tx_descs = kzalloc_cpt(size * sizeof(*tpo->tpo_tx_descs),
 					GFP_NOFS, ps->ps_cpt);
 	if (!tpo->tpo_tx_descs) {
 		CERROR("Can't allocate %d tx descriptors\n", size);
@@ -2251,7 +2252,7 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
 	 * number of CPTs that exist, i.e net->ibn_fmr_ps[cpt].
 	 */
 	net->ibn_fmr_ps = cfs_percpt_alloc(lnet_cpt_table(),
-					   sizeof(struct kib_fmr_poolset));
+					   sizeof(*net->ibn_fmr_ps[0]));
 	if (!net->ibn_fmr_ps) {
 		CERROR("Failed to allocate FMR pool array\n");
 		rc = -ENOMEM;
@@ -2278,7 +2279,7 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
 	 * number of CPTs that exist, i.e net->ibn_tx_ps[cpt].
 	 */
 	net->ibn_tx_ps = cfs_percpt_alloc(lnet_cpt_table(),
-					  sizeof(struct kib_tx_poolset));
+					  sizeof(*net->ibn_tx_ps[0]));
 	if (!net->ibn_tx_ps) {
 		CERROR("Failed to allocate tx pool array\n");
 		rc = -ENOMEM;
diff --git a/net/lnet/libcfs/libcfs_lock.c b/net/lnet/libcfs/libcfs_lock.c
index 3d5157f..313aa95 100644
--- a/net/lnet/libcfs/libcfs_lock.c
+++ b/net/lnet/libcfs/libcfs_lock.c
@@ -66,7 +66,7 @@ struct cfs_percpt_lock *
 		return NULL;
 
 	pcl->pcl_cptab = cptab;
-	pcl->pcl_locks = cfs_percpt_alloc(cptab, sizeof(*lock));
+	pcl->pcl_locks = cfs_percpt_alloc(cptab, sizeof(*pcl->pcl_locks[0]));
 	if (!pcl->pcl_locks) {
 		kfree(pcl);
 		return NULL;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 6c913b5..0020ffd 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -970,7 +970,7 @@ static void lnet_assert_wire_constants(void)
 	int rc;
 	int i;
 
-	recs = cfs_percpt_alloc(lnet_cpt_table(), sizeof(*rec));
+	recs = cfs_percpt_alloc(lnet_cpt_table(), sizeof(*recs[0]));
 	if (!recs) {
 		CERROR("Failed to allocate %s resource containers\n",
 		       lnet_res_type2str(type));
@@ -1033,8 +1033,7 @@ struct list_head **
 	struct list_head *q;
 	int i;
 
-	qs = cfs_percpt_alloc(lnet_cpt_table(),
-			      sizeof(struct list_head));
+	qs = cfs_percpt_alloc(lnet_cpt_table(), sizeof(*qs[0]));
 	if (!qs) {
 		CERROR("Failed to allocate queues\n");
 		return NULL;
@@ -1096,7 +1095,7 @@ struct list_head **
 	the_lnet.ln_interface_cookie = ktime_get_real_ns();
 
 	the_lnet.ln_counters = cfs_percpt_alloc(lnet_cpt_table(),
-						sizeof(struct lnet_counters));
+						sizeof(*the_lnet.ln_counters[0]));
 	if (!the_lnet.ln_counters) {
 		CERROR("Failed to allocate counters for LNet\n");
 		rc = -ENOMEM;
diff --git a/net/lnet/lnet/lib-eq.c b/net/lnet/lnet/lib-eq.c
index 01b8ee3..25af2bd 100644
--- a/net/lnet/lnet/lib-eq.c
+++ b/net/lnet/lnet/lib-eq.c
@@ -95,7 +95,7 @@
 		return -ENOMEM;
 
 	if (count) {
-		eq->eq_events = kvmalloc_array(count, sizeof(struct lnet_event),
+		eq->eq_events = kvmalloc_array(count, sizeof(*eq->eq_events),
 					       GFP_KERNEL | __GFP_ZERO);
 		if (!eq->eq_events)
 			goto failed;
diff --git a/net/lnet/lnet/lib-ptl.c b/net/lnet/lnet/lib-ptl.c
index bb92f37..ae38bc3 100644
--- a/net/lnet/lnet/lib-ptl.c
+++ b/net/lnet/lnet/lib-ptl.c
@@ -793,7 +793,7 @@ struct list_head *
 	int j;
 
 	ptl->ptl_mtables = cfs_percpt_alloc(lnet_cpt_table(),
-					    sizeof(struct lnet_match_table));
+					    sizeof(*ptl->ptl_mtables[0]));
 	if (!ptl->ptl_mtables) {
 		CERROR("Failed to create match table for portal %d\n", index);
 		return -ENOMEM;
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 71ba951..b8f7aba0 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -1386,7 +1386,7 @@ bool lnet_router_checker_active(void)
 
 	the_lnet.ln_rtrpools = cfs_percpt_alloc(lnet_cpt_table(),
 						LNET_NRBPOOLS *
-						sizeof(struct lnet_rtrbufpool));
+						sizeof(*the_lnet.ln_rtrpools[0]));
 	if (!the_lnet.ln_rtrpools) {
 		LCONSOLE_ERROR_MSG(0x10c,
 				   "Failed to initialize router buffe pool\n");
diff --git a/net/lnet/selftest/rpc.c b/net/lnet/selftest/rpc.c
index 4645f04..7a8226c 100644
--- a/net/lnet/selftest/rpc.c
+++ b/net/lnet/selftest/rpc.c
@@ -256,7 +256,7 @@ struct srpc_bulk *
 	svc->sv_shuttingdown = 0;
 
 	svc->sv_cpt_data = cfs_percpt_alloc(lnet_cpt_table(),
-					    sizeof(**svc->sv_cpt_data));
+					    sizeof(*svc->sv_cpt_data[0]));
 	if (!svc->sv_cpt_data)
 		return -ENOMEM;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 561/622] lustre: handle: discard OBD_FREE_RCU
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (559 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 560/622] lustre: all: prefer sizeof(*var) for alloc James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 562/622] lnet: use list_move where appropriate James Simmons
                   ` (61 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

OBD_FREE_RCU and the hop_free call-back together form an overly
complex mechanism equivalent to kfree_rcu() or call_rcu(...).
Discard them and use the simpler approach.

This removes the only use for the field h_size, so discard
that too.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12542
Lustre-commit: 48830f888b6 ("LU-12542 handle: discard OBD_FREE_RCU")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35797
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_handles.h  |  3 ---
 fs/lustre/include/obd_support.h     | 10 ----------
 fs/lustre/ldlm/ldlm_lock.c          | 16 ++++++++--------
 fs/lustre/obdclass/genops.c         |  3 +--
 fs/lustre/obdclass/lustre_handles.c | 15 ---------------
 5 files changed, 9 insertions(+), 38 deletions(-)

diff --git a/fs/lustre/include/lustre_handles.h b/fs/lustre/include/lustre_handles.h
index 7c93d72..8f733fd 100644
--- a/fs/lustre/include/lustre_handles.h
+++ b/fs/lustre/include/lustre_handles.h
@@ -46,7 +46,6 @@
 #include <linux/types.h>
 
 struct portals_handle_ops {
-	void (*hop_free)(void *object, int size);
 	/* hop_type is used for some debugging messages */
 	char *hop_type;
 };
@@ -72,7 +71,6 @@ struct portals_handle {
 	/* newly added fields to handle the RCU issue. -jxiong */
 	struct rcu_head			h_rcu;
 	spinlock_t			h_lock;
-	unsigned int			h_size:31;
 	unsigned int			h_in:1;
 };
 
@@ -83,7 +81,6 @@ void class_handle_hash(struct portals_handle *,
 		       const struct portals_handle_ops *ops);
 void class_handle_unhash(struct portals_handle *);
 void *class_handle2object(u64 cookie, const struct portals_handle_ops *ops);
-void class_handle_free_cb(struct rcu_head *rcu);
 int class_handle_init(void);
 void class_handle_cleanup(void);
 
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index acfd098..5969b6b 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -533,16 +533,6 @@
 #define POISON_PAGE(page, val) do { } while (0)
 #endif
 
-#define OBD_FREE_RCU(ptr, size, handle)			\
-do {							\
-	struct portals_handle *__h = (handle);		\
-							\
-	__h->h_cookie = (unsigned long)(ptr);		\
-	__h->h_size = (size);				\
-	call_rcu(&__h->h_rcu, class_handle_free_cb);	\
-	POISON_PTR(ptr);				\
-} while (0)
-
 #define KEY_IS(str)					\
 	(keylen >= (sizeof(str) - 1) &&			\
 	memcmp(key, str, (sizeof(str) - 1)) == 0)
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 2471e30..61bf028 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -153,6 +153,13 @@ struct ldlm_lock *ldlm_lock_get(struct ldlm_lock *lock)
 }
 EXPORT_SYMBOL(ldlm_lock_get);
 
+static void lock_handle_free(struct rcu_head *rcu)
+{
+	struct ldlm_lock *lock = container_of(rcu, struct ldlm_lock,
+					      l_handle.h_rcu);
+	kmem_cache_free(ldlm_lock_slab, lock);
+}
+
 /**
  * Release lock reference.
  *
@@ -186,7 +193,7 @@ void ldlm_lock_put(struct ldlm_lock *lock)
 		kvfree(lock->l_lvb_data);
 
 		lu_ref_fini(&lock->l_reference);
-		OBD_FREE_RCU(lock, sizeof(*lock), &lock->l_handle);
+		call_rcu(&lock->l_handle.h_rcu, lock_handle_free);
 	}
 }
 EXPORT_SYMBOL(ldlm_lock_put);
@@ -358,14 +365,7 @@ void ldlm_lock_destroy_nolock(struct ldlm_lock *lock)
 	}
 }
 
-static void lock_handle_free(void *lock, int size)
-{
-	LASSERT(size == sizeof(struct ldlm_lock));
-	kmem_cache_free(ldlm_lock_slab, lock);
-}
-
 static struct portals_handle_ops lock_handle_ops = {
-	.hop_free   = lock_handle_free,
 	.hop_type   = "ldlm",
 };
 
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index a31e9ce..15bea0d 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -729,11 +729,10 @@ static void class_export_destroy(struct obd_export *exp)
 	if (exp != obd->obd_self_export)
 		class_decref(obd, "export", exp);
 
-	OBD_FREE_RCU(exp, sizeof(*exp), &exp->exp_handle);
+	kfree_rcu(exp, exp_handle.h_rcu);
 }
 
 static struct portals_handle_ops export_handle_ops = {
-	.hop_free	= NULL,
 	.hop_type	= "export",
 };
 
diff --git a/fs/lustre/obdclass/lustre_handles.c b/fs/lustre/obdclass/lustre_handles.c
index 95a34db..99c68fe 100644
--- a/fs/lustre/obdclass/lustre_handles.c
+++ b/fs/lustre/obdclass/lustre_handles.c
@@ -167,21 +167,6 @@ void *class_handle2object(u64 cookie, const struct portals_handle_ops *ops)
 }
 EXPORT_SYMBOL(class_handle2object);
 
-void class_handle_free_cb(struct rcu_head *rcu)
-{
-	struct portals_handle *h;
-	void *ptr;
-
-	h = container_of(rcu, struct portals_handle, h_rcu);
-	ptr = (void *)(unsigned long)h->h_cookie;
-
-	if (h->h_ops->hop_free)
-		h->h_ops->hop_free(ptr, h->h_size);
-	else
-		kfree(ptr);
-}
-EXPORT_SYMBOL(class_handle_free_cb);
-
 int class_handle_init(void)
 {
 	struct handle_bucket *bucket;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 562/622] lnet: use list_move where appropriate.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (560 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 561/622] lustre: handle: discard OBD_FREE_RCU James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 563/622] lnet: libcfs: provide an scnprintf and start using it James Simmons
                   ` (60 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

There are several places in lustre where "list_del" (or occasionally
"list_del_init") is followed by "list_add" or "list_add_tail" which
moves the object to a different list.
These can be combined into "list_move" or "list_move_tail".

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 590089790fee ("LU-12678 lnet: use list_move where appropriate.")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36339
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c       | 10 ++++------
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c    |  6 ++----
 net/lnet/klnds/socklnd/socklnd.c       |  3 +--
 net/lnet/klnds/socklnd/socklnd_cb.c    |  3 +--
 net/lnet/klnds/socklnd/socklnd_proto.c |  3 +--
 net/lnet/lnet/config.c                 |  3 +--
 net/lnet/lnet/lib-move.c               |  9 +++------
 net/lnet/selftest/console.c            |  6 ++----
 8 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 04e121b..37d8235 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1565,11 +1565,10 @@ static void kiblnd_fail_fmr_poolset(struct kib_fmr_poolset *fps,
 					       struct kib_fmr_pool,
 					       fpo_list)) != NULL) {
 		fpo->fpo_failed = 1;
-		list_del(&fpo->fpo_list);
 		if (!fpo->fpo_map_count)
-			list_add(&fpo->fpo_list, zombies);
+			list_move(&fpo->fpo_list, zombies);
 		else
-			list_add(&fpo->fpo_list, &fps->fps_failed_pool_list);
+			list_move(&fpo->fpo_list, &fps->fps_failed_pool_list);
 	}
 
 	spin_unlock(&fps->fps_lock);
@@ -1887,11 +1886,10 @@ static void kiblnd_fail_poolset(struct kib_poolset *ps, struct list_head *zombie
 					      struct kib_pool,
 					      po_list)) == NULL) {
 		po->po_failed = 1;
-		list_del(&po->po_list);
 		if (!po->po_allocated)
-			list_add(&po->po_list, zombies);
+			list_move(&po->po_list, zombies);
 		else
-			list_add(&po->po_list, &ps->ps_failed_pool_list);
+			list_move(&po->po_list, &ps->ps_failed_pool_list);
 	}
 	spin_unlock(&ps->ps_lock);
 }
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index fcd9db2..f769a45 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -986,8 +986,7 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 	       (tx = list_first_entry_or_null(
 		       &conn->ibc_tx_queue_rsrvd,
 		       struct kib_tx, tx_list)) != NULL) {
-		list_del(&tx->tx_list);
-		list_add_tail(&tx->tx_list, &conn->ibc_tx_queue);
+		list_move_tail(&tx->tx_list, &conn->ibc_tx_queue);
 		conn->ibc_reserved_credits--;
 	}
 
@@ -2118,8 +2117,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 		 */
 		if (!tx->tx_sending) {
 			tx->tx_queued = 0;
-			list_del(&tx->tx_list);
-			list_add(&tx->tx_list, &zombies);
+			list_move(&tx->tx_list, &zombies);
 		}
 	}
 
diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 1d0bedb..593c205 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -1537,8 +1537,7 @@ struct ksock_peer_ni *
 
 		tx->tx_msg.ksm_zc_cookies[0] = 0;
 		tx->tx_zc_aborted = 1; /* mark it as not-acked */
-		list_del(&tx->tx_zc_list);
-		list_add(&tx->tx_zc_list, &zlist);
+		list_move(&tx->tx_zc_list, &zlist);
 	}
 
 	spin_unlock(&peer_ni->ksnp_lock);
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 2b93331..996b231 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -2312,8 +2312,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 
 		tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_TIMEOUT;
 
-		list_del(&tx->tx_list);
-		list_add_tail(&tx->tx_list, &stale_txs);
+		list_move_tail(&tx->tx_list, &stale_txs);
 	}
 
 	write_unlock_bh(&ksocknal_data.ksnd_global_lock);
diff --git a/net/lnet/klnds/socklnd/socklnd_proto.c b/net/lnet/klnds/socklnd/socklnd_proto.c
index c6ea302..887ed2d 100644
--- a/net/lnet/klnds/socklnd/socklnd_proto.c
+++ b/net/lnet/klnds/socklnd/socklnd_proto.c
@@ -437,8 +437,7 @@
 		if (c == cookie1 || c == cookie2 ||
 		    (cookie1 < c && c < cookie2)) {
 			tx->tx_msg.ksm_zc_cookies[0] = 0;
-			list_del(&tx->tx_zc_list);
-			list_add(&tx->tx_zc_list, &zlist);
+			list_move(&tx->tx_zc_list, &zlist);
 
 			if (!--count)
 				break;
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index f521b0b..8994882 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -1533,8 +1533,7 @@ struct lnet_ni *
 		list_for_each_safe(t, t2, &current_nets) {
 			tb = list_entry(t, struct lnet_text_buf, ltb_list);
 
-			list_del(&tb->ltb_list);
-			list_add_tail(&tb->ltb_list, &matched_nets);
+			list_move_tail(&tb->ltb_list, &matched_nets);
 
 			len += snprintf(networks + len, sizeof(networks) - len,
 					"%s%s", !len ? "" : ",",
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 6a2833c..da73009 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -195,8 +195,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		if (!tp->tp_threshold ||    /* needs culling anyway */
 		    nid == LNET_NID_ANY ||       /* removing all entries */
 		    tp->tp_nid == nid) {	  /* matched this one */
-			list_del(&tp->tp_list);
-			list_add(&tp->tp_list, &cull);
+			list_move(&tp->tp_list, &cull);
 		}
 	}
 
@@ -236,8 +235,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 				 * since we may be at interrupt priority on
 				 * incoming messages.
 				 */
-				list_del(&tp->tp_list);
-				list_add(&tp->tp_list, &cull);
+				list_move(&tp->tp_list, &cull);
 			}
 			continue;
 		}
@@ -251,8 +249,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 				if (outgoing &&
 				    !tp->tp_threshold) {
 					/* see above */
-					list_del(&tp->tp_list);
-					list_add(&tp->tp_list, &cull);
+					list_move(&tp->tp_list, &cull);
 				}
 			}
 			break;
diff --git a/net/lnet/selftest/console.c b/net/lnet/selftest/console.c
index abc342c..9f32c1f 100644
--- a/net/lnet/selftest/console.c
+++ b/net/lnet/selftest/console.c
@@ -316,12 +316,10 @@ static void lstcon_group_ndlink_release(struct lstcon_group *,
 	unsigned int idx = LNET_NIDADDR(ndl->ndl_node->nd_id.nid) %
 					LST_NODE_HASHSIZE;
 
-	list_del(&ndl->ndl_hlink);
-	list_del(&ndl->ndl_link);
 	old->grp_nnode--;
 
-	list_add_tail(&ndl->ndl_hlink, &new->grp_ndl_hash[idx]);
-	list_add_tail(&ndl->ndl_link, &new->grp_ndl_list);
+	list_move_tail(&ndl->ndl_hlink, &new->grp_ndl_hash[idx]);
+	list_move_tail(&ndl->ndl_link, &new->grp_ndl_list);
 	new->grp_nnode++;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 563/622] lnet: libcfs: provide an scnprintf and start using it
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (561 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 562/622] lnet: use list_move where appropriate James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 564/622] lustre: llite: fetch default layout for a directory James Simmons
                   ` (59 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

snprintf() returns the number of chars that would be needed to hold
the complete result, which may be larger that the buffer size.

scnprintf differs in it's return value is number of chars actually
written (not including the terminating null).

Correct the few patterns where the return from snprintf() is used and
expected not to exceed the passed buffer size.

Cray-bug-id: LUS-7999
WC-bug-id: https://jira.whamcloud.com/browse/LU-12861
Lustre-commit: 998a494fa9a4 ("LU-12861 libcfs: provide an scnprintf and start using it")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/36453
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/lproc_osc.c   |   6 +--
 net/lnet/lnet/config.c      |   6 +--
 net/lnet/lnet/router_proc.c | 128 ++++++++++++++++++++++----------------------
 3 files changed, 70 insertions(+), 70 deletions(-)

diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index 2bc7047..d545d1b 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -703,9 +703,9 @@ static ssize_t grant_shrink_show(struct kobject *kobj, struct attribute *attr,
 		return len;
 
 	imp = obd->u.cli.cl_import;
-	len = snprintf(buf, PAGE_SIZE, "%d\n",
-		       !imp->imp_grant_shrink_disabled &&
-		       OCD_HAS_FLAG(&imp->imp_connect_data, GRANT_SHRINK));
+	len = scnprintf(buf, PAGE_SIZE, "%d\n",
+			!imp->imp_grant_shrink_disabled &&
+			OCD_HAS_FLAG(&imp->imp_connect_data, GRANT_SHRINK));
 	up_read(&obd->u.cli.cl_sem);
 
 	return len;
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index 8994882..f50df88 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -1535,9 +1535,9 @@ struct lnet_ni *
 
 			list_move_tail(&tb->ltb_list, &matched_nets);
 
-			len += snprintf(networks + len, sizeof(networks) - len,
-					"%s%s", !len ? "" : ",",
-					tb->ltb_text);
+			len += scnprintf(networks + len, sizeof(networks) - len,
+					 "%s%s", !len ? "" : ",",
+					 tb->ltb_text);
 
 			if (len >= sizeof(networks)) {
 				CERROR("Too many matched networks\n");
diff --git a/net/lnet/lnet/router_proc.c b/net/lnet/lnet/router_proc.c
index 2e9342c..180bbde 100644
--- a/net/lnet/lnet/router_proc.c
+++ b/net/lnet/lnet/router_proc.c
@@ -105,16 +105,16 @@ static int proc_lnet_stats(struct ctl_table *table, int write,
 	lnet_counters_get(ctrs);
 	common = ctrs->lct_common;
 
-	len = snprintf(tmpstr, tmpsiz,
-		       "%u %u %u %u %u %u %u %llu %llu %llu %llu",
-		       common.lcc_msgs_alloc, common.lcc_msgs_max,
-		       common.lcc_errors,
-		       common.lcc_send_count, common.lcc_recv_count,
-		       common.lcc_route_count, common.lcc_drop_count,
-		       common.lcc_send_length, common.lcc_recv_length,
-		       common.lcc_route_length, common.lcc_drop_length);
-
-	if (pos >= min_t(int, len, strlen(tmpstr)))
+	len = scnprintf(tmpstr, tmpsiz,
+			"%u %u %u %u %u %u %u %llu %llu %llu %llu",
+			common.lcc_msgs_alloc, common.lcc_msgs_max,
+			common.lcc_errors,
+			common.lcc_send_count, common.lcc_recv_count,
+			common.lcc_route_count, common.lcc_drop_count,
+			common.lcc_send_length, common.lcc_recv_length,
+			common.lcc_route_length, common.lcc_drop_length);
+
+	if (pos >= len)
 		rc = 0;
 	else
 		rc = cfs_trace_copyout_string(buffer, nob,
@@ -153,12 +153,12 @@ static int proc_lnet_routes(struct ctl_table *table, int write,
 	s = tmpstr; /* points to current position in tmpstr[] */
 
 	if (!*ppos) {
-		s += snprintf(s, tmpstr + tmpsiz - s, "Routing %s\n",
-			      the_lnet.ln_routing ? "enabled" : "disabled");
+		s += scnprintf(s, tmpstr + tmpsiz - s, "Routing %s\n",
+			       the_lnet.ln_routing ? "enabled" : "disabled");
 		LASSERT(tmpstr + tmpsiz - s > 0);
 
-		s += snprintf(s, tmpstr + tmpsiz - s, "%-8s %4s %8s %7s %s\n",
-			      "net", "hops", "priority", "state", "router");
+		s += scnprintf(s, tmpstr + tmpsiz - s, "%-8s %4s %8s %7s %s\n",
+			       "net", "hops", "priority", "state", "router");
 		LASSERT(tmpstr + tmpsiz - s > 0);
 
 		lnet_net_lock(0);
@@ -217,12 +217,12 @@ static int proc_lnet_routes(struct ctl_table *table, int write,
 			unsigned int priority = route->lr_priority;
 			int alive = lnet_is_route_alive(route);
 
-			s += snprintf(s, tmpstr + tmpsiz - s,
-				      "%-8s %4d %8u %7s %s\n",
-				      libcfs_net2str(net), hops,
-				      priority,
-				      alive ? "up" : "down",
-				      libcfs_nid2str(route->lr_nid));
+			s += scnprintf(s, tmpstr + tmpsiz - s,
+				       "%-8s %4d %8u %7s %s\n",
+				       libcfs_net2str(net), hops,
+				       priority,
+				       alive ? "up" : "down",
+				       libcfs_nid2str(route->lr_nid));
 			LASSERT(tmpstr + tmpsiz - s > 0);
 		}
 
@@ -276,9 +276,9 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 	s = tmpstr; /* points to current position in tmpstr[] */
 
 	if (!*ppos) {
-		s += snprintf(s, tmpstr + tmpsiz - s,
-			      "%-4s %7s %5s %s\n",
-			      "ref", "rtr_ref", "alive", "router");
+		s += scnprintf(s, tmpstr + tmpsiz - s,
+			       "%-4s %7s %5s %s\n",
+			       "ref", "rtr_ref", "alive", "router");
 		LASSERT(tmpstr + tmpsiz - s > 0);
 
 		lnet_net_lock(0);
@@ -320,11 +320,11 @@ static int proc_lnet_routers(struct ctl_table *table, int write,
 			int nrtrrefs = peer->lp_rtr_refcount;
 			int alive = lnet_is_gateway_alive(peer);
 
-			s += snprintf(s, tmpstr + tmpsiz - s,
-				      "%-4d %7d %5s %s\n",
-				      nrefs, nrtrrefs,
-				      alive ? "up" : "down",
-				      libcfs_nid2str(nid));
+			s += scnprintf(s, tmpstr + tmpsiz - s,
+				       "%-4d %7d %5s %s\n",
+				       nrefs, nrtrrefs,
+				       alive ? "up" : "down",
+				       libcfs_nid2str(nid));
 		}
 
 		lnet_net_unlock(0);
@@ -411,10 +411,10 @@ static int proc_lnet_peers(struct ctl_table *table, int write,
 	s = tmpstr; /* points to current position in tmpstr[] */
 
 	if (!*ppos) {
-		s += snprintf(s, tmpstr + tmpsiz - s,
-			      "%-24s %4s %5s %5s %5s %5s %5s %5s %5s %s\n",
-			      "nid", "refs", "state", "last", "max",
-			      "rtr", "min", "tx", "min", "queue");
+		s += scnprintf(s, tmpstr + tmpsiz - s,
+			       "%-24s %4s %5s %5s %5s %5s %5s %5s %5s %s\n",
+			       "nid", "refs", "state", "last", "max",
+			       "rtr", "min", "tx", "min", "queue");
 		LASSERT(tmpstr + tmpsiz - s > 0);
 
 		hoff++;
@@ -498,11 +498,11 @@ static int proc_lnet_peers(struct ctl_table *table, int write,
 
 			lnet_net_unlock(cpt);
 
-			s += snprintf(s, tmpstr + tmpsiz - s,
-				      "%-24s %4d %5s %5lld %5d %5d %5d %5d %5d %d\n",
-				      libcfs_nid2str(nid), nrefs, aliveness,
-				      lastalive, maxcr, rtrcr, minrtrcr, txcr,
-				      mintxcr, txqnob);
+			s += scnprintf(s, tmpstr + tmpsiz - s,
+				       "%-24s %4d %5s %5lld %5d %5d %5d %5d %5d %d\n",
+				       libcfs_nid2str(nid), nrefs, aliveness,
+				       lastalive, maxcr, rtrcr, minrtrcr, txcr,
+				       mintxcr, txqnob);
 			LASSERT(tmpstr + tmpsiz - s > 0);
 
 		} else { /* peer is NULL */
@@ -560,9 +560,9 @@ static int proc_lnet_buffers(struct ctl_table *table, int write,
 
 	s = tmpstr; /* points to current position in tmpstr[] */
 
-	s += snprintf(s, tmpstr + tmpsiz - s,
-		      "%5s %5s %7s %7s\n",
-		      "pages", "count", "credits", "min");
+	s += scnprintf(s, tmpstr + tmpsiz - s,
+		       "%5s %5s %7s %7s\n",
+		       "pages", "count", "credits", "min");
 	LASSERT(tmpstr + tmpsiz - s > 0);
 
 	if (!the_lnet.ln_rtrpools)
@@ -573,12 +573,12 @@ static int proc_lnet_buffers(struct ctl_table *table, int write,
 
 		lnet_net_lock(LNET_LOCK_EX);
 		cfs_percpt_for_each(rbp, i, the_lnet.ln_rtrpools) {
-			s += snprintf(s, tmpstr + tmpsiz - s,
-				      "%5d %5d %7d %7d\n",
-				      rbp[idx].rbp_npages,
-				      rbp[idx].rbp_nbuffers,
-				      rbp[idx].rbp_credits,
-				      rbp[idx].rbp_mincredits);
+			s += scnprintf(s, tmpstr + tmpsiz - s,
+				       "%5d %5d %7d %7d\n",
+				       rbp[idx].rbp_npages,
+				       rbp[idx].rbp_nbuffers,
+				       rbp[idx].rbp_credits,
+				       rbp[idx].rbp_mincredits);
 			LASSERT(tmpstr + tmpsiz - s > 0);
 		}
 		lnet_net_unlock(LNET_LOCK_EX);
@@ -652,10 +652,10 @@ static int proc_lnet_nis(struct ctl_table *table, int write,
 	s = tmpstr; /* points to current position in tmpstr[] */
 
 	if (!*ppos) {
-		s += snprintf(s, tmpstr + tmpsiz - s,
-			      "%-24s %6s %5s %4s %4s %4s %5s %5s %5s\n",
-			      "nid", "status", "alive", "refs", "peer",
-			      "rtr", "max", "tx", "min");
+		s += scnprintf(s, tmpstr + tmpsiz - s,
+			       "%-24s %6s %5s %4s %4s %4s %5s %5s %5s\n",
+			       "nid", "status", "alive", "refs", "peer",
+			       "rtr", "max", "tx", "min");
 		LASSERT(tmpstr + tmpsiz - s > 0);
 	} else {
 		struct lnet_ni *ni = NULL;
@@ -705,15 +705,15 @@ static int proc_lnet_nis(struct ctl_table *table, int write,
 				if (i)
 					lnet_net_lock(i);
 
-				s += snprintf(s, tmpstr + tmpsiz - s,
-					      "%-24s %6s %5lld %4d %4d %4d %5d %5d %5d\n",
-					      libcfs_nid2str(ni->ni_nid), stat,
-					      last_alive, *ni->ni_refs[i],
-					      ni->ni_net->net_tunables.lct_peer_tx_credits,
-					      ni->ni_net->net_tunables.lct_peer_rtr_credits,
-					      tq->tq_credits_max,
-					      tq->tq_credits,
-					      tq->tq_credits_min);
+				s += scnprintf(s, tmpstr + tmpsiz - s,
+					       "%-24s %6s %5lld %4d %4d %4d %5d %5d %5d\n",
+					       libcfs_nid2str(ni->ni_nid), stat,
+					       last_alive, *ni->ni_refs[i],
+					       ni->ni_net->net_tunables.lct_peer_tx_credits,
+					       ni->ni_net->net_tunables.lct_peer_rtr_credits,
+					       tq->tq_credits_max,
+					       tq->tq_credits,
+					       tq->tq_credits_min);
 				if (i)
 					lnet_net_unlock(i);
 			}
@@ -803,11 +803,11 @@ static int proc_lnet_portal_rotor(struct ctl_table *table, int write,
 		LASSERT(portal_rotors[i].pr_value == portal_rotor);
 		lnet_res_unlock(0);
 
-		rc = snprintf(buf, buf_len,
-			      "{\n\tportals: all\n"
-			      "\trotor: %s\n\tdescription: %s\n}",
-			      portal_rotors[i].pr_name,
-			      portal_rotors[i].pr_desc);
+		rc = scnprintf(buf, buf_len,
+			       "{\n\tportals: all\n"
+			       "\trotor: %s\n\tdescription: %s\n}",
+			       portal_rotors[i].pr_name,
+			       portal_rotors[i].pr_desc);
 
 		if (pos >= min_t(int, rc, buf_len)) {
 			rc = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 564/622] lustre: llite: fetch default layout for a directory
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (562 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 563/622] lnet: libcfs: provide an scnprintf and start using it James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 565/622] lnet: fix rspt counter James Simmons
                   ` (58 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Jian Yu <yujian@whamcloud.com>

For a directory that does not have trusted.lov xattr, the current
"lfs getstripe" will only print the stripe_count, stripe_size,
and stripe_index that are fetched from the /sys/fs/lustre/lov values.
It doesn't show the actual default layout that will be used when
new files will be created in that directory.

This patch fixes the above issue in ll_dir_getstripe_default() by
fetching the layout from root FID after ll_dir_get_default_layout()
returns -ENODATA from a directory that does not have trusted.lov xattr.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11656
Lustre-commit: 3e8fa8a7396c ("LU-11656 llite: fetch default layout for a directory")
Signed-off-by: Jian Yu <yujian@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36609
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                  | 102 ++++++++++++++++++++++++++++-----
 fs/lustre/llite/llite_internal.h       |   9 ++-
 fs/lustre/llite/xattr.c                |   7 ++-
 include/uapi/linux/lustre/lustre_fid.h |   7 +++
 4 files changed, 107 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index c38862e..b1ec905 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -635,16 +635,10 @@ int ll_dir_setstripe(struct inode *inode, struct lov_user_md *lump,
 	return rc;
 }
 
-/**
- * This function will be used to get default LOV/LMV/Default LMV
- * @valid will be used to indicate which stripe it will retrieve
- *	OBD_MD_MEA		LMV stripe EA
- *	OBD_MD_DEFAULT_MEA	Default LMV stripe EA
- *	otherwise		Default LOV EA.
- * Each time, it can only retrieve 1 stripe EA
- **/
-int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
-		     struct ptlrpc_request **request, u64 valid)
+static int ll_dir_get_default_layout(struct inode *inode, void **plmm,
+				     int *plmm_size,
+				     struct ptlrpc_request **request, u64 valid,
+				     enum get_default_layout_type type)
 {
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct mdt_body *body;
@@ -652,6 +646,7 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 	struct ptlrpc_request *req = NULL;
 	int rc, lmmsize;
 	struct md_op_data *op_data;
+	struct lu_fid fid;
 
 	rc = ll_get_max_mdsize(sbi, &lmmsize);
 	if (rc)
@@ -664,11 +659,19 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 		return PTR_ERR(op_data);
 
 	op_data->op_valid = valid | OBD_MD_FLEASIZE | OBD_MD_FLDIREA;
+
+	if (type == GET_DEFAULT_LAYOUT_ROOT) {
+		lu_root_fid(&op_data->op_fid1);
+		fid = op_data->op_fid1;
+	} else {
+		fid = *ll_inode2fid(inode);
+	}
+
 	rc = md_getattr(sbi->ll_md_exp, op_data, &req);
 	ll_finish_md_op_data(op_data);
 	if (rc < 0) {
 		CDEBUG(D_INFO, "md_getattr failed on inode " DFID ": rc %d\n",
-		       PFID(ll_inode2fid(inode)), rc);
+		       PFID(&fid), rc);
 		goto out;
 	}
 
@@ -730,6 +733,70 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 	return rc;
 }
 
+/**
+ * This function will be used to get default LOV/LMV/Default LMV
+ * @valid will be used to indicate which stripe it will retrieve.
+ * If the directory does not have its own default layout, then the
+ * function will request the default layout from root FID.
+ *	OBD_MD_MEA		LMV stripe EA
+ *	OBD_MD_DEFAULT_MEA	Default LMV stripe EA
+ *	otherwise		Default LOV EA.
+ * Each time, it can only retrieve 1 stripe EA
+ */
+int ll_dir_getstripe_default(struct inode *inode, void **plmm, int *plmm_size,
+			     struct ptlrpc_request **request,
+			     struct ptlrpc_request **root_request,
+			     u64 valid)
+{
+	struct ptlrpc_request *req = NULL;
+	struct ptlrpc_request *root_req = NULL;
+	struct lov_mds_md *lmm = NULL;
+	int lmm_size = 0;
+	int rc = 0;
+
+	rc = ll_dir_get_default_layout(inode, (void **)&lmm, &lmm_size,
+				       &req, valid, 0);
+	if (rc == -ENODATA && !fid_is_root(ll_inode2fid(inode)) &&
+	    !(valid & (OBD_MD_MEA|OBD_MD_DEFAULT_MEA)) && root_request)
+		rc = ll_dir_get_default_layout(inode, (void **)&lmm, &lmm_size,
+					       &root_req, valid,
+					       GET_DEFAULT_LAYOUT_ROOT);
+
+	*plmm = lmm;
+	*plmm_size = lmm_size;
+	*request = req;
+	if (root_request)
+		*root_request = root_req;
+
+	return rc;
+}
+
+/**
+ * This function will be used to get default LOV/LMV/Default LMV
+ * @valid will be used to indicate which stripe it will retrieve
+ *	OBD_MD_MEA		LMV stripe EA
+ *	OBD_MD_DEFAULT_MEA	Default LMV stripe EA
+ *	otherwise		Default LOV EA.
+ * Each time, it can only retrieve 1 stripe EA
+ */
+int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
+		     struct ptlrpc_request **request, u64 valid)
+{
+	struct ptlrpc_request *req = NULL;
+	struct lov_mds_md *lmm = NULL;
+	int lmm_size = 0;
+	int rc = 0;
+
+	rc = ll_dir_get_default_layout(inode, (void **)&lmm, &lmm_size,
+				       &req, valid, 0);
+
+	*plmm = lmm;
+	*plmm_size = lmm_size;
+	*request = req;
+
+	return rc;
+}
+
 int ll_get_mdt_idx_by_fid(struct ll_sb_info *sbi, const struct lu_fid *fid)
 {
 	struct md_op_data *op_data;
@@ -1465,6 +1532,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		struct lmv_user_md __user *ulmv;
 		struct lmv_user_md lum;
 		struct ptlrpc_request *request = NULL;
+		struct ptlrpc_request *root_request = NULL;
 		struct lmv_user_md *tmp = NULL;
 		union lmv_mds_md *lmm = NULL;
 		u64 valid = 0;
@@ -1493,8 +1561,8 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		else
 			return -EINVAL;
 
-		rc = ll_dir_getstripe(inode, (void **)&lmm, &lmmsize, &request,
-				      valid);
+		rc = ll_dir_getstripe_default(inode, (void **)&lmm, &lmmsize,
+					      &request, &root_request, valid);
 		if (rc)
 			goto finish_req;
 
@@ -1595,6 +1663,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		kfree(tmp);
 finish_req:
 		ptlrpc_req_finished(request);
+		ptlrpc_req_finished(root_request);
 		return rc;
 	}
 	case LL_IOC_RMFID:
@@ -1611,6 +1680,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case IOC_MDC_GETFILEINFO_OLD:
 	case IOC_MDC_GETFILESTRIPE: {
 		struct ptlrpc_request *request = NULL;
+		struct ptlrpc_request *root_request = NULL;
 		struct lov_user_md __user *lump;
 		struct lov_mds_md *lmm = NULL;
 		struct mdt_body *body;
@@ -1632,8 +1702,9 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			rc = ll_lov_getstripe_ea_info(inode, filename, &lmm,
 						      &lmmsize, &request);
 		} else {
-			rc = ll_dir_getstripe(inode, (void **)&lmm, &lmmsize,
-					      &request, 0);
+			rc = ll_dir_getstripe_default(inode, (void **)&lmm,
+						      &lmmsize, &request,
+						      &root_request, 0);
 		}
 
 		if (request) {
@@ -1786,6 +1857,7 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 
 out_req:
 		ptlrpc_req_finished(request);
+		ptlrpc_req_finished(root_request);
 		if (filename)
 			ll_putname(filename);
 		return rc;
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index fe9d568..def4df0 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -841,6 +841,10 @@ struct page *ll_get_dir_page(struct inode *dir, struct md_op_data *op_data,
 			     u64 offset);
 void ll_release_page(struct inode *inode, struct page *page, bool remove);
 
+enum get_default_layout_type {
+	GET_DEFAULT_LAYOUT_ROOT = 1,
+};
+
 /* llite/namei.c */
 extern const struct inode_operations ll_special_inode_operations;
 
@@ -911,7 +915,10 @@ int ll_lov_getstripe_ea_info(struct inode *inode, const char *filename,
 			     struct ptlrpc_request **request);
 int ll_dir_setstripe(struct inode *inode, struct lov_user_md *lump,
 		     int set_default);
-int ll_dir_getstripe(struct inode *inode, void **lmmp, int *lmm_size,
+int ll_dir_getstripe_default(struct inode *inode, void **lmmp,
+			     int *lmm_size, struct ptlrpc_request **request,
+			     struct ptlrpc_request **root_request, u64 valid);
+int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 		     struct ptlrpc_request **request, u64 valid);
 int ll_fsync(struct file *file, loff_t start, loff_t end, int data);
 int ll_merge_attr(const struct lu_env *env, struct inode *inode);
diff --git a/fs/lustre/llite/xattr.c b/fs/lustre/llite/xattr.c
index 7134f10..e76d2c3 100644
--- a/fs/lustre/llite/xattr.c
+++ b/fs/lustre/llite/xattr.c
@@ -522,11 +522,12 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 		return rc;
 	} else if (S_ISDIR(inode->i_mode)) {
 		struct ptlrpc_request *req = NULL;
+		struct ptlrpc_request *root_req = NULL;
 		struct lov_mds_md *lmm = NULL;
 		int lmm_size = 0;
 
-		rc = ll_dir_getstripe(inode, (void **)&lmm, &lmm_size,
-				      &req, 0);
+		rc = ll_dir_getstripe_default(inode, (void **)&lmm, &lmm_size,
+					      &req, &root_req, 0);
 		if (rc < 0)
 			goto out_req;
 
@@ -545,6 +546,8 @@ static ssize_t ll_getxattr_lov(struct inode *inode, void *buf, size_t buf_size)
 out_req:
 		if (req)
 			ptlrpc_req_finished(req);
+		if (root_req)
+			ptlrpc_req_finished(root_req);
 
 		return rc;
 	} else {
diff --git a/include/uapi/linux/lustre/lustre_fid.h b/include/uapi/linux/lustre/lustre_fid.h
index 79574c0..d6e59cc 100644
--- a/include/uapi/linux/lustre/lustre_fid.h
+++ b/include/uapi/linux/lustre/lustre_fid.h
@@ -135,6 +135,13 @@ static inline bool fid_is_mdt0(const struct lu_fid *fid)
 	return fid_seq_is_mdt0(fid_seq(fid));
 }
 
+static inline void lu_root_fid(struct lu_fid *fid)
+{
+	fid->f_seq = FID_SEQ_ROOT;
+	fid->f_oid = FID_OID_ROOT;
+	fid->f_ver = 0;
+}
+
 /**
  * Check if a fid is igif or not.
  *
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 565/622] lnet: fix rspt counter
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (563 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 564/622] lustre: llite: fetch default layout for a directory James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 566/622] lustre: ldlm: add a counter to the per-namespace data James Simmons
                   ` (57 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

rsp entries must freed via lnet_rspt_free function to avoid counter
leak. handle NULL allocation properly.

Cray-bug-id: LUS-8189
WC-bug-id: https://jira.whamcloud.com/browse/LU-12991
Lustre-commit: 027a4722b26d ("LU-12991 lnet: fix rspt counter")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/36895
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 8 +++++---
 net/lnet/lnet/lib-move.c      | 6 +++---
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 56556fd..3b597e3 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -438,9 +438,11 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 	struct lnet_rsp_tracker *rspt;
 
 	rspt = kzalloc(sizeof(*rspt), GFP_NOFS);
-	lnet_net_lock(cpt);
-	the_lnet.ln_counters[cpt]->lct_health.lch_rst_alloc++;
-	lnet_net_unlock(cpt);
+	if (rspt) {
+		lnet_net_lock(cpt);
+		the_lnet.ln_counters[cpt]->lct_health.lch_rst_alloc++;
+		lnet_net_unlock(cpt);
+	}
 	return rspt;
 }
 
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index da73009..73f9d20 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -4390,7 +4390,7 @@ void lnet_monitor_thr_stop(void)
 		/* we already have an rspt attached to the md, so we'll
 		 * update the deadline on that one.
 		 */
-		kfree(rspt);
+		lnet_rspt_free(rspt, cpt);
 		new_entry = false;
 	} else {
 		/* new md */
@@ -4511,7 +4511,7 @@ void lnet_monitor_thr_stop(void)
 			       md->md_me->me_portal);
 		lnet_res_unlock(cpt);
 
-		kfree(rspt);
+		lnet_rspt_free(rspt, cpt);
 		kfree(msg);
 		return -ENOENT;
 	}
@@ -4745,7 +4745,7 @@ struct lnet_msg *
 		lnet_res_unlock(cpt);
 
 		kfree(msg);
-		kfree(rspt);
+		lnet_rspt_free(rspt, cpt);
 		return -ENOENT;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 566/622] lustre: ldlm: add a counter to the per-namespace data
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (564 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 565/622] lnet: fix rspt counter James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 567/622] lnet: Add peer level aliveness information James Simmons
                   ` (56 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

When we change the resource hash to rhashtable we won't have
a per-bucket counter.  We could use the nelems global counter,
but ldlm_resource goes to some trouble to avoid having any
table-wide atomics, and hopefully rhashtable will grow the
ability to disable the global counter in the near future.
Having a counter we control makes it easier to manage the
back-reference to the namespace when there is anything in the
hash table.

So add a counter to the ldlm_ns_bucket.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8130
Lustre-commit: f9314d6e9259e6c7 ("LU-8130 ldlm: add a counter to the per-namespace data")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36219
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h |  2 ++
 fs/lustre/ldlm/ldlm_resource.c | 10 +++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index cc4b8b0..9ca79f4 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -306,6 +306,8 @@ struct ldlm_ns_bucket {
 	 * fact the network or overall system load is at fault
 	 */
 	struct adaptive_timeout     nsb_at_estimate;
+	/* counter of entries in this bucket */
+	atomic_t		nsb_count;
 };
 
 enum {
diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index 65ff32c..d009d5d 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -133,12 +133,11 @@ static ssize_t resource_count_show(struct kobject *kobj, struct attribute *attr,
 	struct ldlm_namespace *ns = container_of(kobj, struct ldlm_namespace,
 						 ns_kobj);
 	u64 res = 0;
-	struct cfs_hash_bd bd;
 	int i;
 
 	/* result is not strictly consistent */
-	cfs_hash_for_each_bucket(ns->ns_rs_hash, &bd, i)
-		res += cfs_hash_bd_count_get(&bd);
+	for (i = 0; i < (1 << ns->ns_bucket_bits); i++)
+		res += atomic_read(&ns->ns_rs_buckets[i].nsb_count);
 	return sprintf(buf, "%lld\n", res);
 }
 LUSTRE_RO_ATTR(resource_count);
@@ -647,6 +646,7 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 
 		at_init(&nsb->nsb_at_estimate, ldlm_enqueue_min, 0);
 		nsb->nsb_namespace = ns;
+		atomic_set(&nsb->nsb_count, 0);
 	}
 
 	ns->ns_obd = obd;
@@ -1126,7 +1126,7 @@ struct ldlm_resource *
 	}
 	/* We won! Let's add the resource. */
 	cfs_hash_bd_add_locked(ns->ns_rs_hash, &bd, &res->lr_hash);
-	if (cfs_hash_bd_count_get(&bd) == 1)
+	if (atomic_inc_return(&res->lr_ns_bucket->nsb_count) == 1)
 		ns_refcount = ldlm_namespace_get_return(ns);
 
 	cfs_hash_bd_unlock(ns->ns_rs_hash, &bd, 1);
@@ -1170,7 +1170,7 @@ static void __ldlm_resource_putref_final(struct cfs_hash_bd *bd,
 	cfs_hash_bd_unlock(ns->ns_rs_hash, bd, 1);
 	if (ns->ns_lvbo && ns->ns_lvbo->lvbo_free)
 		ns->ns_lvbo->lvbo_free(res);
-	if (cfs_hash_bd_count_get(bd) == 0)
+	if (atomic_dec_and_test(&nsb->nsb_count))
 		ldlm_namespace_put(ns);
 	if (res->lr_itree)
 		kmem_cache_free(ldlm_interval_tree_slab, res->lr_itree);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 567/622] lnet: Add peer level aliveness information
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (565 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 566/622] lustre: ldlm: add a counter to the per-namespace data James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 568/622] lnet: always check return of try_module_get() James Simmons
                   ` (55 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Keep track of the aliveness of a peer so that we can optimize for
situations where an LNet router hasn't responded to a ping. In
this situation we consider all routes down, and we needn't spend time
inspecting each route, or inspecting all of the router's local and
remote interfaces in order to determine the router's aliveness.

Cray-bug-id: LUS-7860
WC-bug-id: https://jira.whamcloud.com/browse/LU-12941
Lustre-commit: ebc9835a971f ("LU-12941 lnet: Add peer level aliveness information")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36678
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h |  3 +++
 net/lnet/lnet/peer.c           |  4 ++++
 net/lnet/lnet/router.c         | 52 ++++++++++++++++++++++++------------------
 3 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index e105308..02ac5df 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -672,6 +672,9 @@ struct lnet_peer {
 
 	/* tasks waiting on discovery of this peer */
 	wait_queue_head_t	lp_dc_waitq;
+
+	/* cached peer aliveness */
+	bool			lp_alive;
 };
 
 /*
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 4f0da4b..b168c97 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -216,6 +216,10 @@
 	init_waitqueue_head(&lp->lp_dc_waitq);
 	spin_lock_init(&lp->lp_lock);
 	lp->lp_primary_nid = nid;
+	if (lnet_peers_start_down())
+		lp->lp_alive = false;
+	else
+		lp->lp_alive = true;
 
 	/* all peers created on a router should have health on
 	 * if it's not already on.
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index b8f7aba0..7ba406a 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -179,7 +179,9 @@ static int rtr_sensitivity_set(const char *val,
 	return check_routers_before_use;
 }
 
-/* A net is alive if at least one gateway NI on the network is alive. */
+/* The peer_net of a gateway is alive if at least one of the peer_ni's on
+ * that peer_net is alive.
+ */
 static bool
 lnet_is_gateway_net_alive(struct lnet_peer_net *lpn)
 {
@@ -200,6 +202,9 @@ bool lnet_is_gateway_alive(struct lnet_peer *gw)
 {
 	struct lnet_peer_net *lpn;
 
+	if (!gw->lp_alive)
+		return false;
+
 	list_for_each_entry(lpn, &gw->lp_peer_nets, lpn_peer_nets) {
 		if (!lnet_is_gateway_net_alive(lpn))
 			return false;
@@ -219,7 +224,10 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	struct lnet_peer *gw = route->lr_gateway;
 	struct lnet_peer_net *llpn;
 	struct lnet_peer_net *rlpn;
-	bool route_alive;
+
+	/* If the gateway is down then all routes are considered down */
+	if (!gw->lp_alive)
+		return false;
 
 	/* if discovery is disabled then rely on the cached aliveness
 	 * information. This is handicapped information which we log when
@@ -230,36 +238,34 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	if (lnet_is_discovery_disabled(gw))
 		return route->lr_alive;
 
-	/* check the gateway's interfaces on the route rnet to make sure
-	 * that the gateway is viable.
-	 */
+	/* check the gateway's interfaces on the local network */
 	llpn = lnet_peer_get_net_locked(gw, route->lr_lnet);
 	if (!llpn)
 		return false;
 
-	route_alive = lnet_is_gateway_net_alive(llpn);
+	if (!lnet_is_gateway_net_alive(llpn))
+		return false;
 
 	if (avoid_asym_router_failure) {
+		/* Check the gateway's interfaces on the remote network */
 		rlpn = lnet_peer_get_net_locked(gw, route->lr_net);
 		if (!rlpn)
 			return false;
-		route_alive = route_alive &&
-			      lnet_is_gateway_net_alive(rlpn);
+		if (!lnet_is_gateway_net_alive(rlpn))
+			return false;
 	}
 
-	if (!route_alive)
-		return route_alive;
-
 	spin_lock(&gw->lp_lock);
 	if (!(gw->lp_state & LNET_PEER_ROUTER_ENABLED)) {
+		spin_unlock(&gw->lp_lock);
 		if (gw->lp_rtr_refcount > 0)
 			CERROR("peer %s is being used as a gateway but routing feature is not turned on\n",
 			       libcfs_nid2str(gw->lp_primary_nid));
-		route_alive = false;
+		return false;
 	}
 	spin_unlock(&gw->lp_lock);
 
-	return route_alive;
+	return true;
 }
 
 void
@@ -409,21 +415,22 @@ bool lnet_is_route_alive(struct lnet_route *route)
 	spin_lock(&lp->lp_lock);
 	lp->lp_state &= ~LNET_PEER_RTR_DISCOVERY;
 	lp->lp_state |= LNET_PEER_RTR_DISCOVERED;
+	lp->lp_alive = lp->lp_dc_error == 0;
 	spin_unlock(&lp->lp_lock);
 
 	/* Router discovery successful? All peer information would've been
 	 * updated already. No need to do any more processing
 	 */
-	if (!lp->lp_dc_error)
+	if (lp->lp_alive)
 		return;
-	/* discovery failed? then we need to set the status of each lpni
-	 * to DOWN. It will be updated the next time we discover the
-	 * router. For router peer NIs not on local networks, we never send
-	 * messages directly to them, so their health will always remain
-	 * at maximum. We can only tell if they are up or down from the
-	 * status returned in the PING response. If we fail to get that
-	 * status in our scheduled router discovery, then we'll assume
-	 * it's down until we're told otherwise.
+
+	/* We do not send messages directly to the remote interfaces
+	 * of an LNet router. As such, we rely on the PING response
+	 * to determine the up/down status of these interfaces. If
+	 * a PING response is not receieved, or some other problem with
+	 * discovery occurs that prevents us from getting this status,
+	 * we assume all interfaces are down until we're able to
+	 * determine otherwise.
 	 */
 	CDEBUG(D_NET, "%s: Router discovery failed %d\n",
 	       libcfs_nid2str(lp->lp_primary_nid), lp->lp_dc_error);
@@ -1629,6 +1636,7 @@ bool lnet_router_checker_active(void)
 	lnet_peer_ni_decref_locked(lpni);
 	if (lpni && lpni->lpni_peer_net && lpni->lpni_peer_net->lpn_peer) {
 		lp = lpni->lpni_peer_net->lpn_peer;
+		lp->lp_alive = alive;
 		list_for_each_entry(route, &lp->lp_routes, lr_gwlist)
 			lnet_set_route_aliveness(route, alive);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 568/622] lnet: always check return of try_module_get()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (566 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 567/622] lnet: Add peer level aliveness information James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 569/622] lustre: obdclass: don't skip records for wrapped catalog James Simmons
                   ` (54 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

try_module_get() can fail, so the return value should be checked.
If we *know* that we already hold a reference, __module_get()
should be used instead.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: a1282a0d8a53 ("LU-9679 lnet: always check return of try_module_get()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36854
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 4 +++-
 net/lnet/klnds/socklnd/socklnd.c | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 37d8235..f6db2c7 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2693,7 +2693,9 @@ static int kiblnd_base_startup(struct net *ns)
 
 	LASSERT(kiblnd_data.kib_init == IBLND_INIT_NOTHING);
 
-	try_module_get(THIS_MODULE);
+	if (!try_module_get(THIS_MODULE))
+		goto failed;
+
 	/* zero pointers, flags etc */
 	memset(&kiblnd_data, 0, sizeof(kiblnd_data));
 
diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 593c205..9a19a3f 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -2357,7 +2357,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 	/* flag lists/ptrs/locks initialised */
 	ksocknal_data.ksnd_init = SOCKNAL_INIT_DATA;
-	try_module_get(THIS_MODULE);
+	if (!try_module_get(THIS_MODULE))
+		goto failed;
 
 	/* Create a scheduler block per available CPT */
 	ksocknal_data.ksnd_schedulers = cfs_percpt_alloc(lnet_cpt_table(),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 569/622] lustre: obdclass: don't skip records for wrapped catalog
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (567 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 568/622] lnet: always check return of try_module_get() James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 570/622] lnet: Refactor lnet_find_best_lpni_on_net James Simmons
                   ` (53 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

osp_sync_thread() uses opd_sync_last_catalog_idx as a start point of
catalog processing. It is used at llog_cat_process_cb also, to skip
records from processing. When catalog is wrapped, processing starts
from second part of catalog and then a first part. So, a first part
would be skipped at llog_cat_process_cb() base on lpd_startcat.

osp_sync_thread() restarts a processing loop with a
opd_sync_last_catalog_idx. For a wrapped it increases last
index and one more increase do a llog_process_thread. This leads
to a skipped records at catalog, they would not be processed.
The patch fixes these issues.
It also adds sanity test 135 and 136 as regression tests.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13069
Lustre-commit: cc1092291932 ("LU-13069 obdclass: don't skip records for wrapped catalog")
Signed-off-by: Alexander Boyko <c17825@cray.com>
Cray-bug-id: LUS-8053,LUS-8236
Reviewed-on: https://review.whamcloud.com/36996
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 2 ++
 fs/lustre/obdclass/llog.c       | 9 +++++++++
 fs/lustre/obdclass/llog_cat.c   | 1 +
 3 files changed, 12 insertions(+)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 5969b6b..a26ac76 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -447,6 +447,8 @@
 /* was	OBD_FAIL_LLOG_CATINFO_NET			0x1309 until 2.3 */
 #define OBD_FAIL_MDS_SYNC_CAPA_SL			0x1310
 #define OBD_FAIL_SEQ_ALLOC				0x1311
+#define OBD_FAIL_PLAIN_RECORDS			    0x1319
+#define OBD_FAIL_CATALOG_FULL_CHECK		    0x131a
 
 #define OBD_FAIL_LLITE					0x1400
 #define OBD_FAIL_LLITE_FAULT_TRUNC_RACE			0x1401
diff --git a/fs/lustre/obdclass/llog.c b/fs/lustre/obdclass/llog.c
index 4e9fd17..620ebc6 100644
--- a/fs/lustre/obdclass/llog.c
+++ b/fs/lustre/obdclass/llog.c
@@ -453,6 +453,8 @@ int llog_process_or_fork(const struct lu_env *env,
 			 llog_cb_t cb, void *data, void *catdata, bool fork)
 {
 	struct llog_process_info *lpi;
+	struct llog_process_data *d = data;
+	struct llog_process_cat_data *cd = catdata;
 	int rc;
 
 	lpi = kzalloc(sizeof(*lpi), GFP_KERNEL);
@@ -463,6 +465,13 @@ int llog_process_or_fork(const struct lu_env *env,
 	lpi->lpi_cbdata = data;
 	lpi->lpi_catdata = catdata;
 
+	CDEBUG(D_OTHER,
+	       "Processing " DFID " flags 0x%03x startcat %d startidx %d first_idx %d last_idx %d\n",
+	       PFID(&loghandle->lgh_id.lgl_oi.oi_fid),
+	       loghandle->lgh_hdr->llh_flags, d ? d->lpd_startcat : -1,
+	       d ? d->lpd_startidx : -1, cd ? cd->lpcd_first_idx : -1,
+	       cd ? cd->lpcd_last_idx : -1);
+
 	if (fork) {
 		struct task_struct *task;
 
diff --git a/fs/lustre/obdclass/llog_cat.c b/fs/lustre/obdclass/llog_cat.c
index 30b0ac5..75226f4 100644
--- a/fs/lustre/obdclass/llog_cat.c
+++ b/fs/lustre/obdclass/llog_cat.c
@@ -244,6 +244,7 @@ static int llog_cat_process_or_fork(const struct lu_env *env,
 			 * catalog bottom.
 			 */
 			startcat = 0;
+			d.lpd_startcat = 0;
 			if (rc != 0)
 				return rc;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 570/622] lnet: Refactor lnet_find_best_lpni_on_net
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (568 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 569/622] lustre: obdclass: don't skip records for wrapped catalog James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 571/622] lnet: Avoid comparing route to itself James Simmons
                   ` (52 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Replace lnet_send_data argument.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12756
Lustre-commit: 80edb2ad72ba ("LU-12756 lnet: Refactor lnet_find_best_lpni_on_net")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36534
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 73f9d20..c8266f0 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1247,8 +1247,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 /* Prerequisite: the best_ni should already be set in the sd */
 static inline struct lnet_peer_ni *
-lnet_find_best_lpni_on_net(struct lnet_send_data *sd, struct lnet_peer *peer,
-			   u32 net_id)
+lnet_find_best_lpni_on_net(struct lnet_ni *lni, lnet_nid_t dst_nid,
+			   struct lnet_peer *peer, u32 net_id)
 {
 	struct lnet_peer_net *peer_net;
 
@@ -1264,8 +1264,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		return NULL;
 	}
 
-	return lnet_select_peer_ni(sd->sd_best_ni, sd->sd_dst_nid,
-				   peer, peer_net);
+	return lnet_select_peer_ni(lni, dst_nid, peer, peer_net);
 }
 
 static int
@@ -1278,13 +1277,12 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_peer *lp2 = r2->lr_gateway;
 	struct lnet_peer_ni *lpni1;
 	struct lnet_peer_ni *lpni2;
-	struct lnet_send_data sd;
 	int rc;
 
-	sd.sd_best_ni = NULL;
-	sd.sd_dst_nid = LNET_NID_ANY;
-	lpni1 = lnet_find_best_lpni_on_net(&sd, lp1, r1->lr_lnet);
-	lpni2 = lnet_find_best_lpni_on_net(&sd, lp2, r2->lr_lnet);
+	lpni1 = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY, lp1,
+					   r1->lr_lnet);
+	lpni2 = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY, lp2,
+					   r2->lr_lnet);
 	LASSERT(lpni1 && lpni2);
 
 	if (r1->lr_priority < r2->lr_priority) {
@@ -1878,7 +1876,9 @@ struct lnet_ni *
 			return -EHOSTUNREACH;
 		}
 
-		sd->sd_best_lpni = lnet_find_best_lpni_on_net(sd, lp,
+		sd->sd_best_lpni = lnet_find_best_lpni_on_net(sd->sd_best_ni,
+							      sd->sd_dst_nid,
+							      lp,
 							      best_lpn->lpn_net_id);
 		if (!sd->sd_best_lpni) {
 			CERROR("peer %s down\n",
@@ -2191,7 +2191,8 @@ struct lnet_ni *
 					lnet_msg_discovery(sd->sd_msg));
 	if (sd->sd_best_ni) {
 		sd->sd_best_lpni =
-		  lnet_find_best_lpni_on_net(sd, sd->sd_peer,
+		  lnet_find_best_lpni_on_net(sd->sd_best_ni, sd->sd_dst_nid,
+					     sd->sd_peer,
 					     sd->sd_best_ni->ni_net->net_id);
 
 		/* if we're successful in selecting a peer_ni on the local
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 571/622] lnet: Avoid comparing route to itself
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (569 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 570/622] lnet: Refactor lnet_find_best_lpni_on_net James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 572/622] lustre: sysfs: use string helper like functions for sysfs James Simmons
                   ` (51 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

The first iteration of the route selection loop compares the first
route in the list with itself.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12756
Lustre-commit: 2b8d9d12d182 ("LU-12756 lnet: Avoid comparing route to itself")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36535
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index c8266f0..45975d6 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1354,6 +1354,12 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			best_route = route;
 			last_route = route;
 			lp_best = lp;
+			best_gw_ni = lnet_find_best_lpni_on_net(NULL,
+								LNET_NID_ANY,
+								route->lr_gateway,
+								route->lr_lnet);
+			LASSERT(best_gw_ni);
+			continue;
 		}
 
 		/* no protection on below fields, but it's harmless */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 572/622] lustre: sysfs: use string helper like functions for sysfs
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (570 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 571/622] lnet: Avoid comparing route to itself James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 573/622] lustre: rename ops to owner James Simmons
                   ` (50 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

For a very long time the Linux kernel has supported the function
memparse() that allowed the passing in of memory sizes with the
suffix set of K, M, G, T, P, E. Lustre adopted this approach
with its proc / sysfs implementation. The difference being that
lustre expanded this functionality to allow sizes with a
fractional component, such as 1.5G for example. The code used to
parse for the numerical value is heavily tied into the debugfs
seq_file handling and stomps on the passed in buffer which you
can't do with sysfs files.

Similar functionality to what Lustre does today exist in newer
linux kernels in the form of string helpers. Currently the
string helpers only convert a numerical value to human readable
format. A new function, string_to_size(), was created that takes
a string and turns it into a numerical value. This enables the
use of string helper suffixes i.e MiB, kB etc with the lustre
tunables and we can now support 10 base numbers i.e MB, kB as
well. Already string helper suffixes are used for debugfs files
so I expect this to be adopted over time so it should be
encouraged to use string_to_size() for newer lustre sysfs files.

At the same time we want to perserve the original behavior of
using the suffix set of K, M, G, T, P, E. To do this we create
the function sysfs_memparse() that supports the new string helper
suffixes as well as the older set of suffixes. This new code is
also way simpler than what is currently done with the current code.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9091
Lustre-commit: d9e0c9f346d0 ("LU-9091 sysfs: use string helper like functions for sysfs")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/35658
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
---
 fs/lustre/include/lprocfs_status.h  |   4 +
 fs/lustre/lov/lproc_lov.c           |   4 +-
 fs/lustre/mdc/lproc_mdc.c           |  27 +++---
 fs/lustre/obdclass/class_obd.c      |  61 ++++++++++++
 fs/lustre/obdclass/lprocfs_status.c | 179 ++++++++++++++++++++++++++++++++++++
 fs/lustre/osc/lproc_osc.c           |  27 ++----
 6 files changed, 271 insertions(+), 31 deletions(-)

diff --git a/fs/lustre/include/lprocfs_status.h b/fs/lustre/include/lprocfs_status.h
index ac62560..22d7741 100644
--- a/fs/lustre/include/lprocfs_status.h
+++ b/fs/lustre/include/lprocfs_status.h
@@ -42,6 +42,7 @@
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/spinlock.h>
+#include <linux/string_helpers.h>
 #include <linux/types.h>
 #include <linux/device.h>
 
@@ -484,6 +485,9 @@ int lprocfs_write_u64_helper(const char __user *buffer,
 int lprocfs_write_frac_u64_helper(const char __user *buffer,
 				  unsigned long count,
 				  u64 *val, int mult);
+int string_to_size(u64 *size, const char *buffer, size_t count);
+int sysfs_memparse(const char *buffer, size_t count, u64 *val,
+		    const char *defunit);
 char *lprocfs_find_named_value(const char *buffer, const char *name,
 			       size_t *count);
 void lprocfs_oh_tally(struct obd_histogram *oh, unsigned int value);
diff --git a/fs/lustre/lov/lproc_lov.c b/fs/lustre/lov/lproc_lov.c
index c528a8b..37ef084 100644
--- a/fs/lustre/lov/lproc_lov.c
+++ b/fs/lustre/lov/lproc_lov.c
@@ -57,8 +57,8 @@ static ssize_t stripesize_store(struct kobject *kobj, struct attribute *attr,
 	u64 val;
 	int rc;
 
-	rc = kstrtoull(buf, 10, &val);
-	if (rc)
+	rc = sysfs_memparse(buf, count, &val, "B");
+	if (rc < 0)
 		return rc;
 
 	lov_fix_desc_stripe_size(&val);
diff --git a/fs/lustre/mdc/lproc_mdc.c b/fs/lustre/mdc/lproc_mdc.c
index 454b69d..c438198 100644
--- a/fs/lustre/mdc/lproc_mdc.c
+++ b/fs/lustre/mdc/lproc_mdc.c
@@ -61,12 +61,19 @@ static ssize_t mdc_max_dirty_mb_seq_write(struct file *file,
 	struct seq_file *sfl = file->private_data;
 	struct obd_device *dev = sfl->private;
 	struct client_obd *cli = &dev->u.cli;
-	__s64 pages_number;
+	char kernbuf[22] = "";
+	u64 pages_number;
 	int rc;
 
-	rc = lprocfs_write_frac_u64_helper(buffer, count, &pages_number,
-					   1 << (20 - PAGE_SHIFT));
-	if (rc)
+	if (count >= sizeof(kernbuf))
+		return -EINVAL;
+
+	if (copy_from_user(kernbuf, buffer, count))
+		return -EFAULT;
+	kernbuf[count] = 0;
+
+	rc = sysfs_memparse(kernbuf, count, &pages_number, "MiB");
+	if (rc < 0)
 		return rc;
 
 	/* MB -> pages */
@@ -111,6 +118,7 @@ static int mdc_cached_mb_seq_show(struct seq_file *m, void *v)
 	struct obd_device *dev = sfl->private;
 	struct client_obd *cli = &dev->u.cli;
 	u64 pages_number;
+	const char *tmp;
 	long rc;
 	char kernbuf[128];
 
@@ -121,18 +129,13 @@ static int mdc_cached_mb_seq_show(struct seq_file *m, void *v)
 		return -EFAULT;
 	kernbuf[count] = 0;
 
-	buffer += lprocfs_find_named_value(kernbuf, "used_mb:", &count) -
-		  kernbuf;
-	rc = lprocfs_write_frac_u64_helper(buffer, count, &pages_number,
-					   1 << (20 - PAGE_SHIFT));
-	if (rc)
+	tmp = lprocfs_find_named_value(kernbuf, "used_mb:", &count);
+	rc = sysfs_memparse(tmp, count, &pages_number, "MiB");
+	if (rc < 0)
 		return rc;
 
 	pages_number >>= PAGE_SHIFT;
 
-	if (pages_number < 0)
-		return -ERANGE;
-
 	rc = atomic_long_read(&cli->cl_lru_in_list) - pages_number;
 	if (rc > 0) {
 		struct lu_env *env;
diff --git a/fs/lustre/obdclass/class_obd.c b/fs/lustre/obdclass/class_obd.c
index 0718fdb..d462317 100644
--- a/fs/lustre/obdclass/class_obd.c
+++ b/fs/lustre/obdclass/class_obd.c
@@ -524,6 +524,20 @@ static long obd_class_ioctl(struct file *filp, unsigned int cmd,
 	.fops		= &obd_psdev_fops,
 };
 
+#define test_string_to_size_one(value, result, def_unit)		\
+({									\
+	u64 __size;							\
+	int __ret;							\
+									\
+	BUILD_BUG_ON(strlen(value) >= 23);				\
+	__ret = sysfs_memparse((value), (result), &__size,		\
+			       (def_unit));				\
+	if (__ret == 0 && (u64)result != __size)			\
+		CERROR("string_helper: size %llu != result %llu\n",	\
+		       __size, (u64)result);				\
+	__ret;								\
+})
+
 static int obd_init_checks(void)
 {
 	u64 u64val, div64val;
@@ -590,6 +604,53 @@ static int obd_init_checks(void)
 		ret = -EINVAL;
 	}
 
+	/* invalid string */
+	ret = test_string_to_size_one("256B34", 256, "B");
+	if (ret == 0)
+		CERROR("string_helpers: format should be number then units\n");
+	ret = test_string_to_size_one("132OpQ", 132, "B");
+	if (ret == 0)
+		CERROR("string_helpers: invalid units should be rejected\n");
+	ret = 0;
+
+	/* small values */
+	test_string_to_size_one("0B", 0, "B");
+	ret = test_string_to_size_one("1.82B", 1, "B");
+	if (ret == 0)
+		CERROR("string_helpers: number string with 'B' and '.' should be invalid\n");
+	ret = 0;
+	test_string_to_size_one("512B", 512, "B");
+	test_string_to_size_one("1.067kB", 1067, "B");
+	test_string_to_size_one("1.042KiB", 1067, "B");
+
+	/* Lustre special handling */
+	test_string_to_size_one("16", 16777216, "MiB");
+	test_string_to_size_one("65536", 65536, "B");
+	test_string_to_size_one("128K", 131072, "B");
+	test_string_to_size_one("1M", 1048576, "B");
+	test_string_to_size_one("256.5G", 275414777856ULL, "GiB");
+
+	/* normal values */
+	test_string_to_size_one("8.39MB", 8390000, "MiB");
+	test_string_to_size_one("8.00MiB", 8388608, "MiB");
+	test_string_to_size_one("256GB", 256000000, "GiB");
+	test_string_to_size_one("238.731 GiB", 256335459385ULL, "GiB");
+
+	/* huge values */
+	test_string_to_size_one("0.4TB", 400000000000ULL, "TiB");
+	test_string_to_size_one("12.5TiB", 13743895347200ULL, "TiB");
+	test_string_to_size_one("2PB", 2000000000000000ULL, "PiB");
+	test_string_to_size_one("16PiB", 18014398509481984ULL, "PiB");
+
+	/* huge values should overflow */
+	ret = test_string_to_size_one("1000EiB", 0, "EiB");
+	if (ret != -EOVERFLOW)
+		CERROR("string_helpers: Failed to detect overflow\n");
+	ret = test_string_to_size_one("1000EB", 0, "EiB");
+	if (ret != -EOVERFLOW)
+		CERROR("string_helpers: Failed to detect overflow\n");
+	ret = 0;
+
 	return ret;
 }
 
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 4fc35c5..325005d 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -217,6 +217,185 @@ static void obd_connect_data_seqprint(struct seq_file *m,
 			   ocd->ocd_maxmodrpcs);
 }
 
+/**
+ * string_to_size - convert ASCII string representing a numerical
+ *		    value with optional units to 64-bit binary value
+ *
+ * @size:	The numerical value extract out of @buffer
+ * @buffer:	passed in string to parse
+ * @count:	length of the @buffer
+ *
+ * This function returns a 64-bit binary value if @buffer contains a valid
+ * numerical string. The string is parsed to 3 significant figures after
+ * the decimal point. Support the string containing an optional units at
+ * the end which can be base 2 or base 10 in value. If no units are given
+ * the string is assumed to just a numerical value.
+ *
+ * Returns:	@count if the string is successfully parsed,
+ *		-errno on invalid input strings. Error values:
+ *
+ *  - ``-EINVAL``: @buffer is not a proper numerical string
+ *  - ``-EOVERFLOW``: results does not fit into 64 bits.
+ *  - ``-E2BIG ``: @buffer is not large
+ */
+int string_to_size(u64 *size, const char *buffer, size_t count)
+{
+	/* For string_get_size() it can support values above exabytes,
+	 * (ZiB, YiB) due to breaking the return value into a size and
+	 * bulk size to avoid 64 bit overflow. We don't break the size
+	 * up into block size units so we don't support ZiB or YiB.
+	 */
+	static const char *const units_10[] = {
+		"kB", "MB", "GB", "TB", "PB", "EB"
+	};
+	static const char *const units_2[] = {
+		"KiB", "MiB", "GiB", "TiB", "PiB", "EiB"
+	};
+	static const char *const *const units_str[] = {
+		[STRING_UNITS_2] = units_2,
+		[STRING_UNITS_10] = units_10,
+	};
+	static const unsigned int coeff[] = {
+		[STRING_UNITS_10] = 1000,
+		[STRING_UNITS_2] = 1024,
+	};
+	enum string_size_units unit;
+	u64 whole, blk_size = 1;
+	char kernbuf[22], *end;
+	size_t len = count;
+	int rc;
+	int i;
+
+	if (count >= sizeof(kernbuf))
+		return -E2BIG;
+
+	*size = 0;
+	/* 'iB' is used for based 2 numbers. If @buffer contains only a 'B'
+	 * or only numbers then we treat it as a direct number which doesn't
+	 * matter if its STRING_UNITS_2 or STRING_UNIT_10.
+	 */
+	unit = strstr(buffer, "iB") ? STRING_UNITS_2 : STRING_UNITS_10;
+	i = unit == STRING_UNITS_2 ? ARRAY_SIZE(units_2) - 1 :
+				     ARRAY_SIZE(units_10) - 1;
+	do {
+		end = strstr(buffer, units_str[unit][i]);
+		if (end) {
+			for (; i >= 0; i--)
+				blk_size *= coeff[unit];
+			len -= strlen(end);
+			break;
+		}
+	} while (i--);
+
+	/* as 'B' is a substring of all units, we need to handle it
+	 * separately.
+	 */
+	if (!end) {
+		/* 'B' is only acceptable letter at this point */
+		end = strchr(buffer, 'B');
+		if (end) {
+			len -= strlen(end);
+
+			if (count - len > 2 ||
+			    (count - len == 2 && strcmp(end, "B\n") != 0))
+				return -EINVAL;
+		}
+		/* kstrtoull will error out if it has non digits */
+		goto numbers_only;
+	}
+
+	end = strchr(buffer, '.');
+	if (end) {
+		/* need to limit 3 decimal places */
+		char rem[4] = "000";
+		u64 frac = 0;
+		size_t off;
+
+		len = end - buffer;
+		end++;
+
+		/* limit to 3 decimal points */
+		off = min_t(size_t, 3, strspn(end, "0123456789"));
+		/* need to limit frac_d to a u32 */
+		memcpy(rem, end, off);
+		rc = kstrtoull(rem, 10, &frac);
+		if (rc)
+			return rc;
+
+		if (fls64(frac) + fls64(blk_size) - 1 > 64)
+			return -EOVERFLOW;
+
+		frac *= blk_size;
+		do_div(frac, 1000);
+		*size += frac;
+	}
+numbers_only:
+	snprintf(kernbuf, sizeof(kernbuf), "%.*s", (int)len, buffer);
+	rc = kstrtoull(kernbuf, 10, &whole);
+	if (rc)
+		return rc;
+
+	if (whole != 0 && fls64(whole) + fls64(blk_size) - 1 > 64)
+		return -EOVERFLOW;
+
+	*size += whole * blk_size;
+
+	return count;
+}
+EXPORT_SYMBOL(string_to_size);
+
+/**
+ * sysfs_memparse - parse a ASCII string to 64-bit binary value,
+ *		    with optional units
+ *
+ * @buffer:	kernel pointer to input string
+ * @count:	number of bytes in the input @buffer
+ * @val:	(output) binary value returned to caller
+ * @defunit:	default unit suffix to use if none is provided
+ *
+ * Parses a string into a number. The number stored at @buffer is
+ * potentially suffixed with K, M, G, T, P, E. Besides these other
+ * valid suffix units are shown in the string_to_size() function.
+ * If the string lacks a suffix then the defunit is used. The defunit
+ * should be given as a binary unit (e.g. MiB) as that is the standard
+ * for tunables in Lustre. If no unit suffix is given (e.g. 'G'), then
+ * it is assumed to be in binary units.
+ *
+ * Returns:	0 on success or -errno on failure.
+ */
+int sysfs_memparse(const char *buffer, size_t count, u64 *val,
+		   const char *defunit)
+{
+	char param[23];
+	int rc;
+
+	if (count >= sizeof(param))
+		return -E2BIG;
+
+	count = strlen(buffer);
+	if (count && buffer[count - 1] == '\n')
+		count--;
+
+	if (!count)
+		return -EINVAL;
+
+	if (isalpha(buffer[count - 1])) {
+		if (buffer[count - 1] != 'B') {
+			scnprintf(param, sizeof(param), "%.*siB",
+				  (int)count, buffer);
+		} else {
+			memcpy(param, buffer, count + 1);
+		}
+	} else {
+		scnprintf(param, sizeof(param), "%.*s%s", (int)count,
+			  buffer, defunit);
+	}
+
+	rc = string_to_size(val, param, strlen(param));
+	return rc < 0 ? rc : 0;
+}
+EXPORT_SYMBOL(sysfs_memparse);
+
 int lprocfs_read_frac_helper(char *buffer, unsigned long count, long val,
 			     int mult)
 {
diff --git a/fs/lustre/osc/lproc_osc.c b/fs/lustre/osc/lproc_osc.c
index d545d1b..5cf2148 100644
--- a/fs/lustre/osc/lproc_osc.c
+++ b/fs/lustre/osc/lproc_osc.c
@@ -203,10 +203,10 @@ static ssize_t osc_cached_mb_seq_write(struct file *file,
 	struct seq_file *m = file->private_data;
 	struct obd_device *dev = m->private;
 	struct client_obd *cli = &dev->u.cli;
-	long pages_number, rc;
+	u64 pages_number;
+	const char *tmp;
+	long rc;
 	char kernbuf[128];
-	int mult;
-	u64 val;
 
 	if (count >= sizeof(kernbuf))
 		return -EINVAL;
@@ -215,19 +215,12 @@ static ssize_t osc_cached_mb_seq_write(struct file *file,
 		return -EFAULT;
 	kernbuf[count] = 0;
 
-	mult = 1 << (20 - PAGE_SHIFT);
-	buffer += lprocfs_find_named_value(kernbuf, "used_mb:", &count) -
-		  kernbuf;
-	rc = lprocfs_write_frac_u64_helper(buffer, count, &val, mult);
-	if (rc)
+	tmp = lprocfs_find_named_value(kernbuf, "used_mb:", &count);
+	rc = sysfs_memparse(tmp, count, &pages_number, "MiB");
+	if (rc < 0)
 		return rc;
 
-	if (val > LONG_MAX)
-		return -ERANGE;
-	pages_number = (long)val;
-
-	if (pages_number < 0)
-		return -ERANGE;
+	pages_number >>= PAGE_SHIFT;
 
 	rc = atomic_long_read(&cli->cl_lru_in_list) - pages_number;
 	if (rc > 0) {
@@ -277,11 +270,11 @@ static ssize_t cur_grant_bytes_store(struct kobject *kobj,
 	struct obd_device *obd = container_of(kobj, struct obd_device,
 					      obd_kset.kobj);
 	struct client_obd *cli = &obd->u.cli;
+	u64 val;
 	int rc;
-	unsigned long long val;
 
-	rc = kstrtoull(buffer, 10, &val);
-	if (rc)
+	rc = sysfs_memparse(buffer, count, &val, "MiB");
+	if (rc < 0)
 		return rc;
 
 	/* this is only for shrinking grant */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 573/622] lustre: rename ops to owner
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (571 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 572/622] lustre: sysfs: use string helper like functions for sysfs James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 574/622] lustre: ldlm: simplify ldlm_ns_hash_defs[] James Simmons
                   ` (49 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

Now that portals_handle_ops contains only a char*,
it is functioning primarily to identify the owner of each handle.
So change the name to h_owner, and the type to const char*.

Note: this h_owner is now quite different from the similar h_owner
in the server code.  When server code is merged the
"med" pointer will be stored in the "mfd" and validated separately.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12542
Lustre-commit: 1a9aafbf6317 ("LU-12542 handle: rename ops to owner")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35798
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_handles.h  | 12 +++---------
 fs/lustre/ldlm/ldlm_lock.c          |  8 +++-----
 fs/lustre/obdclass/genops.c         | 10 +++-------
 fs/lustre/obdclass/lustre_handles.c | 15 +++++++--------
 4 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/fs/lustre/include/lustre_handles.h b/fs/lustre/include/lustre_handles.h
index 8f733fd..55f9a09 100644
--- a/fs/lustre/include/lustre_handles.h
+++ b/fs/lustre/include/lustre_handles.h
@@ -45,11 +45,6 @@
 #include <linux/spinlock.h>
 #include <linux/types.h>
 
-struct portals_handle_ops {
-	/* hop_type is used for some debugging messages */
-	char *hop_type;
-};
-
 /* These handles are most easily used by having them appear at the very top of
  * whatever object that you want to make handles for.  ie:
  *
@@ -65,7 +60,7 @@ struct portals_handle_ops {
 struct portals_handle {
 	struct list_head		h_link;
 	u64				h_cookie;
-	const struct portals_handle_ops	*h_ops;
+	const char			*h_owner;
 	refcount_t			h_ref;
 
 	/* newly added fields to handle the RCU issue. -jxiong */
@@ -77,10 +72,9 @@ struct portals_handle {
 /* handles.c */
 
 /* Add a handle to the hash table */
-void class_handle_hash(struct portals_handle *,
-		       const struct portals_handle_ops *ops);
+void class_handle_hash(struct portals_handle *, const char *h_owner);
 void class_handle_unhash(struct portals_handle *);
-void *class_handle2object(u64 cookie, const struct portals_handle_ops *ops);
+void *class_handle2object(u64 cookie, const char *h_owner);
 int class_handle_init(void);
 void class_handle_cleanup(void);
 
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 61bf028..2c19636 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -365,9 +365,7 @@ void ldlm_lock_destroy_nolock(struct ldlm_lock *lock)
 	}
 }
 
-static struct portals_handle_ops lock_handle_ops = {
-	.hop_type   = "ldlm",
-};
+static const char lock_handle_owner[] = "ldlm";
 
 /**
  *
@@ -407,7 +405,7 @@ static struct ldlm_lock *ldlm_lock_new(struct ldlm_resource *resource)
 	lprocfs_counter_incr(ldlm_res_to_ns(resource)->ns_stats,
 			     LDLM_NSS_LOCKS);
 	INIT_LIST_HEAD(&lock->l_handle.h_link);
-	class_handle_hash(&lock->l_handle, &lock_handle_ops);
+	class_handle_hash(&lock->l_handle, lock_handle_owner);
 
 	lu_ref_init(&lock->l_reference);
 	lu_ref_add(&lock->l_reference, "hash", lock);
@@ -515,7 +513,7 @@ struct ldlm_lock *__ldlm_handle2lock(const struct lustre_handle *handle,
 	if (!lustre_handle_is_used(handle))
 		return NULL;
 
-	lock = class_handle2object(handle->cookie, &lock_handle_ops);
+	lock = class_handle2object(handle->cookie, lock_handle_owner);
 	if (!lock)
 		return NULL;
 
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 15bea0d..0fbe03e 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -662,7 +662,7 @@ int obd_init_caches(void)
 	return -ENOMEM;
 }
 
-static struct portals_handle_ops export_handle_ops;
+static const char export_handle_owner[] = "export";
 
 /* map connection to client */
 struct obd_export *class_conn2export(struct lustre_handle *conn)
@@ -680,7 +680,7 @@ struct obd_export *class_conn2export(struct lustre_handle *conn)
 	}
 
 	CDEBUG(D_INFO, "looking for export cookie %#llx\n", conn->cookie);
-	export = class_handle2object(conn->cookie, &export_handle_ops);
+	export = class_handle2object(conn->cookie, export_handle_owner);
 	return export;
 }
 EXPORT_SYMBOL(class_conn2export);
@@ -732,10 +732,6 @@ static void class_export_destroy(struct obd_export *exp)
 	kfree_rcu(exp, exp_handle.h_rcu);
 }
 
-static struct portals_handle_ops export_handle_ops = {
-	.hop_type	= "export",
-};
-
 struct obd_export *class_export_get(struct obd_export *exp)
 {
 	refcount_inc(&exp->exp_handle.h_ref);
@@ -819,7 +815,7 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 	INIT_LIST_HEAD(&export->exp_req_replay_queue);
 	INIT_LIST_HEAD_RCU(&export->exp_handle.h_link);
 	INIT_LIST_HEAD(&export->exp_hp_rpcs);
-	class_handle_hash(&export->exp_handle, &export_handle_ops);
+	class_handle_hash(&export->exp_handle, export_handle_owner);
 	spin_lock_init(&export->exp_lock);
 	spin_lock_init(&export->exp_rpc_lock);
 	spin_lock_init(&export->exp_bl_list_lock);
diff --git a/fs/lustre/obdclass/lustre_handles.c b/fs/lustre/obdclass/lustre_handles.c
index 99c68fe..6989a60 100644
--- a/fs/lustre/obdclass/lustre_handles.c
+++ b/fs/lustre/obdclass/lustre_handles.c
@@ -58,8 +58,7 @@
  * Generate a unique 64bit cookie (hash) for a handle and insert it into
  * global (per-node) hash-table.
  */
-void class_handle_hash(struct portals_handle *h,
-		       const struct portals_handle_ops *ops)
+void class_handle_hash(struct portals_handle *h, const char *owner)
 {
 	struct handle_bucket *bucket;
 
@@ -85,7 +84,7 @@ void class_handle_hash(struct portals_handle *h,
 	h->h_cookie = handle_base;
 	spin_unlock(&handle_base_lock);
 
-	h->h_ops = ops;
+	h->h_owner = owner;
 	spin_lock_init(&h->h_lock);
 
 	bucket = &handle_hash[h->h_cookie & HANDLE_HASH_MASK];
@@ -132,7 +131,7 @@ void class_handle_unhash(struct portals_handle *h)
 }
 EXPORT_SYMBOL(class_handle_unhash);
 
-void *class_handle2object(u64 cookie, const struct portals_handle_ops *ops)
+void *class_handle2object(u64 cookie, const char *owner)
 {
 	struct handle_bucket *bucket;
 	struct portals_handle *h;
@@ -147,14 +146,14 @@ void *class_handle2object(u64 cookie, const struct portals_handle_ops *ops)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(h, &bucket->head, h_link) {
-		if (h->h_cookie != cookie || h->h_ops != ops)
+		if (h->h_cookie != cookie || h->h_owner != owner)
 			continue;
 
 		spin_lock(&h->h_lock);
 		if (likely(h->h_in != 0)) {
 			refcount_inc(&h->h_ref);
 			CDEBUG(D_INFO, "GET %s %p refcount=%d\n",
-			       h->h_ops->hop_type, h,
+			       h->h_owner, h,
 			       refcount_read(&h->h_ref));
 			retval = h;
 		}
@@ -201,8 +200,8 @@ static int cleanup_all_handles(void)
 
 		spin_lock(&handle_hash[i].lock);
 		list_for_each_entry_rcu(h, &handle_hash[i].head, h_link) {
-			CERROR("force clean handle %#llx addr %p ops %p\n",
-			       h->h_cookie, h, h->h_ops);
+			CERROR("force clean handle %#llx addr %p owner %p\n",
+			       h->h_cookie, h, h->h_owner);
 
 			class_handle_unhash_nolock(h);
 			rc++;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 574/622] lustre: ldlm: simplify ldlm_ns_hash_defs[]
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (572 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 573/622] lustre: rename ops to owner James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 575/622] lnet: prepare to make lnet_lnd const James Simmons
                   ` (48 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

As the ldlm_ns_types are dense, we can use the type as
the index to the array, rather than searching through
the array for a match.
We can also discard nsd_hops as all hash tables now
use the same hops.
This makes the table smaller and the code simpler.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8130
Lustre-commit: 416142145c9d ("LU-8130 ldlm: simplify ldlm_ns_hash_defs[]")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36220
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_resource.c | 62 ++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 42 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_resource.c b/fs/lustre/ldlm/ldlm_resource.c
index d009d5d..9b24be7 100644
--- a/fs/lustre/ldlm/ldlm_resource.c
+++ b/fs/lustre/ldlm/ldlm_resource.c
@@ -522,55 +522,35 @@ static void ldlm_res_hop_put(struct cfs_hash *hs, struct hlist_node *hnode)
 	.hs_put		= ldlm_res_hop_put
 };
 
-struct ldlm_ns_hash_def {
-	enum ldlm_ns_type	nsd_type;
+static struct {
 	/** hash bucket bits */
 	unsigned int		nsd_bkt_bits;
 	/** hash bits */
 	unsigned int		nsd_all_bits;
-	/** hash operations */
-	struct cfs_hash_ops	*nsd_hops;
-};
-
-static struct ldlm_ns_hash_def ldlm_ns_hash_defs[] = {
-	{
-		.nsd_type       = LDLM_NS_TYPE_MDC,
+} ldlm_ns_hash_defs[] = {
+	[LDLM_NS_TYPE_MDC] = {
 		.nsd_bkt_bits   = 11,
 		.nsd_all_bits   = 16,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_MDT,
+	[LDLM_NS_TYPE_MDT] = {
 		.nsd_bkt_bits   = 14,
 		.nsd_all_bits   = 21,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_OSC,
+	[LDLM_NS_TYPE_OSC] = {
 		.nsd_bkt_bits   = 8,
 		.nsd_all_bits   = 12,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_OST,
+	[LDLM_NS_TYPE_OST] = {
 		.nsd_bkt_bits   = 11,
 		.nsd_all_bits   = 17,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_MGC,
+	[LDLM_NS_TYPE_MGC] = {
 		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
-		.nsd_hops       = &ldlm_ns_hash_ops,
 	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_MGT,
+	[LDLM_NS_TYPE_MGT] = {
 		.nsd_bkt_bits   = 3,
 		.nsd_all_bits   = 4,
-		.nsd_hops       = &ldlm_ns_hash_ops,
-	},
-	{
-		.nsd_type       = LDLM_NS_TYPE_UNKNOWN,
 	},
 };
 
@@ -594,7 +574,6 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 					  enum ldlm_ns_type ns_type)
 {
 	struct ldlm_namespace *ns = NULL;
-	struct ldlm_ns_hash_def *nsd;
 	int idx;
 	int rc;
 
@@ -606,15 +585,10 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 		return NULL;
 	}
 
-	for (idx = 0; ; idx++) {
-		nsd = &ldlm_ns_hash_defs[idx];
-		if (nsd->nsd_type == LDLM_NS_TYPE_UNKNOWN) {
-			CERROR("Unknown type %d for ns %s\n", ns_type, name);
-			goto out_ref;
-		}
-
-		if (nsd->nsd_type == ns_type)
-			break;
+	if (ns_type >= ARRAY_SIZE(ldlm_ns_hash_defs) ||
+	    ldlm_ns_hash_defs[ns_type].nsd_bkt_bits == 0) {
+		CERROR("Unknown type %d for ns %s\n", ns_type, name);
+		goto out_ref;
 	}
 
 	ns = kzalloc(sizeof(*ns), GFP_NOFS);
@@ -622,11 +596,13 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 		goto out_ref;
 
 	ns->ns_rs_hash = cfs_hash_create(name,
-					 nsd->nsd_all_bits, nsd->nsd_all_bits,
-					 nsd->nsd_bkt_bits, 0,
+					 ldlm_ns_hash_defs[ns_type].nsd_all_bits,
+					 ldlm_ns_hash_defs[ns_type].nsd_all_bits,
+					 ldlm_ns_hash_defs[ns_type].nsd_bkt_bits,
+					 0,
 					 CFS_HASH_MIN_THETA,
 					 CFS_HASH_MAX_THETA,
-					 nsd->nsd_hops,
+					 &ldlm_ns_hash_ops,
 					 CFS_HASH_DEPTH |
 					 CFS_HASH_BIGNAME |
 					 CFS_HASH_SPIN_BKTLOCK |
@@ -634,7 +610,9 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
 	if (!ns->ns_rs_hash)
 		goto out_ns;
 
-	ns->ns_bucket_bits = nsd->nsd_all_bits - nsd->nsd_bkt_bits;
+	ns->ns_bucket_bits = ldlm_ns_hash_defs[ns_type].nsd_all_bits -
+			     ldlm_ns_hash_defs[ns_type].nsd_bkt_bits;
+
 	ns->ns_rs_buckets = kvmalloc(BIT(ns->ns_bucket_bits) *
 				     sizeof(ns->ns_rs_buckets[0]),
 				     GFP_KERNEL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 575/622] lnet: prepare to make lnet_lnd const.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (573 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 574/622] lustre: ldlm: simplify ldlm_ns_hash_defs[] James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 576/622] lnet: discard struct ksock_peer James Simmons
                   ` (47 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

Preferred practice is for structs containing function
pointers to be 'const'.  Such structs are generally tempting
attack vectors, and making them const allows linux to place
them in read-only memory, thus reducing the attack surface.

'struct lnet_lnd' is mostly function pointers, but contains
one writable field - a list_head.

Rather than keeping registered lnds in a linked-list, we can place
them in an array indexed by type - type numbers are at most 15 so
this is not a burden.

With these changes, no part of an lnet_lnd is ever modified.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 87a6bd0766da ("LU-12678 lnet: prepare to make lnet_lnd const.")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36830
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h   |  6 ++----
 include/uapi/linux/lnet/nidstr.h |  2 ++
 net/lnet/lnet/api-ni.c           | 29 +++++++++++++++--------------
 net/lnet/lnet/lo.c               |  1 -
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 02ac5df..99ed87a 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -46,6 +46,7 @@
 #include <uapi/linux/lnet/lnet-types.h>
 #include <uapi/linux/lnet/lnetctl.h>
 #include <uapi/linux/lnet/lnet-dlc.h>
+#include <uapi/linux/lnet/nidstr.h>
 
 /* Max payload size */
 #define LNET_MAX_PAYLOAD	LNET_MTU
@@ -244,9 +245,6 @@ struct lnet_test_peer {
 struct lnet_ni;			/* forward ref */
 
 struct lnet_lnd {
-	/* fields managed by portals */
-	struct list_head	lnd_list;	/* stash in the LND table */
-
 	/* fields initialised by the LND */
 	u32			lnd_type;
 
@@ -1133,7 +1131,7 @@ struct lnet {
 	/* uniquely identifies this ni in this epoch */
 	u64				ln_interface_cookie;
 	/* registered LNDs */
-	struct list_head		ln_lnds;
+	struct lnet_lnd			*ln_lnds[NUM_LNDS];
 
 	/* test protocol compatibility flags */
 	int				ln_testprotocompat;
diff --git a/include/uapi/linux/lnet/nidstr.h b/include/uapi/linux/lnet/nidstr.h
index 43ec232..958ca8d 100644
--- a/include/uapi/linux/lnet/nidstr.h
+++ b/include/uapi/linux/lnet/nidstr.h
@@ -53,6 +53,8 @@ enum {
 	/*MXLND		= 12, removed v2_7_50_0-34-g8be9e41	*/
 	GNILND		= 13,
 	GNIIPLND	= 14,
+
+	NUM_LNDS
 };
 
 struct list_head;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 0020ffd..cd95bdd 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -734,12 +734,12 @@ static void lnet_assert_wire_constants(void)
 	struct lnet_lnd *lnd;
 
 	/* holding lnd mutex */
-	list_for_each_entry(lnd, &the_lnet.ln_lnds, lnd_list) {
-		if (lnd->lnd_type == type)
-			return lnd;
-	}
+	if (type >= NUM_LNDS)
+		return NULL;
+	lnd = the_lnet.ln_lnds[type];
+	LASSERT(!lnd || lnd->lnd_type == type);
 
-	return NULL;
+	return lnd;
 }
 
 unsigned int
@@ -757,7 +757,7 @@ static void lnet_assert_wire_constants(void)
 	LASSERT(libcfs_isknown_lnd(lnd->lnd_type));
 	LASSERT(!lnet_find_lnd_by_type(lnd->lnd_type));
 
-	list_add_tail(&lnd->lnd_list, &the_lnet.ln_lnds);
+	the_lnet.ln_lnds[lnd->lnd_type] = lnd;
 
 	CDEBUG(D_NET, "%s LND registered\n", libcfs_lnd2str(lnd->lnd_type));
 
@@ -772,7 +772,7 @@ static void lnet_assert_wire_constants(void)
 
 	LASSERT(lnet_find_lnd_by_type(lnd->lnd_type) == lnd);
 
-	list_del(&lnd->lnd_list);
+	the_lnet.ln_lnds[lnd->lnd_type] = NULL;
 	CDEBUG(D_NET, "%s LND unregistered\n", libcfs_lnd2str(lnd->lnd_type));
 
 	mutex_unlock(&the_lnet.ln_lnd_mutex);
@@ -2429,7 +2429,6 @@ int lnet_lib_init(void)
 	}
 
 	the_lnet.ln_refcount = 0;
-	INIT_LIST_HEAD(&the_lnet.ln_lnds);
 	INIT_LIST_HEAD(&the_lnet.ln_net_zombie);
 	INIT_LIST_HEAD(&the_lnet.ln_msg_resend);
 
@@ -2459,16 +2458,18 @@ int lnet_lib_init(void)
  *
  * \pre lnet_lib_init() called with success.
  * \pre All LNet users called LNetNIFini() for matching LNetNIInit() calls.
+ *
+ * As this happens at module-unload, all lnds must already be unloaded,
+ * so they must already be unregistered.
  */
 void lnet_lib_exit(void)
 {
-	struct lnet_lnd *lnd;
-	LASSERT(!the_lnet.ln_refcount);
+	int i;
 
-	while ((lnd = list_first_entry_or_null(&the_lnet.ln_lnds,
-					       struct lnet_lnd,
-					       lnd_list)) != NULL)
-		lnet_unregister_lnd(lnd);
+	LASSERT(!the_lnet.ln_refcount);
+	lnet_unregister_lnd(&the_lolnd);
+	for (i = 0; i < NUM_LNDS; i++)
+		LASSERT(!the_lnet.ln_lnds[i]);
 	lnet_destroy_locks();
 }
 
diff --git a/net/lnet/lnet/lo.c b/net/lnet/lnet/lo.c
index 350495f..c19a5b5 100644
--- a/net/lnet/lnet/lo.c
+++ b/net/lnet/lnet/lo.c
@@ -93,7 +93,6 @@
 }
 
 struct lnet_lnd the_lolnd = {
-	.lnd_list	= LIST_HEAD_INIT(the_lolnd.lnd_list),
 	.lnd_type	= LOLND,
 	.lnd_startup	= lolnd_startup,
 	.lnd_shutdown	= lolnd_shutdown,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 576/622] lnet: discard struct ksock_peer
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (574 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 575/622] lnet: prepare to make lnet_lnd const James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 577/622] lnet: Avoid extra lnet_remotenet lookup James Simmons
                   ` (46 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

struct ksock_peer is declared in a forward-ref, but
never defined or used.  Let's remove it, and change
some spaces to TABs while we are there.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 179d50565e0b ("LU-12678 lnet: discard struct ksock_peer")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36835
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 832bc08..2d4e8d59 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -264,10 +264,9 @@ struct ksock_nal_data {
  * received into either struct iovec or struct bio_vec fragments, depending on
  * what the header matched or whether the message needs forwarding.
  */
-struct ksock_conn;  /* forward ref */
-struct ksock_peer_ni;  /* forward ref */
-struct ksock_route; /* forward ref */
-struct ksock_proto; /* forward ref */
+struct ksock_conn;				/* forward ref */
+struct ksock_route;				/* forward ref */
+struct ksock_proto;				/* forward ref */
 
 struct ksock_tx {				/* transmit packet */
 	struct list_head	tx_list;	/* queue on conn for transmission etc
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 577/622] lnet: Avoid extra lnet_remotenet lookup
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (575 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 576/622] lnet: discard struct ksock_peer James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 578/622] lnet: Remove unused vars in lnet_find_route_locked James Simmons
                   ` (45 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

We can keep track of the lnet_remotenet object associated with the
"best" lnet_peer_net, and pass that lnet_remotenet directly to
lnet_find_route_locked().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12756
Lustre-commit: 3812c54b9ca3 ("LU-12756 lnet: Avoid extra lnet_remotenet lookup")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36536
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 45975d6..03d629d 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1324,23 +1324,18 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 }
 
 static struct lnet_route *
-lnet_find_route_locked(struct lnet_net *net, u32 remote_net,
+lnet_find_route_locked(struct lnet_remotenet *rnet,
 		       struct lnet_route **prev_route,
 		       struct lnet_peer_ni **gwni)
 {
 	struct lnet_peer_ni *best_gw_ni = NULL;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
-	struct lnet_remotenet *rnet;
 	struct lnet_peer *lp_best;
 	struct lnet_route *route;
 	struct lnet_peer *lp;
 	int rc;
 
-	rnet = lnet_find_rnet_locked(remote_net);
-	if (!rnet)
-		return NULL;
-
 	lp_best = NULL;
 	best_route = NULL;
 	last_route = NULL;
@@ -1832,7 +1827,7 @@ struct lnet_ni *
 	struct lnet_peer *lp;
 	struct lnet_peer_net *lpn;
 	struct lnet_peer_net *best_lpn = NULL;
-	struct lnet_remotenet *rnet;
+	struct lnet_remotenet *rnet, *best_rnet = NULL;
 	struct lnet_route *best_route = NULL;
 	struct lnet_route *last_route = NULL;
 	struct lnet_peer_ni *lpni = NULL;
@@ -1867,13 +1862,16 @@ struct lnet_ni *
 			if (!rnet)
 				continue;
 
-			if (!best_lpn)
+			if (!best_lpn) {
 				best_lpn = lpn;
+				best_rnet = rnet;
+			}
 
 			if (best_lpn->lpn_seq <= lpn->lpn_seq)
 				continue;
 
 			best_lpn = lpn;
+			best_rnet = rnet;
 		}
 
 		if (!best_lpn) {
@@ -1892,8 +1890,8 @@ struct lnet_ni *
 			return -EHOSTUNREACH;
 		}
 
-		best_route = lnet_find_route_locked(NULL, best_lpn->lpn_net_id,
-						    &last_route, &gwni);
+		best_route = lnet_find_route_locked(best_rnet, &last_route,
+						    &gwni);
 		if (!best_route) {
 			CERROR("no route to %s from %s\n",
 			       libcfs_nid2str(dst_nid),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 578/622] lnet: Remove unused vars in lnet_find_route_locked
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (576 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 577/622] lnet: Avoid extra lnet_remotenet lookup James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 579/622] lnet: Refactor lnet_compare_routes James Simmons
                   ` (44 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

The lp and lp_best variables are not needed in
lnet_find_route_locked().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12756
Lustre-commit: b129f7b1f76a ("LU-12756 lnet: Remove unused vars in lnet_find_route_locked")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36620
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 03d629d..b7990c9 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1331,24 +1331,18 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_peer_ni *best_gw_ni = NULL;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
-	struct lnet_peer *lp_best;
 	struct lnet_route *route;
-	struct lnet_peer *lp;
 	int rc;
 
-	lp_best = NULL;
 	best_route = NULL;
 	last_route = NULL;
 	list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
-		lp = route->lr_gateway;
-
 		if (!lnet_is_route_alive(route))
 			continue;
 
-		if (!lp_best) {
+		if (!best_route) {
 			best_route = route;
 			last_route = route;
-			lp_best = lp;
 			best_gw_ni = lnet_find_best_lpni_on_net(NULL,
 								LNET_NID_ANY,
 								route->lr_gateway,
@@ -1366,7 +1360,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			continue;
 
 		best_route = route;
-		lp_best = lp;
 	}
 
 	*prev_route = last_route;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 579/622] lnet: Refactor lnet_compare_routes
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (577 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 578/622] lnet: Remove unused vars in lnet_find_route_locked James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 580/622] lustre: u_object: factor out extra per-bucket data James Simmons
                   ` (43 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

Restrict lnet_compare_routes() to only comparing the lnet_route
objects passed as arguments. This saves us from doing unnecessary
calls to lnet_find_best_lpni_on_net().

Rename lnet_compare_peers to lnet_compare_gw_lpnis to better
reflect what is done by this routine.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12756
Lustre-commit: e02287b4ef6a ("LU-12756 lnet: Refactor lnet_compare_routes")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36621
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 77 +++++++++++++++++++-----------------------------
 1 file changed, 31 insertions(+), 46 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index b7990c9..269b2d5 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1137,7 +1137,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 }
 
 static int
-lnet_compare_peers(struct lnet_peer_ni *p1, struct lnet_peer_ni *p2)
+lnet_compare_gw_lpnis(struct lnet_peer_ni *p1, struct lnet_peer_ni *p2)
 {
 	if (p1->lpni_txqnob < p2->lpni_txqnob)
 		return 1;
@@ -1267,60 +1267,26 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	return lnet_select_peer_ni(lni, dst_nid, peer, peer_net);
 }
 
+/* Compare route priorities and hop counts */
 static int
-lnet_compare_routes(struct lnet_route *r1, struct lnet_route *r2,
-		    struct lnet_peer_ni **best_lpni)
+lnet_compare_routes(struct lnet_route *r1, struct lnet_route *r2)
 {
 	int r1_hops = (r1->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r1->lr_hops;
 	int r2_hops = (r2->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r2->lr_hops;
-	struct lnet_peer *lp1 = r1->lr_gateway;
-	struct lnet_peer *lp2 = r2->lr_gateway;
-	struct lnet_peer_ni *lpni1;
-	struct lnet_peer_ni *lpni2;
-	int rc;
-
-	lpni1 = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY, lp1,
-					   r1->lr_lnet);
-	lpni2 = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY, lp2,
-					   r2->lr_lnet);
-	LASSERT(lpni1 && lpni2);
 
-	if (r1->lr_priority < r2->lr_priority) {
-		*best_lpni = lpni1;
+	if (r1->lr_priority < r2->lr_priority)
 		return 1;
-	}
 
-	if (r1->lr_priority > r2->lr_priority) {
-		*best_lpni = lpni2;
+	if (r1->lr_priority > r2->lr_priority)
 		return -1;
-	}
 
-	if (r1_hops < r2_hops) {
-		*best_lpni = lpni1;
+	if (r1_hops < r2_hops)
 		return 1;
-	}
 
-	if (r1_hops > r2_hops) {
-		*best_lpni = lpni2;
+	if (r1_hops > r2_hops)
 		return -1;
-	}
-
-	rc = lnet_compare_peers(lpni1, lpni2);
-	if (rc == 1) {
-		*best_lpni = lpni1;
-		return rc;
-	} else if (rc == -1) {
-		*best_lpni = lpni2;
-		return rc;
-	}
-
-	if (r1->lr_seq - r2->lr_seq <= 0) {
-		*best_lpni = lpni1;
-		return 1;
-	}
 
-	*best_lpni = lpni2;
-	return -1;
+	return 0;
 }
 
 static struct lnet_route *
@@ -1328,7 +1294,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		       struct lnet_route **prev_route,
 		       struct lnet_peer_ni **gwni)
 {
-	struct lnet_peer_ni *best_gw_ni = NULL;
+	struct lnet_peer_ni *lpni, *best_gw_ni = NULL;
 	struct lnet_route *best_route;
 	struct lnet_route *last_route;
 	struct lnet_route *route;
@@ -1355,11 +1321,30 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		if (last_route->lr_seq - route->lr_seq < 0)
 			last_route = route;
 
-		rc = lnet_compare_routes(route, best_route, &best_gw_ni);
-		if (rc < 0)
+		rc = lnet_compare_routes(route, best_route);
+		if (rc == -1)
+			continue;
+
+		lpni = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY,
+						  route->lr_gateway,
+						  route->lr_lnet);
+		LASSERT(lpni);
+
+		if (rc == 1) {
+			best_route = route;
+			best_gw_ni = lpni;
+			continue;
+		}
+
+		rc = lnet_compare_gw_lpnis(lpni, best_gw_ni);
+		if (rc == -1)
 			continue;
 
-		best_route = route;
+		if (rc == 1 || route->lr_seq <= best_route->lr_seq) {
+			best_route = route;
+			best_gw_ni = lpni;
+			continue;
+		}
 	}
 
 	*prev_route = last_route;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 580/622] lustre: u_object: factor out extra per-bucket data
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (578 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 579/622] lnet: Refactor lnet_compare_routes James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 581/622] lustre: llite: replace lli_trunc_sem James Simmons
                   ` (42 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

The hash tables managed by lu_object store some extra
information in each bucket in the hash table.  This prevents the use
of resizeable hash tables, so lu_site_init() goes to some trouble
to try to guess a good hash size.

There is no real need for the extra data to be closely associated with
hash buckets.  There is a small advantage as both the hash bucket and
the extra information can then be protected by the same lock, but as
these locks have low contention, that should rarely be noticed.

The extra data is updated frequently and accessed rarely, such an lru
list and a wait_queue head.  There could just be a single copy of this
data for the whole array, but on a many-cpu machine, that could become
a contention bottle neck.  So it makes sense keep multiple shards and
combine them only when needed.  It does not make sense to have many
more copies than there are CPUs.

This patch takes the extra data out of the hash table buckets and
creates a separate array, which never has more entries than twice the
number of possible cpus.  As this extra data contains a
wait_queue_head, which contains a spinlock, that lock is used to
protect the other data (counter and lru list).

The code currently uses a very simple hash to choose a
hash-table bucket:

(fid_seq(fid) + fid_oid(fid)) & (CFS_HASH_NBKT(hs) - 1)

There is no documented reason for this and I cannot see any value in
not using a general hash function. We can use hash_32() and hash_64()
on the fid value with a random seed created for each lu_site. The
hash_*() functions where picked over the jhash() functions since
it performances way better.

The lock ordering requires that a hash-table lock cannot be taken
while an extra-data lock is held.  This means that in
lu_site_purge_objects() we much first remove objects from the lru
(with the extra information locked) and then remove each one from the
hash table.  To ensure the object is not found between these two
steps, the LU_OBJECT_HEARD_BANSHEE flag is set.

As the extra info is now separate from the hash buckets, we cannot
report statistic from both at the same time.  I think the lru
statistics are probably more useful than the hash-table statistics, so
I have preserved the former and discarded the latter.  When the
hashtable becomes resizeable, those statistics will be irrelevant.

As the lru and the hash table are now managed by different locks
we need to be careful to prevent htable_lookup() finding an
object that lu_site_purge_objects() is purging.
To help with this we introduce a new lu_object flag to say
that and object is being purged.  Once set, the object will
be quickly removed from the hash table, and is already
removed from the lru.

WC-bug-id: https://jira.whamcloud.com/browse/LU-8130
Lustre-commit: e6f7f8a7b349 ("LU-8130 lu_object: factor out extra per-bucket data")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36216
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h  |  13 +++-
 fs/lustre/obdclass/lu_object.c | 167 +++++++++++++++++++++++++----------------
 2 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index e92f12f..4608937 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -463,7 +463,12 @@ enum lu_object_header_flags {
 	 * Object is initialized, when object is found in cache, it may not be
 	 * initialized yet, the object allocator will initialize it.
 	 */
-	LU_OBJECT_INITED	= 2
+	LU_OBJECT_INITED	= 2,
+	/**
+	 * Object is being purged, so mustn't be returned by
+	 * htable_lookup()
+	 */
+	LU_OBJECT_PURGING	= 3,
 };
 
 enum lu_object_header_attr {
@@ -553,6 +558,12 @@ struct lu_site {
 	 * objects hash table
 	 */
 	struct cfs_hash	       *ls_obj_hash;
+	/*
+	 * buckets for summary data
+	 */
+	struct lu_site_bkt_data	*ls_bkts;
+	int			ls_bkt_cnt;
+	u32			ls_bkt_seed;
 	/**
 	 * index of bucket on hash table while purging
 	 */
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index 38c04c7..7ea9948 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -43,6 +43,7 @@
 
 #include <linux/module.h>
 #include <linux/processor.h>
+#include <linux/random.h>
 
 /* hash_long() */
 #include <linux/libcfs/libcfs_hash.h>
@@ -58,11 +59,10 @@
 struct lu_site_bkt_data {
 	/**
 	 * LRU list, updated on each access to object. Protected by
-	 * bucket lock of lu_site::ls_obj_hash.
+	 * lsb_waitq.lock.
 	 *
 	 * "Cold" end of LRU is lu_site::ls_lru.next. Accessed object are
-	 * moved to the lu_site::ls_lru.prev (this is due to the non-existence
-	 * of list_for_each_entry_safe_reverse()).
+	 * moved to the lu_site::ls_lru.prev
 	 */
 	struct list_head		lsb_lru;
 	/**
@@ -92,9 +92,11 @@ enum {
 #define LU_SITE_BITS_MAX		24
 #define LU_SITE_BITS_MAX_CL		19
 /**
- * total 256 buckets, we don't want too many buckets because:
- * - consume too much memory
+ * Max 256 buckets, we don't want too many buckets because:
+ * - consume too much memory (currently max 16K)
  * - avoid unbalanced LRU list
+ * With few cpus there is little gain from extra buckets, so
+ * we treat this as a maximum in lu_site_init().
  */
 #define LU_SITE_BKT_BITS		8
 
@@ -109,14 +111,27 @@ enum {
 static void lu_object_free(const struct lu_env *env, struct lu_object *o);
 static u32 ls_stats_read(struct lprocfs_stats *stats, int idx);
 
+static u32 lu_fid_hash(const void *data, u32 seed)
+{
+	const struct lu_fid *fid = data;
+
+	seed = hash_32(seed ^ fid->f_oid, 32);
+	seed ^= hash_64(fid->f_seq, 32);
+	return seed;
+}
+
+static inline int lu_bkt_hash(struct lu_site *s, const struct lu_fid *fid)
+{
+	return lu_fid_hash(fid, s->ls_bkt_seed) &
+	       (s->ls_bkt_cnt - 1);
+}
+
 wait_queue_head_t *
 lu_site_wq_from_fid(struct lu_site *site, struct lu_fid *fid)
 {
-	struct cfs_hash_bd bd;
 	struct lu_site_bkt_data *bkt;
 
-	cfs_hash_bd_get(site->ls_obj_hash, fid, &bd);
-	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
+	bkt = &site->ls_bkts[lu_bkt_hash(site, fid)];
 	return &bkt->lsb_waitq;
 }
 EXPORT_SYMBOL(lu_site_wq_from_fid);
@@ -155,7 +170,6 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	}
 
 	cfs_hash_bd_get(site->ls_obj_hash, &top->loh_fid, &bd);
-	bkt = cfs_hash_bd_extra_get(site->ls_obj_hash, &bd);
 
 	is_dying = lu_object_is_dying(top);
 	if (!cfs_hash_bd_dec_and_lock(site->ls_obj_hash, &bd, &top->loh_ref)) {
@@ -169,6 +183,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			 * somebody may be waiting for this, currently only
 			 * used for cl_object, see cl_object_put_last().
 			 */
+			bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
 			wake_up_all(&bkt->lsb_waitq);
 		}
 		return;
@@ -183,6 +198,9 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 			o->lo_ops->loo_object_release(env, o);
 	}
 
+	bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
+	spin_lock(&bkt->lsb_waitq.lock);
+
 	/* don't use local 'is_dying' here because if was taken without lock
 	 * but here we need the latest actual value of it so check lu_object
 	 * directly here.
@@ -190,6 +208,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	if (!lu_object_is_dying(top)) {
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
+		spin_unlock(&bkt->lsb_waitq.lock);
 		percpu_counter_inc(&site->ls_lru_len_counter);
 		CDEBUG(D_INODE, "Add %p/%p to site lru. hash: %p, bkt: %p\n",
 		       orig, top, site->ls_obj_hash, bkt);
@@ -199,22 +218,19 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 
 	/*
 	 * If object is dying (will not be cached), then removed it
-	 * from hash table and LRU.
+	 * from hash table (it is already not on the LRU).
 	 *
-	 * This is done with hash table and LRU lists locked. As the only
+	 * This is done with hash table lists locked. As the only
 	 * way to acquire first reference to previously unreferenced
-	 * object is through hash-table lookup (lu_object_find()),
-	 * or LRU scanning (lu_site_purge()), that are done under hash-table
-	 * and LRU lock, no race with concurrent object lookup is possible
-	 * and we can safely destroy object below.
+	 * object is through hash-table lookup (lu_object_find())
+	 * which is done under hash-table, no race with concurrent
+	 * object lookup is possible and we can safely destroy object below.
 	 */
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags))
 		cfs_hash_bd_del_locked(site->ls_obj_hash, &bd, &top->loh_hash);
+	spin_unlock(&bkt->lsb_waitq.lock);
 	cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
-	/*
-	 * Object was already removed from hash and lru above, can
-	 * kill it.
-	 */
+	/* Object was already removed from hash above, can kill it. */
 	lu_object_free(env, orig);
 }
 EXPORT_SYMBOL(lu_object_put);
@@ -238,8 +254,10 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 		if (!list_empty(&top->loh_lru)) {
 			struct lu_site_bkt_data *bkt;
 
+			bkt = &site->ls_bkts[lu_bkt_hash(site, &top->loh_fid)];
+			spin_lock(&bkt->lsb_waitq.lock);
 			list_del_init(&top->loh_lru);
-			bkt = cfs_hash_bd_extra_get(obj_hash, &bd);
+			spin_unlock(&bkt->lsb_waitq.lock);
 			percpu_counter_dec(&site->ls_lru_len_counter);
 		}
 		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
@@ -390,8 +408,6 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 	struct lu_object_header *h;
 	struct lu_object_header *temp;
 	struct lu_site_bkt_data *bkt;
-	struct cfs_hash_bd bd;
-	struct cfs_hash_bd bd2;
 	struct list_head dispose;
 	int did_sth;
 	unsigned int start = 0;
@@ -409,7 +425,7 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 	 */
 	if (nr != ~0)
 		start = s->ls_purge_start;
-	bnr = (nr == ~0) ? -1 : nr / (int)CFS_HASH_NBKT(s->ls_obj_hash) + 1;
+	bnr = (nr == ~0) ? -1 : nr / s->ls_bkt_cnt + 1;
 again:
 	/*
 	 * It doesn't make any sense to make purge threads parallel, that can
@@ -421,21 +437,21 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 		goto out;
 
 	did_sth = 0;
-	cfs_hash_for_each_bucket(s->ls_obj_hash, &bd, i) {
-		if (i < start)
-			continue;
+	for (i = start; i < s->ls_bkt_cnt ; i++) {
 		count = bnr;
-		cfs_hash_bd_lock(s->ls_obj_hash, &bd, 1);
-		bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
+		bkt = &s->ls_bkts[i];
+		spin_lock(&bkt->lsb_waitq.lock);
 
 		list_for_each_entry_safe(h, temp, &bkt->lsb_lru, loh_lru) {
 			LASSERT(atomic_read(&h->loh_ref) == 0);
 
-			cfs_hash_bd_get(s->ls_obj_hash, &h->loh_fid, &bd2);
-			LASSERT(bd.bd_bucket == bd2.bd_bucket);
+			LINVRNT(lu_bkt_hash(s, &h->loh_fid) == i);
 
-			cfs_hash_bd_del_locked(s->ls_obj_hash,
-					       &bd2, &h->loh_hash);
+			/* Cannot remove from hash under current spinlock,
+			 * so set flag to stop object from being found
+			 * by htable_lookup().
+			 */
+			set_bit(LU_OBJECT_PURGING, &h->loh_flags);
 			list_move(&h->loh_lru, &dispose);
 			percpu_counter_dec(&s->ls_lru_len_counter);
 			if (did_sth == 0)
@@ -447,14 +463,16 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 			if (count > 0 && --count == 0)
 				break;
 		}
-		cfs_hash_bd_unlock(s->ls_obj_hash, &bd, 1);
+		spin_unlock(&bkt->lsb_waitq.lock);
 		cond_resched();
 		/*
 		 * Free everything on the dispose list. This is safe against
 		 * races due to the reasons described in lu_object_put().
 		 */
-		while ((h = list_first_entry_or_null(
-				&dispose, struct lu_object_header, loh_lru)) != NULL) {
+		while ((h = list_first_entry_or_null(&dispose,
+						     struct lu_object_header,
+						     loh_lru)) != NULL) {
+			cfs_hash_del(s->ls_obj_hash, &h->loh_fid, &h->loh_hash);
 			list_del_init(&h->loh_lru);
 			lu_object_free(env, lu_object_top(h));
 			lprocfs_counter_incr(s->ls_stats, LU_SS_LRU_PURGED);
@@ -470,7 +488,7 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 		goto again;
 	}
 	/* race on s->ls_purge_start, but nobody cares */
-	s->ls_purge_start = i % CFS_HASH_NBKT(s->ls_obj_hash);
+	s->ls_purge_start = i % (s->ls_bkt_cnt - 1);
 out:
 	return nr;
 }
@@ -631,12 +649,29 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 	}
 
 	h = container_of(hnode, struct lu_object_header, loh_hash);
-	cfs_hash_get(s->ls_obj_hash, hnode);
-	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
 	if (!list_empty(&h->loh_lru)) {
+		struct lu_site_bkt_data *bkt;
+
+		bkt = &s->ls_bkts[lu_bkt_hash(s, &h->loh_fid)];
+		spin_lock(&bkt->lsb_waitq.lock);
+		/* Might have just been moved to the dispose list, in which
+		 * case LU_OBJECT_PURGING will be set.  In that case,
+		 * delete it from the hash table immediately.
+		 * When lu_site_purge_objects() tried, it will find it
+		 * isn't there, which is harmless.
+		 */
+		if (test_bit(LU_OBJECT_PURGING, &h->loh_flags)) {
+			spin_unlock(&bkt->lsb_waitq.lock);
+			cfs_hash_bd_del_locked(s->ls_obj_hash, bd, hnode);
+			lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_MISS);
+			return ERR_PTR(-ENOENT);
+		}
 		list_del_init(&h->loh_lru);
+		spin_unlock(&bkt->lsb_waitq.lock);
 		percpu_counter_dec(&s->ls_lru_len_counter);
 	}
+	cfs_hash_get(s->ls_obj_hash, hnode);
+	lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
 	return lu_object_top(h);
 }
 
@@ -721,8 +756,8 @@ struct lu_object *lu_object_find_at(const struct lu_env *env,
 	if (unlikely(OBD_FAIL_PRECHECK(OBD_FAIL_OBD_ZERO_NLINK_RACE)))
 		lu_site_purge(env, s, -1);
 
+	bkt = &s->ls_bkts[lu_bkt_hash(s, f)];
 	cfs_hash_bd_get(hs, f, &bd);
-	bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
 	if (!(conf && conf->loc_flags & LOC_F_NEW)) {
 		cfs_hash_bd_lock(hs, &bd, 1);
 		o = htable_lookup(s, &bd, f, &version);
@@ -1029,7 +1064,6 @@ static void lu_dev_add_linkage(struct lu_site *s, struct lu_device *d)
 int lu_site_init(struct lu_site *s, struct lu_device *top)
 {
 	struct lu_site_bkt_data *bkt;
-	struct cfs_hash_bd bd;
 	unsigned long bits;
 	unsigned long i;
 	char name[16];
@@ -1046,7 +1080,7 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 	for (bits = lu_htable_order(top); bits >= LU_SITE_BITS_MIN; bits--) {
 		s->ls_obj_hash = cfs_hash_create(name, bits, bits,
 						 bits - LU_SITE_BKT_BITS,
-						 sizeof(*bkt), 0, 0,
+						 0, 0, 0,
 						 &lu_site_hash_ops,
 						 CFS_HASH_SPIN_BKTLOCK |
 						 CFS_HASH_NO_ITEMREF |
@@ -1062,16 +1096,31 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 		return -ENOMEM;
 	}
 
-	cfs_hash_for_each_bucket(s->ls_obj_hash, &bd, i) {
-		bkt = cfs_hash_bd_extra_get(s->ls_obj_hash, &bd);
+	s->ls_bkt_seed = prandom_u32();
+	s->ls_bkt_cnt = max_t(long, 1 << LU_SITE_BKT_BITS,
+			      2 * num_possible_cpus());
+	s->ls_bkt_cnt = roundup_pow_of_two(s->ls_bkt_cnt);
+	s->ls_bkts = kvmalloc_array(s->ls_bkt_cnt, sizeof(*bkt),
+				    GFP_KERNEL | __GFP_ZERO);
+	if (!s->ls_bkts) {
+		cfs_hash_putref(s->ls_obj_hash);
+		s->ls_obj_hash = NULL;
+		s->ls_bkts = NULL;
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < s->ls_bkt_cnt; i++) {
+		bkt = &s->ls_bkts[i];
 		INIT_LIST_HEAD(&bkt->lsb_lru);
 		init_waitqueue_head(&bkt->lsb_waitq);
 	}
 
 	s->ls_stats = lprocfs_alloc_stats(LU_SS_LAST_STAT, 0);
 	if (!s->ls_stats) {
+		kvfree(s->ls_bkts);
 		cfs_hash_putref(s->ls_obj_hash);
 		s->ls_obj_hash = NULL;
+		s->ls_bkts = NULL;
 		return -ENOMEM;
 	}
 
@@ -1119,6 +1168,8 @@ void lu_site_fini(struct lu_site *s)
 		s->ls_obj_hash = NULL;
 	}
 
+	kvfree(s->ls_bkts);
+
 	if (s->ls_top_dev) {
 		s->ls_top_dev->ld_site = NULL;
 		lu_ref_del(&s->ls_top_dev->ld_reference, "site-top", s);
@@ -1878,37 +1929,21 @@ struct lu_site_stats {
 };
 
 static void lu_site_stats_get(const struct lu_site *s,
-			      struct lu_site_stats *stats, int populated)
+			      struct lu_site_stats *stats)
 {
-	struct cfs_hash *hs = s->ls_obj_hash;
-	struct cfs_hash_bd bd;
-	unsigned int i;
+	int cnt = cfs_hash_size_get(s->ls_obj_hash);
 	/*
 	 * percpu_counter_sum_positive() won't accept a const pointer
 	 * as it does modify the struct by taking a spinlock
 	 */
 	struct lu_site *s2 = (struct lu_site *)s;
 
-	stats->lss_busy += cfs_hash_size_get(hs) -
+	stats->lss_busy += cnt -
 		percpu_counter_sum_positive(&s2->ls_lru_len_counter);
-	cfs_hash_for_each_bucket(hs, &bd, i) {
-		struct hlist_head *hhead;
 
-		cfs_hash_bd_lock(hs, &bd, 1);
-		stats->lss_total += cfs_hash_bd_count_get(&bd);
-		stats->lss_max_search = max((int)stats->lss_max_search,
-					    cfs_hash_bd_depmax_get(&bd));
-		if (!populated) {
-			cfs_hash_bd_unlock(hs, &bd, 1);
-			continue;
-		}
-
-		cfs_hash_bd_for_each_hlist(hs, &bd, hhead) {
-			if (!hlist_empty(hhead))
-				stats->lss_populated++;
-		}
-		cfs_hash_bd_unlock(hs, &bd, 1);
-	}
+	stats->lss_total += cnt;
+	stats->lss_max_search = 0;
+	stats->lss_populated = 0;
 }
 
 /*
@@ -2201,7 +2236,7 @@ int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 	struct lu_site_stats stats;
 
 	memset(&stats, 0, sizeof(stats));
-	lu_site_stats_get(s, &stats, 1);
+	lu_site_stats_get(s, &stats);
 
 	seq_printf(m, "%d/%d %d/%ld %d %d %d %d %d %d %d\n",
 		   stats.lss_busy,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 581/622] lustre: llite: replace lli_trunc_sem
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (579 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 580/622] lustre: u_object: factor out extra per-bucket data James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 582/622] lnet: Fix source specified route selection James Simmons
                   ` (41 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

lli_trunc_sem can lead to a deadlock.

vvp_io_read_start takes lli_trunc_sem, and can take
mmap sem in the direct i/o case, via
generic_file_read_iter->ll_direct_IO->get_user_pages_unlocked

vvp_io_fault_start is called with mmap_sem held (taken in
the kernel page fault code), and takes lli_trunc_sem.

These aren't necessarily the same mmap_sem, but can be if
you mmap a lustre file, then read into that mapped memory
from the file.

These are both 'down_read' calls on lli_trunc_sem so they
don't directly conflict, but if vvp_io_setattr_start() is
called to truncate the file between these, it does
'down_write' on lli_trunc_sem.  As semaphores are queued,
this down_write blocks subsequent reads.

This means if the page fault has taken the mmap_sem,
but not yet the lli_trunc_sem in vvp_io_fault_start,
it will wait behind the lli_trunc_sem down_write from
vvp_io_setattr_start.

At the same time, vvp_io_read_start is holding the
lli_trunc_sem and waiting for the mmap_sem, which will not
be released because vvp_io_fault_start cannot get the
lli_trunc_sem because the setattr 'down_write' operation is
queued in front of it.

Solve this by replacing with a hand-coded semaphore, using
atomic counters and wait_var_event().  This allows a
special down_read_nowait which ignores waiting down_write
operations.  This combined with waking up all waiters at
once guarantees that down_read_nowait can always 'join'
another down_read, guaranteeing our ability to take the
semaphore twice for read and avoiding the deadlock.

I'd like there to be a better way to fix this, but I
haven't found it yet.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12460
Lustre-commit: e5914a61ac77 ("LU-12460 llite: replace lli_trunc_sem")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35271
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h | 93 +++++++++++++++++++++++++++++++++++++++-
 fs/lustre/llite/llite_lib.c      |  2 +-
 fs/lustre/llite/vvp_io.c         | 14 +++---
 3 files changed, 100 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index def4df0..b7b418f 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -105,6 +105,16 @@ enum ll_file_flags {
 	LLIF_PROJECT_INHERIT	= 3,
 };
 
+/* See comment on trunc_sem_down_read_nowait */
+struct ll_trunc_sem {
+	/* when positive, this is a count of readers, when -1, it indicates
+	 * the semaphore is held for write, and 0 is unlocked
+	 */
+	atomic_t	ll_trunc_readers;
+	/* this tracks a count of waiting writers */
+	atomic_t	ll_trunc_waiters;
+};
+
 struct ll_inode_info {
 	u32				lli_inode_magic;
 
@@ -178,7 +188,7 @@ struct ll_inode_info {
 		struct {
 			struct mutex			lli_size_mutex;
 			char			       *lli_symlink_name;
-			struct rw_semaphore		lli_trunc_sem;
+			struct ll_trunc_sem		lli_trunc_sem;
 			struct range_lock_tree		lli_write_tree;
 
 			struct rw_semaphore		lli_glimpse_sem;
@@ -253,6 +263,87 @@ struct ll_inode_info {
 	struct list_head		lli_xattrs;/* ll_xattr_entry->xe_list */
 };
 
+static inline void ll_trunc_sem_init(struct ll_trunc_sem *sem)
+{
+	atomic_set(&sem->ll_trunc_readers, 0);
+	atomic_set(&sem->ll_trunc_waiters, 0);
+}
+
+/* This version of down read ignores waiting writers, meaning if the semaphore
+ * is already held for read, this down_read will 'join' that reader and also
+ * take the semaphore.
+ *
+ * This lets us avoid an unusual deadlock.
+ *
+ * We must take lli_trunc_sem in read mode on entry in to various i/o paths
+ * in Lustre, in order to exclude truncates.  Some of these paths then need to
+ * take the mmap_sem, while still holding the trunc_sem.  The problem is that
+ * page faults hold the mmap_sem when calling in to Lustre, and then must also
+ * take the trunc_sem to exclude truncate.
+ *
+ * This means the locking order for trunc_sem and mmap_sem is sometimes AB,
+ * sometimes BA.  This is almost OK because in both cases, we take the trunc
+ * sem for read, so it doesn't block.
+ *
+ * However, if a write mode user (truncate, a setattr op) arrives in the
+ * middle of this, the second reader on the truncate_sem will wait behind that
+ * writer.
+ *
+ * So we have, on our truncate sem, in order (where 'reader' and 'writer' refer
+ * to the mode in which they take the semaphore):
+ * reader (holding mmap_sem, needs truncate_sem)
+ * writer
+ * reader (holding truncate sem, waiting for mmap_sem)
+ *
+ * And so the readers deadlock.
+ *
+ * The solution is this modified semaphore, where this down_read ignores
+ * waiting write operations, and all waiters are woken up at once, so readers
+ * using down_read_nowait cannot get stuck behind waiting writers, regardless
+ * of the order they arrived in.
+ *
+ * down_read_nowait is only used in the page fault case, where we already hold
+ * the mmap_sem.  This is because otherwise repeated read and write operations
+ * (which take the truncate sem) could prevent a truncate from ever starting.
+ * This could still happen with page faults, but without an even more complex
+ * mechanism, this is unavoidable.
+ *
+ * LU-12460
+ */
+static inline void trunc_sem_down_read_nowait(struct ll_trunc_sem *sem)
+{
+	wait_var_event(&sem->ll_trunc_readers,
+		       atomic_inc_unless_negative(&sem->ll_trunc_readers));
+}
+
+static inline void trunc_sem_down_read(struct ll_trunc_sem *sem)
+{
+	wait_var_event(&sem->ll_trunc_readers,
+		       atomic_read(&sem->ll_trunc_waiters) == 0 &&
+		       atomic_inc_unless_negative(&sem->ll_trunc_readers));
+}
+
+static inline void trunc_sem_up_read(struct ll_trunc_sem *sem)
+{
+	if (atomic_dec_return(&sem->ll_trunc_readers) == 0 &&
+	    atomic_read(&sem->ll_trunc_waiters))
+		wake_up_var(&sem->ll_trunc_readers);
+}
+
+static inline void trunc_sem_down_write(struct ll_trunc_sem *sem)
+{
+	atomic_inc(&sem->ll_trunc_waiters);
+	wait_var_event(&sem->ll_trunc_readers,
+		       atomic_cmpxchg(&sem->ll_trunc_readers, 0, -1) == 0);
+	atomic_dec(&sem->ll_trunc_waiters);
+}
+
+static inline void trunc_sem_up_write(struct ll_trunc_sem *sem)
+{
+	atomic_set(&sem->ll_trunc_readers, 0);
+	wake_up_var(&sem->ll_trunc_readers);
+}
+
 static inline u32 ll_layout_version_get(struct ll_inode_info *lli)
 {
 	u32 gen;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 7e128f0..f083a90 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -971,7 +971,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 	} else {
 		mutex_init(&lli->lli_size_mutex);
 		lli->lli_symlink_name = NULL;
-		init_rwsem(&lli->lli_trunc_sem);
+		ll_trunc_sem_init(&lli->lli_trunc_sem);
 		range_lock_tree_init(&lli->lli_write_tree);
 		init_rwsem(&lli->lli_glimpse_sem);
 		lli->lli_glimpse_time = ktime_set(0, 0);
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index b3f628c..259b14a 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -682,7 +682,7 @@ static int vvp_io_setattr_start(const struct lu_env *env,
 	struct ll_inode_info *lli = ll_i2info(inode);
 
 	if (cl_io_is_trunc(io)) {
-		down_write(&lli->lli_trunc_sem);
+		trunc_sem_down_write(&lli->lli_trunc_sem);
 		inode_lock(inode);
 		inode_dio_wait(inode);
 	} else {
@@ -708,7 +708,7 @@ static void vvp_io_setattr_end(const struct lu_env *env,
 		 */
 		vvp_do_vmtruncate(inode, io->u.ci_setattr.sa_attr.lvb_size);
 		inode_unlock(inode);
-		up_write(&lli->lli_trunc_sem);
+		trunc_sem_up_write(&lli->lli_trunc_sem);
 	} else {
 		inode_unlock(inode);
 	}
@@ -747,7 +747,7 @@ static int vvp_io_read_start(const struct lu_env *env,
 
 	CDEBUG(D_VFSTRACE, "read: -> [%lli, %lli)\n", pos, pos + cnt);
 
-	down_read(&lli->lli_trunc_sem);
+	trunc_sem_down_read(&lli->lli_trunc_sem);
 
 	if (io->ci_async_readahead) {
 		file_accessed(file);
@@ -1076,7 +1076,7 @@ static int vvp_io_write_start(const struct lu_env *env,
 	size_t written = 0;
 	ssize_t result = 0;
 
-	down_read(&lli->lli_trunc_sem);
+	trunc_sem_down_read(&lli->lli_trunc_sem);
 
 	if (!can_populate_pages(env, io, inode))
 		return 0;
@@ -1178,7 +1178,7 @@ static void vvp_io_rw_end(const struct lu_env *env,
 	struct inode *inode = vvp_object_inode(ios->cis_obj);
 	struct ll_inode_info *lli = ll_i2info(inode);
 
-	up_read(&lli->lli_trunc_sem);
+	trunc_sem_up_read(&lli->lli_trunc_sem);
 }
 
 static int vvp_io_kernel_fault(struct vvp_fault_io *cfio)
@@ -1243,7 +1243,7 @@ static int vvp_io_fault_start(const struct lu_env *env,
 	loff_t size;
 	pgoff_t last_index;
 
-	down_read(&lli->lli_trunc_sem);
+	trunc_sem_down_read_nowait(&lli->lli_trunc_sem);
 
 	/* offset of the last byte on the page */
 	offset = cl_offset(obj, fio->ft_index + 1) - 1;
@@ -1400,7 +1400,7 @@ static void vvp_io_fault_end(const struct lu_env *env,
 
 	CLOBINVRNT(env, ios->cis_io->ci_obj,
 		   vvp_object_invariant(ios->cis_io->ci_obj));
-	up_read(&lli->lli_trunc_sem);
+	trunc_sem_up_read(&lli->lli_trunc_sem);
 }
 
 static int vvp_io_fsync_start(const struct lu_env *env,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 582/622] lnet: Fix source specified route selection
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (580 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 581/622] lustre: llite: replace lli_trunc_sem James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 583/622] lustre: uapi: turn struct lustre_nfs_fid to userland fhandle James Simmons
                   ` (40 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

If lnet_send() is called with a specific src_nid, but
rtr_nid == LNET_NID_ANY and the message needs to be routed, then we
need to ensure that the lnet_peer_ni of our next hop is on the same
network as the lnet_ni associated with the src_nid. Otherwise we
may end up choosing an lnet_peer_ni that cannot be reached from
the specified source.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12919
Lustre-commit: f0aa632d4255 ("LU-12919 lnet: Fix source specified route selection")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36622
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 269b2d5..ca292a6 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1290,7 +1290,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 }
 
 static struct lnet_route *
-lnet_find_route_locked(struct lnet_remotenet *rnet,
+lnet_find_route_locked(struct lnet_remotenet *rnet, u32 src_net,
 		       struct lnet_route **prev_route,
 		       struct lnet_peer_ni **gwni)
 {
@@ -1299,6 +1299,8 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_route *last_route;
 	struct lnet_route *route;
 	int rc;
+	u32 restrict_net;
+	u32 any_net = LNET_NIDNET(LNET_NID_ANY);
 
 	best_route = NULL;
 	last_route = NULL;
@@ -1306,14 +1308,23 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		if (!lnet_is_route_alive(route))
 			continue;
 
+		/* If the src_net is specified then we need to find an lpni
+		 * on that network
+		 */
+		restrict_net = src_net == any_net ? route->lr_lnet : src_net;
 		if (!best_route) {
-			best_route = route;
-			last_route = route;
-			best_gw_ni = lnet_find_best_lpni_on_net(NULL,
-								LNET_NID_ANY,
-								route->lr_gateway,
-								route->lr_lnet);
-			LASSERT(best_gw_ni);
+			lpni = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY,
+							  route->lr_gateway,
+							  restrict_net);
+			if (lpni) {
+				best_route = route;
+				last_route = route;
+				best_gw_ni = lpni;
+			} else {
+				CERROR("Gateway %s does not have a peer NI on net %s\n",
+				       libcfs_nid2str(route->lr_gateway->lp_primary_nid),
+				       libcfs_net2str(restrict_net));
+			}
 			continue;
 		}
 
@@ -1327,8 +1338,13 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 
 		lpni = lnet_find_best_lpni_on_net(NULL, LNET_NID_ANY,
 						  route->lr_gateway,
-						  route->lr_lnet);
-		LASSERT(lpni);
+						  restrict_net);
+		if (!lpni) {
+			CERROR("Gateway %s does not have a peer NI on net %s\n",
+			       libcfs_nid2str(route->lr_gateway->lp_primary_nid),
+			       libcfs_net2str(restrict_net));
+			continue;
+		}
 
 		if (rc == 1) {
 			best_route = route;
@@ -1868,8 +1884,9 @@ struct lnet_ni *
 			return -EHOSTUNREACH;
 		}
 
-		best_route = lnet_find_route_locked(best_rnet, &last_route,
-						    &gwni);
+		best_route = lnet_find_route_locked(best_rnet,
+						    LNET_NIDNET(src_nid),
+						    &last_route, &gwni);
 		if (!best_route) {
 			CERROR("no route to %s from %s\n",
 			       libcfs_nid2str(dst_nid),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 583/622] lustre: uapi: turn struct lustre_nfs_fid to userland fhandle
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (581 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 582/622] lnet: Fix source specified route selection James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 584/622] lustre: uapi: LU-12521 llapi: add separate fsname and instance API James Simmons
                   ` (39 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Quentin Bouget <quentin.bouget@cea.fr>

Rename struct lustre_nfs_fid to struct lustre_file_handle and move
it to UAPI header lustre_user.h so we can use it with the fhandle
API such as name_to_handle_at().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12806
Lustre-commit: 7ff384eee194 ("LU-12806 llapi: use name_to_handle_at in llapi_fd2fid")
Signed-off-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-on: https://review.whamcloud.com/36292
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_nfs.c             | 23 +++++++++--------------
 include/uapi/linux/lustre/lustre_user.h |  6 ++++++
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/llite/llite_nfs.c b/fs/lustre/llite/llite_nfs.c
index 2ac5ad9..a57ab51 100644
--- a/fs/lustre/llite/llite_nfs.c
+++ b/fs/lustre/llite/llite_nfs.c
@@ -110,11 +110,6 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 	return inode;
 }
 
-struct lustre_nfs_fid {
-	struct lu_fid lnf_child;
-	struct lu_fid lnf_parent;
-};
-
 static struct dentry *
 ll_iget_for_nfs(struct super_block *sb,
 		struct lu_fid *fid, struct lu_fid *parent)
@@ -177,8 +172,8 @@ struct lustre_nfs_fid {
 static int ll_encode_fh(struct inode *inode, u32 *fh, int *plen,
 			struct inode *parent)
 {
-	int fileid_len = sizeof(struct lustre_nfs_fid) / 4;
-	struct lustre_nfs_fid *nfs_fid = (void *)fh;
+	int fileid_len = sizeof(struct lustre_file_handle) / 4;
+	struct lustre_file_handle *lfh = (void *)fh;
 
 	CDEBUG(D_INFO, "%s: encoding for (" DFID ") maxlen=%d minlen=%d\n",
 	       ll_i2sbi(inode)->ll_fsname,
@@ -189,11 +184,11 @@ static int ll_encode_fh(struct inode *inode, u32 *fh, int *plen,
 		return FILEID_INVALID;
 	}
 
-	nfs_fid->lnf_child = *ll_inode2fid(inode);
+	lfh->lfh_child = *ll_inode2fid(inode);
 	if (parent)
-		nfs_fid->lnf_parent = *ll_inode2fid(parent);
+		lfh->lfh_parent = *ll_inode2fid(parent);
 	else
-		fid_zero(&nfs_fid->lnf_parent);
+		fid_zero(&lfh->lfh_parent);
 	*plen = fileid_len;
 
 	return FILEID_LUSTRE;
@@ -264,23 +259,23 @@ static int ll_get_name(struct dentry *dentry, char *name,
 static struct dentry *ll_fh_to_dentry(struct super_block *sb, struct fid *fid,
 				      int fh_len, int fh_type)
 {
-	struct lustre_nfs_fid *nfs_fid = (struct lustre_nfs_fid *)fid;
+	struct lustre_file_handle *lfh = (struct lustre_file_handle *)fid;
 
 	if (fh_type != FILEID_LUSTRE)
 		return ERR_PTR(-EPROTO);
 
-	return ll_iget_for_nfs(sb, &nfs_fid->lnf_child, &nfs_fid->lnf_parent);
+	return ll_iget_for_nfs(sb, &lfh->lfh_child, &lfh->lfh_parent);
 }
 
 static struct dentry *ll_fh_to_parent(struct super_block *sb, struct fid *fid,
 				      int fh_len, int fh_type)
 {
-	struct lustre_nfs_fid *nfs_fid = (struct lustre_nfs_fid *)fid;
+	struct lustre_file_handle *lfh = (struct lustre_file_handle *)fid;
 
 	if (fh_type != FILEID_LUSTRE)
 		return ERR_PTR(-EPROTO);
 
-	return ll_iget_for_nfs(sb, &nfs_fid->lnf_parent, NULL);
+	return ll_iget_for_nfs(sb, &lfh->lfh_parent, NULL);
 }
 
 int ll_dir_get_parent_fid(struct inode *dir, struct lu_fid *parent_fid)
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 12b1f78..1c36114 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -164,6 +164,12 @@ static inline bool fid_is_zero(const struct lu_fid *fid)
 	return !fid->f_seq && !fid->f_oid;
 }
 
+/* The data name_to_handle_at() places in a struct file_handle (at f_handle) */
+struct lustre_file_handle {
+	struct lu_fid lfh_child;
+	struct lu_fid lfh_parent;
+};
+
 struct ost_layout {
 	__u32	ol_stripe_size;
 	__u32	ol_stripe_count;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 584/622] lustre: uapi: LU-12521 llapi: add separate fsname and instance API
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (582 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 583/622] lustre: uapi: turn struct lustre_nfs_fid to userland fhandle James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 585/622] lnet: socklnd: initialize the_ksocklnd at compile-time James Simmons
                   ` (38 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

For Lustre the kernel internal cfg instance is represented by a
16 numeric value very similar but not an UUID. This value is
exposed to user land since this value is used to generate the
sysfs directory tree to represent virtual devices. Expose this
fixed value for kernel and user land use.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12521
Lustre-commit: 00d14521ca1c ("LU-12521 llapi: add separate fsname and instance API")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35451
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/obd_config.c         | 2 +-
 include/uapi/linux/lustre/lustre_user.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/obd_config.c b/fs/lustre/obdclass/obd_config.c
index 97cb8c1..0ccdf5f 100644
--- a/fs/lustre/obdclass/obd_config.c
+++ b/fs/lustre/obdclass/obd_config.c
@@ -1374,7 +1374,7 @@ int class_config_llog_handler(const struct lu_env *env,
 		    lcfg->lcfg_command != LCFG_SPTLRPC_CONF &&
 		    LUSTRE_CFG_BUFLEN(lcfg, 0) > 0) {
 			inst_len = LUSTRE_CFG_BUFLEN(lcfg, 0) +
-				   sizeof(clli->cfg_instance) * 2 + 4;
+				   LUSTRE_MAXINSTANCE + 4;
 			inst_name = kasprintf(GFP_NOFS, "%s-%px",
 					      lustre_cfg_string(lcfg, 0),
 					      clli->cfg_instance);
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 1c36114..08589e6 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -829,6 +829,7 @@ static inline char *obd_uuid2str(const struct obd_uuid *uuid)
 }
 
 #define LUSTRE_MAXFSNAME 8
+#define LUSTRE_MAXINSTANCE 16
 
 /* Extract fsname from uuid (or target name) of a target
  * e.g. (myfs-OST0007_UUID -> myfs)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 585/622] lnet: socklnd: initialize the_ksocklnd at compile-time.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (583 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 584/622] lustre: uapi: LU-12521 llapi: add separate fsname and instance API James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 586/622] lnet: remove locking protection ln_testprotocompat James Simmons
                   ` (37 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

All other lnds initialize this struct at compile-time.
It is best for socklnd to do so too.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: b30930a242c6 ("LU-12678 socklnd: initialize the_ksocklnd at compile-time.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36831
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 9a19a3f..016e005 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -2804,6 +2804,18 @@ static void __exit ksocklnd_exit(void)
 	lnet_unregister_lnd(&the_ksocklnd);
 }
 
+static struct lnet_lnd the_ksocklnd = {
+	.lnd_type		= SOCKLND,
+	.lnd_startup		= ksocknal_startup,
+	.lnd_shutdown		= ksocknal_shutdown,
+	.lnd_ctl		= ksocknal_ctl,
+	.lnd_send		= ksocknal_send,
+	.lnd_recv		= ksocknal_recv,
+	.lnd_notify_peer_down	= ksocknal_notify_gw_down,
+	.lnd_query		= ksocknal_query,
+	.lnd_accept		= ksocknal_accept,
+};
+
 static int __init ksocklnd_init(void)
 {
 	int rc;
@@ -2812,17 +2824,6 @@ static int __init ksocklnd_init(void)
 	BUILD_BUG_ON(SOCKLND_CONN_NTYPES > 4);
 	BUILD_BUG_ON(SOCKLND_CONN_ACK != SOCKLND_CONN_BULK_IN);
 
-	/* initialize the_ksocklnd */
-	the_ksocklnd.lnd_type = SOCKLND;
-	the_ksocklnd.lnd_startup = ksocknal_startup;
-	the_ksocklnd.lnd_shutdown = ksocknal_shutdown;
-	the_ksocklnd.lnd_ctl = ksocknal_ctl;
-	the_ksocklnd.lnd_send = ksocknal_send;
-	the_ksocklnd.lnd_recv = ksocknal_recv;
-	the_ksocklnd.lnd_notify_peer_down = ksocknal_notify_gw_down;
-	the_ksocklnd.lnd_query = ksocknal_query;
-	the_ksocklnd.lnd_accept = ksocknal_accept;
-
 	rc = ksocknal_tunables_init();
 	if (rc)
 		return rc;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 586/622] lnet: remove locking protection ln_testprotocompat
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (584 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 585/622] lnet: socklnd: initialize the_ksocklnd at compile-time James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 587/622] lustre: ptlrpc: suppress connection restored message James Simmons
                   ` (36 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

lnet_net_lock(LNET_LOCK_EX) is a heavy-weight lock that is not
necessary here.  The bits in this field are only set rarely - via an
ioctl - and the pattern for reading and clearing them exactly
matches test_and_clear_bit().  So change the field to "unsigned
long" (so test_and_clear_bit() can be used), and use
test_and_clear_bit(), discarding all other locking.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: 624364420970 ("LU-12678 lnet: remove locking protection ln_testprotocompat")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36856
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h         |  2 +-
 net/lnet/klnds/socklnd/socklnd_proto.c | 17 ++++-------------
 net/lnet/lnet/acceptor.c               | 11 +++--------
 net/lnet/lnet/api-ni.c                 |  2 --
 4 files changed, 8 insertions(+), 24 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 99ed87a..9055da9 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -1134,7 +1134,7 @@ struct lnet {
 	struct lnet_lnd			*ln_lnds[NUM_LNDS];
 
 	/* test protocol compatibility flags */
-	int				ln_testprotocompat;
+	unsigned long			ln_testprotocompat;
 
 	/*
 	 * 0 - load the NIs from the mod params
diff --git a/net/lnet/klnds/socklnd/socklnd_proto.c b/net/lnet/klnds/socklnd/socklnd_proto.c
index 887ed2d..195c44f 100644
--- a/net/lnet/klnds/socklnd/socklnd_proto.c
+++ b/net/lnet/klnds/socklnd/socklnd_proto.c
@@ -484,16 +484,11 @@
 
 	if (the_lnet.ln_testprotocompat) {
 		/* single-shot proto check */
-		lnet_net_lock(LNET_LOCK_EX);
-		if (the_lnet.ln_testprotocompat & 1) {
+		if (test_and_clear_bit(0, &the_lnet.ln_testprotocompat))
 			hmv->version_major++;   /* just different! */
-			the_lnet.ln_testprotocompat &= ~1;
-		}
-		if (the_lnet.ln_testprotocompat & 2) {
+
+		if (test_and_clear_bit(1, &the_lnet.ln_testprotocompat))
 			hmv->magic = LNET_PROTO_MAGIC;
-			the_lnet.ln_testprotocompat &= ~2;
-		}
-		lnet_net_unlock(LNET_LOCK_EX);
 	}
 
 	hdr->src_nid = cpu_to_le64(hello->kshm_src_nid);
@@ -541,12 +536,8 @@
 
 	if (the_lnet.ln_testprotocompat) {
 		/* single-shot proto check */
-		lnet_net_lock(LNET_LOCK_EX);
-		if (the_lnet.ln_testprotocompat & 1) {
+		if (test_and_clear_bit(0, &the_lnet.ln_testprotocompat))
 			hello->kshm_version++;   /* just different! */
-			the_lnet.ln_testprotocompat &= ~1;
-		}
-		lnet_net_unlock(LNET_LOCK_EX);
 	}
 
 	rc = lnet_sock_write(sock, hello, offsetof(struct ksock_hello_msg, kshm_ips),
diff --git a/net/lnet/lnet/acceptor.c b/net/lnet/lnet/acceptor.c
index acd1d75..c6a1835 100644
--- a/net/lnet/lnet/acceptor.c
+++ b/net/lnet/lnet/acceptor.c
@@ -174,16 +174,11 @@
 
 		if (the_lnet.ln_testprotocompat) {
 			/* single-shot proto check */
-			lnet_net_lock(LNET_LOCK_EX);
-			if (the_lnet.ln_testprotocompat & 4) {
+			if (test_and_clear_bit(2, &the_lnet.ln_testprotocompat))
 				cr.acr_version++;
-				the_lnet.ln_testprotocompat &= ~4;
-			}
-			if (the_lnet.ln_testprotocompat & 8) {
+
+			if (test_and_clear_bit(3, &the_lnet.ln_testprotocompat))
 				cr.acr_magic = LNET_PROTO_MAGIC;
-				the_lnet.ln_testprotocompat &= ~8;
-			}
-			lnet_net_unlock(LNET_LOCK_EX);
 		}
 
 		rc = lnet_sock_write(sock, &cr, sizeof(cr), accept_timeout);
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index cd95bdd..0ca8bef 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3842,9 +3842,7 @@ u32 lnet_get_dlc_seq_locked(void)
 		return 0;
 
 	case IOC_LIBCFS_TESTPROTOCOMPAT:
-		lnet_net_lock(LNET_LOCK_EX);
 		the_lnet.ln_testprotocompat = data->ioc_flags;
-		lnet_net_unlock(LNET_LOCK_EX);
 		return 0;
 
 	case IOC_LIBCFS_LNET_FAULT:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 587/622] lustre: ptlrpc: suppress connection restored message
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (585 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 586/622] lnet: remove locking protection ln_testprotocompat James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 588/622] lustre: llite: fix deadlock in ll_update_lsm_md() James Simmons
                   ` (35 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

if that happens on idling connection.

Fixes: 4b102da53ad ("lustre: ptlrpc: idle connections can disconnect")
WC-bug-id: https://jira.whamcloud.com/browse/LU-13098
Lustre-commit: 7aa58847b94d ("LU-13098 ptlrpc: supress connection restored message")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37086
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h |  8 ++++++--
 fs/lustre/ptlrpc/import.c         | 25 ++++++++++++++++---------
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index 501a896..5d548a6 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -304,8 +304,12 @@ struct obd_import {
 					imp_connect_tried:1,
 					/* connected but not FULL yet */
 					imp_connected:1,
-				  /* grant shrink disabled */
-				  imp_grant_shrink_disabled:1;
+					/* grant shrink disabled */
+					imp_grant_shrink_disabled:1,
+					/* to suppress LCONSOLE() at
+					 * conn.restore
+					 */
+					imp_was_idle:1;
 
 	u32				imp_connect_op;
 	u32				imp_idle_timeout;
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 028dd65..23dac39 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1519,21 +1519,22 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
 			import_set_state(imp, LUSTRE_IMP_RECOVER);
 
 	if (imp->imp_state == LUSTRE_IMP_RECOVER) {
-		CDEBUG(D_HA, "reconnected to %s@%s\n",
-		       obd2cli_tgt(imp->imp_obd),
-		       imp->imp_connection->c_remote_uuid.uuid);
+		struct ptlrpc_connection *conn = imp->imp_connection;
 
 		rc = ptlrpc_resend(imp);
 		if (rc)
 			goto out;
 		ptlrpc_activate_import(imp, true);
 
-		deuuidify(obd2cli_tgt(imp->imp_obd), NULL,
-			  &target_start, &target_len);
-		LCONSOLE_INFO("%s: Connection restored to %.*s (at %s)\n",
-			      imp->imp_obd->obd_name,
-			      target_len, target_start,
-			      obd_import_nid2str(imp));
+		CDEBUG_LIMIT(imp->imp_was_idle ?
+				imp->imp_idle_debug : D_CONSOLE,
+			     "%s: Connection restored to %s (at %s)\n",
+			     imp->imp_obd->obd_name,
+			     obd_uuid2str(&conn->c_remote_uuid),
+			     obd_import_nid2str(imp));
+		spin_lock(&imp->imp_lock);
+		imp->imp_was_idle = 0;
+		spin_unlock(&imp->imp_lock);
 	}
 
 	if (imp->imp_state == LUSTRE_IMP_FULL) {
@@ -1749,6 +1750,12 @@ int ptlrpc_disconnect_and_idle_import(struct obd_import *imp)
 	CDEBUG_LIMIT(imp->imp_idle_debug, "%s: disconnect after %llus idle\n",
 		     imp->imp_obd->obd_name,
 		     ktime_get_real_seconds() - imp->imp_last_reply_time);
+
+	/* don't make noise at reconnection */
+	spin_lock(&imp->imp_lock);
+	imp->imp_was_idle = 1;
+	spin_unlock(&imp->imp_lock);
+
 	req->rq_interpret_reply = ptlrpc_disconnect_idle_interpret;
 	ptlrpcd_add_req(req);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 588/622] lustre: llite: fix deadlock in ll_update_lsm_md()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (586 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 587/622] lustre: ptlrpc: suppress connection restored message James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 589/622] lustre: ldlm: fix lock convert races James Simmons
                   ` (34 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Deadlock may happen in in following senario: a lookup process called
ll_update_lsm_md(), it found lli->lli_lsm_md is NULL, then
down_write(&lli->lli_lsm_sem). but another lookup process initialized
lli->lli_lsm_md after this check and before write lock, so the first
lookup process called up_read(&lli->lli_lsm_sem) and return, so the
write lock is never released, which cause subsequent lookups deadlock.

Rearrange the code to simplify the locking:
1. take read lock.
2. if lsm was initialized and unchanged, release read lock and return.
3. otherwise release read lock and take write lock.
4. free current lsm and initialize with new lsm.
5. release write lock.
6. initialize stripes with read lock.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13121
Lustre-commit: 3746550282c8 ("LU-13121 llite: fix deadlock in ll_update_lsm_md()")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37182
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 107 +++++++++++++++++++++-----------------------
 1 file changed, 50 insertions(+), 57 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index f083a90..1a8a5ec 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1401,6 +1401,7 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
 	struct lmv_stripe_md *lsm = md->lmv;
+	struct cl_attr	*attr;
 	int rc = 0;
 
 	LASSERT(S_ISDIR(inode->i_mode));
@@ -1422,74 +1423,66 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 	 * normally dir layout doesn't change, only take read lock to check
 	 * that to avoid blocking other MD operations.
 	 */
-	if (lli->lli_lsm_md)
-		down_read(&lli->lli_lsm_sem);
-	else
-		down_write(&lli->lli_lsm_sem);
+	down_read(&lli->lli_lsm_sem);
 
-	/*
-	 * if dir layout mismatch, check whether version is increased, which
-	 * means layout is changed, this happens in dir migration and lfsck.
+	/* some current lookup initialized lsm, and unchanged */
+	if (lli->lli_lsm_md && lsm_md_eq(lli->lli_lsm_md, lsm))
+		goto unlock;
+
+	/* if dir layout doesn't match, check whether version is increased,
+	 * which means layout is changed, this happens in dir split/merge and
+	 * lfsck.
 	 *
 	 * foreign LMV should not change.
 	 */
-	if (lli->lli_lsm_md && !lsm_md_eq(lli->lli_lsm_md, lsm)) {
-		if (lmv_dir_striped(lli->lli_lsm_md) &&
-		    lsm->lsm_md_layout_version <=
-		    lli->lli_lsm_md->lsm_md_layout_version) {
-			CERROR("%s: " DFID " dir layout mismatch:\n",
-			       ll_i2sbi(inode)->ll_fsname,
-			       PFID(&lli->lli_fid));
-			lsm_md_dump(D_ERROR, lli->lli_lsm_md);
-			lsm_md_dump(D_ERROR, lsm);
-			rc = -EINVAL;
-			goto unlock;
-		}
-
-		/* layout changed, switch to write lock */
-		up_read(&lli->lli_lsm_sem);
-		down_write(&lli->lli_lsm_sem);
-		ll_dir_clear_lsm_md(inode);
+	if (lli->lli_lsm_md && lmv_dir_striped(lli->lli_lsm_md) &&
+	    lsm->lsm_md_layout_version <=
+	    lli->lli_lsm_md->lsm_md_layout_version) {
+		CERROR("%s: " DFID " dir layout mismatch:\n",
+		       ll_i2sbi(inode)->ll_fsname, PFID(&lli->lli_fid));
+		lsm_md_dump(D_ERROR, lli->lli_lsm_md);
+		lsm_md_dump(D_ERROR, lsm);
+		rc = -EINVAL;
+		goto unlock;
 	}
 
-	/* set directory layout */
-	if (!lli->lli_lsm_md) {
-		struct cl_attr *attr;
+	up_read(&lli->lli_lsm_sem);
+	down_write(&lli->lli_lsm_sem);
+	/* clear existing lsm */
+	if (lli->lli_lsm_md) {
+		lmv_free_memmd(lli->lli_lsm_md);
+		lli->lli_lsm_md = NULL;
+	}
 
-		rc = ll_init_lsm_md(inode, md);
-		up_write(&lli->lli_lsm_sem);
-		if (rc)
-			return rc;
+	rc = ll_init_lsm_md(inode, md);
+	up_write(&lli->lli_lsm_sem);
 
-		/*
-		 * set lsm_md to NULL, so the following free lustre_md
-		 * will not free this lsm
-		 */
-		md->lmv = NULL;
+	if (rc)
+		return rc;
 
-		/*
-		 * md_merge_attr() may take long, since lsm is already set,
-		 * switch to read lock.
-		 */
-		down_read(&lli->lli_lsm_sem);
+	/* set md->lmv to NULL, so the following free lustre_md will not free
+	 * this lsm.
+	 */
+	md->lmv = NULL;
 
-		if (!lmv_dir_striped(lli->lli_lsm_md))
-			goto unlock;
+	/* md_merge_attr() may take long, since lsm is already set, switch to
+	 * read lock.
+	 */
+	down_read(&lli->lli_lsm_sem);
 
-		attr = kzalloc(sizeof(*attr), GFP_NOFS);
-		if (!attr) {
-			rc = -ENOMEM;
-			goto unlock;
-		}
+	if (!lmv_dir_striped(lli->lli_lsm_md))
+		goto unlock;
 
-		/* validate the lsm */
-		rc = md_merge_attr(ll_i2mdexp(inode), lsm, attr,
-				   ll_md_blocking_ast);
-		if (rc) {
-			kfree(attr);
-			goto unlock;
-		}
+	attr = kzalloc(sizeof(*attr), GFP_NOFS);
+	if (!attr) {
+		rc = -ENOMEM;
+		goto unlock;
+	}
 
+	/* validate the lsm */
+	rc = md_merge_attr(ll_i2mdexp(inode), lli->lli_lsm_md, attr,
+			   ll_md_blocking_ast);
+	if (!rc) {
 		if (md->body->mbo_valid & OBD_MD_FLNLINK)
 			md->body->mbo_nlink = attr->cat_nlink;
 		if (md->body->mbo_valid & OBD_MD_FLSIZE)
@@ -1500,9 +1493,9 @@ static int ll_update_lsm_md(struct inode *inode, struct lustre_md *md)
 			md->body->mbo_ctime = attr->cat_ctime;
 		if (md->body->mbo_valid & OBD_MD_FLMTIME)
 			md->body->mbo_mtime = attr->cat_mtime;
-
-		kfree(attr);
 	}
+
+	kfree(attr);
 unlock:
 	up_read(&lli->lli_lsm_sem);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 589/622] lustre: ldlm: fix lock convert races
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (587 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 588/622] lustre: llite: fix deadlock in ll_update_lsm_md() James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 590/622] lustre: ldlm: signal vs CP callback race James Simmons
                   ` (33 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Vitaly Fertman <c17818@cray.com>

The blocking cb may be triggered in parallel and the convert logic
of the DOM lock must be ready that the cancel_bits could be already
zeroed by the first executor.

As there may be several blocking cb parallel executors and several
conversion callers, each requesting for different inode bits, setup
the following logic:
- the lock keeps the aggregated set of bits requested for cancelling
  by different parties, where 0 means the whole lock is to be
  cancelled, and where the CBPENDING flag means there is a canceling
  job pending;
- once completed, the cancel_bits are zeroed and the CBPENDING flag
  is dropped, meaning the next request will be a part of the next job;
- once a local lock is converted, its state is changed appropriately
  and no cleanup is left for the interpret time as the lock is ready
  for the next usage;
- as the lock is unlocked in a process of conversion and more bits
  may appear, check it and repeat appropriately;
- let just 1 conversion executor to work at a time, others are waiting
  similar to ldlm_cli_cancel();
- there are others who may want to cancel unused locks (cancel_lru,
  cancel_resource_local), consider CANCELING as a request to cancel
  the full lock independently of the cancel_bits;

Some cleanups are done:
- move the cache drop logic to the CANCELING part of the blocking cb
  from the BLOCKING one;
- remove the convert RPC interpret, as the lock cleanups are already
  done in advance; the convert RPC is re-sendable and an error means
  there is a serioes net problem;

WC-bug-id: https://jira.whamcloud.com/browse/LU-11276
Lustre-commit: 6c0b676e4124 ("LU-11276 ldlm: fix lock convert races")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36466
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h  |   9 +-
 fs/lustre/ldlm/ldlm_inodebits.c | 143 +++++++++++++++--------------
 fs/lustre/ldlm/ldlm_internal.h  |  12 +++
 fs/lustre/ldlm/ldlm_lockd.c     |  73 ++++++++++-----
 fs/lustre/ldlm/ldlm_request.c   | 198 +++++++---------------------------------
 fs/lustre/llite/namei.c         |  59 +++++++-----
 6 files changed, 210 insertions(+), 284 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 9ca79f4..42c1806 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -545,7 +545,6 @@ enum ldlm_cancel_flags {
 	LCF_BL_AST     = 0x4, /* Cancel locks marked as LDLM_FL_BL_AST
 			       * in the same RPC
 			       */
-	LCF_CONVERT    = 0x8, /* Try to convert IBITS lock before cancel */
 };
 
 struct ldlm_flock {
@@ -1291,7 +1290,9 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 			  enum ldlm_mode mode,
 			  u64 *flags, void *lvb, u32 lvb_len,
 			  const struct lustre_handle *lockh, int rc);
-int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags);
+int ldlm_cli_convert_req(struct ldlm_lock *lock, u32 *flags, u64 new_bits);
+int ldlm_cli_convert(struct ldlm_lock *lock,
+		     enum ldlm_cancel_flags cancel_flags);
 int ldlm_cli_update_pool(struct ptlrpc_request *req);
 int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		    enum ldlm_cancel_flags cancel_flags);
@@ -1317,8 +1318,8 @@ int ldlm_cli_cancel_list(struct list_head *head, int count,
 /** @} ldlm_cli_api */
 
 int ldlm_inodebits_drop(struct ldlm_lock *lock, u64 to_drop);
-int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits);
-int ldlm_cli_dropbits_list(struct list_head *converts, u64 drop_bits);
+int ldlm_cli_inodebits_convert(struct ldlm_lock *lock,
+			       enum ldlm_cancel_flags cancel_flags);
 
 /* mds/handler.c */
 /* This has to be here because recursive inclusion sucks. */
diff --git a/fs/lustre/ldlm/ldlm_inodebits.c b/fs/lustre/ldlm/ldlm_inodebits.c
index 9cf3c5f..2288eb5 100644
--- a/fs/lustre/ldlm/ldlm_inodebits.c
+++ b/fs/lustre/ldlm/ldlm_inodebits.c
@@ -98,92 +98,101 @@ int ldlm_inodebits_drop(struct ldlm_lock *lock, u64 to_drop)
 EXPORT_SYMBOL(ldlm_inodebits_drop);
 
 /* convert single lock */
-int ldlm_cli_dropbits(struct ldlm_lock *lock, u64 drop_bits)
+int ldlm_cli_inodebits_convert(struct ldlm_lock *lock,
+			       enum ldlm_cancel_flags cancel_flags)
 {
-	struct lustre_handle lockh;
+	struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
+	struct ldlm_lock_desc ld = { { 0 } };
+	u64 drop_bits, new_bits;
 	u32 flags = 0;
 	int rc;
 
-	LASSERT(drop_bits);
-	LASSERT(!lock->l_readers && !lock->l_writers);
-
-	LDLM_DEBUG(lock, "client lock convert START");
+	check_res_locked(lock->l_resource);
 
-	ldlm_lock2handle(lock, &lockh);
-	lock_res_and_lock(lock);
-	/* check if all bits are blocked */
-	if (!(lock->l_policy_data.l_inodebits.bits & ~drop_bits)) {
-		unlock_res_and_lock(lock);
-		/* return error to continue with cancel */
-		rc = -EINVAL;
-		goto exit;
+	/* Lock is being converted already */
+	if (ldlm_is_converting(lock)) {
+		if (!(cancel_flags & LCF_ASYNC)) {
+			unlock_res_and_lock(lock);
+			wait_event_idle(lock->l_waitq,
+					is_lock_converted(lock));
+			lock_res_and_lock(lock);
+		}
+		return 0;
 	}
 
-	/* check if no common bits, consider this as successful convert */
-	if (!(lock->l_policy_data.l_inodebits.bits & drop_bits)) {
-		unlock_res_and_lock(lock);
-		rc = 0;
-		goto exit;
-	}
+	/* lru_cancel may happen in parallel and call ldlm_cli_cancel_list()
+	 * independently.
+	 */
+	if (ldlm_is_canceling(lock))
+		return -EINVAL;
 
-	/* check if there is race with cancel */
-	if (ldlm_is_canceling(lock) || ldlm_is_cancel(lock)) {
-		unlock_res_and_lock(lock);
-		rc = -EINVAL;
-		goto exit;
-	}
+	/* no need in only local convert */
+	if (lock->l_flags & (LDLM_FL_LOCAL_ONLY | LDLM_FL_CANCEL_ON_BLOCK))
+		return -EINVAL;
 
-	/* clear cbpending flag early, it is safe to match lock right after
-	 * client convert because it is downgrade always.
+	drop_bits = lock->l_policy_data.l_inodebits.cancel_bits;
+	/* no cancel bits - means that caller needs full cancel */
+	if (drop_bits == 0)
+		return -EINVAL;
+
+	new_bits = lock->l_policy_data.l_inodebits.bits & ~drop_bits;
+	/* check if all lock bits are dropped, proceed with cancel */
+	if (!new_bits)
+		return -EINVAL;
+
+	/* check if no dropped bits, consider this as successful convert
 	 */
-	ldlm_clear_cbpending(lock);
-	ldlm_clear_bl_ast(lock);
+	if (lock->l_policy_data.l_inodebits.bits == new_bits)
+		return 0;
 
-	/* If lock is being converted already, check drop bits first */
-	if (ldlm_is_converting(lock)) {
-		/* raced lock convert, lock inodebits are remaining bits
-		 * so check if they are conflicting with new convert or not.
-		 */
-		if (!(lock->l_policy_data.l_inodebits.bits & drop_bits)) {
-			unlock_res_and_lock(lock);
-			rc = 0;
-			goto exit;
-		}
-		/* Otherwise drop new conflicting bits in new convert */
-	}
 	ldlm_set_converting(lock);
-	/* from all bits of blocking lock leave only conflicting */
-	drop_bits &= lock->l_policy_data.l_inodebits.bits;
-	/* save them in cancel_bits, so l_blocking_ast will know
-	 * which bits from the current lock were dropped.
-	 */
-	lock->l_policy_data.l_inodebits.cancel_bits = drop_bits;
-	/* Finally clear these bits in lock ibits */
-	ldlm_inodebits_drop(lock, drop_bits);
-	unlock_res_and_lock(lock);
 	/* Finally call cancel callback for remaining bits only.
 	 * It is important to have converting flag during that
 	 * so blocking_ast callback can distinguish convert from
 	 * cancels.
 	 */
-	if (lock->l_blocking_ast)
-		lock->l_blocking_ast(lock, NULL, lock->l_ast_data,
-				     LDLM_CB_CANCELING);
-
+	ld.l_policy_data.l_inodebits.cancel_bits = drop_bits;
+	unlock_res_and_lock(lock);
+	lock->l_blocking_ast(lock, &ld, lock->l_ast_data, LDLM_CB_CANCELING);
 	/* now notify server about convert */
-	rc = ldlm_cli_convert(lock, &flags);
-	if (rc) {
-		lock_res_and_lock(lock);
-		if (ldlm_is_converting(lock)) {
-			ldlm_clear_converting(lock);
-			ldlm_set_cbpending(lock);
-			ldlm_set_bl_ast(lock);
-		}
-		unlock_res_and_lock(lock);
-		goto exit;
+	rc = ldlm_cli_convert_req(lock, &flags, new_bits);
+	lock_res_and_lock(lock);
+	if (rc)
+		goto full_cancel;
+
+	/* Finally clear these bits in lock ibits */
+	ldlm_inodebits_drop(lock, drop_bits);
+
+	/* Being locked again check if lock was canceled, it is important
+	 * to do and don't drop cbpending below
+	 */
+	if (ldlm_is_canceling(lock)) {
+		rc = -EINVAL;
+		goto full_cancel;
+	}
+
+	/* also check again if more bits to be cancelled appeared */
+	if (drop_bits != lock->l_policy_data.l_inodebits.cancel_bits) {
+		rc = -EAGAIN;
+		goto clear_converting;
 	}
 
-exit:
-	LDLM_DEBUG(lock, "client lock convert END");
+	/* clear cbpending flag early, it is safe to match lock right after
+	 * client convert because it is downgrade always.
+	 */
+	ldlm_clear_cbpending(lock);
+	ldlm_clear_bl_ast(lock);
+	spin_lock(&ns->ns_lock);
+	if (list_empty(&lock->l_lru))
+		ldlm_lock_add_to_lru_nolock(lock);
+	spin_unlock(&ns->ns_lock);
+
+	/* the job is done, zero the cancel_bits. If more conflicts appear,
+	 * it will result in another cycle of ldlm_cli_inodebits_convert().
+	 */
+full_cancel:
+	lock->l_policy_data.l_inodebits.cancel_bits = 0;
+clear_converting:
+	ldlm_clear_converting(lock);
 	return rc;
 }
diff --git a/fs/lustre/ldlm/ldlm_internal.h b/fs/lustre/ldlm/ldlm_internal.h
index 336d9b7..996c0fb 100644
--- a/fs/lustre/ldlm/ldlm_internal.h
+++ b/fs/lustre/ldlm/ldlm_internal.h
@@ -171,6 +171,7 @@ int ldlm_bl_to_thread_list(struct ldlm_namespace *ns,
 
 void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 			     struct ldlm_lock_desc *ld, struct ldlm_lock *lock);
+void ldlm_bl_desc2lock(const struct ldlm_lock_desc *ld, struct ldlm_lock *lock);
 
 extern struct kmem_cache *ldlm_resource_slab;
 extern struct kset *ldlm_ns_kset;
@@ -330,6 +331,17 @@ static inline bool is_bl_done(struct ldlm_lock *lock)
 	return bl_done;
 }
 
+static inline bool is_lock_converted(struct ldlm_lock *lock)
+{
+	bool ret = 0;
+
+	lock_res_and_lock(lock);
+	ret = (lock->l_policy_data.l_inodebits.cancel_bits == 0);
+	unlock_res_and_lock(lock);
+
+	return ret;
+}
+
 typedef void (*ldlm_policy_wire_to_local_t)(const union ldlm_wire_policy_data *,
 					    union ldlm_policy_data *);
 
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 79dab6e..32b7be1 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -73,7 +73,6 @@ struct ldlm_cb_async_args {
 /* LDLM state */
 
 static struct ldlm_state *ldlm_state;
-
 struct ldlm_bl_pool {
 	spinlock_t		blp_lock;
 
@@ -111,21 +110,15 @@ struct ldlm_bl_work_item {
 };
 
 /**
- * Callback handler for receiving incoming blocking ASTs.
- *
- * This can only happen on client side.
+ * Server may pass additional information about blocking lock.
+ * For IBITS locks it is conflicting bits which can be used for
+ * lock convert instead of cancel.
  */
-void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
-			     struct ldlm_lock_desc *ld, struct ldlm_lock *lock)
+void ldlm_bl_desc2lock(const struct ldlm_lock_desc *ld, struct ldlm_lock *lock)
 {
-	int do_ast;
-
-	LDLM_DEBUG(lock, "client blocking AST callback handler");
-
-	lock_res_and_lock(lock);
-
-	/* set bits to cancel for this lock for possible lock convert */
-	if (lock->l_resource->lr_type == LDLM_IBITS) {
+	check_res_locked(lock->l_resource);
+	if (ld &&
+	    (lock->l_resource->lr_type == LDLM_IBITS)) {
 		/*
 		 * Lock description contains policy of blocking lock, and its
 		 * cancel_bits is used to pass conflicting bits.  NOTE: ld can
@@ -137,18 +130,41 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 		 * cookie, never use cancel bits from different resource, full
 		 * cancel is to be used.
 		 */
-		if (ld && ld->l_policy_data.l_inodebits.bits &&
+		if (ld->l_policy_data.l_inodebits.cancel_bits &&
 		    ldlm_res_eq(&ld->l_resource.lr_name,
-				&lock->l_resource->lr_name))
-			lock->l_policy_data.l_inodebits.cancel_bits =
+				&lock->l_resource->lr_name) &&
+		    !(ldlm_is_cbpending(lock) &&
+		      lock->l_policy_data.l_inodebits.cancel_bits == 0)) {
+			/* always combine conflicting ibits */
+			lock->l_policy_data.l_inodebits.cancel_bits |=
 				ld->l_policy_data.l_inodebits.cancel_bits;
-		/*
-		 * If there is no valid ld and lock is cbpending already
-		 * then cancel_bits should be kept, otherwise it is zeroed.
-		 */
-		else if (!ldlm_is_cbpending(lock))
+		} else {
+			/* If cancel_bits are not obtained or
+			 * if the lock is already CBPENDING and
+			 * has no cancel_bits set
+			 * - the full lock is to be cancelled
+			 */
 			lock->l_policy_data.l_inodebits.cancel_bits = 0;
+		}
 	}
+}
+
+/**
+ * Callback handler for receiving incoming blocking ASTs.
+ *
+ * This can only happen on client side.
+ */
+void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
+			     struct ldlm_lock_desc *ld, struct ldlm_lock *lock)
+{
+	int do_ast;
+
+	LDLM_DEBUG(lock, "client blocking AST callback handler");
+
+	lock_res_and_lock(lock);
+
+	/* get extra information from desc if any */
+	ldlm_bl_desc2lock(ld, lock);
 	ldlm_set_cbpending(lock);
 
 	do_ast = !lock->l_readers && !lock->l_writers;
@@ -269,6 +285,7 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 		 * Let ldlm_cancel_lru() be fast.
 		 */
 		ldlm_lock_remove_from_lru(lock);
+		ldlm_bl_desc2lock(&dlm_req->lock_desc, lock);
 		lock->l_flags |= LDLM_FL_CBPENDING | LDLM_FL_BL_AST;
 		LDLM_DEBUG(lock, "completion AST includes blocking AST");
 	}
@@ -318,6 +335,7 @@ static void ldlm_handle_gl_callback(struct ptlrpc_request *req,
 				    struct ldlm_request *dlm_req,
 				    struct ldlm_lock *lock)
 {
+	struct ldlm_lock_desc *ld = &dlm_req->lock_desc;
 	int rc = -ENXIO;
 
 	LDLM_DEBUG(lock, "client glimpse AST callback handler");
@@ -339,8 +357,15 @@ static void ldlm_handle_gl_callback(struct ptlrpc_request *req,
 			ktime_add(lock->l_last_used,
 				  ktime_set(ns->ns_dirty_age_limit, 0)))) {
 		unlock_res_and_lock(lock);
-		if (ldlm_bl_to_thread_lock(ns, NULL, lock))
-			ldlm_handle_bl_callback(ns, NULL, lock);
+
+		/* For MDS glimpse it is always DOM lock, set corresponding
+		 * cancel_bits to perform lock convert if needed
+		 */
+		if (lock->l_resource->lr_type == LDLM_IBITS)
+			ld->l_policy_data.l_inodebits.cancel_bits =
+							MDS_INODELOCK_DOM;
+		if (ldlm_bl_to_thread_lock(ns, ld, lock))
+			ldlm_handle_bl_callback(ns, ld, lock);
 
 		return;
 	}
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 6df057d..7eba8d2 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -489,6 +489,7 @@ int ldlm_cli_enqueue_fini(struct obd_export *exp, struct ptlrpc_request *req,
 
 	if ((*flags) & LDLM_FL_AST_SENT) {
 		lock_res_and_lock(lock);
+		ldlm_bl_desc2lock(&reply->lock_desc, lock);
 		lock->l_flags |= LDLM_FL_CBPENDING |  LDLM_FL_BL_AST;
 		unlock_res_and_lock(lock);
 		LDLM_DEBUG(lock, "enqueue reply includes blocking AST");
@@ -875,129 +876,6 @@ int ldlm_cli_enqueue(struct obd_export *exp, struct ptlrpc_request **reqp,
 EXPORT_SYMBOL(ldlm_cli_enqueue);
 
 /**
- * Client-side lock convert reply handling.
- *
- * Finish client lock converting, checks for concurrent converts
- * and clear 'converting' flag so lock can be placed back into LRU.
- */
-static int lock_convert_interpret(const struct lu_env *env,
-				  struct ptlrpc_request *req,
-				  void *args, int rc)
-{
-	struct ldlm_async_args *aa = args;
-	struct ldlm_lock *lock;
-	struct ldlm_reply *reply;
-
-	lock = ldlm_handle2lock(&aa->lock_handle);
-	if (!lock) {
-		LDLM_DEBUG_NOLOCK("convert ACK for unknown local cookie %#llx",
-			aa->lock_handle.cookie);
-		return -ESTALE;
-	}
-
-	LDLM_DEBUG(lock, "CONVERTED lock:");
-
-	if (rc != ELDLM_OK)
-		goto out;
-
-	reply = req_capsule_server_get(&req->rq_pill, &RMF_DLM_REP);
-	if (!reply) {
-		rc = -EPROTO;
-		goto out;
-	}
-
-	if (reply->lock_handle.cookie != aa->lock_handle.cookie) {
-		LDLM_ERROR(lock,
-			   "convert ACK with wrong lock cookie %#llx but cookie %#llx from server %s id %s\n",
-			   aa->lock_handle.cookie, reply->lock_handle.cookie,
-			   req->rq_export->exp_client_uuid.uuid,
-			   libcfs_id2str(req->rq_peer));
-		rc = ELDLM_NO_LOCK_DATA;
-		goto out;
-	}
-
-	lock_res_and_lock(lock);
-	/*
-	 * Lock convert is sent for any new bits to drop, the converting flag
-	 * is dropped when ibits on server are the same as on client. Meanwhile
-	 * that can be so that more later convert will be replied first with
-	 * and clear converting flag, so in case of such race just exit here.
-	 * if lock has no converting bits then.
-	 */
-	if (!ldlm_is_converting(lock)) {
-		LDLM_DEBUG(lock,
-			   "convert ACK for lock without converting flag, reply ibits %#llx",
-			   reply->lock_desc.l_policy_data.l_inodebits.bits);
-	} else if (reply->lock_desc.l_policy_data.l_inodebits.bits !=
-		   lock->l_policy_data.l_inodebits.bits) {
-		/*
-		 * Compare server returned lock ibits and local lock ibits
-		 * if they are the same we consider conversion is done,
-		 * otherwise we have more converts inflight and keep
-		 * converting flag.
-		 */
-		LDLM_DEBUG(lock, "convert ACK with ibits %#llx\n",
-			   reply->lock_desc.l_policy_data.l_inodebits.bits);
-	} else {
-		ldlm_clear_converting(lock);
-
-		/*
-		 * Concurrent BL AST may arrive and cause another convert
-		 * or cancel so just do nothing here if bl_ast is set,
-		 * finish with convert otherwise.
-		 */
-		if (!ldlm_is_bl_ast(lock)) {
-			struct ldlm_namespace *ns = ldlm_lock_to_ns(lock);
-
-			/*
-			 * Drop cancel_bits since there are no more converts
-			 * and put lock into LRU if it is still not used and
-			 * is not there yet.
-			 */
-			lock->l_policy_data.l_inodebits.cancel_bits = 0;
-			if (!lock->l_readers && !lock->l_writers &&
-			    !ldlm_is_canceling(lock)) {
-				spin_lock(&ns->ns_lock);
-				/* there is check for list_empty() inside */
-				ldlm_lock_remove_from_lru_nolock(lock);
-				ldlm_lock_add_to_lru_nolock(lock);
-				spin_unlock(&ns->ns_lock);
-			}
-		}
-	}
-	unlock_res_and_lock(lock);
-out:
-	if (rc) {
-		int flag;
-
-		lock_res_and_lock(lock);
-		if (ldlm_is_converting(lock)) {
-			ldlm_clear_converting(lock);
-			ldlm_set_cbpending(lock);
-			ldlm_set_bl_ast(lock);
-			lock->l_policy_data.l_inodebits.cancel_bits = 0;
-		}
-		unlock_res_and_lock(lock);
-
-		/*
-		 * fallback to normal lock cancel. If rc means there is no
-		 * valid lock on server, do only local cancel
-		 */
-		if (rc == ELDLM_NO_LOCK_DATA)
-			flag = LCF_LOCAL;
-		else
-			flag = LCF_ASYNC;
-
-		rc = ldlm_cli_cancel(&aa->lock_handle, flag);
-		if (rc < 0)
-			LDLM_DEBUG(lock, "failed to cancel lock: rc = %d\n",
-				   rc);
-	}
-	LDLM_LOCK_PUT(lock);
-	return rc;
-}
-
-/**
  * Client-side IBITS lock convert.
  *
  * Inform server that lock has been converted instead of canceling.
@@ -1009,17 +887,13 @@ static int lock_convert_interpret(const struct lu_env *env,
  * is made asynchronous.
  *
  */
-int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
+int ldlm_cli_convert_req(struct ldlm_lock *lock, u32 *flags, u64 new_bits)
 {
 	struct ldlm_request *body;
 	struct ptlrpc_request *req;
-	struct ldlm_async_args *aa;
 	struct obd_export *exp = lock->l_conn_export;
 
-	if (!exp) {
-		LDLM_ERROR(lock, "convert must not be called on local locks.");
-		return -EINVAL;
-	}
+	LASSERT(exp);
 
 	/*
 	 * this is better to check earlier and it is done so already,
@@ -1050,8 +924,7 @@ int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
 	body->lock_desc.l_req_mode = lock->l_req_mode;
 	body->lock_desc.l_granted_mode = lock->l_granted_mode;
 
-	body->lock_desc.l_policy_data.l_inodebits.bits =
-					lock->l_policy_data.l_inodebits.bits;
+	body->lock_desc.l_policy_data.l_inodebits.bits = new_bits;
 	body->lock_desc.l_policy_data.l_inodebits.cancel_bits = 0;
 
 	body->lock_flags = ldlm_flags_to_wire(*flags);
@@ -1071,10 +944,6 @@ int ldlm_cli_convert(struct ldlm_lock *lock, u32 *flags)
 		lprocfs_counter_incr(exp->exp_obd->obd_svc_stats,
 				     LDLM_CONVERT - LDLM_FIRST_OPC);
 
-	aa = ptlrpc_req_async_args(aa, req);
-	ldlm_lock2handle(lock, &aa->lock_handle);
-	req->rq_interpret_reply = lock_convert_interpret;
-
 	ptlrpcd_add_req(req);
 	return 0;
 }
@@ -1301,6 +1170,27 @@ int ldlm_cli_update_pool(struct ptlrpc_request *req)
 	return 0;
 }
 
+int ldlm_cli_convert(struct ldlm_lock *lock,
+		     enum ldlm_cancel_flags cancel_flags)
+{
+	int rc = -EINVAL;
+
+	LASSERT(!lock->l_readers && !lock->l_writers);
+	LDLM_DEBUG(lock, "client lock convert START");
+
+	if (lock->l_resource->lr_type == LDLM_IBITS) {
+		lock_res_and_lock(lock);
+		do {
+			rc = ldlm_cli_inodebits_convert(lock, cancel_flags);
+		} while (rc == -EAGAIN);
+		unlock_res_and_lock(lock);
+	}
+
+	LDLM_DEBUG(lock, "client lock convert END");
+	return rc;
+}
+EXPORT_SYMBOL(ldlm_cli_convert);
+
 /**
  * Client side lock cancel.
  *
@@ -1323,20 +1213,9 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		return 0;
 	}
 
-	/* Convert lock bits instead of cancel for IBITS locks */
-	if (cancel_flags & LCF_CONVERT) {
-		LASSERT(lock->l_resource->lr_type == LDLM_IBITS);
-		LASSERT(lock->l_policy_data.l_inodebits.cancel_bits != 0);
-
-		rc = ldlm_cli_dropbits(lock,
-				lock->l_policy_data.l_inodebits.cancel_bits);
-		if (rc == 0) {
-			LDLM_LOCK_RELEASE(lock);
-			return 0;
-		}
-	}
-
 	lock_res_and_lock(lock);
+	LASSERT(!ldlm_is_converting(lock));
+
 	/* Lock is being canceled and the caller doesn't want to wait */
 	if (ldlm_is_canceling(lock)) {
 		unlock_res_and_lock(lock);
@@ -1348,16 +1227,6 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		return 0;
 	}
 
-	/*
-	 * Lock is being converted, cancel it immediately.
-	 * When convert will end, it releases lock and it will be gone.
-	 */
-	if (ldlm_is_converting(lock)) {
-		/* set back flags removed by convert */
-		ldlm_set_cbpending(lock);
-		ldlm_set_bl_ast(lock);
-	}
-
 	ldlm_set_canceling(lock);
 	unlock_res_and_lock(lock);
 
@@ -1723,8 +1592,7 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 			/* No locks which got blocking requests. */
 			LASSERT(!ldlm_is_bl_ast(lock));
 
-			if (!ldlm_is_canceling(lock) &&
-			    !ldlm_is_converting(lock))
+			if (!ldlm_is_canceling(lock))
 				break;
 
 			/*
@@ -1782,7 +1650,7 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 
 		lock_res_and_lock(lock);
 		/* Check flags again under the lock. */
-		if (ldlm_is_canceling(lock) || ldlm_is_converting(lock) ||
+		if (ldlm_is_canceling(lock) ||
 		    (ldlm_lock_remove_from_lru_check(lock, last_use) == 0)) {
 			/*
 			 * Another thread is removing lock from LRU, or
@@ -1908,11 +1776,10 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 			continue;
 
 		/*
-		 * If somebody is already doing CANCEL, or blocking AST came,
-		 * skip this lock.
+		 * If somebody is already doing CANCEL, or blocking AST came
+		 * then skip this lock.
 		 */
-		if (ldlm_is_bl_ast(lock) || ldlm_is_canceling(lock) ||
-		    ldlm_is_converting(lock))
+		if (ldlm_is_bl_ast(lock) || ldlm_is_canceling(lock))
 			continue;
 
 		if (lockmode_compat(lock->l_granted_mode, mode))
@@ -1938,7 +1805,6 @@ int ldlm_cancel_resource_local(struct ldlm_resource *res,
 		/* See CBPENDING comment in ldlm_cancel_lru */
 		lock->l_flags |= LDLM_FL_CBPENDING | LDLM_FL_CANCELING |
 				 lock_flags;
-
 		LASSERT(list_empty(&lock->l_bl_ast));
 		list_add(&lock->l_bl_ast, cancels);
 		LDLM_LOCK_GET(lock);
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index c87653d..13c1cf9 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -431,11 +431,10 @@ int ll_md_need_convert(struct ldlm_lock *lock)
 	return !!(bits);
 }
 
-int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
+int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *ld,
 		       void *data, int flag)
 {
 	struct lustre_handle lockh;
-	u64 bits = lock->l_policy_data.l_inodebits.bits;
 	int rc;
 
 	switch (flag) {
@@ -443,17 +442,21 @@ int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
 	{
 		u64 cancel_flags = LCF_ASYNC;
 
-		if (ll_md_need_convert(lock)) {
-			cancel_flags |= LCF_CONVERT;
-			/* For lock convert some cancel actions may require
-			 * this lock with non-dropped canceled bits, e.g. page
-			 * flush for DOM lock. So call ll_lock_cancel_bits()
-			 * here while canceled bits are still set.
-			 */
-			bits = lock->l_policy_data.l_inodebits.cancel_bits;
-			if (bits & MDS_INODELOCK_DOM)
-				ll_lock_cancel_bits(lock, MDS_INODELOCK_DOM);
+		/* if lock convert is not needed then still have to
+		 * pass lock via ldlm_cli_convert() to keep all states
+		 * correct, set cancel_bits to full lock bits to cause
+		 * full cancel to happen.
+		 */
+		if (!ll_md_need_convert(lock)) {
+			lock_res_and_lock(lock);
+			lock->l_policy_data.l_inodebits.cancel_bits =
+					lock->l_policy_data.l_inodebits.bits;
+			unlock_res_and_lock(lock);
 		}
+		rc = ldlm_cli_convert(lock, cancel_flags);
+		if (!rc)
+			return 0;
+		/* continue with cancel otherwise */
 		ldlm_lock2handle(lock, &lockh);
 		rc = ldlm_cli_cancel(&lockh, cancel_flags);
 		if (rc < 0) {
@@ -463,24 +466,34 @@ int ll_md_blocking_ast(struct ldlm_lock *lock, struct ldlm_lock_desc *desc,
 		break;
 	}
 	case LDLM_CB_CANCELING:
+	{
+		u64 to_cancel = lock->l_policy_data.l_inodebits.bits;
+
 		/* Nothing to do for non-granted locks */
 		if (!ldlm_is_granted(lock))
 			break;
 
-		if (ldlm_is_converting(lock)) {
-			/* this is called on already converted lock, so
-			 * ibits has remained bits only and cancel_bits
-			 * are bits that were dropped.
-			 * Note that DOM lock is handled prior lock convert
-			 * and is excluded here.
+		/* If 'ld' is supplied then bits to be cancelled are passed
+		 * implicitly by lock converting and cancel_bits from 'ld'
+		 * should be used. Otherwise full cancel is being performed
+		 * and lock inodebits are used.
+		 *
+		 * Note: we cannot rely on cancel_bits in lock itself at this
+		 * moment because they can be changed by concurrent thread,
+		 * so ldlm_cli_inodebits_convert() pass cancel bits implicitly
+		 * in 'ld' parameter.
+		 */
+		if (ld) {
+			/* partial bits cancel allowed only during convert */
+			LASSERT(ldlm_is_converting(lock));
+			/* mask cancel bits by lock bits so only no any unused
+			 * bits are passed to ll_lock_cancel_bits()
 			 */
-			bits = lock->l_policy_data.l_inodebits.cancel_bits &
-				~MDS_INODELOCK_DOM;
-		} else {
-			LASSERT(ldlm_is_canceling(lock));
+			to_cancel &= ld->l_policy_data.l_inodebits.cancel_bits;
 		}
-		ll_lock_cancel_bits(lock, bits);
+		ll_lock_cancel_bits(lock, to_cancel);
 		break;
+	}
 	default:
 		LBUG();
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 590/622] lustre: ldlm: signal vs CP callback race
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (588 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 589/622] lustre: ldlm: fix lock convert races James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 591/622] lustre: uapi: properly pack data structures James Simmons
                   ` (32 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

In case of interrupted wait for a CP AST
failed_lock_cleanup() sets LDLM_FL_LOCAL_ONLY, so
the client wouldn't cancel the lock on CP AST.

A lock isn't canceled on the server on reception

Cray-bug-id: LUS-2021
WC-bug-id: https://jira.whamcloud.com/browse/LU-7791
Lustre-commit: 7fff052c930d ("LU-7791 ldlm: signal vs CP callback race")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-on: https://review.whamcloud.com/19898
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h |  1 +
 fs/lustre/ldlm/ldlm_lockd.c     | 51 +++++++++++++++++++++++++----------------
 fs/lustre/ldlm/ldlm_request.c   |  3 +++
 3 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index a26ac76..7dfef0f 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -302,6 +302,7 @@
 #define OBD_FAIL_LDLM_CP_CB_WAIT3			0x321
 #define OBD_FAIL_LDLM_CP_CB_WAIT4			0x322
 #define OBD_FAIL_LDLM_CP_CB_WAIT5			0x323
+#define OBD_FAIL_LDLM_PAUSE_CANCEL_LOCAL		0x329
 
 #define OBD_FAIL_LDLM_GRANT_CHECK			0x32a
 #define OBD_FAIL_LDLM_LOCAL_CANCEL_PAUSE		0x32c
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 32b7be1..b252fef 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -187,15 +187,29 @@ void ldlm_handle_bl_callback(struct ldlm_namespace *ns,
 	LDLM_LOCK_RELEASE(lock);
 }
 
+static int ldlm_callback_reply(struct ptlrpc_request *req, int rc)
+{
+	if (req->rq_no_reply)
+		return 0;
+
+	req->rq_status = rc;
+	if (!req->rq_packed_final) {
+		rc = lustre_pack_reply(req, 1, NULL, NULL);
+		if (rc)
+			return rc;
+	}
+	return ptlrpc_reply(req);
+}
+
 /*
  * Callback handler for receiving incoming completion ASTs.
  *
  * This only can happen on client side.
  */
-static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
-				    struct ldlm_namespace *ns,
-				    struct ldlm_request *dlm_req,
-				    struct ldlm_lock *lock)
+static int ldlm_handle_cp_callback(struct ptlrpc_request *req,
+				   struct ldlm_namespace *ns,
+				   struct ldlm_request *dlm_req,
+				   struct ldlm_lock *lock)
 {
 	int lvb_len;
 	LIST_HEAD(ast_list);
@@ -206,6 +220,8 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 	if (OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE)) {
 		long to = HZ;
 
+		ldlm_callback_reply(req, 0);
+
 		while (to > 0) {
 			schedule_timeout_interruptible(to);
 			if (ldlm_is_granted(lock) ||
@@ -250,6 +266,12 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 		lock_res_and_lock(lock);
 	}
 
+	if (ldlm_is_failed(lock)) {
+		unlock_res_and_lock(lock);
+		LDLM_LOCK_RELEASE(lock);
+		return -EINVAL;
+	}
+
 	if (ldlm_is_destroyed(lock) ||
 	    ldlm_is_granted(lock)) {
 		/* bug 11300: the lock has already been granted */
@@ -321,6 +343,8 @@ static void ldlm_handle_cp_callback(struct ptlrpc_request *req,
 		wake_up(&lock->l_waitq);
 	}
 	LDLM_LOCK_RELEASE(lock);
+
+	return 0;
 }
 
 /**
@@ -373,20 +397,6 @@ static void ldlm_handle_gl_callback(struct ptlrpc_request *req,
 	LDLM_LOCK_RELEASE(lock);
 }
 
-static int ldlm_callback_reply(struct ptlrpc_request *req, int rc)
-{
-	if (req->rq_no_reply)
-		return 0;
-
-	req->rq_status = rc;
-	if (!req->rq_packed_final) {
-		rc = lustre_pack_reply(req, 1, NULL, NULL);
-		if (rc)
-			return rc;
-	}
-	return ptlrpc_reply(req);
-}
-
 static int __ldlm_bl_to_thread(struct ldlm_bl_work_item *blwi,
 			       enum ldlm_cancel_flags cancel_flags)
 {
@@ -714,8 +724,9 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 	case LDLM_CP_CALLBACK:
 		CDEBUG(D_INODE, "completion ast\n");
 		req_capsule_extend(&req->rq_pill, &RQF_LDLM_CP_CALLBACK);
-		ldlm_callback_reply(req, 0);
-		ldlm_handle_cp_callback(req, ns, dlm_req, lock);
+		rc = ldlm_handle_cp_callback(req, ns, dlm_req, lock);
+		if (!OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE))
+			ldlm_callback_reply(req, rc);
 		break;
 	case LDLM_GL_CALLBACK:
 		CDEBUG(D_INODE, "glimpse ast\n");
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 7eba8d2..fcb2af5 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -964,6 +964,9 @@ static u64 ldlm_cli_cancel_local(struct ldlm_lock *lock)
 		bool local_only;
 
 		LDLM_DEBUG(lock, "client-side cancel");
+		OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_PAUSE_CANCEL_LOCAL,
+				 cfs_fail_val);
+
 		/* Set this flag to prevent others from getting new references*/
 		lock_res_and_lock(lock);
 		ldlm_set_cbpending(lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 591/622] lustre: uapi: properly pack data structures
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (589 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 590/622] lustre: ldlm: signal vs CP callback race James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 592/622] lnet: peer lookup handle shutdown James Simmons
                   ` (31 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

Linux UAPI headers use the gcc attributre __packed__ to ensure
that the data structures are the exact same size on all platforms.
This comes at the cost of potential misaligned accesses to these
data structures which at best cost performance and at worst cause
a bus error on some platforms. To detect potential misaligned
access starting with gcc version 9 a new compile flags was
introduced which is now impacting builds with Lustre.

Examining the build failures shows most of the problems are due to
packed data structures in the Lustre UAPI header containing
unpacked data structure fields. Packing those missed structures
resolved many of the build issues. The second problem is that the
lustre utilities tend to cast some of its UAPI data structure.
A good example is struct lov_user_md being cast to
struct lov_user_md_v3. To ensure this is properly handled with
packed data structures we need to use the __may_alias__ compiler
attribute. The one exception is struct statx which is defined out
side of Lustre and its unpacked. This requires extra special
handling in user land code due to the described issues in this
comment.

The Lustre UAPI headers currently used __packed to avoid
checkpatch errors due to Lustre being in the staging tree.
Now that the Lustre UAPI headers are in the proper place
update the UAPI headers to use __attribute__((packed)) over
__packed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12822
Lustre-commit: 4751e4a95197 ("LU-12822 uapi: properly pack data structures")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/36798
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
---
 include/uapi/linux/lustre/lustre_idl.h  | 54 ++++++++++++++++-----------------
 include/uapi/linux/lustre/lustre_user.h | 42 ++++++++++++-------------
 2 files changed, 48 insertions(+), 48 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index a69d49a..19ac0cb 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2426,7 +2426,7 @@ enum llog_ctxt_id {
 struct llog_logid {
 	struct ost_id		lgl_oi;
 	__u32			lgl_ogen;
-} __packed;
+} __attribute__((packed));
 
 /** Records written to the CATALOGS list */
 #define CATLIST "CATALOGS"
@@ -2435,7 +2435,7 @@ struct llog_catid {
 	__u32			lci_padding1;
 	__u32			lci_padding2;
 	__u32			lci_padding3;
-} __packed;
+} __attribute__((packed));
 
 /* Log data record types - there is no specific reason that these need to
  * be related to the RPC opcodes, but no reason not to (may be handy later?)
@@ -2477,12 +2477,12 @@ struct llog_rec_hdr {
 	__u32	lrh_index;
 	__u32	lrh_type;
 	__u32	lrh_id;
-};
+} __attribute__((packed));
 
 struct llog_rec_tail {
 	__u32	lrt_len;
 	__u32	lrt_index;
-};
+} __attribute__((packed));
 
 /* Where data follow just after header */
 #define REC_DATA(ptr)						\
@@ -2499,7 +2499,7 @@ struct llog_logid_rec {
 	__u64			lid_padding2;
 	__u64			lid_padding3;
 	struct llog_rec_tail	lid_tail;
-} __packed;
+} __attribute__((packed));
 
 struct llog_unlink_rec {
 	struct llog_rec_hdr	lur_hdr;
@@ -2507,7 +2507,7 @@ struct llog_unlink_rec {
 	__u32			lur_oseq;
 	__u32			lur_count;
 	struct llog_rec_tail	lur_tail;
-} __packed;
+} __attribute__((packed));
 
 struct llog_unlink64_rec {
 	struct llog_rec_hdr	lur_hdr;
@@ -2517,7 +2517,7 @@ struct llog_unlink64_rec {
 	__u64			lur_padding2;
 	__u64			lur_padding3;
 	struct llog_rec_tail	lur_tail;
-} __packed;
+} __attribute__((packed));
 
 struct llog_setattr64_rec {
 	struct llog_rec_hdr	lsr_hdr;
@@ -2528,7 +2528,7 @@ struct llog_setattr64_rec {
 	__u32			lsr_gid_h;
 	__u64			lsr_valid;
 	struct llog_rec_tail	lsr_tail;
-} __packed;
+} __attribute__((packed));
 
 struct llog_size_change_rec {
 	struct llog_rec_hdr	lsc_hdr;
@@ -2538,7 +2538,7 @@ struct llog_size_change_rec {
 	__u64			lsc_padding2;
 	__u64			lsc_padding3;
 	struct llog_rec_tail	lsc_tail;
-} __packed;
+} __attribute__((packed));
 
 /* changelog llog name, needed by client replicators */
 #define CHANGELOG_CATALOG "changelog_catalog"
@@ -2546,14 +2546,14 @@ struct llog_size_change_rec {
 struct changelog_setinfo {
 	__u64 cs_recno;
 	__u32 cs_id;
-} __packed;
+} __attribute__((packed));
 
 /** changelog record */
 struct llog_changelog_rec {
 	struct llog_rec_hdr	cr_hdr;
 	struct changelog_rec	cr;		/**< Variable length field */
 	struct llog_rec_tail	cr_do_not_use;	/**< for_sizezof_only */
-} __packed;
+} __attribute__((packed));
 
 struct llog_changelog_user_rec {
 	struct llog_rec_hdr	cur_hdr;
@@ -2561,7 +2561,7 @@ struct llog_changelog_user_rec {
 	__u32			cur_padding;
 	__u64			cur_endrec;
 	struct llog_rec_tail	cur_tail;
-} __packed;
+} __attribute__((packed));
 
 enum agent_req_status {
 	ARS_WAITING,
@@ -2602,13 +2602,13 @@ struct llog_agent_req_rec {
 	__u64			arr_req_change;	/**< req. status change time */
 	struct hsm_action_item	arr_hai;	/**< req. to the agent */
 	struct llog_rec_tail	arr_tail;   /**< record tail for_sizezof_only */
-} __packed;
+} __attribute__((packed));
 
 /* Old llog gen for compatibility */
 struct llog_gen {
 	__u64 mnt_cnt;
 	__u64 conn_cnt;
-} __packed;
+} __attribute__((packed));
 
 struct llog_gen_rec {
 	struct llog_rec_hdr	lgr_hdr;
@@ -2679,7 +2679,7 @@ struct llog_log_hdr {
 	 */
 	__u32			llh_bitmap[LLOG_BITMAP_BYTES / sizeof(__u32)];
 	struct llog_rec_tail	llh_tail;
-} __packed;
+} __attribute__((packed));
 
 #undef LLOG_HEADER_SIZE
 #undef LLOG_BITMAP_BYTES
@@ -2701,7 +2701,7 @@ struct llog_cookie {
 	__u32			lgc_subsys;
 	__u32			lgc_index;
 	__u32			lgc_padding;
-} __packed;
+} __attribute__((packed));
 
 /** llog protocol */
 enum llogd_rpc_ops {
@@ -2726,13 +2726,13 @@ struct llogd_body {
 	__u32 lgd_saved_index;
 	__u32 lgd_len;
 	__u64 lgd_cur_offset;
-} __packed;
+} __attribute__((packed));
 
 struct llogd_conn_body {
 	struct llog_gen		lgdc_gen;
 	struct llog_logid	lgdc_logid;
 	__u32			lgdc_ctxt_idx;
-} __packed;
+} __attribute__((packed));
 
 /* Note: 64-bit types are 64-bit aligned in structure */
 struct obdo {
@@ -2832,7 +2832,7 @@ struct lustre_capa {
 /* FIXME: y2038 time_t overflow: */
 	__u32		lc_expiry;	/** expiry time (sec) */
 	__u8		lc_hmac[CAPA_HMAC_MAX_LEN];   /** HMAC */
-} __packed;
+} __attribute__((packed));
 
 /** lustre_capa::lc_opc */
 enum {
@@ -2864,7 +2864,7 @@ struct lustre_capa_key {
 	__u32	lk_keyid;			/**< key# */
 	__u32	lk_padding;
 	__u8	lk_key[CAPA_HMAC_KEY_MAX_LEN];	/**< key */
-} __packed;
+} __attribute__((packed));
 
 /** The link ea holds 1 @link_ea_entry for each hardlink */
 #define LINK_EA_MAGIC 0x11EAF1DFUL
@@ -2884,7 +2884,7 @@ struct link_ea_entry {
 	unsigned char	lee_reclen[2];
 	unsigned char	lee_parent_fid[sizeof(struct lu_fid)];
 	char		lee_name[0];
-} __packed;
+} __attribute__((packed));
 
 /** fid2path request/reply structure */
 struct getinfo_fid2path {
@@ -2896,7 +2896,7 @@ struct getinfo_fid2path {
 		char		gf_path[0];
 		struct lu_fid	gf_root_fid[0];
 	};
-} __packed;
+} __attribute__((packed));
 
 /** path2parent request/reply structures */
 struct getparent {
@@ -2904,7 +2904,7 @@ struct getparent {
 	__u32		gp_linkno;	/**< hardlink number */
 	__u32		gp_name_size;	/**< size of the name field */
 	char		gp_name[0];	/**< zero-terminated link name */
-} __packed;
+} __attribute__((packed));
 
 enum layout_intent_opc {
 	LAYOUT_INTENT_ACCESS	= 0,	/** generic access */
@@ -2921,7 +2921,7 @@ struct layout_intent {
 	__u32 li_opc;	/* intent operation for enqueue, read, write etc */
 	__u32 li_flags;
 	struct lu_extent li_extent;
-} __packed;
+} __attribute__((packed));
 
 /**
  * On the wire version of hsm_progress structure.
@@ -2939,20 +2939,20 @@ struct hsm_progress_kernel {
 	/* Additional fields */
 	__u64			hpk_data_version;
 	__u64			hpk_padding2;
-} __packed;
+} __attribute__((packed));
 
 /** layout swap request structure
  * fid1 and fid2 are in mdt_body
  */
 struct mdc_swap_layouts {
 	__u64			msl_flags;
-} __packed;
+} __attribute__((packed));
 
 #define INLINE_RESYNC_ARRAY_SIZE	15
 struct close_data_resync_done {
 	__u32	resync_count;
 	__u32	resync_ids_inline[INLINE_RESYNC_ARRAY_SIZE];
-};
+} __attribute__((packed));
 
 struct close_data {
 	struct lustre_handle	cd_handle;
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 08589e6..5c21f34 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -157,7 +157,7 @@ struct lu_fid {
 	 * used.
 	 **/
 	__u32 f_ver;
-};
+} __attribute__((packed));
 
 static inline bool fid_is_zero(const struct lu_fid *fid)
 {
@@ -176,7 +176,7 @@ struct ost_layout {
 	__u64	ol_comp_start;
 	__u64	ol_comp_end;
 	__u32	ol_comp_id;
-} __packed;
+} __attribute__((packed));
 
 /* Userspace should treat lu_fid as opaque, and only use the following methods
  * to print or parse them.  Other functions (e.g. compare, swab) could be moved
@@ -245,7 +245,7 @@ struct ost_id {
 		} oi;
 		struct lu_fid oi_fid;
 	};
-};
+} __attribute__((packed));
 
 #define DOSTID "%#llx:%llu"
 #define POSTID(oi) ostid_seq(oi), ostid_id(oi)
@@ -462,7 +462,7 @@ struct lov_user_ost_data_v1 {	/* per-stripe data structure */
 	struct ost_id l_ost_oi;	/* OST object ID */
 	__u32 l_ost_gen;	/* generation of this OST index */
 	__u32 l_ost_idx;	/* OST index in LOV */
-} __packed;
+} __attribute__((packed));
 
 #define lov_user_md lov_user_md_v1
 struct lov_user_md_v1 {		/* LOV EA user data (host-endian) */
@@ -480,7 +480,7 @@ struct lov_user_md_v1 {		/* LOV EA user data (host-endian) */
 						 */
 	};
 	struct lov_user_ost_data_v1 lmm_objects[0]; /* per-stripe data */
-} __attribute__((packed,  __may_alias__));
+} __attribute__((packed, __may_alias__));
 
 struct lov_user_md_v3 {		/* LOV EA user data (host-endian) */
 	__u32 lmm_magic;	/* magic number = LOV_USER_MAGIC_V3 */
@@ -498,7 +498,7 @@ struct lov_user_md_v3 {		/* LOV EA user data (host-endian) */
 	};
 	char  lmm_pool_name[LOV_MAXPOOLNAME + 1];   /* pool name */
 	struct lov_user_ost_data_v1 lmm_objects[0]; /* per-stripe data */
-} __packed;
+} __attribute__((packed, __may_alias__));
 
 struct lov_foreign_md {
 	__u32 lfm_magic;	/* magic number = LOV_MAGIC_FOREIGN */
@@ -506,7 +506,7 @@ struct lov_foreign_md {
 	__u32 lfm_type;		/* type, see LU_FOREIGN_TYPE_ */
 	__u32 lfm_flags;	/* flags, type specific */
 	char lfm_value[];
-};
+} __attribute__((packed));
 
 #define foreign_size(lfm) (((struct lov_foreign_md *)lfm)->lfm_length + \
 			   offsetof(struct lov_foreign_md, lfm_value))
@@ -518,7 +518,7 @@ struct lov_foreign_md {
 struct lu_extent {
 	__u64	e_start;
 	__u64	e_end;
-};
+} __attribute__((packed));
 
 #define DEXT "[%#llx, %#llx)"
 #define PEXT(ext) (unsigned long long)(ext)->e_start, (unsigned long long)(ext)->e_end
@@ -583,7 +583,7 @@ struct lov_comp_md_entry_v1 {
 	__u32			lcme_layout_gen;
 	__u64			lcme_timestamp;	/* snapshot time if applicable*/
 	__u32			lcme_padding_1;
-} __packed;
+} __attribute__((packed));
 
 #define SEQ_ID_MAX		0x0000FFFF
 #define SEQ_ID_MASK		SEQ_ID_MAX
@@ -626,7 +626,7 @@ struct lov_comp_md_v1 {
 	__u16	lcm_padding1[3];
 	__u64	lcm_padding2;
 	struct lov_comp_md_entry_v1 lcm_entries[0];
-} __packed;
+} __attribute__((packed));
 
 static inline __u32 lov_user_md_size(__u16 stripes, __u32 lmm_magic)
 {
@@ -649,7 +649,7 @@ static inline __u32 lov_user_md_size(__u16 stripes, __u32 lmm_magic)
 struct lov_user_mds_data_v1 {
 	lstat_t lmd_st;			/* MDS stat struct */
 	struct lov_user_md_v1 lmd_lmm;	/* LOV EA V1 user data */
-} __packed;
+} __attribute__((packed));
 
 struct lov_user_mds_data_v2 {
 	struct lu_fid lmd_fid;		/* Lustre FID */
@@ -663,14 +663,14 @@ struct lov_user_mds_data_v2 {
 struct lov_user_mds_data_v3 {
 	lstat_t lmd_st;			/* MDS stat struct */
 	struct lov_user_md_v3 lmd_lmm;	/* LOV EA V3 user data */
-} __packed;
+} __attribute__((packed));
 #endif
 
 struct lmv_user_mds_data {
 	struct lu_fid	lum_fid;
 	__u32		lum_padding;
 	__u32		lum_mds;
-};
+} __attribute__((packed, __may_alias__));
 
 enum lmv_hash_type {
 	LMV_HASH_TYPE_UNKNOWN	= 0,	/* 0 is reserved for testing purpose */
@@ -743,7 +743,7 @@ struct lmv_user_md_v1 {
 	__u32	lum_padding3;
 	char	lum_pool_name[LOV_MAXPOOLNAME + 1];
 	struct lmv_user_mds_data  lum_objects[0];
-} __packed;
+} __attribute__((packed));
 
 static inline __u32 lmv_foreign_to_md_stripes(__u32 size)
 {
@@ -1315,8 +1315,8 @@ struct changelog_rec {
 		struct lu_fid    cr_tfid;	/**< target fid */
 		__u32	 cr_markerflags; /**< CL_MARK flags */
 	};
-	struct lu_fid	    cr_pfid;	/**< parent fid */
-} __packed;
+	struct lu_fid	 cr_pfid;		/**< parent fid */
+} __attribute__((packed));
 
 /* Changelog extension for RENAME. */
 struct changelog_ext_rename {
@@ -1758,7 +1758,7 @@ enum hsm_states {
 struct hsm_extent {
 	__u64 offset;
 	__u64 length;
-} __packed;
+} __attribute__((packed));
 
 /**
  * Current HSM states of a Lustre file.
@@ -1842,7 +1842,7 @@ struct hsm_request {
 struct hsm_user_item {
 	struct lu_fid	hui_fid;
 	struct hsm_extent hui_extent;
-} __packed;
+} __attribute__((packed));
 
 struct hsm_user_request {
 	struct hsm_request	hur_request;
@@ -1850,7 +1850,7 @@ struct hsm_user_request {
 	/* extra data blob@end of struct (after all
 	 * hur_user_items), only use helpers to access it
 	 */
-} __packed;
+} __attribute__((packed));
 
 /** Return pointer to data field in a hsm user request */
 static inline void *hur_data(struct hsm_user_request *hur)
@@ -1916,7 +1916,7 @@ struct hsm_action_item {
 	__u64		hai_cookie;  /* action cookie from coordinator */
 	__u64		hai_gid;     /* grouplock id */
 	char		hai_data[0]; /* variable length */
-} __packed;
+} __attribute__((packed));
 
 /*
  * helper function which print in hexa the first bytes of
@@ -1960,7 +1960,7 @@ struct hsm_action_list {
 	/* struct hsm_action_item[hal_count] follows, aligned on 8-byte
 	 * boundaries. See hai_first
 	 */
-} __packed;
+} __attribute__((packed));
 
 /* Return pointer to first hai in action list */
 static inline struct hsm_action_item *hai_first(struct hsm_action_list *hal)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 592/622] lnet: peer lookup handle shutdown
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (590 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 591/622] lustre: uapi: properly pack data structures James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 593/622] lnet: lnet response entries leak James Simmons
                   ` (30 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Amir Shehata <ashehata@whamcloud.com>

When LNet is shutting down, looking up peer_nis shouldn't assert
but return NULL. Callers handle NULL return

WC-bug-id: https://jira.whamcloud.com/browse/LU-13049
Lustre-commit: f46b22aa6a28 ("LU-13049 lnet: peer lookup handle shutdown")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/36925
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index b168c97..f987fff 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -647,7 +647,8 @@ void lnet_peer_uninit(void)
 	struct list_head *peers;
 	struct lnet_peer_ni *lp;
 
-	LASSERT(the_lnet.ln_state == LNET_STATE_RUNNING);
+	if (the_lnet.ln_state != LNET_STATE_RUNNING)
+		return NULL;
 
 	peers = &ptable->pt_hash[lnet_nid2peerhash(nid)];
 	list_for_each_entry(lp, peers, lpni_hashlist) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 593/622] lnet: lnet response entries leak
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (591 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 592/622] lnet: peer lookup handle shutdown James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 594/622] lustre: lmv: disable statahead for remote objects James Simmons
                   ` (29 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

LNetPut with ACK flag called, but LNetMDUnlink issued before ACK
arrives. It can due timeout or it is application call (ldiskfs commit
for difficult replies on MDT).
It freed an MD but rsp don't detached, as ACK don't hold an reference
to the MD between request sends and ACK arrives.
monitor thread detect it situation and RSP entry moved into the zombie
list, which don't freed as no msg processed due MD absence.

Let's remove a response tracking in case nobody want to have reply aka
LNetMDUnlink called.

Cray-bug-id: LUS-8188
WC-bug-id: https://jira.whamcloud.com/browse/LU-12991
Lustre-commit: b7035222bd64 ("LU-12991 lnet: lnet response entries leak")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/36896
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 2 ++
 net/lnet/lnet/lib-md.c        | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 3b597e3..bf357b0 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -157,6 +157,8 @@ static inline int lnet_md_unlinkable(struct lnet_libmd *md)
 {
 	unsigned int size;
 
+	LASSERTF(md->md_rspt_ptr == NULL, "md %p rsp %p\n", md, md->md_rspt_ptr);
+
 	if ((md->md_options & LNET_MD_KIOV) != 0)
 		size = offsetof(struct lnet_libmd, md_iov.kiov[md->md_niov]);
 	else
diff --git a/net/lnet/lnet/lib-md.c b/net/lnet/lnet/lib-md.c
index 4a70c76..5ee43c2 100644
--- a/net/lnet/lnet/lib-md.c
+++ b/net/lnet/lnet/lib-md.c
@@ -548,6 +548,9 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 		lnet_eq_enqueue_event(md->md_eq, &ev);
 	}
 
+	if (md->md_rspt_ptr)
+		lnet_detach_rsp_tracker(md, cpt);
+
 	lnet_md_unlink(md);
 
 	lnet_res_unlock(cpt);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 594/622] lustre: lmv: disable statahead for remote objects
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (592 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 593/622] lnet: lnet response entries leak James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 595/622] lustre: llite: eviction during ll_open_cleanup() James Simmons
                   ` (28 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Vladimir Saveliev <c17830@cray.com>

Statahead for remote objects is supposed to be disabled by
LU-11681 lmv: disable remote file statahead.

However due to typo it is not and statahead for remote objects is
accompanied by warnings like:
  ll_set_inode()) Can not initialize inode .. without object type..
  ll_prep_inode()) new_inode -fatal: rc -12

Fix the typo.

Test to illustrate the issue is added.

Fixes: 6dd8b9909e79 ("lustre: lmv: disable remote file statahead")

WC-bug-id: https://jira.whamcloud.com/browse/LU-13099
Lustre-commit: 68330379b01c ("LU-13099 lmv: disable statahead for remote objects")
Signed-off-by: Vladimir Saveliev <c17830@cray.com>
Cray-bug-id: LUS-8262
Reviewed-on: https://review.whamcloud.com/37089
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Olaf Faaland-LLNL <faaland1@llnl.gov>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lmv/lmv_obd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index ee52bba..cead3a1 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -3369,7 +3369,7 @@ static int lmv_intent_getattr_async(struct obd_export *exp,
 	if (IS_ERR(ptgt))
 		return PTR_ERR(ptgt);
 
-	ctgt = lmv_fid2tgt(lmv, &op_data->op_fid1);
+	ctgt = lmv_fid2tgt(lmv, &op_data->op_fid2);
 	if (IS_ERR(ctgt))
 		return PTR_ERR(ctgt);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 595/622] lustre: llite: eviction during ll_open_cleanup()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (593 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 594/622] lustre: lmv: disable statahead for remote objects James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 596/622] lustre: ptlrpc: show target name in req_history James Simmons
                   ` (27 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

On error ll_open_cleanup() is called while
intent lock remains pinned. So eviction can
happen while close request waits for a mod rpc slot.

Release intent lock before ll_open_cleanup()

Cray-bug-id: LUS-8055
WC-bug-id: https://jira.whamcloud.com/browse/LU-13101
Lustre-commit: 6d5d7c6bdb4f ("LU-13101 llite: eviction during ll_open_cleanup()")
Signed-off-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-on: https://review.whamcloud.com/37096
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c | 4 +++-
 fs/lustre/llite/namei.c     | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 1a8a5ec..33ab3f7 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -2507,8 +2507,10 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 	/* cleanup will be done if necessary */
 	md_free_lustre_md(sbi->ll_md_exp, &md);
 
-	if (rc != 0 && it && it->it_op & IT_OPEN)
+	if (rc != 0 && it && it->it_op & IT_OPEN) {
+		ll_intent_drop_lock(it);
 		ll_open_cleanup(sb ? sb : (*inode)->i_sb, req);
+	}
 
 	return rc;
 }
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 13c1cf9..89317db 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -709,8 +709,10 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 	}
 
 out:
-	if (rc != 0 && it->it_op & IT_OPEN)
+	if (rc != 0 && it->it_op & IT_OPEN) {
+		ll_intent_drop_lock(it);
 		ll_open_cleanup((*de)->d_sb, request);
+	}
 
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 596/622] lustre: ptlrpc: show target name in req_history
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (594 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 595/622] lustre: llite: eviction during ll_open_cleanup() James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 597/622] lustre: dom: check read-on-open buffer presents in reply James Simmons
                   ` (26 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Currently the req_history tracing shows the "self" NID as the second
field.  However, this is not very useful since there may be a number
of different targets on the same server, and since the logs are all
collected directly on the server we already know the local NID.

Instead of printing the "self" NID, store the target name as the
second field, if that is available, so that we can determine which
target the RPC was intended for.  This makes it easier to debug
problems with bad clients and isolate traffic for a specific target.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11644
Lustre-commit: 83b6c6608e94 ("LU-11644 ptlrpc: show target name in req_history")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37193
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/lproc_ptlrpc.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index d52a08a..f34aec3 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -956,7 +956,6 @@ static int ptlrpc_lprocfs_svc_req_history_show(struct seq_file *s, void *iter)
 
 		req = srhi->srhi_req;
 
-		libcfs_nid2str_r(req->rq_self, nidstr, sizeof(nidstr));
 		arrival.tv_sec = req->rq_arrival_time.tv_sec;
 		arrival.tv_nsec = req->rq_arrival_time.tv_nsec;
 		sent.tv_sec = req->rq_sent;
@@ -970,8 +969,13 @@ static int ptlrpc_lprocfs_svc_req_history_show(struct seq_file *s, void *iter)
 		 * parser. Currently I only print stuff here I know is OK
 		 * to look at coz it was set up in request_in_callback()!!!
 		 */
-		seq_printf(s, "%lld:%s:%s:x%llu:%d:%s:%lld.%06lld:%lld.%06llds(%+lld.0s) ",
-			   req->rq_history_seq, nidstr,
+		seq_printf(s,
+			   "%lld:%s:%s:x%llu:%d:%s:%lld.%06lld:%lld.%06llds(%+lld.0s) ",
+			   req->rq_history_seq,
+			   req->rq_export && req->rq_export->exp_obd ?
+				req->rq_export->exp_obd->obd_name :
+				libcfs_nid2str_r(req->rq_self, nidstr,
+						 sizeof(nidstr)),
 			   libcfs_id2str(req->rq_peer), req->rq_xid,
 			   req->rq_reqlen, ptlrpc_rqphase2str(req),
 			   (s64)req->rq_arrival_time.tv_sec,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 597/622] lustre: dom: check read-on-open buffer presents in reply
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (595 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 596/622] lustre: ptlrpc: show target name in req_history James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 598/622] lustre: llite: proper names/types for offset/pages James Simmons
                   ` (25 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

The ll_dom_finish_open() uses req_capsule_has_field() wronly,
it check only format but not buffer presence in reply, that
causes unneeded console errors about missing buffer later in
req_capsule_server_get()

Patch replaces that with req_capsule_field_present() to check
if server pack that field in reply or not and properly skip
responses from an old server.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13136
Lustre-commit: 58bea527100b ("LU-13136 dom: check read-on-open buffer presents in reply")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37249
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Stephane Thiell <sthiell@stanford.edu>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_req_layout.h | 3 +++
 fs/lustre/llite/file.c                | 4 ++--
 fs/lustre/ptlrpc/layout.c             | 7 ++++---
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index feb5e77..ea6baef 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -112,6 +112,9 @@ u32 req_capsule_fmt_size(u32 magic, const struct req_format *fmt,
 int req_capsule_has_field(const struct req_capsule *pill,
 			  const struct req_msg_field *field,
 			  enum req_location loc);
+int req_capsule_field_present(const struct req_capsule *pill,
+			      const struct req_msg_field *field,
+			      enum req_location loc);
 void req_capsule_shrink(struct req_capsule *pill,
 			const struct req_msg_field *field,
 			u32 newlen, enum req_location loc);
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index a3c36a7..c7233bf 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -459,8 +459,8 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	if (!obj)
 		return;
 
-	if (!req_capsule_has_field(&req->rq_pill, &RMF_NIOBUF_INLINE,
-				   RCL_SERVER))
+	if (!req_capsule_field_present(&req->rq_pill, &RMF_NIOBUF_INLINE,
+				       RCL_SERVER))
 		return;
 
 	rnb = req_capsule_server_get(&req->rq_pill, &RMF_NIOBUF_INLINE);
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 06db86d..4213fb2 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -2268,9 +2268,9 @@ int req_capsule_has_field(const struct req_capsule *pill,
  * Returns a non-zero value if the given @field is present in the given
  * @pill's PTLRPC request or reply (@loc), else it returns 0.
  */
-static int req_capsule_field_present(const struct req_capsule *pill,
-				     const struct req_msg_field *field,
-				     enum req_location loc)
+int req_capsule_field_present(const struct req_capsule *pill,
+			      const struct req_msg_field *field,
+			      enum req_location loc)
 {
 	u32 offset;
 
@@ -2280,6 +2280,7 @@ static int req_capsule_field_present(const struct req_capsule *pill,
 	offset = __req_capsule_offset(pill, field, loc);
 	return lustre_msg_bufcount(__req_msg(pill, loc)) > offset;
 }
+EXPORT_SYMBOL(req_capsule_field_present);
 
 /**
  * This function shrinks the size of the _buffer_ of the @pill's PTLRPC
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 598/622] lustre: llite: proper names/types for offset/pages
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (596 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 597/622] lustre: dom: check read-on-open buffer presents in reply James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 599/622] lustre: llite: Accept EBUSY for page unaligned read James Simmons
                   ` (24 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Use loff_t for file offsets and pgoff_t for page index values
instead of unsigned long, so that it is possible to distinguish
what type of value is being used in the byte-granular readahead
code.  Otherwise, it is difficult to determine what units "start"
or "end" in a given function are in.

Rename variables that reference page index values with an "_idx"
suffix to make this clear when reading the code.  Similarly, use
"bytes" or "pages" for variable names instead of "count" or "len".

Fix stride_page_count() to properly use loff_t for the byte_count,
which might otherwise overflow for large strides.

Cast pgoff_t vars to loff_t before PAGE_SIZE shift to avoid overflow.
Use shift and mask with PAGE_SIZE and PAGE_MASK instead of mod/div.

Use proper 64-bit division functions for the loff_t types when
calculating stride, since they are not guaranteed to be within 4GB.

Remove unused "remainder" argument from ras_align() function.

Fixes: 91d264551508 ("LU-12518 llite: support page unaligned stride readahead")
WC-bug-id: https://jira.whamcloud.com/browse/LU-12518
Lustre-commit: 83d8dd1d7c30 ("LU-12518 llite: proper names/types for offset/pages")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37248
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h    |  10 +-
 fs/lustre/llite/file.c           |   6 +-
 fs/lustre/llite/llite_internal.h |  49 +++--
 fs/lustre/llite/rw.c             | 455 ++++++++++++++++++++-------------------
 fs/lustre/llite/vvp_internal.h   |   4 +-
 fs/lustre/llite/vvp_io.c         |  18 +-
 fs/lustre/lov/lov_io.c           |  21 +-
 fs/lustre/mdc/mdc_dev.c          |   4 +-
 fs/lustre/obdclass/integrity.c   |   2 +-
 fs/lustre/osc/osc_cache.c        |   2 +-
 fs/lustre/osc/osc_io.c           |   8 +-
 11 files changed, 294 insertions(+), 285 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 67731b0..aa54537 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1464,14 +1464,14 @@ struct cl_read_ahead {
 	 * This is determined DLM lock coverage, RPC and stripe boundary.
 	 * cra_end is included.
 	 */
-	pgoff_t				cra_end;
+	pgoff_t				cra_end_idx;
 	/* optimal RPC size for this read, by pages */
-	unsigned long			cra_rpc_size;
-	/*
-	 * Release callback. If readahead holds resources underneath, this
+	unsigned long			cra_rpc_pages;
+	/* Release callback. If readahead holds resources underneath, this
 	 * function should be called to release it.
 	 */
-	void (*cra_release)(const struct lu_env *env, void *cbdata);
+	void				(*cra_release)(const struct lu_env *env,
+						       void *cbdata);
 	/* Callback data for cra_release routine */
 	void				*cra_cbdata;
 	/* whether lock is in contention */
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index c7233bf..097dbeb 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -472,7 +472,7 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	 * client PAGE_SIZE to be used on that client, if server's PAGE_SIZE is
 	 * smaller then offset may be not aligned and that data is just ignored.
 	 */
-	if (rnb->rnb_offset % PAGE_SIZE)
+	if (rnb->rnb_offset & ~PAGE_MASK)
 		return;
 
 	/* Server returns whole file or just file tail if it fills in reply
@@ -492,9 +492,9 @@ void ll_dom_finish_open(struct inode *inode, struct ptlrpc_request *req,
 	data = (char *)rnb + sizeof(*rnb);
 
 	lnb.lnb_file_offset = rnb->rnb_offset;
-	start = lnb.lnb_file_offset / PAGE_SIZE;
+	start = lnb.lnb_file_offset >> PAGE_SHIFT;
 	index = 0;
-	LASSERT(lnb.lnb_file_offset % PAGE_SIZE == 0);
+	LASSERT((lnb.lnb_file_offset & ~PAGE_MASK) == 0);
 	lnb.lnb_page_offset = 0;
 	do {
 		lnb.lnb_data = data + (index << PAGE_SHIFT);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index b7b418f..55d451fe 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -464,22 +464,22 @@ struct ll_ra_info {
  * counted by page index.
  */
 struct ra_io_arg {
-	pgoff_t		ria_start;	/* start offset of read-ahead*/
-	pgoff_t		ria_end;	/* end offset of read-ahead*/
+	pgoff_t		ria_start_idx;	/* start offset of read-ahead*/
+	pgoff_t		ria_end_idx;	/* end offset of read-ahead*/
 	unsigned long	ria_reserved;	/* reserved pages for read-ahead */
-	pgoff_t		ria_end_min;	/* minimum end to cover current read */
+	pgoff_t		ria_end_idx_min;/* minimum end to cover current read */
 	bool		ria_eof;	/* reach end of file */
-	/* If stride read pattern is detected, ria_stoff means where
-	 * stride read is started. Note: for normal read-ahead, the
+	/* If stride read pattern is detected, ria_stoff is the byte offset
+	 * where stride read is started. Note: for normal read-ahead, the
 	 * value here is meaningless, and also it will not be accessed
 	 */
-	unsigned long	ria_stoff;
+	loff_t		ria_stoff;
 	/* ria_length and ria_bytes are the length and pages length in the
 	 * stride I/O mode. And they will also be used to check whether
 	 * it is stride I/O read-ahead in the read-ahead pages
 	 */
-	unsigned long	ria_length;
-	unsigned long	ria_bytes;
+	loff_t		ria_length;
+	loff_t		ria_bytes;
 };
 
 /* LL_HIST_MAX=32 causes an overflow */
@@ -697,9 +697,9 @@ struct ll_sb_info {
  * per file-descriptor read-ahead data.
  */
 struct ll_readahead_state {
-	spinlock_t  ras_lock;
+	spinlock_t	ras_lock;
 	/* End byte that read(2) try to read.  */
-	unsigned long	ras_last_read_end;
+	loff_t		ras_last_read_end_bytes;
 	/*
 	 * number of bytes read after last read-ahead window reset. As window
 	 * is reset on each seek, this is effectively a number of consecutive
@@ -710,7 +710,7 @@ struct ll_readahead_state {
 	 * case, it probably doesn't make sense to expand window to
 	 * PTLRPC_MAX_BRW_PAGES on the third access.
 	 */
-	unsigned long	ras_consecutive_bytes;
+	loff_t		ras_consecutive_bytes;
 	/*
 	 * number of read requests after the last read-ahead window reset
 	 * As window is reset on each seek, this is effectively the number
@@ -724,12 +724,13 @@ struct ll_readahead_state {
 	 * expanded to PTLRPC_MAX_BRW_PAGES. Afterwards, window is enlarged by
 	 * PTLRPC_MAX_BRW_PAGES chunks up to ->ra_max_pages.
 	 */
-	pgoff_t		ras_window_start, ras_window_len;
+	pgoff_t		ras_window_start_idx;
+	pgoff_t		ras_window_pages;
 	/*
-	 * Optimal RPC size. It decides how many pages will be sent
-	 * for each read-ahead.
+	 * Optimal RPC size in pages.
+	 * It decides how many pages will be sent for each read-ahead.
 	 */
-	unsigned long	ras_rpc_size;
+	unsigned long	ras_rpc_pages;
 	/*
 	 * Where next read-ahead should start at. This lies within read-ahead
 	 * window. Read-ahead window is read in pieces rather than at once
@@ -737,7 +738,7 @@ struct ll_readahead_state {
 	 * ->ra_max_pages (see ll_ra_count_get()), 2. client cannot read pages
 	 * not covered by DLM lock.
 	 */
-	pgoff_t		ras_next_readahead;
+	pgoff_t		ras_next_readahead_idx;
 	/*
 	 * Total number of ll_file_read requests issued, reads originating
 	 * due to mmap are not counted in this total.  This value is used to
@@ -755,9 +756,9 @@ struct ll_readahead_state {
 	 * ras_stride_bytes = stride_bytes;
 	 * Note: all these three items are counted by bytes.
 	 */
-	unsigned long	ras_stride_length;
-	unsigned long	ras_stride_bytes;
-	unsigned long	ras_stride_offset;
+	loff_t		ras_stride_length;
+	loff_t		ras_stride_bytes;
+	loff_t		ras_stride_offset;
 	/*
 	 * number of consecutive stride request count, and it is similar as
 	 * ras_consecutive_requests, but used for stride I/O mode.
@@ -766,7 +767,7 @@ struct ll_readahead_state {
 	 */
 	unsigned long	ras_consecutive_stride_requests;
 	/* index of the last page that async readahead starts */
-	pgoff_t		ras_async_last_readpage;
+	pgoff_t		ras_async_last_readpage_idx;
 	/* whether we should increase readahead window */
 	bool		ras_need_increase_window;
 	/* whether ra miss check should be skipped */
@@ -776,10 +777,8 @@ struct ll_readahead_state {
 struct ll_readahead_work {
 	/** File to readahead */
 	struct file			*lrw_file;
-	/** Start bytes */
-	unsigned long			 lrw_start;
-	/** End bytes */
-	unsigned long			 lrw_end;
+	pgoff_t				 lrw_start_idx;
+	pgoff_t				 lrw_end_idx;
 
 	/* async worker to handler read */
 	struct work_struct		 lrw_readahead_work;
@@ -868,7 +867,7 @@ static inline bool ll_sbi_has_file_heat(struct ll_sb_info *sbi)
 	return !!(sbi->ll_flags & LL_SBI_FILE_HEAT);
 }
 
-void ll_ras_enter(struct file *f, unsigned long pos, unsigned long count);
+void ll_ras_enter(struct file *f, loff_t pos, size_t count);
 
 /* llite/lcommon_misc.c */
 int cl_ocd_update(struct obd_device *host, struct obd_device *watched,
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index bf91ae1..9509023 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -80,7 +80,8 @@
  */
 static unsigned long ll_ra_count_get(struct ll_sb_info *sbi,
 				     struct ra_io_arg *ria,
-				     unsigned long pages, unsigned long min)
+				     unsigned long pages,
+				     unsigned long pages_min)
 {
 	struct ll_ra_info *ra = &sbi->ll_ra_info;
 	long ret;
@@ -101,19 +102,19 @@ static unsigned long ll_ra_count_get(struct ll_sb_info *sbi,
 	}
 
 out:
-	if (ret < min) {
+	if (ret < pages_min) {
 		/* override ra limit for maximum performance */
-		atomic_add(min - ret, &ra->ra_cur_pages);
-		ret = min;
+		atomic_add(pages_min - ret, &ra->ra_cur_pages);
+		ret = pages_min;
 	}
 	return ret;
 }
 
-void ll_ra_count_put(struct ll_sb_info *sbi, unsigned long len)
+void ll_ra_count_put(struct ll_sb_info *sbi, unsigned long pages)
 {
 	struct ll_ra_info *ra = &sbi->ll_ra_info;
 
-	atomic_sub(len, &ra->ra_cur_pages);
+	atomic_sub(pages, &ra->ra_cur_pages);
 }
 
 static void ll_ra_stats_inc_sbi(struct ll_sb_info *sbi, enum ra_stat which)
@@ -131,19 +132,20 @@ void ll_ra_stats_inc(struct inode *inode, enum ra_stat which)
 
 #define RAS_CDEBUG(ras) \
 	CDEBUG(D_READA,							     \
-	       "lre %lu cr %lu cb %lu ws %lu wl %lu nra %lu rpc %lu r %lu csr %lu sf %lu sb %lu sl %lu lr %lu\n", \
-	       ras->ras_last_read_end, ras->ras_consecutive_requests,	     \
-	       ras->ras_consecutive_bytes, ras->ras_window_start,	     \
-	       ras->ras_window_len, ras->ras_next_readahead,		     \
-	       ras->ras_rpc_size, ras->ras_requests,			     \
+	       "lre %llu cr %lu cb %llu wsi %lu wp %lu nra %lu rpc %lu r %lu csr %lu so %llu sb %llu sl %llu lr %lu\n", \
+	       ras->ras_last_read_end_bytes, ras->ras_consecutive_requests,  \
+	       ras->ras_consecutive_bytes, ras->ras_window_start_idx,	     \
+	       ras->ras_window_pages, ras->ras_next_readahead_idx,	     \
+	       ras->ras_rpc_pages, ras->ras_requests,			     \
 	       ras->ras_consecutive_stride_requests, ras->ras_stride_offset, \
 	       ras->ras_stride_bytes, ras->ras_stride_length,		     \
-	       ras->ras_async_last_readpage)
+	       ras->ras_async_last_readpage_idx)
 
-static int pos_in_window(unsigned long pos, unsigned long point,
-			 unsigned long before, unsigned long after)
+static bool pos_in_window(loff_t pos, loff_t point,
+			  unsigned long before, unsigned long after)
 {
-	unsigned long start = point - before, end = point + after;
+	loff_t start = point - before;
+	loff_t end = point + after;
 
 	if (start > point)
 		start = 0;
@@ -228,9 +230,9 @@ static int ll_read_ahead_page(const struct lu_env *env, struct cl_io *io,
 	return rc;
 }
 
-#define RIA_DEBUG(ria)						\
-	CDEBUG(D_READA, "rs %lu re %lu ro %lu rl %lu rb %lu\n",	\
-	       ria->ria_start, ria->ria_end, ria->ria_stoff,	\
+#define RIA_DEBUG(ria)							\
+	CDEBUG(D_READA, "rs %lu re %lu ro %llu rl %llu rb %llu\n",	\
+	       ria->ria_start_idx, ria->ria_end_idx, ria->ria_stoff,	\
 	       ria->ria_length, ria->ria_bytes)
 
 static inline int stride_io_mode(struct ll_readahead_state *ras)
@@ -238,7 +240,7 @@ static inline int stride_io_mode(struct ll_readahead_state *ras)
 	return ras->ras_consecutive_stride_requests > 1;
 }
 
-/* The function calculates how much pages will be read in
+/* The function calculates how many bytes will be read in
  * [off, off + length], in such stride IO area,
  * stride_offset = st_off, stride_length = st_len,
  * stride_bytes = st_bytes
@@ -256,31 +258,29 @@ static inline int stride_io_mode(struct ll_readahead_state *ras)
  *	  =   |<----->|  +  |-------------------------------------| +   |---|
  *	       start_left                 st_bytes * i                 end_left
  */
-static unsigned long
-stride_byte_count(unsigned long st_off, unsigned long st_len,
-		  unsigned long st_bytes, unsigned long off,
-		  unsigned long length)
+static loff_t stride_byte_count(loff_t st_off, loff_t st_len, loff_t st_bytes,
+				loff_t off, loff_t length)
 {
 	u64 start = off > st_off ? off - st_off : 0;
 	u64 end = off + length > st_off ? off + length - st_off : 0;
-	unsigned long start_left = 0;
-	unsigned long end_left = 0;
-	unsigned long bytes_count;
+	u64 start_left;
+	u64 end_left;
+	u64 bytes_count;
 
 	if (st_len == 0 || length == 0 || end == 0)
 		return length;
 
-	start_left = do_div(start, st_len);
+	start = div64_u64_rem(start, st_len, &start_left);
 	if (start_left < st_bytes)
 		start_left = st_bytes - start_left;
 	else
 		start_left = 0;
 
-	end_left = do_div(end, st_len);
+	end = div64_u64_rem(end, st_len, &end_left);
 	if (end_left > st_bytes)
 		end_left = st_bytes;
 
-	CDEBUG(D_READA, "start %llu, end %llu start_left %lu end_left %lu\n",
+	CDEBUG(D_READA, "start %llu, end %llu start_left %llu end_left %llu\n",
 	       start, end, start_left, end_left);
 
 	if (start == end)
@@ -290,48 +290,45 @@ static inline int stride_io_mode(struct ll_readahead_state *ras)
 			st_bytes * (end - start - 1) + end_left;
 
 	CDEBUG(D_READA,
-	       "st_off %lu, st_len %lu st_bytes %lu off %lu length %lu bytescount %lu\n",
+	       "st_off %llu, st_len %llu st_bytes %llu off %llu length %llu bytescount %llu\n",
 	       st_off, st_len, st_bytes, off, length, bytes_count);
 
 	return bytes_count;
 }
 
-static int ria_page_count(struct ra_io_arg *ria)
+static unsigned long ria_page_count(struct ra_io_arg *ria)
 {
-	u64 length_bytes = ria->ria_end >= ria->ria_start ?
-			   (ria->ria_end - ria->ria_start + 1) << PAGE_SHIFT : 0;
-	unsigned int bytes_count, pg_count;
+	loff_t length_bytes = ria->ria_end_idx >= ria->ria_start_idx ?
+			      (loff_t)(ria->ria_end_idx -
+				       ria->ria_start_idx + 1) << PAGE_SHIFT : 0;
+	loff_t bytes_count;
 
 	if (ria->ria_length > ria->ria_bytes && ria->ria_bytes &&
-	    (ria->ria_length % PAGE_SIZE || ria->ria_bytes % PAGE_SIZE ||
-	     ria->ria_stoff % PAGE_SIZE)) {
+	    (ria->ria_length & ~PAGE_SIZE || ria->ria_bytes & ~PAGE_SIZE ||
+	     ria->ria_stoff & ~PAGE_SIZE)) {
 		/* Over-estimate un-aligned page stride read */
-		pg_count = ((ria->ria_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) + 1;
-		pg_count *= length_bytes / ria->ria_length + 1;
+		unsigned long pg_count = ((ria->ria_bytes +
+					   PAGE_SIZE - 1) >> PAGE_SHIFT) + 1;
 
+		pg_count *= length_bytes / ria->ria_length + 1;
 		return pg_count;
 	}
 	bytes_count = stride_byte_count(ria->ria_stoff, ria->ria_length,
-					 ria->ria_bytes, ria->ria_start,
-					 length_bytes);
+					ria->ria_bytes,
+					(loff_t)ria->ria_start_idx << PAGE_SHIFT,
+					length_bytes);
 	return (bytes_count + PAGE_SIZE - 1) >> PAGE_SHIFT;
 }
 
-static unsigned long ras_align(struct ll_readahead_state *ras,
-			       pgoff_t index, unsigned long *remainder)
+static pgoff_t ras_align(struct ll_readahead_state *ras, pgoff_t index)
 {
-	unsigned long rem = index % ras->ras_rpc_size;
-
-	if (remainder)
-		*remainder = rem;
-	return index - rem;
+	return index - (index % ras->ras_rpc_pages);
 }
 
-/*Check whether the index is in the defined ra-window */
-static bool ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
+/* Check whether the index is in the defined ra-window */
+static bool ras_inside_ra_window(pgoff_t idx, struct ra_io_arg *ria)
 {
-	unsigned long pos = idx << PAGE_SHIFT;
-	unsigned long offset;
+	loff_t pos = (loff_t)idx << PAGE_SHIFT;
 
 	/* If ria_length == ria_pages, it means non-stride I/O mode,
 	 * idx should always inside read-ahead window in this case
@@ -342,12 +339,16 @@ static bool ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 		return true;
 
 	if (pos >= ria->ria_stoff) {
-		offset = (pos - ria->ria_stoff) % ria->ria_length;
+		u64 offset;
+
+		div64_u64_rem(pos - ria->ria_stoff, ria->ria_length, &offset);
+
 		if (offset < ria->ria_bytes ||
 		    (ria->ria_length - offset) < PAGE_SIZE)
 			return true;
-	} else if (pos + PAGE_SIZE > ria->ria_stoff)
+	} else if (pos + PAGE_SIZE > ria->ria_stoff) {
 		return true;
+	}
 
 	return false;
 }
@@ -365,11 +366,12 @@ static bool ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 	LASSERT(ria);
 	RIA_DEBUG(ria);
 
-	for (page_idx = ria->ria_start;
-	     page_idx <= ria->ria_end && ria->ria_reserved > 0; page_idx++) {
+	for (page_idx = ria->ria_start_idx;
+	     page_idx <= ria->ria_end_idx && ria->ria_reserved > 0;
+	     page_idx++) {
 		if (ras_inside_ra_window(page_idx, ria)) {
-			if (!ra.cra_end || ra.cra_end < page_idx) {
-				unsigned long end;
+			if (!ra.cra_end_idx || ra.cra_end_idx < page_idx) {
+				pgoff_t end_idx;
 
 				cl_read_ahead_release(env, &ra);
 
@@ -377,37 +379,40 @@ static bool ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 				if (rc < 0)
 					break;
 
-				/* Do not shrink the ria_end at any case until
+				/* Do not shrink ria_end_idx@any case until
 				 * the minimum end of current read is covered.
-				 * And only shrink the ria_end if the matched
+				 * And only shrink ria_end_idx if the matched
 				 * LDLM lock doesn't cover more.
 				 */
-				if (page_idx > ra.cra_end ||
+				if (page_idx > ra.cra_end_idx ||
 				    (ra.cra_contention &&
-				     page_idx > ria->ria_end_min)) {
-					ria->ria_end = ra.cra_end;
+				     page_idx > ria->ria_end_idx_min)) {
+					ria->ria_end_idx = ra.cra_end_idx;
 					break;
 				}
 
 				CDEBUG(D_READA, "idx: %lu, ra: %lu, rpc: %lu\n",
-				       page_idx, ra.cra_end, ra.cra_rpc_size);
-				LASSERTF(ra.cra_end >= page_idx,
+				       page_idx, ra.cra_end_idx,
+				       ra.cra_rpc_pages);
+				LASSERTF(ra.cra_end_idx >= page_idx,
 					 "object: %p, indcies %lu / %lu\n",
-					 io->ci_obj, ra.cra_end, page_idx);
+					 io->ci_obj, ra.cra_end_idx, page_idx);
 				/*
 				 * update read ahead RPC size.
 				 * NB: it's racy but doesn't matter
 				 */
-				if (ras->ras_rpc_size != ra.cra_rpc_size &&
-				    ra.cra_rpc_size > 0)
-					ras->ras_rpc_size = ra.cra_rpc_size;
+				if (ras->ras_rpc_pages != ra.cra_rpc_pages &&
+				    ra.cra_rpc_pages > 0)
+					ras->ras_rpc_pages = ra.cra_rpc_pages;
 				/* trim it to align with optimal RPC size */
-				end = ras_align(ras, ria->ria_end + 1, NULL);
-				if (end > 0 && !ria->ria_eof)
-					ria->ria_end = end - 1;
-				if (ria->ria_end < ria->ria_end_min)
-					ria->ria_end = ria->ria_end_min;
+				end_idx = ras_align(ras, ria->ria_end_idx + 1);
+				if (end_idx > 0 && !ria->ria_eof)
+					ria->ria_end_idx = end_idx - 1;
+				if (ria->ria_end_idx < ria->ria_end_idx_min)
+					ria->ria_end_idx = ria->ria_end_idx_min;
 			}
+			if (page_idx > ria->ria_end_idx)
+				break;
 
 			/* If the page is inside the read-ahead window */
 			rc = ll_read_ahead_page(env, io, queue, page_idx);
@@ -427,16 +432,17 @@ static bool ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 			 * read-ahead mode, then check whether it should skip
 			 * the stride gap.
 			 */
-			unsigned long offset;
-			unsigned long pos = page_idx << PAGE_SHIFT;
+			loff_t pos = (loff_t)page_idx << PAGE_SHIFT;
+			u64 offset;
 
-			offset = (pos - ria->ria_stoff) % ria->ria_length;
+			div64_u64_rem(pos - ria->ria_stoff, ria->ria_length,
+				      &offset);
 			if (offset >= ria->ria_bytes) {
 				pos += (ria->ria_length - offset);
 				if ((pos >> PAGE_SHIFT) >= page_idx + 1)
 					page_idx = (pos >> PAGE_SHIFT) - 1;
 				CDEBUG(D_READA,
-				       "Stride: jump %lu pages to %lu\n",
+				       "Stride: jump %llu pages to %lu\n",
 				       ria->ria_length - offset, page_idx);
 				continue;
 			}
@@ -495,12 +501,12 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 	struct ll_readahead_state *ras;
 	struct cl_io *io;
 	struct cl_2queue *queue;
-	pgoff_t ra_end = 0;
-	unsigned long len, mlen = 0;
+	pgoff_t ra_end_idx = 0;
+	unsigned long pages, pages_min = 0;
 	struct file *file;
 	u64 kms;
 	int rc;
-	unsigned long end_index;
+	pgoff_t eof_index;
 
 	work = container_of(wq, struct ll_readahead_work,
 			    lrw_readahead_work);
@@ -531,30 +537,30 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 	ria = &ll_env_info(env)->lti_ria;
 	memset(ria, 0, sizeof(*ria));
 
-	ria->ria_start = work->lrw_start;
+	ria->ria_start_idx = work->lrw_start_idx;
 	/* Truncate RA window to end of file */
-	end_index = (unsigned long)((kms - 1) >> PAGE_SHIFT);
-	if (end_index <= work->lrw_end) {
-		work->lrw_end = end_index;
+	eof_index = (pgoff_t)(kms - 1) >> PAGE_SHIFT;
+	if (eof_index <= work->lrw_end_idx) {
+		work->lrw_end_idx = eof_index;
 		ria->ria_eof = true;
 	}
-	if (work->lrw_end <= work->lrw_start) {
+	if (work->lrw_end_idx <= work->lrw_start_idx) {
 		rc = 0;
 		goto out_put_env;
 	}
 
-	ria->ria_end = work->lrw_end;
-	len = ria->ria_end - ria->ria_start + 1;
+	ria->ria_end_idx = work->lrw_end_idx;
+	pages = ria->ria_end_idx - ria->ria_start_idx + 1;
 	ria->ria_reserved = ll_ra_count_get(ll_i2sbi(inode), ria,
-					    ria_page_count(ria), mlen);
+					    ria_page_count(ria), pages_min);
 
 	CDEBUG(D_READA,
 	       "async reserved pages: %lu/%lu/%lu, ra_cur %d, ra_max %lu\n",
-	       ria->ria_reserved, len, mlen,
+	       ria->ria_reserved, pages, pages_min,
 	       atomic_read(&ll_i2sbi(inode)->ll_ra_info.ra_cur_pages),
 	       ll_i2sbi(inode)->ll_ra_info.ra_max_pages);
 
-	if (ria->ria_reserved < len) {
+	if (ria->ria_reserved < pages) {
 		ll_ra_stats_inc(inode, RA_STAT_MAX_IN_FLIGHT);
 		if (PAGES_TO_MiB(ria->ria_reserved) < 1) {
 			ll_ra_count_put(ll_i2sbi(inode), ria->ria_reserved);
@@ -563,7 +569,7 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 		}
 	}
 
-	rc = cl_io_rw_init(env, io, CIT_READ, ria->ria_start, len);
+	rc = cl_io_rw_init(env, io, CIT_READ, ria->ria_start_idx, pages);
 	if (rc)
 		goto out_put_env;
 
@@ -577,7 +583,8 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 	queue = &io->ci_queue;
 	cl_2queue_init(queue);
 
-	rc = ll_read_ahead_pages(env, io, &queue->c2_qin, ras, ria, &ra_end);
+	rc = ll_read_ahead_pages(env, io, &queue->c2_qin, ras, ria,
+				 &ra_end_idx);
 	if (ria->ria_reserved != 0)
 		ll_ra_count_put(ll_i2sbi(inode), ria->ria_reserved);
 	if (queue->c2_qin.pl_nr > 0) {
@@ -587,10 +594,10 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 		if (rc == 0)
 			task_io_account_read(PAGE_SIZE * count);
 	}
-	if (ria->ria_end == ra_end && ra_end == (kms >> PAGE_SHIFT))
+	if (ria->ria_end_idx == ra_end_idx && ra_end_idx == (kms >> PAGE_SHIFT))
 		ll_ra_stats_inc(inode, RA_STAT_EOF);
 
-	if (ra_end != ria->ria_end)
+	if (ra_end_idx != ria->ria_end_idx)
 		ll_ra_stats_inc(inode, RA_STAT_FAILED_REACH_END);
 
 	/* TODO: discard all pages until page reinit route is implemented */
@@ -606,7 +613,7 @@ static void ll_readahead_handle_work(struct work_struct *wq)
 out_put_env:
 	cl_env_put(env, &refcheck);
 out_free_work:
-	if (ra_end > 0)
+	if (ra_end_idx > 0)
 		ll_ra_stats_inc_sbi(ll_i2sbi(inode), RA_STAT_ASYNC);
 	ll_readahead_work_free(work);
 }
@@ -618,8 +625,8 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 {
 	struct vvp_io *vio = vvp_env_io(env);
 	struct ll_thread_info *lti = ll_env_info(env);
-	unsigned long len, mlen = 0;
-	pgoff_t ra_end = 0, start = 0, end = 0;
+	unsigned long pages, pages_min = 0;
+	pgoff_t ra_end_idx = 0, start_idx = 0, end_idx = 0;
 	struct inode *inode;
 	struct ra_io_arg *ria = &lti->lti_ria;
 	struct cl_object *clob;
@@ -642,39 +649,38 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	spin_lock(&ras->ras_lock);
 
 	/**
-	 * Note: other thread might rollback the ras_next_readahead,
+	 * Note: other thread might rollback the ras_next_readahead_idx,
 	 * if it can not get the full size of prepared pages, see the
 	 * end of this function. For stride read ahead, it needs to
 	 * make sure the offset is no less than ras_stride_offset,
 	 * so that stride read ahead can work correctly.
 	 */
 	if (stride_io_mode(ras))
-		start = max(ras->ras_next_readahead,
-			    ras->ras_stride_offset >> PAGE_SHIFT);
+		start_idx = max_t(pgoff_t, ras->ras_next_readahead_idx,
+				  ras->ras_stride_offset >> PAGE_SHIFT);
 	else
-		start = ras->ras_next_readahead;
+		start_idx = ras->ras_next_readahead_idx;
 
-	if (ras->ras_window_len > 0)
-		end = ras->ras_window_start + ras->ras_window_len - 1;
+	if (ras->ras_window_pages > 0)
+		end_idx = ras->ras_window_start_idx + ras->ras_window_pages - 1;
 
 	/* Enlarge the RA window to encompass the full read */
 	if (vio->vui_ra_valid &&
-	    end < vio->vui_ra_start + vio->vui_ra_count - 1)
-		end = vio->vui_ra_start + vio->vui_ra_count - 1;
+	    end_idx < vio->vui_ra_start_idx + vio->vui_ra_pages - 1)
+		end_idx = vio->vui_ra_start_idx + vio->vui_ra_pages - 1;
 
-	if (end) {
-		unsigned long end_index;
+	if (end_idx) {
+		pgoff_t eof_index;
 
 		/* Truncate RA window to end of file */
-		end_index = (unsigned long)((kms - 1) >> PAGE_SHIFT);
-		if (end_index <= end) {
-			end = end_index;
+		eof_index = (pgoff_t)((kms - 1) >> PAGE_SHIFT);
+		if (eof_index <= end_idx) {
+			end_idx = eof_index;
 			ria->ria_eof = true;
 		}
 	}
-
-	ria->ria_start = start;
-	ria->ria_end = end;
+	ria->ria_start_idx = start_idx;
+	ria->ria_end_idx = end_idx;
 	/* If stride I/O mode is detected, get stride window*/
 	if (stride_io_mode(ras)) {
 		ria->ria_stoff = ras->ras_stride_offset;
@@ -683,12 +689,12 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	}
 	spin_unlock(&ras->ras_lock);
 
-	if (end == 0) {
+	if (end_idx == 0) {
 		ll_ra_stats_inc(inode, RA_STAT_ZERO_WINDOW);
 		return 0;
 	}
-	len = ria_page_count(ria);
-	if (len == 0) {
+	pages = ria_page_count(ria);
+	if (pages == 0) {
 		ll_ra_stats_inc(inode, RA_STAT_ZERO_WINDOW);
 		return 0;
 	}
@@ -696,45 +702,48 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	RAS_CDEBUG(ras);
 	CDEBUG(D_READA, DFID ": ria: %lu/%lu, bead: %lu/%lu, hit: %d\n",
 	       PFID(lu_object_fid(&clob->co_lu)),
-	       ria->ria_start, ria->ria_end,
-	       vio->vui_ra_valid ? vio->vui_ra_start : 0,
-	       vio->vui_ra_valid ? vio->vui_ra_count : 0,
+	       ria->ria_start_idx, ria->ria_end_idx,
+	       vio->vui_ra_valid ? vio->vui_ra_start_idx : 0,
+	       vio->vui_ra_valid ? vio->vui_ra_pages : 0,
 	       hit);
 
 	/* at least to extend the readahead window to cover current read */
 	if (!hit && vio->vui_ra_valid &&
-	    vio->vui_ra_start + vio->vui_ra_count > ria->ria_start)
-		ria->ria_end_min = vio->vui_ra_start + vio->vui_ra_count - 1;
+	    vio->vui_ra_start_idx + vio->vui_ra_pages > ria->ria_start_idx)
+		ria->ria_end_idx_min =
+			vio->vui_ra_start_idx + vio->vui_ra_pages - 1;
 
-	ria->ria_reserved = ll_ra_count_get(ll_i2sbi(inode), ria, len, mlen);
-	if (ria->ria_reserved < len)
+	ria->ria_reserved = ll_ra_count_get(ll_i2sbi(inode), ria, pages,
+					    pages_min);
+	if (ria->ria_reserved < pages)
 		ll_ra_stats_inc(inode, RA_STAT_MAX_IN_FLIGHT);
 
-	CDEBUG(D_READA, "reserved pages %lu/%lu/%lu, ra_cur %d, ra_max %lu\n",
-	       ria->ria_reserved, len, mlen,
+	CDEBUG(D_READA, "reserved pages: %lu/%lu/%lu, ra_cur %d, ra_max %lu\n",
+	       ria->ria_reserved, pages, pages_min,
 	       atomic_read(&ll_i2sbi(inode)->ll_ra_info.ra_cur_pages),
 	       ll_i2sbi(inode)->ll_ra_info.ra_max_pages);
 
-	ret = ll_read_ahead_pages(env, io, queue, ras, ria, &ra_end);
+	ret = ll_read_ahead_pages(env, io, queue, ras, ria, &ra_end_idx);
 
 	if (ria->ria_reserved)
 		ll_ra_count_put(ll_i2sbi(inode), ria->ria_reserved);
 
-	if (ra_end == end && ra_end == (kms >> PAGE_SHIFT))
+	if (ra_end_idx == end_idx && ra_end_idx == (kms >> PAGE_SHIFT))
 		ll_ra_stats_inc(inode, RA_STAT_EOF);
 
-	CDEBUG(D_READA, "ra_end = %lu end = %lu stride end = %lu pages = %d\n",
-	       ra_end, end, ria->ria_end, ret);
+	CDEBUG(D_READA,
+	       "ra_end_idx = %lu end_idx = %lu stride end = %lu pages = %d\n",
+	       ra_end_idx, end_idx, ria->ria_end_idx, ret);
 
-	if (ra_end != end)
+	if (ra_end_idx != end_idx)
 		ll_ra_stats_inc(inode, RA_STAT_FAILED_REACH_END);
 
-	if (ra_end > 0) {
+	if (ra_end_idx > 0) {
 		/* update the ras so that the next read-ahead tries from
 		 * where we left off.
 		 */
 		spin_lock(&ras->ras_lock);
-		ras->ras_next_readahead = ra_end + 1;
+		ras->ras_next_readahead_idx = ra_end_idx + 1;
 		spin_unlock(&ras->ras_lock);
 		RAS_CDEBUG(ras);
 	}
@@ -744,7 +753,7 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 
 static void ras_set_start(struct ll_readahead_state *ras, pgoff_t index)
 {
-	ras->ras_window_start = ras_align(ras, index, NULL);
+	ras->ras_window_start_idx = ras_align(ras, index);
 }
 
 /* called with the ras_lock held or from places where it doesn't matter */
@@ -752,9 +761,9 @@ static void ras_reset(struct ll_readahead_state *ras, pgoff_t index)
 {
 	ras->ras_consecutive_requests = 0;
 	ras->ras_consecutive_bytes = 0;
-	ras->ras_window_len = 0;
+	ras->ras_window_pages = 0;
 	ras_set_start(ras, index);
-	ras->ras_next_readahead = max(ras->ras_window_start, index + 1);
+	ras->ras_next_readahead_idx = max(ras->ras_window_start_idx, index + 1);
 
 	RAS_CDEBUG(ras);
 }
@@ -771,9 +780,9 @@ static void ras_stride_reset(struct ll_readahead_state *ras)
 void ll_readahead_init(struct inode *inode, struct ll_readahead_state *ras)
 {
 	spin_lock_init(&ras->ras_lock);
-	ras->ras_rpc_size = PTLRPC_MAX_BRW_PAGES;
+	ras->ras_rpc_pages = PTLRPC_MAX_BRW_PAGES;
 	ras_reset(ras, 0);
-	ras->ras_last_read_end = 0;
+	ras->ras_last_read_end_bytes = 0;
 	ras->ras_requests = 0;
 }
 
@@ -782,15 +791,15 @@ void ll_readahead_init(struct inode *inode, struct ll_readahead_state *ras)
  * If it is in the stride window, return true, otherwise return false.
  */
 static bool read_in_stride_window(struct ll_readahead_state *ras,
-				  unsigned long pos, unsigned long count)
+				  loff_t pos, loff_t count)
 {
-	unsigned long stride_gap;
+	loff_t stride_gap;
 
 	if (ras->ras_stride_length == 0 || ras->ras_stride_bytes == 0 ||
 	    ras->ras_stride_bytes == ras->ras_stride_length)
 		return false;
 
-	stride_gap = pos - ras->ras_last_read_end - 1;
+	stride_gap = pos - ras->ras_last_read_end_bytes - 1;
 
 	/* If it is contiguous read */
 	if (stride_gap == 0)
@@ -804,13 +813,13 @@ static bool read_in_stride_window(struct ll_readahead_state *ras,
 }
 
 static void ras_init_stride_detector(struct ll_readahead_state *ras,
-				     unsigned long pos, unsigned long count)
+				     loff_t pos, loff_t count)
 {
-	unsigned long stride_gap = pos - ras->ras_last_read_end - 1;
+	loff_t stride_gap = pos - ras->ras_last_read_end_bytes - 1;
 
 	LASSERT(ras->ras_consecutive_stride_requests == 0);
 
-	if (pos <= ras->ras_last_read_end) {
+	if (pos <= ras->ras_last_read_end_bytes) {
 		/*Reset stride window for forward read*/
 		ras_stride_reset(ras);
 		return;
@@ -828,47 +837,50 @@ static void ras_init_stride_detector(struct ll_readahead_state *ras,
  * stride I/O pattern
  */
 static void ras_stride_increase_window(struct ll_readahead_state *ras,
-				       struct ll_ra_info *ra,
-				       unsigned long inc_len)
+				       struct ll_ra_info *ra, loff_t inc_bytes)
 {
-	unsigned long left, step, window_len;
-	unsigned long stride_len;
-	unsigned long end = ras->ras_window_start + ras->ras_window_len;
+	loff_t window_bytes, stride_bytes;
+	u64 left_bytes;
+	u64 step;
+	loff_t end;
+
+	/* temporarily store in page units to reduce LASSERT() cost below */
+	end = ras->ras_window_start_idx + ras->ras_window_pages;
 
 	LASSERT(ras->ras_stride_length > 0);
 	LASSERTF(end >= (ras->ras_stride_offset >> PAGE_SHIFT),
-		 "window_start %lu, window_len %lu stride_offset %lu\n",
-		 ras->ras_window_start, ras->ras_window_len,
+		 "window_start_idx %lu, window_pages %lu stride_offset %llu\n",
+		 ras->ras_window_start_idx, ras->ras_window_pages,
 		 ras->ras_stride_offset);
 
 	end <<= PAGE_SHIFT;
-	if (end < ras->ras_stride_offset)
-		stride_len = 0;
+	if (end <= ras->ras_stride_offset)
+		stride_bytes = 0;
 	else
-		stride_len = end - ras->ras_stride_offset;
+		stride_bytes = end - ras->ras_stride_offset;
 
-	left = stride_len % ras->ras_stride_length;
-	window_len = (ras->ras_window_len << PAGE_SHIFT) - left;
+	div64_u64_rem(stride_bytes, ras->ras_stride_length, &left_bytes);
+	window_bytes = ((loff_t)ras->ras_window_pages << PAGE_SHIFT) -
+		       left_bytes;
 
-	if (left < ras->ras_stride_bytes)
-		left += inc_len;
+	if (left_bytes < ras->ras_stride_bytes)
+		left_bytes += inc_bytes;
 	else
-		left = ras->ras_stride_bytes + inc_len;
+		left_bytes = ras->ras_stride_bytes + inc_bytes;
 
 	LASSERT(ras->ras_stride_bytes != 0);
 
-	step = left / ras->ras_stride_bytes;
-	left %= ras->ras_stride_bytes;
+	step = div64_u64_rem(left_bytes, ras->ras_stride_bytes, &left_bytes);
 
-	window_len += step * ras->ras_stride_length + left;
+	window_bytes += step * ras->ras_stride_length + left_bytes;
 
 	if (DIV_ROUND_UP(stride_byte_count(ras->ras_stride_offset,
 					   ras->ras_stride_length,
 					   ras->ras_stride_bytes,
 					   ras->ras_stride_offset,
-					   window_len), PAGE_SIZE)
+					   window_bytes), PAGE_SIZE)
 	    <= ra->ra_max_pages_per_file)
-		ras->ras_window_len = (window_len >> PAGE_SHIFT);
+		ras->ras_window_pages = (window_bytes >> PAGE_SHIFT);
 
 	RAS_CDEBUG(ras);
 }
@@ -883,36 +895,34 @@ static void ras_increase_window(struct inode *inode,
 	 */
 	if (stride_io_mode(ras)) {
 		ras_stride_increase_window(ras, ra,
-				ras->ras_rpc_size << PAGE_SHIFT);
+					   (loff_t)ras->ras_rpc_pages << PAGE_SHIFT);
 	} else {
-		unsigned long wlen;
+		pgoff_t window_pages;
 
-		wlen = min(ras->ras_window_len + ras->ras_rpc_size,
-			   ra->ra_max_pages_per_file);
-		if (wlen < ras->ras_rpc_size)
-			ras->ras_window_len = wlen;
+		window_pages = min(ras->ras_window_pages + ras->ras_rpc_pages,
+				   ra->ra_max_pages_per_file);
+		if (window_pages < ras->ras_rpc_pages)
+			ras->ras_window_pages = window_pages;
 		else
-			ras->ras_window_len = ras_align(ras, wlen, NULL);
+			ras->ras_window_pages = ras_align(ras, window_pages);
 	}
 }
 
 /**
  * Seek within 8 pages are considered as sequential read for now.
  */
-static inline bool is_loose_seq_read(struct ll_readahead_state *ras,
-				     unsigned long pos)
+static inline bool is_loose_seq_read(struct ll_readahead_state *ras, loff_t pos)
 {
-	return pos_in_window(pos, ras->ras_last_read_end,
-			     8 << PAGE_SHIFT, 8 << PAGE_SHIFT);
+	return pos_in_window(pos, ras->ras_last_read_end_bytes,
+			     8UL << PAGE_SHIFT, 8UL << PAGE_SHIFT);
 }
 
 static void ras_detect_read_pattern(struct ll_readahead_state *ras,
 				    struct ll_sb_info *sbi,
-				    unsigned long pos, unsigned long count,
-				    bool mmap)
+				    loff_t pos, size_t count, bool mmap)
 {
 	bool stride_detect = false;
-	unsigned long index = pos >> PAGE_SHIFT;
+	pgoff_t index = pos >> PAGE_SHIFT;
 
 	/*
 	 * Reset the read-ahead window in two cases. First when the app seeks
@@ -947,25 +957,25 @@ static void ras_detect_read_pattern(struct ll_readahead_state *ras,
 		 */
 		if (!read_in_stride_window(ras, pos, count)) {
 			ras_stride_reset(ras);
-			ras->ras_window_len = 0;
-			ras->ras_next_readahead = index;
+			ras->ras_window_pages = 0;
+			ras->ras_next_readahead_idx = index;
 		}
 	}
 
 	ras->ras_consecutive_bytes += count;
 	if (mmap) {
-		unsigned int idx = (ras->ras_consecutive_bytes >> PAGE_SHIFT);
+		pgoff_t idx = ras->ras_consecutive_bytes >> PAGE_SHIFT;
 
-		if ((idx >= 4 && idx % 4 == 0) || stride_detect)
+		if ((idx >= 4 && (idx & 3UL) == 0) || stride_detect)
 			ras->ras_need_increase_window = true;
 	} else if ((ras->ras_consecutive_requests > 1 || stride_detect)) {
 		ras->ras_need_increase_window = true;
 	}
 
-	ras->ras_last_read_end = pos + count - 1;
+	ras->ras_last_read_end_bytes = pos + count - 1;
 }
 
-void ll_ras_enter(struct file *f, unsigned long pos, unsigned long count)
+void ll_ras_enter(struct file *f, loff_t pos, size_t count)
 {
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(f);
 	struct ll_readahead_state *ras = &fd->fd_ras;
@@ -998,10 +1008,10 @@ void ll_ras_enter(struct file *f, unsigned long pos, unsigned long count)
 
 		if (kms_pages &&
 		    kms_pages <= ra->ra_max_read_ahead_whole_pages) {
-			ras->ras_window_start = 0;
-			ras->ras_next_readahead = index + 1;
-			ras->ras_window_len = min(ra->ra_max_pages_per_file,
-						  ra->ra_max_read_ahead_whole_pages);
+			ras->ras_window_start_idx = 0;
+			ras->ras_next_readahead_idx = index + 1;
+			ras->ras_window_pages = min(ra->ra_max_pages_per_file,
+						    ra->ra_max_read_ahead_whole_pages);
 			ras->ras_no_miss_check = true;
 			goto out_unlock;
 		}
@@ -1012,18 +1022,19 @@ void ll_ras_enter(struct file *f, unsigned long pos, unsigned long count)
 }
 
 static bool index_in_stride_window(struct ll_readahead_state *ras,
-				   unsigned int index)
+				   pgoff_t index)
 {
-	unsigned long pos = index << PAGE_SHIFT;
-	unsigned long offset;
+	loff_t pos = (loff_t)index << PAGE_SHIFT;
 
 	if (ras->ras_stride_length == 0 || ras->ras_stride_bytes == 0 ||
 	    ras->ras_stride_bytes == ras->ras_stride_length)
 		return false;
 
 	if (pos >= ras->ras_stride_offset) {
-		offset = (pos - ras->ras_stride_offset) %
-			 ras->ras_stride_length;
+		u64 offset;
+
+		div64_u64_rem(pos - ras->ras_stride_offset,
+			      ras->ras_stride_length, &offset);
 		if (offset < ras->ras_stride_bytes ||
 		    ras->ras_stride_length - offset < PAGE_SIZE)
 			return true;
@@ -1035,14 +1046,13 @@ static bool index_in_stride_window(struct ll_readahead_state *ras,
 }
 
 /*
- * ll_ras_enter() is used to detect read pattern according to
- * pos and count.
+ * ll_ras_enter() is used to detect read pattern according to pos and count.
  *
  * ras_update() is used to detect cache miss and
  * reset window or increase window accordingly
  */
 static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
-		       struct ll_readahead_state *ras, unsigned long index,
+		       struct ll_readahead_state *ras, pgoff_t index,
 		       enum ras_update_flags flags)
 {
 	struct ll_ra_info *ra = &sbi->ll_ra_info;
@@ -1065,13 +1075,13 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		goto out_unlock;
 
 	if (flags & LL_RAS_MMAP)
-		ras_detect_read_pattern(ras, sbi, index << PAGE_SHIFT,
+		ras_detect_read_pattern(ras, sbi, (loff_t)index << PAGE_SHIFT,
 					PAGE_SIZE, true);
 
-	if (!hit && ras->ras_window_len &&
-	    index < ras->ras_next_readahead &&
-	    pos_in_window(index, ras->ras_window_start, 0,
-			  ras->ras_window_len)) {
+	if (!hit && ras->ras_window_pages &&
+	    index < ras->ras_next_readahead_idx &&
+	    pos_in_window(index, ras->ras_window_start_idx, 0,
+			  ras->ras_window_pages)) {
 		ll_ra_stats_inc_sbi(sbi, RA_STAT_MISS_IN_WINDOW);
 		ras->ras_need_increase_window = false;
 
@@ -1090,8 +1100,7 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 			 * is still intersect with normal sequential
 			 * read-ahead window.
 			 */
-			if (ras->ras_window_start <
-			    ras->ras_stride_offset)
+			if (ras->ras_window_start_idx < ras->ras_stride_offset)
 				ras_stride_reset(ras);
 			RAS_CDEBUG(ras);
 		} else {
@@ -1111,18 +1120,18 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 	if (stride_io_mode(ras)) {
 		/* Since stride readahead is sensitive to the offset
 		 * of read-ahead, so we use original offset here,
-		 * instead of ras_window_start, which is RPC aligned
+		 * instead of ras_window_start_idx, which is RPC aligned.
 		 */
-		ras->ras_next_readahead = max(index + 1,
-					      ras->ras_next_readahead);
-		ras->ras_window_start =
-				max(ras->ras_stride_offset >> PAGE_SHIFT,
-				    ras->ras_window_start);
+		ras->ras_next_readahead_idx = max(index + 1,
+						  ras->ras_next_readahead_idx);
+		ras->ras_window_start_idx =
+				max_t(pgoff_t, ras->ras_window_start_idx,
+				      ras->ras_stride_offset >> PAGE_SHIFT);
 	} else {
-		if (ras->ras_next_readahead < ras->ras_window_start)
-			ras->ras_next_readahead = ras->ras_window_start;
+		if (ras->ras_next_readahead_idx < ras->ras_window_start_idx)
+			ras->ras_next_readahead_idx = ras->ras_window_start_idx;
 		if (!hit)
-			ras->ras_next_readahead = index + 1;
+			ras->ras_next_readahead_idx = index + 1;
 	}
 
 	if (ras->ras_need_increase_window) {
@@ -1241,7 +1250,7 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	int result;
 
 	if (wbc->range_cyclic) {
-		start = mapping->writeback_index << PAGE_SHIFT;
+		start = (loff_t)mapping->writeback_index << PAGE_SHIFT;
 		end = OBD_OBJECT_EOF;
 	} else {
 		start = wbc->range_start;
@@ -1429,8 +1438,8 @@ static int kickoff_async_readahead(struct file *file, unsigned long pages)
 	struct ll_readahead_state *ras = &fd->fd_ras;
 	struct ll_ra_info *ra = &sbi->ll_ra_info;
 	unsigned long throttle;
-	unsigned long start = ras_align(ras, ras->ras_next_readahead, NULL);
-	unsigned long end = start + pages - 1;
+	pgoff_t start_idx = ras_align(ras, ras->ras_next_readahead_idx);
+	pgoff_t end_idx = start_idx + pages - 1;
 
 	throttle = min(ra->ra_async_pages_per_file_threshold,
 		       ra->ra_max_pages_per_file);
@@ -1440,24 +1449,24 @@ static int kickoff_async_readahead(struct file *file, unsigned long pages)
 	 * we do async readahead, allowing the user thread to do fast i/o.
 	 */
 	if (stride_io_mode(ras) || !throttle ||
-	    ras->ras_window_len < throttle)
+	    ras->ras_window_pages < throttle)
 		return 0;
 
 	if ((atomic_read(&ra->ra_cur_pages) + pages) > ra->ra_max_pages)
 		return 0;
 
-	if (ras->ras_async_last_readpage == start)
+	if (ras->ras_async_last_readpage_idx == start_idx)
 		return 1;
 
 	/* ll_readahead_work_free() free it */
 	lrw = kzalloc(sizeof(*lrw), GFP_NOFS);
 	if (lrw) {
 		lrw->lrw_file = get_file(file);
-		lrw->lrw_start = start;
-		lrw->lrw_end = end;
+		lrw->lrw_start_idx = start_idx;
+		lrw->lrw_end_idx = end_idx;
 		spin_lock(&ras->ras_lock);
-		ras->ras_next_readahead = end + 1;
-		ras->ras_async_last_readpage = start;
+		ras->ras_next_readahead_idx = end_idx + 1;
+		ras->ras_async_last_readpage_idx = start_idx;
 		spin_unlock(&ras->ras_lock);
 		ll_readahead_work_add(inode, lrw);
 	} else {
@@ -1489,7 +1498,7 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		struct lu_env *local_env = NULL;
 		struct inode *inode = file_inode(file);
 		unsigned long fast_read_pages =
-			max(RA_REMAIN_WINDOW_MIN, ras->ras_rpc_size);
+			max(RA_REMAIN_WINDOW_MIN, ras->ras_rpc_pages);
 		struct vvp_page *vpg;
 
 		result = -ENODATA;
@@ -1526,8 +1535,8 @@ int ll_readpage(struct file *file, struct page *vmpage)
 			 * the case, we can't do fast IO because we will need
 			 * a cl_io to issue the RPC.
 			 */
-			if (ras->ras_window_start + ras->ras_window_len <
-			    ras->ras_next_readahead + fast_read_pages ||
+			if (ras->ras_window_start_idx + ras->ras_window_pages <
+			    ras->ras_next_readahead_idx + fast_read_pages ||
 			    kickoff_async_readahead(file, fast_read_pages) > 0)
 				result = 0;
 		}
diff --git a/fs/lustre/llite/vvp_internal.h b/fs/lustre/llite/vvp_internal.h
index 1cc152f..0382b79 100644
--- a/fs/lustre/llite/vvp_internal.h
+++ b/fs/lustre/llite/vvp_internal.h
@@ -103,8 +103,8 @@ struct vvp_io {
 	struct kiocb		*vui_iocb;
 
 	/* Readahead state. */
-	pgoff_t			vui_ra_start;
-	pgoff_t			vui_ra_count;
+	pgoff_t			vui_ra_start_idx;
+	pgoff_t			vui_ra_pages;
 	/* Set when vui_ra_{start,count} have been initialized. */
 	bool			vui_ra_valid;
 };
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 259b14a..cf116be 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -739,8 +739,8 @@ static int vvp_io_read_start(const struct lu_env *env,
 	struct file *file = vio->vui_fd->fd_file;
 	int result;
 	loff_t pos = io->u.ci_rd.rd.crw_pos;
-	long cnt = io->u.ci_rd.rd.crw_count;
-	long tot = vio->vui_tot_count;
+	size_t cnt = io->u.ci_rd.rd.crw_count;
+	size_t tot = vio->vui_tot_count;
 	int exceed = 0;
 
 	CLOBINVRNT(env, obj, vvp_object_invariant(obj));
@@ -776,16 +776,16 @@ static int vvp_io_read_start(const struct lu_env *env,
 	/* initialize read-ahead window once per syscall */
 	if (!vio->vui_ra_valid) {
 		vio->vui_ra_valid = true;
-		vio->vui_ra_start = cl_index(obj, pos);
-		vio->vui_ra_count = cl_index(obj, tot + PAGE_SIZE - 1);
+		vio->vui_ra_start_idx = cl_index(obj, pos);
+		vio->vui_ra_pages = cl_index(obj, tot + PAGE_SIZE - 1);
 		/* If both start and end are unaligned, we read one more page
 		 * than the index math suggests.
 		 */
-		if (pos % PAGE_SIZE != 0 && (pos + tot) % PAGE_SIZE != 0)
-			vio->vui_ra_count++;
+		if ((pos & ~PAGE_MASK) != 0 && ((pos + tot) & ~PAGE_MASK) != 0)
+			vio->vui_ra_pages++;
 
-		CDEBUG(D_READA, "tot %ld, ra_start %lu, ra_count %lu\n", tot,
-		       vio->vui_ra_start, vio->vui_ra_count);
+		CDEBUG(D_READA, "tot %zu, ra_start %lu, ra_count %lu\n",
+		       tot, vio->vui_ra_start_idx, vio->vui_ra_pages);
 	}
 
 	/* BUG: 5972 */
@@ -1424,7 +1424,7 @@ static int vvp_io_read_ahead(const struct lu_env *env,
 		struct vvp_io *vio = cl2vvp_io(env, ios);
 
 		if (unlikely(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-			ra->cra_end = CL_PAGE_EOF;
+			ra->cra_end_idx = CL_PAGE_EOF;
 			result = 1; /* no need to call down */
 		}
 	}
diff --git a/fs/lustre/lov/lov_io.c b/fs/lustre/lov/lov_io.c
index 971f9ba..019e986 100644
--- a/fs/lustre/lov/lov_io.c
+++ b/fs/lustre/lov/lov_io.c
@@ -1014,7 +1014,8 @@ static int lov_io_read_ahead(const struct lu_env *env,
 			      ra);
 
 	CDEBUG(D_READA, DFID " cra_end = %lu, stripes = %d, rc = %d\n",
-	       PFID(lu_object_fid(lov2lu(loo))), ra->cra_end, r0->lo_nr, rc);
+	       PFID(lu_object_fid(lov2lu(loo))), ra->cra_end_idx,
+		    r0->lo_nr, rc);
 	if (rc)
 		return rc;
 
@@ -1027,15 +1028,15 @@ static int lov_io_read_ahead(const struct lu_env *env,
 	 */
 
 	/* cra_end is stripe level, convert it into file level */
-	ra_end = ra->cra_end;
+	ra_end = ra->cra_end_idx;
 	if (ra_end != CL_PAGE_EOF)
-		ra->cra_end = lov_stripe_pgoff(loo->lo_lsm, index,
-					       ra_end, stripe);
+		ra->cra_end_idx = lov_stripe_pgoff(loo->lo_lsm, index,
+						   ra_end, stripe);
 
 	/* boundary of current component */
 	ra_end = cl_index(obj, (loff_t)lov_io_extent(lio, index)->e_end);
-	if (ra_end != CL_PAGE_EOF && ra->cra_end >= ra_end)
-		ra->cra_end = ra_end - 1;
+	if (ra_end != CL_PAGE_EOF && ra->cra_end_idx >= ra_end)
+		ra->cra_end_idx = ra_end - 1;
 
 	if (r0->lo_nr == 1) /* single stripe file */
 		return 0;
@@ -1043,13 +1044,13 @@ static int lov_io_read_ahead(const struct lu_env *env,
 	pps = lov_lse(loo, index)->lsme_stripe_size >> PAGE_SHIFT;
 
 	CDEBUG(D_READA,
-	       DFID " max_index = %lu, pps = %u, index = %u, stripe_size = %u, stripe no = %u, start index = %lu\n",
-	       PFID(lu_object_fid(lov2lu(loo))), ra->cra_end, pps, index,
+	       DFID " max_index = %lu, pps = %u, index = %d, stripe_size = %u, stripe no = %u, start index = %lu\n",
+	       PFID(lu_object_fid(lov2lu(loo))), ra->cra_end_idx, pps, index,
 	       lov_lse(loo, index)->lsme_stripe_size, stripe, start);
 
 	/* never exceed the end of the stripe */
-	ra->cra_end = min_t(pgoff_t,
-			    ra->cra_end, start + pps - start % pps - 1);
+	ra->cra_end_idx = min_t(pgoff_t, ra->cra_end_idx,
+				start + pps - start % pps - 1);
 	return 0;
 }
 
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 312e527..496491f 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -1099,8 +1099,8 @@ static int mdc_io_read_ahead(const struct lu_env *env,
 		ldlm_lock_decref(&lockh, dlmlock->l_req_mode);
 	}
 
-	ra->cra_rpc_size = osc_cli(osc)->cl_max_pages_per_rpc;
-	ra->cra_end = CL_PAGE_EOF;
+	ra->cra_rpc_pages = osc_cli(osc)->cl_max_pages_per_rpc;
+	ra->cra_end_idx = CL_PAGE_EOF;
 	ra->cra_release = osc_read_ahead_release;
 	ra->cra_cbdata = dlmlock;
 
diff --git a/fs/lustre/obdclass/integrity.c b/fs/lustre/obdclass/integrity.c
index 230e1a5..cbb91ed 100644
--- a/fs/lustre/obdclass/integrity.c
+++ b/fs/lustre/obdclass/integrity.c
@@ -229,7 +229,7 @@ static void obd_t10_performance_test(const char *obd_name,
 	for (start = jiffies, end = start + HZ / 4,
 	     bcount = 0; time_before(jiffies, end) && rc == 0; bcount++) {
 		rc = __obd_t10_performance_test(obd_name, cksum_type, page,
-						buf_len / PAGE_SIZE);
+						buf_len >> PAGE_SHIFT);
 		if (rc)
 			break;
 	}
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index dde03bd..7a8dbfc 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -1349,7 +1349,7 @@ static int osc_refresh_count(const struct lu_env *env,
 		return 0;
 	else if (cl_offset(obj, index + 1) > kms)
 		/* catch sub-page write at end of file */
-		return kms % PAGE_SIZE;
+		return kms & ~PAGE_MASK;
 	else
 		return PAGE_SIZE;
 }
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 1ff2df2..f26c95d 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -88,12 +88,12 @@ static int osc_io_read_ahead(const struct lu_env *env,
 			ldlm_lock_decref(&lockh, dlmlock->l_req_mode);
 		}
 
-		ra->cra_rpc_size = osc_cli(osc)->cl_max_pages_per_rpc;
-		ra->cra_end = cl_index(osc2cl(osc),
-				       dlmlock->l_policy_data.l_extent.end);
+		ra->cra_rpc_pages = osc_cli(osc)->cl_max_pages_per_rpc;
+		ra->cra_end_idx = cl_index(osc2cl(osc),
+					   dlmlock->l_policy_data.l_extent.end);
 		ra->cra_release = osc_read_ahead_release;
 		ra->cra_cbdata = dlmlock;
-		if (ra->cra_end != CL_PAGE_EOF)
+		if (ra->cra_end_idx != CL_PAGE_EOF)
 			ra->cra_contention = true;
 		result = 0;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 599/622] lustre: llite: Accept EBUSY for page unaligned read
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (597 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 598/622] lustre: llite: proper names/types for offset/pages James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 600/622] lustre: handle: remove locking from class_handle2object() James Simmons
                   ` (23 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <pfarrell@whamcloud.com>

When doing unaligned strided reads, it's possible for the
first and last page of a stride to be read by another
thread on the same node, resulting in EBUSY.

Also this could potentially happen for sequential read,
for example, several MPI split one large file with unaligned
page size, sequential read happen with each MPI program.

We shouldn't stop readahead in these cases.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12518
Lustre-commit: b9c155065d2c ("LU-12518 llite: Accept EBUSY for page unaligned read")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35457
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/rw.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 9509023..1b5260d 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -360,7 +360,8 @@ static bool ras_inside_ra_window(pgoff_t idx, struct ra_io_arg *ria)
 {
 	struct cl_read_ahead ra = { 0 };
 	pgoff_t page_idx;
-	int count = 0;
+	/* busy page count is per stride */
+	int count = 0, busy_page_count = 0;
 	int rc;
 
 	LASSERT(ria);
@@ -416,8 +417,21 @@ static bool ras_inside_ra_window(pgoff_t idx, struct ra_io_arg *ria)
 
 			/* If the page is inside the read-ahead window */
 			rc = ll_read_ahead_page(env, io, queue, page_idx);
-			if (rc < 0)
+			if (rc < 0 && rc != -EBUSY)
 				break;
+			if (rc == -EBUSY) {
+				busy_page_count++;
+				CDEBUG(D_READA,
+				       "skip busy page: %lu\n", page_idx);
+				/* For page unaligned readahead the first
+				 * last pages of each region can be read by
+				 * another reader on the same node, and so
+				 * may be busy. So only stop for > 2 busy
+				 * pages.
+				 */
+				if (busy_page_count > 2)
+					break;
+			}
 
 			*ra_end = page_idx;
 			/* Only subtract from reserve & count the page if we
@@ -441,6 +455,7 @@ static bool ras_inside_ra_window(pgoff_t idx, struct ra_io_arg *ria)
 				pos += (ria->ria_length - offset);
 				if ((pos >> PAGE_SHIFT) >= page_idx + 1)
 					page_idx = (pos >> PAGE_SHIFT) - 1;
+				busy_page_count = 0;
 				CDEBUG(D_READA,
 				       "Stride: jump %llu pages to %lu\n",
 				       ria->ria_length - offset, page_idx);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 600/622] lustre: handle: remove locking from class_handle2object()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (598 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 599/622] lustre: llite: Accept EBUSY for page unaligned read James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 601/622] lustre: handle: use hlist for hash lists James Simmons
                   ` (22 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

There is limited value in this locking and test on h_in.

If the lookup could have run in parallel with
class_handle_unhash_nolock() and seen "h_in == 0", then it could
equally well have run moments earlier and not seen it - no locking
would prevent that, so the caller much be prepared to have
an object returned which has already been unhashed by the time it
sees the object.

In other words, any interlock between unhash and lookup must be
provided at a higher level than where this code is trying
to handle it.

The locking *does* prevent the refcount from being incremented if the
object has already been removed from the list.  As the final reference
is always dropped after that removal, it indirectly stops the refcount
from being incremented after the final reference is dropped.
This can be more directly achieved by using refcount_inc_not_zero().

So remove the locking, and replace it with refcount_inc_not_zero().

WC-bug-id: https://jira.whamcloud.com/browse/LU-12542
Lustre-commit: e2458a94a6a2 ("LU-12542 handle: remove locking from class_handle2object()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/35861
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lustre_handles.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/lustre/obdclass/lustre_handles.c b/fs/lustre/obdclass/lustre_handles.c
index 6989a60..acee2db 100644
--- a/fs/lustre/obdclass/lustre_handles.c
+++ b/fs/lustre/obdclass/lustre_handles.c
@@ -149,15 +149,12 @@ void *class_handle2object(u64 cookie, const char *owner)
 		if (h->h_cookie != cookie || h->h_owner != owner)
 			continue;
 
-		spin_lock(&h->h_lock);
-		if (likely(h->h_in != 0)) {
-			refcount_inc(&h->h_ref);
+		if (refcount_inc_not_zero(&h->h_ref)) {
 			CDEBUG(D_INFO, "GET %s %p refcount=%d\n",
 			       h->h_owner, h,
 			       refcount_read(&h->h_ref));
 			retval = h;
 		}
-		spin_unlock(&h->h_lock);
 		break;
 	}
 	rcu_read_unlock();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 601/622] lustre: handle: use hlist for hash lists.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (599 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 600/622] lustre: handle: remove locking from class_handle2object() James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 602/622] lustre: obdclass: convert waiting in cl_sync_io_wait() James Simmons
                   ` (21 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

hlist_head/hlist_node is the preferred data structure
for hash tables. Not only does it make the 'head' smaller,
but is also provides hlist_unhashed() which can be used to
check if an object is in the list.  This means that
we don't need h_in any more.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12542
Lustre-commit: 9c9ea6584cfb ("LU-12542 handle: use hlist for hash lists.")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35862
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_handles.h  |  3 +--
 fs/lustre/ldlm/ldlm_lock.c          |  2 +-
 fs/lustre/obdclass/genops.c         |  2 +-
 fs/lustre/obdclass/lustre_handles.c | 20 +++++++++-----------
 4 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/lustre/include/lustre_handles.h b/fs/lustre/include/lustre_handles.h
index 55f9a09..afdade7 100644
--- a/fs/lustre/include/lustre_handles.h
+++ b/fs/lustre/include/lustre_handles.h
@@ -58,7 +58,7 @@
  * to compute the start of the structure based on the handle field.
  */
 struct portals_handle {
-	struct list_head		h_link;
+	struct hlist_node		h_link;
 	u64				h_cookie;
 	const char			*h_owner;
 	refcount_t			h_ref;
@@ -66,7 +66,6 @@ struct portals_handle {
 	/* newly added fields to handle the RCU issue. -jxiong */
 	struct rcu_head			h_rcu;
 	spinlock_t			h_lock;
-	unsigned int			h_in:1;
 };
 
 /* handles.c */
diff --git a/fs/lustre/ldlm/ldlm_lock.c b/fs/lustre/ldlm/ldlm_lock.c
index 2c19636..396bf53 100644
--- a/fs/lustre/ldlm/ldlm_lock.c
+++ b/fs/lustre/ldlm/ldlm_lock.c
@@ -404,7 +404,7 @@ static struct ldlm_lock *ldlm_lock_new(struct ldlm_resource *resource)
 
 	lprocfs_counter_incr(ldlm_res_to_ns(resource)->ns_stats,
 			     LDLM_NSS_LOCKS);
-	INIT_LIST_HEAD(&lock->l_handle.h_link);
+	INIT_HLIST_NODE(&lock->l_handle.h_link);
 	class_handle_hash(&lock->l_handle, lock_handle_owner);
 
 	lu_ref_init(&lock->l_reference);
diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index 0fbe03e..146e735 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -813,7 +813,7 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 	spin_lock_init(&export->exp_uncommitted_replies_lock);
 	INIT_LIST_HEAD(&export->exp_uncommitted_replies);
 	INIT_LIST_HEAD(&export->exp_req_replay_queue);
-	INIT_LIST_HEAD_RCU(&export->exp_handle.h_link);
+	INIT_HLIST_NODE(&export->exp_handle.h_link);
 	INIT_LIST_HEAD(&export->exp_hp_rpcs);
 	class_handle_hash(&export->exp_handle, export_handle_owner);
 	spin_lock_init(&export->exp_lock);
diff --git a/fs/lustre/obdclass/lustre_handles.c b/fs/lustre/obdclass/lustre_handles.c
index acee2db..0048036 100644
--- a/fs/lustre/obdclass/lustre_handles.c
+++ b/fs/lustre/obdclass/lustre_handles.c
@@ -48,7 +48,7 @@
 
 static struct handle_bucket {
 	spinlock_t		lock;
-	struct list_head	head;
+	struct hlist_head	head;
 } *handle_hash;
 
 #define HANDLE_HASH_SIZE (1 << 16)
@@ -63,7 +63,7 @@ void class_handle_hash(struct portals_handle *h, const char *owner)
 	struct handle_bucket *bucket;
 
 	LASSERT(h);
-	LASSERT(list_empty(&h->h_link));
+	LASSERT(hlist_unhashed(&h->h_link));
 
 	/*
 	 * This is fast, but simplistic cookie generation algorithm, it will
@@ -89,8 +89,7 @@ void class_handle_hash(struct portals_handle *h, const char *owner)
 
 	bucket = &handle_hash[h->h_cookie & HANDLE_HASH_MASK];
 	spin_lock(&bucket->lock);
-	list_add_rcu(&h->h_link, &bucket->head);
-	h->h_in = 1;
+	hlist_add_head_rcu(&h->h_link, &bucket->head);
 	spin_unlock(&bucket->lock);
 
 	CDEBUG(D_INFO, "added object %p with handle %#llx to hash\n",
@@ -100,7 +99,7 @@ void class_handle_hash(struct portals_handle *h, const char *owner)
 
 static void class_handle_unhash_nolock(struct portals_handle *h)
 {
-	if (list_empty(&h->h_link)) {
+	if (hlist_unhashed(&h->h_link)) {
 		CERROR("removing an already-removed handle (%#llx)\n",
 		       h->h_cookie);
 		return;
@@ -110,13 +109,12 @@ static void class_handle_unhash_nolock(struct portals_handle *h)
 	       h, h->h_cookie);
 
 	spin_lock(&h->h_lock);
-	if (h->h_in == 0) {
+	if (hlist_unhashed(&h->h_link)) {
 		spin_unlock(&h->h_lock);
 		return;
 	}
-	h->h_in = 0;
+	hlist_del_init_rcu(&h->h_link);
 	spin_unlock(&h->h_lock);
-	list_del_rcu(&h->h_link);
 }
 
 void class_handle_unhash(struct portals_handle *h)
@@ -145,7 +143,7 @@ void *class_handle2object(u64 cookie, const char *owner)
 	bucket = handle_hash + (cookie & HANDLE_HASH_MASK);
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(h, &bucket->head, h_link) {
+	hlist_for_each_entry_rcu(h, &bucket->head, h_link) {
 		if (h->h_cookie != cookie || h->h_owner != owner)
 			continue;
 
@@ -177,7 +175,7 @@ int class_handle_init(void)
 	spin_lock_init(&handle_base_lock);
 	for (bucket = handle_hash + HANDLE_HASH_SIZE - 1; bucket >= handle_hash;
 	     bucket--) {
-		INIT_LIST_HEAD(&bucket->head);
+		INIT_HLIST_HEAD(&bucket->head);
 		spin_lock_init(&bucket->lock);
 	}
 
@@ -196,7 +194,7 @@ static int cleanup_all_handles(void)
 		struct portals_handle *h;
 
 		spin_lock(&handle_hash[i].lock);
-		list_for_each_entry_rcu(h, &handle_hash[i].head, h_link) {
+		hlist_for_each_entry_rcu(h, &handle_hash[i].head, h_link) {
 			CERROR("force clean handle %#llx addr %p owner %p\n",
 			       h->h_cookie, h, h->h_owner);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 602/622] lustre: obdclass: convert waiting in cl_sync_io_wait().
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (600 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 601/622] lustre: handle: use hlist for hash lists James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 603/622] lnet: modules: use list_move were appropriate James Simmons
                   ` (20 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

This function will *always* wait until ->csi_sync_nr reaches zero.
The effect of the timeout is:
  1/ to report an error if the count doesn't reach zero in the given
     time
  2/ to return -ETIMEDOUt instead of csi_sync_rc if the timeout was
     exceeded.

So we rearrange the code to make that more obvious.
A small exrta change is that we now call wait_event_idle() again
even if there was a timeout and the first wait succeeded.
This will simply test csi_sync_nr again and not actually wait.
We could protected it with 'rc != 0 || timeout == 0' but there seems
no point.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10467
Lustre-commit: d6ce546eb7e2 ("LU-10467 obdclass: convert waiting in cl_sync_io_wait().")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/36102
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/cl_io.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index 3bc9097..e11f9fe 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -1054,27 +1054,24 @@ void cl_sync_io_init_notify(struct cl_sync_io *anchor, int nr,
 int cl_sync_io_wait(const struct lu_env *env, struct cl_sync_io *anchor,
 		    long timeout)
 {
-	int rc = 1;
+	int rc = 0;
 
 	LASSERT(timeout >= 0);
 
-	if (timeout == 0)
-		wait_event_idle(anchor->csi_waitq,
-				atomic_read(&anchor->csi_sync_nr) == 0);
-	else
-		rc = wait_event_idle_timeout(anchor->csi_waitq,
-					     atomic_read(&anchor->csi_sync_nr) == 0,
-					     timeout * HZ);
-	if (rc == 0) {
+	if (timeout > 0 &&
+	    wait_event_idle_timeout(anchor->csi_waitq,
+				    atomic_read(&anchor->csi_sync_nr) == 0,
+				    timeout * HZ) == 0) {
 		rc = -ETIMEDOUT;
 		CERROR("IO failed: %d, still wait for %d remaining entries\n",
 		       rc, atomic_read(&anchor->csi_sync_nr));
+	}
 
-		wait_event_idle(anchor->csi_waitq,
-				atomic_read(&anchor->csi_sync_nr) == 0);
-	} else {
+	wait_event_idle(anchor->csi_waitq,
+			atomic_read(&anchor->csi_sync_nr) == 0);
+	if (!rc)
 		rc = anchor->csi_sync_rc;
-	}
+
 	/* We take the lock to ensure that cl_sync_io_note() has finished */
 	spin_lock(&anchor->csi_waitq.lock);
 	LASSERT(atomic_read(&anchor->csi_sync_nr) == 0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 603/622] lnet: modules: use list_move were appropriate.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (601 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 602/622] lustre: obdclass: convert waiting in cl_sync_io_wait() James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 604/622] lnet: fix small race in unloading klnd modules James Simmons
                   ` (19 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Rather than
  list_del(&foo);
  list_add(&foo, &bar);
use
  list_move(&foo, &bar);

Similarly for list_add_tail and list_move_tail.

In lnet_attach_rsp_tracker, local_rspt already has a suitably
initialised ->rspt_on_list, so the new_entry variable can
be discarded.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 7525fd36a266 ("LU-9679 modules: use list_move were appropriate.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36670
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index ca292a6..47d5389 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -4374,7 +4374,6 @@ void lnet_monitor_thr_stop(void)
 			struct lnet_libmd *md, struct lnet_handle_md mdh)
 {
 	s64 timeout_ns;
-	bool new_entry = true;
 	struct lnet_rsp_tracker *local_rspt;
 
 	/* MD has a refcount taken by message so it's not going away.
@@ -4391,7 +4390,6 @@ void lnet_monitor_thr_stop(void)
 		 * update the deadline on that one.
 		 */
 		lnet_rspt_free(rspt, cpt);
-		new_entry = false;
 	} else {
 		/* new md */
 		rspt->rspt_mdh = mdh;
@@ -4406,9 +4404,7 @@ void lnet_monitor_thr_stop(void)
 	 * list in order to expire all the older entries first.
 	 */
 	lnet_net_lock(cpt);
-	if (!new_entry && !list_empty(&local_rspt->rspt_on_list))
-		list_del_init(&local_rspt->rspt_on_list);
-	list_add_tail(&local_rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);
+	list_move_tail(&local_rspt->rspt_on_list, the_lnet.ln_mt_rstq[cpt]);
 	lnet_net_unlock(cpt);
 	lnet_res_unlock(cpt);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 604/622] lnet: fix small race in unloading klnd modules.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (602 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 603/622] lnet: modules: use list_move were appropriate James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 605/622] lnet: me: discard struct lnet_handle_me James Simmons
                   ` (18 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Reference counting of klnd modules is handled by the module itself.
Currently, it is possible for a module to be completely unloaded
between the time when the module called module_put(), and when
it subsequently returns from the function that makes that call.
During this time there may be one or two instructions to execute,
and if the module is unmapped before they are executed, an
exception will result.

The module unload will call lnet_unregister_lnd() which takes
the_lnet.ln_lnd_mutex, so module unload cannot complete while
that is held.  lnd_startup is called with this mutex held to
avoid any races, but lnd_shutdown is not.  Adding that
protection will close the race.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: c087091cd901 ("LU-12678 lnet: fix small race in unloading klnd modules.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36853
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 0ca8bef..5df39aa 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1983,7 +1983,14 @@ static void lnet_push_target_fini(void)
 		islo = ni->ni_net->net_lnd->lnd_type == LOLND;
 
 		LASSERT(!in_interrupt());
+		/* Holding the mutex makes it safe for lnd_shutdown
+		 * to call module_put(). Module unload cannot finish
+		 * until lnet_unregister_lnd() completes, and that
+		 * requires the mutex.
+		 */
+		mutex_lock(&the_lnet.ln_lnd_mutex);
 		net->net_lnd->lnd_shutdown(ni);
+		mutex_unlock(&the_lnet.ln_lnd_mutex);
 
 		if (!islo)
 			CDEBUG(D_LNI, "Removed LNI %s\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 605/622] lnet: me: discard struct lnet_handle_me
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (603 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 604/622] lnet: fix small race in unloading klnd modules James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 606/622] lnet: avoid extra memory consumption James Simmons
                   ` (17 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The Portals API uses a cookie 'handle' to identify an ME.  This is
appropriate for a user-space API for objects maintained by the
kernel, but it brings no value when the API client and
implementation are both in the kernel, as is the case with Lustre
and LNet.

Instead of using a 'handle', a pointer to the 'struct lnet_me' can
be used.  This object is not reference counted and is always freed
correctly, so there can be no case where the cookie becomes invalid
while it is still held - as can be seen by the fact that the return
value from LNetMEUnlink() is never used except to assert that it is
zero.

So use 'struct lnet_me *' directly instead of having indirection
through a 'struct lnet_handle_me'.

Also:
  - change LNetMEUnlink() to return void as it cannot fail now.
  - have LNetMEAttach() return the pointer, using ERR_PTR() to return
    errors.
  - discard ln_me_containers and don't store the me there-in.
  - store an explicit 'cpt' in each me, we no longer store one
    implicitly via the cookie.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: ceeeae4271fd ("LU-12678 lnet: me: discard struct lnet_handle_me")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36859
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/niobuf.c            | 45 ++++++++++++++++-----------------
 include/linux/lnet/api.h             | 20 +++++++--------
 include/linux/lnet/lib-lnet.h        | 22 -----------------
 include/linux/lnet/lib-types.h       |  4 +--
 include/uapi/linux/lnet/lnet-types.h |  4 ---
 net/lnet/lnet/api-ni.c               | 46 ++++++++++++----------------------
 net/lnet/lnet/lib-md.c               | 16 +++++-------
 net/lnet/lnet/lib-me.c               | 48 +++++++++++-------------------------
 net/lnet/selftest/rpc.c              | 14 +++++------
 9 files changed, 76 insertions(+), 143 deletions(-)

diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index fcf7bfa..26a1f97 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -118,11 +118,10 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 	struct ptlrpc_bulk_desc *desc = req->rq_bulk;
 	struct lnet_process_id peer;
 	int rc = 0;
-	int rc2;
 	int posted_md;
 	int total_md;
 	u64 mbits;
-	struct lnet_handle_me me_h;
+	struct lnet_me *me;
 	struct lnet_md md;
 
 	if (OBD_FAIL_CHECK(OBD_FAIL_PTLRPC_BULK_GET_NET))
@@ -183,8 +182,9 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 		    OBD_FAIL_CHECK(OBD_FAIL_PTLRPC_BULK_ATTACH)) {
 			rc = -ENOMEM;
 		} else {
-			rc = LNetMEAttach(desc->bd_portal, peer, mbits, 0,
-					  LNET_UNLINK, LNET_INS_AFTER, &me_h);
+			me = LNetMEAttach(desc->bd_portal, peer, mbits, 0,
+					  LNET_UNLINK, LNET_INS_AFTER);
+			rc = PTR_ERR_OR_ZERO(me);
 		}
 		if (rc != 0) {
 			CERROR("%s: LNetMEAttach failed x%llu/%d: rc = %d\n",
@@ -194,14 +194,13 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 		}
 
 		/* About to let the network at it... */
-		rc = LNetMDAttach(me_h, md, LNET_UNLINK,
+		rc = LNetMDAttach(me, md, LNET_UNLINK,
 				  &desc->bd_mds[posted_md]);
 		if (rc != 0) {
 			CERROR("%s: LNetMDAttach failed x%llu/%d: rc = %d\n",
 			       desc->bd_import->imp_obd->obd_name, mbits,
 			       posted_md, rc);
-			rc2 = LNetMEUnlink(me_h);
-			LASSERT(rc2 == 0);
+			LNetMEUnlink(me);
 			break;
 		}
 	}
@@ -479,11 +478,10 @@ int ptlrpc_error(struct ptlrpc_request *req)
 int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 {
 	int rc;
-	int rc2;
 	unsigned int mpflag = 0;
 	struct lnet_handle_md bulk_cookie;
 	struct ptlrpc_connection *connection;
-	struct lnet_handle_me reply_me_h;
+	struct lnet_me *reply_me;
 	struct lnet_md reply_md;
 	struct obd_import *imp = request->rq_import;
 	struct obd_device *obd = imp->imp_obd;
@@ -611,10 +609,11 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 			request->rq_repmsg = NULL;
 		}
 
-		rc = LNetMEAttach(request->rq_reply_portal,/*XXX FIXME bug 249*/
-				  connection->c_peer, request->rq_xid, 0,
-				  LNET_UNLINK, LNET_INS_AFTER, &reply_me_h);
-		if (rc != 0) {
+		reply_me = LNetMEAttach(request->rq_reply_portal,
+					connection->c_peer, request->rq_xid, 0,
+					LNET_UNLINK, LNET_INS_AFTER);
+		if (IS_ERR(reply_me)) {
+			rc = PTR_ERR(reply_me);
 			CERROR("LNetMEAttach failed: %d\n", rc);
 			LASSERT(rc == -ENOMEM);
 			rc = -ENOMEM;
@@ -652,7 +651,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 		/* We must see the unlink callback to set rq_reply_unlinked,
 		 * so we can't auto-unlink
 		 */
-		rc = LNetMDAttach(reply_me_h, reply_md, LNET_RETAIN,
+		rc = LNetMDAttach(reply_me, reply_md, LNET_RETAIN,
 				  &request->rq_reply_md_h);
 		if (rc != 0) {
 			CERROR("LNetMDAttach failed: %d\n", rc);
@@ -710,8 +709,7 @@ int ptl_send_rpc(struct ptlrpc_request *request, int noreply)
 	 * nobody apart from the PUT's target has the right nid+XID to
 	 * access the reply buffer.
 	 */
-	rc2 = LNetMEUnlink(reply_me_h);
-	LASSERT(rc2 == 0);
+	LNetMEUnlink(reply_me);
 	/* UNLINKED callback called synchronously */
 	LASSERT(!request->rq_receiving_reply);
 
@@ -750,7 +748,7 @@ int ptlrpc_register_rqbd(struct ptlrpc_request_buffer_desc *rqbd)
 	};
 	int rc;
 	struct lnet_md md;
-	struct lnet_handle_me me_h;
+	struct lnet_me *me;
 
 	CDEBUG(D_NET, "LNetMEAttach: portal %d\n",
 	       service->srv_req_portal);
@@ -762,12 +760,12 @@ int ptlrpc_register_rqbd(struct ptlrpc_request_buffer_desc *rqbd)
 	 * which means buffer can only be attached on local CPT, and LND
 	 * threads can find it by grabbing a local lock
 	 */
-	rc = LNetMEAttach(service->srv_req_portal,
+	me = LNetMEAttach(service->srv_req_portal,
 			  match_id, 0, ~0, LNET_UNLINK,
 			  rqbd->rqbd_svcpt->scp_cpt >= 0 ?
-			  LNET_INS_LOCAL : LNET_INS_AFTER, &me_h);
-	if (rc != 0) {
-		CERROR("LNetMEAttach failed: %d\n", rc);
+			  LNET_INS_LOCAL : LNET_INS_AFTER);
+	if (IS_ERR(me)) {
+		CERROR("LNetMEAttach failed: %ld\n", PTR_ERR(me));
 		return -ENOMEM;
 	}
 
@@ -782,14 +780,13 @@ int ptlrpc_register_rqbd(struct ptlrpc_request_buffer_desc *rqbd)
 	md.user_ptr = &rqbd->rqbd_cbid;
 	md.eq_handle = ptlrpc_eq_h;
 
-	rc = LNetMDAttach(me_h, md, LNET_UNLINK, &rqbd->rqbd_md_h);
+	rc = LNetMDAttach(me, md, LNET_UNLINK, &rqbd->rqbd_md_h);
 	if (rc == 0)
 		return 0;
 
 	CERROR("LNetMDAttach failed: %d;\n", rc);
 	LASSERT(rc == -ENOMEM);
-	rc = LNetMEUnlink(me_h);
-	LASSERT(rc == 0);
+	LNetMEUnlink(me);
 	rqbd->rqbd_refcount = 0;
 
 	return -ENOMEM;
diff --git a/include/linux/lnet/api.h b/include/linux/lnet/api.h
index ac602fc..f9f6860 100644
--- a/include/linux/lnet/api.h
+++ b/include/linux/lnet/api.h
@@ -94,15 +94,15 @@
  * and removed from its list by LNetMEUnlink().
  * @{
  */
-int LNetMEAttach(unsigned int portal,
-		 struct lnet_process_id match_id_in,
-		 u64 match_bits_in,
-		 u64 ignore_bits_in,
-		 enum lnet_unlink unlink_in,
-		 enum lnet_ins_pos pos_in,
-		 struct lnet_handle_me *handle_out);
-
-int LNetMEUnlink(struct lnet_handle_me current_in);
+struct lnet_me *
+LNetMEAttach(unsigned int portal,
+	     struct lnet_process_id match_id_in,
+	     u64 match_bits_in,
+	     u64 ignore_bits_in,
+	     enum lnet_unlink unlink_in,
+	     enum lnet_ins_pos pos_in);
+
+void LNetMEUnlink(struct lnet_me *current_in);
 /** @} lnet_me */
 
 /** \defgroup lnet_md Memory descriptors
@@ -118,7 +118,7 @@ int LNetMEAttach(unsigned int portal,
  * associated with a MD: LNetMDUnlink().
  * @{
  */
-int LNetMDAttach(struct lnet_handle_me current_in,
+int LNetMDAttach(struct lnet_me *current_in,
 		 struct lnet_md md_in,
 		 enum lnet_unlink unlink_in,
 		 struct lnet_handle_md *md_handle_out);
diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index bf357b0..a8051fe 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -326,28 +326,6 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 }
 
 static inline void
-lnet_me2handle(struct lnet_handle_me *handle, struct lnet_me *me)
-{
-	handle->cookie = me->me_lh.lh_cookie;
-}
-
-static inline struct lnet_me *
-lnet_handle2me(struct lnet_handle_me *handle)
-{
-	/* ALWAYS called with resource lock held */
-	struct lnet_libhandle *lh;
-	int cpt;
-
-	cpt = lnet_cpt_of_cookie(handle->cookie);
-	lh = lnet_res_lh_lookup(the_lnet.ln_me_containers[cpt],
-				handle->cookie);
-	if (!lh)
-		return NULL;
-
-	return lh_entry(lh, struct lnet_me, me_lh);
-}
-
-static inline void
 lnet_peer_net_addref_locked(struct lnet_peer_net *lpn)
 {
 	atomic_inc(&lpn->lpn_refcount);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 9055da9..3345940 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -192,7 +192,7 @@ struct lnet_eq {
 
 struct lnet_me {
 	struct list_head	 me_list;
-	struct lnet_libhandle	 me_lh;
+	int			 me_cpt;
 	struct lnet_process_id	 me_match_id;
 	unsigned int		 me_portal;
 	unsigned int		 me_pos;	/* hash offset in mt_hash */
@@ -1027,8 +1027,6 @@ struct lnet {
 	int				ln_nportals;
 	/* the vector of portals */
 	struct lnet_portal	      **ln_portals;
-	/* percpt ME containers */
-	struct lnet_res_container     **ln_me_containers;
 	/* percpt MD container */
 	struct lnet_res_container     **ln_md_containers;
 
diff --git a/include/uapi/linux/lnet/lnet-types.h b/include/uapi/linux/lnet/lnet-types.h
index cf263b9..118340f 100644
--- a/include/uapi/linux/lnet/lnet-types.h
+++ b/include/uapi/linux/lnet/lnet-types.h
@@ -374,10 +374,6 @@ static inline int LNetMDHandleIsInvalid(struct lnet_handle_md h)
 	return (LNET_WIRE_HANDLE_COOKIE_NONE == h.cookie);
 }
 
-struct lnet_handle_me {
-	u64	cookie;
-};
-
 /**
  * Global process ID.
  */
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 5df39aa..852bb0c 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -1115,14 +1115,6 @@ struct list_head **
 	if (rc)
 		goto failed;
 
-	recs = lnet_res_containers_create(LNET_COOKIE_TYPE_ME);
-	if (!recs) {
-		rc = -ENOMEM;
-		goto failed;
-	}
-
-	the_lnet.ln_me_containers = recs;
-
 	recs = lnet_res_containers_create(LNET_COOKIE_TYPE_MD);
 	if (!recs) {
 		rc = -ENOMEM;
@@ -1185,11 +1177,6 @@ struct list_head **
 		the_lnet.ln_md_containers = NULL;
 	}
 
-	if (the_lnet.ln_me_containers) {
-		lnet_res_containers_destroy(the_lnet.ln_me_containers);
-		the_lnet.ln_me_containers = NULL;
-	}
-
 	lnet_res_container_cleanup(&the_lnet.ln_eq_container);
 
 	lnet_msg_containers_destroy();
@@ -1594,7 +1581,7 @@ struct lnet_ping_buffer *
 		.nid = LNET_NID_ANY,
 		.pid = LNET_PID_ANY
 	};
-	struct lnet_handle_me me_handle;
+	struct lnet_me *me;
 	struct lnet_md md = { NULL };
 	int rc, rc2;
 
@@ -1614,11 +1601,11 @@ struct lnet_ping_buffer *
 	}
 
 	/* Ping target ME/MD */
-	rc = LNetMEAttach(LNET_RESERVED_PORTAL, id,
+	me = LNetMEAttach(LNET_RESERVED_PORTAL, id,
 			  LNET_PROTO_PING_MATCHBITS, 0,
-			  LNET_UNLINK, LNET_INS_AFTER,
-			  &me_handle);
-	if (rc) {
+			  LNET_UNLINK, LNET_INS_AFTER);
+	if (IS_ERR(me)) {
+		rc = PTR_ERR(me);
 		CERROR("Can't create ping target ME: %d\n", rc);
 		goto fail_decref_ping_buffer;
 	}
@@ -1633,7 +1620,7 @@ struct lnet_ping_buffer *
 	md.eq_handle = the_lnet.ln_ping_target_eq;
 	md.user_ptr = *ppbuf;
 
-	rc = LNetMDAttach(me_handle, md, LNET_RETAIN, ping_mdh);
+	rc = LNetMDAttach(me, md, LNET_RETAIN, ping_mdh);
 	if (rc) {
 		CERROR("Can't attach ping target MD: %d\n", rc);
 		goto fail_unlink_ping_me;
@@ -1643,8 +1630,7 @@ struct lnet_ping_buffer *
 	return 0;
 
 fail_unlink_ping_me:
-	rc2 = LNetMEUnlink(me_handle);
-	LASSERT(!rc2);
+	LNetMEUnlink(me);
 fail_decref_ping_buffer:
 	LASSERT(lnet_ping_buffer_numref(*ppbuf) == 1);
 	lnet_ping_buffer_decref(*ppbuf);
@@ -1773,7 +1759,7 @@ int lnet_push_target_resize(void)
 		.pid	= LNET_PID_ANY
 	};
 	struct lnet_md md = { NULL };
-	struct lnet_handle_me meh;
+	struct lnet_me *me;
 	struct lnet_handle_md mdh;
 	struct lnet_handle_md old_mdh;
 	struct lnet_ping_buffer *pbuf;
@@ -1792,11 +1778,11 @@ int lnet_push_target_resize(void)
 		goto fail_return;
 	}
 
-	rc = LNetMEAttach(LNET_RESERVED_PORTAL, id,
+	me = LNetMEAttach(LNET_RESERVED_PORTAL, id,
 			  LNET_PROTO_PING_MATCHBITS, 0,
-			  LNET_UNLINK, LNET_INS_AFTER,
-			  &meh);
-	if (rc) {
+			  LNET_UNLINK, LNET_INS_AFTER);
+	if (IS_ERR(me)) {
+		rc = PTR_ERR(me);
 		CERROR("Can't create push target ME: %d\n", rc);
 		goto fail_decref_pbuf;
 	}
@@ -1811,10 +1797,10 @@ int lnet_push_target_resize(void)
 	md.user_ptr = pbuf;
 	md.eq_handle = the_lnet.ln_push_target_eq;
 
-	rc = LNetMDAttach(meh, md, LNET_RETAIN, &mdh);
+	rc = LNetMDAttach(me, md, LNET_RETAIN, &mdh);
 	if (rc) {
 		CERROR("Can't attach push MD: %d\n", rc);
-		goto fail_unlink_meh;
+		goto fail_unlink_me;
 	}
 	lnet_ping_buffer_addref(pbuf);
 
@@ -1837,8 +1823,8 @@ int lnet_push_target_resize(void)
 
 	return 0;
 
-fail_unlink_meh:
-	LNetMEUnlink(meh);
+fail_unlink_me:
+	LNetMEUnlink(me);
 fail_decref_pbuf:
 	lnet_ping_buffer_decref(pbuf);
 fail_return:
diff --git a/net/lnet/lnet/lib-md.c b/net/lnet/lnet/lib-md.c
index 5ee43c2..4dae58f 100644
--- a/net/lnet/lnet/lib-md.c
+++ b/net/lnet/lnet/lib-md.c
@@ -337,7 +337,7 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 /**
  * Create a memory descriptor and attach it to a ME
  *
- * @meh		A handle for a ME to associate the new MD with.
+ * @me		An ME to associate the new MD with.
  * @umd		Provides initial values for the user-visible parts of a MD.
  *		Other than its use for initialization, there is no linkage
  *		between this structure and the MD maintained by the LNet.
@@ -354,19 +354,18 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
  * Return:	0 on success.
  *		-EINVAL If @umd is not valid.
  *		-ENOMEM If new MD cannot be allocated.
- *		-ENOENT Either @meh or @umd.eq_handle does not point to a
+ *		-ENOENT Either @me or @umd.eq_handle does not point to a
  *		valid object. Note that it's OK to supply a NULL @umd.eq_handle
  *		by calling LNetInvalidateHandle() on it.
- *		-EBUSY if the ME pointed to by @meh is already associated with
+ *		-EBUSY if the ME pointed to by @me is already associated with
  *		a MD.
  */
 int
-LNetMDAttach(struct lnet_handle_me meh, struct lnet_md umd,
+LNetMDAttach(struct lnet_me *me, struct lnet_md umd,
 	     enum lnet_unlink unlink, struct lnet_handle_md *handle)
 {
 	LIST_HEAD(matches);
 	LIST_HEAD(drops);
-	struct lnet_me *me;
 	struct lnet_libmd *md;
 	int cpt;
 	int rc;
@@ -389,14 +388,11 @@ int lnet_cpt_of_md(struct lnet_libmd *md, unsigned int offset)
 	if (rc)
 		goto out_free;
 
-	cpt = lnet_cpt_of_cookie(meh.cookie);
+	cpt = me->me_cpt;
 
 	lnet_res_lock(cpt);
 
-	me = lnet_handle2me(&meh);
-	if (!me)
-		rc = -ENOENT;
-	else if (me->me_md)
+	if (me->me_md)
 		rc = -EBUSY;
 	else
 		rc = lnet_md_link(md, umd.eq_handle, cpt);
diff --git a/net/lnet/lnet/lib-me.c b/net/lnet/lnet/lib-me.c
index 47cf498..d17f41d 100644
--- a/net/lnet/lnet/lib-me.c
+++ b/net/lnet/lnet/lib-me.c
@@ -62,20 +62,16 @@
  * @pos		Indicates whether the new ME should be prepended or
  *		appended to the match list. Allowed constants: LNET_INS_BEFORE,
  *		LNET_INS_AFTER.
- * @handle	On successful returns, a handle to the newly created ME object
- *		is saved here. This handle can be used later in LNetMEUnlink(),
- *		or LNetMDAttach() functions.
  *
- * Return:	0 On success.
- *		-EINVAL If @portal is invalid.
- *		-ENOMEM If new ME object cannot be allocated.
+ * Return:	0 On success. handle to the newly created ME is returned on success
+ *		ERR_PTR(-EINVAL) If \a portal is invalid.
+ *		ERR_PTR(-ENOMEM) If new ME object cannot be allocated.
  */
-int
+struct lnet_me *
 LNetMEAttach(unsigned int portal,
 	     struct lnet_process_id match_id,
 	     u64 match_bits, u64 ignore_bits,
-	     enum lnet_unlink unlink, enum lnet_ins_pos pos,
-	     struct lnet_handle_me *handle)
+	     enum lnet_unlink unlink, enum lnet_ins_pos pos)
 {
 	struct lnet_match_table *mtable;
 	struct lnet_me *me;
@@ -84,16 +80,16 @@
 	LASSERT(the_lnet.ln_refcount > 0);
 
 	if ((int)portal >= the_lnet.ln_nportals)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	mtable = lnet_mt_of_attach(portal, match_id,
 				   match_bits, ignore_bits, pos);
 	if (!mtable) /* can't match portal type */
-		return -EPERM;
+		return ERR_PTR(-EPERM);
 
 	me = kmem_cache_alloc(lnet_mes_cachep, GFP_NOFS | __GFP_ZERO);
 	if (!me)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	lnet_res_lock(mtable->mt_cpt);
 
@@ -104,8 +100,8 @@
 	me->me_unlink = unlink;
 	me->me_md = NULL;
 
-	lnet_res_lh_initialize(the_lnet.ln_me_containers[mtable->mt_cpt],
-			       &me->me_lh);
+	me->me_cpt = mtable->mt_cpt;
+
 	if (ignore_bits)
 		head = &mtable->mt_mhash[LNET_MT_HASH_IGNORE];
 	else
@@ -117,10 +113,8 @@
 	else
 		list_add(&me->me_list, head);
 
-	lnet_me2handle(handle, me);
-
 	lnet_res_unlock(mtable->mt_cpt);
-	return 0;
+	return me;
 }
 EXPORT_SYMBOL(LNetMEAttach);
 
@@ -132,32 +126,22 @@
  * and an unlink event will be generated. It is an error to use the ME handle
  * after calling LNetMEUnlink().
  *
- * @meh		A handle for the ME to be unlinked.
- *
- * Return	0 On success.
- *		-ENOENT If @meh does not point to a valid ME.
+ * @me		The ME to be unlinked.
  *
  * \see LNetMDUnlink() for the discussion on delivering unlink event.
  */
-int
-LNetMEUnlink(struct lnet_handle_me meh)
+void
+LNetMEUnlink(struct lnet_me *me)
 {
-	struct lnet_me *me;
 	struct lnet_libmd *md;
 	struct lnet_event ev;
 	int cpt;
 
 	LASSERT(the_lnet.ln_refcount > 0);
 
-	cpt = lnet_cpt_of_cookie(meh.cookie);
+	cpt = me->me_cpt;
 	lnet_res_lock(cpt);
 
-	me = lnet_handle2me(&meh);
-	if (!me) {
-		lnet_res_unlock(cpt);
-		return -ENOENT;
-	}
-
 	md = me->me_md;
 	if (md) {
 		md->md_flags |= LNET_MD_FLAG_ABORTED;
@@ -170,7 +154,6 @@
 	lnet_me_unlink(me);
 
 	lnet_res_unlock(cpt);
-	return 0;
 }
 EXPORT_SYMBOL(LNetMEUnlink);
 
@@ -188,6 +171,5 @@
 		lnet_md_unlink(md);
 	}
 
-	lnet_res_lh_invalidate(&me->me_lh);
 	kfree(me);
 }
diff --git a/net/lnet/selftest/rpc.c b/net/lnet/selftest/rpc.c
index 7a8226c..531377d 100644
--- a/net/lnet/selftest/rpc.c
+++ b/net/lnet/selftest/rpc.c
@@ -360,11 +360,12 @@ struct srpc_bulk *
 {
 	int rc;
 	struct lnet_md md;
-	struct lnet_handle_me meh;
+	struct lnet_me *me;
 
-	rc = LNetMEAttach(portal, peer, matchbits, 0, LNET_UNLINK,
-			  local ? LNET_INS_LOCAL : LNET_INS_AFTER, &meh);
-	if (rc) {
+	me = LNetMEAttach(portal, peer, matchbits, 0, LNET_UNLINK,
+			  local ? LNET_INS_LOCAL : LNET_INS_AFTER);
+	if (IS_ERR(me)) {
+		rc = PTR_ERR(me);
 		CERROR("LNetMEAttach failed: %d\n", rc);
 		LASSERT(rc == -ENOMEM);
 		return -ENOMEM;
@@ -377,13 +378,12 @@ struct srpc_bulk *
 	md.options = options;
 	md.eq_handle = srpc_data.rpc_lnet_eq;
 
-	rc = LNetMDAttach(meh, md, LNET_UNLINK, mdh);
+	rc = LNetMDAttach(me, md, LNET_UNLINK, mdh);
 	if (rc) {
 		CERROR("LNetMDAttach failed: %d\n", rc);
 		LASSERT(rc == -ENOMEM);
 
-		rc = LNetMEUnlink(meh);
-		LASSERT(!rc);
+		LNetMEUnlink(me);
 		return -ENOMEM;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 606/622] lnet: avoid extra memory consumption
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (604 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 605/622] lnet: me: discard struct lnet_handle_me James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 607/622] lustre: uapi: remove unused LUSTRE_DIRECTIO_FL James Simmons
                   ` (16 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Alexey Lyashkov <c17817@cray.com>

use slab allocation for the rsp_tracker and lnet_message
structs to avoid memory fragmnetation.

Cray-bug-id: LUS-8190
WC-bug-id: https://jira.whamcloud.com/browse/LU-13036
Lustre-commit: a3ce59ae2c62 ("LU-13036 lnet: avoid extra memory consumption")
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-on: https://review.whamcloud.com/36897
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h | 13 ++++++++++---
 net/lnet/lnet/api-ni.c        | 28 ++++++++++++++++++++++++----
 net/lnet/lnet/lib-move.c      | 11 ++++++-----
 3 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index a8051fe..de0cef0 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -83,13 +83,17 @@
 
 /* default timeout */
 #define DEFAULT_PEER_TIMEOUT    180
+#define LNET_LND_DEFAULT_TIMEOUT 5
+
+bool lnet_is_route_alive(struct lnet_route *route);
 
 #define LNET_SMALL_MD_SIZE	offsetof(struct lnet_libmd, md_iov.iov[1])
 extern struct kmem_cache *lnet_mes_cachep;	 /* MEs kmem_cache */
 extern struct kmem_cache *lnet_small_mds_cachep; /* <= LNET_SMALL_MD_SIZE bytes
 						  * MDs kmem_cache
 						  */
-#define LNET_LND_DEFAULT_TIMEOUT 5
+extern struct kmem_cache *lnet_rspt_cachep;
+extern struct kmem_cache *lnet_msg_cachep;
 
 bool lnet_is_route_alive(struct lnet_route *route);
 bool lnet_is_gateway_alive(struct lnet_peer *gw);
@@ -417,19 +421,22 @@ void lnet_res_lh_initialize(struct lnet_res_container *rec,
 {
 	struct lnet_rsp_tracker *rspt;
 
-	rspt = kzalloc(sizeof(*rspt), GFP_NOFS);
+	rspt = kmem_cache_zalloc(lnet_rspt_cachep, GFP_NOFS);
 	if (rspt) {
 		lnet_net_lock(cpt);
 		the_lnet.ln_counters[cpt]->lct_health.lch_rst_alloc++;
 		lnet_net_unlock(cpt);
 	}
+	CDEBUG(D_MALLOC, "rspt alloc %p\n", rspt);
 	return rspt;
 }
 
 static inline void
 lnet_rspt_free(struct lnet_rsp_tracker *rspt, int cpt)
 {
-	kfree(rspt);
+	CDEBUG(D_MALLOC, "rspt free %p\n", rspt);
+
+	kmem_cache_free(lnet_rspt_cachep, rspt);
 	lnet_net_lock(cpt);
 	the_lnet.ln_counters[cpt]->lct_health.lch_rst_alloc--;
 	lnet_net_unlock(cpt);
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 852bb0c..b9c38f3 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -494,9 +494,11 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 struct kmem_cache *lnet_small_mds_cachep;  /* <= LNET_SMALL_MD_SIZE bytes
 					    *  MDs kmem_cache
 					    */
+struct kmem_cache *lnet_rspt_cachep;	   /* response tracker cache */
+struct kmem_cache *lnet_msg_cachep;
 
 static int
-lnet_descriptor_setup(void)
+lnet_slab_setup(void)
 {
 	/* create specific kmem_cache for MEs and small MDs (i.e., originally
 	 * allocated in <size-xxx> kmem_cache).
@@ -512,12 +514,30 @@ static int lnet_discover(struct lnet_process_id id, u32 force,
 	if (!lnet_small_mds_cachep)
 		return -ENOMEM;
 
+	lnet_rspt_cachep = kmem_cache_create("lnet_rspt",
+					     sizeof(struct lnet_rsp_tracker),
+					     0, 0, NULL);
+	if (!lnet_rspt_cachep)
+		return -ENOMEM;
+
+	lnet_msg_cachep = kmem_cache_create("lnet_msg",
+					    sizeof(struct lnet_msg),
+					    0, 0, NULL);
+	if (!lnet_msg_cachep)
+		return -ENOMEM;
+
 	return 0;
 }
 
 static void
-lnet_descriptor_cleanup(void)
+lnet_slab_cleanup(void)
 {
+	kmem_cache_destroy(lnet_msg_cachep);
+	lnet_msg_cachep = NULL;
+
+	kmem_cache_destroy(lnet_rspt_cachep);
+	lnet_rspt_cachep = NULL;
+
 	kmem_cache_destroy(lnet_small_mds_cachep);
 	lnet_small_mds_cachep = NULL;
 
@@ -1081,7 +1101,7 @@ struct list_head **
 	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
 	init_completion(&the_lnet.ln_started);
 
-	rc = lnet_descriptor_setup();
+	rc = lnet_slab_setup();
 	if (rc != 0)
 		goto failed;
 
@@ -1188,7 +1208,7 @@ struct list_head **
 		the_lnet.ln_counters = NULL;
 	}
 	lnet_destroy_remote_nets_table();
-	lnet_descriptor_cleanup();
+	lnet_slab_cleanup();
 
 	return 0;
 }
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 47d5389..cd36d52 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -4186,7 +4186,7 @@ void lnet_monitor_thr_stop(void)
 		}
 	}
 
-	msg = kzalloc(sizeof(*msg), GFP_NOFS);
+	msg = kmem_cache_zalloc(lnet_msg_cachep, GFP_NOFS);
 	if (!msg) {
 		CERROR("%s, src %s: Dropping %s (out of memory)\n",
 		       libcfs_nid2str(from_nid), libcfs_nid2str(src_nid),
@@ -4194,7 +4194,7 @@ void lnet_monitor_thr_stop(void)
 		goto drop;
 	}
 
-	/* msg zeroed by kzalloc()
+	/* msg zeroed by kmem_cache_zalloc().
 	 * i.e. flags all clear, pointers NULL etc
 	 */
 	msg->msg_type = type;
@@ -4475,7 +4475,7 @@ void lnet_monitor_thr_stop(void)
 		return -EIO;
 	}
 
-	msg = kzalloc(sizeof(*msg), GFP_NOFS);
+	msg = kmem_cache_zalloc(lnet_msg_cachep, GFP_NOFS);
 	if (!msg) {
 		CERROR("Dropping PUT to %s: ENOMEM on struct lnet_msg\n",
 		       libcfs_id2str(target));
@@ -4571,7 +4571,7 @@ struct lnet_msg *
 	 * CAVEAT EMPTOR: 'getmsg' is the original GET, which is freed when
 	 * lnet_finalize() is called on it, so the LND must call this first
 	 */
-	struct lnet_msg *msg = kzalloc(sizeof(*msg), GFP_NOFS);
+	struct lnet_msg *msg;
 	struct lnet_libmd *getmd = getmsg->msg_md;
 	struct lnet_process_id peer_id = getmsg->msg_target;
 	int cpt;
@@ -4579,6 +4579,7 @@ struct lnet_msg *
 	LASSERT(!getmsg->msg_target_is_router);
 	LASSERT(!getmsg->msg_routing);
 
+	msg = kmem_cache_zalloc(lnet_msg_cachep, GFP_NOFS);
 	if (!msg) {
 		CERROR("%s: Dropping REPLY from %s: can't allocate msg\n",
 		       libcfs_nid2str(ni->ni_nid), libcfs_id2str(peer_id));
@@ -4708,7 +4709,7 @@ struct lnet_msg *
 		return -EIO;
 	}
 
-	msg = kzalloc(sizeof(*msg), GFP_NOFS);
+	msg = kmem_cache_zalloc(lnet_msg_cachep, GFP_NOFS);
 	if (!msg) {
 		CERROR("Dropping GET to %s: ENOMEM on struct lnet_msg\n",
 		       libcfs_id2str(target));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 607/622] lustre: uapi: remove unused LUSTRE_DIRECTIO_FL
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (605 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 606/622] lnet: avoid extra memory consumption James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 608/622] lustre: lustre: Reserve OST_FALLOCATE(fallocate) opcode James Simmons
                   ` (15 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

The LUSTRE_DIRECTIO_FL was added based on the upstream FS_DIRECTIO_FL
flag in the hopes that it might be useful, but it has since been
removed from the upstream in kernel commit v4.4-rc4-22-g68ce7bfcd995
and replaced by FS_VERITY_FL using the same value in kernel commit
v5.3-rc2-4-gfe9918d3b228, which we are much more likely to use.

Since LUSTRE_DIRECTIO_FL was unused, there is no risk to remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-13164
Lustre-commit: ff168481a1b2 ("LU-13164 uapi: remove unused LUSTRE_DIRECTIO_FL")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37295
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c            | 2 --
 include/uapi/linux/lustre/lustre_idl.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 6c66815..96f327f 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -2176,8 +2176,6 @@ void lustre_assert_wire_constants(void)
 		 LUSTRE_DIRSYNC_FL);
 	LASSERTF(LUSTRE_TOPDIR_FL == 0x00020000, "found 0x%.8x\n",
 		 LUSTRE_TOPDIR_FL);
-	LASSERTF(LUSTRE_DIRECTIO_FL == 0x00100000, "found 0x%.8x\n",
-		 LUSTRE_DIRECTIO_FL);
 	LASSERTF(LUSTRE_INLINE_DATA_FL == 0x10000000, "found 0x%.8x\n",
 		 LUSTRE_INLINE_DATA_FL);
 	LASSERTF(MDS_INODELOCK_LOOKUP == 0x00000001UL, "found 0x%.8x\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 19ac0cb..df2e34b 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -1543,7 +1543,6 @@ enum {
 #define LUSTRE_INDEX_FL		0x00001000 /* hash-indexed directory */
 #define LUSTRE_DIRSYNC_FL	0x00010000 /* dirsync behaviour (dir only) */
 #define LUSTRE_TOPDIR_FL	0x00020000 /* Top of directory hierarchies*/
-#define LUSTRE_DIRECTIO_FL	0x00100000 /* Use direct i/o */
 #define LUSTRE_INLINE_DATA_FL	0x10000000 /* Inode has inline data. */
 #define LUSTRE_PROJINHERIT_FL	0x20000000 /* Create with parents projid */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 608/622] lustre: lustre: Reserve OST_FALLOCATE(fallocate) opcode
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (606 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 607/622] lustre: uapi: remove unused LUSTRE_DIRECTIO_FL James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 609/622] lnet: libcfs: Cleanup use of bare printk James Simmons
                   ` (14 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Swapnil Pimpale <spimpale@ddn.com>

A new RPC, OST_FALLOCATE has been added for
space preallocation. This patch reserves
OST_FALLOCATE opcode for fallocate syscall.
Reserving opcode upfront would ensure consistency
and would avoid protocol interoperability issues
in the future.

WC-bug-id: https://jira.whamcloud.com/browse/LU-3606
Lustre-commit: 46a11df089c9 ("LU-3606 lustre: Reserve OST_FALLOCATE(fallocate) opcode")
Signed-off-by: Swapnil Pimpale <spimpale@ddn.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Abrarahmed Momin <abrar.momin@gmail.com>
Signed-off-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-on: https://review.whamcloud.com/37277
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/lproc_ptlrpc.c        | 3 ++-
 fs/lustre/ptlrpc/wiretest.c            | 4 +++-
 include/uapi/linux/lustre/lustre_idl.h | 2 ++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index f34aec3..fc7aa3e 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -67,6 +67,7 @@
 	{ OST_QUOTACTL,				"ost_quotactl" },
 	{ OST_QUOTA_ADJUST_QUNIT,		"ost_quota_adjust_qunit" },
 	{ OST_LADVISE,				"ost_ladvise" },
+	{ OST_FALLOCATE,			"ost_fallocate"},
 	{ MDS_GETATTR,				"mds_getattr" },
 	{ MDS_GETATTR_NAME,			"mds_getattr_lock" },
 	{ MDS_CLOSE,				"mds_close" },
@@ -115,7 +116,7 @@
 	{ 401, /* was OBD_LOG_CANCEL */		"llog_cancel" },
 	{ 402, /* was OBD_QC_CALLBACK */	"obd_quota_callback" },
 	{ OBD_IDX_READ,				"dt_index_read" },
-	{ LLOG_ORIGIN_HANDLE_CREATE,		 "llog_origin_handle_open" },
+	{ LLOG_ORIGIN_HANDLE_CREATE,		"llog_origin_handle_open" },
 	{ LLOG_ORIGIN_HANDLE_NEXT_BLOCK,	"llog_origin_handle_next_block" },
 	{ LLOG_ORIGIN_HANDLE_READ_HEADER,	"llog_origin_handle_read_header" },
 	{ 504, /*LLOG_ORIGIN_HANDLE_WRITE_REC*/	"llog_origin_handle_write_rec" },
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 96f327f..d94d2d9 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -106,7 +106,9 @@ void lustre_assert_wire_constants(void)
 		 (long long)OST_QUOTA_ADJUST_QUNIT);
 	LASSERTF(OST_LADVISE == 21, "found %lld\n",
 		 (long long)OST_LADVISE);
-	LASSERTF(OST_LAST_OPC == 22, "found %lld\n",
+	LASSERTF(OST_FALLOCATE == 22, "found %lld\n",
+		 (long long)OST_FALLOCATE);
+	LASSERTF(OST_LAST_OPC == 23, "found %lld\n",
 		 (long long)OST_LAST_OPC);
 	LASSERTF(OBD_OBJECT_EOF == 0xffffffffffffffffULL, "found 0x%.16llxULL\n",
 		 OBD_OBJECT_EOF);
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index df2e34b..12ab369 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -956,6 +956,7 @@ enum ost_cmd {
 	OST_QUOTACTL	= 19,
 	OST_QUOTA_ADJUST_QUNIT = 20, /* not used since 2.4 */
 	OST_LADVISE	= 21,
+	OST_FALLOCATE	= 22,
 	OST_LAST_OPC /* must be < 33 to avoid MDS_GETATTR */
 };
 #define OST_FIRST_OPC  OST_REPLY
@@ -2789,6 +2790,7 @@ struct obdo {
 #define o_dropped o_misc
 #define o_cksum   o_nlink
 #define o_grant_used o_data_version
+#define o_falloc_mode o_nlink
 
 /* request structure for OST's */
 struct ost_body {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 609/622] lnet: libcfs: Cleanup use of bare printk
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (607 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 608/622] lustre: lustre: Reserve OST_FALLOCATE(fallocate) opcode James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 610/622] lnet: Do not assume peers are MR capable James Simmons
                   ` (13 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Shaun Tancheff <stancheff@cray.com>

Some users of printk(<LEVEL> "fmt" can be converted to
pr_level("fmt" equivalents

WC-bug-id: https://jira.whamcloud.com/browse/LU-12861
Lustre-commit: b4c8a5180dec ("LU-12861 libcfs: Cleanup use of bare printk")
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Reviewed-on: https://review.whamcloud.com/37046
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/libcfs/debug.c     |  2 +-
 net/lnet/libcfs/module.c    |  4 ++--
 net/lnet/libcfs/tracefile.c | 43 +++++++++++++++++++++++++------------------
 3 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/net/lnet/libcfs/debug.c b/net/lnet/libcfs/debug.c
index c6b92df..d7747e7 100644
--- a/net/lnet/libcfs/debug.c
+++ b/net/lnet/libcfs/debug.c
@@ -418,7 +418,7 @@ void libcfs_debug_dumplog(void)
 			     "libcfs_debug_dumper");
 	set_current_state(TASK_INTERRUPTIBLE);
 	if (IS_ERR(dumper))
-		pr_err("LustreError: cannot start log dump thread: %ld\n",
+		pr_err("LustreError: cannot start log dump thread: rc = %ld\n",
 		       PTR_ERR(dumper));
 	else
 		schedule();
diff --git a/net/lnet/libcfs/module.c b/net/lnet/libcfs/module.c
index 20d4302..a53efcc 100644
--- a/net/lnet/libcfs/module.c
+++ b/net/lnet/libcfs/module.c
@@ -720,7 +720,7 @@ int libcfs_setup(void)
 
 	rc = libcfs_debug_init(5 * 1024 * 1024);
 	if (rc < 0) {
-		pr_err("LustreError: libcfs_debug_init: %d\n", rc);
+		pr_err("LustreError: libcfs_debug_init: rc = %d\n", rc);
 		goto err;
 	}
 
@@ -794,7 +794,7 @@ static void libcfs_exit(void)
 	/* the below message is checked in test-framework.sh check_mem_leak() */
 	rc = libcfs_debug_cleanup();
 	if (rc)
-		pr_err("LustreError: libcfs_debug_cleanup: %d\n", rc);
+		pr_err("LustreError: libcfs_debug_cleanup: rc = %d\n", rc);
 }
 
 MODULE_AUTHOR("OpenSFS, Inc. <http://www.lustre.org/>");
diff --git a/net/lnet/libcfs/tracefile.c b/net/lnet/libcfs/tracefile.c
index bda3523..1eb5397 100644
--- a/net/lnet/libcfs/tracefile.c
+++ b/net/lnet/libcfs/tracefile.c
@@ -332,7 +332,8 @@ static struct cfs_trace_page *cfs_trace_get_tage(struct cfs_trace_cpu_data *tcd,
 	 * from here: this will lead to infinite recursion.
 	 */
 	if (len > PAGE_SIZE) {
-		pr_err("cowardly refusing to write %lu bytes in a page\n", len);
+		pr_err("LustreError: cowardly refusing to write %lu bytes in a page\n",
+		       len);
 		return NULL;
 	}
 
@@ -477,7 +478,8 @@ int libcfs_debug_msg(struct libcfs_debug_msg_data *msgdata,
 
 		max_nob = PAGE_SIZE - tage->used - known_size;
 		if (max_nob <= 0) {
-			pr_emerg("negative max_nob: %d\n", max_nob);
+			pr_emerg("LustreError: negative max_nob: %d\n",
+				 max_nob);
 			mask |= D_ERROR;
 			cfs_trace_put_tcd(tcd);
 			tcd = NULL;
@@ -499,10 +501,15 @@ int libcfs_debug_msg(struct libcfs_debug_msg_data *msgdata,
 			break;
 	}
 
-	if (*(string_buf + needed - 1) != '\n')
-		pr_info("format@%s:%d:%s doesn't end in newline\n", file,
-			msgdata->msg_line, msgdata->msg_fn);
-
+	if (*(string_buf + needed - 1) != '\n') {
+		pr_info("Lustre: format at %s:%d:%s doesn't end in newline\n",
+			file, msgdata->msg_line, msgdata->msg_fn);
+	} else if (mask & D_TTY) {
+		/* TTY needs '\r\n' to move carriage to leftmost position */
+		if (needed < 2 || *(string_buf + needed - 2) != '\r')
+			pr_info("Lustre: format@%s:%d:%s doesn't end in '\\r\\n'\n",
+				file, msgdata->msg_line, msgdata->msg_fn);
+	}
 	header.ph_len = known_size + needed;
 	debug_buf = (char *)page_address(tage->page) + tage->used;
 
@@ -816,7 +823,7 @@ int cfs_tracefile_dump_all_pages(char *filename)
 	if (IS_ERR(filp)) {
 		rc = PTR_ERR(filp);
 		filp = NULL;
-		pr_err("LustreError: can't open %s for dump: rc %d\n",
+		pr_err("LustreError: can't open %s for dump: rc = %d\n",
 		       filename, rc);
 		goto out;
 	}
@@ -839,8 +846,8 @@ int cfs_tracefile_dump_all_pages(char *filename)
 		kunmap(tage->page);
 
 		if (rc != (int)tage->used) {
-			pr_warn("wanted to write %u but wrote %d\n", tage->used,
-				rc);
+			pr_warn("Lustre: wanted to write %u but wrote %d\n",
+				tage->used, rc);
 			put_pages_back(&pc);
 			__LASSERT(list_empty(&pc.pc_pages));
 			break;
@@ -851,7 +858,7 @@ int cfs_tracefile_dump_all_pages(char *filename)
 
 	rc = vfs_fsync(filp, 1);
 	if (rc)
-		pr_err("sync returns %d\n", rc);
+		pr_err("LustreError: sync returns: rc = %d\n", rc);
 close:
 	filp_close(filp, NULL);
 out:
@@ -985,7 +992,7 @@ int cfs_trace_daemon_command(char *str)
 	} else {
 		strcpy(cfs_tracefile, str);
 
-		pr_info("debug daemon will attempt to start writing to %s (%lukB max)\n",
+		pr_info("Lustre: debug daemon will attempt to start writing to %s (%lukB max)\n",
 			cfs_tracefile,
 			(long)(cfs_tracefile_size >> 10));
 
@@ -1100,8 +1107,8 @@ static int tracefiled(void *arg)
 			if (IS_ERR(filp)) {
 				rc = PTR_ERR(filp);
 				filp = NULL;
-				pr_warn("couldn't open %s: %d\n", cfs_tracefile,
-					rc);
+				pr_warn("Lustre: couldn't open %s: rc = %d\n",
+					cfs_tracefile, rc);
 			}
 		}
 		up_read(&cfs_tracefile_sem);
@@ -1126,7 +1133,7 @@ static int tracefiled(void *arg)
 			kunmap(tage->page);
 
 			if (rc != (int)tage->used) {
-				pr_warn("wanted to write %u but wrote %d\n",
+				pr_warn("Lustre: wanted to write %u but wrote %d\n",
 					tage->used, rc);
 				put_pages_back(&pc);
 				__LASSERT(list_empty(&pc.pc_pages));
@@ -1139,8 +1146,8 @@ static int tracefiled(void *arg)
 		if (!list_empty(&pc.pc_pages)) {
 			int i;
 
-			pr_alert("trace pages aren't empty\n");
-			pr_err("total cpus(%d): ", num_possible_cpus());
+			pr_alert("Lustre: trace pages aren't empty\n");
+			pr_err("Lustre: total cpus(%d): ", num_possible_cpus());
 			for (i = 0; i < num_possible_cpus(); i++)
 				if (cpu_online(i))
 					pr_cont("%d(on) ", i);
@@ -1151,9 +1158,9 @@ static int tracefiled(void *arg)
 			i = 0;
 			list_for_each_entry_safe(tage, tmp, &pc.pc_pages,
 						 linkage)
-				pr_err("page %d belongs to cpu %d\n",
+				pr_err("Lustre: page %d belongs to cpu %d\n",
 				       ++i, tage->cpu);
-			pr_err("There are %d pages unwritten\n", i);
+			pr_err("Lustre: There are %d pages unwritten\n", i);
 		}
 		__LASSERT(list_empty(&pc.pc_pages));
 end_loop:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 610/622] lnet: Do not assume peers are MR capable
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (608 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 609/622] lnet: libcfs: Cleanup use of bare printk James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:17 ` [lustre-devel] [PATCH 611/622] lnet: socklnd: convert peers hash table to hashtable.h James Simmons
                   ` (12 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Chris Horn <hornc@cray.com>

If a peer has discovery disabled then it will not consolidate peer
NI information. This means we need to use a consistent source NI
when sending to it just like we do for non-MR peers.

A comment in lnet_discovery_event_reply() indicates that this was a
known issue, but the situation is not handled properly.

Do not assume peers are multi-rail capable when peer objects are
allocated and initialized.

Do not mark a peer as multi-rail capable unless all of the following
conditions are satisified:
1. The peer has the MR feature flag set
2. The peer has discovery enabled.
3. We have discovery enabled locally

Note: 1, 2, and 3 above are implemented in the code for
lnet_discovery_event_reply(), but code earlier in the function breaks
this behavior. Remove the offending code.

Update sanity-lnet tests 100 and 101 to reflect the fact that peers
added via the traffic path no longer have multi-rail by default.

Cray-bug-id: LUS-7918
WC-bug-id: https://jira.whamcloud.com/browse/LU-12889
Lustre-commit: 3c580c93b8d3 ("LU-12889 lnet: Do not assume peers are MR capable")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/36512
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 45 ++++++++++++++++-----------------------------
 1 file changed, 16 insertions(+), 29 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index f987fff..0d7fbd4 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1520,10 +1520,7 @@ struct lnet_peer_net *
 	struct lnet_peer *lp;
 	struct lnet_peer_net *lpn;
 	struct lnet_peer_ni *lpni;
-	/* Assume peer is Multi-Rail capable and let discovery find out
-	 * otherwise.
-	 */
-	unsigned int flags = LNET_PEER_MULTI_RAIL;
+	unsigned int flags = 0;
 	int rc = 0;
 
 	if (nid == LNET_NID_ANY) {
@@ -2298,20 +2295,7 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	}
 
 	/*
-	 * Only enable the multi-rail feature on the peer if both sides of
-	 * the connection have discovery on
-	 */
-	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL) {
-		CDEBUG(D_NET, "Peer %s has Multi-Rail feature enabled\n",
-		       libcfs_nid2str(lp->lp_primary_nid));
-		lp->lp_state |= LNET_PEER_MULTI_RAIL;
-	} else {
-		CDEBUG(D_NET, "Peer %s has Multi-Rail feature disabled\n",
-		       libcfs_nid2str(lp->lp_primary_nid));
-		lp->lp_state &= ~LNET_PEER_MULTI_RAIL;
-	}
-
-	/* The peer may have discovery disabled at its end. Set
+	 * The peer may have discovery disabled at its end. Set
 	 * NO_DISCOVERY as appropriate.
 	 */
 	if ((pbuf->pb_info.pi_features & LNET_PING_FEAT_DISCOVERY) &&
@@ -2332,21 +2316,24 @@ static void lnet_peer_clear_discovery_error(struct lnet_peer *lp)
 	 */
 	if (pbuf->pb_info.pi_features & LNET_PING_FEAT_MULTI_RAIL) {
 		if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
-			/* Everything's fine */
+			CDEBUG(D_NET, "peer %s(%p) is MR\n",
+			       libcfs_nid2str(lp->lp_primary_nid), lp);
 		} else if (lp->lp_state & LNET_PEER_CONFIGURED) {
 			CWARN("Reply says %s is Multi-Rail, DLC says not\n",
 			      libcfs_nid2str(lp->lp_primary_nid));
+		} else if (lnet_peer_discovery_disabled) {
+			CDEBUG(D_NET,
+			       "peer %s(%p) not MR: DD disabled locally\n",
+			       libcfs_nid2str(lp->lp_primary_nid), lp);
+		} else if (lp->lp_state & LNET_PEER_NO_DISCOVERY) {
+			CDEBUG(D_NET,
+			       "peer %s(%p) not MR: DD disabled remotely\n",
+			       libcfs_nid2str(lp->lp_primary_nid), lp);
 		} else {
-			/* if discovery is disabled then we don't want to
-			 * update the state of the peer. All we'll do is
-			 * update the peer_nis which were reported back in
-			 * the initial ping
-			 */
-
-			if (!lnet_is_discovery_disabled_locked(lp)) {
-				lp->lp_state |= LNET_PEER_MULTI_RAIL;
-				lnet_peer_clr_non_mr_pref_nids(lp);
-			}
+			CDEBUG(D_NET, "peer %s(%p) is MR capable\n",
+			       libcfs_nid2str(lp->lp_primary_nid), lp);
+			lp->lp_state |= LNET_PEER_MULTI_RAIL;
+			lnet_peer_clr_non_mr_pref_nids(lp);
 		}
 	} else if (lp->lp_state & LNET_PEER_MULTI_RAIL) {
 		if (lp->lp_state & LNET_PEER_CONFIGURED) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 611/622] lnet: socklnd: convert peers hash table to hashtable.h
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (609 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 610/622] lnet: Do not assume peers are MR capable James Simmons
@ 2020-02-27 21:17 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 612/622] lustre: llite: Update mdc and lite stats on open|creat James Simmons
                   ` (11 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:17 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

Using a hashtable.h hashtable, rather than bespoke code, has several
advantages:

- the table is comprised of hlist_head, rather than list_head, so
  it consumes less memory (though we need to make it a little bigger
  as it must be a power-of-2)
- there are existing macros for easily walking the whole table
- it uses a "real" hash function rather than "mod a prime number".

In some ways, rhashtable might be even better, but it can change the
ordering of objects in the table at arbitrary moments, and that could
hurt the user-space API.  It also does not support the partitioned
walking that ksocknal_check_peer_timeouts() depends on.

Note that new peers are inserted at the top of a hash chain, rather
than appended at the end.  I don't think that should be a problem.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12678
Lustre-commit: dbbcf61d2bdc ("LU-12678 socklnd: convert peers hash table to hashtable.h")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36837
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c    | 299 ++++++++++++++++--------------------
 net/lnet/klnds/socklnd/socklnd.h    |  18 +--
 net/lnet/klnds/socklnd/socklnd_cb.c |   8 +-
 3 files changed, 140 insertions(+), 185 deletions(-)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 016e005..7abb75a 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -167,10 +167,10 @@
 struct ksock_peer_ni *
 ksocknal_find_peer_locked(struct lnet_ni *ni, struct lnet_process_id id)
 {
-	struct list_head *peer_list = ksocknal_nid2peerlist(id.nid);
 	struct ksock_peer_ni *peer_ni;
 
-	list_for_each_entry(peer_ni, peer_list, ksnp_list) {
+	hash_for_each_possible(ksocknal_data.ksnd_peers, peer_ni,
+			       ksnp_list, id.nid) {
 		LASSERT(!peer_ni->ksnp_closing);
 
 		if (peer_ni->ksnp_ni != ni)
@@ -229,7 +229,7 @@ struct ksock_peer_ni *
 	LASSERT(list_empty(&peer_ni->ksnp_routes));
 	LASSERT(!peer_ni->ksnp_closing);
 	peer_ni->ksnp_closing = 1;
-	list_del(&peer_ni->ksnp_list);
+	hlist_del(&peer_ni->ksnp_list);
 	/* lose peerlist's ref */
 	ksocknal_peer_decref(peer_ni);
 }
@@ -247,55 +247,52 @@ struct ksock_peer_ni *
 
 	read_lock(&ksocknal_data.ksnd_global_lock);
 
-	for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++) {
-		list_for_each_entry(peer_ni, &ksocknal_data.ksnd_peers[i],
-				    ksnp_list) {
-			if (peer_ni->ksnp_ni != ni)
-				continue;
+	hash_for_each(ksocknal_data.ksnd_peers, i, peer_ni, ksnp_list) {
+		if (peer_ni->ksnp_ni != ni)
+			continue;
 
-			if (!peer_ni->ksnp_n_passive_ips &&
-			    list_empty(&peer_ni->ksnp_routes)) {
-				if (index-- > 0)
-					continue;
+		if (!peer_ni->ksnp_n_passive_ips &&
+		    list_empty(&peer_ni->ksnp_routes)) {
+			if (index-- > 0)
+				continue;
 
-				*id = peer_ni->ksnp_id;
-				*myip = 0;
-				*peer_ip = 0;
-				*port = 0;
-				*conn_count = 0;
-				*share_count = 0;
-				rc = 0;
-				goto out;
-			}
+			*id = peer_ni->ksnp_id;
+			*myip = 0;
+			*peer_ip = 0;
+			*port = 0;
+			*conn_count = 0;
+			*share_count = 0;
+			rc = 0;
+			goto out;
+		}
 
-			for (j = 0; j < peer_ni->ksnp_n_passive_ips; j++) {
-				if (index-- > 0)
-					continue;
+		for (j = 0; j < peer_ni->ksnp_n_passive_ips; j++) {
+			if (index-- > 0)
+				continue;
 
-				*id = peer_ni->ksnp_id;
-				*myip = peer_ni->ksnp_passive_ips[j];
-				*peer_ip = 0;
-				*port = 0;
-				*conn_count = 0;
-				*share_count = 0;
-				rc = 0;
-				goto out;
-			}
+			*id = peer_ni->ksnp_id;
+			*myip = peer_ni->ksnp_passive_ips[j];
+			*peer_ip = 0;
+			*port = 0;
+			*conn_count = 0;
+			*share_count = 0;
+			rc = 0;
+			goto out;
+		}
 
-			list_for_each_entry(route, &peer_ni->ksnp_routes,
-					    ksnr_list) {
-				if (index-- > 0)
-					continue;
+		list_for_each_entry(route, &peer_ni->ksnp_routes,
+				    ksnr_list) {
+			if (index-- > 0)
+				continue;
 
-				*id = peer_ni->ksnp_id;
-				*myip = route->ksnr_myipaddr;
-				*peer_ip = route->ksnr_ipaddr;
-				*port = route->ksnr_port;
-				*conn_count = route->ksnr_conn_count;
-				*share_count = route->ksnr_share_count;
-				rc = 0;
-				goto out;
-			}
+			*id = peer_ni->ksnp_id;
+			*myip = route->ksnr_myipaddr;
+			*peer_ip = route->ksnr_ipaddr;
+			*port = route->ksnr_port;
+			*conn_count = route->ksnr_conn_count;
+			*share_count = route->ksnr_share_count;
+			rc = 0;
+			goto out;
 		}
 	}
 out:
@@ -463,8 +460,7 @@ struct ksock_peer_ni *
 		peer_ni = peer2;
 	} else {
 		/* peer_ni table takes my ref on peer_ni */
-		list_add_tail(&peer_ni->ksnp_list,
-			      ksocknal_nid2peerlist(id.nid));
+		hash_add(ksocknal_data.ksnd_peers, &peer_ni->ksnp_list, id.nid);
 	}
 
 	list_for_each_entry(route2, &peer_ni->ksnp_routes, ksnr_list) {
@@ -544,7 +540,7 @@ struct ksock_peer_ni *
 ksocknal_del_peer(struct lnet_ni *ni, struct lnet_process_id id, u32 ip)
 {
 	LIST_HEAD(zombies);
-	struct ksock_peer_ni *pnxt;
+	struct hlist_node *pnxt;
 	struct ksock_peer_ni *peer_ni;
 	int lo;
 	int hi;
@@ -554,17 +550,17 @@ struct ksock_peer_ni *
 	write_lock_bh(&ksocknal_data.ksnd_global_lock);
 
 	if (id.nid != LNET_NID_ANY) {
-		lo = (int)(ksocknal_nid2peerlist(id.nid) - ksocknal_data.ksnd_peers);
-		hi = (int)(ksocknal_nid2peerlist(id.nid) - ksocknal_data.ksnd_peers);
+		lo = hash_min(id.nid, HASH_BITS(ksocknal_data.ksnd_peers));
+		hi = lo;
 	} else {
 		lo = 0;
-		hi = ksocknal_data.ksnd_peer_hash_size - 1;
+		hi = HASH_SIZE(ksocknal_data.ksnd_peers) - 1;
 	}
 
 	for (i = lo; i <= hi; i++) {
-		list_for_each_entry_safe(peer_ni, pnxt,
-					 &ksocknal_data.ksnd_peers[i],
-					 ksnp_list) {
+		hlist_for_each_entry_safe(peer_ni, pnxt,
+					  &ksocknal_data.ksnd_peers[i],
+					  ksnp_list) {
 			if (peer_ni->ksnp_ni != ni)
 				continue;
 
@@ -609,23 +605,20 @@ struct ksock_peer_ni *
 
 	read_lock(&ksocknal_data.ksnd_global_lock);
 
-	for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++) {
-		list_for_each_entry(peer_ni, &ksocknal_data.ksnd_peers[i],
-				    ksnp_list) {
-			LASSERT(!peer_ni->ksnp_closing);
+	hash_for_each(ksocknal_data.ksnd_peers, i, peer_ni, ksnp_list) {
+		LASSERT(!peer_ni->ksnp_closing);
+
+		if (peer_ni->ksnp_ni != ni)
+			continue;
 
-			if (peer_ni->ksnp_ni != ni)
+		list_for_each_entry(conn, &peer_ni->ksnp_conns,
+				    ksnc_list) {
+			if (index-- > 0)
 				continue;
 
-			list_for_each_entry(conn, &peer_ni->ksnp_conns,
-					    ksnc_list) {
-				if (index-- > 0)
-					continue;
-
-				ksocknal_conn_addref(conn);
-				read_unlock(&ksocknal_data.ksnd_global_lock);
-				return conn;
-			}
+			ksocknal_conn_addref(conn);
+			read_unlock(&ksocknal_data.ksnd_global_lock);
+			return conn;
 		}
 	}
 
@@ -1119,8 +1112,8 @@ struct ksock_peer_ni *
 			 * NB this puts an "empty" peer_ni in the peer
 			 * table (which takes my ref)
 			 */
-			list_add_tail(&peer_ni->ksnp_list,
-				      ksocknal_nid2peerlist(peerid.nid));
+			hash_add(ksocknal_data.ksnd_peers,
+				 &peer_ni->ksnp_list, peerid.nid);
 		} else {
 			ksocknal_peer_decref(peer_ni);
 			peer_ni = peer2;
@@ -1732,7 +1725,7 @@ struct ksock_peer_ni *
 ksocknal_close_matching_conns(struct lnet_process_id id, u32 ipaddr)
 {
 	struct ksock_peer_ni *peer_ni;
-	struct ksock_peer_ni *pnxt;
+	struct hlist_node *pnxt;
 	int lo;
 	int hi;
 	int i;
@@ -1741,17 +1734,17 @@ struct ksock_peer_ni *
 	write_lock_bh(&ksocknal_data.ksnd_global_lock);
 
 	if (id.nid != LNET_NID_ANY) {
-		lo = (int)(ksocknal_nid2peerlist(id.nid) - ksocknal_data.ksnd_peers);
-		hi = (int)(ksocknal_nid2peerlist(id.nid) - ksocknal_data.ksnd_peers);
+		lo = hash_min(id.nid, HASH_BITS(ksocknal_data.ksnd_peers));
+		hi = lo;
 	} else {
 		lo = 0;
-		hi = ksocknal_data.ksnd_peer_hash_size - 1;
+		hi = HASH_SIZE(ksocknal_data.ksnd_peers) - 1;
 	}
 
 	for (i = lo; i <= hi; i++) {
-		list_for_each_entry_safe(peer_ni, pnxt,
-					 &ksocknal_data.ksnd_peers[i],
-					 ksnp_list) {
+		hlist_for_each_entry_safe(peer_ni, pnxt,
+					  &ksocknal_data.ksnd_peers[i],
+					  ksnp_list) {
 			if (!((id.nid == LNET_NID_ANY ||
 			       id.nid == peer_ni->ksnp_id.nid) &&
 			      (id.pid == LNET_PID_ANY ||
@@ -1769,10 +1762,7 @@ struct ksock_peer_ni *
 	if (id.nid == LNET_NID_ANY || id.pid == LNET_PID_ANY || !ipaddr)
 		return 0;
 
-	if (!count)
-		return -ENOENT;
-	else
-		return 0;
+	return count ? 0 : -ENOENT;
 }
 
 void
@@ -1892,21 +1882,20 @@ struct ksock_peer_ni *
 
 static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 {
-	struct list_head *start;
-	struct list_head *end;
-	struct list_head *tmp;
+	int lo;
+	int hi;
+	int bkt;
 	int rc = -ENOENT;
-	unsigned int hsize = ksocknal_data.ksnd_peer_hash_size;
 
-	if (id.nid == LNET_NID_ANY) {
-		start = &ksocknal_data.ksnd_peers[0];
-		end = &ksocknal_data.ksnd_peers[hsize - 1];
+	if (id.nid != LNET_NID_ANY) {
+		lo = hash_min(id.nid, HASH_BITS(ksocknal_data.ksnd_peers));
+		hi = lo;
 	} else {
-		start = ksocknal_nid2peerlist(id.nid);
-		end = ksocknal_nid2peerlist(id.nid);
+		lo = 0;
+		hi = HASH_SIZE(ksocknal_data.ksnd_peers) - 1;
 	}
 
-	for (tmp = start; tmp <= end; tmp++) {
+	for (bkt = lo; bkt <= hi; bkt++) {
 		int peer_off; /* searching offset in peer_ni hash table */
 
 		for (peer_off = 0; ; peer_off++) {
@@ -1914,7 +1903,9 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 			int i = 0;
 
 			read_lock(&ksocknal_data.ksnd_global_lock);
-			list_for_each_entry(peer_ni, tmp, ksnp_list) {
+			hlist_for_each_entry(peer_ni,
+					     &ksocknal_data.ksnd_peers[bkt],
+					     ksnp_list) {
 				if (!((id.nid == LNET_NID_ANY ||
 				       id.nid == peer_ni->ksnp_id.nid) &&
 				      (id.pid == LNET_PID_ANY ||
@@ -1969,24 +1960,15 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		iface->ksni_nroutes = 0;
 		iface->ksni_npeers = 0;
 
-		for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++) {
-			list_for_each_entry(peer_ni,
-					    &ksocknal_data.ksnd_peers[i],
-					    ksnp_list) {
-
-				for (j = 0;
-				     j < peer_ni->ksnp_n_passive_ips;
-				     j++)
-					if (peer_ni->ksnp_passive_ips[j] ==
-					    ipaddress)
-						iface->ksni_npeers++;
-
-				list_for_each_entry(route,
-						    &peer_ni->ksnp_routes,
-						    ksnr_list) {
-					if (route->ksnr_myipaddr == ipaddress)
-						iface->ksni_nroutes++;
-				}
+		hash_for_each(ksocknal_data.ksnd_peers, i, peer_ni, ksnp_list) {
+			for (j = 0; j < peer_ni->ksnp_n_passive_ips; j++)
+				if (peer_ni->ksnp_passive_ips[j] == ipaddress)
+					iface->ksni_npeers++;
+
+			list_for_each_entry(route, &peer_ni->ksnp_routes,
+					    ksnr_list) {
+				if (route->ksnr_myipaddr == ipaddress)
+					iface->ksni_nroutes++;
 			}
 		}
 
@@ -2048,7 +2030,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 {
 	struct ksock_net *net = ni->ni_data;
 	int rc = -ENOENT;
-	struct ksock_peer_ni *nxt;
+	struct hlist_node *nxt;
 	struct ksock_peer_ni *peer_ni;
 	u32 this_ip;
 	int i;
@@ -2070,16 +2052,12 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 		net->ksnn_ninterfaces--;
 
-		for (j = 0; j < ksocknal_data.ksnd_peer_hash_size; j++) {
-			list_for_each_entry_safe(peer_ni, nxt,
-						 &ksocknal_data.ksnd_peers[j],
-						 ksnp_list) {
-				if (peer_ni->ksnp_ni != ni)
-					continue;
+		hash_for_each_safe(ksocknal_data.ksnd_peers, j,
+				   nxt, peer_ni, ksnp_list) {
+			if (peer_ni->ksnp_ni != ni)
+				continue;
 
-				ksocknal_peer_del_interface_locked(peer_ni,
-								   this_ip);
-			}
+			ksocknal_peer_del_interface_locked(peer_ni, this_ip);
 		}
 	}
 
@@ -2224,8 +2202,6 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	if (ksocknal_data.ksnd_schedulers)
 		cfs_percpt_free(ksocknal_data.ksnd_schedulers);
 
-	kvfree(ksocknal_data.ksnd_peers);
-
 	spin_lock(&ksocknal_data.ksnd_tx_lock);
 
 	if (!list_empty(&ksocknal_data.ksnd_idle_noop_txs)) {
@@ -2250,6 +2226,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 ksocknal_base_shutdown(void)
 {
 	struct ksock_sched *sched;
+	struct ksock_peer_ni *peer_ni;
 	int i;
 
 	LASSERT(!ksocknal_data.ksnd_nnets);
@@ -2260,9 +2237,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		/* fall through */
 	case SOCKNAL_INIT_ALL:
 	case SOCKNAL_INIT_DATA:
-		LASSERT(ksocknal_data.ksnd_peers);
-		for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++)
-			LASSERT(list_empty(&ksocknal_data.ksnd_peers[i]));
+		hash_for_each(ksocknal_data.ksnd_peers, i, peer_ni, ksnp_list)
+			LASSERT(0);
 
 		LASSERT(list_empty(&ksocknal_data.ksnd_nets));
 		LASSERT(list_empty(&ksocknal_data.ksnd_enomem_conns));
@@ -2326,15 +2302,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 	memset(&ksocknal_data, 0, sizeof(ksocknal_data)); /* zero pointers */
 
-	ksocknal_data.ksnd_peer_hash_size = SOCKNAL_PEER_HASH_SIZE;
-	ksocknal_data.ksnd_peers = kvmalloc_array(ksocknal_data.ksnd_peer_hash_size,
-						  sizeof(struct list_head),
-						  GFP_KERNEL);
-	if (!ksocknal_data.ksnd_peers)
-		return -ENOMEM;
-
-	for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++)
-		INIT_LIST_HEAD(&ksocknal_data.ksnd_peers[i]);
+	hash_init(ksocknal_data.ksnd_peers);
 
 	rwlock_init(&ksocknal_data.ksnd_global_lock);
 	INIT_LIST_HEAD(&ksocknal_data.ksnd_nets);
@@ -2452,43 +2420,38 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 	read_lock(&ksocknal_data.ksnd_global_lock);
 
-	for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++) {
-		list_for_each_entry(peer_ni, &ksocknal_data.ksnd_peers[i],
-				    ksnp_list) {
-			struct ksock_route *route;
-			struct ksock_conn *conn;
-
-			if (peer_ni->ksnp_ni != ni)
-				continue;
+	hash_for_each(ksocknal_data.ksnd_peers, i, peer_ni, ksnp_list) {
+		struct ksock_route *route;
+		struct ksock_conn *conn;
 
-			CWARN("Active peer_ni on shutdown: %s, ref %d, closing %d, accepting %d, err %d, zcookie %llu, txq %d, zc_req %d\n",
-			      libcfs_id2str(peer_ni->ksnp_id),
-			      atomic_read(&peer_ni->ksnp_refcount),
-			      peer_ni->ksnp_closing,
-			      peer_ni->ksnp_accepting, peer_ni->ksnp_error,
-			      peer_ni->ksnp_zc_next_cookie,
-			      !list_empty(&peer_ni->ksnp_tx_queue),
-			      !list_empty(&peer_ni->ksnp_zc_req_list));
+		if (peer_ni->ksnp_ni != ni)
+			continue;
 
-			list_for_each_entry(route, &peer_ni->ksnp_routes,
-					    ksnr_list) {
-				CWARN("Route: ref %d, schd %d, conn %d, cnted %d, del %d\n",
-				      atomic_read(&route->ksnr_refcount),
-				      route->ksnr_scheduled,
-				      route->ksnr_connecting,
-				      route->ksnr_connected,
-				      route->ksnr_deleted);
-			}
+		CWARN("Active peer_ni on shutdown: %s, ref %d, closing %d, accepting %d, err %d, zcookie %llu, txq %d, zc_req %d\n",
+		      libcfs_id2str(peer_ni->ksnp_id),
+		      atomic_read(&peer_ni->ksnp_refcount),
+		      peer_ni->ksnp_closing,
+		      peer_ni->ksnp_accepting, peer_ni->ksnp_error,
+		      peer_ni->ksnp_zc_next_cookie,
+		      !list_empty(&peer_ni->ksnp_tx_queue),
+		      !list_empty(&peer_ni->ksnp_zc_req_list));
+
+		list_for_each_entry(route, &peer_ni->ksnp_routes, ksnr_list) {
+			CWARN("Route: ref %d, schd %d, conn %d, cnted %d, del %d\n",
+			      atomic_read(&route->ksnr_refcount),
+			      route->ksnr_scheduled,
+			      route->ksnr_connecting,
+			      route->ksnr_connected,
+			      route->ksnr_deleted);
+		}
 
-			list_for_each_entry(conn, &peer_ni->ksnp_conns,
-					    ksnc_list) {
-				CWARN("Conn: ref %d, sref %d, t %d, c %d\n",
-				      atomic_read(&conn->ksnc_conn_refcount),
-				      atomic_read(&conn->ksnc_sock_refcount),
-				      conn->ksnc_type, conn->ksnc_closing);
-			}
-			goto done;
+		list_for_each_entry(conn, &peer_ni->ksnp_conns, ksnc_list) {
+			CWARN("Conn: ref %d, sref %d, t %d, c %d\n",
+			      atomic_read(&conn->ksnc_conn_refcount),
+			      atomic_read(&conn->ksnc_sock_refcount),
+			      conn->ksnc_type, conn->ksnc_closing);
 		}
+		goto done;
 	}
 done:
 	read_unlock(&ksocknal_data.ksnd_global_lock);
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index 2d4e8d59..9ebb959 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -43,7 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/uio.h>
 #include <linux/unistd.h>
-#include <asm/irq.h>
+#include <linux/hashtable.h>
 #include <net/sock.h>
 #include <net/tcp.h>
 
@@ -54,7 +54,7 @@
 #define SOCKNAL_NSCHEDS		3
 #define SOCKNAL_NSCHEDS_HIGH	(SOCKNAL_NSCHEDS << 1)
 
-#define SOCKNAL_PEER_HASH_SIZE	101   /* # peer_ni lists */
+#define SOCKNAL_PEER_HASH_BITS	7     /* # log2 of # of peer_ni lists */
 #define SOCKNAL_RESCHED		100   /* # scheduler loops before reschedule */
 #define SOCKNAL_INSANITY_RECONN	5000  /* connd is trying on reconn infinitely */
 #define SOCKNAL_ENOMEM_RETRY	1     /* seconds between retries */
@@ -190,10 +190,10 @@ struct ksock_nal_data {
 	rwlock_t		ksnd_global_lock;	/* stabilize
 							 * peer_ni/conn ops
 							 */
-	struct list_head	*ksnd_peers;		/* hash table of all my
+	DECLARE_HASHTABLE(ksnd_peers, SOCKNAL_PEER_HASH_BITS);
+							/* hash table of all my
 							 * known peers
 							 */
-	int			ksnd_peer_hash_size;	/* size of ksnd_peers */
 
 	int			ksnd_nthreads;		/* # live threads */
 	int			ksnd_shuttingdown;	/* tell threads to exit
@@ -411,7 +411,7 @@ struct ksock_route {
 #define SOCKNAL_KEEPALIVE_PING	1	/* cookie for keepalive ping */
 
 struct ksock_peer_ni {
-	struct list_head	ksnp_list;		/* stash on global peer_ni list */
+	struct hlist_node	ksnp_list;		/* on global peer_nis hash table */
 	time64_t		ksnp_last_alive;	/* when (in seconds) I was last
 							 * alive
 							 */
@@ -519,14 +519,6 @@ struct ksock_proto {
 		(1 << SOCKLND_CONN_BULK_OUT));
 }
 
-static inline struct list_head *
-ksocknal_nid2peerlist(lnet_nid_t nid)
-{
-	unsigned int hash = ((unsigned int)nid) % ksocknal_data.ksnd_peer_hash_size;
-
-	return &ksocknal_data.ksnd_peers[hash];
-}
-
 static inline void
 ksocknal_conn_addref(struct ksock_conn *conn)
 {
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index 996b231..fb933e3 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -2386,7 +2386,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 static void
 ksocknal_check_peer_timeouts(int idx)
 {
-	struct list_head *peers = &ksocknal_data.ksnd_peers[idx];
+	struct hlist_head *peers = &ksocknal_data.ksnd_peers[idx];
 	struct ksock_peer_ni *peer_ni;
 	struct ksock_conn *conn;
 	struct ksock_tx *tx;
@@ -2399,7 +2399,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 	 */
 	read_lock(&ksocknal_data.ksnd_global_lock);
 
-	list_for_each_entry(peer_ni, peers, ksnp_list) {
+	hlist_for_each_entry(peer_ni, peers, ksnp_list) {
 		struct ksock_tx *tx_stale;
 		time64_t deadline = 0;
 		int resid = 0;
@@ -2564,7 +2564,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 		while ((timeout = deadline - ktime_get_seconds()) <= 0) {
 			const int n = 4;
 			const int p = 1;
-			int chunk = ksocknal_data.ksnd_peer_hash_size;
+			int chunk = HASH_SIZE(ksocknal_data.ksnd_peers);
 			unsigned int lnd_timeout;
 
 			/*
@@ -2585,7 +2585,7 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 			for (i = 0; i < chunk; i++) {
 				ksocknal_check_peer_timeouts(peer_index);
 				peer_index = (peer_index + 1) %
-					     ksocknal_data.ksnd_peer_hash_size;
+					     HASH_SIZE(ksocknal_data.ksnd_peers);
 			}
 
 			deadline += p;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 612/622] lustre: llite: Update mdc and lite stats on open|creat
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (610 preceding siblings ...)
  2020-02-27 21:17 ` [lustre-devel] [PATCH 611/622] lnet: socklnd: convert peers hash table to hashtable.h James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 613/622] lustre: osc: glimpse and lock cancel race James Simmons
                   ` (10 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Olaf Faaland <faaland1@llnl.gov>

Increment "create" counter in mdc/<instance>/md_stats, and
"mknod" counter in llite/<instance>stats when an open with
the CREAT flag results in a newly created file.

The mknod counter is chosen for consistency with
patch http://review.whamcloud.com/20246
 "LU-8150 mdt: Track open+create as mknod"
but the mdc counter set does not include mknod.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11114
Lustre-commit: 4b8518ee4fa5 ("LU-11114 llite: Update mdc and lite stats on open|creat")
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-on: https://review.whamcloud.com/36948
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Emoly Liu <emoly@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/namei.c   | 12 ++++++++++--
 fs/lustre/mdc/mdc_locks.c |  6 ++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 89317db..cf2a77f 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -605,7 +605,8 @@ struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
 static int ll_lookup_it_finish(struct ptlrpc_request *request,
 			       struct lookup_intent *it,
 			       struct inode *parent, struct dentry **de,
-			       void *secctx, u32 secctxlen)
+			       void *secctx, u32 secctxlen,
+			       ktime_t kstart)
 {
 	struct inode *inode = NULL;
 	u64 bits = 0;
@@ -708,6 +709,11 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		}
 	}
 
+	if (it_disposition(it, DISP_OPEN_CREATE)) {
+		ll_stats_ops_tally(ll_i2sbi(parent), LPROC_LL_MKNOD,
+				   ktime_us_delta(ktime_get(), kstart));
+	}
+
 out:
 	if (rc != 0 && it->it_op & IT_OPEN) {
 		ll_intent_drop_lock(it);
@@ -722,6 +728,7 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 				   u32 *secctxlen,
 				   struct pcc_create_attach *pca)
 {
+	ktime_t kstart = ktime_get();
 	struct lookup_intent lookup_it = { .it_op = IT_LOOKUP };
 	struct dentry *save = dentry, *retval;
 	struct ptlrpc_request *req = NULL;
@@ -887,7 +894,8 @@ static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
 	ll_unlock_md_op_lsm(op_data);
 	rc = ll_lookup_it_finish(req, it, parent, &dentry,
 				 secctx ? *secctx : NULL,
-				 secctxlen ? *secctxlen : 0);
+				 secctxlen ? *secctxlen : 0,
+				kstart);
 	if (rc != 0) {
 		ll_intent_release(it);
 		retval = ERR_PTR(rc);
diff --git a/fs/lustre/mdc/mdc_locks.c b/fs/lustre/mdc/mdc_locks.c
index 60bbae1..b252605 100644
--- a/fs/lustre/mdc/mdc_locks.c
+++ b/fs/lustre/mdc/mdc_locks.c
@@ -733,6 +733,12 @@ static int mdc_finish_enqueue(struct obd_export *exp,
 			mdc_set_open_replay_data(NULL, NULL, it);
 		}
 
+		if (it_disposition(it, DISP_OPEN_CREATE) &&
+		    !it_open_error(DISP_OPEN_CREATE, it)) {
+			lprocfs_counter_incr(exp->exp_obd->obd_md_stats,
+					     LPROC_MD_CREATE);
+		}
+
 		if (body->mbo_valid & (OBD_MD_FLDIREA | OBD_MD_FLEASIZE)) {
 			void *eadata;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 613/622] lustre: osc: glimpse and lock cancel race
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (611 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 612/622] lustre: llite: Update mdc and lite stats on open|creat James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 614/622] lustre: llog: keep llog handle alive until last reference James Simmons
                   ` (9 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Zarochentsev <c17826@cray.com>

osc_dlm_blocking_ast0 clears l_ast_data before writing
file data to OST and opens a race window. Neither a glimpse
AST nor ldlm_cb_interpret can find correct file attributes at
that moment.

Cray-bug-id: LUS-8344
WC-bug-id: https://jira.whamcloud.com/browse/LU-13128
Lustre-commit: 7c99f67d9d39 ("LU-13128 osc: glimpse and lock cancel race")
Signed-off-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-on: https://review.whamcloud.com/37215
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Andriy Skulysh <c17819@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h | 1 +
 fs/lustre/mdc/mdc_dev.c         | 2 +-
 fs/lustre/osc/osc_lock.c        | 8 ++++++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 7dfef0f..f7fed0e 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -332,6 +332,7 @@
 #define OBD_FAIL_OSC_CONNECT_GRANT_PARAM		0x413
 #define OBD_FAIL_OSC_DELAY_IO				0x414
 #define OBD_FAIL_OSC_NO_SIZE_DATA			0x415
+#define OBD_FAIL_OSC_DELAY_CANCEL			0x416
 
 #define OBD_FAIL_PTLRPC					0x500
 #define OBD_FAIL_PTLRPC_ACK				0x501
diff --git a/fs/lustre/mdc/mdc_dev.c b/fs/lustre/mdc/mdc_dev.c
index 496491f..5a6be44 100644
--- a/fs/lustre/mdc/mdc_dev.c
+++ b/fs/lustre/mdc/mdc_dev.c
@@ -313,7 +313,6 @@ static int mdc_dlm_blocking_ast0(const struct lu_env *env,
 
 	if (dlmlock->l_ast_data) {
 		obj = osc2cl(dlmlock->l_ast_data);
-		dlmlock->l_ast_data = NULL;
 		cl_object_get(obj);
 	}
 	unlock_res_and_lock(dlmlock);
@@ -332,6 +331,7 @@ static int mdc_dlm_blocking_ast0(const struct lu_env *env,
 		 */
 		/* losing a lock, update kms */
 		lock_res_and_lock(dlmlock);
+		dlmlock->l_ast_data = NULL;
 		cl_object_attr_lock(obj);
 		attr->cat_kms = 0;
 		cl_object_attr_update(env, obj, attr, CAT_KMS);
diff --git a/fs/lustre/osc/osc_lock.c b/fs/lustre/osc/osc_lock.c
index ce592d7..3bb5bbd 100644
--- a/fs/lustre/osc/osc_lock.c
+++ b/fs/lustre/osc/osc_lock.c
@@ -419,13 +419,13 @@ static int __osc_dlm_blocking_ast(const struct lu_env *env,
 
 	if (dlmlock->l_ast_data) {
 		obj = osc2cl(dlmlock->l_ast_data);
-		dlmlock->l_ast_data = NULL;
-
 		cl_object_get(obj);
 	}
 
 	unlock_res_and_lock(dlmlock);
 
+	OBD_FAIL_TIMEOUT(OBD_FAIL_OSC_DELAY_CANCEL, 5);
+
 	/* if l_ast_data is NULL, the dlmlock was enqueued by AGL or
 	 * the object has been destroyed.
 	 */
@@ -442,6 +442,10 @@ static int __osc_dlm_blocking_ast(const struct lu_env *env,
 
 		/* losing a lock, update kms */
 		lock_res_and_lock(dlmlock);
+		/* clearing l_ast_data after flushing data,
+		 * to let glimpse ast find the lock and the object
+		 */
+		dlmlock->l_ast_data = NULL;
 		cl_object_attr_lock(obj);
 		/* Must get the value under the lock to avoid race. */
 		old_kms = cl2osc(obj)->oo_oinfo->loi_kms;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 614/622] lustre: llog: keep llog handle alive until last reference
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (612 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 613/622] lustre: osc: glimpse and lock cancel race James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 615/622] lnet: handling device failure by IB event handler James Simmons
                   ` (8 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Mikhail Pershin <mpershin@whamcloud.com>

Llog handle keeps related dt_object pinned until llog_close()
call, meanwhile llog handle can still have other users which
took llog handle via llog_cat_id2handle()

Patch changes llog_handle_put() to call lop_close() upon last
reference drop. So llog_osd_close() will put dt_object only
when llog_handle has no more references.
The llog_handle_get() checks and reports if llog_handle has
zero reference.
Also patch modifies checks for destroyed llogs, llog handle
has new lgh_destroyed flag which is set when llog is destroyed,
llog_osd_exist() checks dt_object_exist() and lgh_destroyed
flag, so destroyed llogs are considered as non-existent too.
Previously it uses lu_object_is_dying() check which is not
reliable because means only that object is not to be kept in
cache.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10198
Lustre-commit: d6bd5e9cc49b ("LU-10198 llog: keep llog handle alive until last reference")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37367
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_log.h     |  3 ++-
 fs/lustre/obdclass/llog.c          | 49 +++++++++++++++++++-------------------
 fs/lustre/obdclass/llog_cat.c      | 19 +++++++++------
 fs/lustre/obdclass/llog_internal.h |  4 ++--
 4 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/fs/lustre/include/lustre_log.h b/fs/lustre/include/lustre_log.h
index 9c784ac..6995414 100644
--- a/fs/lustre/include/lustre_log.h
+++ b/fs/lustre/include/lustre_log.h
@@ -226,7 +226,8 @@ struct llog_handle {
 	char			*lgh_name;
 	void			*private_data;
 	struct llog_operations	*lgh_logops;
-	struct kref		 lgh_refcount;
+	refcount_t		 lgh_refcount;
+	bool			 lgh_destroyed;
 };
 
 #define LLOG_CTXT_FLAG_UNINITIALIZED     0x00000001
diff --git a/fs/lustre/obdclass/llog.c b/fs/lustre/obdclass/llog.c
index 620ebc6..5d828bd 100644
--- a/fs/lustre/obdclass/llog.c
+++ b/fs/lustre/obdclass/llog.c
@@ -65,7 +65,7 @@ static struct llog_handle *llog_alloc_handle(void)
 
 	init_rwsem(&loghandle->lgh_lock);
 	INIT_LIST_HEAD(&loghandle->u.phd.phd_entry);
-	kref_init(&loghandle->lgh_refcount);
+	refcount_set(&loghandle->lgh_refcount, 1);
 
 	return loghandle;
 }
@@ -73,11 +73,8 @@ static struct llog_handle *llog_alloc_handle(void)
 /*
  * Free llog handle and header data if exists. Used in llog_close() only
  */
-static void llog_free_handle(struct kref *kref)
+static void llog_free_handle(struct llog_handle *loghandle)
 {
-	struct llog_handle *loghandle = container_of(kref, struct llog_handle,
-						     lgh_refcount);
-
 	/* failed llog_init_handle */
 	if (!loghandle->lgh_hdr)
 		goto out;
@@ -91,15 +88,30 @@ static void llog_free_handle(struct kref *kref)
 	kfree(loghandle);
 }
 
-void llog_handle_get(struct llog_handle *loghandle)
+struct llog_handle *llog_handle_get(struct llog_handle *loghandle)
 {
-	kref_get(&loghandle->lgh_refcount);
+	if (refcount_inc_not_zero(&loghandle->lgh_refcount))
+		return loghandle;
+	return NULL;
 }
 
-void llog_handle_put(struct llog_handle *loghandle)
+int llog_handle_put(const struct lu_env *env, struct llog_handle *loghandle)
 {
-	LASSERT(kref_read(&loghandle->lgh_refcount) > 0);
-	kref_put(&loghandle->lgh_refcount, llog_free_handle);
+	int rc = 0;
+
+	if (refcount_dec_and_test(&loghandle->lgh_refcount)) {
+		struct llog_operations *lop;
+
+		rc = llog_handle2ops(loghandle, &lop);
+		if (!rc) {
+			if (lop->lop_close)
+				rc = lop->lop_close(env, loghandle);
+			else
+				rc = -EOPNOTSUPP;
+		}
+		llog_free_handle(loghandle);
+	}
+	return rc;
 }
 
 static int llog_read_header(const struct lu_env *env,
@@ -541,7 +553,7 @@ int llog_open(const struct lu_env *env, struct llog_ctxt *ctxt,
 		revert_creds(old_cred);
 
 	if (rc) {
-		llog_free_handle(&(*lgh)->lgh_refcount);
+		llog_free_handle(*lgh);
 		*lgh = NULL;
 	}
 	return rc;
@@ -550,19 +562,6 @@ int llog_open(const struct lu_env *env, struct llog_ctxt *ctxt,
 
 int llog_close(const struct lu_env *env, struct llog_handle *loghandle)
 {
-	struct llog_operations *lop;
-	int rc;
-
-	rc = llog_handle2ops(loghandle, &lop);
-	if (rc)
-		goto out;
-	if (!lop->lop_close) {
-		rc = -EOPNOTSUPP;
-		goto out;
-	}
-	rc = lop->lop_close(env, loghandle);
-out:
-	llog_handle_put(loghandle);
-	return rc;
+	return llog_handle_put(env, loghandle);
 }
 EXPORT_SYMBOL(llog_close);
diff --git a/fs/lustre/obdclass/llog_cat.c b/fs/lustre/obdclass/llog_cat.c
index 75226f4..46636f8 100644
--- a/fs/lustre/obdclass/llog_cat.c
+++ b/fs/lustre/obdclass/llog_cat.c
@@ -85,10 +85,16 @@ static int llog_cat_id2handle(const struct lu_env *env,
 				      cgl->lgl_ogen, logid->lgl_ogen);
 				continue;
 			}
+			*res = llog_handle_get(loghandle);
+			if (!*res) {
+				CERROR("%s: log "DFID" refcount is zero!\n",
+				       loghandle->lgh_ctxt->loc_obd->obd_name,
+				       PFID(&logid->lgl_oi.oi_fid));
+				continue;
+			}
 			loghandle->u.phd.phd_cat_handle = cathandle;
 			up_write(&cathandle->lgh_lock);
-			rc = 0;
-			goto out;
+			return rc;
 		}
 	}
 	up_write(&cathandle->lgh_lock);
@@ -105,10 +111,12 @@ static int llog_cat_id2handle(const struct lu_env *env,
 	rc = llog_init_handle(env, loghandle, fmt | LLOG_F_IS_PLAIN, NULL);
 	if (rc < 0) {
 		llog_close(env, loghandle);
-		loghandle = NULL;
+		*res = NULL;
 		return rc;
 	}
 
+	*res = llog_handle_get(loghandle);
+	LASSERT(*res);
 	down_write(&cathandle->lgh_lock);
 	list_add_tail(&loghandle->u.phd.phd_entry, &cathandle->u.chd.chd_head);
 	up_write(&cathandle->lgh_lock);
@@ -117,9 +125,6 @@ static int llog_cat_id2handle(const struct lu_env *env,
 	loghandle->u.phd.phd_cookie.lgc_lgl = cathandle->lgh_id;
 	loghandle->u.phd.phd_cookie.lgc_index =
 				loghandle->lgh_hdr->llh_cat_idx;
-out:
-	llog_handle_get(loghandle);
-	*res = loghandle;
 	return 0;
 }
 
@@ -204,7 +209,7 @@ static int llog_cat_process_cb(const struct lu_env *env,
 	}
 
 out:
-	llog_handle_put(llh);
+	llog_handle_put(env, llh);
 
 	return rc;
 }
diff --git a/fs/lustre/obdclass/llog_internal.h b/fs/lustre/obdclass/llog_internal.h
index 365bac9..0376656 100644
--- a/fs/lustre/obdclass/llog_internal.h
+++ b/fs/lustre/obdclass/llog_internal.h
@@ -61,8 +61,8 @@ struct llog_thread_info {
 int llog_info_init(void);
 void llog_info_fini(void);
 
-void llog_handle_get(struct llog_handle *loghandle);
-void llog_handle_put(struct llog_handle *loghandle);
+struct llog_handle *llog_handle_get(struct llog_handle *loghandle);
+int llog_handle_put(const struct lu_env *env, struct llog_handle *loghandle);
 int class_config_dump_handler(const struct lu_env *env,
 			      struct llog_handle *handle,
 			      struct llog_rec_hdr *rec, void *data);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 615/622] lnet: handling device failure by IB event handler
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (613 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 614/622] lustre: llog: keep llog handle alive until last reference James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 616/622] lustre: ptlrpc: simplify wait_event handling in unregister functions James Simmons
                   ` (7 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Tatsushi Takamura <takamr.tatsushi@jp.fujitsu.com>

The following IB events cannot be handled by QP event handler
- IB_EVENT_DEVICE_FATAL
- IB_EVENT_PORT_ERR
- IB_EVENT_PORT_ACTIVE

IB event handler handles device errors such as hardware errors
and link down.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12287
Lustre-commit: c6e4c21c4f8b ("LU-12287 lnet: handling device failure by IB event handler")
Signed-off-by: Tatsushi Takamura <takamr.tatsushi@jp.fujitsu.com>
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35037
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c | 100 +++++++++++++++++++++++++++++++++++++++
 net/lnet/klnds/o2iblnd/o2iblnd.h |   8 ++++
 2 files changed, 108 insertions(+)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index f6db2c7..7bf2883 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2306,9 +2306,93 @@ static int kiblnd_net_init_pools(struct kib_net *net, struct lnet_ni *ni,
 	return rc;
 }
 
+static int kiblnd_port_get_attr(struct kib_hca_dev *hdev)
+{
+	struct ib_port_attr *port_attr;
+	int rc;
+	unsigned long flags;
+	rwlock_t *g_lock = &kiblnd_data.kib_global_lock;
+
+	port_attr = kzalloc(sizeof(*port_attr), GFP_NOFS);
+	if (!port_attr) {
+		CDEBUG(D_NETERROR, "Out of memory\n");
+		return -ENOMEM;
+	}
+
+	rc = ib_query_port(hdev->ibh_ibdev, hdev->ibh_port, port_attr);
+
+	write_lock_irqsave(g_lock, flags);
+
+	if (rc == 0)
+		hdev->ibh_state = port_attr->state == IB_PORT_ACTIVE
+				 ? IBLND_DEV_PORT_ACTIVE
+				 : IBLND_DEV_PORT_DOWN;
+
+	write_unlock_irqrestore(g_lock, flags);
+	kfree(port_attr);
+
+	if (rc != 0) {
+		CDEBUG(D_NETERROR, "Failed to query IB port: %d\n", rc);
+		return rc;
+	}
+	return 0;
+}
+
+static inline void
+kiblnd_set_ni_fatal_on(struct kib_hca_dev *hdev, int val)
+{
+	struct kib_net  *net;
+
+	/* for health check */
+	list_for_each_entry(net, &hdev->ibh_dev->ibd_nets, ibn_list) {
+		if (val)
+			CDEBUG(D_NETERROR, "Fatal device error for NI %s\n",
+			       libcfs_nid2str(net->ibn_ni->ni_nid));
+		atomic_set(&net->ibn_ni->ni_fatal_error_on, val);
+	}
+}
+
+void
+kiblnd_event_handler(struct ib_event_handler *handler, struct ib_event *event)
+{
+	rwlock_t *g_lock = &kiblnd_data.kib_global_lock;
+	struct kib_hca_dev  *hdev;
+	unsigned long flags;
+
+	hdev = container_of(handler, struct kib_hca_dev, ibh_event_handler);
+
+	write_lock_irqsave(g_lock, flags);
+
+	switch (event->event) {
+	case IB_EVENT_DEVICE_FATAL:
+		CDEBUG(D_NET, "IB device fatal\n");
+		hdev->ibh_state = IBLND_DEV_FATAL;
+		kiblnd_set_ni_fatal_on(hdev, 1);
+		break;
+	case IB_EVENT_PORT_ACTIVE:
+		CDEBUG(D_NET, "IB port active\n");
+		if (event->element.port_num == hdev->ibh_port) {
+			hdev->ibh_state = IBLND_DEV_PORT_ACTIVE;
+			kiblnd_set_ni_fatal_on(hdev, 0);
+		}
+		break;
+	case IB_EVENT_PORT_ERR:
+		CDEBUG(D_NET, "IB port err\n");
+		if (event->element.port_num == hdev->ibh_port) {
+			hdev->ibh_state = IBLND_DEV_PORT_DOWN;
+			kiblnd_set_ni_fatal_on(hdev, 1);
+		}
+		break;
+	default:
+		break;
+	}
+	write_unlock_irqrestore(g_lock, flags);
+}
+
 static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 {
 	struct ib_device_attr *dev_attr = &hdev->ibh_ibdev->attrs;
+	int rc2 = 0;
 
 	/*
 	 * It's safe to assume a HCA can handle a page size
@@ -2338,12 +2422,19 @@ static int kiblnd_hdev_get_attr(struct kib_hca_dev *hdev)
 	hdev->ibh_mr_size = dev_attr->max_mr_size;
 	hdev->ibh_max_qp_wr = dev_attr->max_qp_wr;
 
+	rc2 = kiblnd_port_get_attr(hdev);
+	if (rc2 != 0)
+		return rc2;
+
 	CERROR("Invalid mr size: %#llx\n", hdev->ibh_mr_size);
 	return -EINVAL;
 }
 
 void kiblnd_hdev_destroy(struct kib_hca_dev *hdev)
 {
+	if (hdev->ibh_event_handler.device)
+		ib_unregister_event_handler(&hdev->ibh_event_handler);
+
 	if (hdev->ibh_pd)
 		ib_dealloc_pd(hdev->ibh_pd);
 
@@ -2491,6 +2582,7 @@ int kiblnd_dev_failover(struct kib_dev *dev, struct net *ns)
 	hdev->ibh_dev = dev;
 	hdev->ibh_cmid = cmid;
 	hdev->ibh_ibdev = cmid->device;
+	hdev->ibh_port  = cmid->port_num;
 
 	pd = ib_alloc_pd(cmid->device, 0);
 	if (IS_ERR(pd)) {
@@ -2513,6 +2605,10 @@ int kiblnd_dev_failover(struct kib_dev *dev, struct net *ns)
 		goto out;
 	}
 
+	INIT_IB_EVENT_HANDLER(&hdev->ibh_event_handler,
+			      hdev->ibh_ibdev, kiblnd_event_handler);
+	ib_register_event_handler(&hdev->ibh_event_handler);
+
 	write_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
 
 	swap(dev->ibd_hdev, hdev); /* take over the refcount */
@@ -2907,6 +3003,7 @@ static int kiblnd_startup(struct lnet_ni *ni)
 		goto net_failed;
 	}
 
+	net->ibn_ni = ni;
 	net->ibn_incarnation = ktime_get_real_ns() / NSEC_PER_USEC;
 
 	rc = kiblnd_tunables_setup(ni);
@@ -3000,6 +3097,9 @@ static int kiblnd_startup(struct lnet_ni *ni)
 	write_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
 	ibdev->ibd_nnets++;
 	list_add_tail(&net->ibn_list, &ibdev->ibd_nets);
+	/* for health check */
+	if (ibdev->ibd_hdev->ibh_state == IBLND_DEV_PORT_DOWN)
+		kiblnd_set_ni_fatal_on(ibdev->ibd_hdev, 1);
 	write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
 
 	net->ibn_init = IBLND_INIT_ALL;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.h b/net/lnet/klnds/o2iblnd/o2iblnd.h
index 2169fdd..8aa79d5 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.h
@@ -180,6 +180,13 @@ struct kib_hca_dev {
 	u64			ibh_mr_size;	/* size of MR */
 	int			ibh_max_qp_wr;	/* maximum work requests size */
 	struct ib_pd		*ibh_pd;	/* PD */
+	u8			ibh_port;	/* port number */
+	struct ib_event_handler
+				ibh_event_handler; /* IB event handler */
+	int			ibh_state;	/* device status */
+#define IBLND_DEV_PORT_DOWN	0
+#define IBLND_DEV_PORT_ACTIVE	1
+#define IBLND_DEV_FATAL		2
 	struct kib_dev		*ibh_dev;	/* owner */
 	atomic_t		ibh_ref;	/* refcount */
 };
@@ -309,6 +316,7 @@ struct kib_net {
 	struct kib_fmr_poolset	**ibn_fmr_ps;	/* fmr pool-set */
 
 	struct kib_dev		*ibn_dev;	/* underlying IB device */
+	struct lnet_ni		*ibn_ni;	/* LNet interface */
 };
 
 #define KIB_THREAD_SHIFT		16
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 616/622] lustre: ptlrpc: simplify wait_event handling in unregister functions
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (614 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 615/622] lnet: handling device failure by IB event handler James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 617/622] lustre: ptlrpc: use l_wait_event_abortable in ptlrpcd_add_reg() James Simmons
                   ` (6 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

We can simplify the wait_event_idle_timeout() handling in both
ptlrpc_unregister_bulk() and ptlrpc_unregister_reply() by changing the
timeout to a countdown. Less variables are needed on the stack.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10467
Lustre-commit: 5e30a2c06176f50f ("LU-10467 lustre: convert most users of LWI_TIMEOUT_INTERVAL()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/35973
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/client.c | 31 ++++++++++---------------------
 fs/lustre/ptlrpc/niobuf.c | 30 ++++++++++++++----------------
 2 files changed, 24 insertions(+), 37 deletions(-)

diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 632ddf1..1714e66 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -2570,9 +2570,6 @@ u64 ptlrpc_req_xid(struct ptlrpc_request *request)
  */
 static int ptlrpc_unregister_reply(struct ptlrpc_request *request, int async)
 {
-	int rc;
-	wait_queue_head_t *wq;
-
 	/* Might sleep. */
 	LASSERT(!in_interrupt());
 
@@ -2599,29 +2596,21 @@ static int ptlrpc_unregister_reply(struct ptlrpc_request *request, int async)
 	if (async)
 		return 0;
 
-	/*
-	 * We have to wait_event_idle_timeout() whatever the result, to get
-	 * a chance to run reply_in_callback(), and to make sure we've
-	 * unlinked before returning a req to the pool.
-	 */
-	if (request->rq_set)
-		wq = &request->rq_set->set_waitq;
-	else
-		wq = &request->rq_reply_waitq;
-
 	for (;;) {
+		wait_queue_head_t *wq = (request->rq_set) ?
+					&request->rq_set->set_waitq :
+					&request->rq_reply_waitq;
+		int seconds = LONG_UNLINK;
 		/*
 		 * Network access will complete in finite time but the HUGE
 		 * timeout lets us CWARN for visibility of sluggish NALs
 		 */
-		int cnt = 0;
-
-		while (cnt < LONG_UNLINK &&
-		       (rc = wait_event_idle_timeout(*wq,
-						     !ptlrpc_client_recv_or_unlink(request),
-						     HZ)) == 0)
-			cnt += 1;
-		if (rc > 0) {
+		while (seconds > LONG_UNLINK &&
+		       (wait_event_idle_timeout(*wq,
+						!ptlrpc_client_recv_or_unlink(request),
+						HZ)) == 0)
+			seconds -= 1;
+		if (seconds > 0) {
 			ptlrpc_rqphase_move(request, request->rq_next_phase);
 			return 1;
 		}
diff --git a/fs/lustre/ptlrpc/niobuf.c b/fs/lustre/ptlrpc/niobuf.c
index 26a1f97..ab2753a 100644
--- a/fs/lustre/ptlrpc/niobuf.c
+++ b/fs/lustre/ptlrpc/niobuf.c
@@ -244,8 +244,6 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 int ptlrpc_unregister_bulk(struct ptlrpc_request *req, int async)
 {
 	struct ptlrpc_bulk_desc *desc = req->rq_bulk;
-	wait_queue_head_t *wq;
-	int rc;
 
 	LASSERT(!in_interrupt());     /* might sleep */
 
@@ -276,23 +274,23 @@ int ptlrpc_unregister_bulk(struct ptlrpc_request *req, int async)
 	if (async)
 		return 0;
 
-	if (req->rq_set)
-		wq = &req->rq_set->set_waitq;
-	else
-		wq = &req->rq_reply_waitq;
-
 	for (;;) {
-		/* Network access will complete in finite time but the HUGE
+		/* The wq argument is ignored by user-space wait_event macros */
+		wait_queue_head_t *wq = (req->rq_set != NULL) ?
+					&req->rq_set->set_waitq :
+					&req->rq_reply_waitq;
+		/*
+		 * Network access will complete in finite time but the HUGE
 		 * timeout lets us CWARN for visibility of sluggish LNDs
 		 */
-		int cnt = 0;
-
-		while (cnt < LONG_UNLINK &&
-		       (rc = wait_event_idle_timeout(*wq,
-						     !ptlrpc_client_bulk_active(req),
-						     HZ)) == 0)
-			cnt += 1;
-		if (rc > 0) {
+		int seconds = LONG_UNLINK;
+
+		while (seconds > LONG_UNLINK &&
+		       wait_event_idle_timeout(*wq,
+					       !ptlrpc_client_bulk_active(req),
+					       HZ) == 0)
+			seconds -= 1;
+		if (seconds > 0) {
 			ptlrpc_rqphase_move(req, req->rq_next_phase);
 			return 1;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 617/622] lustre: ptlrpc: use l_wait_event_abortable in ptlrpcd_add_reg()
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (615 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 616/622] lustre: ptlrpc: simplify wait_event handling in unregister functions James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 618/622] lnet: use LIST_HEAD() for local lists James Simmons
                   ` (5 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.com>

Using wait_event_idle() will ignore signals which is not what we
want in ptlrpcd_add_req(). Change it to l_wait_event_abortable().

WC-bug-id: https://jira.whamcloud.com/browse/LU-10467
Lustre-commit: ca6c35cab141 ("LU-10467 lustre: convert users of back_to_sleep()")
Signed-off-by: Mr NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35980
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Petros Koutoupis <pkoutoupis@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/ptlrpcd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ptlrpc/ptlrpcd.c b/fs/lustre/ptlrpc/ptlrpcd.c
index 1a1fa05..533f592 100644
--- a/fs/lustre/ptlrpc/ptlrpcd.c
+++ b/fs/lustre/ptlrpc/ptlrpcd.c
@@ -235,8 +235,8 @@ void ptlrpcd_add_req(struct ptlrpc_request *req)
 		if (wait_event_idle_timeout(req->rq_set_waitq,
 					    !req->rq_set,
 					    5 * HZ) == 0)
-			wait_event_idle(req->rq_set_waitq,
-					!req->rq_set);
+			l_wait_event_abortable(req->rq_set_waitq,
+					       !req->rq_set);
 	} else if (req->rq_set) {
 		/*
 		 * If we have a valid "rq_set", just reuse it to avoid double
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 618/622] lnet: use LIST_HEAD() for local lists.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (616 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 617/622] lustre: ptlrpc: use l_wait_event_abortable in ptlrpcd_add_reg() James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 619/622] lustre: lustre: " James Simmons
                   ` (4 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

When declaring a local list head, instead of

   struct list_head list;
   INIT_LIST_HEAD(&list);

use
   LIST_HEAD(list);

which does both steps.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 135b5c0009e5 ("LU-9679 lnet: use LIST_HEAD() for local lists.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36954
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c |  7 ++-----
 net/lnet/klnds/socklnd/socklnd_cb.c |  3 +--
 net/lnet/lnet/api-ni.c              | 20 +++++---------------
 net/lnet/lnet/config.c              | 28 ++++++++--------------------
 net/lnet/lnet/lib-move.c            | 30 ++++++++----------------------
 net/lnet/lnet/net_fault.c           | 14 ++++----------
 net/lnet/lnet/peer.c                |  8 ++------
 net/lnet/lnet/router.c              | 15 ++++-----------
 net/lnet/selftest/console.c         |  4 +---
 9 files changed, 35 insertions(+), 94 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index f769a45..67780d0 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1384,11 +1384,9 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 {
 	rwlock_t *glock = &kiblnd_data.kib_global_lock;
 	char *reason = NULL;
-	struct list_head txs;
+	LIST_HEAD(txs);
 	unsigned long flags;
 
-	INIT_LIST_HEAD(&txs);
-
 	write_lock_irqsave(glock, flags);
 	if (!peer_ni->ibp_reconnecting) {
 		if (peer_ni->ibp_accepting)
@@ -2218,7 +2216,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 {
 	struct kib_peer_ni *peer_ni = conn->ibc_peer;
 	struct kib_tx *tx;
-	struct list_head txs;
+	LIST_HEAD(txs);
 	unsigned long flags;
 	int active;
 
@@ -2277,7 +2275,6 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	}
 
 	/* grab pending txs while I have the lock */
-	INIT_LIST_HEAD(&txs);
 	list_splice_init(&peer_ni->ibp_tx_queue, &txs);
 
 	if (!kiblnd_peer_active(peer_ni) ||	/* peer_ni has been deleted */
diff --git a/net/lnet/klnds/socklnd/socklnd_cb.c b/net/lnet/klnds/socklnd/socklnd_cb.c
index fb933e3..66b0ac7 100644
--- a/net/lnet/klnds/socklnd/socklnd_cb.c
+++ b/net/lnet/klnds/socklnd/socklnd_cb.c
@@ -2491,14 +2491,13 @@ void ksocknal_write_callback(struct ksock_conn *conn)
 	wait_queue_entry_t wait;
 	struct ksock_conn *conn;
 	struct ksock_sched *sched;
-	struct list_head enomem_conns;
+	LIST_HEAD(enomem_conns);
 	int nenomem_conns;
 	time64_t timeout;
 	int i;
 	int peer_index = 0;
 	time64_t deadline = ktime_get_seconds();
 
-	INIT_LIST_HEAD(&enomem_conns);
 	init_waitqueue_entry(&wait, current);
 
 	spin_lock_bh(&ksocknal_data.ksnd_reaper_lock);
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index b9c38f3..8f59266 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -2062,11 +2062,9 @@ static void lnet_push_target_fini(void)
 lnet_shutdown_lndnets(void)
 {
 	struct lnet_net *net;
-	struct list_head resend;
+	LIST_HEAD(resend);
 	struct lnet_msg *msg, *tmp;
 
-	INIT_LIST_HEAD(&resend);
-
 	/* NB called holding the global mutex */
 
 	/* All quiet on the API front */
@@ -2202,7 +2200,7 @@ static void lnet_push_target_fini(void)
 {
 	struct lnet_ni *ni;
 	struct lnet_net *net_l = NULL;
-	struct list_head local_ni_list;
+	LIST_HEAD(local_ni_list);
 	int ni_count = 0;
 	u32 lnd_type;
 	struct lnet_lnd *lnd;
@@ -2214,8 +2212,6 @@ static void lnet_push_target_fini(void)
 	int peerrtrcredits =
 		net->net_tunables.lct_peer_rtr_credits;
 
-	INIT_LIST_HEAD(&local_ni_list);
-
 	/*
 	 * make sure that this net is unique. If it isn't then
 	 * we are adding interfaces to an already existing network, and
@@ -2509,11 +2505,9 @@ void lnet_lib_exit(void)
 	int ni_count;
 	struct lnet_ping_buffer *pbuf;
 	struct lnet_handle_md ping_mdh;
-	struct list_head net_head;
+	LIST_HEAD(net_head);
 	struct lnet_net *net;
 
-	INIT_LIST_HEAD(&net_head);
-
 	mutex_lock(&the_lnet.ln_api_mutex);
 
 	CDEBUG(D_OTHER, "refs %d\n", the_lnet.ln_refcount);
@@ -3098,9 +3092,7 @@ static int lnet_handle_legacy_ip2nets(char *ip2nets,
 	struct lnet_net *net;
 	char *nets;
 	int rc;
-	struct list_head net_head;
-
-	INIT_LIST_HEAD(&net_head);
+	LIST_HEAD(net_head);
 
 	rc = lnet_parse_ip2nets(&nets, ip2nets);
 	if (rc < 0)
@@ -3282,13 +3274,11 @@ int lnet_dyn_del_ni(struct lnet_ioctl_config_ni *conf)
 lnet_dyn_add_net(struct lnet_ioctl_config_data *conf)
 {
 	struct lnet_net *net;
-	struct list_head net_head;
+	LIST_HEAD(net_head);
 	int rc;
 	struct lnet_ioctl_config_lnd_tunables tun;
 	char *nets = conf->cfg_config_u.cfg_net.net_intf;
 
-	INIT_LIST_HEAD(&net_head);
-
 	/* Create a net/ni structures for the network string */
 	rc = lnet_parse_networks(&net_head, nets, use_tcp_bonding);
 	if (rc <= 0)
diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index f50df88..9d3813c 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -889,14 +889,12 @@ struct lnet_ni *
 static int
 lnet_str2tbs_sep(struct list_head *tbs, char *str)
 {
-	struct list_head pending;
+	LIST_HEAD(pending);
 	char *sep;
 	int nob;
 	int i;
 	struct lnet_text_buf *ltb;
 
-	INIT_LIST_HEAD(&pending);
-
 	/* Split 'str' into separate commands */
 	for (;;) {
 		/* skip leading whitespace */
@@ -973,7 +971,7 @@ struct lnet_ni *
 lnet_str2tbs_expand(struct list_head *tbs, char *str)
 {
 	char num[16];
-	struct list_head pending;
+	LIST_HEAD(pending);
 	char *sep;
 	char *sep2;
 	char *parsed;
@@ -985,8 +983,6 @@ struct lnet_ni *
 	int nob;
 	int scanned;
 
-	INIT_LIST_HEAD(&pending);
-
 	sep = strchr(str, '[');
 	if (!sep)			/* nothing to expand */
 		return 0;
@@ -1097,8 +1093,8 @@ struct lnet_ni *
 {
 	/* static scratch buffer OK (single threaded) */
 	static char cmd[LNET_SINGLE_TEXTBUF_NOB];
-	struct list_head nets;
-	struct list_head gateways;
+	LIST_HEAD(nets);
+	LIST_HEAD(gateways);
 	struct list_head *tmp1;
 	struct list_head *tmp2;
 	u32 net;
@@ -1114,9 +1110,6 @@ struct lnet_ni *
 	int got_hops = 0;
 	unsigned int priority = 0;
 
-	INIT_LIST_HEAD(&gateways);
-	INIT_LIST_HEAD(&nets);
-
 	/* save a copy of the string for error messages */
 	strncpy(cmd, str, sizeof(cmd));
 	cmd[sizeof(cmd) - 1] = '\0';
@@ -1260,13 +1253,11 @@ struct lnet_ni *
 int
 lnet_parse_routes(char *routes, int *im_a_router)
 {
-	struct list_head tbs;
+	LIST_HEAD(tbs);
 	int rc = 0;
 
 	*im_a_router = 0;
 
-	INIT_LIST_HEAD(&tbs);
-
 	if (lnet_str2tbs_sep(&tbs, routes) < 0) {
 		CERROR("Error parsing routes\n");
 		rc = -EINVAL;
@@ -1453,9 +1444,9 @@ struct lnet_ni *
 {
 	static char networks[LNET_SINGLE_TEXTBUF_NOB];
 	static char source[LNET_SINGLE_TEXTBUF_NOB];
-	struct list_head raw_entries;
-	struct list_head matched_nets;
-	struct list_head current_nets;
+	LIST_HEAD(raw_entries);
+	LIST_HEAD(matched_nets);
+	LIST_HEAD(current_nets);
 	struct list_head *t;
 	struct list_head *t2;
 	struct lnet_text_buf *tb;
@@ -1467,15 +1458,12 @@ struct lnet_ni *
 	int dup;
 	int rc;
 
-	INIT_LIST_HEAD(&raw_entries);
 	if (lnet_str2tbs_sep(&raw_entries, ip2nets) < 0) {
 		CERROR("Error parsing ip2nets\n");
 		LASSERT(!lnet_tbnob);
 		return -EINVAL;
 	}
 
-	INIT_LIST_HEAD(&matched_nets);
-	INIT_LIST_HEAD(&current_nets);
 	networks[0] = 0;
 	count = 0;
 	len = 0;
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index cd36d52..cd7ac7f 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -166,7 +166,7 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_test_peer *tp;
 	struct list_head *el;
 	struct list_head *next;
-	struct list_head cull;
+	LIST_HEAD(cull);
 
 	/* NB: use lnet_net_lock(0) to serialize operations on test peers */
 	if (threshold) {
@@ -184,9 +184,6 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		return 0;
 	}
 
-	/* removing entries */
-	INIT_LIST_HEAD(&cull);
-
 	lnet_net_lock(0);
 
 	list_for_each_safe(el, next, &the_lnet.ln_test_peers) {
@@ -216,11 +213,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	struct lnet_test_peer *tp;
 	struct list_head *el;
 	struct list_head *next;
-	struct list_head cull;
+	LIST_HEAD(cull);
 	int fail = 0;
 
-	INIT_LIST_HEAD(&cull);
-
 	/* NB: use lnet_net_lock(0) to serialize operations on test peers */
 	lnet_net_lock(0);
 
@@ -2620,7 +2615,6 @@ struct lnet_mt_event_info {
 lnet_finalize_expired_responses(void)
 {
 	struct lnet_libmd *md;
-	struct list_head local_queue;
 	struct lnet_rsp_tracker *rspt, *tmp;
 	ktime_t now;
 	int i;
@@ -2629,7 +2623,7 @@ struct lnet_mt_event_info {
 		return;
 
 	cfs_cpt_for_each(i, lnet_cpt_table()) {
-		INIT_LIST_HEAD(&local_queue);
+		LIST_HEAD(local_queue);
 
 		lnet_net_lock(i);
 		if (!the_lnet.ln_mt_rstq[i]) {
@@ -2856,8 +2850,8 @@ struct lnet_mt_event_info {
 lnet_recover_local_nis(void)
 {
 	struct lnet_mt_event_info *ev_info;
-	struct list_head processed_list;
-	struct list_head local_queue;
+	LIST_HEAD(processed_list);
+	LIST_HEAD(local_queue);
 	struct lnet_handle_md mdh;
 	struct lnet_ni *tmp;
 	struct lnet_ni *ni;
@@ -2865,9 +2859,6 @@ struct lnet_mt_event_info {
 	int healthv;
 	int rc;
 
-	INIT_LIST_HEAD(&local_queue);
-	INIT_LIST_HEAD(&processed_list);
-
 	/* splice the recovery queue on a local queue. We will iterate
 	 * through the local queue and update it as needed. Once we're
 	 * done with the traversal, we'll splice the local queue back on
@@ -3091,11 +3082,9 @@ struct lnet_mt_event_info {
 lnet_clean_resendqs(void)
 {
 	struct lnet_msg *msg, *tmp;
-	struct list_head msgs;
+	LIST_HEAD(msgs);
 	int i;
 
-	INIT_LIST_HEAD(&msgs);
-
 	cfs_cpt_for_each(i, lnet_cpt_table()) {
 		lnet_net_lock(i);
 		list_splice_init(the_lnet.ln_mt_resendqs[i], &msgs);
@@ -3114,8 +3103,8 @@ struct lnet_mt_event_info {
 lnet_recover_peer_nis(void)
 {
 	struct lnet_mt_event_info *ev_info;
-	struct list_head processed_list;
-	struct list_head local_queue;
+	LIST_HEAD(processed_list);
+	LIST_HEAD(local_queue);
 	struct lnet_handle_md mdh;
 	struct lnet_peer_ni *lpni;
 	struct lnet_peer_ni *tmp;
@@ -3123,9 +3112,6 @@ struct lnet_mt_event_info {
 	int healthv;
 	int rc;
 
-	INIT_LIST_HEAD(&local_queue);
-	INIT_LIST_HEAD(&processed_list);
-
 	/* Always use cpt 0 for locking across all interactions with
 	 * ln_mt_peerNIRecovq
 	 */
diff --git a/net/lnet/lnet/net_fault.c b/net/lnet/lnet/net_fault.c
index 8408e93..515aa05 100644
--- a/net/lnet/lnet/net_fault.c
+++ b/net/lnet/lnet/net_fault.c
@@ -201,11 +201,9 @@ struct lnet_drop_rule {
 {
 	struct lnet_drop_rule *rule;
 	struct lnet_drop_rule *tmp;
-	struct list_head zombies;
+	LIST_HEAD(zombies);
 	int n = 0;
 
-	INIT_LIST_HEAD(&zombies);
-
 	lnet_net_lock(LNET_LOCK_EX);
 	list_for_each_entry_safe(rule, tmp, &the_lnet.ln_drop_rules, dr_link) {
 		if (rule->dr_attr.fa_src != src && src)
@@ -725,9 +723,8 @@ struct delay_daemon_data {
 lnet_delay_rule_check(void)
 {
 	struct lnet_delay_rule *rule;
-	struct list_head msgs;
+	LIST_HEAD(msgs);
 
-	INIT_LIST_HEAD(&msgs);
 	while (1) {
 		if (list_empty(&delay_dd.dd_sched_rules))
 			break;
@@ -886,14 +883,11 @@ struct delay_daemon_data {
 {
 	struct lnet_delay_rule *rule;
 	struct lnet_delay_rule *tmp;
-	struct list_head rule_list;
-	struct list_head msg_list;
+	LIST_HEAD(rule_list);
+	LIST_HEAD(msg_list);
 	int n = 0;
 	bool cleanup;
 
-	INIT_LIST_HEAD(&rule_list);
-	INIT_LIST_HEAD(&msg_list);
-
 	if (shutdown) {
 		src = 0;
 		dst = 0;
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 0d7fbd4..b76ff94 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1912,9 +1912,7 @@ static void lnet_peer_discovery_complete(struct lnet_peer *lp)
 {
 	struct lnet_msg *msg, *tmp;
 	int rc = 0;
-	struct list_head pending_msgs;
-
-	INIT_LIST_HEAD(&pending_msgs);
+	LIST_HEAD(pending_msgs);
 
 	CDEBUG(D_NET, "Discovery complete. Dequeue peer %s\n",
 	       libcfs_nid2str(lp->lp_primary_nid));
@@ -3238,11 +3236,9 @@ static int lnet_peer_discovery_wait_for_work(void)
 static void lnet_resend_msgs(void)
 {
 	struct lnet_msg *msg, *tmp;
-	struct list_head resend;
+	LIST_HEAD(resend);
 	int rc;
 
-	INIT_LIST_HEAD(&resend);
-
 	spin_lock(&the_lnet.ln_msg_resend_lock);
 	list_splice(&the_lnet.ln_msg_resend, &resend);
 	spin_unlock(&the_lnet.ln_msg_resend_lock);
diff --git a/net/lnet/lnet/router.c b/net/lnet/lnet/router.c
index 7ba406a..69df212 100644
--- a/net/lnet/lnet/router.c
+++ b/net/lnet/lnet/router.c
@@ -717,19 +717,16 @@ static void lnet_shuffle_seed(void)
 int
 lnet_del_route(u32 net, lnet_nid_t gw_nid)
 {
-	struct list_head rnet_zombies;
+	LIST_HEAD(rnet_zombies);
 	struct lnet_remotenet *rnet;
 	struct lnet_remotenet *tmp;
 	struct list_head *rn_list;
 	struct lnet_peer_ni *lpni;
 	struct lnet_route *route;
-	struct list_head zombies;
+	LIST_HEAD(zombies);
 	struct lnet_peer *lp = NULL;
 	int i = 0;
 
-	INIT_LIST_HEAD(&rnet_zombies);
-	INIT_LIST_HEAD(&zombies);
-
 	CDEBUG(D_NET, "Del route: net %s : gw %s\n",
 	       libcfs_net2str(net), libcfs_nid2str(gw_nid));
 
@@ -1152,14 +1149,12 @@ bool lnet_router_checker_active(void)
 lnet_rtrpool_free_bufs(struct lnet_rtrbufpool *rbp, int cpt)
 {
 	int npages = rbp->rbp_npages;
-	struct list_head tmp;
+	LIST_HEAD(tmp);
 	struct lnet_rtrbuf *rb;
 
 	if (!rbp->rbp_nbuffers) /* not initialized or already freed */
 		return;
 
-	INIT_LIST_HEAD(&tmp);
-
 	lnet_net_lock(cpt);
 	list_splice_init(&rbp->rbp_msgs, &tmp);
 	lnet_drop_routed_msgs_locked(&tmp, cpt);
@@ -1181,7 +1176,7 @@ bool lnet_router_checker_active(void)
 static int
 lnet_rtrpool_adjust_bufs(struct lnet_rtrbufpool *rbp, int nbufs, int cpt)
 {
-	struct list_head rb_list;
+	LIST_HEAD(rb_list);
 	struct lnet_rtrbuf *rb;
 	int num_rb;
 	int num_buffers = 0;
@@ -1213,8 +1208,6 @@ bool lnet_router_checker_active(void)
 	rbp->rbp_req_nbuffers = nbufs;
 	lnet_net_unlock(cpt);
 
-	INIT_LIST_HEAD(&rb_list);
-
 	/*
 	 * allocate the buffers on a local list first.  If all buffers are
 	 * allocated successfully then join this list to the rbp buffer
diff --git a/net/lnet/selftest/console.c b/net/lnet/selftest/console.c
index 9f32c1f..cc2c61d 100644
--- a/net/lnet/selftest/console.c
+++ b/net/lnet/selftest/console.c
@@ -1484,12 +1484,10 @@ static void lstcon_group_ndlink_release(struct lstcon_group *,
 lstcon_ndlist_stat(struct list_head *ndlist,
 		   int timeout, struct list_head __user *result_up)
 {
-	struct list_head head;
+	LIST_HEAD(head);
 	struct lstcon_rpc_trans *trans;
 	int rc;
 
-	INIT_LIST_HEAD(&head);
-
 	rc = lstcon_rpc_trans_ndlist(ndlist, &head,
 				     LST_TRANS_STATQRY, NULL, NULL, &trans);
 	if (rc) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 619/622] lustre: lustre: use LIST_HEAD() for local lists.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (617 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 618/622] lnet: use LIST_HEAD() for local lists James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 620/622] lustre: handle: discard h_lock James Simmons
                   ` (3 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

When declaring a local list head, instead of

   struct list_head list;
   INIT_LIST_HEAD(&list);

use
   LIST_HEAD(list);

which does both steps.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9679
Lustre-commit: 0098396983e1 ("LU-9679 lustre: use LIST_HEAD() for local lists.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/36955
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lprocfs_status.c | 3 +--
 fs/lustre/obdclass/lu_object.c      | 6 ++----
 fs/lustre/obdclass/obd_mount.c      | 3 +--
 fs/lustre/ptlrpc/client.c           | 3 +--
 fs/lustre/ptlrpc/service.c          | 3 +--
 5 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 325005d..b19a1bd 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -1885,7 +1885,7 @@ int lprocfs_wr_nosquash_nids(const char __user *buffer, unsigned long count,
 			     struct root_squash_info *squash, char *name)
 {
 	char *kernbuf = NULL, *errmsg;
-	struct list_head tmp;
+	LIST_HEAD(tmp);
 	int len = count;
 	int rc;
 
@@ -1924,7 +1924,6 @@ int lprocfs_wr_nosquash_nids(const char __user *buffer, unsigned long count,
 		return count;
 	}
 
-	INIT_LIST_HEAD(&tmp);
 	if (cfs_parse_nidlist(kernbuf, count, &tmp) <= 0) {
 		errmsg = "can't parse";
 		rc = -EINVAL;
diff --git a/fs/lustre/obdclass/lu_object.c b/fs/lustre/obdclass/lu_object.c
index 7ea9948..e328f89 100644
--- a/fs/lustre/obdclass/lu_object.c
+++ b/fs/lustre/obdclass/lu_object.c
@@ -361,7 +361,7 @@ static void lu_object_free(const struct lu_env *env, struct lu_object *o)
 	struct lu_site *site;
 	struct lu_object *scan;
 	struct list_head *layers;
-	struct list_head splice;
+	LIST_HEAD(splice);
 
 	site = o->lo_dev->ld_site;
 	layers = &o->lo_header->loh_layers;
@@ -380,7 +380,6 @@ static void lu_object_free(const struct lu_env *env, struct lu_object *o)
 	 * necessary, because lu_object_header is freed together with the
 	 * top-level slice.
 	 */
-	INIT_LIST_HEAD(&splice);
 	list_splice_init(layers, &splice);
 	while (!list_empty(&splice)) {
 		/*
@@ -408,7 +407,7 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 	struct lu_object_header *h;
 	struct lu_object_header *temp;
 	struct lu_site_bkt_data *bkt;
-	struct list_head dispose;
+	LIST_HEAD(dispose);
 	int did_sth;
 	unsigned int start = 0;
 	int count;
@@ -418,7 +417,6 @@ int lu_site_purge_objects(const struct lu_env *env, struct lu_site *s,
 	if (OBD_FAIL_CHECK(OBD_FAIL_OBD_NO_LRU))
 		return 0;
 
-	INIT_LIST_HEAD(&dispose);
 	/*
 	 * Under LRU list lock, scan LRU list and move unreferenced objects to
 	 * the dispose list, removing them from LRU and hash table.
diff --git a/fs/lustre/obdclass/obd_mount.c b/fs/lustre/obdclass/obd_mount.c
index 31f2f5b..206edde 100644
--- a/fs/lustre/obdclass/obd_mount.c
+++ b/fs/lustre/obdclass/obd_mount.c
@@ -982,7 +982,7 @@ static bool lmd_find_delimiter(char *buf, char **endh)
  */
 static int lmd_parse_nidlist(char *buf, char **endh)
 {
-	struct list_head nidlist;
+	LIST_HEAD(nidlist);
 	char *endp = buf;
 	int rc = 0;
 	char tmp;
@@ -1000,7 +1000,6 @@ static int lmd_parse_nidlist(char *buf, char **endh)
 	tmp = *endp;
 	*endp = '\0';
 
-	INIT_LIST_HEAD(&nidlist);
 	if (cfs_parse_nidlist(buf, strlen(buf), &nidlist) <= 0)
 		rc = 1;
 	cfs_free_nidlist(&nidlist);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 1714e66..424819e 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -1715,13 +1715,12 @@ static inline int ptlrpc_set_producer(struct ptlrpc_request_set *set)
 int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
 {
 	struct ptlrpc_request *req, *next;
-	struct list_head comp_reqs;
+	LIST_HEAD(comp_reqs);
 	int force_timer_recalc = 0;
 
 	if (atomic_read(&set->set_remaining) == 0)
 		return 1;
 
-	INIT_LIST_HEAD(&comp_reqs);
 	list_for_each_entry_safe(req, next, &set->set_requests, rq_set_chain) {
 		struct obd_import *imp = req->rq_import;
 		int unregistered = 0;
diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index f65d5c5..b10c61b 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1211,7 +1211,7 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 {
 	struct ptlrpc_at_array *array = &svcpt->scp_at_array;
 	struct ptlrpc_request *rq, *n;
-	struct list_head work_list;
+	LIST_HEAD(work_list);
 	u32 index, count;
 	time64_t deadline;
 	time64_t now = ktime_get_real_seconds();
@@ -1244,7 +1244,6 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 	 * We're close to a timeout, and we don't know how much longer the
 	 * server will take. Send early replies to everyone expiring soon.
 	 */
-	INIT_LIST_HEAD(&work_list);
 	deadline = -1;
 	div_u64_rem(array->paa_deadline, array->paa_size, &index);
 	count = array->paa_count;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 620/622] lustre: handle: discard h_lock.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (618 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 619/622] lustre: lustre: " James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 621/622] lnet: remove lnd_query interface James Simmons
                   ` (2 subsequent siblings)
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: NeilBrown <neilb@suse.com>

The h_lock spinlock is now only taken while bucket->lock
is held.  As a handle is associated with precisely one bucket,
this means that h_lock can never be contended, so it isn't needed.

So discard h_lock.

Also discard an increasingly irrelevant comment in the declaration
of struct portals_handle.

WC-bug-id: https://jira.whamcloud.com/browse/LU-12542
Lustre-commit: 6acafe7ac4ef ("LU-12542 handle: discard h_lock.")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-on: https://review.whamcloud.com/35863
Reviewed-by: Neil Brown <neilb@suse.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_handles.h  | 3 ---
 fs/lustre/obdclass/lustre_handles.c | 7 -------
 2 files changed, 10 deletions(-)

diff --git a/fs/lustre/include/lustre_handles.h b/fs/lustre/include/lustre_handles.h
index afdade7..9dbe7c9 100644
--- a/fs/lustre/include/lustre_handles.h
+++ b/fs/lustre/include/lustre_handles.h
@@ -62,10 +62,7 @@ struct portals_handle {
 	u64				h_cookie;
 	const char			*h_owner;
 	refcount_t			h_ref;
-
-	/* newly added fields to handle the RCU issue. -jxiong */
 	struct rcu_head			h_rcu;
-	spinlock_t			h_lock;
 };
 
 /* handles.c */
diff --git a/fs/lustre/obdclass/lustre_handles.c b/fs/lustre/obdclass/lustre_handles.c
index 0048036..7ecd15ad3 100644
--- a/fs/lustre/obdclass/lustre_handles.c
+++ b/fs/lustre/obdclass/lustre_handles.c
@@ -85,7 +85,6 @@ void class_handle_hash(struct portals_handle *h, const char *owner)
 	spin_unlock(&handle_base_lock);
 
 	h->h_owner = owner;
-	spin_lock_init(&h->h_lock);
 
 	bucket = &handle_hash[h->h_cookie & HANDLE_HASH_MASK];
 	spin_lock(&bucket->lock);
@@ -108,13 +107,7 @@ static void class_handle_unhash_nolock(struct portals_handle *h)
 	CDEBUG(D_INFO, "removing object %p with handle %#llx from hash\n",
 	       h, h->h_cookie);
 
-	spin_lock(&h->h_lock);
-	if (hlist_unhashed(&h->h_link)) {
-		spin_unlock(&h->h_lock);
-		return;
-	}
 	hlist_del_init_rcu(&h->h_link);
-	spin_unlock(&h->h_lock);
 }
 
 void class_handle_unhash(struct portals_handle *h)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 621/622] lnet: remove lnd_query interface.
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (619 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 620/622] lustre: handle: discard h_lock James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-02-27 21:18 ` [lustre-devel] [PATCH 622/622] lnet: use conservative health timeouts James Simmons
  2020-04-24  6:01 ` [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 NeilBrown
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Mr NeilBrown <neilb@suse.de>

The ->lnd_query interface is completely unused, and has been since
commit 8e498d3f23ea ("LU-11300 lnet: peer aliveness")

So remove all mention of it.

Fixes: 5cdf0e31a7a9 ("lnet: peer aliveness")
WC-bug-id: https://jira.whamcloud.com/browse/LU-11300
Lustre-commit: 0d816af574b7 ("LU-11300 lnet: remove lnd_query interface.")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/37337
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-types.h      |  3 --
 net/lnet/klnds/o2iblnd/o2iblnd.c    | 32 -------------------
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c |  2 +-
 net/lnet/klnds/socklnd/socklnd.c    | 62 -------------------------------------
 net/lnet/lnet/api-ni.c              |  3 --
 5 files changed, 1 insertion(+), 101 deletions(-)

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 3345940..e885131 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -298,9 +298,6 @@ struct lnet_lnd {
 	/* notification of peer down */
 	void (*lnd_notify_peer_down)(lnet_nid_t peer);
 
-	/* query of peer aliveness */
-	void (*lnd_query)(struct lnet_ni *ni, lnet_nid_t peer, time64_t *when);
-
 	/* accept a new connection */
 	int (*lnd_accept)(struct lnet_ni *ni, struct socket *sock);
 };
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index 7bf2883..196ea4d 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1128,37 +1128,6 @@ static int kiblnd_ctl(struct lnet_ni *ni, unsigned int cmd, void *arg)
 	return rc;
 }
 
-static void kiblnd_query(struct lnet_ni *ni, lnet_nid_t nid, time64_t *when)
-{
-	time64_t last_alive = 0;
-	time64_t now = ktime_get_seconds();
-	rwlock_t *glock = &kiblnd_data.kib_global_lock;
-	struct kib_peer_ni *peer_ni;
-	unsigned long flags;
-
-	read_lock_irqsave(glock, flags);
-
-	peer_ni = kiblnd_find_peer_locked(ni, nid);
-	if (peer_ni)
-		last_alive = peer_ni->ibp_last_alive;
-
-	read_unlock_irqrestore(glock, flags);
-
-	if (last_alive)
-		*when = last_alive;
-
-	/*
-	 * peer_ni is not persistent in hash, trigger peer_ni creation
-	 * and connection establishment with a NULL tx
-	 */
-	if (!peer_ni)
-		kiblnd_launch_tx(ni, NULL, nid);
-
-	CDEBUG(D_NET, "peer_ni %s %p, alive %lld secs ago\n",
-	       libcfs_nid2str(nid), peer_ni,
-	       last_alive ? now - last_alive : -1);
-}
-
 static void kiblnd_free_pages(struct kib_pages *p)
 {
 	int npages = p->ibp_npages;
@@ -3125,7 +3094,6 @@ static int kiblnd_startup(struct lnet_ni *ni)
 	.lnd_startup	= kiblnd_startup,
 	.lnd_shutdown	= kiblnd_shutdown,
 	.lnd_ctl	= kiblnd_ctl,
-	.lnd_query	= kiblnd_query,
 	.lnd_send	= kiblnd_send,
 	.lnd_recv	= kiblnd_recv,
 };
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 67780d0..087657c 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -2684,7 +2684,7 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid,
 	 * attempts (active or passive) are in progress
 	 * NB: reconnect is still needed even when ibp_tx_queue is
 	 * empty if ibp_version != version because reconnect may be
-	 * initiated by kiblnd_query()
+	 * initiated.
 	 */
 	reconnect = (!list_empty(&peer_ni->ibp_tx_queue) ||
 		     peer_ni->ibp_version != version) &&
diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index 7abb75a..d967958 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -1789,67 +1789,6 @@ struct ksock_peer_ni *
 	 */
 }
 
-void
-ksocknal_query(struct lnet_ni *ni, lnet_nid_t nid, time64_t *when)
-{
-	int connect = 1;
-	time64_t last_alive = 0;
-	time64_t now = ktime_get_seconds();
-	struct ksock_peer_ni *peer_ni = NULL;
-	rwlock_t *glock = &ksocknal_data.ksnd_global_lock;
-	struct lnet_process_id id = {
-		.nid = nid,
-		.pid = LNET_PID_LUSTRE,
-	};
-
-	read_lock(glock);
-
-	peer_ni = ksocknal_find_peer_locked(ni, id);
-	if (peer_ni) {
-		struct ksock_conn *conn;
-		int bufnob;
-
-		list_for_each_entry(conn, &peer_ni->ksnp_conns, ksnc_list) {
-			bufnob = conn->ksnc_sock->sk->sk_wmem_queued;
-
-			if (bufnob < conn->ksnc_tx_bufnob) {
-				/* something got ACKed */
-				conn->ksnc_tx_deadline = ktime_get_seconds() +
-							 lnet_get_lnd_timeout();
-				peer_ni->ksnp_last_alive = now;
-				conn->ksnc_tx_bufnob = bufnob;
-			}
-		}
-
-		last_alive = peer_ni->ksnp_last_alive;
-		if (!ksocknal_find_connectable_route_locked(peer_ni))
-			connect = 0;
-	}
-
-	read_unlock(glock);
-
-	if (last_alive)
-		*when = last_alive * HZ;
-
-	CDEBUG(D_NET, "peer_ni %s %p, alive %lld secs ago, connect %d\n",
-	       libcfs_nid2str(nid), peer_ni,
-	       last_alive ? now - last_alive : -1,
-	       connect);
-
-	if (!connect)
-		return;
-
-	ksocknal_add_peer(ni, id, LNET_NIDADDR(nid), lnet_acceptor_port());
-
-	write_lock_bh(glock);
-
-	peer_ni = ksocknal_find_peer_locked(ni, id);
-	if (peer_ni)
-		ksocknal_launch_all_connections_locked(peer_ni);
-
-	write_unlock_bh(glock);
-}
-
 static void
 ksocknal_push_peer(struct ksock_peer_ni *peer_ni)
 {
@@ -2775,7 +2714,6 @@ static void __exit ksocklnd_exit(void)
 	.lnd_send		= ksocknal_send,
 	.lnd_recv		= ksocknal_recv,
 	.lnd_notify_peer_down	= ksocknal_notify_gw_down,
-	.lnd_query		= ksocknal_query,
 	.lnd_accept		= ksocknal_accept,
 };
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 8f59266..ea23471 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -2304,9 +2304,6 @@ static void lnet_push_target_fini(void)
 		if (rc < 0)
 			goto failed1;
 
-		LASSERT(ni->ni_net->net_tunables.lct_peer_timeout <= 0 ||
-			ni->ni_net->net_lnd->lnd_query);
-
 		lnet_ni_addref(ni);
 		list_add_tail(&ni->ni_netlist, &local_ni_list);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 622/622] lnet: use conservative health timeouts
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (620 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 621/622] lnet: remove lnd_query interface James Simmons
@ 2020-02-27 21:18 ` James Simmons
  2020-04-24  6:01 ` [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 NeilBrown
  622 siblings, 0 replies; 626+ messages in thread
From: James Simmons @ 2020-02-27 21:18 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Use more conservative lnet_transaction_timeout and lnet_retry_count
values by default.  Currently with timeout=10 and retry=3 there is
only a 3s window for the RPC to be sent before it is timed out.
This has caused fault injection rather than fault tolerance.
Increase the default timeout to 50s with retry=2, which is hopefully
long enough to cover virtually all uses, but still allows LNet Health
to be enabled by default and resend before Lustre times out itself.

Fixes: d24c948e4467 ("lnet: setup health timeout defaults")

WC-bug-id: https://jira.whamcloud.com/browse/LU-13145
Lustre-commit: 361e9eaef13c ("LU-13145 lnet: use conservative health timeouts")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/37430
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/api-ni.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index ea23471..10ade73 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -141,7 +141,7 @@ static int recovery_interval_set(const char *val,
 		 "Set to 1 to drop asymmetrical route messages.");
 
 #define LNET_TRANSACTION_TIMEOUT_NO_HEALTH_DEFAULT 50
-#define LNET_TRANSACTION_TIMEOUT_HEALTH_DEFAULT 10
+#define LNET_TRANSACTION_TIMEOUT_HEALTH_DEFAULT 50
 
 unsigned int lnet_transaction_timeout = LNET_TRANSACTION_TIMEOUT_HEALTH_DEFAULT;
 static int transaction_to_set(const char *val, const struct kernel_param *kp);
@@ -156,7 +156,7 @@ static int recovery_interval_set(const char *val,
 MODULE_PARM_DESC(lnet_transaction_timeout,
 		 "Maximum number of seconds to wait for a peer response.");
 
-#define LNET_RETRY_COUNT_HEALTH_DEFAULT 3
+#define LNET_RETRY_COUNT_HEALTH_DEFAULT 2
 unsigned int lnet_retry_count = LNET_RETRY_COUNT_HEALTH_DEFAULT;
 static int retry_count_set(const char *val, const struct kernel_param *kp);
 static struct kernel_param_ops param_ops_retry_count = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52
  2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
                   ` (621 preceding siblings ...)
  2020-02-27 21:18 ` [lustre-devel] [PATCH 622/622] lnet: use conservative health timeouts James Simmons
@ 2020-04-24  6:01 ` NeilBrown
  2020-04-28  1:04   ` James Simmons
  622 siblings, 1 reply; 626+ messages in thread
From: NeilBrown @ 2020-04-24  6:01 UTC (permalink / raw)
  To: lustre-devel

On Thu, Feb 27 2020, James Simmons wrote:

> These patches need to be applied to the lustre-backport branch
> starting at commit a436653f641e4b3e2841f38113620535e918dd3f.

Hi James et al,
 I applied these patches on top of a43.... then added the other patches I
 had an looked for difference to my previous tree.
 I found a bunch of improvements you had made, plus some errors that
 came through your patches, plus some other problems that already
 existed...

 I've then added patches from OpenSFS master to get uptodate.
 The result is not on my github tree in the lustre/lustre branch.  It
 has 226 patches on top of the set you posted.

 git log 823691f8d49c~3..823691f8d49c

 will show you some error I fixed, in case you are interested.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20200424/d1f49362/attachment.sig>

^ permalink raw reply	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52
  2020-04-24  6:01 ` [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 NeilBrown
@ 2020-04-28  1:04   ` James Simmons
  2020-04-29  3:32     ` NeilBrown
  0 siblings, 1 reply; 626+ messages in thread
From: James Simmons @ 2020-04-28  1:04 UTC (permalink / raw)
  To: lustre-devel


> > These patches need to be applied to the lustre-backport branch
> > starting at commit a436653f641e4b3e2841f38113620535e918dd3f.
> 
> Hi James et al,
>  I applied these patches on top of a43.... then added the other patches I
>  had an looked for difference to my previous tree.
>  I found a bunch of improvements you had made, plus some errors that
>  came through your patches, plus some other problems that already
>  existed...
> 
>  I've then added patches from OpenSFS master to get uptodate.
>  The result is not on my github tree in the lustre/lustre branch.  It
>  has 226 patches on top of the set you posted.
> 
>  git log 823691f8d49c~3..823691f8d49c
> 
>  will show you some error I fixed, in case you are interested.

I grabbed those fixes and applied them to my tree. Its nice to see my 
tree, your tree, and the OpenSFS branch starting to merge. I updated
my tree to the same OpenSFS commit as your tree and have started to look 
at difference. I noticed the sync with my tree didn't exactly match up. 
You lost a few changes. Some of the major changes missing are

1) LU-9679 llite: Discard LUSTRE_FPRIVATE()
   OpenSFS hash : 9e5cb57addbb5d7bc1596096821ad8dcac7a939b

2) LU-13274 uapi: make lnet UAPI headers C99 compliant 
   OpenSFS hash: 742897a967cff5be53c447d14b17ae405c2b31f2

Also their are some kmem_cache bugs that causes several of the sanity
test to crash the client. I need to do more comparing of our tree and
sort out the changes that haven't landed or been pushed to the OpenSFS 
branch to figure out these regressions. I can easily see in the 4 to 8 
weeks all the patches in your tree to land to OpenSFS as well as my tree. 
So we could sort everything out once everything has landed. I need to 
pull a few changes you have as well to my tree for cleanups like for the 
fid layer that seem to have gotten lost. Mostly likely from people doing
cleanups while in staging.

Now for some important info. I noticed sanity test failing recently
which I tracked down to me not updating my lustre utilities. This is
going to be a problem so to make it painless I created a few patches
to enable building the lustre utilies only against the linux
client. If you apply the following patches:

https://review.whamcloud.com/#/c/38369
https://review.whamcloud.com/#/c/38370
https://review.whamcloud.com/#/c/38105
https://review.whamcloud.com/#/c/36603
https://review.whamcloud.com/#/c/34954

Now for 33603 patch it exposes a regression with LSOM and 
open_by_handle_at(). This patch removes the need for dot_lustre which is 
broken anyways for submounts (filesets). The 'dot_lustre' in the UAPI 
header is used for general utilites as well as the kernel server code. 
Since it had no use in the linux client it was nuked :-) Instead of 
restoring it we can resolve the LSOM issues for 33603 patch.

Once you apply the above patches just run:

sh autogen.sh
./configure --disable-modules --disable-server
make rpms

and you will have proper rpms to use with the linux client. I recommend
after updating the linux client to update your utilities as well 
everytime.

^ permalink raw reply	[flat|nested] 626+ messages in thread

* [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52
  2020-04-28  1:04   ` James Simmons
@ 2020-04-29  3:32     ` NeilBrown
  0 siblings, 0 replies; 626+ messages in thread
From: NeilBrown @ 2020-04-29  3:32 UTC (permalink / raw)
  To: lustre-devel

On Tue, Apr 28 2020, James Simmons wrote:

>> > These patches need to be applied to the lustre-backport branch
>> > starting at commit a436653f641e4b3e2841f38113620535e918dd3f.
>> 
>> Hi James et al,
>>  I applied these patches on top of a43.... then added the other patches I
>>  had an looked for difference to my previous tree.
>>  I found a bunch of improvements you had made, plus some errors that
>>  came through your patches, plus some other problems that already
>>  existed...
>> 
>>  I've then added patches from OpenSFS master to get uptodate.
>>  The result is not on my github tree in the lustre/lustre branch.  It
>>  has 226 patches on top of the set you posted.
>> 
>>  git log 823691f8d49c~3..823691f8d49c
>> 
>>  will show you some error I fixed, in case you are interested.
>
> I grabbed those fixes and applied them to my tree. Its nice to see my 
> tree, your tree, and the OpenSFS branch starting to merge. I updated
> my tree to the same OpenSFS commit as your tree and have started to look 
> at difference. I noticed the sync with my tree didn't exactly match up. 
> You lost a few changes. Some of the major changes missing are
>
> 1) LU-9679 llite: Discard LUSTRE_FPRIVATE()
>    OpenSFS hash : 9e5cb57addbb5d7bc1596096821ad8dcac7a939b

Thanks.  I've also added

2bea4a7a3706 LU-5432 fld: don't loop forever on bogus FID sequences
2f66b4903516 LU-7768 fld: Do not retry fld request
8d4ef45e0780 LU-7524 fld: fld_clientlookup retries next target

which I had found by comparison myself but not applied yet.

> 2) LU-13274 uapi: make lnet UAPI headers C99 compliant 
>    OpenSFS hash: 742897a967cff5be53c447d14b17ae405c2b31f2

I have that...
Commit 619f523e5c36 ("lustre: uapi: make lnet UAPI headers C99 compliant")

What did I miss?

>
> Also their are some kmem_cache bugs that causes several of the sanity
> test to crash the client. I need to do more comparing of our tree and
> sort out the changes that haven't landed or been pushed to the OpenSFS 
> branch to figure out these regressions. I can easily see in the 4 to 8 
> weeks all the patches in your tree to land to OpenSFS as well as my tree. 
> So we could sort everything out once everything has landed. I need to 
> pull a few changes you have as well to my tree for cleanups like for the 
> fid layer that seem to have gotten lost. Mostly likely from people doing
> cleanups while in staging.
>
> Now for some important info. I noticed sanity test failing recently
> which I tracked down to me not updating my lustre utilities. This is
> going to be a problem so to make it painless I created a few patches
> to enable building the lustre utilies only against the linux
> client. If you apply the following patches:
>
> https://review.whamcloud.com/#/c/38369
> https://review.whamcloud.com/#/c/38370

This one:
 LU-12511 build: don't use OpenSFS UAPI headers with --disable-modules

doesn't seem like a good thing.  It implies that the UAPI headers in
OpenSFS give different results than the ones in Linux.
If this is true - we need to fix it. The UAPI needs to be stable.

Do you know what the important differences are.

> https://review.whamcloud.com/#/c/38105
> https://review.whamcloud.com/#/c/36603
> https://review.whamcloud.com/#/c/34954
>
> Now for 33603 patch it exposes a regression with LSOM and 
> open_by_handle_at(). This patch removes the need for dot_lustre which is 
> broken anyways for submounts (filesets). The 'dot_lustre' in the UAPI 
> header is used for general utilites as well as the kernel server code. 
> Since it had no use in the linux client it was nuked :-) Instead of 
> restoring it we can resolve the LSOM issues for 33603 patch.
>
> Once you apply the above patches just run:
>
> sh autogen.sh
> ./configure --disable-modules --disable-server
> make rpms
>
> and you will have proper rpms to use with the linux client. I recommend
> after updating the linux client to update your utilities as well 
> everytime.

Thanks from the thorough response.  I'll keep this in mind when I next
refresh my linux test setup.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20200429/cb1f1b9a/attachment.sig>

^ permalink raw reply	[flat|nested] 626+ messages in thread

end of thread, other threads:[~2020-04-29  3:32 UTC | newest]

Thread overview: 626+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-27 21:07 [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 001/622] lustre: always enable special debugging, fhandles, and quota support James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 002/622] lustre: osc_cache: remove __might_sleep() James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 003/622] lustre: uapi: remove enum hsm_progress_states James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 004/622] lustre: uapi: sync enum obd_statfs_state James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 005/622] lustre: llite: return compatible fsid for statfs James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 006/622] lustre: ldlm: Make kvzalloc | kvfree use consistent James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 007/622] lustre: llite: limit smallest max_cached_mb value James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 008/622] lustre: obdecho: turn on async flag only for mode 3 James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 009/622] lustre: llite: reorganize variable and data structures James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 010/622] lustre: llite: increase whole-file readahead to RPC size James Simmons
2020-02-27 21:07 ` [lustre-devel] [PATCH 011/622] lustre: llite: handle ORPHAN/DEAD directories James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 012/622] lustre: lov: protected ost pool count updation James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 013/622] lustre: obdclass: fix llog_cat_cleanup() usage on Client James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 014/622] lustre: mdc: fix possible NULL pointer dereference James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 015/622] lustre: obdclass: allow specifying complex jobids James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 016/622] lustre: ldlm: don't disable softirq for exp_rpc_lock James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 017/622] lustre: obdclass: new wrapper to convert NID to string James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 018/622] lustre: ptlrpc: Add QoS for uid and gid in NRS-TBF James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 019/622] lustre: hsm: ignore compound_id James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 020/622] lnet: libcfs: remove unnecessary set_fs(KERNEL_DS) James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 021/622] lustre: ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 022/622] lustre: llite: yield cpu after call to ll_agl_trigger James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 023/622] lustre: osc: Do not request more than 2GiB grant James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 024/622] lustre: llite: rename FSFILT_IOC_* to system flags James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 025/622] lnet: fix nid range format '*@<net>' support James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 026/622] lustre: ptlrpc: fix test_req_buffer_pressure behavior James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 027/622] lustre: lu_object: improve debug message for lu_object_put() James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 028/622] lustre: idl: remove obsolete directory split flags James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 029/622] lustre: mdc: resend quotactl if needed James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 030/622] lustre: obd: create ping sysfs file James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 031/622] lustre: ldlm: change LDLM_POOL_ADD_VAR macro to inline function James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 032/622] lustre: obdecho: use vmalloc for lnb James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 033/622] lustre: mdc: deny layout swap for DoM file James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 034/622] lustre: mgc: remove obsolete IR swabbing workaround James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 035/622] lustre: ptlrpc: add dir migration connect flag James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 036/622] lustre: mds: remove obsolete MDS_VTX_BYPASS flag James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 037/622] lustre: ldlm: expose dirty age limit for flush-on-glimpse James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 038/622] lustre: ldlm: IBITS lock convert instead of cancel James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 039/622] lustre: ptlrpc: fix return type of boolean functions James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 040/622] lustre: llite: decrease sa_running if fail to start statahead James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 041/622] lustre: lmv: dir page is released while in use James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 042/622] lustre: ldlm: speed up preparation for list of lock cancel James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 043/622] lustre: checksum: enable/disable checksum correctly James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 044/622] lustre: build: armv7 client build fixes James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 045/622] lustre: ldlm: fix l_last_activity usage James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 046/622] lustre: ptlrpc: Add WBC connect flag James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 047/622] lustre: llog: remove obsolete llog handlers James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 048/622] lustre: ldlm: fix for l_lru usage James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 049/622] lustre: lov: Move lov_tgts_kobj init to lov_setup James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 050/622] lustre: osc: add T10PI support for RPC checksum James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 051/622] lustre: ldlm: Reduce debug to console during eviction James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 052/622] lustre: ptlrpc: idle connections can disconnect James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 053/622] lustre: osc: truncate does not update blocks count on client James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 054/622] lustre: ptlrpc: add LOCK_CONVERT connection flag James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 055/622] lustre: ldlm: handle lock converts in cancel handler James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 056/622] lustre: ptlrpc: Serialize procfs access to scp_hist_reqs using mutex James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 057/622] lustre: ldlm: don't add canceling lock back to LRU James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 058/622] lustre: quota: add default quota setting support James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 059/622] lustre: ptlrpc: don't zero request handle James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 060/622] lnet: ko2iblnd: determine gaps correctly James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 061/622] lustre: osc: increase default max_dirty_mb to 2G James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 062/622] lustre: ptlrpc: remove obsolete OBD RPC opcodes James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 063/622] lustre: ptlrpc: assign specific values to MGS opcodes James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 064/622] lustre: ptlrpc: remove obsolete LLOG_ORIGIN_* RPCs James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 065/622] lustre: osc: fix idle_timeout handling James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 066/622] lustre: ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor)) James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 067/622] lustre: obd: keep dirty_max_pages a round number of MB James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 068/622] lustre: osc: depart grant shrinking from pinger James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 069/622] lustre: mdt: Lazy size on MDT James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 070/622] lustre: lfsck: layout LFSCK for mirrored file James Simmons
2020-02-27 21:08 ` [lustre-devel] [PATCH 071/622] lustre: mdt: read on open for DoM files James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 072/622] lustre: migrate: pack lmv ea in migrate rpc James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 073/622] lustre: hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id lists in array James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 074/622] lustre: llite: handle zero length xattr values correctly James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 075/622] lnet: refactor lnet_select_pathway() James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 076/622] lnet: add health value per ni James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 077/622] lnet: add lnet_health_sensitivity James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 078/622] lnet: add monitor thread James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 079/622] lnet: handle local ni failure James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 080/622] lnet: handle o2iblnd tx failure James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 081/622] lnet: handle socklnd " James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 082/622] lnet: handle remote errors in LNet James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 083/622] lnet: add retry count James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 084/622] lnet: calculate the lnd timeout James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 085/622] lnet: sysfs functions for module params James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 086/622] lnet: timeout delayed REPLYs and ACKs James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 087/622] lnet: remove duplicate timeout mechanism James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 088/622] lnet: handle fatal device error James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 089/622] lnet: reset health value James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 090/622] lnet: add health statistics James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 091/622] lnet: Add ioctl to get health stats James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 092/622] lnet: remove obsolete health functions James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 093/622] lnet: set health value from user space James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 094/622] lnet: add global health statistics James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 095/622] lnet: print recovery queues content James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 096/622] lnet: health error simulation James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 097/622] lustre: ptlrpc: replace simple_strtol with kstrtol James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 098/622] lustre: obd: use correct ip_compute_csum() version James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 099/622] lustre: osc: serialize access to idle_timeout vs cleanup James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 100/622] lustre: mdc: remove obsolete intent opcodes James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 101/622] lustre: llite: fix setstripe for specific osts upon dir James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 102/622] lustre: osc: enable/disable OSC grant shrink James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 103/622] lustre: protocol: MDT as a statfs proxy James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 104/622] lustre: ldlm: correct logic in ldlm_prepare_lru_list() James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 105/622] lustre: llite: check truncate race for DOM pages James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 106/622] lnet: lnd: conditionally set health status James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 107/622] lnet: router handling James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 108/622] lustre: obd: check '-o network' and peer discovery conflict James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 109/622] lnet: update logging James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 110/622] lustre: ldlm: don't cancel DoM locks before replay James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 111/622] lnet: lnd: Clean up logging James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 112/622] lustre: mdt: revoke lease lock for truncate James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 113/622] lustre: ptlrpc: race in AT early reply James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 114/622] lustre: migrate: migrate striped directory James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 115/622] lustre: obdclass: remove unused ll_import_cachep James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 116/622] lustre: ptlrpc: add debugging for idle connections James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 117/622] lustre: obdclass: Add lbug_on_eviction option James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 118/622] lustre: lmv: support accessing migrating directory James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 119/622] lustre: mdc: move RPC semaphore code to lustre/osp James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 120/622] lnet: libcfs: fix wrong check in libcfs_debug_vmsg2() James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 121/622] lustre: ptlrpc: new request vs disconnect race James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 122/622] lustre: misc: name open file handles as such James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 123/622] lustre: ldlm: cleanup LVB handling James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 124/622] lustre: ldlm: pass preallocated env to methods James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 125/622] lustre: osc: move obdo_cache to OSC code James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 126/622] lustre: llite: zero lum for stripeless files James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 127/622] lustre: idl: remove obsolete RPC flags James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 128/622] lustre: flr: add 'nosync' flag for FLR mirrors James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 129/622] lustre: llite: create checksums to replace checksum_pages James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 130/622] lustre: ptlrpc: don't change buffer when signature is ready James Simmons
2020-02-27 21:09 ` [lustre-devel] [PATCH 131/622] lustre: ldlm: update l_blocking_lock under lock James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 132/622] lustre: mgc: don't proccess cld during stopping James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 133/622] lustre: obdclass: make mod rpc slot wait queue FIFO James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 134/622] lustre: mdc: use old statfs format James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 135/622] lnet: Fix selftest backward compatibility post health James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 136/622] lustre: osc: clarify short_io_bytes is maximum value James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 137/622] lustre: ptlrpc: Make CPU binding switchable James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 138/622] lustre: misc: quiet console messages at startup James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 139/622] lustre: ldlm: don't apply ELC to converting and DOM locks James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 140/622] lustre: class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 141/622] lustre: uapi: add new changerec_type James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 142/622] lustre: ldlm: check double grant race after resource change James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 143/622] lustre: mdc: grow lvb buffer to hold layout James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 144/622] lustre: osc: re-check target versus available grant James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 145/622] lnet: unlink md if fail to send recovery James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 146/622] lustre: obd: use correct names for conn_uuid James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 147/622] lustre: idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 148/622] lustre: llite: optimize read on open pages James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 149/622] lnet: set the health status correctly James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 150/622] lustre: lov: add debugging info for statfs James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 151/622] lnet: Decrement health on timeout James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 152/622] lustre: quota: fix setattr project check James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 153/622] lnet: socklnd: dynamically set LND parameters James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 154/622] lustre: flr: add mirror write command James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 155/622] lnet: properly error check sensitivity James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 156/622] lustre: llite: add lock for dir layout data James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 157/622] lnet: configure recovery interval James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 158/622] lustre: osc: Do not walk full extent list James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 159/622] lnet: separate ni state from recovery James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 160/622] lustre: mdc: move empty xattr handling to mdc layer James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 161/622] lustre: obd: remove portals handle from OBD import James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 162/622] lustre: mgc: restore mgc binding for sptlrpc James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 163/622] lnet: peer deletion code may hide error James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 164/622] lustre: hsm: make changelog flag argument an enum James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 165/622] lustre: ldlm: don't skip bl_ast for local lock James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 166/622] lustre: clio: use pagevec_release for many pages James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 167/622] lustre: lmv: allocate fid on parent MDT in migrate James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 168/622] lustre: ptlrpc: Do not map unrecognized ELDLM errnos to EIO James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 169/622] lustre: llite: protect reading inode->i_data.nrpages James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 170/622] lustre: mdt: fix read-on-open for big PAGE_SIZE James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 171/622] lustre: llite: handle -ENODATA in ll_layout_fetch() James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 172/622] lustre: hsm: increase upper limit of maximum HSM backends registered with MDT James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 173/622] lustre: osc: wrong page offset for T10PI checksum James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 174/622] lnet: increase lnet transaction timeout James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 175/622] lnet: handle multi-md usage James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 176/622] lustre: uapi: fix warnings when lustre_user.h included James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 177/622] lustre: obdclass: lu_dirent record length missing '0' James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 178/622] lustre: update version to 2.11.99 James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 179/622] lustre: osc: limit chunk number of write submit James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 180/622] lustre: osc: speed up page cache cleanup during blocking ASTs James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 181/622] lustre: lmv: Fix style issues for lmv_fld.c James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 182/622] lustre: llite: Fix style issues for llite_nfs.c James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 183/622] lustre: llite: Fix style issues for lcommon_misc.c James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 184/622] lustre: llite: Fix style issues for symlink.c James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 185/622] lustre: headers: define pct(a, b) once James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 186/622] lustre: obdclass: report all obd states for OBD_IOC_GETDEVICE James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 187/622] lustre: ldlm: remove trace from ldlm_pool_count() James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 188/622] lustre: ptlrpc: clean up rq_interpret_reply callbacks James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 189/622] lustre: lov: quiet lov_dump_lmm_ console messages James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 190/622] lustre: lov: cl_cache could miss initialize James Simmons
2020-02-27 21:10 ` [lustre-devel] [PATCH 191/622] lnet: socklnd: improve scheduling algorithm James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 192/622] lustre: ldlm: Adjust search_* functions James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 193/622] lustre: sysfs: make ping sysfs file read and writable James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 194/622] lustre: ptlrpc: connect vs import invalidate race James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 195/622] lustre: ptlrpc: always unregister bulk James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 196/622] lustre: sptlrpc: split sptlrpc_process_config() James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 197/622] lustre: cfg: reserve flags for SELinux status checking James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 198/622] lustre: llite: remove cl_file_inode_init() LASSERT James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 199/622] lnet: add fault injection for bulk transfers James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 200/622] lnet: remove .nf_min_max handling James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 201/622] lustre: sec: create new function sptlrpc_get_sepol() James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 202/622] lustre: clio: fix incorrect invariant in cl_io_iter_fini() James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 203/622] lustre: mdc: Improve xattr buffer allocations James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 204/622] lnet: libcfs: allow file/func/line passed to CDEBUG() James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 205/622] lustre: llog: add startcat for wrapped catalog James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 206/622] lustre: llog: add synchronization for the last record James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 207/622] lustre: ptlrpc: improve memory allocation for service RPCs James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 208/622] lustre: llite: enable flock mount option by default James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 209/622] lustre: lmv: avoid gratuitous 64-bit modulus James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 210/622] lustre: Ensure crc-t10pi is enabled James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 211/622] lustre: lov: fix lov_iocontrol for inactive OST case James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 212/622] lustre: llite: Initialize cl_dirty_max_pages James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 213/622] lustre: mdc: don't use ACL at setattr James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 214/622] lnet: o2iblnd: ibc_rxs is created and freed with different size James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 215/622] lustre: osc: reduce atomic ops in osc_enter_cache_try James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 216/622] lustre: llite: ll_fault should fail for insane file offsets James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 217/622] lustre: ptlrpc: reset generation for old requests James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 218/622] lustre: osc: check if opg is in lru list without locking James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 219/622] lnet: use right rtr address James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 220/622] lnet: use right address for routing message James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 221/622] lustre: lov: avoid signed vs. unsigned comparison James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 222/622] lustre: obd: use ldo_process_config for mdc and osc layer James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 223/622] lnet: check for asymmetrical route messages James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 224/622] lustre: llite: Lock inode on tiny write if setuid/setgid set James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 225/622] lustre: llite: make sure name pack atomic James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 226/622] lustre: ptlrpc: handle proper import states for recovery James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 227/622] lustre: ldlm: don't convert wrong resource James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 228/622] lustre: llite: limit statfs ffree if less than OST ffree James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 229/622] lustre: mdc: prevent glimpse lock count grow James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 230/622] lustre: dne: performance improvement for file creation James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 231/622] lustre: mdc: return DOM size on open resend James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 232/622] lustre: llite: optimizations for not granted lock processing James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 233/622] lustre: osc: propagate grant shrink interval immediately James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 234/622] lustre: osc: grant shrink shouldn't account skipped OSC James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 235/622] lustre: quota: protect quota flags at OSC James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 236/622] lustre: osc: pass client page size during reconnect too James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 237/622] lustre: ptlrpc: Change static defines to use macro for sec_gc.c James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 238/622] lnet: libcfs: do not calculate debug_mb if it is set James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 239/622] lustre: ldlm: Lost lease lock on migrate error James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 240/622] lnet: lnd: increase CQ entries James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 241/622] lustre: security: return security context for metadata ops James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 242/622] lustre: grant: prevent overflow of o_undirty James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 243/622] lustre: ptlrpc: manage SELinux policy info at connect time James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 244/622] lustre: ptlrpc: manage SELinux policy info for metadata ops James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 245/622] lustre: obd: make health_check sysfs compliant James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 246/622] lustre: misc: delete OBD_IOC_PING_TARGET ioctl James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 247/622] lustre: misc: remove LIBCFS_IOC_DEBUG_MASK ioctl James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 248/622] lustre: llite: add file heat support James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 249/622] lustre: obdclass: improve llog config record message James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 250/622] lustre: lov: remove KEY_CACHE_SET to simplify the code James Simmons
2020-02-27 21:11 ` [lustre-devel] [PATCH 251/622] lustre: ldlm: Fix style issues for ldlm_lockd.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 252/622] lustre: ldlm: Fix style issues for ldlm_request.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 253/622] lustre: ptlrpc: Fix style issues for sec_bulk.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 254/622] lustre: ldlm: Fix style issues for ptlrpcd.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 255/622] lustre: ptlrpc: IR doesn't reconnect after EAGAIN James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 256/622] lustre: llite: ll_fault fixes James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 257/622] lustre: lsom: Add an OBD_CONNECT2_LSOM connect flag James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 258/622] lustre: pcc: Reserve a new connection flag for PCC James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 259/622] lustre: uapi: reserve connect flag for plain layout James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 260/622] lustre: ptlrpc: allow stopping threads above threads_max James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 261/622] lnet: Avoid lnet debugfs read/write if ctl_table does not exist James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 262/622] lnet: lnd: bring back concurrent_sends James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 263/622] lnet: properly cleanup lnet debugfs files James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 264/622] lustre: mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 265/622] lnet: Cleanup lnet_get_rtr_pool_cfg James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 266/622] lustre: quota: make overquota flag for old req James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 267/622] lustre: osd: Set max ea size to XATTR_SIZE_MAX James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 268/622] lustre: lov: Remove unnecessary assert James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 269/622] lnet: o2iblnd: kib_conn leak James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 270/622] lustre: llite: switch to use ll_fsname directly James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 271/622] lustre: llite: improve max_readahead console messages James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 272/622] lustre: llite: fill copied dentry name's ending char properly James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 273/622] lustre: obd: update udev event handling James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 274/622] lustre: ptlrpc: Bulk assertion fails on -ENOMEM James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 275/622] lustre: obd: Add overstriping CONNECT flag James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 276/622] lustre: llite, readahead: fix to call ll_ras_enter() properly James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 277/622] lustre: ptlrpc: ASSERTION (req_transno < next_transno) failed James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 278/622] lustre: lov: new foreign LOV format James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 279/622] lustre: lmv: new foreign LMV format James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 280/622] lustre: obd: replace class_uuid with linux kernel version James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 281/622] lustre: ptlrpc: Fix style issues for sec_null.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 282/622] lustre: ptlrpc: Fix style issues for service.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 283/622] lustre: uapi: fix file heat support James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 284/622] lnet: libcfs: poll fail_loc in cfs_fail_timeout_set() James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 285/622] lustre: obd: round values to nearest MiB for *_mb syfs files James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 286/622] lustre: osc: don't check capability for every page James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 287/622] lustre: statahead: sa_handle_callback get lli_sa_lock earlier James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 288/622] lnet: use number of wrs to calculate CQEs James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 289/622] lustre: ldlm: Fix style issues for ldlm_resource.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 290/622] lustre: ptlrpc: Fix style issues for sec_gc.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 291/622] lustre: ptlrpc: Fix style issues for llog_client.c James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 292/622] lustre: dne: allow access to striped dir with broken layout James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 293/622] lustre: ptlrpc: ocd_connect_flags are wrong during reconnect James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 294/622] lnet: libcfs: fix panic for too large cpu partitions James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 295/622] lustre: obdclass: put all service's env on the list James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 296/622] lustre: mdt: fix mdt_dom_discard_data() timeouts James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 297/622] lustre: lov: Add overstriping support James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 298/622] lustre: rpc: support maximum 64MB I/O RPC James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 299/622] lustre: dom: per-resource ELC for WRITE lock enqueue James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 300/622] lustre: dom: mdc_lock_flush() improvement James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 301/622] lnet: Fix NI status in debugfs for loopback ni James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 302/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 303/622] lustre: llite: Revalidate dentries in ll_intent_file_open James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 304/622] lustre: llite: hash just created files if lock allows James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 305/622] lnet: adds checking msg len James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 306/622] lustre: dne: add new dir hash type "space" James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 307/622] lustre: uapi: Add nonrotational flag to statfs James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 308/622] lnet: libcfs: crashes with certain cpu part numbers James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 309/622] lustre: lov: fix wrong calculated length for fiemap James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 310/622] lustre: obdclass: remove unprotected access to lu_object James Simmons
2020-02-27 21:12 ` [lustre-devel] [PATCH 311/622] lustre: push rcu_barrier() before destroying slab James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 312/622] lustre: ptlrpc: intent_getattr fetches default LMV James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 313/622] lustre: mdc: add async statfs James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 314/622] lustre: lmv: mkdir with balanced space usage James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 315/622] lustre: llite: check correct size in ll_dom_finish_open() James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 316/622] lnet: recovery event handling broken James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 317/622] lnet: clean mt_eqh properly James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 318/622] lnet: handle remote health error James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 319/622] lnet: setup health timeout defaults James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 320/622] lnet: fix cpt locking James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 321/622] lnet: detach response tracker James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 322/622] lnet: invalidate recovery ping mdh James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 323/622] lnet: fix list corruption James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 324/622] lnet: correct discovery LNetEQFree() James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 325/622] lnet: Protect lp_dc_pendq manipulation with lp_lock James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 326/622] lnet: Ensure md is detached when msg is not committed James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 327/622] lnet: verify msg is commited for send/recv James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 328/622] lnet: select LO interface for sending James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 329/622] lnet: remove route add restriction James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 330/622] lnet: Discover routers on first use James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 331/622] lnet: use peer for gateway James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 332/622] lnet: lnet_add/del_route() James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 333/622] lnet: Do not allow deleting of router nis James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 334/622] lnet: router sensitivity James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 335/622] lnet: cache ni status James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 336/622] lnet: Cache the routing feature James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 337/622] lnet: peer aliveness James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 338/622] lnet: router aliveness James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 339/622] lnet: simplify lnet_handle_local_failure() James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 340/622] lnet: Cleanup rcd James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 341/622] lnet: modify lnd notification mechanism James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 342/622] lnet: use discovery for routing James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 343/622] lnet: MR aware gateway selection James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 344/622] lnet: consider alive_router_check_interval James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 345/622] lnet: allow deleting router primary_nid James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 346/622] lnet: transfer routers James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 347/622] lnet: handle health for incoming messages James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 348/622] lnet: misleading discovery seqno James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 349/622] lnet: drop all rule James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 350/622] lnet: handle discovery off James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 351/622] lnet: handle router health off James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 352/622] lnet: push router interface updates James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 353/622] lnet: net aliveness James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 354/622] lnet: discover each gateway Net James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 355/622] lnet: look up MR peers routes James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 356/622] lnet: check peer timeout on a router James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 357/622] lustre: lmv: reuse object alloc QoS code from LOD James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 358/622] lustre: llite: Add persistent cache on client James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 359/622] lustre: pcc: Non-blocking PCC caching James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 360/622] lustre: pcc: security and permission for non-root user access James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 361/622] lustre: llite: Rule based auto PCC caching when create files James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 362/622] lustre: pcc: auto attach during open for valid cache James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 363/622] lustre: pcc: change detach behavior and add keep option James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 364/622] lustre: lov: return error if cl_env_get fails James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 365/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 366/622] lustre: ldlm: layout lock fixes James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 367/622] lnet: Do not allow gateways on remote nets James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 368/622] lustre: osc: reduce lock contention in osc_unreserve_grant James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 369/622] lnet: Change static defines to use macro for module.c James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 370/622] lustre: llite, readahead: don't always use max RPC size James Simmons
2020-02-27 21:13 ` [lustre-devel] [PATCH 371/622] lustre: llite: improve single-thread read performance James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 372/622] lustre: obdclass: allow per-session jobids James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 373/622] lustre: llite: fix deadloop with tiny write James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 374/622] lnet: prevent loop in LNetPrimaryNID() James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 375/622] lustre: ldlm: Fix style issues for ldlm_lib.c James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 376/622] lustre: obdclass: protect imp_sec using rwlock_t James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 377/622] lustre: llite: console message for disabled flock call James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 378/622] lustre: ptlrpc: Add increasing XIDs CONNECT2 flag James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 379/622] lustre: ptlrpc: don't reset lru_resize on idle reconnect James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 380/622] lnet: use after free in lnet_discover_peer_locked() James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 381/622] lustre: obdclass: generate random u64 max correctly James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 382/622] lnet: fix peer ref counting James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 383/622] lustre: llite: collect debug info for ll_fsync James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 384/622] lustre: obdclass: use RCU to release lu_env_item James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 385/622] lustre: mdt: improve IBITS lock definitions James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 386/622] lustre: uapi: change "space" hash type to hash flag James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 387/622] lustre: osc: cancel osc_lock list traversal once found the lock is being used James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 388/622] lustre: obdclass: add comment for rcu handling in lu_env_remove James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 389/622] lnet: honor discovery setting James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 390/622] lustre: obdclass: don't send multiple statfs RPCs James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 391/622] lustre: lov: Correct bounds checking James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 392/622] lustre: lu_object: Add missed qos_rr_init James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 393/622] lustre: fld: let's caller to retry FLD_QUERY James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 394/622] lustre: llite: make sure readahead cover current read James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 395/622] lustre: ptlrpc: Add jobid to rpctrace debug messages James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 396/622] lnet: libcfs: Reduce memory frag due to HA debug msg James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 397/622] lustre: ptlrpc: change IMPORT_SET_* macros into real functions James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 398/622] lustre: uapi: add unused enum obd_statfs_state James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 399/622] lustre: llite: create obd_device with usercopy whitelist James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 400/622] lnet: warn if discovery is off James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 401/622] lustre: ldlm: always cancel aged locks regardless enabling or disabling lru resize James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 402/622] lustre: llite: cleanup stats of LPROC_LL_* James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 403/622] lustre: osc: Do not assert for first extent James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 404/622] lustre: llite: MS_* flags and SB_* flags split James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 405/622] lustre: llite: improve ll_dom_lock_cancel James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 406/622] lustre: llite: swab LOV EA user data James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 407/622] lustre: clio: support custom csi_end_io handler James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 408/622] lustre: llite: release active extent on sync write commit James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 409/622] lustre: obd: harden debugfs handling James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 410/622] lustre: obd: add rmfid support James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 411/622] lnet: Convert noisy timeout error to cdebug James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 412/622] lnet: Misleading error from lnet_is_health_check James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 413/622] lustre: llite: do not cache write open lock for exec file James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 414/622] lustre: mdc: polling mode for changelog reader James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 415/622] lnet: Sync the start of discovery and monitor threads James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 416/622] lustre: llite: don't check vmpage refcount in ll_releasepage() James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 417/622] lnet: Deprecate live and dead router check params James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 418/622] lnet: Detach rspt when md_threshold is infinite James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 419/622] lnet: Return EHOSTUNREACH for unreachable gateway James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 420/622] lustre: ptlrpc: Don't get jobid in body_v2 James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 421/622] lnet: Defer rspt cleanup when MD queued for unlink James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 422/622] lustre: lov: Correct write_intent end for trunc James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 423/622] lustre: mdc: hold lock while walking changelog dev list James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 424/622] lustre: import: fix race between imp_state & imp_invalid James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 425/622] lnet: support non-default network namespace James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 426/622] lustre: obdclass: 0-nlink race in lu_object_find_at() James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 427/622] lustre: osc: reserve lru pages for read in batch James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 428/622] lustre: uapi: Make lustre_user.h c++-legal James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 429/622] lnet: create existing net returns EEXIST James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 430/622] lustre: obdecho: reuse an cl env cache for obdecho survey James Simmons
2020-02-27 21:14 ` [lustre-devel] [PATCH 431/622] lustre: mdc: dir page ldp_hash_end mistakenly adjusted James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 432/622] lnet: handle unlink before send completes James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 433/622] lustre: osc: layout and chunkbits alignment mismatch James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 434/622] lnet: handle recursion in resend James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 435/622] lustre: llite: forget cached ACLs properly James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 436/622] lustre: osc: Fix dom handling in weight_ast James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 437/622] lustre: llite: Fix extents_stats James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 438/622] lustre: llite: don't miss every first stride page James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 439/622] lustre: llite: swab LOV EA data in ll_getxattr_lov() James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 440/622] lustre: llite: Mark lustre_inode_cache as reclaimable James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 441/622] lustre: osc: add preferred checksum type support James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 442/622] lustre: ptlrpc: Stop sending ptlrpc_body_v2 James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 443/622] lnet: Fix style issues for selftest/rpc.c James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 444/622] lnet: Fix style issues for module.c conctl.c James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 445/622] lustre: ptlrpc: check lm_bufcount and lm_buflen James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 446/622] lustre: uapi: Remove unused CONNECT flag James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 447/622] lustre: lmv: disable remote file statahead James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 448/622] lustre: llite: Fix page count for unaligned reads James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 449/622] lnet: discovery off route state update James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 450/622] lustre: llite: prevent mulitple group locks James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 451/622] lustre: ptlrpc: make DEBUG_REQ messages consistent James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 452/622] lustre: ptlrpc: check buffer length in lustre_msg_string() James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 453/622] lustre: uapi: fix building fail against Power9 little endian James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 454/622] lustre: ptlrpc: fix reply buffers shrinking and growing James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 455/622] lustre: dom: manual OST-to-DOM migration via mirroring James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 456/622] lustre: fld: remove fci_no_shrink field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 457/622] lustre: lustre: remove ldt_obd_type field of lu_device_type James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 458/622] lustre: lustre: remove imp_no_timeout field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 459/622] lustre: llog: remove olg_cat_processing field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 460/622] lustre: ptlrpc: remove struct ptlrpc_bulk_page James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 461/622] lustre: ptlrpc: remove bd_import_generation field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 462/622] lustre: ptlrpc: remove srv_threads from struct ptlrpc_service James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 463/622] lustre: ptlrpc: remove scp_nthrs_stopping field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 464/622] lustre: ldlm: remove unused ldlm_server_conn James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 465/622] lustre: llite: remove lli_readdir_mutex James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 466/622] lustre: llite: remove ll_umounting field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 467/622] lustre: llite: align field names in ll_sb_info James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 468/622] lustre: llite: remove lti_iter field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 469/622] lustre: llite: remove ft_mtime field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 470/622] lustre: llite: remove sub_reenter field James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 471/622] lustre: osc: remove oti_descr oti_handle oti_plist James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 472/622] lustre: osc: remove oe_next_page James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 473/622] lnet: o2iblnd: remove some unused fields James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 474/622] lnet: socklnd: remove ksnp_sharecount James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 475/622] lustre: llite: extend readahead locks for striped file James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 476/622] lustre: llite: Improve readahead RPC issuance James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 477/622] lustre: lov: Move page index to top level James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 478/622] lustre: readahead: convert stride page index to byte James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 479/622] lustre: osc: prevent use after free James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 480/622] lustre: mdc: hold obd while processing changelog James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 481/622] lnet: change ln_mt_waitq to a completion James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 482/622] lustre: obdclass: align to T10 sector size when generating guard James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 483/622] lustre: ptlrpc: Hold imp lock for idle reconnect James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 484/622] lustre: osc: glimpse - search for active lock James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 485/622] lustre: lmv: use lu_tgt_descs to manage tgts James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 486/622] lustre: lmv: share object alloc QoS code with LMV James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 487/622] lustre: import: Fix missing spin_unlock() James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 488/622] lnet: o2iblnd: Make credits hiw connection aware James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 489/622] lustre: obdecho: avoid panic with partially object init James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 490/622] lnet: o2iblnd: cache max_qp_wr James Simmons
2020-02-27 21:15 ` [lustre-devel] [PATCH 491/622] lustre: som: integrate LSOM with lfs find James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 492/622] lustre: llite: error handling of ll_och_fill() James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 493/622] lnet: Don't queue msg when discovery has completed James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 494/622] lnet: Use alternate ping processing for non-mr peers James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 495/622] lustre: obdclass: qos penalties miscalculated James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 496/622] lustre: osc: wrong cache of LVB attrs James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 497/622] lustre: osc: wrong cache of LVB attrs, part2 James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 498/622] lustre: vvp: dirty pages with pagevec James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 499/622] lustre: ptlrpc: resend may corrupt the data James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 500/622] lnet: eliminate uninitialized warning James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 501/622] lnet: o2ib: Record rc in debug log on startup failure James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 502/622] lnet: o2ib: Reintroduce kiblnd_dev_search James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 503/622] lustre: ptlrpc: fix watchdog ratelimit logic James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 504/622] lustre: flr: avoid reading unhealthy mirror James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 505/622] lustre: obdclass: lu_tgt_descs cleanup James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 506/622] lustre: ptlrpc: Properly swab ll_fiemap_info_key James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 507/622] lustre: llite: clear flock when using localflock James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 508/622] lustre: sec: reserve flags for client side encryption James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 509/622] lustre: llite: limit max xattr size by kernel value James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 510/622] lustre: ptlrpc: return proper error code James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 511/622] lnet: fix peer_ni selection James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 512/622] lustre: pcc: Auto attach for PCC during IO James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 513/622] lustre: lmv: alloc dir stripes by QoS James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 514/622] lustre: llite: Don't clear d_fsdata in ll_release() James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 515/622] lustre: llite: move agl_thread cleanup out of thread James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 516/622] lustre/lnet: remove unnecessary use of msecs_to_jiffies() James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 517/622] lnet: net_fault: don't pass struct member to do_div() James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 518/622] lustre: obd: discard unused enum James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 519/622] lustre: update version to 2.13.50 James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 520/622] lustre: llite: report latency for filesystem ops James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 521/622] lustre: osc: don't re-enable grant shrink on reconnect James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 522/622] lustre: llite: statfs to use NODELAY with MDS James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 523/622] lustre: ptlrpc: grammar fix James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 524/622] lustre: lov: check all entries in lov_flush_composite James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 525/622] lustre: pcc: Incorrect size after re-attach James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 526/622] lustre: pcc: auto attach not work after client cache clear James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 527/622] lustre: pcc: Init saved dataset flags properly James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 528/622] lustre: use simple sleep in some cases James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 529/622] lustre: lov: use wait_event() in lov_subobject_kill() James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 530/622] lustre: llite: use wait_event in cl_object_put_last() James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 531/622] lustre: modules: Use LIST_HEAD for declaring list_heads James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 532/622] lustre: handle: move refcount into the lustre_handle James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 533/622] lustre: llite: support page unaligned stride readahead James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 534/622] lustre: ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 535/622] lustre: osc: allow increasing osc.*.short_io_bytes James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 536/622] lnet: remove pt_number from lnet_peer_table James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 537/622] lnet: Optimize check for routing feature flag James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 538/622] lustre: llite: file write pos mimatch James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 539/622] lustre: ldlm: FLOCK request can be processed twice James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 540/622] lnet: timers: correctly offset mod_timer James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 541/622] lustre: ptlrpc: update wiretest for new values James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 542/622] lustre: ptlrpc: do lu_env_refill for any new request James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 543/622] lustre: obd: perform proper division James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 544/622] lustre: uapi: introduce OBD_CONNECT2_CRUSH James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 545/622] lnet: Wait for single discovery attempt of routers James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 546/622] lustre: mgc: config lock leak James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 547/622] lnet: check if current->nsproxy is NULL before using James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 548/622] lustre: ptlrpc: always reset generation for idle reconnect James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 549/622] lustre: obdclass: Allow read-ahead for write requests James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 550/622] lustre: ldlm: separate buckets from ldlm hash table James Simmons
2020-02-27 21:16 ` [lustre-devel] [PATCH 551/622] lustre: llite: don't cache MDS_OPEN_LOCK for volatile files James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 552/622] lnet: discard lnd_refcount James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 553/622] lnet: socklnd: rename struct ksock_peer to struct ksock_peer_ni James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 554/622] lnet: change ksocknal_create_peer() to return pointer James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 555/622] lnet: discard ksnn_lock James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 556/622] lnet: discard LNetMEInsert James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 557/622] lustre: lmv: fix to return correct MDT count James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 558/622] lustre: obdclass: remove assertion for imp_refcount James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 559/622] lnet: Prefer route specified by rtr_nid James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 560/622] lustre: all: prefer sizeof(*var) for alloc James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 561/622] lustre: handle: discard OBD_FREE_RCU James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 562/622] lnet: use list_move where appropriate James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 563/622] lnet: libcfs: provide an scnprintf and start using it James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 564/622] lustre: llite: fetch default layout for a directory James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 565/622] lnet: fix rspt counter James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 566/622] lustre: ldlm: add a counter to the per-namespace data James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 567/622] lnet: Add peer level aliveness information James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 568/622] lnet: always check return of try_module_get() James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 569/622] lustre: obdclass: don't skip records for wrapped catalog James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 570/622] lnet: Refactor lnet_find_best_lpni_on_net James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 571/622] lnet: Avoid comparing route to itself James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 572/622] lustre: sysfs: use string helper like functions for sysfs James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 573/622] lustre: rename ops to owner James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 574/622] lustre: ldlm: simplify ldlm_ns_hash_defs[] James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 575/622] lnet: prepare to make lnet_lnd const James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 576/622] lnet: discard struct ksock_peer James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 577/622] lnet: Avoid extra lnet_remotenet lookup James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 578/622] lnet: Remove unused vars in lnet_find_route_locked James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 579/622] lnet: Refactor lnet_compare_routes James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 580/622] lustre: u_object: factor out extra per-bucket data James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 581/622] lustre: llite: replace lli_trunc_sem James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 582/622] lnet: Fix source specified route selection James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 583/622] lustre: uapi: turn struct lustre_nfs_fid to userland fhandle James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 584/622] lustre: uapi: LU-12521 llapi: add separate fsname and instance API James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 585/622] lnet: socklnd: initialize the_ksocklnd at compile-time James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 586/622] lnet: remove locking protection ln_testprotocompat James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 587/622] lustre: ptlrpc: suppress connection restored message James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 588/622] lustre: llite: fix deadlock in ll_update_lsm_md() James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 589/622] lustre: ldlm: fix lock convert races James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 590/622] lustre: ldlm: signal vs CP callback race James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 591/622] lustre: uapi: properly pack data structures James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 592/622] lnet: peer lookup handle shutdown James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 593/622] lnet: lnet response entries leak James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 594/622] lustre: lmv: disable statahead for remote objects James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 595/622] lustre: llite: eviction during ll_open_cleanup() James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 596/622] lustre: ptlrpc: show target name in req_history James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 597/622] lustre: dom: check read-on-open buffer presents in reply James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 598/622] lustre: llite: proper names/types for offset/pages James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 599/622] lustre: llite: Accept EBUSY for page unaligned read James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 600/622] lustre: handle: remove locking from class_handle2object() James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 601/622] lustre: handle: use hlist for hash lists James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 602/622] lustre: obdclass: convert waiting in cl_sync_io_wait() James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 603/622] lnet: modules: use list_move were appropriate James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 604/622] lnet: fix small race in unloading klnd modules James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 605/622] lnet: me: discard struct lnet_handle_me James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 606/622] lnet: avoid extra memory consumption James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 607/622] lustre: uapi: remove unused LUSTRE_DIRECTIO_FL James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 608/622] lustre: lustre: Reserve OST_FALLOCATE(fallocate) opcode James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 609/622] lnet: libcfs: Cleanup use of bare printk James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 610/622] lnet: Do not assume peers are MR capable James Simmons
2020-02-27 21:17 ` [lustre-devel] [PATCH 611/622] lnet: socklnd: convert peers hash table to hashtable.h James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 612/622] lustre: llite: Update mdc and lite stats on open|creat James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 613/622] lustre: osc: glimpse and lock cancel race James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 614/622] lustre: llog: keep llog handle alive until last reference James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 615/622] lnet: handling device failure by IB event handler James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 616/622] lustre: ptlrpc: simplify wait_event handling in unregister functions James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 617/622] lustre: ptlrpc: use l_wait_event_abortable in ptlrpcd_add_reg() James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 618/622] lnet: use LIST_HEAD() for local lists James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 619/622] lustre: lustre: " James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 620/622] lustre: handle: discard h_lock James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 621/622] lnet: remove lnd_query interface James Simmons
2020-02-27 21:18 ` [lustre-devel] [PATCH 622/622] lnet: use conservative health timeouts James Simmons
2020-04-24  6:01 ` [lustre-devel] [PATCH 000/622] lustre: sync closely to 2.13.52 NeilBrown
2020-04-28  1:04   ` James Simmons
2020-04-29  3:32     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.