trace data which starts _before_ the cache dropdown starts and while it
is decreasing should be the first step. Ideally along with /proc/vmstat
gathered at the same time. I am pretty sure you have some high order
memory consumer which forces the reclaim and we over reclaim. Last data
was not really conclusive as it didn't really captured the dropdown
IIRC.

with before you mean in a totally healthy state?
as I can not tell when decreasing starts this would mean collecting data
over days perhaps. however, I have no issue with that.
As I do not want to miss anything that might help you, could you please
provide the commands for all the data you require?
one host is at a healthy state right now, I'd run that over there immediately.