>
> trace data which starts _before_ the cache dropdown starts and while it
> is decreasing should be the first step. Ideally along with /proc/vmstat
> gathered at the same time. I am pretty sure you have some high order
> memory consumer which forces the reclaim and we over reclaim. Last data
> was not really conclusive as it didn't really captured the dropdown
> IIRC.
>

with before you mean in a totally healthy state?
as I can not tell when decreasing starts this would mean collecting data
over days perhaps. however, I have no issue with that.
As I do not want to miss anything that might help you, could you please
provide the commands for all the data you require?
one host is at a healthy state right now, I'd run that over there
immediately.