From: mgorman <mgorman@suse.com>
Date: Mon, 16 Nov 2020 20:00:42 +0000
Subject: [PATCH] sched: Fix loadavg accounting race on arm64 kabi
References: bnc#1178227
Patch-mainline: Never, upstream will fix this by moving the sched_remote_wakeup field to a clobber-free zone
An internal bug report filed against a 5.8 and 5.9 kernel that loadavg
was "exploding" on arm64 on a machines acting as a build servers. It
happened on at least two different arm64 variants. That setup is complex to
replicate but can be reproduced by running hackbench-process-pipes while
heavily overcommitting a machine with 96 logical CPUs and then checking
if loadavg drops afterwards. With an MMTests clone, reproduce it as follows
./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done
The reproduction case simply hammers the case where a task can be
descheduling while also being woken by another task at the same time.
After the test completes, load avg should reach 0 within a few minutes.
Upstream, the fix is likely to be that sched_remote_wakeup moves to a
clobber-free zone but this will break KABI on SLE 15 SP2. There is no
guarantee that a sufficiently large hole exists in task_struct on
all architectures. Hence, for arm64, the rq lock is acquired if
sched_remote_wakeup is changing to avoid a store tearing problem
with the adjacent bit fields.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sched/core.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a8d623c2f012..34340298b98f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2349,8 +2349,46 @@ void scheduler_ipi(void)
static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);
+ int remote = !!(wake_flags & WF_MIGRATED);
- p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+ if (remote != p->sched_remote_wakeup) {
+#ifdef CONFIG_X86
+ /*
+ * On x86 at least a plain store to a bitfield is still safe
+ * even though the the bitfield is not protected by a rq
+ * lock and will not be clobbered by a parallel update to
+ * sched_reset_on_fork, sched_contributes_to_load or
+ * sched_migrated. s390 should also be safe given that
+ * smp_wmb is a simple compiler barrier.
+ */
+ p->sched_remote_wakeup = remote;
+#else
+ /*
+ * On ARM64, ppc64 and potentially s390, a store to a
+ * bitfield can clobber adjacent bits so the the rq lock
+ * is needed avoid lost updates. Ideally there would be
+ * a KABI-safe way of moving the field but a large enough
+ * hole is not guaranteed to exist on all supported
+ * architectures in task_struct.
+ *
+ * Note, in theory it would be KABI safe to use an
+ * anonymous union of an unsigned long and a struct
+ * containing the bitfields. Within a cmpxchg loop,
+ * take a READ_ONCE copy of the unsigned long on
+ * the stack, cast it to the bitfield, update the bit
+ * and cmpxchg the new value with the old value to
+ * give an atomic update on a single bit. It's a
+ * complex dance that may not be any cheaper than
+ * the lock but worth bearing in in mind if there
+ * is a regression related to slowdowns on task
+ * wakeup.
+ */
+ struct rq_flags rf;
+ rq_lock_irqsave(rq, &rf);
+ p->sched_remote_wakeup = remote;
+ rq_unlock_irqrestore(rq, &rf);
+#endif
+ }
if (llist_add(&p->wake_entry, &rq->wake_list)) {
if (!set_nr_if_polling(rq->idle))