Blob Blame History Raw
From: mgorman <mgorman@suse.com>
Date: Mon, 16 Nov 2020 20:00:42 +0000
Subject: [PATCH] sched: Fix loadavg accounting race on arm64 kabi

References: bnc#1178227
Patch-mainline: Never, upstream will fix this by moving the sched_remote_wakeup field to a clobber-free zone

An internal bug report filed against a 5.8 and 5.9 kernel that loadavg
was "exploding" on arm64 on a machines acting as a build servers. It
happened on at least two different arm64 variants. That setup is complex to
replicate but can be reproduced by running hackbench-process-pipes while
heavily overcommitting a machine with 96 logical CPUs and then checking
if loadavg drops afterwards. With an MMTests clone, reproduce it as follows

./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
    for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done

The reproduction case simply hammers the case where a task can be
descheduling while also being woken by another task at the same time.
After the test completes, load avg should reach 0 within a few minutes.

Upstream, the fix is likely to be that sched_remote_wakeup moves to a
clobber-free zone but this will break KABI on SLE 15 SP2. There is no
guarantee that a sufficiently large hole exists in task_struct on
all architectures. Hence, for arm64, the rq lock is acquired if
sched_remote_wakeup is changing to avoid a store tearing problem
with the adjacent bit fields.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/core.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a8d623c2f012..34340298b98f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2349,8 +2349,46 @@ void scheduler_ipi(void)
 static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
+	int remote = !!(wake_flags & WF_MIGRATED);
 
-	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+	if (remote != p->sched_remote_wakeup) {
+#ifdef CONFIG_X86
+		/*
+		 * On x86 at least a plain store to a bitfield is still safe
+		 * even though the the bitfield is not protected by a rq
+		 * lock and will not be clobbered by a parallel update to
+		 * sched_reset_on_fork, sched_contributes_to_load or
+		 * sched_migrated. s390 should also be safe given that
+		 * smp_wmb is a simple compiler barrier.
+		 */
+		p->sched_remote_wakeup = remote;
+#else
+		/*
+		 * On ARM64, ppc64 and potentially s390, a store to a
+		 * bitfield can clobber adjacent bits so the the rq lock
+		 * is needed avoid lost updates. Ideally there would be
+		 * a KABI-safe way of moving the field but a large enough
+		 * hole is not guaranteed to exist on all supported
+		 * architectures in task_struct.
+		 *
+		 * Note, in theory it would be KABI safe to use an
+		 * anonymous union of an unsigned long and a struct
+		 * containing the bitfields. Within a cmpxchg loop,
+		 * take a READ_ONCE copy of the unsigned long on
+		 * the stack, cast it to the bitfield, update the bit
+		 * and cmpxchg the new value with the old value to
+		 * give an atomic update on a single bit. It's a
+		 * complex dance that may not be any cheaper than
+		 * the lock but worth bearing in in mind if there
+		 * is a regression related to slowdowns on task
+		 * wakeup.
+		 */
+		struct rq_flags rf;
+		rq_lock_irqsave(rq, &rf);
+		p->sched_remote_wakeup = remote;
+		rq_unlock_irqrestore(rq, &rf);
+#endif
+	}
 
 	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))