Blob Blame History Raw
From: mgorman <mgorman@suse.com>
Date: Wed, 23 Sep 2020 18:10:35 +0100
Subject: [PATCH] sched/numa: Avoid creating large imbalances at task creation
 time

References: bnc#1176588
Patch-mainline: Not yet, needs to be posted but likely needs modification

A bug was reported against a distribution kernel about a regression related
to an application that has very large numbers of threads operating on
large amounts of memory with a mix of page faults and address space
modifications. System CPU usage was higher and elapsed time was much
reduced.

Part of the problem is that the application relies on threads and their
placement. As cloned threads remain local to the node if there is an
idle CPU, they contend heavily on the LRU spinlock when faulting
memory. It also creates a large imbalance of CPU utilisation until
the load balancer which does not happen quickly enough. As NUMA
balancing is disabled as the application is partially NUMA aware,
the situation never recovers.

This is not a representative test case, but similar symptoms can
be seen with a benchmark that faults pages in parallel

			    baseline		   patch
Amean     system-1          4.36 (   0.00%)        4.33 *   0.88%*
Amean     system-4          4.49 (   0.00%)        4.44 *   1.00%*
Amean     system-7          4.80 (   0.00%)        4.67 *   2.80%*
Amean     system-12         5.05 (   0.00%)        4.94 *   2.14%*
Amean     system-21         7.98 (   0.00%)        5.10 *  36.04%*
Amean     system-30         8.45 (   0.00%)        6.44 *  23.79%*
Amean     system-48         9.40 (   0.00%)        9.38 (   0.24%)
Amean     elapsed-1         5.70 (   0.00%)        5.68 *   0.45%*
Amean     elapsed-4         1.48 (   0.00%)        1.47 *   0.70%*
Amean     elapsed-7         0.92 (   0.00%)        0.89 *   3.10%*
Amean     elapsed-12        0.57 (   0.00%)        0.55 *   2.94%*
Amean     elapsed-21        0.51 (   0.00%)        0.33 *  34.38%*
Amean     elapsed-30        0.39 (   0.00%)        0.35 *  10.27%*
Amean     elapsed-48        0.24 (   0.00%)        0.25 (  -2.08%)

The system has 48 cores in total, note the decrease in system CPU
usage and the large decrease in elapsed time when one node is
almost full.

This imbalance is not the best possible solution. Ideally it would
be reconciled with adjust_numa_imbalance() but that was a regression
magnet when it first tried to allow imbalances. This is the minimal
fix to act as a baseline before trying to reconcile all the imbalance
handling cross nodes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db02212b2fba..1a3984801cc9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8679,9 +8679,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			.group_type = group_overloaded,
 	};
 
-	imbalance = scale_load_down(NICE_0_LOAD) *
-				(sd->imbalance_pct-100) / 100;
-
 	do {
 		int local_group;
 
@@ -8735,6 +8732,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	switch (local_sgs.group_type) {
 	case group_overloaded:
 	case group_fully_busy:
+
+		/* Calculate allowed imbalance based on load */
+		imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
+
 		/*
 		 * When comparing groups across NUMA domains, it's possible for
 		 * the local domain to be very lightly loaded relative to the
@@ -8787,13 +8789,24 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					return idlest;
 			}
 #endif
+
 			/*
 			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * to its wakeup source if it would not cause a large
+			 * imbalance. If there is a real need of migration,
+			 * periodic load balance will take care of it.
+			 */
+
+			/* See adjust_numa_imbalance */
+			imbalance = 2;
+
+			/*
+			 * Allow an imbalance if the node is not nearly full
+			 * and the imbalance between local and idlest is not
+			 * excessive.
 			 */
-			if (local_sgs.idle_cpus)
+			if (local_sgs.idle_cpus >= imbalance &&
+			    idlest_sgs.idle_cpus - local_sgs.idle_cpus <= imbalance)
 				return NULL;
 		}