Blob Blame History Raw
From 8b2060de23c565150bb68fce8d8cf98cf3670cf7 Mon Sep 17 00:00:00 2001
From: mgorman <mgorman@suse.com>
Date: Fri, 25 Sep 2020 09:32:32 +0100
Subject: [PATCH] cpuidle: Poll for a minimum of 30ns and poll for a tick if
 lower c-states are disabled

References: bnc#1176588
Patch-mainline: Not yet, needs to be posted but will likely be rejected for favoring power over performance

A bug was reported against a distribution kernel about a regression
related to an application that has very large numbers of threads operating
on large amounts of memory with a mix of page faults and address space
modifications. The threads enter/exit idle states extremely rapidly and
perf indicated that a large amount of time was spent on native_safe_halt.
The application requires that cpuidle states be limited to C1 to reduce
latencies on wakeup.

The problem is that the application indirectly relied on similar behaviour
to commit 36fcb4292473 ("cpuidle: use first valid target residency as
poll time") where CPUs would poll to the lowest C-state exit latency
before exiting. As low c-states, the application more directly relies
on a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
to poll a CPU until a rescheduling event occurred.

Rewinding this back "works" but is extreme. Instead this patch sets a
baseline polling time that is close to the C2 exit latency and anecdotally
is a common target as a wakeup latency. It guesses if lower C-states have
been disabled and if so, it polls until the rescheduling event or a tick
has passed. It's unlikely a tick will pass but it avoids the corner case
commit a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
intended to avoid.

SLE15-SP4: With few exceptions, this was found to be mostly neutral and
	at least one major gain shows that the baseline figure is
	implausible and potentially a measurement error or there was an
	unknown source of external interference. As the original value was
	a guess, leave this patch disabled by default unless it shows that
	SAP requires it. If SAP does require it, an alternative fix would
	be to make this tunable via the kernel command line or possible
	a sysctl upstream before backporting. See results at
        http://laplace.suse.de/pt-master/SLE15-SP4/0002-cpuidle-minpoll

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 drivers/cpuidle/cpuidle.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ef2ea1b12cd8..b1b0008f629e 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -369,6 +369,7 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
 }
 
 /*
+ * Upstream comment:
  * Min polling interval of 10usec is a guess. It is assuming that
  * for most users, the time for a single ping-pong workload like
  * perf bench pipe would generally complete within 10usec but
@@ -377,8 +378,12 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
  * perf bench sched pipe -l 10000
  *
  * Run multiple times to avoid cpufreq effects.
+ *
+ * SLE note: The min interval is changed to 30 nsec because
+ * the min polling was originally based on an bsc#1176588 and
+ * 30nsec is what was used.
  */
-#define CPUIDLE_POLL_MIN 10000
+#define CPUIDLE_POLL_MIN (30 * NSEC_PER_USEC)
 #define CPUIDLE_POLL_MAX (TICK_NSEC / 16)
 
 /**