Blob Blame History Raw
From: mgorman <mgorman@suse.com>
Date: Fri, 25 Sep 2020 09:32:32 +0100
Subject: [PATCH] cpuidle: Poll for a minimum of 30ns and poll for a tick if
 lower c-states are disabled

References: bnc#1176588
Patch-mainline: Not yet, needs to be posted but will likely be rejected for favoring power over performance

A bug was reported against a distribution kernel about a regression
related to an application that has very large numbers of threads operating
on large amounts of memory with a mix of page faults and address space
modifications. The threads enter/exit idle states extremely rapidly and
perf indicated that a large amount of time was spent on native_safe_halt.
The application requires that cpuidle states be limited to C1 to reduce
latencies on wakeup.

The problem is that the application indirectly relied on similar behaviour
to commit 36fcb4292473 ("cpuidle: use first valid target residency as
poll time") where CPUs would poll to the lowest C-state exit latency
before exiting. As low c-states, the application more directly relies
on a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
to poll a CPU until a rescheduling event occurred.

Rewinding this back "works" but is extreme. Instead this patch sets a
baseline polling time that is close to the C2 exit latency and anecdotally
is a common target as a wakeup latency. It guesses if lower C-states have
been disabled and if so, it polls until the rescheduling event or a tick
has passed. It's unlikely a tick will pass but it avoids the corner case
commit a37b969a61c1 ("cpuidle: poll_state: Add time limit to poll_idle()")
intended to avoid.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 drivers/cpuidle/cpuidle.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 29d2d7a21bd7..b903016e653b 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -361,6 +361,8 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
 		cpuidle_curr_governor->reflect(dev, index);
 }
 
+#define MIN_POLL_TIME (30 * NSEC_PER_USEC)
+
 /**
  * cpuidle_poll_time - return amount of time to poll for,
  * governors can override dev->poll_limit_ns if necessary
@@ -373,21 +375,39 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv,
 		      struct cpuidle_device *dev)
 {
 	int i;
-	u64 limit_ns;
+	u64 limit_ns, max_limit;
 
 	if (dev->poll_limit_ns)
 		return dev->poll_limit_ns;
 
 	limit_ns = TICK_NSEC;
+	max_limit = 0;
 	for (i = 1; i < drv->state_count; i++) {
+		u64 state_limit;
+
 		if (drv->states[i].disabled || dev->states_usage[i].disable)
 			continue;
 
-		limit_ns = (u64)drv->states[i].target_residency * NSEC_PER_USEC;
-		break;
+		state_limit = (u64)drv->states[i].target_residency * NSEC_PER_USEC;
+		if (limit_ns == TICK_NSEC)
+			limit_ns = state_limit;
+		max_limit = state_limit;
+	}
+
+	dev->poll_limit_ns = max_t(u64, MIN_POLL_TIME, limit_ns);
+
+	/*
+	 * If the deepest state is below the minimum, assume that c-states
+	 * are limited by the driver or kernel command line and that latency
+	 * is a concern. In this case, poll for longer periods;
+	 */
+	if (max_limit < MIN_POLL_TIME) {
+		pr_info("cpuidle deepest latency of %llu below min %llu, idling based on tick\n",
+			max_limit, (u64)MIN_POLL_TIME);
+		dev->poll_limit_ns = TICK_NSEC;
 	}
 
-	dev->poll_limit_ns = limit_ns;
+	pr_info("cpuidle polling time = %llu ns\n", dev->poll_limit_ns);
 
 	return dev->poll_limit_ns;
 }