|
Vlastimil Babka |
c2d3b6 |
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
|
|
Vlastimil Babka |
c2d3b6 |
Date: Thu, 18 Aug 2022 18:40:33 +0530
|
|
Vlastimil Babka |
c2d3b6 |
Subject: mm/demotion: add support for explicit memory tiers
|
|
Vlastimil Babka |
c2d3b6 |
Git-commit: 992bf77591cb7e696fcc59aa7e64d1200b673513
|
|
Vlastimil Babka |
c2d3b6 |
Patch-mainline: v6.1-rc1
|
|
Vlastimil Babka |
c2d3b6 |
References: jsc#PED-1248
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
Patch series "mm/demotion: Memory tiers and demotion", v15.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
The current kernel has the basic memory tiering support: Inactive pages on
|
|
Vlastimil Babka |
c2d3b6 |
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
|
|
Vlastimil Babka |
c2d3b6 |
node to make room for new allocations on the higher tier NUMA node.
|
|
Vlastimil Babka |
c2d3b6 |
Frequently accessed pages on a lower tier NUMA node can be migrated
|
|
Vlastimil Babka |
c2d3b6 |
(promoted) to a higher tier NUMA node to improve the performance.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
In the current kernel, memory tiers are defined implicitly via a demotion
|
|
Vlastimil Babka |
c2d3b6 |
path relationship between NUMA nodes, which is created during the kernel
|
|
Vlastimil Babka |
c2d3b6 |
initialization and updated when a NUMA node is hot-added or hot-removed.
|
|
Vlastimil Babka |
c2d3b6 |
The current implementation puts all nodes with CPU into the highest tier,
|
|
Vlastimil Babka |
c2d3b6 |
and builds the tier hierarchy tier-by-tier by establishing the per-node
|
|
Vlastimil Babka |
c2d3b6 |
demotion targets based on the distances between nodes.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
This current memory tier kernel implementation needs to be improved for
|
|
Vlastimil Babka |
c2d3b6 |
several important use cases:
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
* The current tier initialization code always initializes each
|
|
Vlastimil Babka |
c2d3b6 |
memory-only NUMA node into a lower tier. But a memory-only NUMA node
|
|
Vlastimil Babka |
c2d3b6 |
may have a high performance memory device (e.g. a DRAM-backed
|
|
Vlastimil Babka |
c2d3b6 |
memory-only node on a virtual machine) and that should be put into a
|
|
Vlastimil Babka |
c2d3b6 |
higher tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
* The current tier hierarchy always puts CPU nodes into the top tier.
|
|
Vlastimil Babka |
c2d3b6 |
But on a system with HBM (e.g. GPU memory) devices, these memory-only
|
|
Vlastimil Babka |
c2d3b6 |
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
|
|
Vlastimil Babka |
c2d3b6 |
better to be placed into the next lower tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
* Also because the current tier hierarchy always puts CPU nodes into the
|
|
Vlastimil Babka |
c2d3b6 |
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
|
|
Vlastimil Babka |
c2d3b6 |
node from CPU-less into a CPU node (or vice versa), the memory tier
|
|
Vlastimil Babka |
c2d3b6 |
hierarchy gets changed, even though no memory node is added or removed.
|
|
Vlastimil Babka |
c2d3b6 |
This can make the tier hierarchy unstable and make it difficult to
|
|
Vlastimil Babka |
c2d3b6 |
support tier-based memory accounting.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
* A higher tier node can only be demoted to nodes with shortest distance
|
|
Vlastimil Babka |
c2d3b6 |
on the next lower tier as defined by the demotion path, not any other
|
|
Vlastimil Babka |
c2d3b6 |
node from any lower tier. This strict, demotion order does not work in
|
|
Vlastimil Babka |
c2d3b6 |
all use cases (e.g. some use cases may want to allow cross-socket
|
|
Vlastimil Babka |
c2d3b6 |
demotion to another node in the same demotion tier as a fallback when
|
|
Vlastimil Babka |
c2d3b6 |
the preferred demotion node is out of space), and has resulted in the
|
|
Vlastimil Babka |
c2d3b6 |
feature request for an interface to override the system-wide, per-node
|
|
Vlastimil Babka |
c2d3b6 |
demotion order from the userspace. This demotion order is also
|
|
Vlastimil Babka |
c2d3b6 |
inconsistent with the page allocation fallback order when all the nodes
|
|
Vlastimil Babka |
c2d3b6 |
in a higher tier are out of space: The page allocation can fall back to
|
|
Vlastimil Babka |
c2d3b6 |
any node from any lower tier, whereas the demotion order doesn't allow
|
|
Vlastimil Babka |
c2d3b6 |
that.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
This patch series make the creation of memory tiers explicit under the
|
|
Vlastimil Babka |
c2d3b6 |
control of device driver.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
Memory Tier Initialization
|
|
Vlastimil Babka |
c2d3b6 |
==========================
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
Linux kernel presents memory devices as NUMA nodes and each memory device
|
|
Vlastimil Babka |
c2d3b6 |
is of a specific type. The memory type of a device is represented by its
|
|
Vlastimil Babka |
c2d3b6 |
abstract distance. A memory tier corresponds to a range of abstract
|
|
Vlastimil Babka |
c2d3b6 |
distance. This allows for classifying memory devices with a specific
|
|
Vlastimil Babka |
c2d3b6 |
performance range into a memory tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
By default, all memory nodes are assigned to the default tier with
|
|
Vlastimil Babka |
c2d3b6 |
abstract distance 512.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
A device driver can move its memory nodes from the default tier. For
|
|
Vlastimil Babka |
c2d3b6 |
example, PMEM can move its memory nodes below the default tier, whereas
|
|
Vlastimil Babka |
c2d3b6 |
GPU can move its memory nodes above the default tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
The kernel initialization code makes the decision on which exact tier a
|
|
Vlastimil Babka |
c2d3b6 |
memory node should be assigned to based on the requests from the device
|
|
Vlastimil Babka |
c2d3b6 |
drivers as well as the memory device hardware information provided by the
|
|
Vlastimil Babka |
c2d3b6 |
firmware.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
This patch (of 10):
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
In the current kernel, memory tiers are defined implicitly via a demotion
|
|
Vlastimil Babka |
c2d3b6 |
path relationship between NUMA nodes, which is created during the kernel
|
|
Vlastimil Babka |
c2d3b6 |
initialization and updated when a NUMA node is hot-added or hot-removed.
|
|
Vlastimil Babka |
c2d3b6 |
The current implementation puts all nodes with CPU into the highest tier,
|
|
Vlastimil Babka |
c2d3b6 |
and builds the tier hierarchy by establishing the per-node demotion
|
|
Vlastimil Babka |
c2d3b6 |
targets based on the distances between nodes.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
This current memory tier kernel implementation needs to be improved for
|
|
Vlastimil Babka |
c2d3b6 |
several important use cases,
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
The current tier initialization code always initializes each memory-only
|
|
Vlastimil Babka |
c2d3b6 |
NUMA node into a lower tier. But a memory-only NUMA node may have a high
|
|
Vlastimil Babka |
c2d3b6 |
performance memory device (e.g. a DRAM-backed memory-only node on a
|
|
Vlastimil Babka |
c2d3b6 |
virtual machine) that should be put into a higher tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
The current tier hierarchy always puts CPU nodes into the top tier. But
|
|
Vlastimil Babka |
c2d3b6 |
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
|
|
Vlastimil Babka |
c2d3b6 |
these devices should be in the top tier, and DRAM nodes with CPUs are
|
|
Vlastimil Babka |
c2d3b6 |
better to be placed into the next lower tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
With current kernel higher tier node can only be demoted to nodes with
|
|
Vlastimil Babka |
c2d3b6 |
shortest distance on the next lower tier as defined by the demotion path,
|
|
Vlastimil Babka |
c2d3b6 |
not any other node from any lower tier. This strict, demotion order does
|
|
Vlastimil Babka |
c2d3b6 |
not work in all use cases (e.g. some use cases may want to allow
|
|
Vlastimil Babka |
c2d3b6 |
cross-socket demotion to another node in the same demotion tier as a
|
|
Vlastimil Babka |
c2d3b6 |
fallback when the preferred demotion node is out of space), This demotion
|
|
Vlastimil Babka |
c2d3b6 |
order is also inconsistent with the page allocation fallback order when
|
|
Vlastimil Babka |
c2d3b6 |
all the nodes in a higher tier are out of space: The page allocation can
|
|
Vlastimil Babka |
c2d3b6 |
fall back to any node from any lower tier, whereas the demotion order
|
|
Vlastimil Babka |
c2d3b6 |
doesn't allow that.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
This patch series address the above by defining memory tiers explicitly.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
Linux kernel presents memory devices as NUMA nodes and each memory device
|
|
Vlastimil Babka |
c2d3b6 |
is of a specific type. The memory type of a device is represented by its
|
|
Vlastimil Babka |
c2d3b6 |
abstract distance. A memory tier corresponds to a range of abstract
|
|
Vlastimil Babka |
c2d3b6 |
distance. This allows for classifying memory devices with a specific
|
|
Vlastimil Babka |
c2d3b6 |
performance range into a memory tier.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
This patch configures the range/chunk size to be 128. The default DRAM
|
|
Vlastimil Babka |
c2d3b6 |
abstract distance is 512. We can have 4 memory tiers below the default
|
|
Vlastimil Babka |
c2d3b6 |
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
|
|
Vlastimil Babka |
c2d3b6 |
Faster memory devices can be placed in these faster(higher) memory tiers.
|
|
Vlastimil Babka |
c2d3b6 |
Slower memory devices like persistent memory will have abstract distance
|
|
Vlastimil Babka |
c2d3b6 |
higher than the default DRAM level.
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
[akpm@linux-foundation.org: fix comment, per Aneesh]
|
|
Vlastimil Babka |
c2d3b6 |
Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
|
|
Vlastimil Babka |
c2d3b6 |
Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com
|
|
Vlastimil Babka |
c2d3b6 |
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
|
|
Vlastimil Babka |
c2d3b6 |
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
|
|
Vlastimil Babka |
c2d3b6 |
Acked-by: Wei Xu <weixugc@google.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Alistair Popple <apopple@nvidia.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Bharata B Rao <bharata@amd.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Dan Williams <dan.j.williams@intel.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Dave Hansen <dave.hansen@intel.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Davidlohr Bueso <dave@stgolabs.net>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Hesham Almatary <hesham.almatary@huawei.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Johannes Weiner <hannes@cmpxchg.org>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Michal Hocko <mhocko@kernel.org>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Tim Chen <tim.c.chen@intel.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Yang Shi <shy828301@gmail.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
|
|
Vlastimil Babka |
c2d3b6 |
Cc: SeongJae Park <sj@kernel.org>
|
|
Vlastimil Babka |
c2d3b6 |
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Vlastimil Babka |
c2d3b6 |
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Vlastimil Babka |
c2d3b6 |
---
|
|
Vlastimil Babka |
c2d3b6 |
include/linux/memory-tiers.h | 18 ++++++
|
|
Vlastimil Babka |
c2d3b6 |
mm/Makefile | 1
|
|
Vlastimil Babka |
c2d3b6 |
mm/memory-tiers.c | 129 +++++++++++++++++++++++++++++++++++++++++++
|
|
Vlastimil Babka |
c2d3b6 |
3 files changed, 148 insertions(+)
|
|
Vlastimil Babka |
c2d3b6 |
|
|
Vlastimil Babka |
c2d3b6 |
--- /dev/null
|
|
Vlastimil Babka |
c2d3b6 |
+++ b/include/linux/memory-tiers.h
|
|
Vlastimil Babka |
c2d3b6 |
@@ -0,0 +1,18 @@
|
|
Vlastimil Babka |
c2d3b6 |
+/* SPDX-License-Identifier: GPL-2.0 */
|
|
Vlastimil Babka |
c2d3b6 |
+#ifndef _LINUX_MEMORY_TIERS_H
|
|
Vlastimil Babka |
c2d3b6 |
+#define _LINUX_MEMORY_TIERS_H
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+/*
|
|
Vlastimil Babka |
c2d3b6 |
+ * Each tier cover a abstrace distance chunk size of 128
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+#define MEMTIER_CHUNK_BITS 7
|
|
Vlastimil Babka |
c2d3b6 |
+#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS)
|
|
Vlastimil Babka |
c2d3b6 |
+/*
|
|
Vlastimil Babka |
c2d3b6 |
+ * Smaller abstract distance values imply faster (higher) memory tiers. Offset
|
|
Vlastimil Babka |
c2d3b6 |
+ * the DRAM adistance so that we can accommodate devices with a slightly lower
|
|
Vlastimil Babka |
c2d3b6 |
+ * adistance value (slightly faster) than default DRAM adistance to be part of
|
|
Vlastimil Babka |
c2d3b6 |
+ * the same memory tier.
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+#endif /* _LINUX_MEMORY_TIERS_H */
|
|
Vlastimil Babka |
c2d3b6 |
--- a/mm/Makefile
|
|
Vlastimil Babka |
c2d3b6 |
+++ b/mm/Makefile
|
|
Vlastimil Babka |
c2d3b6 |
@@ -90,6 +90,7 @@ obj-$(CONFIG_KFENCE) += kfence/
|
|
Vlastimil Babka |
c2d3b6 |
obj-$(CONFIG_FAILSLAB) += failslab.o
|
|
Vlastimil Babka |
c2d3b6 |
obj-$(CONFIG_MEMTEST) += memtest.o
|
|
Vlastimil Babka |
c2d3b6 |
obj-$(CONFIG_MIGRATION) += migrate.o
|
|
Vlastimil Babka |
c2d3b6 |
+obj-$(CONFIG_NUMA) += memory-tiers.o
|
|
Vlastimil Babka |
c2d3b6 |
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
|
|
Vlastimil Babka |
c2d3b6 |
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
|
|
Vlastimil Babka |
c2d3b6 |
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
|
|
Vlastimil Babka |
c2d3b6 |
--- /dev/null
|
|
Vlastimil Babka |
c2d3b6 |
+++ b/mm/memory-tiers.c
|
|
Vlastimil Babka |
c2d3b6 |
@@ -0,0 +1,129 @@
|
|
Vlastimil Babka |
c2d3b6 |
+// SPDX-License-Identifier: GPL-2.0
|
|
Vlastimil Babka |
c2d3b6 |
+#include <linux/types.h>
|
|
Vlastimil Babka |
c2d3b6 |
+#include <linux/nodemask.h>
|
|
Vlastimil Babka |
c2d3b6 |
+#include <linux/slab.h>
|
|
Vlastimil Babka |
c2d3b6 |
+#include <linux/lockdep.h>
|
|
Vlastimil Babka |
c2d3b6 |
+#include <linux/memory-tiers.h>
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+struct memory_tier {
|
|
Vlastimil Babka |
c2d3b6 |
+ /* hierarchy of memory tiers */
|
|
Vlastimil Babka |
c2d3b6 |
+ struct list_head list;
|
|
Vlastimil Babka |
c2d3b6 |
+ /* list of all memory types part of this tier */
|
|
Vlastimil Babka |
c2d3b6 |
+ struct list_head memory_types;
|
|
Vlastimil Babka |
c2d3b6 |
+ /*
|
|
Vlastimil Babka |
c2d3b6 |
+ * start value of abstract distance. memory tier maps
|
|
Vlastimil Babka |
c2d3b6 |
+ * an abstract distance range,
|
|
Vlastimil Babka |
c2d3b6 |
+ * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+ int adistance_start;
|
|
Vlastimil Babka |
c2d3b6 |
+};
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+struct memory_dev_type {
|
|
Vlastimil Babka |
c2d3b6 |
+ /* list of memory types that are part of same tier as this type */
|
|
Vlastimil Babka |
c2d3b6 |
+ struct list_head tier_sibiling;
|
|
Vlastimil Babka |
c2d3b6 |
+ /* abstract distance for this specific memory type */
|
|
Vlastimil Babka |
c2d3b6 |
+ int adistance;
|
|
Vlastimil Babka |
c2d3b6 |
+ /* Nodes of same abstract distance */
|
|
Vlastimil Babka |
c2d3b6 |
+ nodemask_t nodes;
|
|
Vlastimil Babka |
c2d3b6 |
+ struct memory_tier *memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+};
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+static DEFINE_MUTEX(memory_tier_lock);
|
|
Vlastimil Babka |
c2d3b6 |
+static LIST_HEAD(memory_tiers);
|
|
Vlastimil Babka |
c2d3b6 |
+static struct memory_dev_type *node_memory_types[MAX_NUMNODES];
|
|
Vlastimil Babka |
c2d3b6 |
+/*
|
|
Vlastimil Babka |
c2d3b6 |
+ * For now we can have 4 faster memory tiers with smaller adistance
|
|
Vlastimil Babka |
c2d3b6 |
+ * than default DRAM tier.
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+static struct memory_dev_type default_dram_type = {
|
|
Vlastimil Babka |
c2d3b6 |
+ .adistance = MEMTIER_ADISTANCE_DRAM,
|
|
Vlastimil Babka |
c2d3b6 |
+ .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
|
|
Vlastimil Babka |
c2d3b6 |
+};
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
|
|
Vlastimil Babka |
c2d3b6 |
+{
|
|
Vlastimil Babka |
c2d3b6 |
+ bool found_slot = false;
|
|
Vlastimil Babka |
c2d3b6 |
+ struct memory_tier *memtier, *new_memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+ int adistance = memtype->adistance;
|
|
Vlastimil Babka |
c2d3b6 |
+ unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ lockdep_assert_held_once(&memory_tier_lock);
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ /*
|
|
Vlastimil Babka |
c2d3b6 |
+ * If the memtype is already part of a memory tier,
|
|
Vlastimil Babka |
c2d3b6 |
+ * just return that.
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+ if (memtype->memtier)
|
|
Vlastimil Babka |
c2d3b6 |
+ return memtype->memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ adistance = round_down(adistance, memtier_adistance_chunk_size);
|
|
Vlastimil Babka |
c2d3b6 |
+ list_for_each_entry(memtier, &memory_tiers, list) {
|
|
Vlastimil Babka |
c2d3b6 |
+ if (adistance == memtier->adistance_start) {
|
|
Vlastimil Babka |
c2d3b6 |
+ memtype->memtier = memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+ list_add(&memtype->tier_sibiling, &memtier->memory_types);
|
|
Vlastimil Babka |
c2d3b6 |
+ return memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+ } else if (adistance < memtier->adistance_start) {
|
|
Vlastimil Babka |
c2d3b6 |
+ found_slot = true;
|
|
Vlastimil Babka |
c2d3b6 |
+ break;
|
|
Vlastimil Babka |
c2d3b6 |
+ }
|
|
Vlastimil Babka |
c2d3b6 |
+ }
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ new_memtier = kmalloc(sizeof(struct memory_tier), GFP_KERNEL);
|
|
Vlastimil Babka |
c2d3b6 |
+ if (!new_memtier)
|
|
Vlastimil Babka |
c2d3b6 |
+ return ERR_PTR(-ENOMEM);
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ new_memtier->adistance_start = adistance;
|
|
Vlastimil Babka |
c2d3b6 |
+ INIT_LIST_HEAD(&new_memtier->list);
|
|
Vlastimil Babka |
c2d3b6 |
+ INIT_LIST_HEAD(&new_memtier->memory_types);
|
|
Vlastimil Babka |
c2d3b6 |
+ if (found_slot)
|
|
Vlastimil Babka |
c2d3b6 |
+ list_add_tail(&new_memtier->list, &memtier->list);
|
|
Vlastimil Babka |
c2d3b6 |
+ else
|
|
Vlastimil Babka |
c2d3b6 |
+ list_add_tail(&new_memtier->list, &memory_tiers);
|
|
Vlastimil Babka |
c2d3b6 |
+ memtype->memtier = new_memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+ list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
|
|
Vlastimil Babka |
c2d3b6 |
+ return new_memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+}
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+static struct memory_tier *set_node_memory_tier(int node)
|
|
Vlastimil Babka |
c2d3b6 |
+{
|
|
Vlastimil Babka |
c2d3b6 |
+ struct memory_tier *memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+ struct memory_dev_type *memtype;
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ lockdep_assert_held_once(&memory_tier_lock);
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ if (!node_state(node, N_MEMORY))
|
|
Vlastimil Babka |
c2d3b6 |
+ return ERR_PTR(-EINVAL);
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ if (!node_memory_types[node])
|
|
Vlastimil Babka |
c2d3b6 |
+ node_memory_types[node] = &default_dram_type;
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ memtype = node_memory_types[node];
|
|
Vlastimil Babka |
c2d3b6 |
+ node_set(node, memtype->nodes);
|
|
Vlastimil Babka |
c2d3b6 |
+ memtier = find_create_memory_tier(memtype);
|
|
Vlastimil Babka |
c2d3b6 |
+ return memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+}
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+static int __init memory_tier_init(void)
|
|
Vlastimil Babka |
c2d3b6 |
+{
|
|
Vlastimil Babka |
c2d3b6 |
+ int node;
|
|
Vlastimil Babka |
c2d3b6 |
+ struct memory_tier *memtier;
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ mutex_lock(&memory_tier_lock);
|
|
Vlastimil Babka |
c2d3b6 |
+ /*
|
|
Vlastimil Babka |
c2d3b6 |
+ * Look at all the existing N_MEMORY nodes and add them to
|
|
Vlastimil Babka |
c2d3b6 |
+ * default memory tier or to a tier if we already have memory
|
|
Vlastimil Babka |
c2d3b6 |
+ * types assigned.
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+ for_each_node_state(node, N_MEMORY) {
|
|
Vlastimil Babka |
c2d3b6 |
+ memtier = set_node_memory_tier(node);
|
|
Vlastimil Babka |
c2d3b6 |
+ if (IS_ERR(memtier))
|
|
Vlastimil Babka |
c2d3b6 |
+ /*
|
|
Vlastimil Babka |
c2d3b6 |
+ * Continue with memtiers we are able to setup
|
|
Vlastimil Babka |
c2d3b6 |
+ */
|
|
Vlastimil Babka |
c2d3b6 |
+ break;
|
|
Vlastimil Babka |
c2d3b6 |
+ }
|
|
Vlastimil Babka |
c2d3b6 |
+ mutex_unlock(&memory_tier_lock);
|
|
Vlastimil Babka |
c2d3b6 |
+
|
|
Vlastimil Babka |
c2d3b6 |
+ return 0;
|
|
Vlastimil Babka |
c2d3b6 |
+}
|
|
Vlastimil Babka |
c2d3b6 |
+subsys_initcall(memory_tier_init);
|