aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide/mm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/mm')
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst10
-rw-r--r--Documentation/admin-guide/mm/concepts.rst17
-rw-r--r--Documentation/admin-guide/mm/damon/index.rst17
-rw-r--r--Documentation/admin-guide/mm/damon/lru_sort.rst294
-rw-r--r--Documentation/admin-guide/mm/damon/reclaim.rst274
-rw-r--r--Documentation/admin-guide/mm/damon/start.rst127
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst911
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst109
-rw-r--r--Documentation/admin-guide/mm/idle_page_tracking.rst9
-rw-r--r--Documentation/admin-guide/mm/index.rst13
-rw-r--r--Documentation/admin-guide/mm/ksm.rst144
-rw-r--r--Documentation/admin-guide/mm/memory-hotplug.rst909
-rw-r--r--Documentation/admin-guide/mm/multigen_lru.rst162
-rw-r--r--Documentation/admin-guide/mm/nommu-mmap.rst283
-rw-r--r--Documentation/admin-guide/mm/numa_memory_policy.rst47
-rw-r--r--Documentation/admin-guide/mm/numaperf.rst22
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst186
-rw-r--r--Documentation/admin-guide/mm/shrinker_debugfs.rst133
-rw-r--r--Documentation/admin-guide/mm/soft-dirty.rst2
-rw-r--r--Documentation/admin-guide/mm/swap_numa.rst78
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst142
-rw-r--r--Documentation/admin-guide/mm/userfaultfd.rst382
-rw-r--r--Documentation/admin-guide/mm/zswap.rst178
23 files changed, 3854 insertions, 595 deletions
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
index 4e06ffabd78a..7367e6294ef6 100644
--- a/Documentation/admin-guide/mm/cma_debugfs.rst
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -5,10 +5,10 @@ CMA Debugfs Interface
The CMA debugfs interface is useful to retrieve basic information out of the
different CMA areas and to test allocation/release in each of the areas.
-Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
-kernel's CMA index. So the first CMA zone would be:
+Each CMA area represents a directory under <debugfs>/cma/, represented by
+its CMA name like below:
- <debugfs>/cma/cma-0
+ <debugfs>/cma/<cma_name>
The structure of the files created under that directory is as follows:
@@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
- [RO] bitmap: The bitmap of page states in the zone.
- [WO] alloc: Allocate N pages from that CMA area. For example::
- echo 5 > <debugfs>/cma/cma-2/alloc
+ echo 5 > <debugfs>/cma/<cma_name>/alloc
-would try to allocate 5 pages from the cma-2 area.
+would try to allocate 5 pages from the 'cma_name' area.
- [WO] free: Free N pages from that CMA area, similar to the above.
diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst
index c2531b14bf46..e796b0a7e4a5 100644
--- a/Documentation/admin-guide/mm/concepts.rst
+++ b/Documentation/admin-guide/mm/concepts.rst
@@ -1,5 +1,3 @@
-.. _mm_concepts:
-
=================
Concepts overview
=================
@@ -35,7 +33,7 @@ physical memory (demand paging) and provides a mechanism for the
protection and controlled sharing of data between processes.
With virtual memory, each and every memory access uses a virtual
-address. When the CPU decodes the an instruction that reads (or
+address. When the CPU decodes an instruction that reads (or
writes) from (or to) the system memory, it translates the `virtual`
address encoded in that instruction to a `physical` address that the
memory controller can understand.
@@ -86,16 +84,15 @@ memory with the huge pages. The first one is `HugeTLB filesystem`, or
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
store. For the files created in this filesystem the data resides in
the memory and mapped using huge pages. The hugetlbfs is described at
-:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
+Documentation/admin-guide/mm/hugetlbpage.rst.
Another, more recent, mechanism that enables use of the huge pages is
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
requires users and/or system administrators to configure what parts of
the system memory should and can be mapped by the huge pages, THP
manages such mappings transparently to the user and hence the
-name. See
-:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
-for more details about THP.
+name. See Documentation/admin-guide/mm/transhuge.rst for more details
+about THP.
Zones
=====
@@ -125,8 +122,8 @@ processor. Each bank is referred to as a `node` and for each node Linux
constructs an independent memory management subsystem. A node has its
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
-:ref:`Documentation/vm/numa.rst <numa>` and in
-:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
+Documentation/mm/numa.rst` and in
+Documentation/admin-guide/mm/numa_memory_policy.rst.
Page cache
==========
@@ -184,7 +181,7 @@ pages either asynchronously or synchronously, depending on the state
of the system. When the system is not loaded, most of the memory is free
and allocation requests will be satisfied immediately from the free
pages supply. As the load increases, the amount of the free pages goes
-down and when it reaches a certain threshold (high watermark), an
+down and when it reaches a certain threshold (low watermark), an
allocation request will awaken the ``kswapd`` daemon. It will
asynchronously scan memory pages and either just free them if the data
they contain is available elsewhere, or evict to the backing storage
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
new file mode 100644
index 000000000000..33d37bb2fb4e
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
+Using DAMON, users can analyze the memory access patterns of their systems and
+optimize those.
+
+.. toctree::
+ :maxdepth: 2
+
+ start
+ usage
+ reclaim
+ lru_sort
diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst
new file mode 100644
index 000000000000..7b0775d281b4
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/lru_sort.rst
@@ -0,0 +1,294 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+DAMON-based LRU-lists Sorting
+=============================
+
+DAMON-based LRU-lists Sorting (DAMON_LRU_SORT) is a static kernel module that
+aimed to be used for proactive and lightweight data access pattern based
+(de)prioritization of pages on their LRU-lists for making LRU-lists a more
+trusworthy data access pattern source.
+
+Where Proactive LRU-lists Sorting is Required?
+==============================================
+
+As page-granularity access checking overhead could be significant on huge
+systems, LRU lists are normally not proactively sorted but partially and
+reactively sorted for special events including specific user requests, system
+calls and memory pressure. As a result, LRU lists are sometimes not so
+perfectly prepared to be used as a trustworthy access pattern source for some
+situations including reclamation target pages selection under sudden memory
+pressure.
+
+Because DAMON can identify access patterns of best-effort accuracy while
+inducing only user-specified range of overhead, proactively running
+DAMON_LRU_SORT could be helpful for making LRU lists more trustworthy access
+pattern source with low and controlled overhead.
+
+How It Works?
+=============
+
+DAMON_LRU_SORT finds hot pages (pages of memory regions that showing access
+rates that higher than a user-specified threshold) and cold pages (pages of
+memory regions that showing no access for a time that longer than a
+user-specified threshold) using DAMON, and prioritizes hot pages while
+deprioritizing cold pages on their LRU-lists. To avoid it consuming too much
+CPU for the prioritizations, a CPU time usage limit can be configured. Under
+the limit, it prioritizes and deprioritizes more hot and cold pages first,
+respectively. System administrators can also configure under what situation
+this scheme should automatically activated and deactivated with three memory
+pressure watermarks.
+
+Its default parameters for hotness/coldness thresholds and CPU quota limit are
+conservatively chosen. That is, the module under its default parameters could
+be widely used without harm for common situations while providing a level of
+benefits for systems having clear hot/cold access patterns under memory
+pressure while consuming only a limited small portion of CPU time.
+
+Interface: Module Parameters
+============================
+
+To use this feature, you should first ensure your system is running on a kernel
+that is built with ``CONFIG_DAMON_LRU_SORT=y``.
+
+To let sysadmins enable or disable it and tune for the given system,
+DAMON_LRU_SORT utilizes module parameters. That is, you can put
+``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write
+proper values to ``/sys/module/damon_lru_sort/parameters/<parameter>`` files.
+
+Below are the description of each parameter.
+
+enabled
+-------
+
+Enable or disable DAMON_LRU_SORT.
+
+You can enable DAMON_LRU_SORT by setting the value of this parameter as ``Y``.
+Setting it as ``N`` disables DAMON_LRU_SORT. Note that DAMON_LRU_SORT could do
+no real monitoring and LRU-lists sorting due to the watermarks-based activation
+condition. Refer to below descriptions for the watermarks parameter for this.
+
+commit_inputs
+-------------
+
+Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``.
+
+Input parameters that updated while DAMON_LRU_SORT is running are not applied
+by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT reads values
+of parametrs except ``enabled`` again. Once the re-reading is done, this
+parameter is set as ``N``. If invalid parameters are found while the
+re-reading, DAMON_LRU_SORT will be disabled.
+
+hot_thres_access_freq
+---------------------
+
+Access frequency threshold for hot memory regions identification in permil.
+
+If a memory region is accessed in frequency of this or higher, DAMON_LRU_SORT
+identifies the region as hot, and mark it as accessed on the LRU list, so that
+it could not be reclaimed under memory pressure. 50% by default.
+
+cold_min_age
+------------
+
+Time threshold for cold memory regions identification in microseconds.
+
+If a memory region is not accessed for this or longer time, DAMON_LRU_SORT
+identifies the region as cold, and mark it as unaccessed on the LRU list, so
+that it could be reclaimed first under memory pressure. 120 seconds by
+default.
+
+quota_ms
+--------
+
+Limit of time for trying the LRU lists sorting in milliseconds.
+
+DAMON_LRU_SORT tries to use only up to this time within a time window
+(quota_reset_interval_ms) for trying LRU lists sorting. This can be used
+for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the
+limit is disabled.
+
+10 ms by default.
+
+quota_reset_interval_ms
+-----------------------
+
+The time quota charge reset interval in milliseconds.
+
+The charge reset interval for the quota of time (quota_ms). That is,
+DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms
+milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds.
+
+1 second by default.
+
+wmarks_interval
+---------------
+
+The watermarks check time interval in microseconds.
+
+Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is
+enabled but inactive due to its watermarks rule. 5 seconds by default.
+
+wmarks_high
+-----------
+
+Free memory rate (per thousand) for the high watermark.
+
+If free memory of the system in bytes per thousand bytes is higher than this,
+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the
+watermarks. 200 (20%) by default.
+
+wmarks_mid
+----------
+
+Free memory rate (per thousand) for the middle watermark.
+
+If free memory of the system in bytes per thousand bytes is between this and
+the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring and
+the LRU-lists sorting. 150 (15%) by default.
+
+wmarks_low
+----------
+
+Free memory rate (per thousand) for the low watermark.
+
+If free memory of the system in bytes per thousand bytes is lower than this,
+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the
+watermarks. 50 (5%) by default.
+
+sample_interval
+---------------
+
+Sampling interval for the monitoring in microseconds.
+
+The sampling interval of DAMON for the cold memory monitoring. Please refer to
+the DAMON documentation (:doc:`usage`) for more detail. 5ms by default.
+
+aggr_interval
+-------------
+
+Aggregation interval for the monitoring in microseconds.
+
+The aggregation interval of DAMON for the cold memory monitoring. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail. 100ms by
+default.
+
+min_nr_regions
+--------------
+
+Minimum number of monitoring regions.
+
+The minimal number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set lower-bound of the monitoring quality.
+But, setting this too high could result in increased monitoring overhead.
+Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by
+default.
+
+max_nr_regions
+--------------
+
+Maximum number of monitoring regions.
+
+The maximum number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set upper-bound of the monitoring overhead.
+However, setting this too low could result in bad monitoring quality. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail. 1000 by
+defaults.
+
+monitor_region_start
+--------------------
+
+Start of target memory region in physical address.
+
+The start physical address of memory region that DAMON_LRU_SORT will do work
+against. By default, biggest System RAM is used as the region.
+
+monitor_region_end
+------------------
+
+End of target memory region in physical address.
+
+The end physical address of memory region that DAMON_LRU_SORT will do work
+against. By default, biggest System RAM is used as the region.
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. Else,
+-1.
+
+nr_lru_sort_tried_hot_regions
+-----------------------------
+
+Number of hot memory regions that tried to be LRU-sorted.
+
+bytes_lru_sort_tried_hot_regions
+--------------------------------
+
+Total bytes of hot memory regions that tried to be LRU-sorted.
+
+nr_lru_sorted_hot_regions
+-------------------------
+
+Number of hot memory regions that successfully be LRU-sorted.
+
+bytes_lru_sorted_hot_regions
+----------------------------
+
+Total bytes of hot memory regions that successfully be LRU-sorted.
+
+nr_hot_quota_exceeds
+--------------------
+
+Number of times that the time quota limit for hot regions have exceeded.
+
+nr_lru_sort_tried_cold_regions
+------------------------------
+
+Number of cold memory regions that tried to be LRU-sorted.
+
+bytes_lru_sort_tried_cold_regions
+---------------------------------
+
+Total bytes of cold memory regions that tried to be LRU-sorted.
+
+nr_lru_sorted_cold_regions
+--------------------------
+
+Number of cold memory regions that successfully be LRU-sorted.
+
+bytes_lru_sorted_cold_regions
+-----------------------------
+
+Total bytes of cold memory regions that successfully be LRU-sorted.
+
+nr_cold_quota_exceeds
+---------------------
+
+Number of times that the time quota limit for cold regions have exceeded.
+
+Example
+=======
+
+Below runtime example commands make DAMON_LRU_SORT to find memory regions
+having >=50% access frequency and LRU-prioritize while LRU-deprioritizing
+memory regions that not accessed for 120 seconds. The prioritization and
+deprioritization is limited to be done using only up to 1% CPU time to avoid
+DAMON_LRU_SORT consuming too much CPU time for the (de)prioritization. It also
+asks DAMON_LRU_SORT to do nothing if the system's free memory rate is more than
+50%, but start the real works if it becomes lower than 40%. If DAMON_RECLAIM
+doesn't make progress and therefore the free memory rate becomes lower than
+20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to
+the LRU-list based page granularity reclamation. ::
+
+ # cd /sys/module/damon_lru_sort/parameters
+ # echo 500 > hot_thres_access_freq
+ # echo 120000000 > cold_min_age
+ # echo 10 > quota_ms
+ # echo 1000 > quota_reset_interval_ms
+ # echo 500 > wmarks_high
+ # echo 400 > wmarks_mid
+ # echo 200 > wmarks_low
+ # echo Y > enabled
diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst
new file mode 100644
index 000000000000..343e25b252f4
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/reclaim.rst
@@ -0,0 +1,274 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+DAMON-based Reclamation
+=======================
+
+DAMON-based Reclamation (DAMON_RECLAIM) is a static kernel module that aimed to
+be used for proactive and lightweight reclamation under light memory pressure.
+It doesn't aim to replace the LRU-list based page_granularity reclamation, but
+to be selectively used for different level of memory pressure and requirements.
+
+Where Proactive Reclamation is Required?
+========================================
+
+On general memory over-committed systems, proactively reclaiming cold pages
+helps saving memory and reducing latency spikes that incurred by the direct
+reclaim of the process or CPU consumption of kswapd, while incurring only
+minimal performance degradation [1]_ [2]_ .
+
+Free Pages Reporting [3]_ based memory over-commit virtualization systems are
+good example of the cases. In such systems, the guest VMs reports their free
+memory to host, and the host reallocates the reported memory to other guests.
+As a result, the memory of the systems are fully utilized. However, the
+guests could be not so memory-frugal, mainly because some kernel subsystems and
+user-space applications are designed to use as much memory as available. Then,
+guests could report only small amount of memory as free to host, results in
+memory utilization drop of the systems. Running the proactive reclamation in
+guests could mitigate this problem.
+
+How It Works?
+=============
+
+DAMON_RECLAIM finds memory regions that didn't accessed for specific time
+duration and page out. To avoid it consuming too much CPU for the paging out
+operation, a speed limit can be configured. Under the speed limit, it pages
+out memory regions that didn't accessed longer time first. System
+administrators can also configure under what situation this scheme should
+automatically activated and deactivated with three memory pressure watermarks.
+
+Interface: Module Parameters
+============================
+
+To use this feature, you should first ensure your system is running on a kernel
+that is built with ``CONFIG_DAMON_RECLAIM=y``.
+
+To let sysadmins enable or disable it and tune for the given system,
+DAMON_RECLAIM utilizes module parameters. That is, you can put
+``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write
+proper values to ``/sys/module/damon_reclaim/parameters/<parameter>`` files.
+
+Below are the description of each parameter.
+
+enabled
+-------
+
+Enable or disable DAMON_RECLAIM.
+
+You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``.
+Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
+no real monitoring and reclamation due to the watermarks-based activation
+condition. Refer to below descriptions for the watermarks parameter for this.
+
+commit_inputs
+-------------
+
+Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.
+
+Input parameters that updated while DAMON_RECLAIM is running are not applied
+by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values
+of parametrs except ``enabled`` again. Once the re-reading is done, this
+parameter is set as ``N``. If invalid parameters are found while the
+re-reading, DAMON_RECLAIM will be disabled.
+
+min_age
+-------
+
+Time threshold for cold memory regions identification in microseconds.
+
+If a memory region is not accessed for this or longer time, DAMON_RECLAIM
+identifies the region as cold, and reclaims it.
+
+120 seconds by default.
+
+quota_ms
+--------
+
+Limit of time for the reclamation in milliseconds.
+
+DAMON_RECLAIM tries to use only up to this time within a time window
+(quota_reset_interval_ms) for trying reclamation of cold pages. This can be
+used for limiting CPU consumption of DAMON_RECLAIM. If the value is zero, the
+limit is disabled.
+
+10 ms by default.
+
+quota_sz
+--------
+
+Limit of size of memory for the reclamation in bytes.
+
+DAMON_RECLAIM charges amount of memory which it tried to reclaim within a time
+window (quota_reset_interval_ms) and makes no more than this limit is tried.
+This can be used for limiting consumption of CPU and IO. If this value is
+zero, the limit is disabled.
+
+128 MiB by default.
+
+quota_reset_interval_ms
+-----------------------
+
+The time/size quota charge reset interval in milliseconds.
+
+The charget reset interval for the quota of time (quota_ms) and size
+(quota_sz). That is, DAMON_RECLAIM does not try reclamation for more than
+quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms
+milliseconds.
+
+1 second by default.
+
+wmarks_interval
+---------------
+
+Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is
+enabled but inactive due to its watermarks rule.
+
+wmarks_high
+-----------
+
+Free memory rate (per thousand) for the high watermark.
+
+If free memory of the system in bytes per thousand bytes is higher than this,
+DAMON_RECLAIM becomes inactive, so it does nothing but only periodically checks
+the watermarks.
+
+wmarks_mid
+----------
+
+Free memory rate (per thousand) for the middle watermark.
+
+If free memory of the system in bytes per thousand bytes is between this and
+the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring and
+the reclaiming.
+
+wmarks_low
+----------
+
+Free memory rate (per thousand) for the low watermark.
+
+If free memory of the system in bytes per thousand bytes is lower than this,
+DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks the
+watermarks. In the case, the system falls back to the LRU-list based page
+granularity reclamation logic.
+
+sample_interval
+---------------
+
+Sampling interval for the monitoring in microseconds.
+
+The sampling interval of DAMON for the cold memory monitoring. Please refer to
+the DAMON documentation (:doc:`usage`) for more detail.
+
+aggr_interval
+-------------
+
+Aggregation interval for the monitoring in microseconds.
+
+The aggregation interval of DAMON for the cold memory monitoring. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+min_nr_regions
+--------------
+
+Minimum number of monitoring regions.
+
+The minimal number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set lower-bound of the monitoring quality.
+But, setting this too high could result in increased monitoring overhead.
+Please refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+max_nr_regions
+--------------
+
+Maximum number of monitoring regions.
+
+The maximum number of monitoring regions of DAMON for the cold memory
+monitoring. This can be used to set upper-bound of the monitoring overhead.
+However, setting this too low could result in bad monitoring quality. Please
+refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+monitor_region_start
+--------------------
+
+Start of target memory region in physical address.
+
+The start physical address of memory region that DAMON_RECLAIM will do work
+against. That is, DAMON_RECLAIM will find cold memory regions in this region
+and reclaims. By default, biggest System RAM is used as the region.
+
+monitor_region_end
+------------------
+
+End of target memory region in physical address.
+
+The end physical address of memory region that DAMON_RECLAIM will do work
+against. That is, DAMON_RECLAIM will find cold memory regions in this region
+and reclaims. By default, biggest System RAM is used as the region.
+
+skip_anon
+---------
+
+Skip anonymous pages reclamation.
+
+If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous
+pages. By default, ``N``.
+
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else,
+-1.
+
+nr_reclaim_tried_regions
+------------------------
+
+Number of memory regions that tried to be reclaimed by DAMON_RECLAIM.
+
+bytes_reclaim_tried_regions
+---------------------------
+
+Total bytes of memory regions that tried to be reclaimed by DAMON_RECLAIM.
+
+nr_reclaimed_regions
+--------------------
+
+Number of memory regions that successfully be reclaimed by DAMON_RECLAIM.
+
+bytes_reclaimed_regions
+-----------------------
+
+Total bytes of memory regions that successfully be reclaimed by DAMON_RECLAIM.
+
+nr_quota_exceeds
+----------------
+
+Number of times that the time/space quota limits have exceeded.
+
+Example
+=======
+
+Below runtime example commands make DAMON_RECLAIM to find memory regions that
+not accessed for 30 seconds or more and pages out. The reclamation is limited
+to be done only up to 1 GiB per second to avoid DAMON_RECLAIM consuming too
+much CPU time for the paging out operation. It also asks DAMON_RECLAIM to do
+nothing if the system's free memory rate is more than 50%, but start the real
+works if it becomes lower than 40%. If DAMON_RECLAIM doesn't make progress and
+therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to
+do nothing again, so that we can fall back to the LRU-list based page
+granularity reclamation. ::
+
+ # cd /sys/module/damon_reclaim/parameters
+ # echo 30000000 > min_age
+ # echo $((1 * 1024 * 1024 * 1024)) > quota_sz
+ # echo 1000 > quota_reset_interval_ms
+ # echo 500 > wmarks_high
+ # echo 400 > wmarks_mid
+ # echo 200 > wmarks_low
+ # echo Y > enabled
+
+.. [1] https://research.google/pubs/pub48551/
+.. [2] https://lwn.net/Articles/787611/
+.. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
new file mode 100644
index 000000000000..7aa0071ff1c3
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -0,0 +1,127 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Getting Started
+===============
+
+This document briefly describes how you can use DAMON by demonstrating its
+default user space tool. Please note that this document describes only a part
+of its features for brevity. Please refer to the usage `doc
+<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more
+details.
+
+
+Prerequisites
+=============
+
+Kernel
+------
+
+You should first ensure your system is running on a kernel built with
+``CONFIG_DAMON_*=y``.
+
+
+User Space Tool
+---------------
+
+For the demonstration, we will use the default user space tool for DAMON,
+called DAMON Operator (DAMO). It is available at
+https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
+your ``$PATH``. It's not mandatory, though.
+
+Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
+mounted.
+
+
+Recording Data Access Patterns
+==============================
+
+The commands below record the memory access patterns of a program and save the
+monitoring results to a file. ::
+
+ $ git clone https://github.com/sjp38/masim
+ $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ sudo damo record -o damon.data $(pidof masim)
+
+The first two lines of the commands download an artificial memory access
+generator program and run it in the background. The generator will repeatedly
+access two 100 MiB sized memory regions one by one. You can substitute this
+with your real workload. The last line asks ``damo`` to record the access
+pattern in the ``damon.data`` file.
+
+
+Visualizing Recorded Patterns
+=============================
+
+You can visualize the pattern in a heatmap, showing which memory region
+(x-axis) got accessed when (y-axis) and how frequently (number).::
+
+ $ sudo damo report heats --heatmap stdout
+ 22222222222222222222222222222222222222211111111111111111111111111111111111111100
+ 44444444444444444444444444444444444444434444444444444444444444444444444444443200
+ 44444444444444444444444444444444444444433444444444444444444444444444444444444200
+ 33333333333333333333333333333333333333344555555555555555555555555555555555555200
+ 33333333333333333333333333333333333344444444444444444444444444444444444444444200
+ 22222222222222222222222222222222222223355555555555555555555555555555555555555200
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ 33333333333333333333333333333333333333355555555555555555555555555555555555555200
+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000
+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000
+ 33333333333333333333333333333333333333444444444444444444444444444444444444443200
+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400
+ [...]
+ # access_frequency: 0 1 2 3 4 5 6 7 8 9
+ # x-axis: space (139728247021568-139728453431248: 196.848 MiB)
+ # y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)
+ # resolution: 80x40 (2.461 MiB and 1.758 s for each character)
+
+You can also visualize the distribution of the working set size, sorted by the
+size.::
+
+ $ sudo damo report wss --range 0 101 10
+ # <percentile> <wss>
+ # target_id 18446632103789443072
+ # avr: 107.708 MiB
+ 0 0 B | |
+ 10 95.328 MiB |**************************** |
+ 20 95.332 MiB |**************************** |
+ 30 95.340 MiB |**************************** |
+ 40 95.387 MiB |**************************** |
+ 50 95.387 MiB |**************************** |
+ 60 95.398 MiB |**************************** |
+ 70 95.398 MiB |**************************** |
+ 80 95.504 MiB |**************************** |
+ 90 190.703 MiB |********************************************************* |
+ 100 196.875 MiB |***********************************************************|
+
+Using ``--sortby`` option with the above command, you can show how the working
+set size has chronologically changed.::
+
+ $ sudo damo report wss --range 0 101 10 --sortby time
+ # <percentile> <wss>
+ # target_id 18446632103789443072
+ # avr: 107.708 MiB
+ 0 3.051 MiB | |
+ 10 190.703 MiB |***********************************************************|
+ 20 95.336 MiB |***************************** |
+ 30 95.328 MiB |***************************** |
+ 40 95.387 MiB |***************************** |
+ 50 95.332 MiB |***************************** |
+ 60 95.320 MiB |***************************** |
+ 70 95.398 MiB |***************************** |
+ 80 95.398 MiB |***************************** |
+ 90 95.340 MiB |***************************** |
+ 100 95.398 MiB |***************************** |
+
+
+Data Access Pattern Aware Memory Management
+===========================================
+
+Below command makes every memory region of size >=4K that has not accessed for
+>=60 seconds in your workload to be swapped out. ::
+
+ $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \
+ --damos_age 60s max --damos_action pageout \
+ <pid of your workload>
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
new file mode 100644
index 000000000000..9d23144bf985
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,911 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Detailed Usages
+===============
+
+DAMON provides below interfaces for different users.
+
+- *DAMON user space tool.*
+ `This <https://github.com/awslabs/damo>`_ is for privileged people such as
+ system administrators who want a just-working human-friendly interface.
+ Using this, users can use the DAMON’s major features in a human-friendly way.
+ It may not be highly tuned for special cases, though. For more detail,
+ please refer to its `usage document
+ <https://github.com/awslabs/damo/blob/next/USAGE.md>`_.
+- *sysfs interface.*
+ :ref:`This <sysfs_interface>` is for privileged user space programmers who
+ want more optimized use of DAMON. Using this, users can use DAMON’s major
+ features by reading from and writing to special sysfs files. Therefore,
+ you can write and use your personalized DAMON sysfs wrapper programs that
+ reads/writes the sysfs files instead of you. The `DAMON user space tool
+ <https://github.com/awslabs/damo>`_ is one example of such programs.
+- *Kernel Space Programming Interface.*
+ :doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
+ users can utilize every feature of DAMON most flexibly and efficiently by
+ writing kernel space DAMON application programs for you. You can even extend
+ DAMON for various address spaces. For detail, please refer to the interface
+ :doc:`document </mm/damon/api>`.
+- *debugfs interface. (DEPRECATED!)*
+ :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
+ <sysfs_interface>`. This is deprecated, so users should move to the
+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
+ move, please report your usecase to damon@lists.linux.dev and
+ linux-mm@kvack.org.
+
+.. _sysfs_interface:
+
+sysfs Interface
+===============
+
+DAMON sysfs interface is built when ``CONFIG_DAMON_SYSFS`` is defined. It
+creates multiple directories and files under its sysfs directory,
+``<sysfs>/kernel/mm/damon/``. You can control DAMON by writing to and reading
+from the files under the directory.
+
+For a short example, users can monitor the virtual address space of a given
+workload as below. ::
+
+ # cd /sys/kernel/mm/damon/admin/
+ # echo 1 > kdamonds/nr_kdamonds && echo 1 > kdamonds/0/contexts/nr_contexts
+ # echo vaddr > kdamonds/0/contexts/0/operations
+ # echo 1 > kdamonds/0/contexts/0/targets/nr_targets
+ # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid_target
+ # echo on > kdamonds/0/state
+
+Files Hierarchy
+---------------
+
+The files hierarchy of DAMON sysfs interface is shown below. In the below
+figure, parents-children relations are represented with indentations, each
+directory is having ``/`` suffix, and files in each directory are separated by
+comma (",").
+
+.. parsed-literal::
+
+ :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin
+ │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
+ │ │ :ref:`0 <sysfs_kdamond>`/state,pid
+ │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
+ │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
+ │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
+ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
+ │ │ │ │ │ │ nr_regions/min,max
+ │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
+ │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
+ │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
+ │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end
+ │ │ │ │ │ │ │ │ ...
+ │ │ │ │ │ │ ...
+ │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
+ │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
+ │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
+ │ │ │ │ │ │ │ │ sz/min,max
+ │ │ │ │ │ │ │ │ nr_accesses/min,max
+ │ │ │ │ │ │ │ │ age/min,max
+ │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
+ │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
+ │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
+ │ │ │ │ │ │ │ │ │ 0/target_value,current_value
+ │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
+ │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
+ │ │ │ │ │ │ │ │ 0/type,matching,memcg_id
+ │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
+ │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
+ │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
+ │ │ │ │ │ │ │ │ ...
+ │ │ │ │ │ │ ...
+ │ │ │ │ ...
+ │ │ ...
+
+.. _sysfs_root:
+
+Root
+----
+
+The root of the DAMON sysfs interface is ``<sysfs>/kernel/mm/damon/``, and it
+has one directory named ``admin``. The directory contains the files for
+privileged user space programs' control of DAMON. User space tools or daemons
+having the root permission could use this directory.
+
+.. _sysfs_kdamonds:
+
+kdamonds/
+---------
+
+Under the ``admin`` directory, one directory, ``kdamonds``, which has files for
+controlling the kdamonds (refer to
+:ref:`design <damon_design_execution_model_and_data_structures>` for more
+details) exists. In the beginning, this directory has only one file,
+``nr_kdamonds``. Writing a number (``N``) to the file creates the number of
+child directories named ``0`` to ``N-1``. Each directory represents each
+kdamond.
+
+.. _sysfs_kdamond:
+
+kdamonds/<N>/
+-------------
+
+In each kdamond directory, two files (``state`` and ``pid``) and one directory
+(``contexts``) exist.
+
+Reading ``state`` returns ``on`` if the kdamond is currently running, or
+``off`` if it is not running.
+
+Users can write below commands for the kdamond to the ``state`` file.
+
+- ``on``: Start running.
+- ``off``: Stop running.
+- ``commit``: Read the user inputs in the sysfs files except ``state`` file
+ again.
+- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
+ :ref:`quota goals <sysfs_schemes_quota_goals>`.
+- ``update_schemes_stats``: Update the contents of stats files for each
+ DAMON-based operation scheme of the kdamond. For details of the stats,
+ please refer to :ref:`stats section <sysfs_schemes_stats>`.
+- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme
+ action tried regions directory for each DAMON-based operation scheme of the
+ kdamond. For details of the DAMON-based operation scheme action tried
+ regions directory, please refer to
+ :ref:`tried_regions section <sysfs_schemes_tried_regions>`.
+- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes``
+ files.
+- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
+ action tried regions directory for each DAMON-based operation scheme of the
+ kdamond.
+
+If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
+
+``contexts`` directory contains files for controlling the monitoring contexts
+that this kdamond will execute.
+
+.. _sysfs_contexts:
+
+kdamonds/<N>/contexts/
+----------------------
+
+In the beginning, this directory has only one file, ``nr_contexts``. Writing a
+number (``N``) to the file creates the number of child directories named as
+``0`` to ``N-1``. Each directory represents each monitoring context (refer to
+:ref:`design <damon_design_execution_model_and_data_structures>` for more
+details). At the moment, only one context per kdamond is supported, so only
+``0`` or ``1`` can be written to the file.
+
+.. _sysfs_context:
+
+contexts/<N>/
+-------------
+
+In each context directory, two files (``avail_operations`` and ``operations``)
+and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
+exist.
+
+DAMON supports multiple types of monitoring operations, including those for
+virtual address space and the physical address space. You can get the list of
+available monitoring operations set on the currently running kernel by reading
+``avail_operations`` file. Based on the kernel configuration, the file will
+list some or all of below keywords.
+
+ - vaddr: Monitor virtual address spaces of specific processes
+ - fvaddr: Monitor fixed virtual address ranges
+ - paddr: Monitor the physical address space of the system
+
+Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
+differences between the operations sets in terms of the monitoring target
+regions.
+
+You can set and get what type of monitoring operations DAMON will use for the
+context by writing one of the keywords listed in ``avail_operations`` file and
+reading from the ``operations`` file.
+
+.. _sysfs_monitoring_attrs:
+
+contexts/<N>/monitoring_attrs/
+------------------------------
+
+Files for specifying attributes of the monitoring including required quality
+and efficiency of the monitoring are in ``monitoring_attrs`` directory.
+Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this
+directory.
+
+Under ``intervals`` directory, three files for DAMON's sampling interval
+(``sample_us``), aggregation interval (``aggr_us``), and update interval
+(``update_us``) exist. You can set and get the values in micro-seconds by
+writing to and reading from the files.
+
+Under ``nr_regions`` directory, two files for the lower-bound and upper-bound
+of DAMON's monitoring regions (``min`` and ``max``, respectively), which
+controls the monitoring overhead, exist. You can set and get the values by
+writing to and rading from the files.
+
+For more details about the intervals and monitoring regions range, please refer
+to the Design document (:doc:`/mm/damon/design`).
+
+.. _sysfs_targets:
+
+contexts/<N>/targets/
+---------------------
+
+In the beginning, this directory has only one file, ``nr_targets``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each monitoring target.
+
+.. _sysfs_target:
+
+targets/<N>/
+------------
+
+In each target directory, one file (``pid_target``) and one directory
+(``regions``) exist.
+
+If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should
+be a process. You can specify the process to DAMON by writing the pid of the
+process to the ``pid_target`` file.
+
+.. _sysfs_regions:
+
+targets/<N>/regions
+-------------------
+
+When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to
+the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the
+monitoring target regions so that entire memory mappings of target processes
+can be covered. However, users could want to set the initial monitoring region
+to specific address ranges.
+
+In contrast, DAMON do not automatically sets and updates the monitoring target
+regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
+(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
+Therefore, users should set the monitoring target regions by themselves in the
+cases.
+
+For such cases, users can explicitly set the initial monitoring target regions
+as they want, by writing proper values to the files under this directory.
+
+In the beginning, this directory has only one file, ``nr_regions``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each initial monitoring target region.
+
+.. _sysfs_region:
+
+regions/<N>/
+------------
+
+In each region directory, you will find two files (``start`` and ``end``). You
+can set and get the start and end addresses of the initial monitoring target
+region by writing to and reading from the files, respectively.
+
+Each region should not overlap with others. ``end`` of directory ``N`` should
+be equal or smaller than ``start`` of directory ``N+1``.
+
+.. _sysfs_schemes:
+
+contexts/<N>/schemes/
+---------------------
+
+The directory for DAMON-based Operation Schemes (:ref:`DAMOS
+<damon_design_damos>`). Users can get and set the schemes by reading from and
+writing to files under this directory.
+
+In the beginning, this directory has only one file, ``nr_schemes``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each DAMON-based operation scheme.
+
+.. _sysfs_scheme:
+
+schemes/<N>/
+------------
+
+In each scheme directory, five directories (``access_pattern``, ``quotas``,
+``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
+(``action`` and ``apply_interval``) exist.
+
+The ``action`` file is for setting and getting the scheme's :ref:`action
+<damon_design_damos_action>`. The keywords that can be written to and read
+from the file and their meaning are as below.
+
+Note that support of each action depends on the running DAMON operations set
+:ref:`implementation <sysfs_context>`.
+
+ - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
+ Supported by ``vaddr`` and ``fvaddr`` operations set.
+ - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
+ Supported by ``vaddr`` and ``fvaddr`` operations set.
+ - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
+ Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
+ - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
+ Supported by ``vaddr`` and ``fvaddr`` operations set.
+ - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
+ Supported by ``vaddr`` and ``fvaddr`` operations set.
+ - ``lru_prio``: Prioritize the region on its LRU lists.
+ Supported by ``paddr`` operations set.
+ - ``lru_deprio``: Deprioritize the region on its LRU lists.
+ Supported by ``paddr`` operations set.
+ - ``stat``: Do nothing but count the statistics.
+ Supported by all operations sets.
+
+The ``apply_interval_us`` file is for setting and getting the scheme's
+:ref:`apply_interval <damon_design_damos>` in microseconds.
+
+.. _sysfs_access_pattern:
+
+schemes/<N>/access_pattern/
+---------------------------
+
+The directory for the target access :ref:`pattern
+<damon_design_damos_access_pattern>` of the given DAMON-based operation scheme.
+
+Under the ``access_pattern`` directory, three directories (``sz``,
+``nr_accesses``, and ``age``) each having two files (``min`` and ``max``)
+exist. You can set and get the access pattern for the given scheme by writing
+to and reading from the ``min`` and ``max`` files under ``sz``,
+``nr_accesses``, and ``age`` directories, respectively. Note that the ``min``
+and the ``max`` form a closed interval.
+
+.. _sysfs_quotas:
+
+schemes/<N>/quotas/
+-------------------
+
+The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
+DAMON-based operation scheme.
+
+Under ``quotas`` directory, three files (``ms``, ``bytes``,
+``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
+
+You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
+``reset interval`` in milliseconds by writing the values to the three files,
+respectively. Then, DAMON tries to use only up to ``time quota`` milliseconds
+for applying the ``action`` to memory regions of the ``access_pattern``, and to
+apply the action to only up to ``bytes`` bytes of memory regions within the
+``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
+quota limits.
+
+Under ``weights`` directory, three files (``sz_permil``,
+``nr_accesses_permil``, and ``age_permil``) exist.
+You can set the :ref:`prioritization weights
+<damon_design_damos_quotas_prioritization>` for size, access frequency, and age
+in per-thousand unit by writing the values to the three files under the
+``weights`` directory.
+
+.. _sysfs_schemes_quota_goals:
+
+schemes/<N>/quotas/goals/
+-------------------------
+
+The directory for the :ref:`automatic quota tuning goals
+<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation
+scheme.
+
+In the beginning, this directory has only one file, ``nr_goals``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each goal and current achievement.
+Among the multiple feedback, the best one is used.
+
+Each goal directory contains two files, namely ``target_value`` and
+``current_value``. Users can set and get any number to those files to set the
+feedback. User space main workload's latency or throughput, system metrics
+like free memory ratio or memory pressure stall time (PSI) could be example
+metrics for the values. Note that users should write
+``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
+directory <sysfs_kdamond>` to pass the feedback to DAMON.
+
+.. _sysfs_watermarks:
+
+schemes/<N>/watermarks/
+-----------------------
+
+The directory for the :ref:`watermarks <damon_design_damos_watermarks>` of the
+given DAMON-based operation scheme.
+
+Under the watermarks directory, five files (``metric``, ``interval_us``,
+``high``, ``mid``, and ``low``) for setting the metric, the time interval
+between check of the metric, and the three watermarks exist. You can set and
+get the five values by writing to the files, respectively.
+
+Keywords and meanings of those that can be written to the ``metric`` file are
+as below.
+
+ - none: Ignore the watermarks
+ - free_mem_rate: System's free memory rate (per thousand)
+
+The ``interval`` should written in microseconds unit.
+
+.. _sysfs_filters:
+
+schemes/<N>/filters/
+--------------------
+
+The directory for the :ref:`filters <damon_design_damos_filters>` of the given
+DAMON-based operation scheme.
+
+In the beginning, this directory has only one file, ``nr_filters``. Writing a
+number (``N``) to the file creates the number of child directories named ``0``
+to ``N-1``. Each directory represents each filter. The filters are evaluated
+in the numeric order.
+
+Each filter directory contains six files, namely ``type``, ``matcing``,
+``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
+file, you can write one of four special keywords: ``anon`` for anonymous pages,
+``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
+open-ended interval), or ``target`` for specific DAMON monitoring target
+filtering. In case of the memory cgroup filtering, you can specify the memory
+cgroup of the interest by writing the path of the memory cgroup from the
+cgroups mount point to ``memcg_path`` file. In case of the address range
+filtering, you can specify the start and end address of the range to
+``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
+target filtering, you can specify the index of the target between the list of
+the DAMON context's monitoring targets list to ``target_idx`` file. You can
+write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
+not match to the type, respectively. Then, the scheme's action will not be
+applied to the pages that specified to be filtered out.
+
+For example, below restricts a DAMOS action to be applied to only non-anonymous
+pages of all memory cgroups except ``/having_care_already``.::
+
+ # echo 2 > nr_filters
+ # # filter out anonymous pages
+ echo anon > 0/type
+ echo Y > 0/matching
+ # # further filter out all cgroups except one at '/having_care_already'
+ echo memcg > 1/type
+ echo /having_care_already > 1/memcg_path
+ echo N > 1/matching
+
+Note that ``anon`` and ``memcg`` filters are currently supported only when
+``paddr`` :ref:`implementation <sysfs_context>` is being used.
+
+Also, memory regions that are filtered out by ``addr`` or ``target`` filters
+are not counted as the scheme has tried to those, while regions that filtered
+out by other type filters are counted as the scheme has tried to. The
+difference is applied to :ref:`stats <damos_stats>` and
+:ref:`tried regions <sysfs_schemes_tried_regions>`.
+
+.. _sysfs_schemes_stats:
+
+schemes/<N>/stats/
+------------------
+
+DAMON counts the total number and bytes of regions that each scheme is tried to
+be applied, the two numbers for the regions that each scheme is successfully
+applied, and the total number of the quota limit exceeds. This statistics can
+be used for online analysis or tuning of the schemes.
+
+The statistics can be retrieved by reading the files under ``stats`` directory
+(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and
+``qt_exceeds``), respectively. The files are not updated in real time, so you
+should ask DAMON sysfs interface to update the content of the files for the
+stats by writing a special keyword, ``update_schemes_stats`` to the relevant
+``kdamonds/<N>/state`` file.
+
+.. _sysfs_schemes_tried_regions:
+
+schemes/<N>/tried_regions/
+--------------------------
+
+This directory initially has one file, ``total_bytes``.
+
+When a special keyword, ``update_schemes_tried_regions``, is written to the
+relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
+that reading it returns the total size of the scheme tried regions, and creates
+directories named integer starting from ``0`` under this directory. Each
+directory contains files exposing detailed information about each of the memory
+region that the corresponding scheme's ``action`` has tried to be applied under
+this directory, during next :ref:`apply interval <damon_design_damos>` of the
+corresponding scheme. The information includes address range, ``nr_accesses``,
+and ``age`` of the region.
+
+Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
+file will only update the ``total_bytes`` file, and will not create the
+subdirectories.
+
+The directories will be removed when another special keyword,
+``clear_schemes_tried_regions``, is written to the relevant
+``kdamonds/<N>/state`` file.
+
+The expected usage of this directory is investigations of schemes' behaviors,
+and query-like efficient data access monitoring results retrievals. For the
+latter use case, in particular, users can set the ``action`` as ``stat`` and
+set the ``access pattern`` as their interested pattern that they want to query.
+
+.. _sysfs_schemes_tried_region:
+
+tried_regions/<N>/
+------------------
+
+In each region directory, you will find four files (``start``, ``end``,
+``nr_accesses``, and ``age``). Reading the files will show the start and end
+addresses, ``nr_accesses``, and ``age`` of the region that corresponding
+DAMON-based operation scheme ``action`` has tried to be applied.
+
+Example
+~~~~~~~
+
+Below commands applies a scheme saying "If a memory region of size in [4KiB,
+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region. For the paging out, use only up to
+10ms per second, and also don't page out more than 1GiB per second. Under the
+limitation, page out memory regions having longer age first. Also, check the
+free memory rate of the system every 5 seconds, start the monitoring and paging
+out when the free memory rate becomes lower than 50%, but stop it if the free
+memory rate becomes larger than 60%, or lower than 30%". ::
+
+ # cd <sysfs>/kernel/mm/damon/admin
+ # # populate directories
+ # echo 1 > kdamonds/nr_kdamonds; echo 1 > kdamonds/0/contexts/nr_contexts;
+ # echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes
+ # cd kdamonds/0/contexts/0/schemes/0
+ # # set the basic access pattern and the action
+ # echo 4096 > access_pattern/sz/min
+ # echo 8192 > access_pattern/sz/max
+ # echo 0 > access_pattern/nr_accesses/min
+ # echo 5 > access_pattern/nr_accesses/max
+ # echo 10 > access_pattern/age/min
+ # echo 20 > access_pattern/age/max
+ # echo pageout > action
+ # # set quotas
+ # echo 10 > quotas/ms
+ # echo $((1024*1024*1024)) > quotas/bytes
+ # echo 1000 > quotas/reset_interval_ms
+ # # set watermark
+ # echo free_mem_rate > watermarks/metric
+ # echo 5000000 > watermarks/interval_us
+ # echo 600 > watermarks/high
+ # echo 500 > watermarks/mid
+ # echo 300 > watermarks/low
+
+Please note that it's highly recommended to use user space tools like `damo
+<https://github.com/awslabs/damo>`_ rather than manually reading and writing
+the files as above. Above is only for an example.
+
+.. _tracepoint:
+
+Tracepoints for Monitoring Results
+==================================
+
+Users can get the monitoring results via the :ref:`tried_regions
+<sysfs_schemes_tried_regions>`. The interface is useful for getting a
+snapshot, but it could be inefficient for fully recording all the monitoring
+results. For the purpose, two trace points, namely ``damon:damon_aggregated``
+and ``damon:damos_before_apply``, are provided. ``damon:damon_aggregated``
+provides the whole monitoring results, while ``damon:damos_before_apply``
+provides the monitoring results for regions that each DAMON-based Operation
+Scheme (:ref:`DAMOS <damon_design_damos>`) is gonna be applied. Hence,
+``damon:damos_before_apply`` is more useful for recording internal behavior of
+DAMOS, or DAMOS target access
+:ref:`pattern <damon_design_damos_access_pattern>` based query-like efficient
+monitoring results recording.
+
+While the monitoring is turned on, you could record the tracepoint events and
+show results using tracepoint supporting tools like ``perf``. For example::
+
+ # echo on > monitor_on
+ # perf record -e damon:damon_aggregated &
+ # sleep 5
+ # kill 9 $(pidof perf)
+ # echo off > monitor_on
+ # perf script
+ kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
+ [...]
+
+Each line of the perf script output represents each monitoring region. The
+first five fields are as usual other tracepoint outputs. The sixth field
+(``target_id=X``) shows the ide of the monitoring target of the region. The
+seventh field (``nr_regions=X``) shows the total number of monitoring regions
+for the target. The eighth field (``X-Y:``) shows the start (``X``) and end
+(``Y``) addresses of the region in bytes. The ninth field (``X``) shows the
+``nr_accesses`` of the region (refer to
+:ref:`design <damon_design_region_based_sampling>` for more details of the
+counter). Finally the tenth field (``X``) shows the ``age`` of the region
+(refer to :ref:`design <damon_design_age_tracking>` for more details of the
+counter).
+
+If the event was ``damon:damos_beofre_apply``, the ``perf script`` output would
+be somewhat like below::
+
+ kdamond.0 47293 [000] 80801.060214: damon:damos_before_apply: ctx_idx=0 scheme_idx=0 target_idx=0 nr_regions=11 121932607488-135128711168: 0 136
+ [...]
+
+Each line of the output represents each monitoring region that each DAMON-based
+Operation Scheme was about to be applied at the traced time. The first five
+fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``)
+of the scheme in the list of the contexts of the context's kdamond, the index
+of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in
+addition to the output of ``damon_aggregated`` tracepoint.
+
+
+.. _debugfs_interface:
+
+debugfs Interface (DEPRECATED!)
+===============================
+
+.. note::
+
+ THIS IS DEPRECATED!
+
+ DAMON debugfs interface is deprecated, so users should move to the
+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
+ move, please report your usecase to damon@lists.linux.dev and
+ linux-mm@kvack.org.
+
+DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
+``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
+``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
+
+
+Attributes
+----------
+
+Users can get and set the ``sampling interval``, ``aggregation interval``,
+``update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file. To know about the monitoring
+attributes in detail, please refer to the :doc:`/mm/damon/design`. For
+example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
+1000, and then check it again::
+
+ # cd <debugfs>/damon
+ # echo 5000 100000 1000000 10 1000 > attrs
+ # cat attrs
+ 5000 100000 1000000 10 1000
+
+
+Target IDs
+----------
+
+Some types of address spaces supports multiple monitoring target. For example,
+the virtual memory address spaces monitoring can have multiple processes as the
+monitoring targets. Users can set the targets by writing relevant id values of
+the targets to, and get the ids of the current targets by reading from the
+``target_ids`` file. In case of the virtual address spaces monitoring, the
+values should be pids of the monitoring target processes. For example, below
+commands set processes having pids 42 and 4242 as the monitoring targets and
+check it again::
+
+ # cd <debugfs>/damon
+ # echo 42 4242 > target_ids
+ # cat target_ids
+ 42 4242
+
+Users can also monitor the physical memory address space of the system by
+writing a special keyword, "``paddr\n``" to the file. Because physical address
+space monitoring doesn't support multiple targets, reading the file will show a
+fake value, ``42``, as below::
+
+ # cd <debugfs>/damon
+ # echo paddr > target_ids
+ # cat target_ids
+ 42
+
+Note that setting the target ids doesn't start the monitoring.
+
+
+Initial Monitoring Target Regions
+---------------------------------
+
+In case of the virtual address space monitoring, DAMON automatically sets and
+updates the monitoring target regions so that entire memory mappings of target
+processes can be covered. However, users can want to limit the monitoring
+region to specific address ranges, such as the heap, the stack, or specific
+file-mapped area. Or, some users can know the initial access pattern of their
+workloads and therefore want to set optimal initial regions for the 'adaptive
+regions adjustment'.
+
+In contrast, DAMON do not automatically sets and updates the monitoring target
+regions in case of physical memory monitoring. Therefore, users should set the
+monitoring target regions by themselves.
+
+In such cases, users can explicitly set the initial monitoring target regions
+as they want, by writing proper values to the ``init_regions`` file. The input
+should be a sequence of three integers separated by white spaces that represent
+one region in below form.::
+
+ <target idx> <start address> <end address>
+
+The ``target idx`` should be the index of the target in ``target_ids`` file,
+starting from ``0``, and the regions should be passed in address order. For
+example, below commands will set a couple of address ranges, ``1-100`` and
+``100-200`` as the initial monitoring target region of pid 42, which is the
+first one (index ``0``) in ``target_ids``, and another couple of address
+ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one
+(index ``1``) in ``target_ids``.::
+
+ # cd <debugfs>/damon
+ # cat target_ids
+ 42 4242
+ # echo "0 1 100 \
+ 0 100 200 \
+ 1 20 40 \
+ 1 50 100" > init_regions
+
+Note that this sets the initial monitoring target regions only. In case of
+virtual memory monitoring, DAMON will automatically updates the boundary of the
+regions after one ``update interval``. Therefore, users should set the
+``update interval`` large enough in this case, if they don't want the
+update.
+
+
+Schemes
+-------
+
+Users can get and set the DAMON-based operation :ref:`schemes
+<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file.
+Reading the file also shows the statistics of each scheme. To the file, each
+of the schemes should be represented in each line in below form::
+
+ <target access pattern> <action> <quota> <watermarks>
+
+You can disable schemes by simply writing an empty string to the file.
+
+Target Access Pattern
+~~~~~~~~~~~~~~~~~~~~~
+
+The target access :ref:`pattern <damon_design_damos_access_pattern>` of the
+scheme. The ``<target access pattern>`` is constructed with three ranges in
+below form::
+
+ min-size max-size min-acc max-acc min-age max-age
+
+Specifically, bytes for the size of regions (``min-size`` and ``max-size``),
+number of monitored accesses per aggregate interval for access frequency
+(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of
+regions (``min-age`` and ``max-age``) are specified. Note that the ranges are
+closed interval.
+
+Action
+~~~~~~
+
+The ``<action>`` is a predefined integer for memory management :ref:`actions
+<damon_design_damos_action>`. The supported numbers and their meanings are as
+below.
+
+ - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if
+ ``target`` is ``paddr``.
+ - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if
+ ``target`` is ``paddr``.
+ - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
+ - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if
+ ``target`` is ``paddr``.
+ - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if
+ ``target`` is ``paddr``.
+ - 5: Do nothing but count the statistics
+
+Quota
+~~~~~
+
+Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme
+via the ``<quota>`` in below form::
+
+ <ms> <sz> <reset interval> <priority weights>
+
+This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying
+the action to memory regions of the ``target access pattern`` within the
+``<reset interval>`` milliseconds, and to apply the action to only up to
+``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both
+``<ms>`` and ``<sz>`` zero disables the quota limits.
+
+For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users
+can set the weights for the three properties in ``<priority weights>`` in below
+form::
+
+ <size weight> <access frequency weight> <age weight>
+
+Watermarks
+~~~~~~~~~~
+
+Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the
+given scheme via ``<watermarks>`` in below form::
+
+ <metric> <check interval> <high mark> <middle mark> <low mark>
+
+``<metric>`` is a predefined integer for the metric to be checked. The
+supported numbers and their meanings are as below.
+
+ - 0: Ignore the watermarks
+ - 1: System's free memory rate (per thousand)
+
+The value of the metric is checked every ``<check interval>`` microseconds.
+
+If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the
+scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme
+is activated.
+
+.. _damos_stats:
+
+Statistics
+~~~~~~~~~~
+
+It also counts the total number and bytes of regions that each scheme is tried
+to be applied, the two numbers for the regions that each scheme is successfully
+applied, and the total number of the quota limit exceeds. This statistics can
+be used for online analysis or tuning of the schemes.
+
+The statistics can be shown by reading the ``schemes`` file. Reading the file
+will show each scheme you entered in each line, and the five numbers for the
+statistics will be added at the end of each line.
+
+Example
+~~~~~~~
+
+Below commands applies a scheme saying "If a memory region of size in [4KiB,
+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region. For the paging out, use only up to
+10ms per second, and also don't page out more than 1GiB per second. Under the
+limitation, page out memory regions having longer age first. Also, check the
+free memory rate of the system every 5 seconds, start the monitoring and paging
+out when the free memory rate becomes lower than 50%, but stop it if the free
+memory rate becomes larger than 60%, or lower than 30%".::
+
+ # cd <debugfs>/damon
+ # scheme="4096 8192 0 5 10 20 2" # target access pattern and action
+ # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
+ # scheme+=" 0 0 100" # prioritization weights
+ # scheme+=" 1 5000000 600 500 300" # watermarks
+ # echo "$scheme" > schemes
+
+
+Turning On/Off
+--------------
+
+Setting the files as described above doesn't incur effect unless you explicitly
+start the monitoring. You can start, stop, and check the current status of the
+monitoring by writing to and reading from the ``monitor_on`` file. Writing
+``on`` to the file starts the monitoring of the targets with the attributes.
+Writing ``off`` to the file stops those. DAMON also stops if every target
+process is terminated. Below example commands turn on, off, and check the
+status of DAMON::
+
+ # cd <debugfs>/damon
+ # echo on > monitor_on
+ # echo off > monitor_on
+ # cat monitor_on
+ off
+
+Please note that you cannot write to the above-mentioned debugfs files while
+the monitoring is turned on. If you write to the files while DAMON is running,
+an error code such as ``-EBUSY`` will be returned.
+
+
+Monitoring Thread PID
+---------------------
+
+DAMON does requested monitoring with a kernel thread called ``kdamond``. You
+can get the pid of the thread by reading the ``kdamond_pid`` file. When the
+monitoring is turned off, reading the file returns ``none``. ::
+
+ # cd <debugfs>/damon
+ # cat monitor_on
+ off
+ # cat kdamond_pid
+ none
+ # echo on > monitor_on
+ # cat kdamond_pid
+ 18594
+
+
+Using Multiple Monitoring Threads
+---------------------------------
+
+One ``kdamond`` thread is created for each monitoring context. You can create
+and remove monitoring contexts for multiple ``kdamond`` required use case using
+the ``mk_contexts`` and ``rm_contexts`` files.
+
+Writing the name of the new context to the ``mk_contexts`` file creates a
+directory of the name on the DAMON debugfs directory. The directory will have
+DAMON debugfs files for the context. ::
+
+ # cd <debugfs>/damon
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+ # echo foo > mk_contexts
+ # ls foo
+ # attrs init_regions kdamond_pid schemes target_ids
+
+If the context is not needed anymore, you can remove it and the corresponding
+directory by putting the name of the context to the ``rm_contexts`` file. ::
+
+ # echo foo > rm_contexts
+ # ls foo
+ # ls: cannot access 'foo': No such file or directory
+
+Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
+root directory only.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 1cc0bc78d10e..e4d4b4a8dc97 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -1,5 +1,3 @@
-.. _hugetlbpage:
-
=============
HugeTLB Pages
=============
@@ -60,8 +58,12 @@ HugePages_Surp
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
maximum number of surplus huge pages is controlled by
``/proc/sys/vm/nr_overcommit_hugepages``.
+ Note: When the feature of freeing unused vmemmap pages associated
+ with each hugetlb page is enabled, the number of surplus huge pages
+ may be temporarily larger than the maximum number of surplus huge
+ pages when the system is under memory pressure.
Hugepagesize
- is the default hugepage size (in Kb).
+ is the default hugepage size (in kB).
Hugetlb
is the total amount of memory (in kB), consumed by huge
pages of all sizes.
@@ -80,6 +82,10 @@ returned to the huge page pool when freed by a task. A user with root
privileges can dynamically allocate more or free some persistent huge pages
by increasing or decreasing the value of ``nr_hugepages``.
+Note: When the feature of freeing unused vmemmap pages associated with each
+hugetlb page is enabled, we can fail to free the huge pages triggered by
+the user when the system is under memory pressure. Please try again later.
+
Pages that are used as huge pages are reserved inside the kernel and cannot
be used for other purposes. Huge pages cannot be swapped out under
memory pressure.
@@ -100,6 +106,65 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
+Hugetlb boot command line parameter semantics
+
+hugepagesz
+ Specify a huge page size. Used in conjunction with hugepages
+ parameter to preallocate a number of huge pages of the specified
+ size. Hence, hugepagesz and hugepages are typically specified in
+ pairs such as::
+
+ hugepagesz=2M hugepages=512
+
+ hugepagesz can only be specified once on the command line for a
+ specific huge page size. Valid huge page sizes are architecture
+ dependent.
+hugepages
+ Specify the number of huge pages to preallocate. This typically
+ follows a valid hugepagesz or default_hugepagesz parameter. However,
+ if hugepages is the first or only hugetlb command line parameter it
+ implicitly specifies the number of huge pages of default size to
+ allocate. If the number of huge pages of default size is implicitly
+ specified, it can not be overwritten by a hugepagesz,hugepages
+ parameter pair for the default size. This parameter also has a
+ node format. The node format specifies the number of huge pages
+ to allocate on specific nodes.
+
+ For example, on an architecture with 2M default huge page size::
+
+ hugepages=256 hugepagesz=2M hugepages=512
+
+ will result in 256 2M huge pages being allocated and a warning message
+ indicating that the hugepages=512 parameter is ignored. If a hugepages
+ parameter is preceded by an invalid hugepagesz parameter, it will
+ be ignored.
+
+ Node format example::
+
+ hugepagesz=2M hugepages=0:1,1:2
+
+ It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
+ If the node number is invalid, the parameter will be ignored.
+
+default_hugepagesz
+ Specify the default huge page size. This parameter can
+ only be specified once on the command line. default_hugepagesz can
+ optionally be followed by the hugepages parameter to preallocate a
+ specific number of huge pages of default size. The number of default
+ sized huge pages to preallocate can also be implicitly specified as
+ mentioned in the hugepages section above. Therefore, on an
+ architecture with 2M default huge page size::
+
+ hugepages=256
+ default_hugepagesz=2M hugepages=256
+ hugepages=256 default_hugepagesz=2M
+
+ will all result in 256 2M huge pages being allocated. Valid default
+ huge page size is architecture dependent.
+hugetlb_free_vmemmap
+ When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB
+ Vmemmap Optimization (HVO).
+
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
Thus, one can use the following command to dynamically allocate/deallocate
@@ -177,8 +242,12 @@ will exist, of the form::
hugepages-${size}kB
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist. In addition, two additional interfaces for demoting huge
+pages may exist::
+ demote
+ demote_size
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
@@ -186,7 +255,29 @@ Inside each of these directories, the same set of files will exist::
resv_hugepages
surplus_hugepages
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages. For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
+2MB huge pages. Demote interfaces are not available for the smallest
+huge page size. The demote interfaces are:
+
+demote_size
+ is the size of demoted pages. When a page is demoted a corresponding
+ number of huge pages of demote_size will be created. By default,
+ demote_size is set to the next smaller huge page size. If there are
+ multiple smaller huge page sizes, demote_size can be set to any of
+ these smaller sizes. Only huge page sizes less than the current huge
+ pages size are allowed.
+
+demote
+ is used to demote a number of huge pages. A user with root privileges
+ can write to this file. It may not be possible to demote the
+ requested number of huge pages. To determine how many pages were
+ actually demoted, compare the value of nr_hugepages before and after
+ writing to the demote interface. demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:
@@ -220,7 +311,7 @@ memory policy mode--bind, preferred, local or interleave--may be used. The
resulting effect on persistent huge page allocation is as follows:
#. Regardless of mempolicy mode [see
- :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
+ Documentation/admin-guide/mm/numa_memory_policy.rst],
persistent huge pages will be distributed across the node or nodes
specified in the mempolicy as if "interleave" had been specified.
However, if a node in the policy does not contain sufficient contiguous
@@ -368,13 +459,13 @@ Examples
.. _map_hugetlb:
``map_hugetlb``
- see tools/testing/selftests/vm/map_hugetlb.c
+ see tools/testing/selftests/mm/map_hugetlb.c
``hugepage-shm``
- see tools/testing/selftests/vm/hugepage-shm.c
+ see tools/testing/selftests/mm/hugepage-shm.c
``hugepage-mmap``
- see tools/testing/selftests/vm/hugepage-mmap.c
+ see tools/testing/selftests/mm/hugepage-mmap.c
The `libhugetlbfs`_ library provides a wide range of userspace tools
to help with huge page usability, environment setup, and control.
diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..16fcf38dac56 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -1,5 +1,3 @@
-.. _idle_page_tracking:
-
==================
Idle Page Tracking
==================
@@ -65,14 +63,13 @@ workload one should:
are not reclaimable, he or she can filter them out using
``/proc/kpageflags``.
-The page-types tool in the tools/vm directory can be used to assist in this.
+The page-types tool in the tools/mm directory can be used to assist in this.
If the tool is run initially with the appropriate option, it will mark all the
queried pages as idle. Subsequent runs of the tool can then show which pages have
their idle flag cleared in the interim.
-See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
-information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
-``/proc/kpagecgroup``.
+See Documentation/admin-guide/mm/pagemap.rst for more information about
+``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
.. _impl_details:
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 11db46448354..1f883abf3f00 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -3,9 +3,9 @@ Memory Management
=================
Linux memory management subsystem is responsible, as the name implies,
-for managing the memory in the system. This includes implemnetation of
+for managing the memory in the system. This includes implementation of
virtual memory and demand paging, memory allocation both for kernel
-internal structures and user space programms, mapping of files into
+internal structures and user space programs, mapping of files into
processes address space and many other cool things.
Linux memory management is a complex system with many configurable
@@ -16,8 +16,7 @@ are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
Linux memory management has its own jargon and if you are not yet
-familiar with it, consider reading
-:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
+familiar with it, consider reading Documentation/admin-guide/mm/concepts.rst.
Here we document in detail how to interact with various mechanisms in
the Linux memory management.
@@ -27,13 +26,19 @@ the Linux memory management.
concepts
cma_debugfs
+ damon/index
hugetlbpage
idle_page_tracking
ksm
memory-hotplug
+ multigen_lru
+ nommu-mmap
numa_memory_policy
numaperf
pagemap
+ shrinker_debugfs
soft-dirty
+ swap_numa
transhuge
userfaultfd
+ zswap
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 874eb0c77d34..a639cac12477 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -1,5 +1,3 @@
-.. _admin_guide_ksm:
-
=======================
Kernel Samepage Merging
=======================
@@ -9,7 +7,7 @@ Overview
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
-and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
+and http://lwn.net/Articles/306704/ and https://lwn.net/Articles/330589/
KSM was originally developed for use with KVM (where it was known as
Kernel Shared Memory), to fit more virtual machines into physical memory,
@@ -22,7 +20,7 @@ content which can be replaced by a single write-protected page (which
is automatically copied if a process later wants to update its
content). The amount of pages that KSM daemon scans in a single pass
and the time between the passes are configured using :ref:`sysfs
-intraface <ksm_sysfs>`
+interface <ksm_sysfs>`
KSM only merges anonymous (private) pages, never pagecache (file) pages.
KSM's merged pages were originally locked into kernel memory, but can now
@@ -52,7 +50,7 @@ with EAGAIN, but more probably arousing the Out-Of-Memory killer.
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
built with CONFIG_KSM=y, those calls will normally succeed: even if the
-the KSM daemon is not currently running, MADV_MERGEABLE still registers
+KSM daemon is not currently running, MADV_MERGEABLE still registers
the range for whenever the KSM daemon is started; even if the range
cannot contain any pages which KSM could actually merge; even if
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
@@ -82,6 +80,9 @@ pages_to_scan
how many pages to scan before ksmd goes to sleep
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
+ The pages_to_scan value cannot be changed if ``advisor_mode`` has
+ been set to scan-time.
+
Default: 100 (chosen for demonstration purposes)
sleep_millisecs
@@ -157,8 +158,44 @@ stable_node_chains_prune_millisecs
scan. It's a noop if not a single KSM page hit the
``max_page_sharing`` yet.
+smart_scan
+ Historically KSM checked every candidate page for each scan. It did
+ not take into account historic information. When smart scan is
+ enabled, pages that have previously not been de-duplicated get
+ skipped. How often these pages are skipped depends on how often
+ de-duplication has already been tried and failed. By default this
+ optimization is enabled. The ``pages_skipped`` metric shows how
+ effective the setting is.
+
+advisor_mode
+ The ``advisor_mode`` selects the current advisor. Two modes are
+ supported: none and scan-time. The default is none. By setting
+ ``advisor_mode`` to scan-time, the scan time advisor is enabled.
+ The section about ``advisor`` explains in detail how the scan time
+ advisor works.
+
+adivsor_max_cpu
+ specifies the upper limit of the cpu percent usage of the ksmd
+ background thread. The default is 70.
+
+advisor_target_scan_time
+ specifies the target scan time in seconds to scan all the candidate
+ pages. The default value is 200 seconds.
+
+advisor_min_pages_to_scan
+ specifies the lower limit of the ``pages_to_scan`` parameter of the
+ scan time advisor. The default is 500.
+
+adivsor_max_pages_to_scan
+ specifies the upper limit of the ``pages_to_scan`` parameter of the
+ scan time advisor. The default is 30000.
+
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
+general_profit
+ how effective is KSM. The calculation is explained below.
+pages_scanned
+ how many pages are being scanned for ksm
pages_shared
how many shared pages are being used
pages_sharing
@@ -167,12 +204,21 @@ pages_unshared
how many pages unique but repeatedly checked for merging
pages_volatile
how many pages changing too fast to be placed in a tree
+pages_skipped
+ how many pages did the "smart" page scanning algorithm skip
full_scans
how many times all mergeable areas have been scanned
stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
+ksm_zero_pages
+ how many zero pages that are still mapped into processes were mapped by
+ KSM when deduplicating.
+
+When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
+``ksm_zero_pages`` represents the actual number of pages saved by KSM.
+if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
@@ -184,6 +230,94 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
+Monitoring KSM profit
+=====================
+
+KSM can save memory by merging identical pages, but also can consume
+additional memory, because it needs to generate a number of rmap_items to
+save each scanned page's brief rmap information. Some of these pages may
+be merged, but some may not be abled to be merged after being checked
+several times, which are unprofitable memory consumed.
+
+1) How to determine whether KSM save memory or consume memory in system-wide
+ range? Here is a simple approximate calculation for reference::
+
+ general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
+ sizeof(rmap_item);
+
+ where ksm_saved_pages equals to the sum of ``pages_sharing`` +
+ ``ksm_zero_pages`` of the system, and all_rmap_items can be easily
+ obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
+ and ``pages_volatile``.
+
+2) The KSM profit inner a single process can be similarly obtained by the
+ following approximate calculation::
+
+ process_profit =~ ksm_saved_pages * sizeof(page) -
+ ksm_rmap_items * sizeof(rmap_item).
+
+ where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
+ ``ksm_zero_pages``, both of which are shown under the directory
+ ``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
+ ``/proc/<pid>/ksm_stat``. The process profit is also shown in
+ ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
+
+From the perspective of application, a high ratio of ``ksm_rmap_items`` to
+``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
+administrators have to rethink how to change madvise policy. Giving an example
+for reference, a page's size is usually 4K, and the rmap_item's size is
+separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
+so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
+or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
+because the ksm profit is approximately zero or negative.
+
+Monitoring KSM events
+=====================
+
+There are some counters in /proc/vmstat that may be used to monitor KSM events.
+KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW
+or on swapping in copy. Those events could help users evaluate whether or how
+to use KSM. For example, if cow_ksm increases too fast, user may decrease the
+range of madvise(, , MADV_MERGEABLE).
+
+cow_ksm
+ is incremented every time a KSM page triggers copy on write (COW)
+ when users try to write to a KSM page, we have to make a copy.
+
+ksm_swpin_copy
+ is incremented every time a KSM page is copied when swapping in
+ note that KSM page might be copied when swapping in because do_swap_page()
+ cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
+
+Advisor
+=======
+
+The number of candidate pages for KSM is dynamic. It can be often observed
+that during the startup of an application more candidate pages need to be
+processed. Without an advisor the ``pages_to_scan`` parameter needs to be
+sized for the maximum number of candidate pages. The scan time advisor can
+changes the ``pages_to_scan`` parameter based on demand.
+
+The advisor can be enabled, so KSM can automatically adapt to changes in the
+number of candidate pages to scan. Two advisors are implemented: none and
+scan-time. With none, no advisor is enabled. The default is none.
+
+The scan time advisor changes the ``pages_to_scan`` parameter based on the
+observed scan times. The possible values for the ``pages_to_scan`` parameter is
+limited by the ``advisor_max_cpu`` parameter. In addition there is also the
+``advisor_target_scan_time`` parameter. This parameter sets the target time to
+scan all the KSM candidate pages. The parameter ``advisor_target_scan_time``
+decides how aggressive the scan time advisor scans candidate pages. Lower
+values make the scan time advisor to scan more aggresively. This is the most
+important parameter for the configuration of the scan time advisor.
+
+The initial value and the maximum value can be changed with
+``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default
+values are sufficient for most workloads and use cases.
+
+The ``pages_to_scan`` parameter is re-calculated after a scan has been completed.
+
+
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5c4432c96c4b..098f14d83e99 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -1,444 +1,695 @@
-.. _admin_guide_memory_hotplug:
+==================
+Memory Hot(Un)Plug
+==================
-==============
-Memory Hotplug
-==============
+This document describes generic Linux support for memory hot(un)plug with
+a focus on System RAM, including ZONE_MOVABLE support.
-:Created: Jul 28 2007
-:Updated: Add some details about locking internals: Aug 20 2018
+.. contents:: :local:
-This document is about memory hotplug including how-to-use and current status.
-Because Memory Hotplug is still under development, contents of this text will
-be changed often.
+Introduction
+============
-.. contents:: :local:
+Memory hot(un)plug allows for increasing and decreasing the size of physical
+memory available to a machine at runtime. In the simplest case, it consists of
+physically plugging or unplugging a DIMM at runtime, coordinated with the
+operating system.
-.. note::
+Memory hot(un)plug is used for various purposes:
- (1) x86_64's has special implementation for memory hotplug.
- This text does not describe it.
- (2) This text assumes that sysfs is mounted at ``/sys``.
+- The physical memory available to a machine can be adjusted at runtime, up- or
+ downgrading the memory capacity. This dynamic memory resizing, sometimes
+ referred to as "capacity on demand", is frequently used with virtual machines
+ and logical partitions.
+- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
+ example is replacing failing memory modules.
-Introduction
-============
+- Reducing energy consumption either by physically unplugging memory modules or
+ by logically unplugging (parts of) memory modules from Linux.
+
+Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also
+used to expose persistent memory, other performance-differentiated memory and
+reserved memory regions as ordinary system RAM to Linux.
-Purpose of memory hotplug
--------------------------
+Linux only supports memory hot(un)plug on selected 64 bit architectures, such as
+x86_64, arm64, ppc64 and s390x.
-Memory Hotplug allows users to increase/decrease the amount of memory.
-Generally, there are two purposes.
+Memory Hot(Un)Plug Granularity
+------------------------------
-(A) For changing the amount of memory.
- This is to allow a feature like capacity on demand.
-(B) For installing/removing DIMMs or NUMA-nodes physically.
- This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
+physical memory address space into chunks of the same size: memory sections. The
+size of a memory section is architecture dependent. For example, x86_64 uses
+128 MiB and ppc64 uses 16 MiB.
-(A) is required by highly virtualized environments and (B) is required by
-hardware which supports memory power management.
+Memory sections are combined into chunks referred to as "memory blocks". The
+size of a memory block is architecture dependent and corresponds to the smallest
+granularity that can be hot(un)plugged. The default size of a memory block is
+the same as memory section size, unless an architecture specifies otherwise.
-Linux memory hotplug is designed for both purpose.
+All memory blocks have the same size.
-Phases of memory hotplug
+Phases of Memory Hotplug
------------------------
-There are 2 phases in Memory Hotplug:
+Memory hotplug consists of two phases:
- 1) Physical Memory Hotplug phase
- 2) Logical Memory Hotplug phase.
+(1) Adding the memory to Linux
+(2) Onlining memory blocks
-The First phase is to communicate hardware/firmware and make/erase
-environment for hotplugged memory. Basically, this phase is necessary
-for the purpose (B), but this is good phase for communication between
-highly virtualized environments too.
+In the first phase, metadata, such as the memory map ("memmap") and page tables
+for the direct mapping, is allocated and initialized, and memory blocks are
+created; the latter also creates sysfs files for managing newly created memory
+blocks.
-When memory is hotplugged, the kernel recognizes new memory, makes new memory
-management tables, and makes sysfs files for new memory's operation.
+In the second phase, added memory is exposed to the page allocator. After this
+phase, the memory is visible in memory statistics, such as free and total
+memory, of the system.
-If firmware supports notification of connection of new memory to OS,
-this phase is triggered automatically. ACPI can notify this event. If not,
-"probe" operation by system administration is used instead.
-(see :ref:`memory_hotplug_physical_mem`).
+Phases of Memory Hotunplug
+--------------------------
-Logical Memory Hotplug phase is to change memory state into
-available/unavailable for users. Amount of memory from user's view is
-changed by this phase. The kernel makes all memory in it as free pages
-when a memory range is available.
+Memory hotunplug consists of two phases:
-In this document, this phase is described as online/offline.
+(1) Offlining memory blocks
+(2) Removing the memory from Linux
-Logical Memory Hotplug phase is triggered by write of sysfs file by system
-administrator. For the hot-add case, it must be executed after Physical Hotplug
-phase by hand.
-(However, if you writes udev's hotplug scripts for memory hotplug, these
-phases can be execute in seamless way.)
+In the first phase, memory is "hidden" from the page allocator again, for
+example, by migrating busy memory to other memory locations and removing all
+relevant free pages from the page allocator After this phase, the memory is no
+longer visible in memory statistics of the system.
-Unit of Memory online/offline operation
----------------------------------------
+In the second phase, the memory blocks are removed and metadata is freed.
-Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
-into chunks of the same size. These chunks are called "sections". The size of
-a memory section is architecture dependent. For example, power uses 16MiB, ia64
-uses 1GiB.
+Memory Hotplug Notifications
+============================
-Memory sections are combined into chunks referred to as "memory blocks". The
-size of a memory block is architecture dependent and represents the logical
-unit upon which memory online/offline operations are to be performed. The
-default size of a memory block is the same as memory section size unless an
-architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
+There are various ways how Linux is notified about memory hotplug events such
+that it can start adding hotplugged memory. This description is limited to
+systems that support ACPI; mechanisms specific to other firmware interfaces or
+virtual machines are not described.
-To determine the size (in bytes) of a memory block please read this file::
+ACPI Notifications
+------------------
- /sys/devices/system/memory/block_size_bytes
+Platforms that support ACPI, such as x86_64, can support memory hotplug
+notifications via ACPI.
-Kernel Configuration
-====================
+In general, a firmware supporting memory hotplug defines a memory class object
+HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
+driver will hotplug the memory to Linux.
-To use memory hotplug feature, kernel must be compiled with following
-config options.
+If the firmware supports hotplug of NUMA nodes, it defines an object _HID
+"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
+assigned memory devices are added to Linux by the ACPI driver.
-- For all memory hotplug:
- - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``)
- - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``)
+Similarly, Linux can be notified about requests to hotunplug a memory device or
+a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
+blocks, and, if successful, hotunplug the memory from Linux.
-- To enable memory removal, the following are also necessary:
- - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``)
- - Page Migration (``CONFIG_MIGRATION``)
+Manual Probing
+--------------
-- For ACPI memory hotplug, the following are also necessary:
- - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
- - This option can be kernel module.
+On some architectures, the firmware may not be able to notify the operating
+system about a memory hotplug event. Instead, the memory has to be manually
+probed from user space.
-- As a related configuration, if your box has a feature of NUMA-node hotplug
- via ACPI, then this option is necessary too.
+The probe interface is located at::
- - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
- (``CONFIG_ACPI_CONTAINER``).
+ /sys/devices/system/memory/probe
- This option can be kernel module too.
+Only complete memory blocks can be probed. Individual memory blocks are probed
+by providing the physical start address of the memory block::
+ % echo addr > /sys/devices/system/memory/probe
-.. _memory_hotplug_sysfs_files:
+Which results in a memory block for the range [addr, addr + memory_block_size)
+being created.
-sysfs files for memory hotplug
-==============================
+.. note::
-All memory blocks have their device information in sysfs. Each memory block
-is described under ``/sys/devices/system/memory`` as::
+ Using the probe interface is discouraged as it is easy to crash the kernel,
+ because Linux cannot validate user input; this interface might be removed in
+ the future.
- /sys/devices/system/memory/memoryXXX
+Onlining and Offlining Memory Blocks
+====================================
-where XXX is the memory block id.
+After a memory block has been created, Linux has to be instructed to actually
+make use of that memory: the memory block has to be "online".
-For the memory block covered by the sysfs directory. It is expected that all
-memory sections in this range are present and no memory holes exist in the
-range. Currently there is no way to determine if there is a memory hole, but
-the existence of one should not affect the hotplug capabilities of the memory
-block.
+Before a memory block can be removed, Linux has to stop using any memory part of
+the memory block: the memory block has to be "offlined".
-For example, assume 1GiB memory block size. A device for a memory starting at
-0x100000000 is ``/sys/device/system/memory/memory4``::
+The Linux kernel can be configured to automatically online added memory blocks
+and drivers automatically trigger offlining of memory blocks when trying
+hotunplug of memory. Memory blocks can only be removed once offlining succeeded
+and drivers may trigger offlining of memory blocks when attempting hotunplug of
+memory.
- (0x100000000 / 1Gib = 4)
+Onlining Memory Blocks Manually
+-------------------------------
-This device covers address range [0x100000000 ... 0x140000000)
+If auto-onlining of memory blocks isn't enabled, user-space has to manually
+trigger onlining of memory blocks. Often, udev rules are used to automate this
+task in user space.
-Under each memory block, you can see 5 files:
+Onlining of a memory block can be triggered via::
-- ``/sys/devices/system/memory/memoryXXX/phys_index``
-- ``/sys/devices/system/memory/memoryXXX/phys_device``
-- ``/sys/devices/system/memory/memoryXXX/state``
-- ``/sys/devices/system/memory/memoryXXX/removable``
-- ``/sys/devices/system/memory/memoryXXX/valid_zones``
+ % echo online > /sys/devices/system/memory/memoryXXX/state
-=================== ============================================================
-``phys_index`` read-only and contains memory block id, same as XXX.
-``state`` read-write
-
- - at read: contains online/offline state of memory.
- - at write: user can specify "online_kernel",
-
- "online_movable", "online", "offline" command
- which will be performed on all sections in the block.
-``phys_device`` read-only: designed to show the name of physical memory
- device. This is not well implemented now.
-``removable`` read-only: contains an integer value indicating
- whether the memory block is removable or not
- removable. A value of 1 indicates that the memory
- block is removable and a value of 0 indicates that
- it is not removable. A memory block is removable only if
- every section in the block is removable.
-``valid_zones`` read-only: designed to show which zones this memory block
- can be onlined to.
-
- The first column shows it`s default zone.
-
- "memory6/valid_zones: Normal Movable" shows this memoryblock
- can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
- by online_movable.
-
- "memory7/valid_zones: Movable Normal" shows this memoryblock
- can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
- by online_kernel.
-=================== ============================================================
+Or alternatively::
+
+ % echo 1 > /sys/devices/system/memory/memoryXXX/online
+
+The kernel will select the target zone automatically, depending on the
+configured ``online_policy``.
+
+One can explicitly request to associate an offline memory block with
+ZONE_MOVABLE by::
+
+ % echo online_movable > /sys/devices/system/memory/memoryXXX/state
+
+Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+
+ % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+
+In any case, if onlining succeeds, the state of the memory block is changed to
+be "online". If it fails, the state of the memory block will remain unchanged
+and the above commands will fail.
+
+Onlining Memory Blocks Automatically
+------------------------------------
+
+The kernel can be configured to try auto-onlining of newly added memory blocks.
+If this feature is disabled, the memory blocks will stay offline until
+explicitly onlined from user space.
+
+The configured auto-online behavior can be observed via::
+
+ % cat /sys/devices/system/memory/auto_online_blocks
+
+Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
+``online_movable`` to that file, like::
+
+ % echo online > /sys/devices/system/memory/auto_online_blocks
+
+Similarly to manual onlining, with ``online`` the kernel will select the
+target zone automatically, depending on the configured ``online_policy``.
+
+Modifying the auto-online behavior will only affect all subsequently added
+memory blocks only.
.. note::
- These directories/files appear after physical memory hotplug phase.
+ In corner cases, auto-onlining can fail. The kernel won't retry. Note that
+ auto-onlining is not expected to fail in default configurations.
-If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
-via symbolic links located in the ``/sys/devices/system/node/node*`` directories.
+.. note::
-For example::
+ DLPAR on ppc64 ignores the ``offline`` setting and will still online added
+ memory blocks; if onlining fails, memory blocks are removed again.
- /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+Offlining Memory Blocks
+-----------------------
-A backlink will also be created::
+In the current implementation, Linux's memory offlining will try migrating all
+movable pages off the affected memory block. As most kernel allocations, such as
+page tables, are unmovable, page migration can fail and, therefore, inhibit
+memory offlining from succeeding.
- /sys/devices/system/memory/memory9/node0 -> ../../node/node0
+Having the memory provided by memory block managed by ZONE_MOVABLE significantly
+increases memory offlining reliability; still, memory offlining can fail in
+some corner cases.
-.. _memory_hotplug_physical_mem:
+Further, memory offlining might retry for a long time (or even forever), until
+aborted by the user.
-Physical memory hot-add phase
-=============================
+Offlining of a memory block can be triggered via::
-Hardware(Firmware) Support
---------------------------
+ % echo offline > /sys/devices/system/memory/memoryXXX/state
-On x86_64/ia64 platform, memory hotplug by ACPI is supported.
+Or alternatively::
-In general, the firmware (ACPI) which supports memory hotplug defines
-memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
-Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
-script. This will be done automatically.
+ % echo 0 > /sys/devices/system/memory/memoryXXX/online
-But scripts for memory hotplug are not contained in generic udev package(now).
-You may have to write it by yourself or online/offline memory by hand.
-Please see :ref:`memory_hotplug_how_to_online_memory` and
-:ref:`memory_hotplug_how_to_offline_memory`.
+If offlining succeeds, the state of the memory block is changed to be "offline".
+If it fails, the state of the memory block will remain unchanged and the above
+commands will fail, for example, via::
-If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
-"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
-calls hotplug code for all of objects which are defined in it.
-If memory device is found, memory hotplug code will be called.
+ bash: echo: write error: Device or resource busy
-Notify memory hot-add event by hand
------------------------------------
+or via::
-On some architectures, the firmware may not notify the kernel of a memory
-hotplug event. Therefore, the memory "probe" interface is supported to
-explicitly notify the kernel. This interface depends on
-CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
-if hotplug is supported, although for x86 this should be handled by ACPI
-notification.
+ bash: echo: write error: Invalid argument
-Probe interface is located at::
+Observing the State of Memory Blocks
+------------------------------------
- /sys/devices/system/memory/probe
+The state (online/offline/going-offline) of a memory block can be observed
+either via::
-You can tell the physical address of new memory to the kernel by::
+ % cat /sys/devices/system/memory/memoryXXX/state
- % echo start_address_of_new_memory > /sys/devices/system/memory/probe
+Or alternatively (1/0) via::
-Then, [start_address_of_new_memory, start_address_of_new_memory +
-memory_block_size] memory range is hot-added. In this case, hotplug script is
-not called (in current implementation). You'll have to online memory by
-yourself. Please see :ref:`memory_hotplug_how_to_online_memory`.
+ % cat /sys/devices/system/memory/memoryXXX/online
-Logical Memory hot-add phase
-============================
+For an online memory block, the managing zone can be observed via::
-State of memory
----------------
+ % cat /sys/devices/system/memory/memoryXXX/valid_zones
-To see (online/offline) state of a memory block, read 'state' file::
+Configuring Memory Hot(Un)Plug
+==============================
- % cat /sys/device/system/memory/memoryXXX/state
+There are various ways how system administrators can configure memory
+hot(un)plug and interact with memory blocks, especially, to online them.
+Memory Hot(Un)Plug Configuration via Sysfs
+------------------------------------------
-- If the memory block is online, you'll read "online".
-- If the memory block is offline, you'll read "offline".
+Some memory hot(un)plug properties can be configured or inspected via sysfs in::
+ /sys/devices/system/memory/
-.. _memory_hotplug_how_to_online_memory:
+The following files are currently defined:
-How to online memory
---------------------
+====================== =========================================================
+``auto_online_blocks`` read-write: set or get the default state of new memory
+ blocks; configure auto-onlining.
-When the memory is hot-added, the kernel decides whether or not to "online"
-it according to the policy which can be read from "auto_online_blocks" file::
+ The default value depends on the
+ CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
+ option.
- % cat /sys/devices/system/memory/auto_online_blocks
+ See the ``state`` property of memory blocks for details.
+``block_size_bytes`` read-only: the size in bytes of a memory block.
+``probe`` write-only: add (probe) selected memory blocks manually
+ from user space by supplying the physical start address.
-The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
-option. If it is disabled the default is "offline" which means the newly added
-memory is not in a ready-to-use state and you have to "online" the newly added
-memory blocks manually. Automatic onlining can be requested by writing "online"
-to "auto_online_blocks" file::
+ Availability depends on the CONFIG_ARCH_MEMORY_PROBE
+ kernel configuration option.
+``uevent`` read-write: generic udev file for device subsystems.
+``crash_hotplug`` read-only: when changes to the system memory map
+ occur due to hot un/plug of memory, this file contains
+ '1' if the kernel updates the kdump capture kernel memory
+ map itself (via elfcorehdr), or '0' if userspace must update
+ the kdump capture kernel memory map.
- % echo online > /sys/devices/system/memory/auto_online_blocks
+ Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
+ configuration option.
+====================== =========================================================
-This sets a global policy and impacts all memory blocks that will subsequently
-be hotplugged. Currently offline blocks keep their state. It is possible, under
-certain circumstances, that some memory blocks will be added but will fail to
-online. User space tools can check their "state" files
-(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
+.. note::
-If the automatic onlining wasn't requested, failed, or some memory block was
-offlined it is possible to change the individual block's state by writing to the
-"state" file::
+ When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
+ additional files ``hard_offline_page`` and ``soft_offline_page`` are available
+ to trigger hwpoisoning of pages, for example, for testing purposes. Note that
+ this functionality is not really related to memory hot(un)plug or actual
+ offlining of memory blocks.
- % echo online > /sys/devices/system/memory/memoryXXX/state
+Memory Block Configuration via Sysfs
+------------------------------------
-This onlining will not change the ZONE type of the target memory block,
-If the memory block doesn't belong to any zone an appropriate kernel zone
-(usually ZONE_NORMAL) will be used unless movable_node kernel command line
-option is specified when ZONE_MOVABLE will be used.
+Each memory block is represented as a memory block device that can be
+onlined or offlined. All memory blocks have their device information located in
+sysfs. Each present memory block is listed under
+``/sys/devices/system/memory`` as::
-You can explicitly request to associate it with ZONE_MOVABLE by::
+ /sys/devices/system/memory/memoryXXX
- % echo online_movable > /sys/devices/system/memory/memoryXXX/state
+where XXX is the memory block id; the number of digits is variable.
-.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
+A present memory block indicates that some memory in the range is present;
+however, a memory block might span memory holes. A memory block spanning memory
+holes cannot be offlined.
-Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+For example, assume 1 GiB memory block size. A device for a memory starting at
+0x100000000 is ``/sys/devices/system/memory/memory4``::
- % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+ (0x100000000 / 1Gib = 4)
+
+This device covers address range [0x100000000 ... 0x140000000)
-.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
+The following files are currently defined:
-An explicit zone onlining can fail (e.g. when the range is already within
-and existing and incompatible zone already).
+=================== ============================================================
+``online`` read-write: simplified interface to trigger onlining /
+ offlining and to observe the state of a memory block.
+ When onlining, the zone is selected automatically.
+``phys_device`` read-only: legacy interface only ever used on s390x to
+ expose the covered storage increment.
+``phys_index`` read-only: the memory block id (XXX).
+``removable`` read-only: legacy interface that indicated whether a memory
+ block was likely to be offlineable or not. Nowadays, the
+ kernel return ``1`` if and only if it supports memory
+ offlining.
+``state`` read-write: advanced interface to trigger onlining /
+ offlining and to observe the state of a memory block.
+
+ When writing, ``online``, ``offline``, ``online_kernel`` and
+ ``online_movable`` are supported.
+
+ ``online_movable`` specifies onlining to ZONE_MOVABLE.
+ ``online_kernel`` specifies onlining to the default kernel
+ zone for the memory block, such as ZONE_NORMAL.
+ ``online`` let's the kernel select the zone automatically.
+
+ When reading, ``online``, ``offline`` and ``going-offline``
+ may be returned.
+``uevent`` read-write: generic uevent file for devices.
+``valid_zones`` read-only: when a block is online, shows the zone it
+ belongs to; when a block is offline, shows what zone will
+ manage it when the block will be onlined.
+
+ For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
+ ``Movable`` and ``none`` may be returned. ``none`` indicates
+ that memory provided by a memory block is managed by
+ multiple zones or spans multiple nodes; such memory blocks
+ cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
+ Other values indicate a kernel zone.
+
+ For offline memory blocks, the first column shows the
+ zone the kernel would select when onlining the memory block
+ right now without further specifying a zone.
+
+ Availability depends on the CONFIG_MEMORY_HOTREMOVE
+ kernel configuration option.
+=================== ============================================================
+
+.. note::
-After this, memory block XXX's state will be 'online' and the amount of
-available memory will be increased.
+ If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
+ directories can also be accessed via symbolic links located in the
+ ``/sys/devices/system/node/node*`` directories.
-This may be changed in future.
+ For example::
-Logical memory remove
-=====================
+ /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
-Memory offline and ZONE_MOVABLE
--------------------------------
+ A backlink will also be created::
-Memory offlining is more complicated than memory online. Because memory offline
-has to make the whole memory block be unused, memory offline can fail if
-the memory block includes memory which cannot be freed.
+ /sys/devices/system/memory/memory9/node0 -> ../../node/node0
+
+Command Line Parameters
+-----------------------
+
+Some command line parameters affect memory hot(un)plug handling. The following
+command line parameters are relevant:
+
+======================== =======================================================
+``memhp_default_state`` configure auto-onlining by essentially setting
+ ``/sys/devices/system/memory/auto_online_blocks``.
+``movable_node`` configure automatic zone selection in the kernel when
+ using the ``contig-zones`` online policy. When
+ set, the kernel will default to ZONE_MOVABLE when
+ onlining a memory block, unless other zones can be kept
+ contiguous.
+======================== =======================================================
+
+See Documentation/admin-guide/kernel-parameters.txt for a more generic
+description of these command line parameters.
+
+Module Parameters
+------------------
+
+Instead of additional command line parameters or sysfs files, the
+``memory_hotplug`` subsystem now provides a dedicated namespace for module
+parameters. Module parameters can be set via the command line by predicating
+them with ``memory_hotplug.`` such as::
+
+ memory_hotplug.memmap_on_memory=1
+
+and they can be observed (and some even modified at runtime) via::
+
+ /sys/module/memory_hotplug/parameters/
+
+The following module parameters are currently defined:
+
+================================ ===============================================
+``memmap_on_memory`` read-write: Allocate memory for the memmap from
+ the added memory block itself. Even if enabled,
+ actual support depends on various other system
+ properties and should only be regarded as a
+ hint whether the behavior would be desired.
+
+ While allocating the memmap from the memory
+ block itself makes memory hotplug less likely
+ to fail and keeps the memmap on the same NUMA
+ node in any case, it can fragment physical
+ memory in a way that huge pages in bigger
+ granularity cannot be formed on hotplugged
+ memory.
+
+ With value "force" it could result in memory
+ wastage due to memmap size limitations. For
+ example, if the memmap for a memory block
+ requires 1 MiB, but the pageblock size is 2
+ MiB, 1 MiB of hotplugged memory will be wasted.
+ Note that there are still cases where the
+ feature cannot be enforced: for example, if the
+ memmap is smaller than a single page, or if the
+ architecture does not support the forced mode
+ in all configurations.
+
+``online_policy`` read-write: Set the basic policy used for
+ automatic zone selection when onlining memory
+ blocks without specifying a target zone.
+ ``contig-zones`` has been the kernel default
+ before this parameter was added. After an
+ online policy was configured and memory was
+ online, the policy should not be changed
+ anymore.
+
+ When set to ``contig-zones``, the kernel will
+ try keeping zones contiguous. If a memory block
+ intersects multiple zones or no zone, the
+ behavior depends on the ``movable_node`` kernel
+ command line parameter: default to ZONE_MOVABLE
+ if set, default to the applicable kernel zone
+ (usually ZONE_NORMAL) if not set.
+
+ When set to ``auto-movable``, the kernel will
+ try onlining memory blocks to ZONE_MOVABLE if
+ possible according to the configuration and
+ memory device details. With this policy, one
+ can avoid zone imbalances when eventually
+ hotplugging a lot of memory later and still
+ wanting to be able to hotunplug as much as
+ possible reliably, very desirable in
+ virtualized environments. This policy ignores
+ the ``movable_node`` kernel command line
+ parameter and isn't really applicable in
+ environments that require it (e.g., bare metal
+ with hotunpluggable nodes) where hotplugged
+ memory might be exposed via the
+ firmware-provided memory map early during boot
+ to the system instead of getting detected,
+ added and onlined later during boot (such as
+ done by virtio-mem or by some hypervisors
+ implementing emulated DIMMs). As one example, a
+ hotplugged DIMM will be onlined either
+ completely to ZONE_MOVABLE or completely to
+ ZONE_NORMAL, not a mixture.
+ As another example, as many memory blocks
+ belonging to a virtio-mem device will be
+ onlined to ZONE_MOVABLE as possible,
+ special-casing units of memory blocks that can
+ only get hotunplugged together. *This policy
+ does not protect from setups that are
+ problematic with ZONE_MOVABLE and does not
+ change the zone of memory blocks dynamically
+ after they were onlined.*
+``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL
+ memory ratio in % for the ``auto-movable``
+ online policy. Whether the ratio applies only
+ for the system across all NUMA nodes or also
+ per NUMA nodes depends on the
+ ``auto_movable_numa_aware`` configuration.
+
+ All accounting is based on present memory pages
+ in the zones combined with accounting per
+ memory device. Memory dedicated to the CMA
+ allocator is accounted as MOVABLE, although
+ residing on one of the kernel zones. The
+ possible ratio depends on the actual workload.
+ The kernel default is "301" %, for example,
+ allowing for hotplugging 24 GiB to a 8 GiB VM
+ and automatically onlining all hotplugged
+ memory to ZONE_MOVABLE in many setups. The
+ additional 1% deals with some pages being not
+ present, for example, because of some firmware
+ allocations.
+
+ Note that ZONE_NORMAL memory provided by one
+ memory device does not allow for more
+ ZONE_MOVABLE memory for a different memory
+ device. As one example, onlining memory of a
+ hotplugged DIMM to ZONE_NORMAL will not allow
+ for another hotplugged DIMM to get onlined to
+ ZONE_MOVABLE automatically. In contrast, memory
+ hotplugged by a virtio-mem device that got
+ onlined to ZONE_NORMAL will allow for more
+ ZONE_MOVABLE memory within *the same*
+ virtio-mem device.
+``auto_movable_numa_aware`` read-write: Configure whether the
+ ``auto_movable_ratio`` in the ``auto-movable``
+ online policy also applies per NUMA
+ node in addition to the whole system across all
+ NUMA nodes. The kernel default is "Y".
+
+ Disabling NUMA awareness can be helpful when
+ dealing with NUMA nodes that should be
+ completely hotunpluggable, onlining the memory
+ completely to ZONE_MOVABLE automatically if
+ possible.
+
+ Parameter availability depends on CONFIG_NUMA.
+================================ ===============================================
+
+ZONE_MOVABLE
+============
-In general, memory offline can use 2 techniques.
+ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
+Further, having system RAM managed by ZONE_MOVABLE instead of one of the
+kernel zones can increase the number of possible transparent huge pages and
+dynamically allocated huge pages.
-(1) reclaim and free all memory in the memory block.
-(2) migrate all pages in the memory block.
+Most kernel allocations are unmovable. Important examples include the memory
+map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
+can only be served from the kernel zones.
-In the current implementation, Linux's memory offline uses method (2), freeing
-all pages in the memory block by page migration. But not all pages are
-migratable. Under current Linux, migratable pages are anonymous pages and
-page caches. For offlining a memory block by migration, the kernel has to
-guarantee that the memory block contains only migratable pages.
+Most user space pages, such as anonymous memory, and page cache pages are
+movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
-Now, a boot option for making a memory block which consists of migratable pages
-is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
-create ZONE_MOVABLE...a zone which is just used for movable pages.
-(See also Documentation/admin-guide/kernel-parameters.rst)
+Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable
+allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
+absolutely no guarantee whether a memory block can be offlined successfully.
-Assume the system has "TOTAL" amount of memory at boot time, this boot option
-creates ZONE_MOVABLE as following.
+Zone Imbalances
+---------------
+
+Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
+which can harm the system or degrade performance. As one example, the kernel
+might crash because it runs out of free memory for unmovable allocations,
+although there is still plenty of free memory left in ZONE_MOVABLE.
-1) When kernelcore=YYYY boot option is used,
- Size of memory not for movable pages (not for offline) is YYYY.
- Size of memory for movable pages (for offline) is TOTAL-YYYY.
+Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
+are definitely impossible due to the overhead for the memory map.
-2) When movablecore=ZZZZ boot option is used,
- Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
- Size of memory for movable pages (for offline) is ZZZZ.
+Actual safe zone ratios depend on the workload. Extreme cases, like excessive
+long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
.. note::
- Unfortunately, there is no information to show which memory block belongs
- to ZONE_MOVABLE. This is TBD.
+ CMA memory part of a kernel zone essentially behaves like memory in
+ ZONE_MOVABLE and similar considerations apply, especially when combining
+ CMA with ZONE_MOVABLE.
-.. _memory_hotplug_how_to_offline_memory:
+ZONE_MOVABLE Sizing Considerations
+----------------------------------
-How to offline memory
----------------------
+We usually expect that a large portion of available system RAM will actually
+be consumed by user space, either directly or indirectly via the page cache. In
+the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
-You can offline a memory block by using the same sysfs interface that was used
-in memory onlining::
+With that in mind, it makes sense that we can have a big portion of system RAM
+managed by ZONE_MOVABLE. However, there are some things to consider when using
+ZONE_MOVABLE, especially when fine-tuning zone ratios:
- % echo offline > /sys/devices/system/memory/memoryXXX/state
+- Having a lot of offline memory blocks. Even offline memory blocks consume
+ memory for metadata and page tables in the direct map; having a lot of offline
+ memory blocks is not a typical case, though.
+
+- Memory ballooning without balloon compaction is incompatible with
+ ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
+ pseries CMM, fully support balloon compaction.
+
+ Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
+ disabled. In that case, balloon inflation will only perform unmovable
+ allocations and silently create a zone imbalance, usually triggered by
+ inflation requests from the hypervisor.
+
+- Gigantic pages are unmovable, resulting in user space consuming a
+ lot of unmovable memory.
+
+- Huge pages are unmovable when an architectures does not support huge
+ page migration, resulting in a similar issue as with gigantic pages.
+
+- Page tables are unmovable. Excessive swapping, mapping extremely large
+ files or ZONE_DEVICE memory can be problematic, although only really relevant
+ in corner cases. When we manage a lot of user space memory that has been
+ swapped out or is served from a file/persistent memory/... we still need a lot
+ of page tables to manage that memory once user space accessed that memory.
+
+- In certain DAX configurations the memory map for the device memory will be
+ allocated from the kernel zones.
+
+- KASAN can have a significant memory overhead, for example, consuming 1/8th of
+ the total system memory size as (unmovable) tracking metadata.
+
+- Long-term pinning of pages. Techniques that rely on long-term pinnings
+ (especially, RDMA and vfio/mdev) are fundamentally problematic with
+ ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
+ on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
+ have to be migrated off that zone while pinning. Pinning a page can fail
+ even if there is plenty of free memory in ZONE_MOVABLE.
+
+ In addition, using ZONE_MOVABLE might make page pinning more expensive,
+ because of the page migration overhead.
+
+By default, all the memory configured at boot time is managed by the kernel
+zones and ZONE_MOVABLE is not used.
+
+To enable ZONE_MOVABLE to include the memory present at boot and to control the
+ratio between movable and kernel zones there are two command line options:
+``kernelcore=`` and ``movablecore=``. See
+Documentation/admin-guide/kernel-parameters.rst for their description.
+
+Memory Offlining and ZONE_MOVABLE
+---------------------------------
+
+Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
+block might fail:
+
+- Memory blocks with memory holes; this applies to memory blocks present during
+ boot and can apply to memory blocks hotplugged via the XEN balloon and the
+ Hyper-V balloon.
+
+- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
+ offlining; this applies to memory blocks present during boot only.
+
+- Special memory blocks prevented by the system from getting offlined. Examples
+ include any memory available during boot on arm64 or memory blocks spanning
+ the crashkernel area on s390x; this usually applies to memory blocks present
+ during boot only.
+
+- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
+ memory blocks present during boot only.
+
+- Concurrent activity that operates on the same physical memory area, such as
+ allocating gigantic pages, can result in temporary offlining failures.
+
+- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
+ Optimization (HVO) is enabled.
+
+ Offlining code may be able to migrate huge page contents, but may not be able
+ to dissolve the source huge page because it fails allocating (unmovable) pages
+ for the vmemmap, because the system might not have free memory in the kernel
+ zones left.
+
+ Users that depend on memory offlining to succeed for movable zones should
+ carefully consider whether the memory savings gained from this feature are
+ worth the risk of possibly not being able to offline memory in certain
+ situations.
+
+Further, when running into out of memory situations while migrating pages, or
+when still encountering permanently unmovable pages within ZONE_MOVABLE
+(-> BUG), memory offlining will keep retrying until it eventually succeeds.
+
+When offlining is triggered from user space, the offlining context can be
+terminated by sending a signal. A timeout based offlining can easily be
+implemented via::
-If offline succeeds, the state of the memory block is changed to be "offline".
-If it fails, some error core (like -EBUSY) will be returned by the kernel.
-Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
-it. If it doesn't contain 'unmovable' memory, you'll get success.
-
-A memory block under ZONE_MOVABLE is considered to be able to be offlined
-easily. But under some busy state, it may return -EBUSY. Even if a memory
-block cannot be offlined due to -EBUSY, you can retry offlining it and may be
-able to offline it (or not). (For example, a page is referred to by some kernel
-internal call and released soon.)
-
-Consideration:
- Memory hotplug's design direction is to make the possibility of memory
- offlining higher and to guarantee unplugging memory under any situation. But
- it needs more work. Returning -EBUSY under some situation may be good because
- the user can decide to retry more or not by himself. Currently, memory
- offlining code does some amount of retry with 120 seconds timeout.
-
-Physical memory remove
-======================
-
-Need more implementation yet....
- - Notification completion of remove works by OS to firmware.
- - Guard from remove if not yet.
-
-
-Locking Internals
-=================
-
-When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
-the device_hotplug_lock should be held to:
-
-- synchronize against online/offline requests (e.g. via sysfs). This way, memory
- block devices can only be accessed (.online/.state attributes) by user
- space once memory has been fully added. And when removing memory, we
- know nobody is in critical sections.
-- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
-
-Especially, there is a possible lock inversion that is avoided using
-device_hotplug_lock when adding memory and user space tries to online that
-memory faster than expected:
-
-- device_online() will first take the device_lock(), followed by
- mem_hotplug_lock
-- add_memory_resource() will first take the mem_hotplug_lock, followed by
- the device_lock() (while creating the devices, during bus_add_device()).
-
-As the device is visible to user space before taking the device_lock(), this
-can result in a lock inversion.
-
-onlining/offlining of memory should be done via device_online()/
-device_offline() - to make sure it is properly synchronized to actions
-via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
-
-When adding/removing/onlining/offlining memory or adding/removing
-heterogeneous/device memory, we should always hold the mem_hotplug_lock in
-write mode to serialise memory hotplug (e.g. access to global/zone
-variables).
-
-In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
-mode allows for a quite efficient get_online_mems/put_online_mems
-implementation, so code accessing memory can protect from that memory
-vanishing.
-
-
-Future Work
-===========
-
- - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
- sysctl or new control file.
- - showing memory block and physical device relationship.
- - test and make it better memory offlining.
- - support HugeTLB page migration and offlining.
- - memmap removing at memory offline.
- - physical remove memory.
+ % timeout $TIMEOUT offline_block | failure_handling
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..33e068830497
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+The multi-gen LRU is an alternative LRU implementation that optimizes
+page reclaim and improves performance under memory pressure. Page
+reclaim decides the kernel's caching policy and ability to overcommit
+memory. It directly impacts the kswapd CPU usage and RAM efficiency.
+
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enabled`` accepts different values to enable or disable the
+following components. Its default value depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enabled`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+ batches, when MMU sets it (e.g., on x86). This behavior can
+ theoretically worsen lock contention (mmap_lock). If it is
+ disabled, the multi-gen LRU will suffer a minor performance
+ degradation for workloads that contiguously map hot pages,
+ whose accessed bits can be otherwise cleared by fewer larger
+ batches.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+ well, when MMU sets it (e.g., on x86). This behavior was not
+ verified on x86 varieties other than Intel and AMD. If it is
+ disabled, the multi-gen LRU will suffer a negligible
+ performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+ echo y >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0007
+ echo 5 >/sys/kernel/mm/lru_gen/enabled
+ cat /sys/kernel/mm/lru_gen/enabled
+ 0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+The default value ``0`` means disabled.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application needs
+in a given time interval, and it is usually done with little impact on
+the performance of the application. E.g., data centers want to
+optimize job scheduling (bin packing) to improve memory utilizations.
+When a new job comes in, the job scheduler needs to find out whether
+each server it manages can allocate a certain amount of memory for
+this new job before it can pick a candidate. To do so, the job
+scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram. The
+histograms are noncumulative.
+::
+
+ memcg memcg_id memcg_path
+ node node_id
+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages
+ ...
+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages
+
+Each bin contains an estimated number of pages that have been accessed
+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
+the former is the largest and that of the latter is the smallest.
+
+Users can write the following command to ``lru_gen`` to create a new
+generation ``max_gen_nr+1``:
+
+ ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
+
+``can_swap`` defaults to the swap setting and, if it is set to ``1``,
+it forces the scan of anon pages when swap is off, and vice versa.
+``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
+employs heuristics to reduce the overhead, which is likely to reduce
+the coverage as well.
+
+A typical use case is that a job scheduler runs this command at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold pages defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces page reclaim when there is no memory
+pressure. It usually targets cold pages only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim cold pages on
+the server it selected, to improve the chance of successfully landing
+this new job.
+
+Users can write the following command to ``lru_gen`` to evict
+generations less than or equal to ``min_gen_nr``.
+
+ ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
+
+``min_gen_nr`` should be less than ``max_gen_nr-1``, since
+``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
+the active list) and therefore cannot be evicted. ``swappiness``
+overrides the default value in ``/proc/sys/vm/swappiness``.
+``nr_to_reclaim`` limits the number of pages to evict.
+
+A typical use case is that a job scheduler runs this command before it
+tries to land a new job on a server. If it fails to materialize enough
+cold pages because of the overestimation, it retries on the next
+server according to the ranking result obtained from the working set
+estimation step. This less forceful approach limits the impacts on the
+existing jobs.
diff --git a/Documentation/admin-guide/mm/nommu-mmap.rst b/Documentation/admin-guide/mm/nommu-mmap.rst
new file mode 100644
index 000000000000..530fed08de2c
--- /dev/null
+++ b/Documentation/admin-guide/mm/nommu-mmap.rst
@@ -0,0 +1,283 @@
+=============================
+No-MMU memory mapping support
+=============================
+
+The kernel has limited support for memory mapping under no-MMU conditions, such
+as are used in uClinux environments. From the userspace point of view, memory
+mapping is made use of in conjunction with the mmap() system call, the shmat()
+call and the execve() system call. From the kernel's point of view, execve()
+mapping is actually performed by the binfmt drivers, which call back into the
+mmap() routines to do the actual work.
+
+Memory mapping behaviour also involves the way fork(), vfork(), clone() and
+ptrace() work. Under uClinux there is no fork(), and clone() must be supplied
+the CLONE_VM flag.
+
+The behaviour is similar between the MMU and no-MMU cases, but not identical;
+and it's also much more restricted in the latter case:
+
+ (#) Anonymous mapping, MAP_PRIVATE
+
+ In the MMU case: VM regions backed by arbitrary pages; copy-on-write
+ across fork.
+
+ In the no-MMU case: VM regions backed by arbitrary contiguous runs of
+ pages.
+
+ (#) Anonymous mapping, MAP_SHARED
+
+ These behave very much like private mappings, except that they're
+ shared across fork() or clone() without CLONE_VM in the MMU case. Since
+ the no-MMU case doesn't support these, behaviour is identical to
+ MAP_PRIVATE there.
+
+ (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE
+
+ In the MMU case: VM regions backed by pages read from file; changes to
+ the underlying file are reflected in the mapping; copied across fork.
+
+ In the no-MMU case:
+
+ - If one exists, the kernel will re-use an existing mapping to the
+ same segment of the same file if that has compatible permissions,
+ even if this was created by another process.
+
+ - If possible, the file mapping will be directly on the backing device
+ if the backing device has the NOMMU_MAP_DIRECT capability and
+ appropriate mapping protection capabilities. Ramfs, romfs, cramfs
+ and mtd might all permit this.
+
+ - If the backing device can't or won't permit direct sharing,
+ but does have the NOMMU_MAP_COPY capability, then a copy of the
+ appropriate bit of the file will be read into a contiguous bit of
+ memory and any extraneous space beyond the EOF will be cleared
+
+ - Writes to the file do not affect the mapping; writes to the mapping
+ are visible in other processes (no MMU protection), but should not
+ happen.
+
+ (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE
+
+ In the MMU case: like the non-PROT_WRITE case, except that the pages in
+ question get copied before the write actually happens. From that point
+ on writes to the file underneath that page no longer get reflected into
+ the mapping's backing pages. The page is then backed by swap instead.
+
+ In the no-MMU case: works much like the non-PROT_WRITE case, except
+ that a copy is always taken and never shared.
+
+ (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: VM regions backed by pages read from file; changes to
+ pages written back to file; writes to file reflected into pages backing
+ mapping; shared across fork.
+
+ In the no-MMU case: not supported.
+
+ (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: As for ordinary regular files.
+
+ In the no-MMU case: The filesystem providing the memory-backed file
+ (such as ramfs or tmpfs) may choose to honour an open, truncate, mmap
+ sequence by providing a contiguous sequence of pages to map. In that
+ case, a shared-writable memory mapping will be possible. It will work
+ as for the MMU case. If the filesystem does not provide any such
+ support, then the mapping request will be denied.
+
+ (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: As for ordinary regular files.
+
+ In the no-MMU case: As for memory backed regular files, but the
+ blockdev must be able to provide a contiguous run of pages without
+ truncate being called. The ramdisk driver could do this if it allocated
+ all its memory as a contiguous array upfront.
+
+ (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
+
+ In the MMU case: As for ordinary regular files.
+
+ In the no-MMU case: The character device driver may choose to honour
+ the mmap() by providing direct access to the underlying device if it
+ provides memory or quasi-memory that can be accessed directly. Examples
+ of such are frame buffers and flash devices. If the driver does not
+ provide any such support, then the mapping request will be denied.
+
+
+Further notes on no-MMU MMAP
+============================
+
+ (#) A request for a private mapping of a file may return a buffer that is not
+ page-aligned. This is because XIP may take place, and the data may not be
+ paged aligned in the backing store.
+
+ (#) A request for an anonymous mapping will always be page aligned. If
+ possible the size of the request should be a power of two otherwise some
+ of the space may be wasted as the kernel must allocate a power-of-2
+ granule but will only discard the excess if appropriately configured as
+ this has an effect on fragmentation.
+
+ (#) The memory allocated by a request for an anonymous mapping will normally
+ be cleared by the kernel before being returned in accordance with the
+ Linux man pages (ver 2.22 or later).
+
+ In the MMU case this can be achieved with reasonable performance as
+ regions are backed by virtual pages, with the contents only being mapped
+ to cleared physical pages when a write happens on that specific page
+ (prior to which, the pages are effectively mapped to the global zero page
+ from which reads can take place). This spreads out the time it takes to
+ initialize the contents of a page - depending on the write-usage of the
+ mapping.
+
+ In the no-MMU case, however, anonymous mappings are backed by physical
+ pages, and the entire map is cleared at allocation time. This can cause
+ significant delays during a userspace malloc() as the C library does an
+ anonymous mapping and the kernel then does a memset for the entire map.
+
+ However, for memory that isn't required to be precleared - such as that
+ returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to
+ indicate to the kernel that it shouldn't bother clearing the memory before
+ returning it. Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled
+ to permit this, otherwise the flag will be ignored.
+
+ uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this
+ to allocate the brk and stack region.
+
+ (#) A list of all the private copy and anonymous mappings on the system is
+ visible through /proc/maps in no-MMU mode.
+
+ (#) A list of all the mappings in use by a process is visible through
+ /proc/<pid>/maps in no-MMU mode.
+
+ (#) Supplying MAP_FIXED or a requesting a particular mapping address will
+ result in an error.
+
+ (#) Files mapped privately usually have to have a read method provided by the
+ driver or filesystem so that the contents can be read into the memory
+ allocated if mmap() chooses not to map the backing device directly. An
+ error will result if they don't. This is most likely to be encountered
+ with character device files, pipes, fifos and sockets.
+
+
+Interprocess shared memory
+==========================
+
+Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU
+mode. The former through the usual mechanism, the latter through files created
+on ramfs or tmpfs mounts.
+
+
+Futexes
+=======
+
+Futexes are supported in NOMMU mode if the arch supports them. An error will
+be given if an address passed to the futex system call lies outside the
+mappings made by a process or if the mapping in which the address lies does not
+support futexes (such as an I/O chardev mapping).
+
+
+No-MMU mremap
+=============
+
+The mremap() function is partially supported. It may change the size of a
+mapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size
+of the mapping exceeds the size of the slab object currently occupied by the
+memory to which the mapping refers, or if a smaller slab object could be used.
+
+MREMAP_FIXED is not supported, though it is ignored if there's no change of
+address and the object does not need to be moved.
+
+Shared mappings may not be moved. Shareable mappings may not be moved either,
+even if they are not currently shared.
+
+The mremap() function must be given an exact match for base address and size of
+a previously mapped object. It may not be used to create holes in existing
+mappings, move parts of existing mappings or resize parts of mappings. It must
+act on a complete mapping.
+
+.. [#] Not currently supported.
+
+
+Providing shareable character device support
+============================================
+
+To provide shareable character device support, a driver must provide a
+file->f_op->get_unmapped_area() operation. The mmap() routines will call this
+to get a proposed address for the mapping. This may return an error if it
+doesn't wish to honour the mapping because it's too long, at a weird offset,
+under some unsupported combination of flags or whatever.
+
+The driver should also provide backing device information with capabilities set
+to indicate the permitted types of mapping on such devices. The default is
+assumed to be readable and writable, not executable, and only shareable
+directly (can't be copied).
+
+The file->f_op->mmap() operation will be called to actually inaugurate the
+mapping. It can be rejected at that point. Returning the ENOSYS error will
+cause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
+
+The vm_ops->close() routine will be invoked when the last mapping on a chardev
+is removed. An existing mapping will be shared, partially or not, if possible
+without notifying the driver.
+
+It is permitted also for the file->f_op->get_unmapped_area() operation to
+return -ENOSYS. This will be taken to mean that this operation just doesn't
+want to handle it, despite the fact it's got an operation. For instance, it
+might try directing the call to a secondary driver which turns out not to
+implement it. Such is the case for the framebuffer driver which attempts to
+direct the call to the device-specific driver. Under such circumstances, the
+mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
+copy mapped otherwise.
+
+.. important::
+
+ Some types of device may present a different appearance to anyone
+ looking at them in certain modes. Flash chips can be like this; for
+ instance if they're in programming or erase mode, you might see the
+ status reflected in the mapping, instead of the data.
+
+ In such a case, care must be taken lest userspace see a shared or a
+ private mapping showing such information when the driver is busy
+ controlling the device. Remember especially: private executable
+ mappings may still be mapped directly off the device under some
+ circumstances!
+
+
+Providing shareable memory-backed file support
+==============================================
+
+Provision of shared mappings on memory backed files is similar to the provision
+of support for shared mapped character devices. The main difference is that the
+filesystem providing the service will probably allocate a contiguous collection
+of pages and permit mappings to be made on that.
+
+It is recommended that a truncate operation applied to such a file that
+increases the file size, if that file is empty, be taken as a request to gather
+enough pages to honour a mapping. This is required to support POSIX shared
+memory.
+
+Memory backed devices are indicated by the mapping's backing device info having
+the memory_backed flag set.
+
+
+Providing shareable block device support
+========================================
+
+Provision of shared mappings on block device files is exactly the same as for
+character devices. If there isn't a real device underneath, then the driver
+should allocate sufficient contiguous memory to honour any supported mapping.
+
+
+Adjusting page trimming behaviour
+=================================
+
+NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
+when performing an allocation. This can have adverse effects on memory
+fragmentation, and as such, is left configurable. The default behaviour is to
+aggressively trim allocations and discard any excess pages back in to the page
+allocator. In order to retain finer-grained control over fragmentation, this
+behaviour can either be disabled completely, or bumped up to a higher page
+watermark where trimming begins.
+
+Page trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``.
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 8463f5538fda..eca38fa81e0f 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -1,5 +1,3 @@
-.. _numa_memory_policy:
-
==================
NUMA Memory Policy
==================
@@ -111,7 +109,7 @@ VMA Policy
* A task may install a new VMA policy on a sub-range of a
previously mmap()ed region. When this happens, Linux splits
the existing virtual memory area into 2 or 3 VMAs, each with
- it's own policy.
+ its own policy.
* By default, VMA policy applies only to pages allocated after
the policy is installed. Any pages already faulted into the
@@ -245,6 +243,13 @@ MPOL_INTERLEAVED
address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.
+MPOL_PREFERRED_MANY
+ This mode specifies that the allocation should be preferably
+ satisfied from the nodemask specified in the policy. If there is
+ a memory pressure on all nodes in the nodemask, the allocation
+ can fall back to all existing numa nodes. This is effectively
+ MPOL_PREFERRED allowed for a mask rather than a single node.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
@@ -253,10 +258,10 @@ MPOL_F_STATIC_NODES
nodes changes after the memory policy has been defined.
Without this flag, any time a mempolicy is rebound because of a
- change in the set of allowed nodes, the node (Preferred) or
- nodemask (Bind, Interleave) is remapped to the new set of
- allowed nodes. This may result in nodes being used that were
- previously undesired.
+ change in the set of allowed nodes, the preferred nodemask (Preferred
+ Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+ remapped to the new set of allowed nodes. This may result in nodes
+ being used that were previously undesired.
With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is
@@ -353,7 +358,7 @@ and NUMA nodes. "Usage" here means one of the following:
2) examination of the policy to determine the policy mode and associated node
or node lists, if any, for page allocation. This is considered a "hot
path". Note that for MPOL_BIND, the "usage" extends across the entire
- allocation process, which may sleep during page reclaimation, because the
+ allocation process, which may sleep during page reclamation, because the
BIND policy nodemask is used, by reference, to filter ineligible nodes.
We can avoid taking an extra reference during the usages listed above as
@@ -364,19 +369,19 @@ follows:
2) for querying the policy, we do not need to take an extra reference on the
target task's task policy nor vma policies because we always acquire the
- task's mm's mmap_sem for read during the query. The set_mempolicy() and
- mbind() APIs [see below] always acquire the mmap_sem for write when
+ task's mm's mmap_lock for read during the query. The set_mempolicy() and
+ mbind() APIs [see below] always acquire the mmap_lock for write when
installing or replacing task or vma policies. Thus, there is no possibility
of a task or thread freeing a policy while another task or thread is
querying it.
3) Page allocation usage of task or vma policy occurs in the fault path where
- we hold them mmap_sem for read. Again, because replacing the task or vma
- policy requires that the mmap_sem be held for write, the policy can't be
+ we hold them mmap_lock for read. Again, because replacing the task or vma
+ policy requires that the mmap_lock be held for write, the policy can't be
freed out from under us while we're using it for page allocation.
4) Shared policies require special consideration. One task can replace a
- shared memory policy while another task, with a distinct mmap_sem, is
+ shared memory policy while another task, with a distinct mmap_lock, is
querying or allocating a page based on the policy. To resolve this
potential race, the shared policy infrastructure adds an extra reference
to the shared policy during lookup while holding a spin lock on the shared
@@ -401,7 +406,7 @@ follows:
Memory Policy APIs
==================
-Linux supports 3 system calls for controlling memory policy. These APIS
+Linux supports 4 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space.
@@ -453,6 +458,20 @@ requested via the 'flags' argument.
See the mbind(2) man page for more details.
+Set home node for a Range of Task's Address Spacec::
+
+ long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
+ unsigned long home_node,
+ unsigned long flags);
+
+sys_set_mempolicy_home_node set the home node for a VMA policy present in the
+task's address range. The system call updates the home node only for the existing
+mempolicy range. Other address ranges are ignored. A home node is the NUMA node
+closest to which page allocation will come from. Specifying the home node override
+the default allocation policy to allocate memory close to the local node for an
+executing CPU.
+
+
Memory Policy Command Line Interface
====================================
diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
index a80c3c37226e..90a12b6a8bfc 100644
--- a/Documentation/admin-guide/mm/numaperf.rst
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -1,6 +1,7 @@
-.. _numaperf:
+=======================
+NUMA Memory Performance
+=======================
-=============
NUMA Locality
=============
@@ -56,7 +57,11 @@ nodes' access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.
-================
+The access class "1" is used to allow differentiation between initiators
+that are CPUs and hence suitable for generic task scheduling, and
+IO initiators such as GPUs and NICs. Unlike access class 0, only
+nodes containing CPUs are considered.
+
NUMA Performance
================
@@ -69,7 +74,7 @@ memory node's access class 0 initiators as follows::
/sys/devices/system/node/nodeY/access0/initiators/
These attributes apply only when accessed from nodes that have the
-are linked under the this access's inititiators.
+are linked under the this access's initiators.
The performance characteristics the kernel provides for the local initiators
are exported are as follows::
@@ -88,7 +93,9 @@ The latency attributes are provided in nanoseconds.
The values reported here correspond to the rated latency and bandwidth
for the platform.
-==========
+Access class 1 takes the same form but only includes values for CPU to
+memory activity.
+
NUMA Cache
==========
@@ -129,7 +136,7 @@ will create the following directory::
/sys/devices/system/node/nodeX/memory_side_cache/
-If that directory is not present, the system either does not not provide
+If that directory is not present, the system either does not provide
a memory-side cache, or that information is not accessible to the kernel.
The attributes for each level of cache is provided under its cache
@@ -143,7 +150,7 @@ Each cache level's directory provides its attributes. For example, the
following shows a single cache level and the attributes available for
software to query::
- # tree sys/devices/system/node/node0/memory_side_cache/
+ # tree /sys/devices/system/node/node0/memory_side_cache/
/sys/devices/system/node/node0/memory_side_cache/
|-- index1
| |-- indexing
@@ -162,7 +169,6 @@ The "size" is the number of bytes provided by this cache level.
The "write_policy" will be 0 for write-back, and non-zero for
write-through caching.
-========
See Also
========
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index 340a5aee9b80..f5f065c67615 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -1,5 +1,3 @@
-.. _pagemap:
-
=============================
Examining Process Page Tables
=============================
@@ -19,9 +17,11 @@ There are four components to pagemap:
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see
- :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
+ Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped (since 4.2)
- * Bits 57-60 zero
+ * Bit 57 pte is uffd-wp write-protected (since 5.13) (see
+ Documentation/admin-guide/mm/userfaultfd.rst)
+ * Bits 58-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
@@ -44,7 +44,7 @@ There are four components to pagemap:
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
-The page-types tool in the tools/vm directory can be used to query the
+The page-types tool in the tools/mm directory can be used to query the
number of times a page is mapped.
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
@@ -88,13 +88,14 @@ Short descriptions to the page flags
====================================
0 - LOCKED
- page is being locked for exclusive access, e.g. by undergoing read/write IO
+ The page is being locked for exclusive access, e.g. by undergoing read/write
+ IO.
7 - SLAB
- page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
- When compound page is used, SLUB/SLQB will only set this flag on the head
- page; SLOB will not flag it at all.
+ The page is managed by the SLAB/SLUB kernel memory allocator.
+ When compound page is used, either will only set this flag on the head
+ page.
10 - BUDDY
- a free memory block managed by the buddy system allocator
+ A free memory block managed by the buddy system allocator.
The buddy system organizes free memory in blocks of various orders.
An order N block has 2^N physically contiguous pages, with the BUDDY flag
set for and _only_ for the first page.
@@ -102,75 +103,74 @@ Short descriptions to the page flags
A compound page with order N consists of 2^N physically contiguous pages.
A compound page with order 2 takes the form of "HTTT", where H donates its
head page and T donates its tail page(s). The major consumers of compound
- pages are hugeTLB pages
- (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
+ pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst),
the SLUB etc. memory allocators and various device drivers.
However in this interface, only huge/giga pages are made visible
to end users.
16 - COMPOUND_TAIL
A compound page tail (see description above).
17 - HUGE
- this is an integral part of a HugeTLB page
+ This is an integral part of a HugeTLB page.
19 - HWPOISON
- hardware detected memory corruption on this page: don't touch the data!
+ Hardware detected memory corruption on this page: don't touch the data!
20 - NOPAGE
- no page frame exists at the requested address
+ No page frame exists at the requested address.
21 - KSM
- identical memory pages dynamically shared between one or more processes
+ Identical memory pages dynamically shared between one or more processes.
22 - THP
- contiguous pages which construct transparent hugepages
+ Contiguous pages which construct transparent hugepages.
23 - OFFLINE
- page is logically offline
+ The page is logically offline.
24 - ZERO_PAGE
- zero page for pfn_zero or huge_zero page
+ Zero page for pfn_zero or huge_zero page.
25 - IDLE
- page has not been accessed since it was marked idle (see
- :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
+ The page has not been accessed since it was marked idle (see
+ Documentation/admin-guide/mm/idle_page_tracking.rst).
Note that this flag may be stale in case the page was accessed via
a PTE. To make sure the flag is up-to-date one has to read
``/sys/kernel/mm/page_idle/bitmap`` first.
26 - PGTABLE
- page is in use as a page table
+ The page is in use as a page table.
IO related page flags
---------------------
1 - ERROR
- IO error occurred
+ IO error occurred.
3 - UPTODATE
- page has up-to-date data
+ The page has up-to-date data.
ie. for file backed page: (in-memory data revision >= on-disk one)
4 - DIRTY
- page has been written to, hence contains new data
+ The page has been written to, hence contains new data.
i.e. for file backed page: (in-memory data revision > on-disk one)
8 - WRITEBACK
- page is being synced to disk
+ The page is being synced to disk.
LRU related page flags
----------------------
5 - LRU
- page is in one of the LRU lists
+ The page is in one of the LRU lists.
6 - ACTIVE
- page is in the active LRU list
+ The page is in the active LRU list.
18 - UNEVICTABLE
- page is in the unevictable (non-)LRU list It is somehow pinned and
+ The page is in the unevictable (non-)LRU list It is somehow pinned and
not a candidate for LRU page reclaims, e.g. ramfs pages,
- shmctl(SHM_LOCK) and mlock() memory segments
+ shmctl(SHM_LOCK) and mlock() memory segments.
2 - REFERENCED
- page has been referenced since last LRU list enqueue/requeue
+ The page has been referenced since last LRU list enqueue/requeue.
9 - RECLAIM
- page will be reclaimed soon after its pageout IO completed
+ The page will be reclaimed soon after its pageout IO completed.
11 - MMAP
- a memory mapped page
+ A memory mapped page.
12 - ANON
- a memory mapped page that is not part of a file
+ A memory mapped page that is not part of a file.
13 - SWAPCACHE
- page is mapped to swap space, i.e. has an associated swap entry
+ The page is mapped to swap space, i.e. has an associated swap entry.
14 - SWAPBACKED
- page is backed by swap/RAM
+ The page is backed by swap/RAM.
-The page-types tool in the tools/vm directory can be used to query the
+The page-types tool in the tools/mm directory can be used to query the
above flags.
Using pagemap to do something useful
@@ -194,6 +194,28 @@ you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced
once.
+Exceptions for Shared Memory
+============================
+
+Page table entries for shared pages are cleared when the pages are zapped or
+swapped out. This makes swapped out pages indistinguishable from never-allocated
+ones.
+
+In kernel space, the swap location can still be retrieved from the page cache.
+However, values stored only on the normal PTE get lost irretrievably when the
+page is swapped out (i.e. SOFT_DIRTY).
+
+In user space, whether the page is present, swapped or none can be deduced with
+the help of lseek and/or mincore system calls.
+
+lseek() can differentiate between accessed pages (present or swapped out) and
+holes (none/non-allocated) by specifying the SEEK_DATA flag on the file where
+the pages are backed. For anonymous shared pages, the file can be found in
+``/proc/pid/map_files/``.
+
+mincore() can differentiate between pages in memory (present, including swap
+cache) and out of memory (swapped out or none/non-allocated).
+
Other notes
===========
@@ -205,3 +227,93 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
always 12 at most architectures). Since Linux 3.11 their meaning changes
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
flags unconditionally.
+
+Pagemap Scan IOCTL
+==================
+
+The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
+clear the info about page table entries. The following operations are supported
+in this IOCTL:
+
+- Scan the address range and get the memory ranges matching the provided criteria.
+ This is performed when the output buffer is specified.
+- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
+ the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
+ non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
+ used with or without ``PM_SCAN_CHECK_WPASYNC``.
+- Both of those operations can be combined into one atomic operation where we can
+ get and write protect the pages as well.
+
+Following flags about pages are currently supported:
+
+- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
+- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
+- ``PAGE_IS_FILE`` - Page is file backed
+- ``PAGE_IS_PRESENT`` - Page is present in the memory
+- ``PAGE_IS_SWAPPED`` - Page is in swapped
+- ``PAGE_IS_PFNZERO`` - Page has zero PFN
+- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
+- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
+
+The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
+
+ 1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
+ field. This field will be helpful in recognizing the structure if extensions
+ are done later.
+ 2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
+ and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
+ operation is optionally performed depending upon if the output buffer is
+ provided or not.
+ 3. The range is specified through ``start`` and ``end``.
+ 4. The walk can abort before visiting the complete range such as the user buffer
+ can get full etc. The walk ending address is specified in``end_walk``.
+ 5. The output buffer of ``struct page_region`` array and size is specified in
+ ``vec`` and ``vec_len``.
+ 6. The optional maximum requested pages are specified in the ``max_pages``.
+ 7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
+ ``category_inverted`` and ``return_mask``.
+
+Find pages which have been written and WP them as well::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
+ ..
+ .category_mask = PAGE_IS_WRITTEN,
+ .return_mask = PAGE_IS_WRITTEN,
+ };
+
+Find pages which have been written, are file backed, not swapped and either
+present or huge::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .flags = 0,
+ ..
+ .category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
+ .category_inverted = PAGE_IS_SWAPPED,
+ .category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
+ .return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
+ PAGE_IS_PRESENT | PAGE_IS_HUGE,
+ };
+
+The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
+of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
+the user can find the true soft-dirty pages in case of normal pages. (There may
+still be extra dirty pages reported for THP or Hugetlb pages.)
+
+"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
+implement memory dirty tracking in userspace:
+
+ 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
+ 2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
+ are set by ``UFFDIO_API`` IOCTL.
+ 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
+ through ``UFFDIO_REGISTER`` IOCTL.
+ 4. Then any part of the registered memory or the whole memory region must
+ be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
+ or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
+ same operation. The former is better in terms of performance.
+ 5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
+ have been written to since they were last marked and/or optionally write protect
+ the pages as well.
diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst
new file mode 100644
index 000000000000..c582033bd113
--- /dev/null
+++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst
@@ -0,0 +1,133 @@
+==========================
+Shrinker Debugfs Interface
+==========================
+
+Shrinker debugfs interface provides a visibility into the kernel memory
+shrinkers subsystem and allows to get information about individual shrinkers
+and interact with them.
+
+For each shrinker registered in the system a directory in **<debugfs>/shrinker/**
+is created. The directory's name is composed from the shrinker's name and an
+unique id: e.g. *kfree_rcu-0* or *sb-xfs:vda1-36*.
+
+Each shrinker directory contains **count** and **scan** files, which allow to
+trigger *count_objects()* and *scan_objects()* callbacks for each memcg and
+numa node (if applicable).
+
+Usage:
+------
+
+1. *List registered shrinkers*
+
+ ::
+
+ $ cd /sys/kernel/debug/shrinker/
+ $ ls
+ dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
+ mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
+ mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
+ rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
+ sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
+ sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
+ sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
+ sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
+ sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
+ sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
+ sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
+ sb-dax-11 sb-proc-45 sb-tmpfs-35
+ sb-debugfs-7 sb-proc-46 sb-tmpfs-40
+
+2. *Get information about a specific shrinker*
+
+ ::
+
+ $ cd sb-btrfs\:vda2-24/
+ $ ls
+ count scan
+
+3. *Count objects*
+
+ Each line in the output has the following format::
+
+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...
+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...
+ ...
+
+ If there are no objects on all numa nodes, a line is omitted. If there
+ are no objects at all, the output might be empty.
+
+ If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed
+ as cgroup inode id. If the shrinker is not numa-aware, 0's are printed
+ for all nodes except the first one.
+ ::
+
+ $ cat count
+ 1 224 2
+ 21 98 0
+ 55 818 10
+ 2367 2 0
+ 2401 30 0
+ 225 13 0
+ 599 35 0
+ 939 124 0
+ 1041 3 0
+ 1075 1 0
+ 1109 1 0
+ 1279 60 0
+ 1313 7 0
+ 1347 39 0
+ 1381 3 0
+ 1449 14 0
+ 1483 63 0
+ 1517 53 0
+ 1551 6 0
+ 1585 1 0
+ 1619 6 0
+ 1653 40 0
+ 1687 11 0
+ 1721 8 0
+ 1755 4 0
+ 1789 52 0
+ 1823 888 0
+ 1857 1 0
+ 1925 2 0
+ 1959 32 0
+ 2027 22 0
+ 2061 9 0
+ 2469 799 0
+ 2537 861 0
+ 2639 1 0
+ 2707 70 0
+ 2775 4 0
+ 2877 84 0
+ 293 1 0
+ 735 8 0
+
+4. *Scan objects*
+
+ The expected input format::
+
+ <cgroup inode id> <numa id> <number of objects to scan>
+
+ For a non-memcg-aware shrinker or on a system with no memory
+ cgrups **0** should be passed as cgroup id.
+ ::
+
+ $ cd /sys/kernel/debug/shrinker/
+ $ cd sb-btrfs\:vda2-24/
+
+ $ cat count | head -n 5
+ 1 212 0
+ 21 97 0
+ 55 802 5
+ 2367 2 0
+ 225 13 0
+
+ $ echo "55 0 200" > scan
+
+ $ cat count | head -n 5
+ 1 212 0
+ 21 96 0
+ 55 752 5
+ 2367 2 0
+ 225 13 0
diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst
index cb0cfd6672fa..aeea936caa44 100644
--- a/Documentation/admin-guide/mm/soft-dirty.rst
+++ b/Documentation/admin-guide/mm/soft-dirty.rst
@@ -1,5 +1,3 @@
-.. _soft_dirty:
-
===============
Soft-Dirty PTEs
===============
diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst
new file mode 100644
index 000000000000..2e630627bcee
--- /dev/null
+++ b/Documentation/admin-guide/mm/swap_numa.rst
@@ -0,0 +1,78 @@
+===========================================
+Automatically bind swap device to numa node
+===========================================
+
+If the system has more than one swap device and swap device has the node
+information, we can make use of this information to decide which swap
+device to use in get_swap_pages() to get better performance.
+
+
+How to use this feature
+=======================
+
+Swap device has priority and that decides the order of it to be used. To make
+use of automatically binding, there is no need to manipulate priority settings
+for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
+swapB, with swapA attached to node 0 and swapB attached to node 1, are going
+to be swapped on. Simply swapping them on by doing::
+
+ # swapon /dev/swapA
+ # swapon /dev/swapB
+
+Then node 0 will use the two swap devices in the order of swapA then swapB and
+node 1 will use the two swap devices in the order of swapB then swapA. Note
+that the order of them being swapped on doesn't matter.
+
+A more complex example on a 4 node machine. Assume 6 swap devices are going to
+be swapped on: swapA and swapB are attached to node 0, swapC is attached to
+node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
+The way to swap them on is the same as above::
+
+ # swapon /dev/swapA
+ # swapon /dev/swapB
+ # swapon /dev/swapC
+ # swapon /dev/swapD
+ # swapon /dev/swapE
+ # swapon /dev/swapF
+
+Then node 0 will use them in the order of::
+
+ swapA/swapB -> swapC -> swapD -> swapE -> swapF
+
+swapA and swapB will be used in a round robin mode before any other swap device.
+
+node 1 will use them in the order of::
+
+ swapC -> swapA -> swapB -> swapD -> swapE -> swapF
+
+node 2 will use them in the order of::
+
+ swapD/swapE -> swapA -> swapB -> swapC -> swapF
+
+Similaly, swapD and swapE will be used in a round robin mode before any
+other swap devices.
+
+node 3 will use them in the order of::
+
+ swapF -> swapA -> swapB -> swapC -> swapD -> swapE
+
+
+Implementation details
+======================
+
+The current code uses a priority based list, swap_avail_list, to decide
+which swap device to use and if multiple swap devices share the same
+priority, they are used round robin. This change here replaces the single
+global swap_avail_list with a per-numa-node list, i.e. for each numa node,
+it sees its own priority based list of available swap devices. Swap
+device's priority can be promoted on its matching node's swap_avail_list.
+
+The current swap device's priority is set as: user can set a >=0 value,
+or the system will pick one starting from -1 then downwards. The priority
+value in the swap_avail_list is the negated value of the swap device's
+due to plist being sorted from low to high. The new policy doesn't change
+the semantics for priority >=0 cases, the previous starting from -1 then
+downwards now becomes starting from -2 then downwards and -1 is reserved
+as the promoted value. So if multiple swap devices are attached to the same
+node, they will all be promoted to priority -1 on that node's plist and will
+be used round robin before any other swap devices.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 2f31de8f7c74..04eb45a2f940 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -1,5 +1,3 @@
-.. _admin_guide_transhuge:
-
============================
Transparent Hugepage Support
============================
@@ -47,10 +45,25 @@ components:
the two is using hugepages just because of the fact the TLB miss is
going to run faster.
+Modern kernels support "multi-size THP" (mTHP), which introduces the
+ability to allocate memory in blocks that are bigger than a base page
+but smaller than traditional PMD-size (as described above), in
+increments of a power-of-2 number of pages. mTHP can back anonymous
+memory (for example 16K, 32K, 64K, etc). These THPs continue to be
+PTE-mapped, but in many cases can still provide similar benefits to
+those outlined above: Page faults are significantly reduced (by a
+factor of e.g. 4, 8, 16, etc), but latency spikes are much less
+prominent because the size of each page isn't as huge as the PMD-sized
+variant and there is less memory to clear in each page fault. Some
+architectures also employ TLB compression mechanisms to squeeze more
+entries in when a set of PTEs are virtually and physically contiguous
+and approporiately aligned. In this case, TLB misses will occur less
+often.
+
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into huge pages.
+collapses sequences of basic pages into PMD-sized huge pages.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -97,12 +110,40 @@ Global THP controls
Transparent Hugepage Support for anonymous memory can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
regions (to avoid the risk of consuming more memory resources) or enabled
-system wide. This can be achieved with one of::
+system wide. This can be achieved per-supported-THP-size with one of::
+
+ echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+
+where <size> is the hugepage size being addressed, the available sizes
+for which vary by system.
+
+For example::
+
+ echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+
+Alternatively it is possible to specify that a given hugepage size
+will inherit the top-level "enabled" value::
+
+ echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+
+For example::
+
+ echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+
+The top-level setting (for use with "inherit") can be set by issuing
+one of the following commands::
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled
+By default, PMD-sized hugepages have enabled="inherit" and all other
+hugepage sizes have enabled="never". If enabling multiple hugepage
+sizes, the kernel will select the most appropriate enabled size for a
+given allocation.
+
It's also possible to limit defrag efforts in the VM to generate
anonymous hugepages in case they're not immediately free to madvise
regions or to never try to defrag memory and simply fallback to regular
@@ -148,25 +189,34 @@ madvise
never
should be self-explanatory.
-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
+By default kernel tries to use huge, PMD-mappable zero page on read
+page fault to anonymous mapping. It's possible to disable huge zero
+page by writing 0 or enable it back by writing 1::
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
-Some userspace (such as a test program, or an optimized memory allocation
-library) may want to know the size (in bytes) of a transparent hugepage::
+Some userspace (such as a test program, or an optimized memory
+allocation library) may want to know the size (in bytes) of a
+PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
-khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+khugepaged will be automatically started when one or more hugepage
+sizes are enabled (either by directly setting "always" or "madvise",
+or by setting "inherit" while the top-level enabled is set to "always"
+or "madvise"), and it'll be automatically shutdown when the last
+hugepage size is disabled (either by directly setting "never", or by
+setting "inherit" while the top-level enabled is set to "never").
Khugepaged controls
-------------------
+.. note::
+ khugepaged currently only searches for opportunities to collapse to
+ PMD-sized THP and no attempt is made to collapse to other THP
+ sizes.
+
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
@@ -191,7 +241,14 @@ allocation failure to throttle the next allocation attempt::
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
-The khugepaged progress can be seen in the number of pages collapsed::
+The khugepaged progress can be seen in the number of pages collapsed (note
+that this counter may not be an exact count of the number of pages
+collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
+being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
+one 2M hugepage. Each may happen independently, or together, depending on
+the type of memory and the failures that occur. As such, this value should
+be interpreted roughly as a sign of progress, and counters in /proc/vmstat
+consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -220,6 +277,13 @@ memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.
+``max_ptes_shared`` specifies how many pages can be shared across multiple
+processes. Exceeding the number would block the collapse::
+
+ /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
+
+A higher value may increase memory footprint for some workloads.
+
Boot parameter
==============
@@ -270,19 +334,26 @@ force
Need of application restart
===========================
-The transparent_hugepage/enabled values and tmpfs mount option only affect
-future behavior. So to make them effective you need to restart any
-application that could have been using hugepages. This also applies to the
-regions registered in khugepaged.
+The transparent_hugepage/enabled and
+transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
+option only affect future behavior. So to make them effective you need
+to restart any application that could have been using hugepages. This
+also applies to the regions registered in khugepaged.
Monitoring usage
================
-The number of anonymous transparent huge pages currently used by the
+.. note::
+ Currently the below counters only record events relating to
+ PMD-sized THP. Events relating to other THP sizes are not included.
+
+The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
-To identify what applications are using anonymous transparent huge pages,
-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
-for each mapping.
+To identify what applications are using PMD-sized anonymous transparent huge
+pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
+fields for each mapping. (Note that AnonHugePages only applies to traditional
+PMD-sized THP for historical reasons and should have been called
+AnonHugePmdMapped).
The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -298,8 +369,7 @@ monitor how successfully the system is providing huge pages for use.
thp_fault_alloc
is incremented every time a huge page is successfully
- allocated to handle a page fault. This applies to both the
- first time a page is faulted and for COW faults.
+ allocated to handle a page fault.
thp_collapse_alloc
is incremented by khugepaged when it has found
@@ -360,10 +430,9 @@ thp_split_pmd
page table entry.
thp_zero_page_alloc
- is incremented every time a huge zero page is
- successfully allocated. It includes allocations which where
- dropped due race with other allocation. Note, it doesn't count
- every map of the huge zero page, only its allocation.
+ is incremented every time a huge zero page used for thp is
+ successfully allocated. Note, it doesn't count every map of
+ the huge zero page, only its allocation.
thp_zero_page_alloc_failed
is incremented if kernel fails to allocate
@@ -395,30 +464,15 @@ compact_fail
is incremented if the system tries to compact memory
but failed.
-compact_pages_moved
- is incremented each time a page is moved. If
- this value is increasing rapidly, it implies that the system
- is copying a lot of data to satisfy the huge page allocation.
- It is possible that the cost of copying exceeds any savings
- from reduced TLB misses.
-
-compact_pagemigrate_failed
- is incremented when the underlying mechanism
- for moving a page failed.
-
-compact_blocks_moved
- is incremented each time memory compaction examines
- a huge page aligned range of pages.
-
It is possible to establish how long the stalls were using the function
-tracer to record how long was spent in __alloc_pages_nodemask and
+tracer to record how long was spent in __alloc_pages() and
using the mm_page_alloc tracepoint to identify which allocations were
for huge pages.
Optimizing the applications
===========================
-To be guaranteed that the kernel will map a 2M page immediately in any
+To be guaranteed that the kernel will map a THP immediately in any
memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee.
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index c30176e67900..e5cc8848dcb3 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -1,5 +1,3 @@
-.. _userfaultfd:
-
===========
Userfaultfd
===========
@@ -12,109 +10,173 @@ and more generally they allow userland to take control of various
memory page faults, something otherwise only the kernel code could do.
For example userfaults allows a proper and more optimal implementation
-of the PROT_NONE+SIGSEGV trick.
+of the ``PROT_NONE+SIGSEGV`` trick.
Design
======
-Userfaults are delivered and resolved through the userfaultfd syscall.
+Userspace creates a new userfaultfd, initializes it, and registers one or more
+regions of virtual memory with it. Then, any page faults which occur within the
+region(s) result in a message being delivered to the userfaultfd, notifying
+userspace of the fault.
-The userfaultfd (aside from registering and unregistering virtual
+The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
-1) read/POLLIN protocol to notify a userland thread of the faults
+1) ``read/POLLIN`` protocol to notify a userland thread of the faults
happening
-2) various UFFDIO_* ioctls that can manage the virtual memory regions
- registered in the userfaultfd that allows userland to efficiently
+2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
+ registered in the ``userfaultfd`` that allows userland to efficiently
resolve the userfaults it receives via 1) or to manage the virtual
memory in the background
The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
-userfaultfd runtime load never takes the mmap_sem for writing).
-
+``userfaultfd`` runtime load never takes the mmap_lock for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.
-The userfaultfd once opened by invoking the syscall, can also be
+The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
-(well of course unless they later try to use the userfaultfd
+(well of course unless they later try to use the ``userfaultfd``
themselves on the same region the manager is already tracking, which
-is a corner case that would currently return -EBUSY).
+is a corner case that would currently return ``-EBUSY``).
API
===
-When first opened the userfaultfd must be enabled invoking the
-UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
-a later API version) which will specify the read/POLLIN protocol
-userland intends to speak on the UFFD and the uffdio_api.features
-userland requires. The UFFDIO_API ioctl if successful (i.e. if the
-requested uffdio_api.api is spoken also by the running kernel and the
+Creating a userfaultfd
+----------------------
+
+There are two ways to create a new userfaultfd, each of which provide ways to
+restrict access to this functionality (since historically userfaultfds which
+handle kernel page faults have been a useful tool for exploiting the kernel).
+
+The first way, supported since userfaultfd was introduced, is the
+userfaultfd(2) syscall. Access to this is controlled in several ways:
+
+- Any user can always create a userfaultfd which traps userspace page faults
+ only. Such a userfaultfd can be created using the userfaultfd(2) syscall
+ with the flag UFFD_USER_MODE_ONLY.
+
+- In order to also trap kernel page faults for the address space, either the
+ process needs the CAP_SYS_PTRACE capability, or the system must have
+ vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
+ is set to 0.
+
+The second way, added to the kernel more recently, is by opening
+/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
+yields equivalent userfaultfds to the userfaultfd(2) syscall.
+
+Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
+filesystem permissions (user/group/mode), which gives fine grained access to
+userfaultfd specifically, without also granting other unrelated privileges at
+the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
+to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
+vm.unprivileged_userfaultfd is not considered.
+
+Initializing a userfaultfd
+--------------------------
+
+When first opened the ``userfaultfd`` must be enabled invoking the
+``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
+a later API version) which will specify the ``read/POLLIN`` protocol
+userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
+userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
+requested ``uffdio_api.api`` is spoken also by the running kernel and the
requested features are going to be enabled) will return into
-uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
+``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
respectively all the available features of the read(2) protocol and
the generic ioctl available.
-The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
-defines what memory types are supported by the userfaultfd and what
-events, except page fault notifications, may be generated.
-
-If the kernel supports registering userfaultfd ranges on hugetlbfs
-virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
-uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
-set if the kernel supports registering userfaultfd ranges on shared
-memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
-MAP_SHARED, memfd_create, etc).
-
-The userland application that wants to use userfaultfd with hugetlbfs
-or shared memory need to set the corresponding flag in
-uffdio_api.features to enable those features.
-
-If the userland desires to receive notifications for events other than
-page faults, it has to verify that uffdio_api.features has appropriate
-UFFD_FEATURE_EVENT_* bits set. These events are described in more
-detail below in "Non-cooperative userfaultfd" section.
-
-Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
-be invoked (if present in the returned uffdio_api.ioctls bitmask) to
-register a memory range in the userfaultfd by setting the
-uffdio_register structure accordingly. The uffdio_register.mode
+The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
+defines what memory types are supported by the ``userfaultfd`` and what
+events, except page fault notifications, may be generated:
+
+- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
+ other than page faults are supported. These events are described in more
+ detail below in the `Non-cooperative userfaultfd`_ section.
+
+- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
+ indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
+ registrations for hugetlbfs and shared memory (covering all shmem APIs,
+ i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
+ etc) virtual memory areas, respectively.
+
+- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
+ ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
+ areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
+ support for shmem virtual memory areas.
+
+- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
+ existing page contents from userspace.
+
+The userland application should set the feature flags it intends to use
+when invoking the ``UFFDIO_API`` ioctl, to request that those features be
+enabled if supported.
+
+Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
+ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
+bitmask) to register a memory range in the ``userfaultfd`` by setting the
+uffdio_register structure accordingly. The ``uffdio_register.mode``
bitmask will specify to the kernel which kind of faults to track for
-the range (UFFDIO_REGISTER_MODE_MISSING would track missing
-pages). The UFFDIO_REGISTER ioctl will return the
-uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+the range. The ``UFFDIO_REGISTER`` ioctl will return the
+``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
-supported for all memory types depending on the underlying virtual
-memory backend (anonymous memory vs tmpfs vs real filebacked
-mappings).
+supported for all memory types (e.g. anonymous memory vs. shmem vs.
+hugetlbfs), or all types of intercepted faults.
-Userland can use the uffdio_register.ioctls to manage the virtual
+Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove
-memory from the userfaultfd registered range). This means a userfault
+memory from the ``userfaultfd`` registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.
-The primary ioctl to resolve userfaults is UFFDIO_COPY. That
-atomically copies a page into the userfault registered range and wakes
-up the blocked userfaults (unless uffdio_copy.mode &
-UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
-UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
-half copied page since it'll keep userfaulting until the copy has
-finished.
+Resolving Userfaults
+--------------------
+
+There are three basic ways to resolve userfaults:
+
+- ``UFFDIO_COPY`` atomically copies some existing page contents from
+ userspace.
+
+- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
+
+- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
+
+These operations are atomic in the sense that they guarantee nothing can
+see a half-populated page, since readers will keep userfaulting until the
+operation has finished.
+
+By default, these wake up userfaults blocked on the range in question.
+They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
+that waking will be done separately at some later time.
+
+Which ioctl to choose depends on the kind of page fault, and what we'd
+like to do to resolve it:
+
+- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
+ resolved by either providing a new page (``UFFDIO_COPY``), or mapping
+ the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
+ the zero page for a missing fault. With userfaultfd, userspace can
+ decide what content to provide before the faulting thread continues.
+
+- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
+ the page cache). Userspace has the option of modifying the page's
+ contents before resolving the fault. Once the contents are correct
+ (modified or not), userspace asks the kernel to map the page and let the
+ faulting thread continue with ``UFFDIO_CONTINUE``.
Notes:
-- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
- you must provide some kind of page in your thread after reading from
- the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
- The normal behavior of the OS automatically providing a zero page on
- an annonymous mmaping is not in place.
+- You can tell which kind of fault occurred by examining
+ ``pagefault.flags`` within the ``uffd_msg``, checking for the
+ ``UFFD_PAGEFAULT_FLAG_*`` flags.
- None of the page-delivering ioctls default to the range that you
registered with. You must fill in all fields for the appropriate
@@ -122,13 +184,13 @@ Notes:
- You get the address of the access that triggered the missing page
event out of a struct uffd_msg that you read in the thread from the
- uffd. You can supply as many pages as you want with UFFDIO_COPY or
- UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then
- the first of any of those IOCTLs wakes up the faulting thread.
+ uffd. You can supply as many pages as you want with these IOCTLs.
+ Keep in mind that unless you used DONTWAKE then the first of any of
+ those IOCTLs wakes up the faulting thread.
-- Be sure to test for all errors including (pollfd[0].revents &
- POLLERR). This can happen, e.g. when ranges supplied were
- incorrect.
+- Be sure to test for all errors including
+ (``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
+ supplied were incorrect.
Write Protect Notifications
---------------------------
@@ -136,41 +198,117 @@ Write Protect Notifications
This is equivalent to (but faster than) using mprotect and a SIGSEGV
signal handler.
-Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
-Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
-struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
+Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
+Instead of using mprotect(2) you use
+``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
+while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
in the struct passed in. The range does not default to and does not
have to be identical to the range you registered with. You can write
protect as many ranges as you like (inside the registered range).
Then, in the thread reading from uffd the struct will have
-msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
-ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
-while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
-This wakes up the thread which will continue to run with writes. This
+``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
+``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
+again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
+set. This wakes up the thread which will continue to run with writes. This
allows you to do the bookkeeping about the write in the uffd reading
thread before the ioctl.
-If you registered with both UFFDIO_REGISTER_MODE_MISSING and
-UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
+If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
+``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
which you supply a page and undo write protect. Note that there is a
difference between writes into a WP area and into a !WP area. The
-former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
-UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but
-you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
+former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
+``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but
+you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.
+Userfaultfd write-protect mode currently behave differently on none ptes
+(when e.g. page is missing) over different types of memories.
+
+For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
+(e.g. when pages are missing and not populated). For file-backed memories
+like shmem and hugetlbfs, none ptes will be write protected just like a
+present pte. In other words, there will be a userfaultfd write fault
+message generated when writing to a missing page on file typed memories,
+as long as the page range was write-protected before. Such a message will
+not be generated on anonymous memories by default.
+
+If the application wants to be able to write protect none ptes on anonymous
+memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
+newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
+and set the feature bit in advance to make sure none ptes will also be
+write protected even upon anonymous memory.
+
+When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either
+``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when
+resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``
+respectively, it may be desirable for the new page / mapping to be
+write-protected (so future writes will also result in a WP fault). These ioctls
+support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
+respectively) to configure the mapping this way.
+
+If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set,
+any vma registered with write-protection will work in async mode rather
+than the default sync mode.
+
+In async mode, there will be no message generated when a write operation
+happens, meanwhile the write-protection will be resolved automatically by
+the kernel. It can be seen as a more accurate version of soft-dirty
+tracking and it can be different in a few ways:
+
+ - The dirty result will not be affected by vma changes (e.g. vma
+ merging) because the dirty is only tracked by the pte.
+
+ - It supports range operations by default, so one can enable tracking on
+ any range of memory as long as page aligned.
+
+ - Dirty information will not get lost if the pte was zapped due to
+ various reasons (e.g. during split of a shmem transparent huge page).
+
+ - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit
+ set; dirty when uffd-wp bit cleared), it has different semantics on
+ some of the memory operations. For example: ``MADV_DONTNEED`` on
+ anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as
+ dirtying of memory by dropping uffd-wp bit during the procedure.
+
+The user app can collect the "written/dirty" status by looking up the
+uffd-wp bit for the pages being interested in /proc/pagemap.
+
+The page will not be under track of uffd-wp async mode until the page is
+explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode
+flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
+that was tracked by async mode userfaultfd-wp is invalid.
+
+When userfaultfd-wp async mode is used alone, it can be applied to all
+kinds of memory.
+
+Memory Poisioning Emulation
+---------------------------
+
+In response to a fault (either missing or minor), an action userspace can
+take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
+future faulters to either get a SIGBUS, or in KVM's case the guest will
+receive an MCE as if there were hardware memory poisoning.
+
+This is used to emulate hardware memory poisoning. Imagine a VM running on a
+machine which experiences a real hardware memory error. Later, we live migrate
+the VM to another physical machine. Since we want the migration to be
+transparent to the guest, we want that same address range to act as if it was
+still poisoned, even though it's on a new physical host which ostensibly
+doesn't have a memory error in the exact same spot.
+
QEMU/KVM
========
-QEMU/KVM is using the userfaultfd syscall to implement postcopy live
+QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
migration. Postcopy live migration is one form of memory
externalization consisting of a virtual machine running with part or
all of its memory residing on a different node in the cloud. The
-userfaultfd abstraction is generic enough that not a single line of
+``userfaultfd`` abstraction is generic enough that not a single line of
KVM kernel code had to be modified in order to add postcopy live
migration to QEMU.
-Guest async page faults, FOLL_NOWAIT and all other GUP features work
+Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
just fine in combination with userfaults. Userfaults trigger async
page faults in the guest scheduler so those guest processes that
aren't waiting for userfaults (i.e. network bound) can keep running in
@@ -183,19 +321,19 @@ generating userfaults for readonly guest regions.
The implementation of postcopy live migration currently uses one
single bidirectional socket but in the future two different sockets
will be used (to reduce the latency of the userfaults to the minimum
-possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
+possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
The QEMU in the source node writes all pages that it knows are missing
in the destination node, into the socket, and the migration thread of
-the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
-ioctls on the userfaultfd in order to map the received pages into the
-guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
+the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
+ioctls on the ``userfaultfd`` in order to map the received pages into the
+guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
A different postcopy thread in the destination node listens with
-poll() to the userfaultfd in parallel. When a POLLIN event is
+poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
generated after a userfault triggers, the postcopy thread read() from
-the userfaultfd and receives the fault address (or -EAGAIN in case the
-userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
+the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
+userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
by the parallel QEMU migration thread).
After the QEMU postcopy thread (running in the destination node) gets
@@ -206,7 +344,7 @@ remaining missing pages from that new page offset. Soon after that
(just the time to flush the tcp_wmem queue through the network) the
migration thread in the QEMU running in the destination node will
receive the page that triggered the userfault and it'll map it as
-usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
+usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
was spontaneously sent by the source or if it was an urgent page
requested through a userfault).
@@ -219,74 +357,74 @@ checked to find which missing pages to send in round robin and we seek
over it when receiving incoming userfaults. After sending each page of
course the bitmap is updated accordingly. It's also useful to avoid
sending the same page twice (in case the userfault is read by the
-postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
+postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
thread).
Non-cooperative userfaultfd
===========================
-When the userfaultfd is monitored by an external manager, the manager
+When the ``userfaultfd`` is monitored by an external manager, the manager
must be able to track changes in the process virtual memory
layout. Userfaultfd can notify the manager about such changes using
the same read(2) protocol as for the page fault notifications. The
manager has to explicitly enable these events by setting appropriate
-bits in uffdio_api.features passed to UFFDIO_API ioctl:
+bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
-UFFD_FEATURE_EVENT_FORK
- enable userfaultfd hooks for fork(). When this feature is
- enabled, the userfaultfd context of the parent process is
+``UFFD_FEATURE_EVENT_FORK``
+ enable ``userfaultfd`` hooks for fork(). When this feature is
+ enabled, the ``userfaultfd`` context of the parent process is
duplicated into the newly created process. The manager
- receives UFFD_EVENT_FORK with file descriptor of the new
- userfaultfd context in the uffd_msg.fork.
+ receives ``UFFD_EVENT_FORK`` with file descriptor of the new
+ ``userfaultfd`` context in the ``uffd_msg.fork``.
-UFFD_FEATURE_EVENT_REMAP
+``UFFD_FEATURE_EVENT_REMAP``
enable notifications about mremap() calls. When the
non-cooperative process moves a virtual memory area to a
different location, the manager will receive
- UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
+ ``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
new addresses of the area and its original length.
-UFFD_FEATURE_EVENT_REMOVE
+``UFFD_FEATURE_EVENT_REMOVE``
enable notifications about madvise(MADV_REMOVE) and
- madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
- be generated upon these calls to madvise. The uffd_msg.remove
+ madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
+ be generated upon these calls to madvise(). The ``uffd_msg.remove``
will contain start and end addresses of the removed area.
-UFFD_FEATURE_EVENT_UNMAP
+``UFFD_FEATURE_EVENT_UNMAP``
enable notifications about memory unmapping. The manager will
- get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
+ get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
end addresses of the unmapped area.
-Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
+Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
are pretty similar, they quite differ in the action expected from the
-userfaultfd manager. In the former case, the virtual memory is
+``userfaultfd`` manager. In the former case, the virtual memory is
removed, but the area is not, the area remains monitored by the
-userfaultfd, and if a page fault occurs in that area it will be
+``userfaultfd``, and if a page fault occurs in that area it will be
delivered to the manager. The proper resolution for such page fault is
to zeromap the faulting address. However, in the latter case, when an
area is unmapped, either explicitly (with munmap() system call), or
implicitly (e.g. during mremap()), the area is removed and in turn the
-userfaultfd context for such area disappears too and the manager will
+``userfaultfd`` context for such area disappears too and the manager will
not get further userland page faults from the removed area. Still, the
notification is required in order to prevent manager from using
-UFFDIO_COPY on the unmapped area.
+``UFFDIO_COPY`` on the unmapped area.
Unlike userland page faults which have to be synchronous and require
explicit or implicit wakeup, all the events are delivered
asynchronously and the non-cooperative process resumes execution as
-soon as manager executes read(). The userfaultfd manager should
-carefully synchronize calls to UFFDIO_COPY with the events
-processing. To aid the synchronization, the UFFDIO_COPY ioctl will
-return -ENOSPC when the monitored process exits at the time of
-UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
-its virtual memory layout simultaneously with outstanding UFFDIO_COPY
+soon as manager executes read(). The ``userfaultfd`` manager should
+carefully synchronize calls to ``UFFDIO_COPY`` with the events
+processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
+return ``-ENOSPC`` when the monitored process exits at the time of
+``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
+its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
operation.
The current asynchronous model of the event delivery is optimal for
-single threaded non-cooperative userfaultfd manager implementations. A
+single threaded non-cooperative ``userfaultfd`` manager implementations. A
synchronous event delivery model can be added later as a new
-userfaultfd feature to facilitate multithreading enhancements of the
-non cooperative manager, for example to allow UFFDIO_COPY ioctls to
+``userfaultfd`` feature to facilitate multithreading enhancements of the
+non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
run in parallel to the event reception. Single threaded
implementations should continue to use the current async event
delivery model instead.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
new file mode 100644
index 000000000000..b42132969e31
--- /dev/null
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -0,0 +1,178 @@
+=====
+zswap
+=====
+
+Overview
+========
+
+Zswap is a lightweight compressed cache for swap pages. It takes pages that are
+in the process of being swapped out and attempts to compress them into a
+dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles
+for potentially reduced swap I/O. This trade-off can also result in a
+significant performance improvement if reads from the compressed cache are
+faster than reads from a swap device.
+
+Some potential benefits:
+
+* Desktop/laptop users with limited RAM capacities can mitigate the
+ performance impact of swapping.
+* Overcommitted guests that share a common I/O resource can
+ dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
+ throttling by the hypervisor. This allows more work to get done with less
+ impact to the guest workload and guests sharing the I/O subsystem
+* Users with SSDs as swap devices can extend the life of the device by
+ drastically reducing life-shortening writes.
+
+Zswap evicts pages from compressed cache on an LRU basis to the backing swap
+device when the compressed pool reaches its size limit. This requirement had
+been identified in prior community discussions.
+
+Whether Zswap is enabled at the boot time depends on whether
+the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not.
+This setting can then be overridden by providing the kernel command line
+``zswap.enabled=`` option, for example ``zswap.enabled=0``.
+Zswap can also be enabled and disabled at runtime using the sysfs interface.
+An example command to enable zswap at runtime, assuming sysfs is mounted
+at ``/sys``, is::
+
+ echo 1 > /sys/module/zswap/parameters/enabled
+
+When zswap is disabled at runtime it will stop storing pages that are
+being swapped out. However, it will _not_ immediately write out or fault
+back into memory all of the pages stored in the compressed pool. The
+pages stored in zswap will remain in the compressed pool until they are
+either invalidated or faulted back into memory. In order to force all
+pages out of the compressed pool, a swapoff on the swap device(s) will
+fault back into memory all swapped out pages, including those in the
+compressed pool.
+
+Design
+======
+
+Zswap receives pages for compression from the swap subsystem and is able to
+evict pages from its own compressed pool on an LRU basis and write them back to
+the backing swap device in the case that the compressed pool is full.
+
+Zswap makes use of zpool for the managing the compressed memory pool. Each
+allocation in zpool is not directly accessible by address. Rather, a handle is
+returned by the allocation routine and that handle must be mapped before being
+accessed. The compressed memory pool grows on demand and shrinks as compressed
+pages are freed. The pool is not preallocated. By default, a zpool
+of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
+but it can be overridden at boot time by setting the ``zpool`` attribute,
+e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
+``zpool`` attribute, e.g.::
+
+ echo zbud > /sys/module/zswap/parameters/zpool
+
+The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which
+means the compression ratio will always be 2:1 or worse (because of half-full
+zbud pages). The zsmalloc type zpool has a more complex compressed page
+storage method, and it can achieve greater storage densities.
+
+When a swap page is passed from swapout to zswap, zswap maintains a mapping
+of the swap entry, a combination of the swap type and swap offset, to the zpool
+handle that references that compressed swap page. This mapping is achieved
+with a red-black tree per swap type. The swap offset is the search key for the
+tree nodes.
+
+During a page fault on a PTE that is a swap entry, the swapin code calls the
+zswap load function to decompress the page into the page allocated by the page
+fault handler.
+
+Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
+in the swap_map goes to 0) the swap code calls the zswap invalidate function
+to free the compressed entry.
+
+Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
+controlled policy:
+
+* max_pool_percent - The maximum percentage of memory that the compressed
+ pool can occupy.
+
+The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT``
+Kconfig option, but it can be overridden at boot time by setting the
+``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
+It can also be changed at runtime using the sysfs "compressor"
+attribute, e.g.::
+
+ echo lzo > /sys/module/zswap/parameters/compressor
+
+When the zpool and/or compressor parameter is changed at runtime, any existing
+compressed pages are not modified; they are left in their own zpool. When a
+request is made for a page in an old zpool, it is uncompressed using its
+original compressor. Once all pages are removed from an old zpool, the zpool
+and its compressor are freed.
+
+Some of the pages in zswap are same-value filled pages (i.e. contents of the
+page have same value or repetitive pattern). These pages include zero-filled
+pages and they are handled differently. During store operation, a page is
+checked if it is a same-value filled page before compressing it. If true, the
+compressed length of the page is set to zero and the pattern or same-filled
+value is stored.
+
+Same-value filled pages identification feature is enabled by default and can be
+disabled at boot time by setting the ``same_filled_pages_enabled`` attribute
+to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and
+disabled at runtime using the sysfs ``same_filled_pages_enabled``
+attribute, e.g.::
+
+ echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled
+
+When zswap same-filled page identification is disabled at runtime, it will stop
+checking for the same-value filled pages during store operation.
+In other words, every page will be then considered non-same-value filled.
+However, the existing pages which are marked as same-value filled pages remain
+stored unchanged in zswap until they are either loaded or invalidated.
+
+In some circumstances it might be advantageous to make use of just the zswap
+ability to efficiently store same-filled pages without enabling the whole
+compressed page storage.
+In this case the handling of non-same-value pages by zswap (enabled by default)
+can be disabled by setting the ``non_same_filled_pages_enabled`` attribute
+to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``.
+It can also be enabled and disabled at runtime using the sysfs
+``non_same_filled_pages_enabled`` attribute, e.g.::
+
+ echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled
+
+Disabling both ``zswap.same_filled_pages_enabled`` and
+``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new
+pages by zswap.
+
+To prevent zswap from shrinking pool when zswap is full and there's a high
+pressure on swap (this will result in flipping pages in and out zswap pool
+without any real benefit but with a performance drop for the system), a
+special parameter has been introduced to implement a sort of hysteresis to
+refuse taking pages into zswap pool until it has sufficient space if the limit
+has been hit. To set the threshold at which zswap would start accepting pages
+again after it became full, use the sysfs ``accept_threshold_percent``
+attribute, e. g.::
+
+ echo 80 > /sys/module/zswap/parameters/accept_threshold_percent
+
+Setting this parameter to 100 will disable the hysteresis.
+
+Some users cannot tolerate the swapping that comes with zswap store failures
+and zswap writebacks. Swapping can be disabled entirely (without disabling
+zswap itself) on a cgroup-basis as follows:
+
+ echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
+
+Note that if the store failures are recurring (for e.g if the pages are
+incompressible), users can observe reclaim inefficiency after disabling
+writeback (because the same pages might be rejected again and again).
+
+When there is a sizable amount of cold memory residing in the zswap pool, it
+can be advantageous to proactively write these cold pages to swap and reclaim
+the memory for other use cases. By default, the zswap shrinker is disabled.
+User can enable it as follows:
+
+ echo Y > /sys/module/zswap/parameters/shrinker_enabled
+
+This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
+selected.
+
+A debugfs interface is provided for various statistic about pool size, number
+of pages stored, same-value filled pages and various counters for the reasons
+pages are rejected.