linux-next/Documentation/mm/zsmalloc.rst

========
zsmalloc
========

This allocator is designed for use with zram. Thus, the allocator is
supposed to work well under low memory conditions. In particular, it
never attempts higher order page allocation which is very likely to
fail under memory pressure. On the other hand, if we just use single
(0-order) pages, it would suffer from very high fragmentation --
any object of size PAGE_SIZE/2 or larger would occupy an entire page.
This was one of the major issues with its predecessor (xvmalloc).

To overcome these issues, zsmalloc allocates a bunch of 0-order pages
and links them together using various 'struct page' fields. These linked
pages act as a single higher-order page i.e. an object can span 0-order
page boundaries. The code refers to these linked pages as a single entity
called zspage.

For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE
since this satisfies the requirements of all its current users (in the
worst case, page is incompressible and is thus stored "as-is" i.e. in
uncompressed form). For allocation requests larger than this size, failure
is returned (see zs_malloc).

Additionally, zs_malloc() does not return a dereferenceable pointer.
Instead, it returns an opaque handle (unsigned long) which encodes actual
location of the allocated object. The reason for this indirection is that
zsmalloc does not keep zspages permanently mapped since that would cause
issues on 32-bit systems where the VA region for kernel space mappings
is very small. So, before using the allocating memory, the object has to
be mapped using zs_map_object() to get a usable pointer and subsequently
unmapped using zs_unmap_object().

stat
====

With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via
``/sys/kernel/debug/zsmalloc/<user name>``. Here is a sample of stat output::

 # cat /sys/kernel/debug/zsmalloc/zram0/classes

 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage
    ...
    ...
     9   176           0            1           186        129          8                4
    10   192           1            0          2880       2872        135                3
    11   208           0            1           819        795         42                2
    12   224           0            1           219        159         12                4
    ...
    ...


class
	index
size
	object size zspage stores
almost_empty
	the number of ZS_ALMOST_EMPTY zspages(see below)
almost_full
	the number of ZS_ALMOST_FULL zspages(see below)
obj_allocated
	the number of objects allocated
obj_used
	the number of objects allocated to the user
pages_used
	the number of pages allocated for the class
pages_per_zspage
	the number of 0-order pages to make a zspage

We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where

* n = number of allocated objects
* N = total number of objects zspage can store
* f = fullness_threshold_frac(ie, 4 at the moment)

Similarly, we assign zspage to:

* ZS_ALMOST_FULL  when n > N / f
* ZS_EMPTY        when n == 0
* ZS_FULL         when n == N


Internals
=========

zsmalloc has 255 size classes, each of which can hold a number of zspages.
Each zspage can contain up to ZSMALLOC_CHAIN_SIZE physical (0-order) pages.
The optimal zspage chain size for each size class is calculated during the
creation of the zsmalloc pool (see calculate_zspage_chain_size()).

As an optimization, zsmalloc merges size classes that have similar
characteristics in terms of the number of pages per zspage and the number
of objects that each zspage can store.

For instance, consider the following size classes:::

  class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
  ...
     94  1536           0            0             0          0          0                3        0
    100  1632           0            0             0          0          0                2        0
  ...


Size classes #95-99 are merged with size class #100. This means that when we
need to store an object of size, say, 1568 bytes, we end up using size class
#100 instead of size class #96. Size class #100 is meant for objects of size
1632 bytes, so each object of size 1568 bytes wastes 1632-1568=64 bytes.

Size class #100 consists of zspages with 2 physical pages each, which can
hold a total of 5 objects. If we need to store 13 objects of size 1568, we
end up allocating three zspages, or 6 physical pages.

However, if we take a closer look at size class #96 (which is meant for
objects of size 1568 bytes) and trace `calculate_zspage_chain_size()`, we
find that the most optimal zspage configuration for this class is a chain
of 5 physical pages:::

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

This means that a class #96 configuration with 5 physical pages can store 13
objects of size 1568 in a single zspage, using a total of 5 physical pages.
This is more efficient than the class #100 configuration, which would use 6
physical pages to store the same number of objects.

As the zspage chain size for class #96 increases, its key characteristics
such as pages per-zspage and objects per-zspage also change. This leads to
dewer class mergers, resulting in a more compact grouping of classes, which
reduces memory wastage.

Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`:::

  class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
  ...
    202  3264           0            0             0          0          0                4        0
    254  4096           0            0             0          0          0                1        0
  ...

Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages
per zspage. Any object larger than 3264 bytes is considered huge and belongs
to size class #254, which stores each object in its own physical page (objects
in huge classes do not share pages).

Increasing the size of the chain of zspages also results in a higher watermark
for the huge size class and fewer huge classes overall. This allows for more
efficient storage of large objects.

For zspage chain size of 8, huge class watermark becomes 3632 bytes:::

  class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
  ...
    202  3264           0            0             0          0          0                4        0
    211  3408           0            0             0          0          0                5        0
    217  3504           0            0             0          0          0                6        0
    222  3584           0            0             0          0          0                7        0
    225  3632           0            0             0          0          0                8        0
    254  4096           0            0             0          0          0                1        0
  ...

For zspage chain size of 16, huge class watermark becomes 3840 bytes:::

  class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
  ...
    202  3264           0            0             0          0          0                4        0
    206  3328           0            0             0          0          0               13        0
    207  3344           0            0             0          0          0                9        0
    208  3360           0            0             0          0          0               14        0
    211  3408           0            0             0          0          0                5        0
    212  3424           0            0             0          0          0               16        0
    214  3456           0            0             0          0          0               11        0
    217  3504           0            0             0          0          0                6        0
    219  3536           0            0             0          0          0               13        0
    222  3584           0            0             0          0          0                7        0
    223  3600           0            0             0          0          0               15        0
    225  3632           0            0             0          0          0                8        0
    228  3680           0            0             0          0          0                9        0
    230  3712           0            0             0          0          0               10        0
    232  3744           0            0             0          0          0               11        0
    234  3776           0            0             0          0          0               12        0
    235  3792           0            0             0          0          0               13        0
    236  3808           0            0             0          0          0               14        0
    238  3840           0            0             0          0          0               15        0
    254  4096           0            0             0          0          0                1        0
  ...

Overall the combined zspage chain size effect on zsmalloc pool configuration:::

  pages per zspage   number of size classes (clusters)   huge size class watermark
         4                        69                               3264
         5                        86                               3408
         6                        93                               3504
         7                       112                               3584
         8                       123                               3632
         9                       140                               3680
        10                       143                               3712
        11                       159                               3744
        12                       164                               3776
        13                       180                               3792
        14                       183                               3808
        15                       188                               3840
        16                       191                               3840


A synthetic test
----------------

zram as a build artifacts storage (Linux kernel compilation).

* `CONFIG_ZSMALLOC_CHAIN_SIZE=4`

  zsmalloc classes stats:::

    class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
    ...
    Total                13           51        413836     412973     159955                         3

  zram mm_stat:::

   1691783168 628083717 655175680        0 655175680       60        0    34048    34049


* `CONFIG_ZSMALLOC_CHAIN_SIZE=8`

  zsmalloc classes stats:::

    class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
    ...
    Total                18           87        414852     412978     156666                         0

  zram mm_stat:::

    1691803648 627793930 641703936        0 641703936       60        0    33591    33591

Using larger zspage chains may result in using fewer physical pages, as seen
in the example where the number of physical pages used decreased from 159955
to 156666, at the same time maximum zsmalloc pool memory usage went down from
655175680 to 641703936 bytes.

However, this advantage may be offset by the potential for increased system
memory pressure (as some zspages have larger chain sizes) in cases where there
is heavy internal fragmentation and zspool compaction is unable to relocate
objects and release zspages. In these cases, it is recommended to decrease
the limit on the size of the zspage chains (as specified by the
CONFIG_ZSMALLOC_CHAIN_SIZE option).