linux/kernel
Martin KaFai Lau 3a08c2fd76 bpf: LRU List
Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.

* General Thoughts:
1. Target use case.  Read is more often than update.
   (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
   If bpf_prog does a bpf_lookup_elem() first and then an in-place
   update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
   3.1 No lock in read case
   3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
   property.  A completely isolated percpu LRU list has the best
   performance but the memory utilization is not ideal considering
   the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
   list with batched operations before accessing the global LRU list.
   As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
   'struct bpf_lru_locallist'.  This local list is not used to sort
   the LRU property.  Instead, the local list is to batch enough
   operations before acquiring the lock of the global LRU list.  More
   details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
   map-attribute for scalability reason and for use cases that need to
   prepare for the worst (and pathological) case like DoS attack.
   The percpu LRU list is completely isolated from each other and the
   LRU nodes (including free nodes) cannot be moved across the list.  The
   following description is for the global LRU list but mostly applicable
   to the percpu LRU list also.

* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
   page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
   global LRU list) at the beginning.  The pre-allocation reasoning
   is similar to the existing BPF_MAP_TYPE_HASH.  However,
   opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
   the LRU map.

* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
   the nodes.  We can think of it as the working set or more frequently
   accessed nodes.  The access frequency is approximated by a ref-bit.
   The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
   active set of nodes.  They are the candidates to be removed
   from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
   The tail of the inactive list is the older nodes and
   should be released first if the bpf_htab needs free element.

* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
   the global list.
2. The active list is only rotated when the inactive list is running
   low.  This idea is similar to the current page cache.
   Inactive running low is currently defined as
   "# of inactive < # of active".
3. The active list rotation always starts from the tail.  It moves
   node without ref-bit set to the head of the inactive list.
   It moves node with ref-bit set back to the head of the active
   list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
   It walks the inactive list and moves the nodes back to the head of
   active list if its ref-bit is set. The ref-bit is cleared after moving
   to the active list.
   If the node does not have ref-bit set, it just leave it as it is
   because it is already in the inactive list.

* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
   full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
   delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
   one node from the tail of inactive or active list.  Forcefully is
   in the sense that it ignores the ref-bit.

* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
   batch enough operations before acquiring the lock of the
   global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
   of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
   global LRU list.  The global LRU list can either satisfy it
   by its global free-list or by shrinking the global inactive
   list.  Since we have acquired the global LRU list lock,
   it will try to get at most LOCAL_FREE_TARGET elements
   to the local free list.
5. When a new element is added to the bpf_htab, it will
   first sit at the pending-list (of the local list) first.
   The pending-list will be flushed to the global LRU list
   when it needs to acquire free nodes from the global list
   next time.

* Lock Consideration:
The LRU list has a lock (lru_lock).  Each bucket of htab has a
lock (buck_lock).  If both locks need to be acquired together,
the lock order is always lru_lock -> buck_lock and this only
happens in the bpf_lru_list.c logic.

In hashtab.c, both locks are not acquired together (i.e. one
lock is always released first before acquiring another lock).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 11:50:20 -05:00
..
bpf bpf: LRU List 2016-11-15 11:50:20 -05:00
configs config: android: enable CONFIG_SECCOMP 2016-10-11 15:06:32 -07:00
debug
events Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-10-28 16:27:16 -07:00
gcov gcov: add support for gcc version >= 6 2016-07-15 14:54:27 +09:00
irq genirq: Use irq type from irqdata instead of irqdesc 2016-11-08 15:15:19 +01:00
livepatch livepatch/module: make TAINT_LIVEPATCH module-specific 2016-08-26 14:42:08 +02:00
locking locking/lglock: Remove lglock implementation 2016-09-22 15:25:56 +02:00
power PM / sleep: fix device reference leak in test_suspend 2016-11-02 05:10:04 +01:00
printk Revert "printk: make reading the kernel log flush pending lines" 2016-11-14 09:31:52 -08:00
rcu This adds a new gcc plugin named "latent_entropy". It is designed to 2016-10-15 10:03:15 -07:00
sched sched/core: Remove pointless printout in sched_show_task() 2016-11-03 07:31:34 +01:00
time timers: Prevent base clock corruption when forwarding 2016-10-25 16:32:50 +02:00
trace bpf: add helper for retrieving current numa node id 2016-10-22 17:05:52 -04:00
.gitignore
acct.c
async.c
audit_fsnotify.c
audit_tree.c audit: cleanup prune_tree_thread 2016-04-04 09:46:47 -04:00
audit_watch.c Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-09-01 15:55:56 -07:00
audit.c Merge branch 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit 2016-10-04 14:21:41 -07:00
audit.h Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit 2016-07-29 17:54:17 -07:00
auditfilter.c audit: add fields to exclude filter by reusing user filter 2016-06-27 11:01:00 -04:00
auditsc.c Merge branch 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit 2016-10-04 14:21:41 -07:00
backtracetest.c
bounds.c
capability.c kernel: Add noaudit variant of ns_capable() 2016-06-06 20:16:18 +10:00
cgroup_freezer.c
cgroup_pids.c cgroup: Use lld instead of ld when printing pids controller events_limit 2016-06-21 15:03:36 -04:00
cgroup.c Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2016-10-14 12:18:50 -07:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c cpu/hotplug: Use distinct name for cpu_hotplug.dep_map 2016-10-16 11:09:32 +02:00
cpuset.c Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2016-10-14 12:18:50 -07:00
crash_dump.c
cred.c cred: Reject inodes with invalid ids in set_create_file_as() 2016-06-30 18:05:09 -05:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c mm, oom: enforce exit_oom_victim on current task 2016-10-07 18:46:28 -07:00
extable.c
fork.c fork: Add task stack refcounting sanity check and prevent premature task stack freeing 2016-11-01 07:39:17 +01:00
freezer.c freezer, oom: check TIF_MEMDIE on the correct task 2016-07-28 16:07:41 -07:00
futex_compat.c
futex.c futex: Add some more function commentry 2016-09-05 17:20:18 +02:00
groups.c cred: simpler, 1D supplementary groups 2016-10-07 18:46:30 -07:00
hung_task.c hung_task: allow hung_task_panic when hung_task_warnings is 0 2016-10-11 15:06:33 -07:00
irq_work.c
jump_label.c powerpc updates for 4.8 #2 2016-08-05 09:00:54 -04:00
kallsyms.c kallsyms: add support for relative offsets in kallsyms address table 2016-03-15 16:55:16 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: properly check if we are in an interrupt 2016-10-27 18:43:42 -07:00
kexec_core.c kexec: add restriction on kexec_load() segment sizes 2016-08-02 19:35:31 -04:00
kexec_file.c kexec: fix double-free when failing to relocate the purgatory 2016-09-01 17:52:01 -07:00
kexec_internal.h
kexec.c kexec: allow architectures to override boot mapping 2016-08-02 19:35:27 -04:00
kmod.c
kprobes.c kprobes: include <asm/sections.h> instead of <asm-generic/sections.h> 2016-10-11 15:06:31 -07:00
ksysfs.c kexec: add a kexec_crash_loaded() function 2016-08-02 19:35:30 -04:00
kthread.c kthread: better support freezable kthread workers 2016-10-11 15:06:33 -07:00
latencytop.c
Makefile userns: Add per user namespace sysctls. 2016-08-08 13:18:58 -05:00
membarrier.c
memremap.c mm: fix cache mode of dax pmd mappings 2016-09-09 17:34:46 -07:00
module_signing.c KEYS: Move the point of trust determination to __key_link() 2016-04-11 22:43:43 +01:00
module-internal.h
module.c livepatch/module: make TAINT_LIVEPATCH module-specific 2016-08-26 14:42:08 +02:00
notifier.c
nsproxy.c
padata.c padata: Convert to hotplug state machine 2016-09-19 21:44:30 +02:00
panic.c x86/panic: replace smp_send_stop() with kdump friendly version in panic path 2016-10-11 15:06:32 -07:00
params.c
pid_namespace.c Merge branch 'nsfs-ioctls' into HEAD 2016-09-22 20:00:36 -05:00
pid.c remove lots of IS_ERR_VALUE abuses 2016-05-27 15:26:11 -07:00
profile.c profile: Convert to hotplug state machine 2016-07-15 10:41:42 +02:00
ptrace.c mm: replace access_process_vm() write parameter with gup_flags 2016-10-19 08:31:25 -07:00
range.c
reboot.c
relay.c relay: Use irq_work instead of plain timer for deferred wakeup 2016-10-11 15:06:32 -07:00
resource.c /proc/iomem: only expose physical resource addresses to privileged users 2016-04-14 12:56:09 -07:00
seccomp.c seccomp: Fix tracer exit notifications during fatal signals 2016-08-30 16:12:46 -07:00
signal.c x86/signal: Add SA_{X32,IA32}_ABI sa_flags 2016-09-14 21:28:11 +02:00
smp.c smp: Allocate smp_call_on_cpu() workqueue on stack too 2016-09-22 14:49:10 +02:00
smpboot.c kthread/smpboot: do not park in kthread_create_on_cpu() 2016-10-11 15:06:33 -07:00
smpboot.h
softirq.c softirq: Display IRQ_POLL for irq-poll statistics 2016-10-21 15:45:47 -06:00
stacktrace.c
stop_machine.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-10-03 13:39:00 -07:00
sys_ni.c x86/pkeys: Fix pkeys build breakage for some non-x86 arches 2016-09-13 14:41:36 +02:00
sys.c prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable 2016-05-23 17:04:14 -07:00
sysctl_binary.c kernel/sysctl_binary.c: use generic UUID library 2016-05-20 17:58:30 -07:00
sysctl.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-10-10 13:04:49 -07:00
task_work.c task_work: use READ_ONCE/lockless_dereference, avoid pi_lock if !task_works 2016-08-02 19:35:02 -04:00
taskstats.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-11-15 10:54:36 -05:00
test_kprobes.c
torture.c torture: Convert torture_shutdown() to hrtimer 2016-08-22 10:01:49 -07:00
tracepoint.c kernel/...: convert pr_warning to pr_warn 2016-03-22 15:36:02 -07:00
tsacct.c
ucount.c mntns: Add a limit on the number of mount namespaces. 2016-08-31 07:28:35 -05:00
uid16.c cred: simpler, 1D supplementary groups 2016-10-07 18:46:30 -07:00
up.c smp: Add function to execute a function synchronously on a CPU 2016-09-05 13:52:39 +02:00
user_namespace.c Merge branch 'nsfs-ioctls' into HEAD 2016-09-22 20:00:36 -05:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c Merge branch 'nsfs-ioctls' into HEAD 2016-09-22 20:00:36 -05:00
watchdog.c Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86" 2016-07-10 20:58:36 +02:00
workqueue_internal.h
workqueue.c kthread: rename probe_kthread_data() to kthread_probe_data() 2016-10-11 15:06:33 -07:00