QUIC Thread Assisted Mode: Add design document

Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Matt Caswell <matt@openssl.org> (Merged from https://github.com/openssl/openssl/pull/20348)
2024-12-14 12:34:02 +08:00 · 2023-02-22 05:55:23 +00:00 · 2023-02-22 05:55:23 +00:00 · 27c49c06f1
commit 27c49c06f1
parent 3b1ab5a3a0
1 changed files with 104 additions and 0 deletions
--- a/doc/designs/quic-design/quic-thread-assist.md
+++ b/doc/designs/quic-design/quic-thread-assist.md
@ -0,0 +1,104 @@
+QUIC Thread Assisted Mode Synchronisation Requirements
+======================================================
+
+In thread assisted mode, we spin up a background thread to ensure that periodic
+QUIC processing is handled in a timely fashion regardless of whether an
+application is frequently calling (or blocked in) SSL API I/O functions.
+
+Part of the QUIC state comprises the TLS handshake layer. However, synchronising
+access to this is extremely difficult.
+
+At first glance, one could synchronise handshake layer public APIs by locking a
+per-connection mutex for the duration of any public API call which we forward to
+the handshake layer. Since we forward a very large number of APIs to the
+handshake layer, this would require a very large number of code changes to add
+the locking to every single public HL-related API call.
+
+However, on second glance, this does not even solve the problem, as
+applications existing usage of the HL APIs assumes exclusive access, and thus
+consistency over multiple API calls. For example:
+
+    x = SSL_get_foo(s);
+    /* application mutates x */
+    SSL_set_foo(s, x);
+
+For locking of API calls the lock would only be held for the separate get and
+set calls, but the combination of the two would not be safe if the assist thread
+can process some event which causes mutation of `foo`.
+
+As such, there are really only three possible solutions:
+
+- **1. Application-controlled explicit locking.**
+
+  We would offer something like `SSL_lock()` and `SSL_unlock()`.
+  An application performing a single HL API call, or a sequence of related HL
+  calls, would be required to take the lock. As a special exemption, an
+  application is not required to take the lock prior to connection
+  (specifically, prior to the instantiation of a QUIC channel and consequent
+  assist thread creation).
+
+  The key disadvantage here is that it requires more API changes on the
+  application side, although since most HL API calls made by an application
+  probably happen prior to initiating a connection, things may not be that bad.
+  It would also only be required for applications which want to use thread
+  assisted mode.
+
+  Pro: Most “robust” solution in terms of HL evolution.
+
+  Con: API changes.
+
+- **2. Handshake layer always belongs to the application thread.**
+
+  In this model, the handshake layer “belongs” to the application thread
+  and the assist thread is never allowed to touch it:
+
+  - `SSL_tick()` (or another I/O function) called by the application fully
+    services the connection.
+
+  - The assist thread performs a reduced tick operation which does everything
+    except servicing the crypto stream, or any other events we may define in
+    future which would be processed by the handshake layer.
+
+  - This is rather hacky but should work adequately. When using TLS 1.3
+    as the handshake layer, the only thing we actually need to worry about
+    servicing after handshake completion is the New Session Ticket message,
+    which doesn't need to be acknowledged and isn't “urgent”. The other
+    post-handshake messages used by TLS 1.3 aren't relevant to QUIC TLS:
+
+    - Post-handshake authentication is not allowed;
+
+    - Key update uses a separate, QUIC-specific method;
+
+    - TLS alerts are signalled via `CONNECTION_CLOSE` frames rather than the TLS
+      1.3 Alert message; thus if a peer's HL does raise an alert after
+      handshake completion (which would in itself be highly unusual), we simply
+      receive a `CONNECTION_CLOSE` frame and process it normally.
+
+  Thus so long as we don't expect our own TLS implementation to spontaneously
+  generate alerts or New Session Ticket messages after handshake completion,
+  this should work.
+
+  Pro: No API changes.
+
+  Con: Somewhat hacky solution.
+
+- **3. Handshake layer belongs to the assist thread after connection begins.**
+
+  In this model, the application may make handshake layer calls freely prior to
+  connecting, but after that, ownership of the HL is transferred to the assist
+  thread and may not be touched further. We would need to block all API calls
+  which would forward to the HL after connection commences (specifically, after
+  the QUIC channel is instantiated).
+
+  Con: Many applications probably expect to be able to query the HL after
+  connection. We could selectively enable some important post-handshake HL calls
+  by specially implementing synchronised forwarders, but doing this in the
+  general case runs into the same issues as option 1 above. We could only enable
+  APIs we think have safe semantics here; e.g. implement only getters and not
+  setters, focus on APIs which return data which doesn't change after
+  connection. The work required is proportional to the number of APIs to be
+  enabled. Some APIs may not have ways to indicate failure; for such APIs which
+  we don't implement for thread assisted post-handshake QUIC, we would
+  essentially return incorrect data here.
+
+Option 2 has been chosen as the basis for implementation.