Merge: #23088: Clarify null termination of bytes and strings in C API.

2024-11-24 10:24:35 +08:00 · 2015-05-13 20:32:19 -04:00 · 2015-05-13 20:32:19 -04:00 · 812bc1b86b
commit 812bc1b86b
parent b01a1fdb94 0a560a11af
3 changed files with 44 additions and 31 deletions
--- a/Doc/c-api/bytearray.rst
+++ b/Doc/c-api/bytearray.rst
@ -64,7 +64,8 @@ Direct API functions
 .. c:function:: char* PyByteArray_AsString(PyObject *bytearray)

   Return the contents of *bytearray* as a char array after checking for a
-   *NULL* pointer.
+   *NULL* pointer.  The returned array always has an extra
+   null byte appended.


 .. c:function:: int PyByteArray_Resize(PyObject *bytearray, Py_ssize_t len)
--- a/Doc/c-api/bytes.rst
+++ b/Doc/c-api/bytes.rst
@ -69,8 +69,8 @@ called with a non-bytes parameter.
   +===================+===============+================================+
   | :attr:`%%`        | *n/a*         | The literal % character.       |
   +-------------------+---------------+--------------------------------+
-   | :attr:`%c`        | int           | A single character,            |
-   |                   |               | represented as an C int.       |
+   | :attr:`%c`        | int           | A single byte,                 |
+   |                   |               | represented as a C int.        |
   +-------------------+---------------+--------------------------------+
   | :attr:`%d`        | int           | Exactly equivalent to          |
   |                   |               | ``printf("%d")``.              |
@ -109,7 +109,7 @@ called with a non-bytes parameter.
   +-------------------+---------------+--------------------------------+

   An unrecognized format character causes all the rest of the format string to be
-   copied as-is to the result string, and any extra arguments discarded.
+   copied as-is to the result object, and any extra arguments discarded.


 .. c:function:: PyObject* PyBytes_FromFormatV(const char *format, va_list vargs)
@ -136,11 +136,13 @@ called with a non-bytes parameter.

 .. c:function:: char* PyBytes_AsString(PyObject *o)

-   Return a NUL-terminated representation of the contents of *o*.  The pointer
-   refers to the internal buffer of *o*, not a copy.  The data must not be
-   modified in any way, unless the string was just created using
+   Return a pointer to the contents of *o*.  The pointer
+   refers to the internal buffer of *o*, which consists of ``len(o) + 1``
+   bytes.  The last byte in the buffer is always null, regardless of
+   whether there are any other null bytes.  The data must not be
+   modified in any way, unless the object was just created using
   ``PyBytes_FromStringAndSize(NULL, size)``. It must not be deallocated.  If
-   *o* is not a string object at all, :c:func:`PyBytes_AsString` returns *NULL*
+   *o* is not a bytes object at all, :c:func:`PyBytes_AsString` returns *NULL*
   and raises :exc:`TypeError`.


@ -151,16 +153,18 @@ called with a non-bytes parameter.

 .. c:function:: int PyBytes_AsStringAndSize(PyObject *obj, char **buffer, Py_ssize_t *length)

-   Return a NUL-terminated representation of the contents of the object *obj*
+   Return the null-terminated contents of the object *obj*
   through the output variables *buffer* and *length*.

-   If *length* is *NULL*, the resulting buffer may not contain NUL characters;
+   If *length* is *NULL*, the bytes object
+   may not contain embedded null bytes;
   if it does, the function returns ``-1`` and a :exc:`TypeError` is raised.

-   The buffer refers to an internal string buffer of *obj*, not a copy. The data
-   must not be modified in any way, unless the string was just created using
+   The buffer refers to an internal buffer of *obj*, which includes an
+   additional null byte at the end (not counted in *length*).  The data
+   must not be modified in any way, unless the object was just created using
   ``PyBytes_FromStringAndSize(NULL, size)``.  It must not be deallocated.  If
-   *string* is not a string object at all, :c:func:`PyBytes_AsStringAndSize`
+   *obj* is not a bytes object at all, :c:func:`PyBytes_AsStringAndSize`
   returns ``-1`` and raises :exc:`TypeError`.


@ -168,14 +172,14 @@ called with a non-bytes parameter.

   Create a new bytes object in *\*bytes* containing the contents of *newpart*
   appended to *bytes*; the caller will own the new reference.  The reference to
-   the old value of *bytes* will be stolen.  If the new string cannot be
+   the old value of *bytes* will be stolen.  If the new object cannot be
   created, the old reference to *bytes* will still be discarded and the value
   of *\*bytes* will be set to *NULL*; the appropriate exception will be set.


 .. c:function:: void PyBytes_ConcatAndDel(PyObject **bytes, PyObject *newpart)

-   Create a new string object in *\*bytes* containing the contents of *newpart*
+   Create a new bytes object in *\*bytes* containing the contents of *newpart*
   appended to *bytes*.  This version decrements the reference count of
   *newpart*.

--- a/Doc/c-api/unicode.rst
+++ b/Doc/c-api/unicode.rst
@ -227,7 +227,10 @@ access internal read-only data of Unicode objects:
                const char* PyUnicode_AS_DATA(PyObject *o)

   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
-   ``AS_DATA`` form casts the pointer to :c:type:`const char *`.  *o* has to be
+   returned buffer is always terminated with an extra null code point.  It
+   may also contain embedded null code points, which would cause the string
+   to be truncated when used in most C functions.  The ``AS_DATA`` form
+   casts the pointer to :c:type:`const char *`.  The *o* argument has to be
   a Unicode object (not checked).

   .. versionchanged:: 3.3
@ -650,7 +653,8 @@ APIs:

   Copy the string *u* into a new UCS4 buffer that is allocated using
   :c:func:`PyMem_Malloc`.  If this fails, *NULL* is returned with a
-   :exc:`MemoryError` set.
+   :exc:`MemoryError` set.  The returned buffer always has an extra
+   null code point appended.

   .. versionadded:: 3.3

@ -689,8 +693,9 @@ Extension modules can continue using them, as they will not be removed in Python
   Return a read-only pointer to the Unicode object's internal
   :c:type:`Py_UNICODE` buffer, or *NULL* on error. This will create the
   :c:type:`Py_UNICODE*` representation of the object if it is not yet
-   available. Note that the resulting :c:type:`Py_UNICODE` string may contain
-   embedded null characters, which would cause the string to be truncated when
+   available. The buffer is always terminated with an extra null code point.
+   Note that the resulting :c:type:`Py_UNICODE` string may also contain
+   embedded null code points, which would cause the string to be truncated when
   used in most C functions.

   Please migrate to using :c:func:`PyUnicode_AsUCS4`,
@ -708,8 +713,9 @@ Extension modules can continue using them, as they will not be removed in Python
 .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)

   Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
-   array length in *size*. Note that the resulting :c:type:`Py_UNICODE*` string
-   may contain embedded null characters, which would cause the string to be
+   array length (excluding the extra null terminator) in *size*.
+   Note that the resulting :c:type:`Py_UNICODE*` string
+   may contain embedded null code points, which would cause the string to be
   truncated when used in most C functions.

   .. versionadded:: 3.3
@ -717,11 +723,11 @@ Extension modules can continue using them, as they will not be removed in Python

 .. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)

-   Create a copy of a Unicode string ending with a nul character. Return *NULL*
+   Create a copy of a Unicode string ending with a null code point. Return *NULL*
   and raise a :exc:`MemoryError` exception on memory allocation failure,
   otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free
   the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may
-   contain embedded null characters, which would cause the string to be
+   contain embedded null code points, which would cause the string to be
   truncated when used in most C functions.

   .. versionadded:: 3.2
@ -902,10 +908,10 @@ wchar_t Support

   Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*.  At most
   *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
-   0-termination character).  Return the number of :c:type:`wchar_t` characters
+   null termination character).  Return the number of :c:type:`wchar_t` characters
   copied or -1 in case of an error.  Note that the resulting :c:type:`wchar_t*`
-   string may or may not be 0-terminated.  It is the responsibility of the caller
-   to make sure that the :c:type:`wchar_t*` string is 0-terminated in case this is
+   string may or may not be null-terminated.  It is the responsibility of the caller
+   to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is
   required by the application. Also, note that the :c:type:`wchar_t*` string
   might contain null characters, which would cause the string to be truncated
   when used with most C functions.
@ -914,8 +920,8 @@ wchar_t Support
 .. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)

   Convert the Unicode object to a wide character string. The output string
-   always ends with a nul character. If *size* is not *NULL*, write the number
-   of wide characters (excluding the trailing 0-termination character) into
+   always ends with a null character. If *size* is not *NULL*, write the number
+   of wide characters (excluding the trailing null termination character) into
   *\*size*.

   Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
@ -1045,9 +1051,11 @@ These are the UTF-8 codec APIs:

 .. c:function:: char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)

-   Return a pointer to the default encoding (UTF-8) of the Unicode object, and
-   store the size of the encoded representation (in bytes) in *size*.  *size*
-   can be *NULL*, in this case no size will be stored.
+   Return a pointer to the UTF-8 encoding of the Unicode object, and
+   store the size of the encoded representation (in bytes) in *size*.  The
+   *size* argument can be *NULL*; in this case no size will be stored.  The
+   returned buffer always has an extra null byte appended (not included in
+   *size*), regardless of whether there are any other null code points.

   In the case of an error, *NULL* is returned with an exception set and no
   *size* is stored.