mirror of
https://github.com/python/cpython.git
synced 2025-01-22 16:35:16 +08:00
bpo-36324: Apply review comments from Allen Downey (GH-15693)
This commit is contained in:
parent
8f9cc8771f
commit
e4810b2a6c
@ -26,10 +26,10 @@ numeric (:class:`Real`-valued) data.
|
||||
Unless explicitly noted otherwise, these functions support :class:`int`,
|
||||
:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
|
||||
Behaviour with other types (whether in the numeric tower or not) is
|
||||
currently unsupported. Mixed types are also undefined and
|
||||
implementation-dependent. If your input data consists of mixed types,
|
||||
you may be able to use :func:`map` to ensure a consistent result, e.g.
|
||||
``map(float, input_data)``.
|
||||
currently unsupported. Collections with a mix of types are also undefined
|
||||
and implementation-dependent. If your input data consists of mixed types,
|
||||
you may be able to use :func:`map` to ensure a consistent result, for
|
||||
example: ``map(float, input_data)``.
|
||||
|
||||
Averages and measures of central location
|
||||
-----------------------------------------
|
||||
@ -102,11 +102,9 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
.. note::
|
||||
|
||||
The mean is strongly affected by outliers and is not a robust estimator
|
||||
for central location: the mean is not necessarily a typical example of the
|
||||
data points. For more robust, although less efficient, measures of
|
||||
central location, see :func:`median` and :func:`mode`. (In this case,
|
||||
"efficient" refers to statistical efficiency rather than computational
|
||||
efficiency.)
|
||||
for central location: the mean is not necessarily a typical example of
|
||||
the data points. For more robust measures of central location, see
|
||||
:func:`median` and :func:`mode`.
|
||||
|
||||
The sample mean gives an unbiased estimate of the true population mean,
|
||||
which means that, taken on average over all the possible samples,
|
||||
@ -120,9 +118,8 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
Convert *data* to floats and compute the arithmetic mean.
|
||||
|
||||
This runs faster than the :func:`mean` function and it always returns a
|
||||
:class:`float`. The result is highly accurate but not as perfect as
|
||||
:func:`mean`. If the input dataset is empty, raises a
|
||||
:exc:`StatisticsError`.
|
||||
:class:`float`. The *data* may be a sequence or iterator. If the input
|
||||
dataset is empty, raises a :exc:`StatisticsError`.
|
||||
|
||||
.. doctest::
|
||||
|
||||
@ -136,15 +133,20 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
|
||||
Convert *data* to floats and compute the geometric mean.
|
||||
|
||||
The geometric mean indicates the central tendency or typical value of the
|
||||
*data* using the product of the values (as opposed to the arithmetic mean
|
||||
which uses their sum).
|
||||
|
||||
Raises a :exc:`StatisticsError` if the input dataset is empty,
|
||||
if it contains a zero, or if it contains a negative value.
|
||||
The *data* may be a sequence or iterator.
|
||||
|
||||
No special efforts are made to achieve exact results.
|
||||
(However, this may change in the future.)
|
||||
|
||||
.. doctest::
|
||||
|
||||
>>> round(geometric_mean([54, 24, 36]), 9)
|
||||
>>> round(geometric_mean([54, 24, 36]), 1)
|
||||
36.0
|
||||
|
||||
.. versionadded:: 3.8
|
||||
@ -174,7 +176,7 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
3.6
|
||||
|
||||
Using the arithmetic mean would give an average of about 5.167, which
|
||||
is too high.
|
||||
is well over the aggregate P/E ratio.
|
||||
|
||||
:exc:`StatisticsError` is raised if *data* is empty, or any element
|
||||
is less than zero.
|
||||
@ -312,10 +314,10 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
The mode (when it exists) is the most typical value and serves as a
|
||||
measure of central location.
|
||||
|
||||
If there are multiple modes, returns the first one encountered in the *data*.
|
||||
If the smallest or largest of multiple modes is desired instead, use
|
||||
``min(multimode(data))`` or ``max(multimode(data))``. If the input *data* is
|
||||
empty, :exc:`StatisticsError` is raised.
|
||||
If there are multiple modes with the same frequency, returns the first one
|
||||
encountered in the *data*. If the smallest or largest of those is
|
||||
desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
|
||||
If the input *data* is empty, :exc:`StatisticsError` is raised.
|
||||
|
||||
``mode`` assumes discrete data, and returns a single value. This is the
|
||||
standard treatment of the mode as commonly taught in schools:
|
||||
@ -325,8 +327,8 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
|
||||
3
|
||||
|
||||
The mode is unique in that it is the only statistic which also applies
|
||||
to nominal (non-numeric) data:
|
||||
The mode is unique in that it is the only statistic in this package that
|
||||
also applies to nominal (non-numeric) data:
|
||||
|
||||
.. doctest::
|
||||
|
||||
@ -368,15 +370,16 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
|
||||
.. function:: pvariance(data, mu=None)
|
||||
|
||||
Return the population variance of *data*, a non-empty iterable of real-valued
|
||||
numbers. Variance, or second moment about the mean, is a measure of the
|
||||
variability (spread or dispersion) of data. A large variance indicates that
|
||||
the data is spread out; a small variance indicates it is clustered closely
|
||||
around the mean.
|
||||
Return the population variance of *data*, a non-empty sequence or iterator
|
||||
of real-valued numbers. Variance, or second moment about the mean, is a
|
||||
measure of the variability (spread or dispersion) of data. A large
|
||||
variance indicates that the data is spread out; a small variance indicates
|
||||
it is clustered closely around the mean.
|
||||
|
||||
If the optional second argument *mu* is given, it should be the mean of
|
||||
*data*. If it is missing or ``None`` (the default), the mean is
|
||||
automatically calculated.
|
||||
If the optional second argument *mu* is given, it is typically the mean of
|
||||
the *data*. It can also be used to compute the second moment around a
|
||||
point that is not the mean. If it is missing or ``None`` (the default),
|
||||
the arithmetic mean is automatically calculated.
|
||||
|
||||
Use this function to calculate the variance from the entire population. To
|
||||
estimate the variance from a sample, the :func:`variance` function is usually
|
||||
@ -401,10 +404,6 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
>>> pvariance(data, mu)
|
||||
1.25
|
||||
|
||||
This function does not attempt to verify that you have passed the actual mean
|
||||
as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
|
||||
results.
|
||||
|
||||
Decimals and Fractions are supported:
|
||||
|
||||
.. doctest::
|
||||
@ -423,11 +422,11 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
σ². When called on a sample instead, this is the biased sample variance
|
||||
s², also known as variance with N degrees of freedom.
|
||||
|
||||
If you somehow know the true population mean μ, you may use this function
|
||||
to calculate the variance of a sample, giving the known population mean as
|
||||
the second argument. Provided the data points are representative
|
||||
(e.g. independent and identically distributed), the result will be an
|
||||
unbiased estimate of the population variance.
|
||||
If you somehow know the true population mean μ, you may use this
|
||||
function to calculate the variance of a sample, giving the known
|
||||
population mean as the second argument. Provided the data points are a
|
||||
random sample of the population, the result will be an unbiased estimate
|
||||
of the population variance.
|
||||
|
||||
|
||||
.. function:: stdev(data, xbar=None)
|
||||
@ -502,19 +501,19 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
:func:`pvariance` function as the *mu* parameter to get the variance of a
|
||||
sample.
|
||||
|
||||
.. function:: quantiles(dist, *, n=4, method='exclusive')
|
||||
.. function:: quantiles(data, *, n=4, method='exclusive')
|
||||
|
||||
Divide *dist* into *n* continuous intervals with equal probability.
|
||||
Divide *data* into *n* continuous intervals with equal probability.
|
||||
Returns a list of ``n - 1`` cut points separating the intervals.
|
||||
|
||||
Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
|
||||
*n* to 100 for percentiles which gives the 99 cuts points that separate
|
||||
*dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
|
||||
*data* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
|
||||
is not least 1.
|
||||
|
||||
The *dist* can be any iterable containing sample data or it can be an
|
||||
The *data* can be any iterable containing sample data or it can be an
|
||||
instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
|
||||
results, the number of data points in *dist* should be larger than *n*.
|
||||
results, the number of data points in *data* should be larger than *n*.
|
||||
Raises :exc:`StatisticsError` if there are not at least two data points.
|
||||
|
||||
For sample data, the cut points are linearly interpolated from the
|
||||
@ -523,7 +522,7 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
cut-point will evaluate to ``104``.
|
||||
|
||||
The *method* for computing quantiles can be varied depending on
|
||||
whether the data in *dist* includes or excludes the lowest and
|
||||
whether the data in *data* includes or excludes the lowest and
|
||||
highest possible values from the population.
|
||||
|
||||
The default *method* is "exclusive" and is used for data sampled from
|
||||
@ -535,14 +534,14 @@ However, for reading convenience, most of the examples show sorted sequences.
|
||||
|
||||
Setting the *method* to "inclusive" is used for describing population
|
||||
data or for samples that are known to include the most extreme values
|
||||
from the population. The minimum value in *dist* is treated as the 0th
|
||||
from the population. The minimum value in *data* is treated as the 0th
|
||||
percentile and the maximum value is treated as the 100th percentile.
|
||||
The portion of the population falling below the *i-th* of *m* sorted
|
||||
data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
|
||||
values, the method sorts them and assigns the following percentiles:
|
||||
0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
|
||||
|
||||
If *dist* is an instance of a class that defines an
|
||||
If *data* is an instance of a class that defines an
|
||||
:meth:`~inv_cdf` method, setting *method* has no effect.
|
||||
|
||||
.. doctest::
|
||||
@ -580,7 +579,7 @@ A single exception is defined:
|
||||
:class:`NormalDist` is a tool for creating and manipulating normal
|
||||
distributions of a `random variable
|
||||
<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
|
||||
composite class that treats the mean and standard deviation of data
|
||||
class that treats the mean and standard deviation of data
|
||||
measurements as a single entity.
|
||||
|
||||
Normal distributions arise from the `Central Limit Theorem
|
||||
@ -616,13 +615,14 @@ of applications in statistics.
|
||||
|
||||
.. classmethod:: NormalDist.from_samples(data)
|
||||
|
||||
Makes a normal distribution instance computed from sample data. The
|
||||
*data* can be any :term:`iterable` and should consist of values that
|
||||
can be converted to type :class:`float`.
|
||||
Makes a normal distribution instance with *mu* and *sigma* parameters
|
||||
estimated from the *data* using :func:`fmean` and :func:`stdev`.
|
||||
|
||||
If *data* does not contain at least two elements, raises
|
||||
:exc:`StatisticsError` because it takes at least one point to estimate
|
||||
a central value and at least two points to estimate dispersion.
|
||||
The *data* can be any :term:`iterable` and should consist of values
|
||||
that can be converted to type :class:`float`. If *data* does not
|
||||
contain at least two elements, raises :exc:`StatisticsError` because it
|
||||
takes at least one point to estimate a central value and at least two
|
||||
points to estimate dispersion.
|
||||
|
||||
.. method:: NormalDist.samples(n, *, seed=None)
|
||||
|
||||
@ -636,10 +636,10 @@ of applications in statistics.
|
||||
.. method:: NormalDist.pdf(x)
|
||||
|
||||
Using a `probability density function (pdf)
|
||||
<https://en.wikipedia.org/wiki/Probability_density_function>`_,
|
||||
compute the relative likelihood that a random variable *X* will be near
|
||||
the given value *x*. Mathematically, it is the ratio ``P(x <= X <
|
||||
x+dx) / dx``.
|
||||
<https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
|
||||
the relative likelihood that a random variable *X* will be near the
|
||||
given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
|
||||
X < x+dx) / dx`` as *dx* approaches zero.
|
||||
|
||||
The relative likelihood is computed as the probability of a sample
|
||||
occurring in a narrow range divided by the width of the range (hence
|
||||
@ -667,8 +667,10 @@ of applications in statistics.
|
||||
|
||||
.. method:: NormalDist.overlap(other)
|
||||
|
||||
Returns a value between 0.0 and 1.0 giving the overlapping area for
|
||||
the two probability density functions.
|
||||
Measures the agreement between two normal probability distributions.
|
||||
Returns a value between 0.0 and 1.0 giving `the overlapping area for
|
||||
the two probability density functions
|
||||
<https://www.rasch.org/rmt/rmt101r.htm>`_.
|
||||
|
||||
Instances of :class:`NormalDist` support addition, subtraction,
|
||||
multiplication and division by a constant. These operations
|
||||
@ -740,12 +742,11 @@ Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
|
||||
... return (3*x + 7*x*y - 5*y) / (11 * z)
|
||||
...
|
||||
>>> n = 100_000
|
||||
>>> seed = 86753099035768
|
||||
>>> X = NormalDist(10, 2.5).samples(n, seed=seed)
|
||||
>>> Y = NormalDist(15, 1.75).samples(n, seed=seed)
|
||||
>>> Z = NormalDist(50, 1.25).samples(n, seed=seed)
|
||||
>>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
|
||||
NormalDist(mu=1.8661894803304777, sigma=0.65238717376862)
|
||||
>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
|
||||
>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
|
||||
>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
|
||||
>>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
|
||||
[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
|
||||
|
||||
Normal distributions commonly arise in machine learning problems.
|
||||
|
||||
|
@ -322,7 +322,6 @@ def fmean(data):
|
||||
"""Convert data to floats and compute the arithmetic mean.
|
||||
|
||||
This runs faster than the mean() function and it always returns a float.
|
||||
The result is highly accurate but not as perfect as mean().
|
||||
If the input dataset is empty, it raises a StatisticsError.
|
||||
|
||||
>>> fmean([3.5, 4.0, 5.25])
|
||||
@ -538,15 +537,16 @@ def mode(data):
|
||||
``mode`` assumes discrete data, and returns a single value. This is the
|
||||
standard treatment of the mode as commonly taught in schools:
|
||||
|
||||
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
|
||||
3
|
||||
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
|
||||
3
|
||||
|
||||
This also works with nominal (non-numeric) data:
|
||||
|
||||
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
|
||||
'red'
|
||||
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
|
||||
'red'
|
||||
|
||||
If there are multiple modes, return the first one encountered.
|
||||
If there are multiple modes with same frequency, return the first one
|
||||
encountered:
|
||||
|
||||
>>> mode(['red', 'red', 'green', 'blue', 'blue'])
|
||||
'red'
|
||||
@ -615,28 +615,28 @@ def multimode(data):
|
||||
# position is that fewer options make for easier choices and that
|
||||
# external packages can be used for anything more advanced.
|
||||
|
||||
def quantiles(dist, /, *, n=4, method='exclusive'):
|
||||
"""Divide *dist* into *n* continuous intervals with equal probability.
|
||||
def quantiles(data, /, *, n=4, method='exclusive'):
|
||||
"""Divide *data* into *n* continuous intervals with equal probability.
|
||||
|
||||
Returns a list of (n - 1) cut points separating the intervals.
|
||||
|
||||
Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles.
|
||||
Set *n* to 100 for percentiles which gives the 99 cuts points that
|
||||
separate *dist* in to 100 equal sized groups.
|
||||
separate *data* in to 100 equal sized groups.
|
||||
|
||||
The *dist* can be any iterable containing sample data or it can be
|
||||
The *data* can be any iterable containing sample data or it can be
|
||||
an instance of a class that defines an inv_cdf() method. For sample
|
||||
data, the cut points are linearly interpolated between data points.
|
||||
|
||||
If *method* is set to *inclusive*, *dist* is treated as population
|
||||
If *method* is set to *inclusive*, *data* is treated as population
|
||||
data. The minimum value is treated as the 0th percentile and the
|
||||
maximum value is treated as the 100th percentile.
|
||||
"""
|
||||
if n < 1:
|
||||
raise StatisticsError('n must be at least 1')
|
||||
if hasattr(dist, 'inv_cdf'):
|
||||
return [dist.inv_cdf(i / n) for i in range(1, n)]
|
||||
data = sorted(dist)
|
||||
if hasattr(data, 'inv_cdf'):
|
||||
return [data.inv_cdf(i / n) for i in range(1, n)]
|
||||
data = sorted(data)
|
||||
ld = len(data)
|
||||
if ld < 2:
|
||||
raise StatisticsError('must have at least two data points')
|
||||
@ -745,7 +745,7 @@ def variance(data, xbar=None):
|
||||
def pvariance(data, mu=None):
|
||||
"""Return the population variance of ``data``.
|
||||
|
||||
data should be an iterable of Real-valued numbers, with at least one
|
||||
data should be a sequence or iterator of Real-valued numbers, with at least one
|
||||
value. The optional argument mu, if given, should be the mean of
|
||||
the data. If it is missing or None, the mean is automatically calculated.
|
||||
|
||||
@ -766,10 +766,6 @@ def pvariance(data, mu=None):
|
||||
>>> pvariance(data, mu)
|
||||
1.25
|
||||
|
||||
This function does not check that ``mu`` is actually the mean of ``data``.
|
||||
Giving arbitrary values for ``mu`` may lead to invalid or impossible
|
||||
results.
|
||||
|
||||
Decimals and Fractions are supported:
|
||||
|
||||
>>> from decimal import Decimal as D
|
||||
@ -913,8 +909,8 @@ class NormalDist:
|
||||
"NormalDist where mu is the mean and sigma is the standard deviation."
|
||||
if sigma < 0.0:
|
||||
raise StatisticsError('sigma must be non-negative')
|
||||
self._mu = mu
|
||||
self._sigma = sigma
|
||||
self._mu = float(mu)
|
||||
self._sigma = float(sigma)
|
||||
|
||||
@classmethod
|
||||
def from_samples(cls, data):
|
||||
|
Loading…
Reference in New Issue
Block a user