How the Randomness Tests Work
Burp Sequencer employs standard statistical tests for randomness. These are
based on the principle of testing a hypothesis against a sample of evidence,
and calculating the probability of the observed data occurring, assuming that
the hypothesis is true:
- The hypothesis to be tested is: that the tokens are randomly
generated.
- Each test observes specific properties of the sample that are
likely to have certain characteristics if the tokens are randomly generated.
- The probability of the observed characteristics occurring is
calculated, working on the assumption that the hypothesis is true.
- If this probability falls below a certain level (the "significance
level") then the hypothesis is rejected and the tokens are deemed to
be non-random.
The significance level is a key parameter in this methodology. Using a lower
significance level means that stronger evidence is required to reject the hypothesis
that the tokens are randomly generated, and so increases the chance that non-random
data will be treated as random. There is no universally "right" significance
level to use for any particular purpose: scientific experiments often use significance
levels in the region of 1% to 5%; the standard FIPS tests for randomness (which
are implemented within Burp Sequencer) use significance levels in the region
of 0.002% to 0.03%. Burp Sequencer lets you choose what significance level you
wish to use to interpret its findings:
- Each individual test reports the computed probability of the observed
data occurring, assuming that the hypothesis is true. This probability represents
the boundary significance level at which the hypothesis would be rejected,
based solely upon this test.
- Aggregated results from multiple tests are presented in terms of the
number of bits of effective entropy within the token at various significance
levels, ranging from 0.001% to 10%. This summary enables you to see how
your choice of significance level affects the "quantity" of randomness deemed
to exist within the sample. In typical cases, this summary demonstrates
that the choice of significance level is a moot point because the tokens
possess either a clearly satisfactory or clearly unsatisfactory amount of
randomness for any of the significance levels that you may reasonably choose.
Some important caveats arise with any statistical-based test for randomness.
The results may contain false negatives and positives for the following reasons:
- Data that is generated in a completely deterministic way may be deemed to be
random by statistical tests. For example, a well-designed linear
congruential pseudo-random number generator, or an algorithm which computes
the hash of a sequential number, may produce seemingly random output even
though an attacker who knows the internal state of the generator can extrapolate
its output with complete reliability in both forwards and reverse directions.
- Data that is deemed to be non-random by statistical-based tests may
not actually be predictable in a practical situation because the patterns
that are discernible within the data do not sufficiently narrow down the
range of possible future outputs to a range that can be viably tested.
Because of these caveats, the results of using Burp Sequencer should be interpreted
only as an indicative guide to the randomness of the sampled data.
The tests performed by Burp Sequencer divide into two levels of analysis:
character-level and bit-level.
Character-Level Analysis
The character-level tests operate on each character position of the token
in its raw form. First, the size of the character set at each position is counted
- this is the number of different characters that appear at each position within
the sample data. Then, the following tests are performed using this information:
- Character count analysis. This test analyzes the distribution
of characters used at each position within the token. If the sample is randomly
generated, the distribution of characters employed is likely to be approximately
uniform. At each position, the test computes the probability of the observed
distribution arising if the tokens are random.
- Character transition analysis. This test analyzes the transitions
between successive tokens in the sample. If the sample is randomly generated,
a character appearing at a given position is equally likely to be followed
in the next token by any one of the characters that is used at that position.
At each position, the test computes the probability of the observed transitions
arising if the tokens are random.
Based on the above tests, the character-level analysis computes an overall
score for each character position - this is the lowest probability calculated
at each position by each of the character-level tests. The analysis then counts
the number of bits of effective entropy for various significance levels. Based
on the size of its character set, each position is assigned a number of bits
of entropy (2 bits if there are 4 characters, 3 bits if there are 8 characters,
etc.), and the total number of bits at or above each significance level are
calculated.
Bit-Level Analysis
The bit-level tests are more powerful than the character-level tests. To
enable bit-level analysis, each token is converted into a set of bits, with
the total number of bits determined by the size of the character set at each
character position. If any positions employ a character set whose size is not
a round power of two, the sample data at that position is translated into a
character set whose size is the nearest smaller round power of two. The partial
bit of data at the position is effectively merged into the whole bits derived
from that position. This translation is done in a way that is designed to
preserve the randomness characteristics of the original sample, without
introducing or removing any bias. However, no process of this type can be
perfect, and it is likely the process of analyzing samples with non-round
character set sizes will introduce some inaccuracies into the analysis results.
When each token has been converted into a sequence of bits, the following
tests are performed at each bit position:
- FIPS monobit test. This test analyzes the distribution of ones
and zeroes at each bit position. If the sample is randomly generated, the
number of ones and zeroes is likely to be approximately equal. At each position,
the test computes the probability of the observed distribution arising if
the tokens are random. For each of the FIPS tests carried out, in addition
to reporting the probability of the observed data occurring, Burp Sequencer
also records whether each bit passed or failed the FIPS test. Note that
the FIPS pass criteria are recalibrated within Burp Sequencer to work with
arbitrary sample sizes, while the formal specification for the FIPS tests
assumes a sample of exactly 20,000 tokens. Hence, if you wish to obtain
results that are strictly compliant with the FIPS specification, you should
ensure that you use a sample of 20,000 tokens.
- FIPS poker test. This test divides the bit sequence at each position
into consecutive, non-overlapping groups of four, and derives a four-bit
number from each group. It then counts the number of occurrences of each
of the 16 possible numbers, and performs a chi-square calculation to evaluate
this distribution. If the sample is randomly generated, the distribution
of four-bit numbers is likely to be approximately uniform. At each position,
the test computes the probability of the observed distribution arising if
the tokens are random.
- FIPS runs tests. This test divides the bit sequence at each position
into runs of consecutive bits which have the same value. It then counts
the number of runs with a length of 1, 2, 3, 4, 5, and 6 and above. If the
sample is randomly generated, the number of runs with each of these lengths
is likely to be within a range determined by the size of the sample set.
At each position, the test computes the probability of the observed runs
occurring if the tokens are random.
- FIPS long runs test. This test measures the longest run of bits
with the same value at each bit position. If the sample is randomly generated,
the longest run is likely to be within a range determined by the size of
the sample set. At each position, the test computes the probability of the
observed longest run arising if the tokens are random. Note that the FIPS
specification for this test only records a fail if the longest run of bits
is overly long. However, an overly short longest run of bits also indicates
that the sample is not random. Therefore some bits may record a significance
level that is below the FIPS pass level even though they do not strictly
fail the FIPS test.
- Spectral tests. This test performs a sophisticated analysis of
the bit sequence at each position, and is capable of identifying evidence
of non-randomness in some samples that pass the other statistical tests.
The test works through the bit sequence and treats each series of consecutive
numbers as coordinates in a multi-dimensional space. It plots a point in
this space at each location determined by these co-ordinates. If the sample
is randomly generated, the distribution of points within this space is likely
to be approximately uniform; the appearance of clusters within the space
indicates that the data is likely to be non-random. At each position, the
test computes the probability of the observed distribution occurring if
the tokens are random. The test is repeated for multiple sizes of number
(between 1 and 8 bits) and for multiple numbers of dimensions (between 2
and 6).
- Correlation test. Each of the other bit-level tests operates
on individual bit positions within the sampled tokens, and so the amount
of randomness at each bit position is calculated in isolation. Performing
only this type of test would prevent any meaningful assessment of the amount
of randomness in the token as a whole: a sample of tokens containing the
same bit value at each position may appear to contain more entropy than
a sample of shorter tokens containing different values at each position.
Hence, it is necessary to test for any statistically significant relationships
between the values at different bit positions within the tokens. If the
sample is randomly generated, a value at a given bit position is equally
likely to be accompanied by a one or a zero at any other bit position. At
each position, this test computes the probability of the relationships observed
with bits at other positions arising if the tokens are random. To prevent
arbitrary results, when a degree of correlation is observed between two
bits, the test adjusts the significance level of the bit whose significance
level is lower based on all of the other bit-level tests.
- Compression test. This test does not use the statistical approach
employed by the other tests, but rather provides a simple intuitive indication
of the amount of entropy at each bit position. The test attempts to compress
the bit sequence at each position using standard ZLIB compression. The results
indicate the proportional reduction in the size of the bit sequence when
it was compressed. A higher degree of compression indicates that the data
is less likely to be randomly generated.
Based on the above tests, the bit-level analysis computes an overall score
for each bit position - this is the lowest probability calculated at each position
by each of the bit-level tests. The analysis then counts the number of bits
of effective entropy for various significance levels.