Continuing my irrational? quest of compressing and then comparing irrational numbers, I have found a clear bias in the compressed sequences of pi and e.
The simple test:
black box compression algorithm (well not so black box as I posted the C# code for it in a previous thread)
two sequences: a. 160million digits of e (natural log) ascii text base10 sequence b. 160million digits of pi ascii text base10 sequence
script to compress digits from the 160million sequences in this format:
a. break the 160million digit sequence into 1600 consecutive chunks, each of 100,000 digits b. compress each of the 100,000 digit chunks c. make a list of the "compressed lines" output for each of the 1600 compressed sequences (compressed lines is the output of the compression algorithm) d. sort the list of 1600 compressed sequences in ascending order from lowest compressed lines to highest. e. compare the sorted 1600 compressed sequences of pi to the sorted
1600 compressed sequences of e (natural log) by subtracting to find the difference in compressed lines for each of the 1600 sorted compressed sequences. f.graph the difference
success! There is an obvious bias in the vast majority of the 1600 compressed sequences, showing that e has slightly more compressed lines per compressed sequence than pi does.
The bias is a small one, but it exists, and the bias between e and pi is not the biggest one either, out of the four sequences I tested, e, golden ratio, pi, and sqrt2, golden ratio had a larger bias than e to pi:
100000 sequence length compressed averages (1600 samples per up to
160million)
compressed lines: e 1202.123 golden ratio 1203.872 pi 1200.819 sqrt2 1202.009
formatting link
This bias is seen in other ways too, but I think this comparison really is hard to refute statistically.
More comparisons of sequences are in here if interested:
formatting link
I have reached the limit of my sequence digits (160million) but am trying to get ones up to 10billion digits!
I'm not sure exactly what Jamie is doing, but whatever it is, I don't think it's a viable method for comparing the correlation between sequences of "random" numbers...
I think one of them, pi or e, has a very slight periodicity so I am not comparing correlations between random numbers but instead trying to find the non random part. Most likely e has a very small bias in non randomness, at least compared to pi, in that it has slightly more equal interval repeating pairs of digits, ie 3,4,x,y,z,3,4,a,b,c within the vast majority of any 100,000 digit block of the sequence.
Not sure exactly but I'm trying to figure out what the exact difference in periodicity is, and then it should be easy to just look at any random 100,000 digit block of pi or e and compare if the vast majority of them do have the same small bias. I don't know if it qualifies as random or non random if there is a detectable bias like this, but I think it should be *theoretically* to "reverse engineer" a transcendental sequence back to the formula(s) that were used to create it maybe eventually.
But your code is always looking at a finite sample of a countably infinite set. How do you propose to extrapolate any bias you find in a finite sequence to the whole?
I am looking for a pattern of digits to explain the bias I found, something like a difference in the number of periodic repeating pairs of digits, ie 3,4,(x digits),3,4, which my compression algorithm identifies for any repeating digits a,b (ie 3,4) and any number of equal gaps x.
If some simple pattern like that occurs slightly more in e than pi in general consistently over any range of digits checked, then the question becomes why, which is beyond me, but probably something to do with pi digits being more random than e.. I am just looking for the initial reason why there is a bias :D
I think I've narrowed the identified bias down to blocks of 1000 digits making it easier to identify any patterns.
I compressed 1000 digit blocks up to 160million digits of pi and e using the sequences available here:
formatting link
This created 160,000 compressed sequences for e and pi each.
I compared the ouput "compressed lines" from the compression algorithm for e and pi for all of these 160,000 compressed sequences, and sorted them from fewest compressed lines to most compressed lines.
For the sorted compressed sequences, there are 99 occurances where the compressed lines vary between pi and e out of the 160,000 compressed sequences, and for 97 of them e has one more compressed line than pi, and for 2 of them pi has one more compressed line than e. So the compression is almost identical for the vast majority of the sequences, however the small amount, 99, that aren't equal, there is a large bias towards e having more compressed lines than pi (97 vs 2). This pattern of pi tending to have fewer compressed lines has been observed by me all over the place so it isn't surprising to me, but I'm trying to figure out why pi sometimes compresses to fewer lines.
Here is a small spreadsheet showing the 97 occurrences where e compresses to one more line than pi and the 2 occurrences where pi compresses to more lines than e:
formatting link
e (natural log) data is column A to K and pi data is column P to AA
The column B (for e) and column Q (for pi) are the digit start points for the 1000 digit blocks that have the compression difference.
ie for row 3: e digits from 0 to 1000 compress to one more line than row3: pi digits from 158292000 to 158293000
This test has showed how uneven the distribution of compression differences is, ie there are many compression difference near the end of the sequence of 160million digits of pi, so I think the pi digits from
Here is some more data showing the actual output of the compression algorithm to show the details of how the sequence e has more compressed lines for the same sequence length than pi sometimes:
sequence input e (natural log)
1000 compressed digits starting digit (from sequence of e digits) 76522000
That is 95 lines, more than any other of the 160,000 compressed sequences of e digits from the first 160million digits of e, the next closest had 94 lines.
The extra lines are coming from the indented (tabbed) lines, there are four of them above:
C9(2,67)[2] C10(67,2)[2] C6(2,67)[3] C4(33,3)[2]
The bias I've detected in pi, is that there are fewer of these indented lines created than for other sequences ie e (natural log).
These tabbed lines are "second order chirplets" and are created when the original sequence has a pattern like this:
so for example: a=3 b=4 gap1=random number of random digits (ie 3 random digits 5,7,1 and 6,2,8) gap2=random number of random digits (ie 2 random digits 8,8 and 0,9 and 7,2)
So this example sequence could look like:
3,4, 5,7,1, 3,4, 8,8, 3,4, 6,2,8, 3,4, 7,2, 3,4
or
3,4,5,7,1,3,4,8,8,3,4,6,2,8,3,4,7,2,3,4
The above sequence has only one consecutive pair of digits that exist at least twice: 3,4. And the gaps between each occurrence of where 3,4 occurs are the digits counted from each occurance of 3,4, so for the first 3,4 set it to 0 the second 3,4 count 5 digits from the first 3,4 the third 3,4 count 4 digits from the second 3,4 the fourth 3,4 count 5 digits from the third 3,4 the fifth 3,4 count 4 digits from the fourth 3,4
This gives a new sequence:
5,4,5,4
My compression algorithm recursively looks for the same patterns in this second order sequence to identify consecutive pairs of digits that exist at least twice, so since this sequence has 5,4 that meet this criteria, it makes a "second order chirplet" in the output compressed file (the indented one above that occurs more in the e sequence than pi!)
So from this example, that is a small difference I identified in e (natural log) compared to pi, ie pi will have a very small extra chance that a second order chirplet won't get created, ie instead of the 5,4,5,4 sequence, it may be 5,4,5,3 etc which would not create a second order chirplet.
So therefore I would say that pi is slightly less periodic, and slightly more random than e! At least to the 160million digits I tested.
There are fewer of these types of second order structures in pi than there are in e, just a very small amount fewer though.
My compression algorithm works up to n-order periodic structuring, but I have only compressed up to 10million digits at once in a single sequence and didn't see any chirplets past second order, in these very random sequences, but in more periodic sequences, there are lots of high order (ie 7th level chirplets, ie in a quantized sine wave)
The details of what you are doing are way over my head and/or available time, but isn't it possible (and maybe you've mentioned this already before) that the bias you find is caused by taking a finite number of digits from an infinite sequence? Would this implicate that if this is true, the bias would reduce in sequences with progressive lengths?
I've been working on that most of the time to try to figure it out :D
One interesting thing though is that the "local" bias seems to be maintained no matter where in the sequence the local sample of digits comes from. At least up to my 160million available digits sampled, there is still some bias when the 1000 digit or 10,000 or 100,000 etc individual compressed samples are averaged together.
Here is a graph of the bias I detected versus sequence length for various constants averaged up to 160million digits at several different input digit lengths:
1000, 10000, 20000,30000,80000,100000,200000
1000 digits is average of 160000 sequences (160million total)
10000 digits is average of 16000 sequences (160million total)
20000 digits is average of 8000 sequences (160million total)
30000 digits is average of 5333 sequences (160million total)
80000 digits is average of 2000 sequences (160million total)
100000 digits is average of 1600 sequences (160million total)
200000 digits is average of 800 sequences (160million total)
formatting link
more detailed info:
formatting link
I need to add 100digit compressed samples to it still, I compressed the 100 digit input sequences, which was 1.6million different compressed sequences for each of pi, e, golden ratio, sqrt2 (took over
24hours on my PC I think to run) and generated 6.4million total compressed sequences, with just the basic statistics file being over 850MB, I didn't sort it all out to see what kind of bias there is for this compressed digit length yet but here are some preliminary results that are interesting and help explain how the compression algorithm works by using the simpler example of compressing just 100 digits, to get an idea of what the "bias" I'm finding actually is.
Out of the above 6.4million compressed sequences, 6 stand out, the only ones that have more than one "level2 chirplet" (a second order pattern of digits). These 6 100digit sequences occurred at widely spread out locations in the 160million digit sequence and here they all are: (extracted from the 850MB+ stats file)
Using pi as an example, I took the sequence row starting point above for pi in the 160million digits of pi sequence, and re-ran the compression algorithm for 100 digits to regenerate the compressed sequence that has 2 second level chirplets identified, so I could see how it looks:
Here it is (the 100digit compressed pi sequence from start digit
..actually 101 digits I'll look into that maybe, but doesn't really change how it works for this example.. :D
So the reason that is a very "unique" sequence out of all the
1.6million 100digit sequences I compressed in the first 160million digits of pi, is that it is the only one with more than 1 level2 chirplet.. The level2 chirplets are 6,23 and 23,6, and if you look at that sequence you can see how there is a pattern in the occurances of the level1 chirplet C5(0,7) or just 0,7, that is the parent of the level2 indented chirplets that is statistically low probability, ie there are equally spaced gaps of digits between occurances of 0,7, here is how the algorithm detects those:
list of subsequences from above sequence showing gaps between 0,7 chirplet:
Those are just cut and pasted right from that sequence of 100 pi digits.
07845407 has a gap of 6 from the first 07 to the second 07, as in the number of single digit shifts that would be required to move the first
07 to the place of the second 07 (gap of 6 corresponds to 4 random digits between the two 07's)
The algorithm takes these gaps and creates a new sequence and if any chirplets are found it recursively recompresses it too, looking for n-order periodicity.
So this new gap sequence is: 6,23,6,23,6, which according to my own definition of chirplets, which requires that two consecutive digits have to occur in the sequence twice in order to be called a chirplet, then there are two consecutive digits that meet this definition, 6,23 and 23,6, and they luckily are the second order chirplets above that I was talking about, so that is basically a big chunk of how the compression algorithm works..
Things to note:
first thing to note:
pi sequences seem to create fewer level2 chirplets than other sequences I tested, implying that these types of patterns:
exist a tiny bit less than in other irrational number sequences of digits, and instead there would be unmatched gaps more often in pi, like this perhaps:
which would explain why slightly fewer level2 chirplets are created in compressed pi sequences, since no level2 chirplets would be found in the sequence: 6,18,6,23,6 since there need to be at least two 6,23's etc to be defined as a chirplet..
(from way above, there are two 100digit sequences of the golden ratio, and sqrt2 that have two level2 chirplets, and one 100digit sequence of e that has a level2 chirplet (in the first 160million digits of the sequences)
I hope that can kind of explain what I'm doing, anyway the spreadsheet shows the statistical differences I am trying to find and I think this is an explanation of what the difference is, which is that pi has slightly lower second order periodicity I think.
Here's one last one to show the bias:
formatting link
That one was done with 100,000 digit compressed sequences (1600 sequences up to 160million digits) for pi and e, and it shows the bias that e has an excess of level2 chirplets quite consistently.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.