Information Theory and the Length Distribution of all Discrete Systems

September 06, 2017 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Les Hatton, Gregory Warr arXiv ID 1709.01712 Category q-bio.OT Cross-listed cs.IT, physics.bio-ph, physics.soc-ph, q-bio.PE Citations 9 Venue arXiv.org Last Checked 1 month ago
Abstract
We begin with the extraordinary observation that the length distribution of 80 million proteins in UniProt, the Universal Protein Resource, measured in amino acids, is qualitatively identical to the length distribution of large collections of computer functions measured in programming language tokens, at all scales. That two such disparate discrete systems share important structural properties suggests that yet other apparently unrelated discrete systems might share the same properties, and certainly invites an explanation. We demonstrate that this is inevitable for all discrete systems of components built from tokens or symbols. Departing from existing work by embedding the Conservation of Hartley-Shannon information (CoHSI) in a classical statistical mechanics framework, we identify two kinds of discrete system, heterogeneous and homogeneous. Heterogeneous systems contain components built from a unique alphabet of tokens and yield an implicit CoHSI distribution with a sharp unimodal peak asymptoting to a power-law. Homogeneous systems contain components each built from just one kind of token unique to that component and yield a CoHSI distribution corresponding to Zipf's law. This theory is applied to heterogeneous systems, (proteome, computer software, music); homogeneous systems (language texts, abundance of the elements); and to systems in which both heterogeneous and homogeneous behaviour co-exist (word frequencies and word length frequencies in language texts). In each case, the predictions of the theory are tested and supported to high levels of statistical significance. We also show that in the same heterogeneous system, different but consistent alphabets must be related by a power-law. We demonstrate this on a large body of music by excluding and including note duration in the definition of the unique alphabet of notes.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” q-bio.OT

Died the same way β€” πŸ‘» Ghosted