The 8 Fallacies of Assembly Theory

Dr. Hector Zenil
75 min readMay 1, 2023

Regarding the alleged misunderstandings reflected in our critique of ‘Assembly Theory,’ wherein we drew attention to the many serious issues undermining its foundations and methods.

Disclaimer: These criticisms of Assembly Theory are not motivated by the authors’ challenge to the status quo. The motivation is to counterbalance the engine behind the authors’ full-time efforts to advertise flawed ideas and inundate journals and media with misinformation as if they had invented the field of complexity science or the practice of science was a marketing exercise. All this is backed up by the multiple papers we have published in the field.

In this post, you will learn how we replicated — and outperformed — Assembly Theory (AT) using traditional statistical measures, and how almost five years before Assembly Theory we demonstrated how to separate organic from nonorganic compounds without making bold unjustified claims, as the authors of Assembly Theory do. Also five years before AT, we showed how the type of processes considered by AT were connected to selection and evolution but unlike the trivial algorithm that they use based on LZ compression, we used a measure called Block Decomposition Method (any similarities are not coincidence), that not only counts for identical copies but also small causal structures that in sequence explain how a process may have been unfolded or an object may have been assembled. The marketing and self-promotional skills of the authors, deployed in service of a flawed idea and an even more flawed methodology are unfortunate and reprehensible.

One of the main thrusts of our critique and our main take-home message may be gleaned from this figure (MA is the Assembly Theory measure, all others are traditional statistical measures that the authors of AT never tested against MA):

Taken from our paper, available online, this figure shows how simple statistical measures applied to the same experimental data do exactly what the authors of Assembly Theory claim they have done for the first time. It shows the lack of basic control experiments, the absence of an original contribution, and the lack of proper attribution. They have advanced a theory that sounds sophisticated only because it draws heavily on definitions of certain fundamental ideas from complexity theory that they make sound like their own while in fact appropriating them almost word for word. Moreover, their theory is disconnected from their method, which is completely trivial and consists of counting the number of exact repetitions in a piece of data. According to the authors, this defines life. Just as we have done, others have argued that simple things like a pile of coal or certain minerals would satisfy Assembly Theory’s definition of life. We have known since the 19th century that organic chemistry does not equate to life and that it is possible to synthesize naturally and artificially organic compounds. The Y axis is in log scale and normalised to compare the different indexes. Any of the separating cut-off values on the Y axis can serve as thresholds for life as AT suggests.

UPDATE (Thursday, 11 January 2024): Disassembling Assembly Theory

The authors of Assembly Theory led by Lee Cronin are changing the narrative moving away from the ‘lifemeter’ idea that Assembly Theory can detect life after being caught using the most popular compression algorithm that has widely been used for clustering and classification purposes including separating organic from non-organic molecules. He continues claiming as of January 7 that the assembly index is not a compression algorithm even when it is, their popular examples are exactly LZ77/LZ78, and at the core of AT is an algorithm of the LZ family, a very basic compression algorithm introduced in 1977 and most popular in implementations such as ZIP, GIF, etc.

The authors of AT seem not to realise that what defines an algorithm is not its use, it is not that if they are not using it to compress data to save memory space in a computer then it is not a compression algorithm, it is. Nor do they seem to understand that there is nothing wrong in accepting that it is a compression algorithm and an existing one, but in their effort to appear novel and revolutionary, they want to negate it as strongly and long as possible because otherwise their Assembly Theory (AT) gets exposed as a very weak approximation to algorithmic (Kolmogorov) complexity that has delivered the same results as theirs years before AT (see below for details).

Striking similarities between Assembly Theory which pretends to explain selection and evolution and our Block Decomposition method introduced in the early 2010s that has been proposed and tested with actual data and preliminary validated on a cancer pathway

The new narrative is more focused on the claim that ‘Assembly Theory explains selection and evolution’ (biological and beyond). Unfortunately, just as we did for their chemical claims separating organic from non-organic compounds, we did this already as well in 2018 with our paper published by the Royal Society Open Science under the title “Algorithmically probable mutations reproduce aspects of evolution, such as convergence rate, genetic memory, and modularity”. The similarities of their definition of evolution in terms of Assembly Theory are strikingly close (but way weaker) to ours. Furthermore, their paper in Nature has no experiments or empirical evidence. So, on both accounts where authors of Assembly Theory believe they have innovated, we have reported it before, on better grounds, more formal, with experiments and data to support it. In contrast, they have replicated and deployed theory and algorithms from the 1960s and 1970s and when faced with our papers, they deliberately decided not to cite them and, actually, take a stance against Kolmogorov complexity in their attempt to appear that they were doing something completely different and disconnected from it (which they are not). Their claims of novelty are based on two ideas that have been introduced by my groups before:

  1. The application to separate organic from non-organic chemical compounds, and
  2. An application to explain selection and evolution as an emergent phenomenon even leading to modularity (the ‘copies’ in Assembly Theory)

As shown in these papers:

H. Zenil, N.A. Kiani, M-M. Shang, J. Tegnér, Algorithmic Complexity and Reprogrammability of Chemical Structure Networks, Parallel Processing Letters, vol. 28, 2018. (a complete version is also available on Arxiv) (notice that our complexity calculator from where you can replicate the experiments is now at https://complexity-calculator.com/ as we lost the domain without the hyphen to a squatter)

Where we demonstrated how organic compounds from non-organic compounds can be separated using LZW and other algorithms with high accuracy something and on an exhaustive database of over 15,000 compounds compared to the five that the authors of Assembly Theory say were given by NASA (or only a hundred compounds considering their ‘calibration’ procedure). The group of Assembly Theory presented their results as new and revolutionary in their Nature Communications paper and all over the media claiming to have found a theory of everything and to have unified biology and physics. We did not mount a media stunt because we found it was not very surprising or hard, and that it was possible with any representation of the chemical data, including mass spectral but also InChi and molecular distance matrices as shown in our 2018 paper.

And then,

S. Hernández-Orozco, N.A. Kiani, H. Zenil, Algorithmically Probable Mutations Reproduce Aspects of Evolution, such as Convergence Rate, Genetic Memory, and Modularity, Royal Society Open Science, 5:180399, 2018.

where we investigated the application of algorithmic complexity to explain and quantify selection and evolution as emergent phenomena including the modularity (that Assembly Theory calls identical copies) with proper control experiments such as comparing evolutionary convergence rates and several indexes and methods against each other. Results both on synthetic and biological examples indicate that our theory could explain an accelerated rate of mutations that are not statistically uniform but algorithmically uniform which may justify how evolution finds shortcuts through selection as an emergent property of algorithmic information theory.

In the paper, we show that algorithmic distributions can evolve modularity and genetic memory by the preservation of structures when they first occur from very basic computational processes leading to accelerated production of diversity but also population extinctions, possibly explaining naturally occurring phenomena such as diversity explosions (e.g. the Cambrian) and massive extinctions (e.g. the End Triassic) whose causes are currently a cause for debate. The natural approach introduced in this paper appears to be a better approximation to biological evolution than models based exclusively upon random uniform mutations, and it also approaches a formal version of open-ended evolution predating biological evolution as an example of how selection emerges. These results validate some suggestions in the direction that computation may be an equally important driver of evolution.

We validated our approach on a cancer pathway, finding the most likely oncogenes. The measure we introduced is called Block Decomposition Method (notice the similarities, again), it counts blocks of causal content, not only identical copies but also patches of simple transformations that could have explained how a process unfolded. The similarities with Assembly Theory stop there because our method is not only LZW, it can do what a compression algorithm does (and we compare it against compression) but also finds causal structures that in sequence assemble the original object.

In this other paper, published in iScience (a journal of Cell) in 2019, we were able to reconstruct an epigenetic Waddington landscape validated with the three genetic databases and the literature in cell biology on how stem cells differentiate. We also showed how our algorithms based on algorithmic complexity can reconstruct the causal mechanics of dynamical systems proving the authors of Assembly Theory wrong regarding their many false claims involving algorithmic (Kolmogorov) complexity and their numerical approximations and their claim that is not, or cannot be, related to causality:

H Zenil, N.A. Kiani, F. Marabita, Y. Deng, S. Elias, A. Schmidt, G. Ball, J. Tegnér, An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems iScience, S2589–0042(19)30270–6, 2019.

Other researchers have also used LZ algorithms to characterise all sorts of objects, including biological and chemical, finding a strong simplicity bias of the type captured by LZ, such as repetitions that AT claims to have found for the first time. But we were the first to report these phenomena in a paper titled “On the Algorithmic Nature of the World” published in 2010 in a scholarly volume titled Information and Computation, World Scientific Publishing Company, and later in another publication by PLoS ONE under the title “Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machine” in 2014 both in the context of my Ph.D. theses (Lille and Sorbonne) of which Greg Chaitin, one of the founders of the theory of algorithmic complexity, was on the thesis evaluation committee. Unlike Cronin’s, The papers above are supported by empirical data and systematic and exhaustive actual experiments with proper controls.

The papers above from my groups are years ahead of what the Cronin and Walker groups have offered with no evidence, experiments, solid (original) theory, or any novelty in their methods and index. The two papers above are only a sample of the dozens of papers that we have published in the field.

In application to causality, another area that Cronin and Walker wrongly suggest algorithmic complexity is not suitable, and wrongly suggest they are contributing to, and to be doing so independently of algorithmic complexity. For example, we published this paper on Nature Machine Intelligence in 2019:

Unlike Assembly Theory, our framework takes into account and combines the developments and what is known about causality in the last 40–50 years, including perturbation analysis, counterfactuals, multivariate simulation, etc.

Nature produced this video to explain our approach:

Nature described our research and methods in 2019 as “One group of scientists are trying to fix this problem with a completely new kind of machine learning. This new approach aims to find the underlying algorithmic models that interact and generate data, to help scientists uncover the dynamics of cause and effect. This could aid researchers across a huge range of scientific fields, such as cell biology and genetics, answering the kind of questions that typical machine learning is not designed for.”

One can see how Assembly Theory mirrors all our work taking the ideas from algorithmic complexity (without acknowledging it) but instead doing it in a rather pedestrian fashion with an incredibly simplistic, and weak method that has no scientific basis but incredibly exaggerated claims.

Based on all this body of scholarly literature that we have produced in the last 15 years, we founded in the mid and late 2010s the field of Algorithmic Information Dynamics of which Springer Nature and Cambridge University Press published these two books in application to causality and living systems:

Notice the subtitle: In application to Causality and Living Systems.

If you missed from one of my videos the glossary of terms that Assembly Theory has misappropriated (or plagiarised), here it is:

Cronin also says he wants to apply his Assembly index (equivalent to LZ77/LZ78) on natural language text to classify languages and other objects, such as species, using exactly algorithmc in the LZ family based on the same principles. That has also been done by Clibrasi and Vitanyi in 2005 in their landmark paper on Clustering by Compression, published in IEEE Transactions on Information Theory. Ever since, text compression has been used widely for things like fraud, authorship, and, paradoxically, plagiarism detection. Cronin will say his assembly index is not a compression algorithm, but it is.

On my ‘association’ to Prof. James Tour:

I have also been asked why I decided to participate in Prof. James Tour’s podcast when he has so many strong religious ideas opposite to mine. I opened my intervention saying I was the opposite of a creationist and not a religious person and closed my interview by calling out Intelligent Design on my last slide. So, throughout it, I was very clear about my convictions and I am happy his audience had access to it and could be exposed to what I think is a healthy check and balance.

My duty is not to judge what are the beliefs of the person that wants to discuss with me about science, my duty is to answer with the most objective answer I can come up with as a scientist to whoever asks or reaches out to me. Not reaching out across the aisle is what promotes greater division and sometimes radicalisation into small isolated spheres of information where James Tour’s audience would have perhaps never heard someone like me as someone opposed to their own views. So the fact that the video has almost 100K views since the first week only, is for me a great result and a double mission accomplished.

Prof. James Tour is one of the most reputed synthetic chemists. He has been awarded prizes by organisations such as the Royal Society of Chemistry for his work (the same society that, in contrast, apparently suspended Cronin for misconduct). Doesn’t it say something about how things work that Tour, a professor who happens to have a podcast for his own reasons, reached out to me rather than science communicators like Lex Fridman and others? I shared the same information with Philip Ball, a popular science communicator, who ignored much of what I said and misrepresented me adding only a water-downed single sentence in his article on Assembly Theory as saying that I thought ‘Assembly Theory was yet another measure of complexity’ in his Quanta and Wired articles about Assembly Theory because he was so excited about Cronin and Walker’s empty but colourful rhetoric. They do not want to undo what they, the science media did, because they would be accepting they were grossly fooled. I wrote about this in another blog post.

UPDATE (Saturday, 6 January 2024): Videos and Interview Disputing Assembly Theory

To help my readers and others who may have been misled by Assembly Theory, I have decided to support my arguments with a video recording posted on YouTube and an interview I gave to one of the most reputable synthetic chemists, Prof. James Tour, who very kindly invited me onto his podcast, having cited me at last year’s Origin of Life debate with Lee Cronin at Harvard University.

Tour is a Professor of Chemistry, Professor of Materials Science, and Nanoengineering at Rice University in Houston, Texas. He is a world authority on nanotechnology and has pioneered the field of nanomedicine and the application of nanorobots in drug delivery. For example, Tour’s lab’s research into graphene scaffolding gel has been shown to repair the spinal cords of paralyzed mice.

As per his Wikipedia page, Prof. Tour has about 650 research publications and over 200 patents, with an H-index > 170 with total citations of over 130,000 (Google Scholar). Prof. Tour was awarded the Royal Society of Chemistry’s Centenary Prize for innovations in materials chemistry with applications in medicine and nanotechnology. Tour was inducted into the National Academy of Inventors in 2015. He was named among “The 50 Most Influential Scientists in the World Today” by TheBestSchools.org in 2014.

Unlike me, Prof. Tour is a deeply religious man, which is not uncommon among scientists. Prof. Tour’s beliefs are his own business and I respect him even if I don’t share them.

These videos show how close Lee Cronin and Sara Walker are to plagiarism, perhaps only saved by — or so I’d like to believe — their profound ignorance of the topics they are pretending to contribute to. If it is not ignorance, then it is dishonesty, and they have already crossed the line into deception and plagiarism:

Assembly Theory is identical to LZ77/LZ78 and algorithmic (Kolmogorov) complexity. Every aspect of Assembly Theory is disputed by Dr. Zenil.
Dr. Hector Zenil interviewed by Prof. James Tour after the Harvard debate on the Origin of Life with Lee Cronin. It debunks Assembly Theory as introduced by Lee Cronin and Sara Walker.

Offer to Cronin’s and Walker’s group members: I know how difficult it may be for postdoctoral researchers in Lee Cronin and Sara Walker’s groups to learn and face the reality that they have been grossly misled. Please, do not get discouraged. You are most likely brilliant and young and do not have to bear the responsibility for the mistakes of your principal investigators. They are incompetent in the areas of computer science, causality, complexity, systems biology, and more. You should not be forced to waste your valuable time or career with their personal agendas.

To Cronin and Walker: I repeat an offer I’ve made before, to advise you and get you, if possible, on the right path, as you need an expert in computer science, complexity science, information theory, causality, and systems biology and there is nothing wrong in accepting this fact. The sooner you realize it and accept it, the better. I would also be happy to assign your groups real research problems on which you can work. For example, the development of tools to deal with inverse problems. In the video above I have even suggested one possible route to some version of Assembly. I suggested you could look for nested copies. This is something that compression algorithms already do. For example, ZIP already can take as a parameter how many times to traverse the string to look for deeper nested repetitions. Fractal compression does something similar too although in this case is a lossy compression algorithm. Anyway, I don’t expect you to credit me for any of the ideas I am giving you as you never credit anyone that is not you or your close supporters, but I believe in making a service and the sooner you stop inundating journals and media with plagiarised Assembly Theory, the better for science.

A third video will follow, with more arguments on the wild speculations of Lee Cronin and Sara Walker on how instantaneous ‘pictures’ of objects are constructed sequentially according to their simplistic algorithms. I will take the opportunity to explain how the theory of Algorithmic Information Dynamics (AID) makes evident how Algorithmic (Kolmogorov, Kolmogorov-Chaitin) complexity is deeply connected to causality and is the ultimate theory of causality under the most basic and common assumptions of science. AID provides the theory and methods to perform causal discovery and causal analysis and provides the best mechanistic view of every cause-and-effect chain of a process and how it comes to be non-trivially assembled (beyond identical copies).

UPDATE (Wednesday, 29 November 2023): Assembly Theory loaded claims with no substance inundating journals and media

James Tour and Lee Cronin had a public debate hosted at Harvard University, where in refuting Cronin’s unfounded claims that he had basically ‘solved life’ (and almost everything else in the universe, according to the authors’ claims to the media, see below), our work, and some excerpts from this blog pointing out the many problems with Assembly Theory, were prominently cited. Dr. James Tour is a Professor of Chemistry, Materials Science, and Nanoengineering at Rice University in Houston, Texas.

These are the titles and Press Releases approved by the authors of Assembly Theory feeding the media.

While I may not share all of Prof. Tour’s beliefs regarding religion (he did not use any religious arguments to refute Cronin’s claims), I think he did a service to science and scientific practice by pointing out the many false claims made by Cronin and his group. While I may also not have dismissed a paper by Cronin as ‘garbage’ without elaborating, I understand the huge discrepancy between their many false claims and their actual contribution. This bears out the Bullshit Asymmetry Principle, also known as Brandolini’s Law: the amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.

A noteworthy feature of the debate and what follows is how Cronin’s attitude is diametrically different from his public profile and bold claims. Before an audience of Harvard and MIT scholars, he claims to know almost nothing; he says he is an amateur in almost every field of knowledge and that he may be completely wrong (we have shown he is). This humility is merely a strategy to avoid further grilling when facing other than laymen.

Contrast this to his public claims that he spreads where no one can refute him, including claims that he has ‘unified physics and biology’, redefined ‘time’ as a physical process (he says so on the video), and so on, none of which is remotely true. I was expecting a technical defence but Cronin unfortunately presented to an audience of Harvard scholars a general talk appropriate for a lay audience.

According to Cronin (and this was one among several false statements aired in this debate), “no one is disputing the science” (of AT). We have, and only a few minutes previously, James Tour had, but Cronin seems to specialise in sophisticated academic deflection. One can see immediately how he exploits Prof. Tour’s beliefs in order to try to discredit his position, without attempting any logical, or scientific argument as a response to Prof. Tour’sspecific questions.

In the end, Cronin disguises his theory as a ‘discovery’ of the highly hierarchical structure of living systems, but we have known this since the introduction of cell biology and crystal-structure over a hundred years ago. This is what systems biology has studied for decades, whether in relation to the compartments found in the cell, or the way evolution takes old material and self-organizes into heavily nested DNA and genetic activation switches, leading to highly nested protein structures, the Lego blocks that build living systems (which the AT authors want us to believe it is they who have discovered!). Because they say things that are generally known and accepted, many people are taken in and think they have something new to offer, but the only part of AT that makes sense is what we already know, and know better from the work of others, but that is also the strategy, to say something that is known or sounds like scientifically sound and then mix it with fake rigour to embellish results.

Cronin also compared himself and the situation with AT to Galileo Galilei and the heliocentric model — despite insisting that he was no Galileo. As if Assembly Theory was so disruptive that he had annoyed the establishment. This is not the case; there is nothing concrete about AT for us to oppose, as it lacks substance. This is our main criticism: that there is no science to criticize; AT is not saying or proving anything that existing theories and methods could not have proved (and have). We reproduced the results that AT claims only it is capable of producing with very simplistic statistical measures separating the same compounds that AT claimed it was the first to separate. And we also did this almost five years before AT. We also demonstrated that all sorts of old and new indexes separate organic from nonorganic compounds (which Cronin and Walker conveniently confound with ‘life’) using exactly (and only) the same mass spectral data that the authors of AT used, showing they perform similarly or better than AT. So, what is left?

More papers disputing the science of AT are coming out, showing how inert (natural, experimental, and synthetic) minerals can have a high assembly index (hence being ‘alive’ according to Cronin’s and Walker’s Assembly Theory) — as we showed theoretically in our paper and predicted would be the case.

Unlike Tour, who thinks there is no ill-intention or bad faith at play here, I have asked whether this is not the case as I see it as deeply dishonest this divergence of attitudes with which they face the public — knowingly misleading when they can get away with it, but not when, being outside the ecosystem of followers they have created, they are not likely to succeed. In reality, Cronin, Walker, and their groups have not proven or demonstrated anything close to what they claim in public forums, through their paper titles, and university press releases.

Their modus operandi does not belong to science (Cronin bragged about how many times his paper had been mentioned online and downloaded — mostly for the wrong reasons — but he seems happy regardless). Their approach does a disservice to science and the general public, and I think that was the main point Prof. Tour made, just as I have been doing for years now (as have many others who had not found the courage to come forward before).

Bad as this self-promotion is, it is equally dishonest to do science that is not based on basic control experiments, ignore or dismiss the work of others, take ideas without proper attribution, conceal conflicts of interest, and deform reality to garner agreement to appear more credible, creating a terrible ecosystem of false science driven by the wrong incentives (academic promotion, tax-payer grant money, attention seeking).

Of course, this is not a problem exclusive to Cronin and Walker, but they are prime examples of all that is currently wrong in the practice of self-promoting ‘science’. Their work is an instance of the kind of science boosted by dubious incentives, albeit an extreme one. The fact that they have managed to mislead colleagues and the public, even their own students, and are rewarded in some contexts for doing so, with promotions, high-impact articles, media attention, and government grants, with little to no accountability, is of course, only a reflection of a greater problem in the scientific community and in society in general.

UPDATE (Saturday, 25 November 2023): The authors of Assembly Theory ever moving goal post

The current narrative of the authors of Assembly Theory contra our criticisms seems to boil down to the following:

1. They claim they are not implementing Huffman or anything related to compression because they are taking a ‘physical signal’ and processing it directly under a hypothesis about how a molecular compound may assemble; moreover, their algorithm is (slightly) different. This does not make sense. Huffman coding and lossless statistical compression like LZW are totally mechanistic, in the sense that they offer a step-by-step plausible physical unfolding process, and are even better-grounded because proven to be optimal in a universal sense (they asymptotically converge), while the AT hypothesis about how a molecule may have assembled is pure speculation, not grounded in any physical evidence (evolutionary or otherwise). Hence there is no state-to-state correspondence between their theory and reality, and the former has already been shown to face many elementary problems and challenges even in justifying its chemical layer, let alone other layers. In other words, there is no indication that their molecular compounds are assembled in the way they think they are (which is a form of suboptimal reverse Huffman process), or indeed in any other way. That is, their hypothesis is as good as Huffman’s when applied to the same (physical) data. Actually, the evidence is in favour of Huffman’s because it separates the classes from the AT hypothesis in a greater statistically significant fashion than AT itself does, thereby implying that there may be a better correspondence with the ground truth it is trying to capture. Their speculations are therefore not related to any evidence of physical assembly and are therefore as good (in the best case) as Huffman’s assembly and disassembly processes. Huffman and statistical compression have been applied to many other physical biological signals, and the AT measure (marked as MA on the plots) is nothing but a restricted version of a compression algorithm that takes a piece of data that is represented in a computable fashion, just as it would be for any other measure, including lossless statistical compression.

2. They also claim that other indexes ‘require a Turing machine’ in order to be applied, which does not make sense either, as I have explained extensively below. This is like saying that their measure requires a Dell computer because they carried out their calculations using a Dell computer. A Turing machine is just a computer program.

3. They also claim that we have not applied our indexes to the same spectral data. This is false; we have applied our measures on exactly the same mass spectral signal they have used (our Fig. 4, extending their Fig. 4 on their N. Comms paper). We did use other data in some other of our figures, only to show that there was nothing special about mass spectral data and that all indexes (out)performed MA equally when applied to molecule distance matrices or even InChi descriptions (which we reported almost five years before AT was introduced, continue reading) to separate the same classes as they did.

The problem is that the authors of AT throw out so many false statements per claim that it is impossible to have a rational scientific conversation with them or their theory. This is a quintessential instance of the so-called Bullshit Asymmetry Principle. The Bullshit Asymmetry Principle, less well known as Brandolini’s Law, states that the amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it. I have done my best so far to scientifically demonstrate to the world the falsehood of every statement the AT authors have made to capture people’s attention, probably more than any other individual. Fortunately, more papers showing how AT is flawed and false are coming out in the wake of ours.

UPDATE (Thursday, 19 October 2023): On the unscientific and unethical behaviour of the senior authors of Assembly Theory

For years, I’ve been reporting what I think is the unscientific and unethical behaviour of the senior authors of AT. I was the first to report how the authors, the journals in which they have published, and, to some extent — also accountable — the science writers and science magazines have fallen into the trap of amplifying the sensationalistic claims that the authors and their university PR teams make, giving little to no space to neutrality or critical views that would make for balanced reportage.

This time around, the situation has reached new, unprecedented levels of deception, with the collaboration of Nature — the journal, the senior authors’ university PR departments, and the media reaching a point where eyebrows are being raised, with most people seeing through the hype, including some of the authors’ closest collaborators.

Its engaging prose notwithstanding, their paper verges on full-blown plagiarism, appropriating core elements of algorithmic information theory and my group’s own work making ‘Assembly Theory’ (AT) indistinguishable from the principles of algorithmic complexity (including elements of algorithmic probability and Bennett’s logical depth, that are part of the same body of knowledge). This paper is even more shallow than the last one, though it makes even bolder claims, with the authors and the media now widely reporting AT as the theory that unifies biology and physics. All this about a ‘theory’ whose method is a weaker form of Shannon entropy applied to counting exact copies of molecular sequences, a method which has been used (and abused) in every popular lossless compression algorithm since the 1960s (like GZIP) to approximate algorithmic complexity — despite the authors going to great lengths to distance themselves from compression and algorithmic complexity. In fact what they implement is none other than a compression algorithm (a very simple and weak resource-bounded version of Kolmogorov complexity, and a trivial version of our own, the Block Decomposition Method), even openly adopting the language of compression — using terms like ‘shortest paths’, ‘smallest number of steps’ and so on — as indeed they have done before. Yet, the authors did not compare their resource-bounded measure with any other under the false grounds that others are, presumably according to the authors, ‘uncomputable’.

These authors, who’d already made unfounded claims about ‘Assembly Theory’ (AT) (claiming it was able to detect life on other planets), only a few months ago announced that they had also achieved the breakthrough of making ‘time physical’ using AT, a claim that nobody could understand, especially because any assembly time index is an absolute quantity trivially defined with no evidence to have any physical state correspondence or causal connection to ground truth. Their breakthrough does not correspond in any way to anything we know about time in science generally or via the laws of physics, such as general relativity, and exemplifies the wild claims these authors are inclined to make. Now they claim that the same theory can explain selection and evolution, unify biology and physics, and explain all life, while experts in selection and evolution don’t think they are even engaging in a serious discussion of the basics of biology. In other words, they are not falling far short of falsely declaring AT the Grand Unified Theory of Everything, when behind the scenes all there is is a weaker version of one of the simplest algorithms known in computer science, Huffman’s coding scheme that counts for repetitions in strings of data, a form of a trivial version of Shannon Entropy, though, unlike AT, Huffman actually counts correctly.

Instead of addressing the many criticisms levelled at their work, including the fact that my group reproduced their results with a 50-year-old algorithm and improved upon it with our own and that we reported almost five years before Assembly Theory that organic compounds could be separated from inorganic compounds using molecular distance matrices with simple algorithms, they have now switched to full plagiarism mode and appropriated our ideas published in a Royal Society paper (linking resource-bounded, i.e., computable algorithmic complexity and ideas from evolution theory and genetic modularity as emergent phenomena of algorithmic probability) without attribution, making claims that are unfounded and ten orders of magnitude bolder than any we would have ever dared to make.

Make no mistake, the authors have the talent to appear to be describing a profound and important theory while actually presenting something that is mostly empty and shallow, a theory that’s far from commensurate with their methods, let alone the grandiose claims made on its behalf.

Unfortunately, they have made a marketing exercise of science, and they command a giant, well-oiled machine operating at full throttle that scores publicity stunt after publicity stunt, with the latest one calling AT the theory that has unified physics and biology. They operate unchecked by either their universities or the media, with the latter neglecting to feature critical views that would make for fair and balanced reportage. The University of Glasgow titled their press release “Assembly Theory Unifies Physics and Biology to Explain Evolution and Complexity” while ASU titled theirs “New ‘assembly theory’ unifies physics and biology to explain evolution, complexity”. Together they triggered everything that followed. I think this business has the potential to explode in everyone’s hands and end badly for those involved, as recently happened with Integrated Information Theory, only worse, because these claims are 99% marketing and only 1% original thinking.

Scientists and researchers are taken aback by the way such shallow ideas have been promoted. This work has already received a huge wave of negative reviews from experts in all relevant areas, including complexity, computer science, evolution, chemistry, physics, and biology, testifying to just how rotten the state of scientific publishing and scientific practice is these days. I am glad I have been calling out this kind of unethical practice for years. I feel vindicated for bringing to light the vicious ways in which the system, in collusion with authors intent on self-promotion, enables such scientific scams.

In the wake of the scandal surrounding Integrated Information Theory and now Assembly Theory, I am more skeptical than ever of the quality and integrity of some of the highest-impact journals, especially Science and Nature.

MINOR UPDATE (Monday, 4 September 2023): How media communicators fail us

If you want to learn how I think the Quanta article on Assembly Theory (later parroted by Wired), and even more so the one in the New Scientist, failed to follow the basic principles of objective journalism or scientific journalism, please go here, as I wish not to distract the readers from the main issues at stake with Assembly Theory.

UPDATE (Saturday, 8 July 2023): Assembly Theory spreads misinformation across science journals and science media outlets

Unfortunately, the authors of Assembly Theory (AT) continue spreading misinformation in a podcast online despite having been publicly corrected about algorithmic information theory and complexity science. They have said, once again, that Kolmogorov complexity ‘requires a Turing machine’ and is all about ‘Turing machines and Oracles’ (Oracles being a type of formal Turing machine that helps mathematicians prove theorems). This makes no sense. The founding fathers of complexity and algorithmic information theory, who were mathematicians, may have unduly emphasised Turing machines and Oracles, thereby giving the impression that their main features were negative results preventing their application, but in point of fact, they are not. In this paper, for example, we defined algorithmic complexity as a tool for causal discovery and causal analysis based on a regular computer language; no Turing machines involved or shortest programs required.

Such abstractions of a computer as Turing machines were merely useful in proving the theorems (it is the theorems that are fundamental). Theorems that the authors of AT may disregard or ignore endow the indexes based upon them with much more solid foundations than underlie the ad hoc indexes based on personal choices of what authors take to be a fundamental property of an object like life. Their status as estimations is no different from any other tool used to approximate a scientific explanation of an observation or a natural phenomenon. The belief that the Turing machine, or its way of operation, is fundamental indicates a lack of knowledge of algorithmic complexity, a fundamental theory that should be mastered, given the many elements Assembly Theory takes from it and the criticisms they level at it.

The proponents of AT also allege that the field of complexity science has ‘not settled’ on any complexity measure and is incapable of resolving anything. Not only is this false, but even if it were true, AT does not help matters. Let’s explore some of the fundamental indexes in complexity theory:

(Variable window length) block Shannon entropy: This measure counts the number of repetitions of variable length in a piece of data according to a mass probability distribution. Does it sound familiar? Indeed, the assembly index is a version of this, but drops the mass probability distribution. This means AT always assumes the simplest uniform distribution case and is, therefore, a weaker version of Shannon entropy (as demonstrated in our paper).

Huffman coding: It optimally looks for nested repetitions in a piece of data, sorting the most frequent ones to minimise the number of steps necessary to reconstruct its original form. The result is a step-by-step tree with the instructions to assemble the original object. Does this sound familiar? The authors of AT echoed this word for word in defining AT and their complexity index, apparently unaware of this extensively used algorithm in computer science that was introduced in the 1960s. Indeed, they created an algorithm that turns out to be a suboptimal version of Huffman’s coding scheme, because their assembly index does not count for copies correctly. In this sense, their algorithm is a variation of Huffman or RLE, another, even simpler, algorithm introduced in the 1960s.

LZW: LZ or LZW (and other statistical lossless compression/encoder algorithms) implements a dictionary-based approach of which Huffman and LZ77/LZ78 are examples. It is widely used in compression algorithms, from GIF to ZIP, and has been used in many applications, ranging from genetics to spam and plagiarism detection. A landmark paper using physical (or experimental) data directly was published in 2005 and has over 1500 citations. It is indistinguishable from Assembly Theory. This is why the authors of Assembly Theory refuse to simply accept they are using a compression algorithm. They would need to accept that they have been doing all along what they categorically claim not to be doing, a carbon copy of algorithmic complexity approximated with a resource-bounded measure they call assembly index that is very limited compared to other more sophisticated measures that are also resource-bounded, computable and approximations to algorithmic complexity.

Algorithmic complexity (Kolmogorov, Chaitin): This is the universally accepted measure of randomness, settled on in the late 1960s or early 1970s. It is a generalisation of all the above indexes and any similar ones, including Assembly theory (AT). It makes AT possible. Indeed, AT is a loose upper bound of algorithmic complexity, because simplistic statistical measures like the assembly index cannot deal with simple transformations of data. For example, reverse copies. They would fail the most basic tests that are not trivial and would not be able to scale up to any real-world scenario (AT’s assembly index picks ‘beer’ as a living organism or the ‘most alive’ compound, the object of jokes for those who see no grounds for taking the theory seriously) beyond trivial cases with perfect representations, which (block) Shannon entropy could have resolved anyway. The authors of AT claim it is uncomputable. This is technically false; it is semi-computable, which means estimations are possible. Unless the AT authors think their assembly index is not an estimation of life but the ultimate measure of life, it is as much an estimation as estimations of algorithmic complexity. In fact, we claim it is an inferior one because AT is a loose upper bound of algorithmic complexity, and as we argue below, it cannot capture very basic cases (like simple reverse copies), and as soon as it tries to, it gets closer to actual implementations of algorithmic complexity.

Algorithmic probability (AP) (Solomonoff): Deeply related to algorithmic (Kolmogorov) complexity (proportionally inverse), this measure is the accepted mathematical definition of inductive inference. Indeed, algorithmic probability inaugurated the field of Artificial Intelligence when Solomonoff presented it at what is today considered a landmark event, a workshop at Dartmouth College when all AI researchers accepted it as the final solution to the problem of inference. At an event a few years ago in NYC, Marvin Minsky, one of the fathers of AI, urged everybody to study algorithmic probability and algorithmic complexity, adding that he wished he had been able to devote his life to it. AP addresses why some object configurations are more probable than others. Does this sound familiar? It was fully adopted by the authors of AT as their own.

Logical Depth (Bennett): This measure separates complexity from randomness and simplicity (contradicting Walker’s claim that this had not been dealt with before and thus implying that it was AT that solved the problem). It measures the time, in the number of steps, needed to unfold an object from its set of (approximated shortest) possible origins. Sounds familiar again? AT has also been introduced in exactly this fashion, word for word. Bennett also talks about memory size and time in the causal evolution of an object based upon the fundamental measure of none other than Kolmogorov (algorithmic) complexity. All these concepts that make AT sound robust have been taken from these other measures, but it implements a weak version of Shannon entropy and is indistinguishable from LZ77/LZ78 (up to at most a small constant in the number of steps), so the distance between the claims made for the theory (appropriated from the work of others) and what the simplistic assembly index actually does, is staggering.

Resource-bounded Kolmogorov complexity: This is a computable version of Kolmogorov complexity that the authors of AT pretend does not exist or suggest is impossible. They pretend AT is not an approximation of the real world. There are several resource-bounded approximations of Kolmogorov complexity, such as Minimum Description Length methods, limiting time or space to make computable approximations. All the computable measures above are resource-bounded versions of Kolmogorov complexity or upper bounds (of which AT too is one). We introduced our resource-bounded versions in 2010 with great success, garnering over 2000 citations from all sorts of groups despite not having the marketing engine of AT. The first of our algorithms is called the Coding Theorem Method (the coding theorem formally relates algorithmic complexity and algorithmic probability) or CTM. CTM, like any other measure, is an approximation or estimation of what the thesis in the theory suggests and conforms well with the expectations of Kolmogorov complexity. It is also the basis of the field of Algorithmic Information Dynamics, a book about which has just been published by Cambridge University Press.

Block Decomposition Method (BDM): A measure that we introduced ourselves that basically combines all the above measures and has been applied to ‘physical’ (or ‘experimental’) data of the same type that the authors of AT deal with, claiming that they have done so for the first time, physical data ranging from DNA folded in nucleosomes — to find regions of high genetic encoding (research that has been published in the top journal of nucleic research and uses real ‘physical’ genomic data), to the same mass spectral data that AT used in their original paper, producing similar or better results and clearly separating organic from inorganic molecules/compounds (almost five years before AT. A preprint version of our results is available here). BDM is based upon the history of complexity science over the last 60–70 years, and incorporates, in one way or another, all the knowledge from all the measures above, properly attributing each insight to its rightful source. It counts the number of repetitions of different lengths for long-range correlations using Shannon entropy, but it also looks for small patches at local short ranges of algorithmic complexity, thus combining the best of both worlds: a statistical measure for scalability and quick computation, and a algorithmic symbolic measure with a short range that provides more solid approximations of local randomness and complexity.

None of the above algorithms is forced to work on only one-dimensional bit strings. They can operate on any digital object. AT also ultimately ingests digital objects only. In fact, just recently, we applied Entropy and our own indexes on multi-dimensional objects, even sound, to prove that non-random information encodes its own geometry and topology.

Other indexes measuring features that some authors find interesting may include concepts like ‘self-organisation’, ‘emergence’, ‘synergy’, etc. They are indeed not settled because their authors advance such indexes based on the beliefs or assumptions of their theories. This is no different from AT. The AT authors propose to focus on how many copies of the same type are being reused in an object to drive their index. So, it is not agnostic as they claim, and does nothing to address the purportedly unsettled status of the field. In fact, it enlarges the number of unsettled and controversial indexes. However, these indexes are not the things that are attacked by the authors of AT, but rather, out of ignorance, the foundations, which in fact are settled. This is the main divergence from algorithmic complexity. Algorithmic complexity does not carry any bias or author baggage. It does not focus on any particular feature but considers all of them together to find the unfolding causal and mechanistic explanation of an observation. This is why it is ultimately more difficult to estimate — because nature and biology are complex, and life cannot be defined simplistically. We demonstrate how AT fails in theory and practice.

Why are the above measures fundamental in complexity theory? They have all been proven to be optimal or fundamental in various ways. For example, LZW, has been proven to converge to Shannon entropy, and other lossless compression algorithms can converge faster, though they are not fundamentally better at the limit. Measures with a * mark are measures that have been proven to be universal or invariant — in the 1960s, when mathematicians were trying to define randomness in various ways, using all possible statistical patterns, unpredictability, and compressibility. It turns out that each of these characterisations is equivalent to the other, thus stabilising the field and resolving once and for all questions related to randomness, simplicity, and optimal induction. Every other weak definition of randomness would turn out to be contained in algorithmic complexity or to actually be algorithmic complexity. AT is indeed contained in algorithmic complexity and does not resolve randomness from complexity.

The authors of AT also suggest that their theory contributes to finding the minimum number of steps to define or create life. This, too, is false. In fact, one can create an object to fool AT both in theory and practice, as demonstrated in our paper, and nothing prevents such an object from existing in nature. In fact, they do: complex crystals or piles of coal are examples provided in this other blog post in another critique of AT based upon the basics of organic chemistry. In fact, AT has already predicted that beer was the most complex living organism on Earth. The assembly index would characterise complex crystal-like structures as living systems. In fact, all other measures can do the same. Shannon entropy on a uniform probability distribution can find the number of bits needed to encode the number of steps that AT claims is ‘the magic number’ for life (which we don’t think it is and therefore, responsibly, never announced it that way); LZ77/LZ78 and Huffman can find the number of tree vertices in its statistical ‘causal’ graph, just as AT can, albeit optimally (because AT does not count correctly, as we proved on their own example ABRACADABRA, see figure below, you will find plenty of examples online using LZ77/LZ78 on the ABRACADABRA word). Lossless compression can provide a compression ratio, a threshold equivalent to the 15 steps, and actually, it has always done this in all applications. See, for example, a characterisation of complexity that formalises Wolfram’s classes by compression published in the Journal of Bifurcation and Chaos.

Measures such as Shannon entropy are used everywhere in the world. For example, every time someone takes a night picture with their phone and is asked to remain steady and not move their hands, it is because it wants to maximise the mutual information of overlapping pictures using Shannon Entropy. Can things get more physical, experimental, or applied than this? Hardly. AT is clearly one of many complexity indexes that takes data from the real world and process it like any other. LZW is also used everywhere; Logical depth has been used before to characterise ecosystems (hence life, and biosignatures on earth), and to classify human-made objects into simple and random objects, in a paper published by our group under the title ‘Physical Complexity’, using real-world physical data. Algorithmic probability has also been used in DNA and protein research, and CTM and BDM have been used in areas ranging from psychometrics to visual cognition to animal behaviour to medicine, all involving real-world, physical data.

Doesn’t the fact that optimal mechanistic inductive inference (AI and causality) is the other side of the algorithmic complexity coin sound totally fundamental, more so than anything else advanced by AT? Indeed, an upper bound of Kolmogorov complexity is a lower bound of algorithmic probability, two fundamental concepts intertwined. This may be news to some people, thanks to researchers like the authors of AT and their widely touted misconceptions. It is one of the most important scientific results, even called a miracle and highly praised by Marvin Minsky, and the cornerstone of complexity theory, which the authors of AT ignore or pretend to have come up with themselves, grossly misrepresenting it while advancing a simplistic measure in practice.

The authors also claim that their assembly index is not a compression algorithm and has nothing to do with minimal programs (see fallacies below). They claim that it is a flaw of Kolmogorov complexity (or algorithmic information theory) that it seeks minimal programs. This is not true. The definition of Kolmogorov complexity uses the concept of the shortest computer program, but computable approximations look only for upper bounds and can find all sorts of programs, including causal graphs explaining data. Does this sound familiar? Yes, again. It makes an appearance in the mix of concepts that the authors of AT throw at the reader. The purpose of science is to find short explanations/models to explain observations. If they think Kolmogorov’s complexity is flawed in pursuing this goal, then so is science. Kolmogorov complexity took the idea of the simplest model to the extreme to generalise over all cases; AT aims to do the same by finding short explanations (or shortest paths as they now call them), but does so in a trivial manner, using a badly written algorithm that was introduced in the 1960s and is known to be so simplistic as a characterisation of life that the field has abandoned it and moved on.

So what do we prove in our paper criticising AT? Among other things, including theoretical and fundamental issues, that basically any trivial statistical algorithm can do the job (as well or better) that AT does, despite the claim of its authors, some true believers, and their friends, that it is unique and, moreover, new and radical.

By ‘complete’ molecular description we mean a description that allows, in principle, reconstruction of the molecule/compound with little information of fixed size (i.e., that does not depend on the molecule/compound itself). Examples include InChi, molecular distance matrices like mol files, or mass spectral data such as AT used for their assembly index. In all cases, the alternative indexes were able to ingest ‘physical’ data, just like their assembly index, or any other processed data such as InChi codes or distance matrices, which also derive from the physical and chemical properties of the molecules/compounds.

Now, the authors of AT would try to separate their measure, saying it is the only one that ingests ‘physical’ data (if Cronin is a materialist as he claims, shouldn’t all data be physical?). This is false too. Not only have the above measures been widely applied to data that comes directly from physical observations, but their own index takes a computable representation of physical data, not physical data itself (they do not take atoms or molecules through their algorithm, they parse a matrix with numerical values using a computer).

Resource-bounded measures of Kolmogorov complexity (of which Shannon entropy, for example, is one) are the object of investigation and widely successful application. These measures make all sorts of applications possible, from video compression to fraud and spam detection. Resource-bounded computable versions, such as CTM and BDM developed by our group, have been proven to be effective when applied to physical data (a long list of relevant publications may be found here).

Assembly Theory itself is only possible because of algorithmic information (it is a weak copy of algorithmic information, with the authors failing to cite those whose fundamental ideas they have drawn upon). Their assembly index is another computable resource-bounded weak version of statistical encoders (such as compression algorithms, which count copies as AT does as their most basic step).

Their misconception that algorithmic complexity ‘[requires] a Turing machine’ is equivalent to saying that the assembly index requires a finite automaton, because that is what they would need to run their algorithm (yes, they run it on a weak version of a Turing machine, on a computer that they mock and attack).

To justify their theory, they say that nothing is settled in complexity theory. This is also false. In all its versions, Kolmogorov complexity is the absolute accepted definition of mathematical randomness. If there are other measures of complexity, of which Assembly Theory itself is one, it is because different authors believe their measure captures a specific fundamental property that is of particular interest to them — just as Assembly Theory does.

Again, the greatest weakness of AT lies in its lack of control experiments — which we performed for them, albeit without recognition. These control experiments would have told them that almost any old statistical index (e.g., Huffman, RLE, lossless compression like LZ77/LZ78/LZW, etc.) + almost any description of a molecule/compound that can reconstruct the molecule/compound (invariance theorem) can produce the same results as AT.

Therefore, AT has little or no value to science, despite the huge marketing campaign the authors are conducting, making it look credible by associating it with various indisputable notions (such as the recursive nature of nature, and the nestedness of living systems).

In the podcast referenced above, Cronin revealed what I take as an acknowledgment of the mistake they had made — thinking that they could count ‘physical copies’ and scale that up. According to him, it is a feature of their theory (every weakness is a feature for them) that not every variation from a perfect copy of a molecular configuration would be picked up by their measure. They say this is good because bad copies of molecules create bad entities. This is true, but the argument therefore works only at the very microscopic scale, perhaps a few nanometers, because at any larger scale, these imperfections will start to appear, making their index irrelevant (this is why complexity theory moved on from simplistic measures about 40 years ago). This means that anything larger than a few molecules — not even nucleosomes, let alone cellular or multi-cellular life — will be beyond the scope of their assembly index.

It is a pity that the authors of Assembly Theory feel they have to trash everybody else’s work (based on false premises) to feel like they have created new from scratch, advancing their work through self-promotion exploiting the international media and social networks at the expense of so much time and taxpayer money. The authors seem to seek the general public audience and mislead potential researchers in other areas, and do not care much about scholars in the fields they pretend to be contributing to. They may end the media battle but our intention is to reach the scholars that may get deceived by the media hype.

And all this is just the tip of the iceberg as regards the misleading claims of Assembly Theory (see the rest of this blog post). For a well-deserved criticism of their simplistic assumptions from the perspective of organic chemistry, read this Harvard researcher’s blog post.

I will do my best to keep up with and provide some balance to the uninformed views of the authors of AT, thankless task though it is, as it is more difficult to debunk false theories for the sake of justice and good science than to advance them in the first place, for personal gain. It is all rather like dealing with fake news, especially when you’re up against two experts in marketing like the leading authors of AT, who already have about 10 podcasts and online interviews to their name, and many more pieces in the written public media.

UPDATE (26 June 2023 + update 7 July): Assembly Theory is undistinguishable from Algorithmic Complexity and their index is a compression algorithm

The ever-changing definition of Assembly Theory keeps evolving in circles, with the authors in full marketing mode instead of addressing their theory’s many challengers and critics.

In their recent marketing material, Assembly Theory looks more than ever like algorithmic information and Bennett’s logical depth. Cronin and Walker now say Assembly Theory is a measure of ‘memory’ size. This is the definition of algorithmic information, and/or of unfolding computing history (or time) from a causal origin, Bennett’s logical depth is based on algorithmic complexity, thus not an original idea either. However, their compression and decompression steps are the same number and therefore their concept of shortest assembly path is identical to a (weak) approximation of algorithmic (K) complexity.

They also claim that their motivation is to explain how some configurations are more likely than others (which is known as ‘algorithmic probability,’ also not their idea). In what amounts to a masterly publicity stunt staged with the help of their friends and colleagues, they have appropriated all the seminal ideas from complexity science. But not only that: Their measure does not do what they think it does (or does it very poorly); they do not need ‘physical’ data as they claim they do, and existing algorithms introduced in the 1960s perform as well or better than their basic assembly index. For more on this, see our detailed technical work below, reproducing their results but without all the claims they make.

The most recent media covering AT appeared in a bulletin from the Santa Fe Institute, Aeon, and in the New Scientist, introducing new terms into an intrinsically flawed and simplistic theory, such as ‘memory,’ which they adapt to identify by-products of life. This surprising new addition (which does not match what their indexes actually do) makes AT identical to algorithmic complexity and Bennett’s logical depth (introduced in the 80s) in spirit, but is ill-defined and incomplete, as they are unable to instantiate it even with the equivalent of a simple statistical algorithm of 1960s vintage — which actually outperforms AT (see below).

In their new efforts to make a theory and measure look different despite not making any substantial contribution, the authors claim that the measure is not just about identifying life but the by-products of life, clearly an attempt to accommodate the fact that their measure predicted that beer was the most alive product on earth. This new development makes Assembly Theory indistinguishable from algorithmic information and Bennett’s logical depth, introduced in the 1980s. Moreover, their actual measure fails very short of performing even the most basic tasks.

In the never-ending exercise of rehashing a simplistic idea of life to make it more credible and immune to criticism, the authors of Assembly Theory have decided to change the narrative around the measure and introduce yet another term, this time ‘memory,’ and ‘computational history,’ or similar terms, thereby making their AT approach a carbon copy of algorithmic complexity, as it is about the length of the model that generates the object’s history (including time); and in focusing on ‘time,’ which they introduce in a grandiose fashion as if it had never been considered before, they succeed in making AT a carbon copy of these decades-old and powerful theories of life, while failing to make the appropriate attributions. The simplistic measure of Assembly Theory does not match their grandiose description, so not only does AT turn out to be a carbon copy of an existing theory of life, but a suboptimal carbon copy.

Instead of seizing an opportunity to justify their claims or explaining how other measures can reproduce and even outperform what they claim to be outstanding results (see Figures below in our original post), they have continued spreading unfounded claims.

Assembly Theory has gained traction because the authors highlight a property of life nobody can disagree with: life is highly hierarchical, reuses resources, and is heavily nested. We have known this for decades; these notions are integral to our understanding of biological evolution, genetics, self-assembly, etc. The problem is that pointing out such an obvious feature, common knowledge to most experts, has earned them undue credit, despite only appropriating other people’s ideas without attribution and introducing ill-defined concepts and measures.

In a previous version of Assembly Theory (AT), the authors suggested that the number of ‘physical’ copies of an element used to assemble an object measured how ‘alive’ the object was. Builders use the same bricks made of the same materials in all possible configurations to construct walls and rooms on similar floors in multiple buildings in a highly modular fashion. Are these buildings alive according to Assembly Theory? It seems to suggest that these objects, just like Lego constructions and natural fractals, are alive because they are highly modular and have a long assembly history, just as it predicted beer was alive. The authors of Assembly Theory have now realised they were wrong because their measure designated beer as the ‘most alive’ element on earth, even more so than yeast, so they have adapted their version to include living systems’ products, including beer (though it still does not work, because all they have done is modify their theory yet again in an attempt to accommodate their unexpected results without backing down from previous unfounded claims. These new claims actually make the theory more vacuous, because they take all the theoretical arguments from algorithmic probability and information theory as their own. Update: the authors now exclude all sorts of objects like 3D crystals or anything that does not fulfill their definition of applicability to avoid contradictions making their measure highly ad hoc when they advertised it as universal.

Our paper shows that we don’t require any of the ‘physical data’ the authors of AT refer to in order to separate organic from non-organic compounds.

We are convinced that what characterises life is the agency evinced in its interaction with the environment and not any single intrinsic (and rather simplistic) property.

Let us also address what its authors claim is another exclusive property of Assembly Theory, namely that their theory captures ‘physical’ copies and is the only one to do so, not excluding even AIT. They also claim that their measure is the only experimentally validated one.

The leading authors of Assembly Theory seem to suggest that their measure has some mystical powers that enable it to capture the concept of ‘physical copy,’ even though it is encoded in a computable description (data) and fed to their measure, just as it is fed to the measures we used to reproduce their results step-by-step, regardless of whether it is ‘physical’ or not, as any algorithm has to be computably represented/encoded to be read.

The input data to their assembly index is a MS2 spectra file (a type of distance matrix). In other words, it all comes down to a mathematical and computable representation derived from observations and fed to a complexity measure, no different from what has been done for decades. This is exactly how every other complexity measure is deployed (except when used on simulated data). For example, how LZW was used by Li and Vitanyi to define their normalised information/compression applied to genomic data. Is not ‘genomic data’ physical for the authors of Assembly Theory? Moreover, we have shown that InChi ID strings and distance matrices separately are enough to classify compounds as organic and inorganic. InChi codes may be extracted from distance matrices but are as ‘physical ‘as other descriptions, and distance matrices are also directly physically derived. If almost any other statistical measure of complexity + any compound description (such as InChi codes or distance matrices) produces the same or better results as AT on spectral data, why would one need AT in the first place, or what is it enabling that was not possible before? (over and above what was shown to be possible in the paper we published almost five years before the first paper on AT).

The input data to the assembly index are MS2 spectra files (also a type of distance matrix). The authors claim that their Assembly Theory represents the first time in the history of science that an index takes ‘physical’ data from observations, effectively claiming that they have invented science. mol files are matrices, not the actual compounds, just as InChi codes are strings. Their prime example is how their measure (wrongly, see figure below) counts for the letter repetitions that can reconstruct the words BANANA or ABRACADABRA. Do they think letters are physical while the entities all other researchers have worked with are not? The main question one has to answer is: if you get the same or better results by applying almost any other complexity measure on any molecular representation, why would you need AT and their ‘physical’ data? Our results show that distance matrices that should count as physical for the AT authors, or InChi codes, can trivially separate organic from inorganic (Fig labelled as 2 below, and main figures reproducing the AT results further, below). (July 7: A previous version of this figure wrongly suggested that the Assembly Index was reading InChi codes, irrelevant to arguments made before, nothing changes)
In a paper predating Assembly Theory, published in the journal Parallel Processing Letters, we demonstrated that by only taking InChi nomenclature IDs, sometimes enriched with distance matrices, classification into different categories was possible, including organic and inorganic, something the authors of Assembly Theory have rediscovered five years later, claiming to require spectral data to do so. They would have discovered that their special data was not needed if they had not neglected to include a basic control experiment in their work. We have applied complexity measures to observable physical data in the field for decades. My group and I have been doing so since at least 2010, including the application of algorithmic probability on physical sources (section 1.3.3), as in this paper published in G. Dodig-Crnkovic and M. Burgin (eds), Information and Computation, World Scientific Publishing Company; and, more recently, in this other paper published in the journal Nucleic Acids Research (in 2019), showing how an application of complexity indexes on nucleosome data (quite physical) can contribute to solving the second most important challenge in cell and molecular biology (after protein folding), which is the problem of nucleosome positioning, where we, as was right and proper, compared our indexes to the gold standard in the field as well as to several other complexity indexes.
In a landmark paper in the field of complexity science published in 2005 that uses compression algorithms, the authors took individual whole-mitochondrial genomic sequences from different species and correctly reconstructed an evolutionary mammalian phylogenetic tree corresponding to current biological knowledge. According to the authors of Assembly Theory, genomic sequences would not qualify as ‘physical,’ as they claim to be the first and only authors to define and validate a measure of complexity with real ‘physical’ data. This is pretty much what every complexity theorist has done in the last 50 years (including ourselves) — taken observable data with a representation and fed it to a measure to classify or extract information from it. For example, in 2018, based on our work, a group from Oxford University published an application of measures of algorithmic complexity and the coding theorem to RNA secondary structures (SS). These phenotypes specify the bonding pattern of nucleotides, which the authors of Assembly Theory would not regard as ‘physical data’.

What seems reprehensible is the authors’ open ‘fake it until you make it’ Silicon-Valley approach to science that, unfortunately, often pays off with some science journalists and science enthusiasts, who seem to be Cronin and Walker group’s main audience. Enthusiasts and science writers are sometimes drawn to such grandiose stories because the motivation of their employers is to sell more, which negative results in science hardly do (making them into something like tabloids of science), just as grant agencies seek (media) impact by rewarding senior researchers with cheap labour in the form of so-called ‘postdocs,’ underpaid researchers who execute most of the research and are often misled by employers who seek personal and professional gain. Every time a public space is used to peddle bad science, attention is deflected from sound science, and a disservice is done to young researchers who do not have the deep pockets and marketing minds of these kinds of researchers and groups.

It is wrong to misappropriate ideas from others without attribution and to ignore results that predate one’s own. The community should repudiate these practices, as they turn the practice of science into a marketing exercise. Deft prose that makes one’s work look rigorous and deep when it is so fundamentally wrong and shallow is in no sense commendable.

In summary, this is what is wrong with Assembly Theory, why it needs fixing, and why the authors should stop promoting it:

  • The authors take criticism in the wrong spirit, doubling down on their false claims instead of course-correcting.
  • The theory and papers lack appropriate control experiments and introduce other people’s work as de novo ideas without attribution. Had they performed any control experiments, they would have found that they didn’t need any extra information to classify their compounds, or any new measure for that matter, as they could have used almost any other measure of complexity that already counts copies (from Huffman to RLE, LZW, you name it).
  • They have blatantly misappropriated concepts and ideas from others, knowingly and without crediting anyone.
  • They have created a pseudo-problem for which they have manufactured a pseudo-solution.
  • Their theory is inconsistent with their method.
  • They have chosen to take a marketing approach to science.
  • All the eight fallacies below, from creating a strawman fallacy against algorithmic information to (reinventing, misappropriating) and embracing an algorithm that does not do what they say it does.

UPDATE (Friday 5 May 2023): In a recent popular article on Quanta by the science writer Philip Ball (which we won’t link here because we don’t wish to draw what we think is undeserved attention to this theory), the authors of Assembly Theory seem to suggest that the idea of considering the entire history of how entities come to be is original to Assembly Theory (AT). This is, again, incorrect; this idea was explored in the 1980s and was Charles Bennett’s, one of the most outstanding computer scientists and complexity theorists. Roughly, Bennett’s logical depth measures the computational time (number of steps) required to compute an observed structure. This is “the number of steps in the deductive or causal path connecting a thing with its plausible origin”.

Bennett’s motivation was exactly that of Cronin, Walker, and their groups, but his work predated theirs by about 35 years. He was interested in how complex structures evolve and emerge from the large pool of possible (random) combinations, which is also the main subject of interest in Algorithmic Information Theory, which has resource-bounded measures of which AT is properly speaking a weak version, not only because it is computable but because it is trivial, as proven in our paper and this blog post. The authors of Assembly Theory seem to keep jumping from one unsubstantiated claim to another. Unfortunately, most of the people interviewed about Assembly Theory in the Quanta article who were positive are too close to Sara Walker (co-author of AT) to be considered entirely objective (with one of them being one of Walker’s most prolific co-authors), and should not have been chosen to comment as if they were neutral. Either the journalist was misled, or it failed journalistic principles.

Also, we found it unfortunate that the mistaken idea that Kolmogorov (algorithmic) complexity is too abstract or theoretical to be applied was put forward again. And the claim that Kolmogorov complexity ‘requires a device’ (which is no different from AT requiring a computer algorithm to be instantiated) relates to the fallacy below concerning what appear to be some ‘mystical’ properties that the authors attribute to AT).

Algorithmic Information Theory (AIT), or Kolmogorov complexity (which is only an index of AIT), has been applied for almost 80 years and makes all compression algorithms used daily for digital communication possible. It has also found applications in biology, genetics, and molecular biology. Yet one does not even need Kolmogorov complexity to prove Assembly Theory incorrect because it does not do what it says it does, and what it does, it does no better than almost any other control measure of complexity. It only takes one of the simplest possible algorithms known in computer science to prove Assembly Theory redundant since the Huffman coding scheme and better compression algorithms such as the full LZW and many others, can count copies better than Assembly Theory. Counting copies has been the basis of all old statistical lossless compression algorithms since the 1960s. It has been used (and sometimes abused) widely in the life sciences and complexity to characterise aspects of life. Nothing theoretical or abstract makes such applications impossible, though this is yet another common fallacy parroted with high frequency. —

Original Post:

We have identified at least eight significant fallacies in the rebuttal by the proponents of Assembly Theory to our paper criticising the theory, available at https://arxiv.org/abs/2210.00901:

In a recent blog post (https://colemathis.github.io/blog/2022-10-25-SalientMisunderstandings), one of the leading authors of a paper on Assembly Theory suggested that our criticism of Assembly Theory was based on a misunderstanding. At the end of this response, we have included screenshots of this rebuttal to our critique for the record.

By the time you reach the end of this reply, you will have learned how the main results of Assembly Theory can be reproduced, and even outperformed, by some of the simplest algorithms known to computer science, algorithms that were (correctly) designed to do exactly the same that the proponents of Assembly Theory set out to do. We could reproduce all of the results and figures from their original paper, thus demonstrating that Assembly Theory does not add anything new to the decades-old discussion about life. You can go directly to the MAIN RESULT section below should you want to skip the long list of fallacies and cut to the chase (thus focusing exclusively on the empirical demonstration and skipping most of the rest of the foundational issues).

Fallacy 1: Assembly Theory vs AIT

According to the authors’ rebuttal, “[We] contrasted Assembly Theory with measures common in Algorithmic Information Theory (AIT)” and ‘AIT has not considered number of copies’.

This is among the most troubling statements in their reply as it shows the degree of misunderstanding. The number of copies is among the most basic aspects AIT would cover and is the first feature that any simple statistical compression algorithm would look for, so the statement is false and makes no sense.

Furthermore, in our critique, we specifically covered measures of classical information and coding theory unrelated to AIT, which they managed to disregard or distract the reader’s attention from. We showed that their measure was fundamentally and methodologically suboptimal under AIT, under classical Shannon information theory, and under basic traditional statistics and common sense. As discussed in this reply, Assembly Theory and its proponents’ rebuttal of our critique of it mislead the reader into believing that the core of our criticism is predicated upon AIT or Turing machines — an example of a fallacy of origins.

AIT plays little to no role in comparing Assembly Theory with other coding algorithms. As discussed under Fallacies 2 and 4, Assembly Theory proposes a measure that performs poorly in comparison to certain simple coding algorithms introduced in the 1950s and 1960s. These simple coding algorithms are based on entropy principles and traditional statistics. Yet, the authors make unsubstantiated and disproportionate claims about their work in papers and on social media.

This type of fallacious argument continues to appear in the text of the rebuttal to our critique, which suggests a lack of formal knowledge of the mathematics underpinning statistics, information theory, and the theory of computation; or else represents a vicious cycle in which the authors have been unwilling to recognise that they have seriously overstated their case.

To try to distinguish AIT from Assembly Theory in hopes of explaining why our paper’s theoretical results also do not serve as a critique, their text keeps mischaracterising the advantages and challenges of AIT as well as attributing false mathematical properties to AIT, for example, those to be discussed under Fallacies 2, 5, and 6 below.

One of the many issues we pointed out was that their molecular assembly index would fail at identifying as a copy any variant, no matter how negligible the degree of variation, e.g., resulting from DNA methylation. This means the index would need to be tweaked to identify small variations in a copy, meaning it would no longer be agnostic. For example, even linear transformations (e.g., change of scale, reflection, or rotation) would not be picked up by Assembly Theory’s simplistic method for counting identical copies, from which complexity theory largely moved on decades ago. Given that one cannot anticipate all possible transformations, perturbations, noise, or interactions with other systems to which an object may be susceptible, it is necessary to have recourse to more robust measures. These measures will typically be ultimately uncomputable or semi-computable, because they will look for all these non-predictable changes, large or small. So indeed, there is a compromise to be made, yet that they are uncomputable or semi-computable does not mean we have to abandon them in favour of trivial methods or that such measures and approaches cannot be estimated or partially implemented.

But if it comes to testing trivial algorithms as Assembly Theory proposes, algorithms similar as RLE, LZ, and Huffman introduced in the 1960s are special-purpose coding methods that were designed to count copies in data and have been proven to be optimal at counting copies and minimising the number of steps to reproduce an object, unlike the ill-defined Assembly Theory indexes.

Below, we compared Assembly Theory (AT) and its molecular assembly (MA) index against these and other more sophisticated algorithms, showing that neither AT nor MA offer any particular advantage and are in fact suboptimal, both theoretically and in practice, at separating living from non-living systems, using their own data and taking at face value their own results.

Figure taken from our paper (v2) showing how algorithms like Huffman coding or LZ77 do what the authors meant to do (count copies) but failed (later you can also see how Huffman coding performs comparably or better than AT at classifying their own molecules without recourse to any structural data). This is a classical word problem in mathematics, often a first-year course problem in a computer science degree that can be solved with a very simple algorithm like LZ77 or Huffman coding as implemented by a finite automaton (the authors mock Turing machines because they think it is too simple, but their measure runs on a strictly weak version of a Turing machine, a finite automaton, and assumes life can be defined by the processes of life that behave that way). Sub figure C was taken from https://codeconfessions.substack.com/p/lz77-is-all-you-need as a simple Google search retrieves dozens of examples of people showing how LZ77 on the word abracadabra and from which a dictionary-based graph like the one in B (optimal) or A (suboptimal from AT) can be derived. In other words, neither the measure, nor the data or the application can be considered new or making any contribution.

When we say that AT and MA are suboptimal compared to Huffman or LZW, we don’t mean that we expect Assembly Theory to be an optimal compression algorithm (as the authors pretended we were suggesting, in a straw-man attack). LZW or Huffman coding are not an optimal statistical compressor, but they are optimal at doing what Assembly Theory claimed it was doing. This is another point the authors seem to get wrong, naively or viciously repeating ad infinitum that AT and MA are not compression algorithms, hoping that such a claim would make them immune to this criticism.

In no way do we expect Assembly Theory to be like AIT in attempting to implement optimal lossless compression. In other words, the above (and the rest of the paper and post) compares Assembly Theory to one of the most basic coding algorithms for counting copies, a method every compression algorithm has taken for granted since the 1960s but nothing else. The bar is thus quite low, to begin with.

Given that the authors seem to take ‘compression’ as a synonym of AIT, we have decided to substitute the term ‘compression’ for ‘coding’ when appropriate (in most cases) in the new version of our paper (https://arxiv.org/abs/2210.00901), so the authors, and readers, know that we are talking about the properties attributed to AT and MA in other algorithms, regardless of whether these algorithms are seen or have been used in the field of compression.

Ultimately, their (molecular) assembly index (MA) is an upper bound of the algorithmic complexity of the objects it measures, including molecules, notwithstanding the scepticism of the proponents of Assembly Theory. Hence their MA is, properly speaking, an estimation of AIT, even if it is a basic or suboptimal one compared to other available algorithms.

Fallacy 2: ‘We are not a compression algorithm,’ so Assembly Theory is immune from any criticism that may be levelled at the use of compression algorithms

Interestingly, in the view of computer science, Assembly Theory’s molecular assembly index falls into the category of a compression algorithm for all intents and purposes, to the possible consternation of its proponents. This is because their algorithm looks for statistical repetitions (or copies, as they call them), which is what characterises any basic statistical compression algorithm.

Compression is a form of coding. Even if the authors fail to recognise it or name it as such, their algorithm is, for all technical purposes, a suboptimal version of a coding algorithm that has been used in compression for decades. Even if they only wanted to capture (physical) ‘copies’ in data or a process, which is exactly what algorithms like RLE and Huffman do, and confine themselves to doing, albeit optimally, their rebuttal of our critique fails to recognise that what they have proposed is a limited special case of an RLE-Huffman coding algorithm, which means their paper introduces a simpler version of what was already considered one of the simplest coding algorithms in computer science.

By ‘simple’, we mean less optimal and weaker at what it is designed to do, meaning that it may miss certain very basic types of ‘copies’ that, for some reason, the assembly index may disregard as nested in a pathway, hence not even properly counting (physical) ‘copies’ — which the Huffman Coding algorithm does effectively, outperforming AT in practice too (see figure below).

Fallacy 3: Assembly theory is the first (experimental) application of biosignatures to distinguish life from non-life

The claim to be the first to have done so is misleading. The entire literature from the complexity theory community is about trying to find the properties of living systems. The community has been working on identifying properties of living systems that can be captured with and by complexity indexes for decades, perhaps since the concept of entropy itself was formulated. Here is one from us published a decade before Assembly Theory: https://onlinelibrary.wiley.com/doi/10.1002/cplx.20388. The problem has even inspired models of computation, such as membrane computing and P systems, as introduced in the 1990s, that exploit nested modularity.

We could also not find any evidence in favour of the claim in the alleged experimental nature of their index, given that all other measures could separate the molecules as they did without any special experimental data, mostly based on molecular nomenclature. Thus, the defence that their claims about their measure are experimentally validated does not make sense. The agnostic algorithms that we tested and that should have been controlling experiments in their original exploration take the same input from their own data source and produce the same output (same separation of classes) or better.

We have updated our paper online (https://arxiv.org/abs/2210.00901) to cover all their results reproduced by using measures introduced in the 1960s, showing how all other measures produce the same results or even outperform the assembly index.

Actually, we had already reported that nomenclature could drive most of the complexity measures (especially simple ones like AT and MA) into separating living from non-living molecules, which seems to be what the authors of Assembly Theory replicated years later. For example, in this paper published in 2018, we showed how organic and inorganic molecules could be classified using existing complexity indexes of different flavors, based on basic coding, compression, and AIT (a preprint is available here), predating the Assembly Theory indexes by four years.

In 2018, before Assembly Theory was introduced, we showed, in this paper, that complexity indexes could separate organic from inorganic molecules/compounds.
In 2018, in the same paper, we showed that repetitions in nomenclature would drive some complexity indexes and that some measures would pick up structural properties of these molecules/compounds. However, the authors of Assembly Theory disregarded the literature. They rehashed the work in complexity science and information theory done in the last 60 years that they failed to cite it correctly (even when told to do so before their first publication). They have attracted much attention with an extensive marketing campaign and PR engine that most labs and less social-media-oriented researchers cannot have access to or spare resources for.

We are convinced that it is impossible to define life by looking solely at the structure of an agent’s individual or intrinsic material process in such a trivial manner and without considering its relationship and exchange with its environment, as we explored here, for example, or here, where, by the way, we explained how evolution might create modularity (something the author of the Quanta article on Assembly Theory says, wishfully, that perhaps Assembly Theory could do).

What differentiates a living organism from a crystal is not the nested structure (which can be very similar) but the way the living system interacts with its environment and how it extracts or mirrors its environment’s complexity to its own advantage. So the fact that Assembly Theory pretends to characterise life by counting the number of identical constituents it is made of does not make scientific sense and may also be why Assembly Theory suggests beer is the most alive of the living systems that the authors considered, including yeast.

Fallacy 4: Surprise at the correlation

The authors say they were surprised and found it interesting that the tested compression algorithms, introduced in the 1950s, produce similar or better results than Assembly Theory, as we reported in https://arxiv.org/abs/2210.00901. This should not be a surprise, as those algorithms implement an optimal version of what the authors meant to implement in the first place, with the results conforming with the expectation implicit in the formal and informal specifications.

In this case, to effectively counter the counterargument would be to argue that although theoretically and empirically relevant from the statistical perspective, the correlation is not obtained due to structural similarities (i.e., that the assembly index is a particular type of coding algorithm) within the measure itself. However, the latter is obviously false. Furthermore, if a statistically significant classification task is being performed with equal or greater capacity than a measurement algorithm and method, then this fact per se would require a further explanation — as to why an equal or superior performance should be disregarded. This would entail exposing the measures’ foundational structural characteristics, bringing the reader face-to-face with the other fallacies in the text.

The authors must now explain how we could reproduce their main result with every other measure. They would have toned down their claims if they had performed basic control experiments. Neither the foundational theory nor the methods of Assembly Theory offer anything not explored decades ago with these other indexes that could separate organic and inorganic compounds just as MA and AT did.

Fallacy 5: Assembly Theory vs Turing machines (or computable processes)

The authors wrote:

“They have not demonstrated how those algorithms would manifest in the absence of a Turing machine, how those algorithms could result in chemical synthesis, or the implications of their claims for life detection. Their calculations do not have any bearing on the life detection claims made in Marshall et. al. 2021, or the other peer-reviewed claims of Assembly Theory. Despite the alternative complexity measures discussed, there are no other published agnostic methods for distinguishing the molecular artifacts of living and non-living systems which have been tested experimentally.”

Coming back again to the same straw-man fallacy, they seem to conflate a Turing machine with simplicity and proceed to disparage the model; they do not realise that their index is a basic coding scheme widely used in compression since the 1960s even if they claim that they do not wish to compress (which they effectively do), and thus is also related to algorithmic complexity as an upper bound. They proceeded to ask “how those algorithms would manifest in the absence of a Turing machine,” misconstruing our arguments and dismissing the results, saying they were ‘surprised’ by them. We could not make sense of this statement.

Later on, they say: “MA is grounded in the physics of how molecules are built in reality, not in the abstracted concept of Turing machines.” We could not make any sense of this either. We can only assume that they think we are suggesting that AIT implies that nature operates as a Turing machine. This is an incorrect implication. What AIT does imply is that a measure of algorithmic complexity captures computable (and statistical, as they are a subset) features of a process. There is nothing controversial about this. Science operates on the same assumption, seeking mechanistic causes for natural phenomena. If the implication is that because AIT is typically defined in terms of Turing machines (as an oversimplification of the concept of an algorithm), then they are implying that their assembly index is assuming nature to operate as an even simpler and much sillier automaton, given that the assembly index can be defined in terms of, and executed by, a bounded automaton (an automaton more basic, equally mechanistic, and ‘more simplistic’ than a Turing machine). However, we are not even invoking any Turing-machines-argument. The use of AIT is only to support the logical arguments and a small part of the demonstration of the many issues undermining Assembly Theory.

Algorithms such as LZW, RLE and Huffman do not make any particular ontological commitments. They can be implemented as finite automata, just as the assembly index does; they do not require Turing machines. This, again, shows a lack of understanding on the part of the authors of basic concepts in computer science and complexity theory. In other words, if we were to construct a hierarchy of simplicity, with the simpler machines being inferior, their assembly index would occupy a very lowly position in the food chain of automata theory, counting as one of the simplest algorithms possible that does not even require the power of sophisticated automata like Turing machines or a general-purpose algorithm.

The claim that a Turing machine is an abstract model unable to capture the subtleties of their measure defies logic and comprehension because their measures can run on an automaton significantly simpler than a Turing machine; it does not even require the power of a universal Turing machine. And ultimately, we could not find their measure to be grounded in physics or chemistry, despite their suggestion that this is what makes their measure special. For example, when they say that “the goal of assembly theory is to develop a new understanding of evolved matter that accounts for causal history in terms of what physical operations are possible and have a clear interpretation when measured in the laboratory” and “assembly spaces for molecules must be constructed from bonds because the bond formation is what the physical environment must cause to assemble molecules,” they disclose the limitations of Assembly Theory, because it cannot handle other more intricate environmental (or biological) catalytic processes that can increase the odds of a particular molecule being produced as a by-product, while other more capable compression methods can. What they do accept is that their algorithm counts copies, and as such, their algorithm can run on a very simple automaton of strictly less power than a Turing machine.

This misunderstanding is blatantly evinced in the rebuttal’s passage in which they call the process of generating a stochastically random string an “algorithm.” The authors fail to distinguish between the class of computable processes and the class of the algorithms run with access to an oracle (in this case, access to a stochastic source). Then, to construct a counterexample against our methods, they implicitly assume that the generative process of the string belongs to the latter class while the coding/compression processes for the string belong to the former class. Despite these basic mistakes, the authors later argue that this is one of the reasons that AIT fails to capture their notion of “complete randomness”, with Assembly Theory being designed to do just that. These mistakes suggest an en passant reading of the theoretical computer science and complex systems science literature (see also Fallacy 6). Similarly, their rebuttal claims that our results cannot handle the assembly process as their method can. However, their oversimplified method that generates the assembly index is feasibly computable (and advertised as such by the authors) and can easily be reproduced or simulated by decades-old compression algorithms, let alone by other more powerful computable processes.

In most of our criticism, the use of a Turing machine is irrelevant, regardless of what the authors may think of Turing machines. Their assembly index, RLE, and the Huffman coding do not require any description or involvement of a Turing machine other than the fact that they can all be executed on a Turing machine. This holds because our empirical results do not require AIT (see Fallacies 2 and 4), and our theoretical results do not require Turing machines.

Note that even if completely different from an abstract or physically implemented Turing machine, a physical process can be either computable, capable of universal computation, or both. They are trying to depict the proofs against their methods as if our position was that physical or chemical processes are Turing machines, which makes no sense (here, they seem to be employing a type of clichéd-thinking fallacy). Moreover, they ignore the state-of-the-art ongoing advances in complexity science on hybrid physical processes, that is, processes that are partially computable and partially stochastic.

The authors also fail to see that any recursive algorithm (like theirs) is equivalent to a computer program running on a Turing machine or to a Turing machine itself, including the methods of Assembly Theory, and thus the use of algorithms or Turing machines to make a point is irrelevant. The only option for making sense of their argument is to assume they believe that their Turing-computable algorithm can capture non-Turing computable processes by some mystical or magical power.

Fallacy 6: Assembly index has certain “magical” properties validated experimentally, including solving a problem that it was demonstrated decades ago could not be solved by the likes of it.

The authors claim that if we do not create a molecule that the assembly index fails to characterise we cannot disprove their methods. In addition to being an instance of an appeal to ignorance fallacy, this misrepresents the core of our arguments by ignoring what our results imply. As also discussed under Fallacies 2 and 4, we showed that the assembly measure could be replaced by simple statistical compression measures that do a similar or a better job both of capturing their intended features and classifying the data (the alleged biosignatures) while also offering better (optimal) foundations. We have also proposed, tested, and proven to be better than their index computable approximations to semi-computable measures. Even the computable versions are better than Assembly Theory, both in principle and practice.

Furthermore, they presented a compound as a counter-example that they modestly called Marshallarane (after the lead author of their paper), defined as “[producing] a molecule which contains one of each of the elements in the periodic table.” This is an instance of a combined straw-man and false analogy fallacy, in which an oversimplified explanatory or illustrative example that is supposed to elucidate the target argument is employed to concentrate the discussion around that oversimplified example.

If the authors find that not saying anything about a compound like the ‘Marshallarane’ algorithm (which I’d rather call Arrogantium or Gaslightium) is an advantage, then RLE and Huffman, and indeed any basic Shannon entropy-based algorithm qualify equally since they wouldn’t be able to find any copy or repetition, as we explain in depth in this paper. In fact, RLE and Huffman coding schemes are limited and among the first and simplest forms of coding schemes to ‘count copies’. This would be different from algorithmic complexity, but it means that we do not need algorithmic complexity to perform like MA, so there’s no need to invoke algorithmic complexity if we can replace Assembly Theory and its molecular assembly with simplistic algorithms such as RLE or Huffman (as introduced in the 1960s) that appear to outperform Assembly Theory itself.

This strategy fails to address our criticism because such a counterexample of building a molecule first poses a contradiction internal to their own proposition. They have provided a Turing-computable process to build their new compound since, according to the authors, “describing a molecule is not the same as causing the physical production of that molecule. Easily describing a graph that represents a molecule is not the same as easily synthesising the real molecule in chemistry or biology”. Yet, their algorithm requires a computable representation that is no more special than any other computable representation of a process. In addition to possibly being an instance of a clichéd-thinking fallacy, here again, given the use in the literature of the term ‘descriptive complexity’ to refer to algorithmic complexity, this is in contradiction to their own claim that Turing machines are not an appropriate abstract model — because it is exactly the mechanistic nature of processes that a Turing machine would be able to emulate or simulate, a process which lies at the heart of AIT. This passage shows a total misunderstanding of AIT, Turing machines, and internal logical coherence (see also Fallacy 5). Our own research on Algorithmic Information Dynamics is concerned with causal model discovery and analysis based on the principles of AIT. Still, the authors present AIT as oblivious to causality and advance their oversimplified, weak, and suboptimal algorithm as able to capture the subtleties of the physical world.

Such an argument also fails because it distorts or omits, in a straw-man manner, the crucial part of our theoretical results that there are objects or events (e.g., a molecule) that satisfy their own statistical criteria for calculating statistical significance and distinguish pure stochastic randomness from constrained (or biased) assembling processes. In other words, there is a computable process that results from a fair-coin-toss stochastically random outcome that satisfies their own statistical criteria that one may employ to check whether or not the resulting molecule sample frequency is statistically significant; and that satisfies their own mathematical criteria for distinguishing random events (in their own words, those with a “copy number of 1”) from non-random events (in their own words, “those with the repeated formation of identical copies with sufficiently high MA”).

The authors indeed appear to suggest that their index has some “magical” properties and is the only measure that can capture and tell apart physical or chemical processes from living ones. For example, their rebuttal of our critique employs the argument that it can tell apart physical or chemical processes from living ones because it “handles” the problem of randomness differently from AIT. Moreover, when they say that “using compression alone, we cannot distinguish between complete randomness and high algorithmic information,” such a claim already contradicts, even in an at-first-glance reading, the fact that the assembly index can be employed as an oversimplified compression process. In any event, the difference between Assembly Theory and AIT certainly and trivially cannot lie in how randomness is handled because Assembly Theory does not handle randomness at all, as it is in sheer contradiction to what is mathematically defined as randomness.

The formal concept of randomness was only established in the literature of mathematics in the wake of a long series of inadequate definitions and open problems exercising the ingenuity of mathematicians, especially in the last century, decades-old problems which (inadvertently or not) already include the statistical and compression method of Assembly Theory as one of the proven cases in which randomness fails to be mathematically characterised. Randomness occurs when it passes every formal-theoretic statistical test one may possibly devise to check whether or not it is somehow “biased” (that is, more formally, that it has some distinctive property or pattern preserved at the measure-theoretic asymptotic limit). However, there are statistical tests (as shown in our paper) for which an event satisfies Assembly Theory’s criteria for “randomness” but does not pass these latter tests. In other words, Assembly Theory fails even in the case of the most intuitive notion of randomness, so that there is an object for which its subsequent constituents are less predictable (or random) according to Assembly Theory than they actually are.

Contrary to their claims, randomness is synonymous with maximum algorithmic information. The real scientific debate that the authors fail to grasp — and is very unlike what they propose as research — has to do with complexity measures for complex systems with intertwinements of both computable and stochastic dynamics, as discussed under Fallacy 5, systems which the results in our critique already showed Assembly Theory failed to measure even as well as other well-known compression methods.

Fallacy 7: “Many groups around the world are using our [assembly] index”

An instance of the bandwagon fallacy. The authors claim that several groups around the world are working on Assembly Theory. If this is the case, it does not validate our criticism but makes it more relevant. The fact that the authors have published their ideas in high-impact journals also corroborates (if only anecdotally) the ongoing and urgent concern in scientific circles about current scientific practice, how biased the peer review process is in its tendency to value social and symbolic capital; how the kind of behaviour once considered inappropriate is now rewarded by social network dynamics impacting scientific dissemination, and how the fancy university titles of corresponding authors may play a role in dissemination, to the detriment of science.

In point of fact, today, there are definitely more groups working in information theory and AIT (which Cronin calls ‘a scam’, see his Twitter post below) than there are groups working on Assembly Theory. And as of today, despite the high media profile of Leroy Cronin, methods based on AIT (a field in which one of his collaborators, Sara Walker, has been productive), including our own, have way more citations and are used by many more groups than Assembly Theory.

That a small army of misled and underpaid postdocs in an over-funded lab led by an academic highly active on social media has managed to make other researchers follow their lead is not that surprising. But it is our duty to inform less well-informed researchers that they may have been misled by severely unsubstantiated claims, naively or viciously. If their work had been ignored, we would not have invested so much time and effort debunking it.

Fallacy 8: “100 molecules only”

To turn now to the claim that we only used 100 molecules and reached conclusions that reveal a misunderstanding of what they did. In their supplemental information, they claim that the 100 molecules in Fig. 2 constitute the main test for MA to establish the chemical space, based upon which they can distinguish biological samples in attempting to detect life. These molecules are, therefore, the same ones they used and defended the use of when confronted with reviewers who pointed out the weaknesses of their paper. The authors mislead the reader by claiming that the 100-molecule experiment has little or nothing to do with their main claims and that we have therefore misunderstood their methods and results. This is false; their own reviewers were concerned that their claims were entirely based on tuning their measure over these 100 molecules. We simply replicated their experiment with proper controls and found that other computable statistical measures (therefore not only AIT but classical information coding measures) were equivalent to or better than Assembly Theory.

It does not follow that our empirical findings, shown in Fig. 2 of the first version of our paper, do not disprove their Fig. 4 findings because, according to the authors, Fig. 2 was how they ‘calibrated’ their index and set the chemical space. However, we have updated our paper online to cover Fig. 4, too, showing how all other measures reproduce the same results or even outperform the assembly index so that now each and every one of their results and figures have been covered.

Their paper was not reproducible, which is one of the reasons we hadn’t produced Fig. 4. So, we have taken as given the values of their molecular assembly in their own plot. Even assuming their results, they are inferior compared to those obtained in all control experiments. Every other measure tested produced similar or better results, with some even effortlessly outperforming their index.

Notice that they had claimed that the central hypothesis and framework for MA computation was built upon the data and validation, whose results are shown in Figs. 2 and 3, on the 100 molecules we originally used in our critique. If that had failed, their first hypothesis would have failed on its own terms. So, the reference to Fig. 4 was only to distract readers from the reported issues because the original 100-molecules experiment was the basis for their final results.

They used the 100 molecules to set the chemical space and validated it with the larger database of molecules. They claimed multiple times that complex real biological samples like e.coli extract or yeast extract are just a complex mixture of some of these molecules from which they built the chemical space (i.e., the 100 molecules).

MAIN EMPIRICAL RESULT

Finally, we provide the main figure showing that other simple and more sophisticated measures reproduce the molecular assembly results at separating biological molecules using the same data that their index uses while performing either comparably or actually outperforming Assembly Theory.

It reproduces their results in full using traditional measures, some of which the main author of Assembly Theory has called ‘a scam’, showing that their assembly index does not add anything new, their algorithm does not display ‘special properties’ to count ‘physical’ copies, and that, in fact, some measures even outperform it when taking their results at face value (as we were unable to reproduce them). It is thus fatal to the said theory.

The new version of our paper, available online, includes this figure that incorporates all the molecules/compounds in the original Assembly Theory paper. The authors previously said that we did not replicate all their results using decades-old simplistic algorithms (or outperform them using simple and more sophisticated ones). This figure now does and shows that what the main author has called a scam (‘algorithmic complexity,’ see image below) can reproduce the same results as their allegedly novel and ground-breaking concept based on the methodologically ill-defined and fundamentally weak framework of Assembly Theory.
Table taken from our paper showing how other measures trivially outperform AT and MA. The authors of Assembly Theory have never done or offered any control experiments comparing their work or indexes to other measures. The above shows that other complexity measures perform comparably or better and that no claim to ‘physical’ processes is justified because with and without structural data their results are reproducible with trivial algorithms that have been known for decades.

In summary

Readers should judge for themselves by reading the papers and the rebuttals whether or not their fallacious arguments can be worked around. Their response to our critique does not address our main conclusions. It tries, hopefully unintentionally, to distract the reader from the results with statements about their work’s empirical assumptions or putative theoretical foundations. Contrary to their claim that “[our] theoretical and empirical claims do not undermine previously published work on Assembly Theory or its application to life detection,” our arguments seriously undermine Assembly Theory, indeed seem fatal to it.

The assembly index can be replaced by simple statistical coding schemes that are truly agnostic, not being designed with any particular purpose in mind, and do a similar or better job at capturing both the features that Assembly Theory is meant to capture and, in practice, classifying the data (the alleged biosignatures) while also offering better foundations (optimality at counting copies) even without having recourse to more advanced methods such as AIT (which also reproduce their results, e.g. 1D-BDM).

The animus of the senior author toward one of the core areas of computer science suggests a lack of understanding of some of the basics of computer science and complexity theory (some replies to his Tweet even told him that his own index was a weak version of AIT, to which the author never replied). Coming from an established university professor, dismissing algorithmic complexity measures as ‘a scam’ (see the image from Twitter below), with no reasons given even when requested to elaborate, is dismaying. And especially so since one of the finest works of one of the original paper’s main co-authors happens to include serious work in the area of algorithmic probability (Sara Walker paper). The authors fail to realise that for all foundational and practical purposes, Assembly Theory and its methods are a special weak case of AIT and would only work because of the principles of algorithmic complexity. AIT is the theory that underpins their Assembly Theory as an (unintended and suboptimal) upper bound of algorithmic complexity.

On behalf of the authors,

Dr. Hector Zenil
Machine Learning Group
Department of Chemical Engineering and Biotechnology
University of Cambridge, U.K.

The following are the honest answers to the authors’ FAQs that they have published on their website (http://molecular-assembly.com/Q%26A/):

Second Q&A from the original authors led by L.Cronin

Honest answer: It is a simplistic index of pseudo-statistical randomness that counts ‘copies’ of elements in an object. Surprisingly, the authors claim to capture key physical processes underlying living systems. Indeed, according to the authors, counting copies of an object unveils whether it is alive or not. The idea that such a simplistic ‘theory of life’ can characterise a complex phenomenon like life is too naive.

In our opinion, no definition of life can disregard the agent’s environment, as life is about interaction with the environment. A simplistic intrinsic measure like the assembly index would astonish any past and future scientist if it were actually capable of the feats that their authors believe it capable of (the senior author even claims to be able to detect extraterrestrial life). However, we have shown that it does not perform better than other simple coding measures introduced in the 1960s (and better defined) and that their measure is ill-defined and suboptimal at counting copies (see https://arxiv.org/abs/2210.00901).

We think that the key to characterising life lies in the way living systems interact with their environments and that any measure that does not consider this state-dependent variable, which is the basis of, for example, Darwinian evolution, will fail. We have published some work in this area featuring content that Lee Cronin considers ‘a scam’ (see below), one of which was co-authored with one of the senior authors of the Assembly Theory paper with Cronin, Sara Walker (https://www.nature.com/articles/s41598-017-00810-8).

Third Q&A from the original authors led by L. Cronin

Honest answer: Despite the apparent fake rigour of the answer, the assembly index is a weak (and unattributed) version of a simplistic coding algorithm widely used in compression and introduced in the 1960s, mostly regarded today as a toy algorithm. As such, and given that all compression algorithms are measures of algorithmic complexity, it is an algorithmic complexity measure. To the authors’ misfortune, their assembly index is the most simplistic statistical compression algorithm known to computer scientists today (see https://en.wikipedia.org/wiki/Huffman_coding).

Their original paper is full of, perhaps unintended, fake rigour. The authors even included a proof of computability of their ‘counting-copies algorithm’ which nobody would doubt was trivially computable. Nobody has proven that algorithms like RLE or Huffman coding are computable because they are trivial. Even assuming their proof is right, it was unnecessary. These algorithms can be implemented in finite automata and are therefore trivially computable. What they have shown is that they seem to be in some sort of straw-man crusade against AIT in an apparently desperate attempt to show how much better they are compared to AIT (without even comparing themselves to other, trivial computable measures).

Fourth Q&A from the original author led by L.Cronin

Honest answer: Yes, their second sentence indicates that their algorithm is trying to optimise for the minimum number of steps after compression by looking at how many copies an object is made of (just as Huffman’s coding does. For an example, see https://www.bbc.co.uk/bitesize/guides/zd88jty/revision/10), but AT and MA do this suboptimally. Therefore not only is it a frequency-based compression scheme but also, unfortunately for them, a bad one (see https://arxiv.org/abs/2210.00901).

To their surprise, perhaps, almost everything is related to compression. In AI, for example, the current most successful model based on what are called Transformers (ChatGPT) is being found to perform successfully because it can reduce/compress the high dimensionality of large training data sets. Science itself is about compressing observed data into succinct mechanistic explanatory models.

Fifth Q&A from the original authors led by L. Cronin

Honest answer: The authors say this has nothing to do with algorithmic complexity, but their answer is almost the definition of algorithmic complexity, and of algorithmic probability, which looks at the likelihood of a process producing other simple processes (including copies). Unfortunately, the assembly index is such a simplistic measure that complexity science left it behind in the 1960s or incorporated it as one of the most basic algorithms that literally any compression and complexity measure takes into account. These days it is taken for granted and used in exercises for first-year students in Computer Science. Update (14 May 2023):

How to fix Assembly Theory?

One can anticipate and brace for the next move of the group behind AT, the next publicity stunt backed by a well-oiled marketing engine: the announcement that they have created artificial life in their lab, as measured by this simplistic assembly index (which we believe is their original motivation) that classifies even a crystal as an instance of life, and beer as an exemplary instance (according to their own experimental results).

AT is unfixable in some fundamental ways, but the authors have invested so much effort and their own credibility that they are unlikely to back down. In the spirit of constructiveness, and having discussed this with a group of colleagues (extending beyond the authors of our paper and blog post), we can get creative and see how to make AT more relevant. Here we suggest a way to somehow ‘patch up’ AT:

  1. Embrace an optimal method of what Assembly Theory originally wanted to do, which is to count nested copies. It should start with Huffman encoding, or any other coding measure from AIT. They can explore weaker versions by relaxing assumptions if they want to generate different interpretations (and duly credit the right people when exploring borrowed ideas , such as their idea of the ‘number of steps’ and ‘causally connected processes’ from Logical Depth). Things like counting the frequency of molecules would already be taken into account, as shown in our own papers and in others.
  2. Drop the claim that ‘physical’ bonds, reactions or processes can only be captured by AT; it makes no sense. ‘Physical’ or not, everything is physical, or nothing is. Whatever material they feed their assembly index with must be symbolic and computable, as their measure is an algorithm that takes a piece of data containing a representation of ‘physical’ copies, as in chemical nomenclature (e.g. InChi) or structural distance matrices.
  3. Drop the assumption that bonds or chemical reactions happen with equal probability. To begin with, this is not the case; depending on the environment, each reaction has a different probability of happening. In any case, algorithmic probability already indicates a simplicity bias, and is encoded in the universal distribution (see this paper based on and motivated by our own extensive work).4. Your measure needs to factor in the influence of the environment to change the probability distribution of how likely specific physical or chemical steps can happen. This is the state-dependency step that we defend above, which no measure can ignore, and is the chemical basis of biological evolution.
  4. Your measure needs to factor in the influence of the environment to change the probability distribution of how likely it is that specific physical or chemical steps can transpire. This is the state-dependency step that we defend above, which no measure can ignore, and is the chemical basis of biological evolution.
  5. Based on an ever-changing environment, any agent (physical/chemical process) would need to adapt, which is the precondition for particular chemical reactions occurring. It is the internal dynamics of this relationship that we know is the hallmark of life. After all, amino acids can readily be found on dead asteroids ; not many people would call that, or indeed beer (as AT does) life.

For AT to work, therefore, the agreement among colleagues seems to be that these steps would need to be considered before anyone suggests any ‘validation’ using spectrometry data and putting a complexity number on a molecule to call it a good measure for detecting life. This is not Assembly Theory or what the authors of AT did but what others have been doing in the field, with both breakthroughs and incremental progress in the past and for a long time.

Cited screenshots

The original rebuttal to our paper (https://arxiv.org/abs/2210.00901) from the authors of Assembly Theory (for reference in case it changes or is taken down in the future)

Appendix

Let’s address the — entirely unwarranted — reservations the authors of Assembly Theory (and many more uniformed researchers) seem to have against a classical model of computation. While researchers from Assembly Theory seem to disparage anything related to the foundations of computer science, such as the model of Turing machines and algorithmic complexity, other labs, such as the Sinclair Lab at Harvard, have recently reported surprising experimental epigenetic aging results deeply connected to information and computer sciences. Says Prof. David Sinclair: “We believe it’s a loss of information — a loss in the cell’s ability to read its original DNA so it forgets how to function — in much the same way an old computer may develop corrupted software. I call it the information theory of aging.” https://edition.cnn.com/2023/01/12/health/reversing-aging-scn-wellness/index.html

While we are not embracing a supremacist view of AIT or Turing machines, we have no reason to disparage the Turing machine model. There are very eminent scientists, such as Sydney Brenner (Nobel Prize in Chemistry), who believe not only that nature can be expressed and described by computational systems but that the Turing machine is a fundamental and powerful analogue for biological processes, as Brenner argues in his paper in Nature entitled “Life’s code script” (https://www.nature.com/articles/482461a), ideas that Cronin and his co-authors seem to mock. According to Brenner, DNA is a quintessential example of a Turing machine-like system in nature, at the core of terrestrial biology, underpinning all living systems.

As for the importance of AIT (algorithmic complexity and algorithmic probability) in science, in this video of a panel discussion led by Sir Paul Nurse (Nobel Prize in Physiology or Medicine, awarded by the Karolinska Institute) at the World Science Festival in New York, Marvin Minsky, considered a founding father of AI and one of the most brilliant scientists, expressed his belief that the area was probably the most important human scientific achievement, urging scientists to devote their lives to it. He did this sitting next to Gregory Chaitin, fellow panelist and one of the founders of AIT and algorithmic probability (and thesis advisor to Dr. Zenil, the senior author of the paper critiquing Assembly Theory).

--

--

Dr. Hector Zenil

Associate Professor King’s College. Former Senior Researcher & Faculty Member @Oxford U. & The Turing Institute and Dept of Chemical Eng & Biotech @Cambridge U.