Scientific discovery is one of the engines ofhuman and societal progress. Scientists spend years and years learning, devising hypotheses, carrying out conclusions, and publishing results, in pursuit of advancing the state of the art. What now if a machine were able to do science, only much faster?
This is exactly what a recent study has demonstrated, in the field of material science. Researchers have developed an AI system that was able to find previously unknown compounds.
How do they know that their system actually works? The ingenious solution was to metaphorically go back in time and test whether the AI system would be able to predict discoveries that actually happened. For instance, if a new compound was discovered in 2018, they fed the system with data only prior to that year. A virtual time machine of sorts.
More in detail, the system uses text from scientific publications to create embeddings, that is high-dimensional representations of words that allow to calculate how their meaning relates one to the other. If two molecules are described having similar characteristics, the embedding will place them close to each other in a high-dimensional space, even if the texts are in different publications (see examples in the figure below, where high dimensions are flattened to 2D).
By representing molecules with their context words, it is possible to create a map in which the materials cluster into applications, such as photovoltaic and organic.
How did the scientist go from this representation to finding yet to be discovered compounds? Using the power of abstract representation: both the text describing the molecules and that about the applications can be represented in the same high-dimensional space. Which means that their similarity can be measured simply by calculating their distance in the same space. The image below shows the result as shades of purple: the more purple the point, the more it is amenable for the same application