November 8, 2005Open Access

Architecture and Evolution of Organic Chemistry

MFMarcin FiałkowskiInstitute of Physical Chemistry KBKyle J. M. BishopColumbia University VCVictor A. ChubukovNorthwestern University

Key Points

Key points are not available for this paper at this time.

Abstract

Organic syntheses reported in the literature between 1850 and 2000 are analyzed at the simplified level of a connected network (see the picture of the network for 1850). Fundamental statistical laws that govern organic syntheses are established. These laws allow the estimation of the synthetic and industrial usefulness of organic molecules. For almost two centuries, chemists all over the world have applied their expertise and creativity1–5 to the synthesis of new molecules. Since each individual chemist—or a collaborating group of chemists—tries to select unique synthetic targets6–9 and come up with a maximally original and/or efficient method of making them, it might appear that the activities of such independent “agents” should be largely uncorrelated, and that no generalizations about the evolution of chemistry en large could be made. As we show here, however, there exist several statistical laws that describe how molecules are made and interconverted. We analyze organic synthesis at the level of an abstract network representation whereby molecules correspond to nodes characterized by molecular masses, and reactions to directed edges connecting these nodes (Figure 1 a). We show, among others, that the connections between the nodes form a time-evolving scale free network10–14 of structure similar to that of the world wide web (WWW),12, 15, 16 and that masses of molecules in this network are governed by a single stochastic process.17, 18 Aside from fundamental interest, the trends we identify allow making predictions of potential economical impact for the chemical industry, for example, how many molecules will be synthesized in the future, molecules of which molecular masses are more likely to be made or used as substrates, how the network connectivities can be used to assess a molecule's industrial importance, and how the equation describing evolution of masses could help in designing fragment libraries for combinatorial chemistry. a) Illustration of the conversion of a collection of chemical reactions into a directed-graph (network) representation. Chemical compounds A–E correspond to the nodes of the network characterized by molecular masses, m. Directed edges are assigned for a given reaction by connecting all reactants to all products. The connectivity of each node is described by the number of incoming arrows, kin (e.g., kin(C)=2), and the number of outgoing arrows, kout (e.g., kout(C)=4). The edges are characterized by the earliest publication date of the corresponding reaction. Note that multiple edges—of the same or different directions—can connect two nodes if these nodes are involved in more than one chemical reaction (e.g., C and E above). b) Visual representation of the giant connected component (GCC)—that is, the largest subset of nodes that are connected to each other—of the organic-chemistry network in 1835. This network contains only 176 compounds; the size of a node corresponds to its total connectivity. Note the scale-free architecture in which most nodes connect to others through highly connected “hub” molecules. c) The GCC in 1850 contains 867 compounds. While the scale-free architecture remains unchanged, the complexity of the network has increased substantially within the elapsed 15 years. One can only imagine the network of organic chemistry in 2004, which contains approximately six million compounds! (The graphs were created using Pajek, a freely available software package for the visualization of large networks.) Analysis was performed on data stored in the Beilstein database (BD),19 which is the largest repository of organic reactions, containing (up to April 2004) 9 550 398 chemical substances and 9 293 250 reactions in which these substances participate. In choosing BD, we adopted its well-established criterion for the classification of chemical substances as “organic” and its comprehensive coverage of the chemical literature dating back to 1779 (see the Supporting Information and reference 19 for details of BD). Although we stress that BD is not without omissions, it provides the single, most-complete description of organic chemistry and its evolution. Therefore, it seems reasonable to assume that statistical laws derived from data described therein are indeed representative of organic chemistry in general, and we present them as such. In the translation of organic synthesis into a network of chemical connectivity, each node represented a chemical compound characterized by its molecular mass (99.7 % of the compounds had mass data). Substances that participated in no reactions or acted only as catalysts or solvents were excluded from the network, thereby decreasing the number of “active” nodes to 5 957 807. Reactions between these chemicals were used to assign directed edges, each characterized by the year in which the reaction was published (99.7 % of the reactions had date information). Duplicate reactions—that is, reactions with identical reactants and products—were considered only once and characterized by the date of the earliest reaction. Reactions that lacked either reactants or products (that is, “half reactions”) were not included in the network. Stoichiometry and reaction yields were not considered, as they were reported for only a few percent of database entries. After parsing, the total number of reactions was decreased to 6 539 158. In the conversion of the set of chemical reactions into a directed network, all reactants were connected to all products by directed edges, as illustrated in Figure 1 a (see Supporting Information). The starting point of our analysis was 1850, and since then both the numbers of molecules and the numbers of chemical reactions have increased exponentially. The corresponding growth rates were constant within time periods up to the beginning of the twentieth century (rm=0.083 year−1 for molecules and rr=0.087 year−1 for reactions) and afterwards (rm=0.044 year−1 and rr=0.038 year−1; Figure 2 a). At the same time, the average connectivity between molecules (Figure 2 b)—defined as the number of edges divided by the number of nodes—initially increased, reached a maximum by about 1885, and then steadily decreased to the value of approximately 2 in 2004. It appears that the early days of chemistry were dominated by “wiring” existing molecules (presumably, to perfect/optimize known synthetic methodologies); when these methodologies matured, exploration of unknown structural space became a dominant activity. a) The number of chemical compounds, reactions, and graph edges as a function of time. Although all three grow exponentially with time, the rate of growth changes near the turn of the twentieth century. b) Average connectivity (degree)—defined as the number of edges divided by the number of nodes—as a function of time. In the nineteenth century, chemistry focused on “rewiring” existing compounds, whereas in the twentieth century, it has been dominated by the synthesis of new compounds. c) In- and out-degree distributions of molecules that comprise the chemistry network obey a power law p(k)=k−γ characteristic of scale-free networks. From 1850 to 2004, the power-law exponents γin and γout (inserts) have been steadily growing and approach constant values of approximately 2.7 and 2.1, respectively. d) Evolution of the chemistry network is governed by a “two-way” (i.e., both “in” and “out”) mechanism of preferential attachment. The plots show the average increase in connectivities of one million random nodes from 1990 to 2000. The “strength” of preferential attachment (measured by the slope of the linear fit, d〈Δk〉/dk) is approximately three times greater for out than for in connectivities. This helps to explain why the power-law exponent γout is smaller than γin and indicates that highly connected substrates are more likely to be found than highly connected products. e) The corresponding cumulative distribution for preferential attachment defined as κ(k)=〈Δki〉. The distribution follows a power law κ(k)=kσ, where σ≈2.1 for both the in and out distributions, which implies that the preferential attachment distribution plotted in (e) is indeed linear (〈Δk〉∝kσ−1≈k). To establish the topological characteristics of the expanding network of chemical reactions, we analyzed distributions of the numbers of connections outgoing from a given node, kout (i.e., the number of times a given molecule was used as a reaction substrate), and the number of connections incoming to a given node, kin (i.e., the number of times a molecule was obtained as a reaction product). As shown in Figure 2 c, the distributions of both p(kout) and p(kin) decay algebraically with the number of connections, p(k)∼k−γ, and the γ exponents for both in and out distributions increase with time (measured in years) approximately as γin(t)=2.671−exp(−(t−1780)/52.4) and γout(t)=2.141−exp(−(t−1780)/36.2). Interestingly, the values these exponents asymptotically approach (γin=2.67 and γout=2.14) are similar to those that characterize the directed network of the WWW (2.71 and 2.1, respectively), thus suggesting that the network of chemistry and the WWW have similar topologies (although the latter is more highly connected; 〈k〉=7.5 for the WWW in 2000 vs 〈k〉=2.1 for chemistry in the same year).16 The observed power-law dependencies indicate that organic reactions—like the WWW,12, 15, 16 the internet,20 metabolic networks,21 and even societies22, 23—form a scale-free network,10 whose architecture is distinguished by the presence of highly connected “hubs.” These hub molecules of organic chemistry are directly analogous to those found in the scale-free network of the airline system in which they facilitate transportation from one poorly connected airport to another. Likewise, the synthesis of one molecule in organic chemistry from another by a series of chemical transformations will likely utilize one of these versatile hub compounds as an intermediate (as we shall see later, this also has an added economic advantage, since hub compounds are significantly less expensive than poorly connected ones). The presence of a scale-free topology also provides evidence as to the mechanism by which the network evolves over time. Specifically, the connectivities of organic molecules evolve according to a two-way mechanism of preferential attachment (i.e., both in and out connections), which stipulates that well-connected substances are more likely to participate in new reactions than poorly connected compounds. We verified this mechanism directly, by analyzing how the connectivities of one million randomly chosen nodes changed over time. We found that the average increase in both in or out connectivities over a given time period was proportional to their values at the beginning of this period (〈Δkin,out〉∝kin,out; see Figure 2 d, e). This result means that 1) the more times a molecule has been used as a synthetic substrate (i.e., large kout value), the higher the chances that it will be used again in the future; 2) conversely, the higher its kin value, the more likely that chemists will try to make it by a new reaction. Interestingly, this mechanism also implies that the evolution of a molecule's usefulness—measured by its participation in new reactions—increases exponentially over time. Therefore, highly connected compounds create exponential “explosions,” thus leading to the formation of preferred substrates and target molecules (the “hubs”) in organic chemistry. The fact that the exponent characterizing outgoing connections, γout, is smaller than that characterizing incoming ones, γin, indicates (see Figure 2 c) that the most connected hubs of the chemistry network have on average more outgoing than incoming connections—the most-connected molecules are usually used as synthetic substrates. At the same time, as chemistry develops, the degree of correlation between in and out connectivities of the molecules increases from R(kout,kin)=0.327 in 1850 to 0.571 in 2004 (see Supporting Information). The rationale for this trend is that the more useful a compound is as a synthetic substrate, the more ways are designed to prepare it (presumably, in an effort to maximize yields and minimize reaction costs); conversely, the more synthetic recipes exist for making a compound, the more available it becomes and can be used for further syntheses. Information about the molecular masses (henceforth, simply “masses” or m) of the network's nodes (i.e., molecules) provides an additional source of information about its evolution. While mass is the simplest of the possible molecular descriptors, it was chosen because it was readily available from the database and because it correlated with several other scalar descriptors (e.g., molecular volume24, 25 or structural complexity factor26, 27 of organic molecules). Figure 3 a, b3 shows the frequency distribution of masses that were used as substrates (gout(m,t); left) and products (gin(m,t); right) in reactions reported between 1850 and 2004. The most commonly used substrates are those near m=150 g mol−1 and the most common products near m=250 g mol−1. Importantly, the shapes of both distributions and the locations of their maxima do not change with time but only shift upwards according to gin,out(m,t)=θin,out(t) (m), where (m) are “master” distributions (here, in 1850) and θin,out(t)= exp((t−t0)/τin,out) are the propagating/scaling functions (t0=1850, =5910.1, =4869.0, τin=19.34, τout=19.07)). These results show that the masses of the most popular substrates and products have not significantly changed for 150 years; moreover, these masses will remain the most popular in the future, as they correspond to the most rapidly connecting nodes of the network. Frequency distributions of masses that were used as a) substrates (gout(m,t)) and b) products (gin(m,t)) in reactions reported in 25-year intervals between 1850 and 2004. Specifically, the substrate- and product-mass distributions were found by calculating the statistics of all masses weighted by kout and kin, respectively. c) Distribution of masses, M(m,t), of all molecules in the chemistry network as of 2004. The zoom-in illustrates periodic variation in the abundances of molecules with even and odd masses. The insert is a Fourier spectrum of the M(m,t) distribution. d) Evolution of mass distributions with time. All M(m,t) curves are plotted against masses binned into m=2 intervals; this binning removes high-frequency oscillations and simplifies calculations but does not, as we verified, affect the generality of the results. e) The lines in this log–log plot correspond to the theoretical mass distributions for 1875 (green), 1950 (red), and 2004 (blue) propagated from the initial distribution in 1850 according to the master equation; these simulated curves agree with the Beilstein data (points). The characteristic shape of the distributions indicates that molecules are created by a stochastic Kesten process. Insert: Mass distributions overlap when rescaled20 with respect to the location of the peak and the total number of molecules. f) Correlation between masses of substrates mS and products mp based on all reactions stored in the database. g) Log–log plot of normalized mass distributions of all compounds stored in Beilstein (blue) and 1382 most important drugs (red). The two distributions are nearly identical with the goodness-of-fit R2=0.779 (for which the Beilstein data is taken as the reference distribution). Abundances of molecules depend on their masses. Figure 3 c shows the M(m,t=2004) mass distribution of all nodes in the network. Aside from an obvious overall trend (few very small and very large molecules), the curve exhibits pronounced “high-frequency” oscillations. The Fourier spectrum of the distribution (Figure 3 c, insert) shows a dominant, sharp peak at 2 and a broader peak centered at 14–15. The former indicates that there are significantly more molecules (indeed, by ≈48 %) of even than of odd masses; the latter corresponds to the masses of the most common chemical “building blocks” (e.g., CH2, CH3, OH, NH2, etc.) from which molecules are made. More detailed, time-dependent analysis of M(m,t) distributions (Figure 3 d) reveals that chemists create molecules according to a single stochastic process. When plotted on a log–log scale (Figure 3 e), the M(m,t) curves for t=1850–2004: 1) are log-normal for the majority (>90 %) of masses around the peak; 2) decay algebraically, M(m,t)∼m−β with the exponent β≈4, for m>800; and 3) exhibit a characteristic “shoulder” for m<50. Since 1850, the average mass increased from to according to where g mol−1 and as the total number of masses (i.e., of molecules) increased exponentially with time. Importantly, when the M(m,t) distributions for different times were they a single master curve (Figure 3 The shapes and of the mass distributions indicate that they evolve according to a in which new masses are created according to the master equation where has a of time. at each (i.e., time molecule from a existing distribution M(m,t) is used with constant to a new molecule whose mass is by the master The value of is by the growth rate of the total number of where is the number of distribution simulated between 1850 and the between and the time. The number of that correspond to one year is After the both the and the created masses are in a distribution thus for the increase in the number of molecules with When stochastic (i.e., and were from distributions and by the mass distribution in year the master equation propagated mass distributions at of time, these distributions were given by where is the initial mass distribution and the distribution of masses derived from in the Although the stochastic the fundamental statistical between the masses of the substrates and the products mp in the network (Figure 3 its this molecular masses evolve according to the master equation from the very of of of the and constant in synthetic equation can be into the In other it can be used to masses with will be obtained in reaction or set of reactions to be important generalizations the master equation indicates that very large and very small molecules are to When large substrates are the to a of a smaller conversely, for small substrates, the increases the this the master equation provides information in the form of which an of how it is to make a molecule of a given We stress that these should be only in a statistical and by no means the of the of molecules. the of the stochastic that the of masses can be used to the of syntheses. a reasonable that such syntheses are based on chemical transformations (i.e., likely described in it is that the masses of the initial of should evolve according to the master This could be used in designing fragment that evolve into a distribution of masses a number of synthetic of such an analysis for three potential libraries is illustrated in Figure a and further in the Supporting One of the most in which such mass distributions are is in it is that masses of substances from those of random however, reveals (Figure 3 g) that this is and that most important have a mass distribution identical to that of all chemicals stored in this is a result of drugs by random from the of existing substances or of synthetic by the master equation (see leading to a log-normal distribution is an molecular the network connectivity of a chemical is a of its or economic As we have it is reasonable to that for the most useful compounds, the number of ways in which they can be used should with the number of ways in which they can be for the 1382 most important and the most important industrial the between kin and kout values are not only higher than for randomly chosen compounds, but are also as illustrated in Figure a, which shows time of these important substances in the are highly and sharp to and by randomly chosen compounds. In other by at a evolution of the connectivity of a one can make about its industrial In this is in and the of chemicals should be by their (here, by kin, the known ways of a and by the existing (measured by the number of that As the of the chemicals (Figure b) with these economic and obey a power law of the form the of a chemical rapidly with both kin and a) Evolution of three initial in the of simulated by the master 1 and 2 distributions of masses centered at g and characterized by and g respectively. The is a of two distributions of masses centered at and g mol−1 and characterized by the same g mol−1. obtained from the three initial 6 The target distribution of masses of 1382 most important drugs is also The average which between the distributions if they are obtained from and 3 and the target distribution of drugs plotted as a function of synthetic (i.e., the number of The 1 most b) of important drugs and the chemicals and 2000 randomly chosen molecules plotted in the kout The of the important chemicals are and linear (i.e., where whereas those of are and of between approximately and The insert shows distributions of the exponents the curves are to the of is higher for randomly chosen compounds than for chemicals and drugs as a function of in and out connectivities of the compounds. The power-law indicates that more highly connected compounds are significantly less this and other trends we are only the to the of information in the of chemistry. Although additional theoretical can be with the Beilstein repository (e.g., network's of molecular and other structural descriptors, the of more trends will likely the of that reaction and it be to see trends could be for of chemistry (e.g., the rapidly growing and etc.) and these trends from the trends a fundamental our results another of how activities of individual into a and as of statistical changes have been made to this since its publication in The Supporting information for this is available on the WWW or from the The is not for the or of information by the than should be directed to the corresponding for the

Ask AI

Helpful

Bookmark

View Full Paper