Preview: Useful Chemistry
Useful ChemistryThis blog chronicles the research of the UsefulChem project in the Bradley lab at Drexel University. The main project currently involves the synthesis of novel anti-malarial compounds. The work is done under Open Notebook Science conditions with the actuaUpdated: 2012-02-06T01:01:30.219-05:00
MiniSymposium Bradley Lab 2011 2011-10-05T12:47:44.785-04:00 I recently presented a 15 minute summary of the current research in my lab on September 29, 2011 at the Drexel University Department of Chemistry Faculty Mini-Symposium. The main project discussed was the Open Melting Point Collection done in collaboration with Andrew Lang and Antony Williams. Work by Evan Curtin is also shown, demonstrating the application of melting point and solubility in reaction design. I'll discuss this imine synthesis project in more detail later but Evan's experiments are listed in the notebook.MiniSymp2011 Bradley(object) (embed) View more presentations from Jean-Claude Bradley. (object) (embed)
Patrick Ndungu talk at Drexel on Nanotechnology 2011-08-18T15:04:39.588-04:00 One of my former Ph. D. students, Patrick Ndungu (now at University of KwaZulu Natal, South Africa) will be speaking at Drexel University on Friday August 19, 2011 at 12:30 in Disque 109.
Google Apps Scripts Workshop at Drexel University 2011-08-17T16:01:09.854-04:00 Andrew Lang will be in Philadelphia next week and we will be running a workshop on Leveraging Google Spreadsheets with Scripts for Research and Teaching. Now that our institution is no longer providing Microsoft Office for students in the fall term, it seems like a good time to explore converting some assignments and projects relying on Excel to freely available Google Spreadsheets. (Resources available here)Andrew Lang (Department of Mathematics at Oral Roberts University) and Jean-Claude Bradley (Department of Chemistry at Drexel University) will host a workshop on Google Apps Scripts from 10:30 to 12:00 on Tuesday August 23, 2011 at the Hagerty Library in room L13C. They will demonstrate how users with no programming experience can easily add functions and drop-down menus to a Google Spreadsheet. Some chemistry examples will be detailed, such as inter-converting compound identifiers (common name, SMILES, CAS number, etc.) and reporting properties (melting points, solubility, density, etc.) with a single click. Participants are encouraged to suggest applications in other fields to explore during the workshop.
Open Melting Point Collection Book Edition 1 2011-08-11T16:03:04.939-04:00 Several months of work through a collaboration between myself, Andrew Lang, Antony Williams and Evan Curtin have culminated in the publication of an Open Melting Point Collection Book. Like our other books on solubility and Reaction Attempts, the conversion from a database format to a PDF has several advantages.Now that the book has been accepted by Nature Precedings, it provides a convenient mechanism for citation via DOI, a formal author list, version control, etc. The book is also now available from LuLu.com either as a free PDF download or a physical copy. Because the book runs 699 pages (it covers 2706 unique compounds) the lowest price we could get is $30.96, which just covers printing and shipping. (image) Even though we have melting points for about 20,000 unique compounds, most of these are from single sources. Unless we can get another major donation of melting points (not using any of the sources we already have), progress in curating single values manually will take time. As described in the abstract: This book represents a PDF version of Dataset ONSMP029 (2706 unique compounds, 7413 measurements) from a project to collect and curate melting points made available as Open Data. This particular collection was selected from the application of a threshold to favor the likelihood of reliability. Specifically, the entire range of averaged values for a data point was set to 0.01 C to 5 C, with at least two different measurements within this range. Measurements were pooled and processed from the following sources: Alfa Aesar, MDPI, Bergstrom, PhysProp, DrugBank, Bell, Oxford MSDS, Hughes, Griffiths and the Chemical Information Validation Spreadsheet. Links to all the information sources and web services are available from the Open Melting Point Resource page: http://onswebservices.wikispaces.com/meltingpoint This filtering of double validated melting point measurements within a range of 5C is an attempt to provide a "reasonably" good source, It is imperative to understand that this is not a "trusted source" - as I've mentioned several time there is no such thing. However, since absolute trusted sources do not exist, this double validated dataset of 2706 compounds is probably the best we can do for now. In fact, use of this double validated to build melting point model has led to some excellent models, which are far superior to models constructed from the entire database of 20,000 compounds.
Rapid analysis of melting point trends and models using Google Apps Scripts 2011-07-19T21:54:07.918-04:00 I recently reported on how Google Apps Scripts can be used to facilitate the recording and calculations associated with a chemistry laboratory notebook. (also see resource page)I will demonstrate here how these scripts can be used to rapidly discover trends in the melting points of analogs for the curation of data and the evaluation of models. The two melting point services that Andrew Lang created under the gONS menu were used to keep track of the measured and predicted melting points for all reactants and product as part of a "dashboard view" of the reaction being performed.For looking at melting point trends, the following template sheet can be used.For reasons explained previously, the template sheet has no active scripts in the page (except for the images). These are just the values generated from running the scripts corresponding to the column headings on the common names. In order to use for another series of compounds just make a copy of the entire Google Spreadsheet (File->Make a Copy) then enter the new list and pick the desired script to run from the menus. Once the values are computed remember to copy and paste as values.It is important to understand that our melting point service is not a "trusted source" - it simply reports the average of all recorded data sources, ignoring values marked as DONOUSE. That means that not all data points are equal and it is up to the user to determine a threshold of some type to decide how to use a particular data point.In this investigation, I have marked in green averaged experimental values where at least 3 different values are clustered within a few degrees. A link in column H is automatically generated from the CSID to provide a very convenient way to evaluate the data sources. For example the link for methanol has 3 very close but different melting point values: -98 C, -97.6 C and -97.53 C. The -98 C value is repeated 7 times because this resulted from the automatic merging of several Open Collections.In general we don't manually add values that are identical from different sources because it is likely that these all originate from the same source. We have to make that assumption because proper data provenance is usually lacking in chemical information sources today. A Google search will often return the same one or two melting points from dozens of sites, which may turn out to be an outlier when compared with other independent sources. (CAS numbers are generated in the template sheet because they are useful for searching Google for melting points - for example see here for methanol)In another scenario where there are 3 or more different but close values and a few clear marked outliers, I considered these averages as having passed my threshold and colored these green as well. A good example is ethanol, which I have previously used to illustrate our curation method.It turns out that for the series of n-alcohols from methanol to 1-decanol, I was able to mark in green every experimental melting point average, making the confidence level of the following plot about as high as it can get from current chemical information sources.It is particularly gratifying to note that the predicted melting points based on Andrew Lang's random forest Model002 perform very well here, even predicting a melting point minimum at 3 carbons. Note that this model is Open Source and uses Open Descriptors derived from the CDK. It does not yet include the results of our most recent curation efforts. Any new models incorporating improved datasets will be listed here.Extending the analysis to n-alkyl carboxylic acids from formic acid to decanoic acid provides the following plot, with the same confidence for the experimental averages.For this series, the random forest model not only predicts that the lowest melting point is for the 5 carbon analog but it also appears to take the shape of a zig-zag pattern, especially for the first 6 acids. Since this alternating pattern has been a[...]
Practical Tips on using Google Apps Scripts for Chemistry Applications 2011-07-14T14:55:28.508-04:00 A few weeks ago I described our use of Google Apps Scripts, developed by Rich Apodaca and Andrew Lang, as an intuitive interface to information related to a chemistry laboratory notebook. Since then we have been using these tools to actively plan and record experiments (e.g. UC-EXP269) and we have learned their strengths and weaknesses.The most problematic aspect of Google Apps Scripts running within Google Spreadsheets turns out to be the way caching and refreshing operate. There does not appear to be an obvious way to refresh a single cell. So if a script times out or fails, Google stores that failed output on their servers and will not run it again until some time has elapsed (which seems to be on the order of about an hour). Typing in a new input for that cell will cause the script to run again but entering a previously entered input will only retrieve the cached output, even a failed output. For example, if you have a cell calculating the MW from "benzene" entered in another cell and the script fails for any reason, typing in "ethanol" will get it to run again for the new input, but going back to "benzene" will just pull up the cached output of "Failed".Nevertheless, I did come across some tricks to force a refresh indirectly. If you insert a row or column then re-enter the desired scripts in the new cells, they will run again. You simply need to then delete the old column with failed outputs. This is fine for simple sheets but it can be a headache for sheets that have several calculation dependencies between cells.To avoid these complications, simply refresh the entire sheet by duplicating it, deleting the old sheet and then renaming the new one to the original name. The problem now is that it will refresh all the cells, not just those that had failed outputs. And if there are a large number of scripts on that sheet the odds are good that at least one will fail on that particular attempt, especially if several are hitting the same web server.As a result of all these problems, I would not recommend using these services as I had initially hoped, where a researcher would enter data into a template sheet loaded with scripts to automatically generate a series of calculated outputs. There is a way to achieve this end but it requires thinking about the scripts in a slightly different way.As I mentioned above, there are tricks for refreshing an entire sheet or a column or row. In order to avoid re-running the scripts that already returned desired outputs, we need to lock them in. This can be done by highlighting the completed cells, copying them (either control-c or Edit->Copy) then pasting them as values (from the Edit menu). Now refreshing will only be done on the cells with failed outputs and these can be locked in as well as soon as they complete.The downside of this approach is that you lose the information about which script was run to generate the output values. And to change an input requires re-selecting the desired script. But in practice it is so convenient to hit a dropdown menu and hit getMW (for example) that this downside is quite minimal, especially when contrasted with the upside of knowing that others will see your information reliably, independent of how the services are running at a particular time.Over the past few weeks we have found that some services fail more often than others and it would be advantageous to have some redundancies. This has been particularly problematic for the cactus services recently, which we often use for resolving common names. By using ChemSpiderIDs (CSIDs), the cactus services can be bypassed for several of the gONS services. So a good practice for any application is to generate and lock in SMILES and CSIDs right away from the common name. CAS numbers can be used too but the gChem service that Rich has created sometimes yields multiple CAS numbers and these will fail as input for a subsequent script.We now have a chem[...]
Open Notebook Science Talk at HUBbub 2011 2011-07-01T09:09:14.082-04:00 On April 6, 2011 I presented at the HUBzero Conference in Indianapolis on "Open Notebook Science: Does Transparency Work?".This presentation will first describe Open Notebook Science, the practice of making the laboratory notebook and all associated raw data available to the public in real time. Examples of current applications in organic chemistry - solubility and chemical reactions - will be detailed. Key details of the current technical implementation will be described and possible applicability to nanotechnology projects will be explored. Finally, the implications for Intellectual Property protection, claims of priority, subsequent publication in peer reviewed journals and the eventual automation of the scientific process will be explored.The organizers did a great job in making the recording available as either a video or audio podcast. I learned a great deal at the conference about how researchers from various fields use the HUBzero software to manage and share their data. As described on their website: HUBzero® is a platform used to create dynamic web sites for scientific research and educational activities. With HUBzero, you can easily publish your research software and related educational materials on the web.Although the system is not primarily designed for completely Open sharing, I did get the impression that for some applications there was significant interest in making data and processes more Open. There is certainly an enthusiastic user community around HUBzero - check out the recordings for some of the other talks here. Open Notebook Science HUBzero 2011 src="http://www.slideshare.net/slideshow/embed_code/7533037" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" height="355" width="425">
View more presentations from Jean-Claude Bradley
The 4-benzyltoluene melting point twist 2011-06-22T19:56:40.911-04:00 Evan Curtin and I were in the lab this morning to follow up on our effort to curate the melting point of 4-benzyltoluene. I identified the next step to confirm an upper limit of -15 C:With the information available thus far from our experiments (UC-EXP266), we think it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C. The patent reports that solidification of some viscous mixtures took up to a full week but we did not observe an appreciable increase in viscosity for 4-benzyltoluene at -15 C. But in order to be sure we will first freeze the sample again below -40 C and let it warm up to -15 C in the freezer and confirm that it melts completely.But when we took the sample out of the freezer after 16 days it was completely frozen!This now effectively ruled out the -30 C value and re-opened the possibility that the +4.6 C value could be the best estimate. Learning from our previous failed attempt to observe a temperature plateau when heating the sample, this time we let it warm as slowly as possible by leaving it in an ice water bath inside of a Styrofoam container. This worked much better as the sample warmed a few degrees over several hours. This time Evan observed a clear transition from the solid to the liquid phase in the 4-6 C range.(UC-EXP266)The curation record for the melting point of 4-benzyltoluene now looks like this:When I introduce the concept of Open Notebook Science in my talks I usually make the point that there are no facts - just measurements embedded within assumptions.The 4-benzyltoluene melting point story is a really good example of this principle. When I stated that I thought that "it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C", it was not the measurement that was in error - it was the interpretation. And when new information came to light, an experiment was proposed to either challenge or further support that interpretation. There were never any "facts" in this story (nor is the +4.6 C value a "fact" from these results).I think that this is how science functions best and most efficiently. Unfortunately we don't usually have access to all pertinent raw measurements, assumptions and interpretations. I would be extremely interested in seeing how the -30 C value was determined. This is actually the value provided by the company that sold us this batch of material (as well as the PhysProp entry in the image above). Because of slow crystallization, I can see how this could happen if the temperature was dropped until solidification was observed. In our observations, the -30 C to -35 C range is roughly where we observed rapid solidification upon cooling. (UC-EXP266)[...]
Google Apps Scripts for an intuitive interface to organic chemistry Open Notebooks 2011-06-18T12:48:03.720-04:00 Rich Apodaca recently demonstrated how Google Apps Scripts can be added to Google Spreadsheets to enable simple calling of web services for chemistry applications (gChem). Although we have been using web service calls from within a Google spreadsheet for some time (solubility calculation by NMR link #3 and misc chem conversions link #1), the process wasn't as intuitive as it could be because one had to find then paste lengthy urls.Rich's approach enables simply clicking the desired web service from a menu on Google Spreadsheets and these functions have simple names like getSMILES. Andrew Lang has now added several web services from our ONS projects and the CDK. There are now 3 menus to choose from: gChem, gCDK and gONS.To demonstrate the power of these tools consider the rapid construction of a customized interface to an experiment in a lab notebook (in this example UC-EXP263).1) Because Andy has added a gONS service to render images of molecules from ChemSpider, consistent reaction schemes can now be constructed from this template by simply typing the name of the reactants and products then embedding in the wiki.2) Planning of the reaction to calculate reactant amounts and product yield can then be processed by simply typing the name of the chemicals. Services calling molecular weight and density are automatic based on the chemical name as input.3) Typing the name of the solvent then allows easy access to the solubility properties of the reaction components. The calculated concentrations of the reactants and product can be directly compared with their measured maximum solubility. In this experiment the observed separation of the product from the solution is consistent with these measurements.4) Both experimental and predicted melting points (using Model002) can then be lined up for comparison. A large discrepancy between the two would flag a possible error - in this case good agreement is found. Noting that the product's melting point is near room temperature (53 C) explains why two layers were were observed to form during the course of the reaction and cooling to 0 C induced the product to precipitate. Links to the melting measurements are also provided in column N for easy exploration.5) Column O provides a quick link to the ChemSpider entries for all compounds and column P provides links to the Reaction Attempts Explorer where, for example, one can explore other reactions where the product was involved. Finally columns Q and R provide one click access to an interactive NMR spectrum of the product, powered by ChemDoodle.The last few columns still use our older code to call web services but over time these should be added to the gONS collection for convenience.The easiest way to experiment with this interface is probably to just make a copy (File -> Make a Copy from the Google Spreadsheet menu). The sheet can then be customized for other applications.[...]
Live Tweeting Haumea: the Open Science Ratchet at work? 2011-06-16T14:43:22.156-04:00 Eugenie Samuel Reich just announced on the Nature NewsBlog that astronomer Mike Brown live-tweeted his observations of a transit of dwarf planet Haumea by its moon, Namaka.About a year ago, I wrote about Mike Brown and the controversy about the discovery of Haumea stemming from a competitor's more aggressive data dissemination practice. In that post I speculated that we could expect accelerated data sharing over time due to the Open Science Ratchet, where the actions of scientists that are most open set the pace for everyone else working on that particular project, regardless of their views on how secretive science should be. I don't know if Mike Brown has changed his views on data sharing - or if he has always felt this way but thought it was too risky until now. Either way, he certainly is taking the lead at this point to demonstrate how radical openness can be done in astronomy! (image)
My talk at SLA on Trust in Science and Open Melting Point Collections 2011-06-16T11:33:24.106-04:00 On June 14 and 15, 2011 I attended the Special Libraries Association conference and made presentations on two panels on the role of trust in science with a case-study of the Open Melting Point collections that Andrew Lang, Antony Williams and I have been assembling and curating.The first panel was on the "International Year of Chemistry: Perils and Promises of Modern Communication in the Sciences". My colleague Laurence Souder from the Department of Culture and Communications at Drexel presented on "Trust in Science and Science by Blogging", using as an example the NASA press release on arsenic replacing phosphorus in bacteria and subsequent controversy taking place in the blogosphere. (see post in Scientific American blog today) Watch Lawrence Souder's presentation screencast and slides. The second panel was on "New Forms of Scholarly Communications in the Sciences". Don Hagen from the National Technical Information Service presented on "NTIS Focus on Science and Data: Open and Sustainable Models for Science Information Discovery" and Dorothea Salo discussed the evolving role of libraries and institutional repositories on scholarly communication and archiving. Watch Don Hagen's presentation screencast and slides. My own slides and screencast from the second panel are available below: (object) (embed) Bradley SLA Talk on Open Melting Point Collections src="http://www.slideshare.net/slideshow/embed_code/8326752" marginwidth="0" marginheight="0" frameborder="0" height="355" scrolling="no" width="425">
View more presentations from Jean-Claude Bradley
More on 4-benzyltoluene and the impact of melting point data curation and transparency 2011-06-12T08:53:58.769-04:00 There are many motivations for performing scientific research. One of these is the desire to advance public scientific knowledge.This is a difficult concept to quantify or even qualitatively assess. One can try to use literature citations and impact factors but that captures only a small fraction of the true scientific impact. For example, one formal citation of our solubility dataset doesn't represent the 100,000 anonymous solubility queries made directly to our database. And of these the actual impact will depend on exactly how the information was used. Egon Willighagen has identified this as a problem for the Chemistry Development Kit (CDK) as well: many more people use the CDK than reflected simply by the number of citations to the original paper.There are a few of us who believe that curating chemistry data is a high impact activity. Antony Williams spends a considerable amount of time on this activity and frequently uncovers very serious errors from a number of data sources. Andrew Lang and I have put in a similar effort in collecting and curating solubility measurements openly - and recently (with Antony) we have been doing the same for melting points.Although attempting to estimate the total impact of the curation activity isn't really practical, we can look at a specific and representative example to capture the scope.I recently exposed the situation with the melting point measurements of 4-benzyltoluene. In brief, the literature provided contradictory information that could not be resolved without performing an experiment. Although an exact measurement was not found, a limit was determined that ruled out all measurements except for one.Ironically it turns out that the melting point of this compound is its most important property for industrial use! Derivatives of diphenylmethane were sought out to replace PCBs as electrical insulating oils for capacitors because of toxicity concerns. As described in this patent (US5134761), for this application one requires the oil to remain liquid down to -50 C. Another key requirement is the ability to absorb hydrogen gas liberated at the electrode surface (a solubility property). Since this is optimal for smaller alkyl groups on the rings, it places benzyltoluene isomers at the focal point of research for this application.The patent states: "According to references, the melting points of the position isomers of benzyltoluenes are as follows..." but does not make a specific reference. However, by comparing the numbers with other sources we can presume that the reference is the Lemneck1954 paper I discussed previously.The patent then uses these melting points to calculate the melting behavior of mixtures of these isomers, as they obtain without further purification from a Friedel-Crafts reaction.If our results are correct and the melting point of 4-benzyltoluene is not +4.6 C but well below -15 C, then the calculated properties in the patent may be significantly in error as well. With the information available thus far from our experiments (UC-EXP266), we think it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C. The patent reports that solidification of some viscous mixtures took up to a full week but we did not observe an appreciable increase in viscosity for 4-benzyltoluene at -15 C. But in order to be sure we will first freeze the sample again below -40 C and let it warm up to -15 C in the freezer and confirm that it melts completely.It is in light of this analysis that I make the case that open curation of melting point data is likely to be a high impact activity relative to the amount of time required to perform it. The problem is that errors such as these cascade through the scientific record and lik[...]
Open Melting Points on iPhone via MMDS 2011-06-10T14:32:29.316-04:00 As Alex Clark explained on his blog Cheminformatics 2.0, both predicted and experimental melting points from our Open Data collection are now available on iPhones via his MMDS webservices protocol.(image) Although the app is not free, the web service (#7 from our collection) that Andrew Lang and Alex created for this purpose is Open and available for anyone to use. It reads an XML formatted molfile and returns the average measured melting point, predicted melting point, SMILES, CSID and a link to the ChemSpider entry. (image)
The quest to determine the melting point of 4-benzyltoluene 2011-06-09T19:57:49.107-04:00 I recently reported that we are attempting to curate the open melting point measurements collected from multiple sources such as Alfa Aesar, PhysProp (EPIsuite) and several smaller collections. I mentioned that some values - like benzylamine - simply don't converge and the only way to resolve the issue is to actually get a high purity sample and do a measurement.Since that report, we found another non-converging situation with 4-benzyltoluene. As shown below, reported measurements range from -30 C to 125C.The values in red have been removed from the calculation of the average based on evidence we obtained from ordering the compound from TransWorld Chemicals and observing its behavior when exposed to various temperatures. The details can be found from UC-EXP266 (which I performed with Evan Curtin).Immediately after opening the package it was clear that the compound was a liquid and thus the 125C and 98.5C values became improbable enough to remove.First Evan Curtin and I dropped the still sealed bottle into an ice bath (0C) and after 10 minutes there was no trace of solidification.At this point, this does not necessarily rule out the values near 5C because of the short time in the bath.We then used an acetone/dry ice bath and did see a rapid and clear solidification after reaching -30C to -35C.Letting the bath temperature rise it was difficult to tell what was happening but there seemed to be some liquefaction around -12C.In order to get a more precise measurement, we transferred about 2 mls of the sample into a test tube and introduced the thermometer directly in contact with the substance. After quickly freezing the contents in a dry ice/acetone bath, the sample was removed and its behavior was observed over time, as shown below.I was expecting to see the internal temperature rise then plateau at the melting point until all the solid disappeared and then finally observe a second temperature rise. This comes from experience in making 0C baths within minutes by simply throwing ice into pure water.As shown above that is not at all what happened. The liquid formed gradually starting at about -9C and never reached a plateau even up to +7C, where there was still much solid left.If we look at the method used to generate the 4.58 C value (Lamneck1954) we find that a similar method was cited - but not actually described there. The actual curves are not available either. However, this paper provides melting points for several compounds within a series, which is often useful for spotting possible errors - unless of course these are systematic errors. In this particular case it doesn't help much because the 2-methyl derivative is similar but the 3-methyl analogue is very close to -30 C value listed in our sources.Notice that one of the "melting points" (3-methyldicyclohexylmethane) is not even measurable because it forms a glass. It is easy to see how melting points below room temperature can generate very different values - and very difficult to assess if the full experimental details of the measurements are not reported.Trying to get at more details lets look at the referenced paper (Goodman1950). Indeed the researchers determine the melting point by plotting the temperature over time as the sample is heated and looking for a plateau. The obvious difference is that the heating rate is about an order of magnitude slower than in our experiment.This paper also highlights the fact that there are more twists and turns in the melting point story. One compound (2-butylbiphenyl) was found to have 2 melting points that can be observed by seeding with different polymorphic crystals.At this point, our objective of obtaining an actual melting point was replaced with trying to a[...]
More Open Melting Points from EPI and other sources: on the path to ultimate curation 2011-05-25T21:12:05.491-04:00 As recently as 2008, Hughes et al published a paper asking: Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR of Solubility, Melting Point, and Log PThe question then is: why do QSPR models consistently perform significantly worse with regard to melting point? In the Introduction, we proposed three reasons for the failure of QSPR models: problems with the data, the descriptors, or the modeling methods. We find issues with the data unlikely to be the only source of error in Log S, Tm, and Log P predictions. Although the accuracy of the data provides a fundamental limit on the quality of a QSPR model, we attempted to minimize its influence by selecting consistent, high quality data... With regards to the accuracy of Tm and Log P data, both properties are associated with smaller errors than Log S measurement. Moreover, the melting point model performed the worst, yet it is by far the most straightforward property to measure...We suggest that the failure of existing chemoinformatics descriptors adequately to describe interactions in the crystalline solid phase may be a significant cause of error in melting point prediction.Indeed, I have often heard that melting point prediction is notoriously difficult. This paper attempted to discover why and suggested that it is more likely that the problem is related to a deficiency in available descriptors rather than data quality. The authors seem to argue that taking a melting point is so straightforward that the resulting dataset is almost self-evidently high quality.I might have thought the same before we started collecting melting point datasets.It turns out that validating melting points can be very challenging and we have found enormous errors - even cases where the same compound in the same dataset is assigned very different melting points. Under such conditions it is mathematically impossible to obtain high correlations between predicted and "measured" values.Since we have no additional information to go on (no spectral proof of purity, reports of heating rate, observations of melting behavior, etc.) the only way we can validate data points is to look for strong convergence from multiple sources. For example, consider the -130 C value for the melting point of ethanol (as discussed previously in detail). It is clearly an outlier from the very closely clustered values near -114 C.This outlier value is now highlighted in red to indicate that it was explicitly identified to not be used in calculating the average. Andrew Lang has now updated the melting point explorer to allow a convenient way to select or deselect outliers and indicate a reason (service #3). For large separate datasets - such as the Alfa Aesar collection - this can be done right on the melting point explorer interface with a click. For values recorded in the Chemical Information Validation sheet, one has to update the spreadsheet directly.This is the same strategy that we used for our solubility data - in that case by marking outliers with "DONOTUSE". This way, we never delete data so that anyone can question our decision to exclude data points. Also by not deleting data, meaningful statistical analyses of the quality of currently available chemical information can be performed for a variety of applications.The donation of the Alfa Aesar dataset to the public domain was instrumental in allowing us to start systematically validating or excluding data points for practical or modeling applications. We have also just received confirmation that the entire EPI (PhysProp) melting point dataset can be used as Open Data. Many thanks to Antony Williams for coordinating this agreement and for approval and ad[...]
La Science par Cahier de Laboratoire Ouvert à l'Acfas 2011-05-10T10:33:55.953-04:00 On May 9, 2011 I presented remotely for the French-Canadian Association for the Advancement of Science (ACFAS). This was the first time I gave a talk about Open Notebook Science in French. In fact the last time I gave a scientific talk in French was probably in 1995, when I was doing a postdoc at the Collège de France in Paris. I remember being teased for my French Canadian accent back then so happily that wasn't an issue this time. Even though I was a bit rusty I think I managed to communicate the key points well enough. (At least I hope I did)My presentation was a good fit for the theme of the conference: Une autre science est possible : science collaborative, science ouverte, science engagée, contre la marchandisation du savoir. (Another Science is possible: collaborative science, open science, against the commercialization of knowledge). I would like to thank the organizers (Mélissa Lieutenant-Gosselin and Florence Piron) for inviting me to participate. I was able to record most of the talk (see below) but very near the end Skype decided to install an update and shut down so the recording ends somewhat abruptly. Given what people use Skype for, that default setting for updates really doesn't make much sense. La Science a Cahier de Laboratoire Ouvert src="http://www.slideshare.net/slideshow/embed_code/7900534" marginwidth="0" marginheight="0" frameborder="0" height="355" scrolling="no" width="425"> View more presentations from Jean-Claude Bradley (object) (embed)
Breast Cancer Coalition talk on ONS and Taxol solubility 2011-05-08T16:29:20.930-04:00 On May 1, 2011 I presented "Accelerating Discovery by Sharing: a case for Open Notebook Science" at the National Breast Cancer Coalition Annual Advocacy Conference in Arlington, VA. This was the first year where they had a session on an Open Science related theme and the organizers invited me to highlight some of the tools and practices in chemistry which might be applicable to cancer research.I was really touched by the passion from those in the audience as well as the other speakers and conference participants I met afterward. For many, their deep connection with the cause was strongly rooted in a personal experience as breast cancer survivors themselves or their loved ones. Several expressed a frustration with the current system of sharing results from scientific studies. They felt that knowledge sharing is much slower than it needs to be and that potentially useful "negative" results are generally not disclosed at all.The NBCC has ambitiously set 2020 as the deadline to end breast cancer (including a countdown clock). It seems reasonable to me that encouraging transparency in research is a good strategy to accelerate progress. Of course, great care must be exercised wherever patient confidentiality is a factor. But health care researchers are already experienced with following protocols to anonymize datasets for publication. Opting to work more openly would not change that but it might affect when and how results are shared. Also there is a great deal of science related to breast cancer that does not directly involve human subjects.One initiative that particularly impressed me was The Susan G. Komen for the Cure Tissue Bank, presented by Susan Clare from Indiana University and moderated by Virginia Mason from the Inflammatory Breast Cancer Research Foundation. As a result of this effort, thousands of women have donated healthy breast tissue to create a comprehensive database richly annotated with donor genetics and medical history. The idea of trying to tackle a disease state by first understanding normal functioning in great detail was apparently somewhat of a paradigm shift for the cancer research community and it was challenging to implement. According to Dr. Clare, data from the Tissue Bank have shown that the common practice of using apparently unaffected tissue adjacent to a tumor as a control may not be valid.This example highlights one of the key principles of Open Science: there is value in everyone knowing more - even if it isn't immediately clear how that knowledge will prove to be useful.In my experience, this is a fundamental point that distinguishes those who are likely to favor Open Science from those who reject its value. If two researchers are discussing Open Science and only one of them views this philosophy as being self-evident the conversation will likely be about why someone would want (or not want) to share more and the focus will fall on extrinsic motivators such as academic credit, intellectual property, etc. If both researchers view this philosophy as self-evident the conversation will probably gravitate towards how and what to share.I refer to this philosophy as being self-evident because I don't think people can become convinced through argumentation (I've never seen that happen). Within the realm of Open Notebook Science I have been involved in countless discussions about the value of sharing all experimental details - even when errors are discovered. I can think of a few ways in which this is useful - for example telegraphing a research direction to those in the field or providing data for researchers who study how science is actually done (suc[...]
Collaboration using Open Notebook Science in Academia book chapter 2011-05-08T11:37:29.206-04:00 I am very pleased to report that the book chapter that I co-wrote with Andrew Lang, Steve Koch and Cameron Neylon is now available online: Collaboration using Open Notebook Science in Academia. This is the 25th chapter of Collaborative Computational Technologies for Biomedical Research, edited by Sean Ekins, Maggie Hupcey, Antony Williams and Alpheus Bingham.Our chapter provides some fairly detailed examples of how Open Notebook Science can be used to enhance collaboration between researchers from both similar or distant fields. It also suggests certain paths towards machine/human collaboration in science. Hopefully it will encourage researchers who have an interest in Open Science to experiment with some of the tools and strategies mentioned. I am also grateful to Wiley for choosing our chapter as the free online sample for the book! This book discusses the state-of-the-art collaborative and computing techniques for the pharmaceutical industry, the present and future implications and opportunities to advance healthcare research. The book tackles problems thoroughly, from both the human collaborative and the data and informatics side, and is very relevant to the day-to-day activities running a laboratory or a collaborative R&D project. It can be applied to help organizations make critical decisions about managing drug discovery and development partnership. The book follows a “man- methods-machine” format with sections on how to get people to collaborate, collaborative methods, and computational tools for collaboration. This book offers the reader a “getting started guide” or instruction on “how to collaborate” for new laboratories, new companies, and new partnerships, as well as a user manual for how to troubleshoot existing collaborations. (image)
Evan Curtin is the May 2011 RSC ONS Challenge Winner 2011-05-07T10:24:19.385-04:00 Evan Curtin, a chemistry freshman student working under the supervision of Jean-Claude Bradley at Drexel University, is the May 2011 Royal Society of Chemistry Open Notebook Science Challenge Award winner. He wins a cash prize from the RSC.Evan's primary focus has centered on synthesizing aromatic imines and measuring their solubility in a number of organic solvents. This will allow us to generate Abraham descriptors for this class of compounds in order to predict their solubility in 70+ solvents. Coupled with our new model to include temperature dependent solubility, this should greatly facilitate optimal solvent prediction for this and related reactions. Imine formation is of particular interest to the UsefulChem group because it is the first step of the Ugi reaction, which we have used to synthesize compounds with anti-malarial activity. But it is also a simple convenient reaction in itself to test our Solvent Selector's ability to predict optimal conditions (solvent and temperature) for isolation of products by precipitation. Evan's synthesis experiments are available here: http://usefulchem.wikispaces.com/Exp263 http://usefulchem.wikispaces.com/Exp262 http://usefulchem.wikispaces.com/Exp261 and his solubility experiments are listed here: http://onschallenge.wikispaces.com/Exp207 http://onschallenge.wikispaces.com/Exp206 http://onschallenge.wikispaces.com/Exp205 http://onschallenge.wikispaces.com/Exp204 http://onschallenge.wikispaces.com/Exp201 http://onschallenge.wikispaces.com/Exp198 http://onschallenge.wikispaces.com/Exp197 Three more RSC ONS Awards will be made during 2011. Submissions from students in the US and the UK are still welcome. For more information see: http://onschallenge.wikispaces.com http://onschallenge.wikispaces.com/RSCAwards2010
ACS and ACRL presentations on web services and trust in science 2011-05-24T13:47:33.043-04:00 Update: the recording of my ACS talk on Rapid Dissemination of Chemical Information for people and machines using Open Notebook Science is now available here.On March 30 and 31, 2011 I presented two related talks - the first remotely for the American Chemical Society (ACS) Meeting and the second in Philadelphia at the meeting for the Association of College and Research Libraries (ACRL).In the ACS talk "Rapid Dissemination of Chemical Information for people and machines using Open Notebook Science", I spoke for the first time in detail about the results of the open modeling Andrew Lang and I carried out on the open dataset of melting points we collected starting with the Alfa Aesar dataset recently made public.We used Skype and Google Presenter with the help of Peter Murray-Rust on site at the conference and it went fairly well I think. Henry Rzepa had a good question about polymorphism possibly being responsible for different melting points from various sources. I don't think that is the problem in most of these cases but we can certainly spend some time investigating the reports of polymorphism for cases where the information is available. One of the big problems is that we don't know the history of the sample used for a melting point from most sources like chemical vendor sites. At least in journal articles we might be told which solvent was used to crystallize the sample. If multiple sources agree on a certain melting point and there is one outlier, I think it is reasonable to assume that the common melting point is likely to correspond to the thermodynamically favored polymorph. This might not be correct in all cases but - without the means to discover more information about the sample histories - I think it makes sense to proceed in this way. Since we don't consider polymorphism in our modeling, there is an implicit assumption that - in the case of polymorphism - we are dealing with the thermodynamically most stable form.My ACRL talk "Is there a role for Trust in Science?" focused more on the Chemical Information Validation study and outcomes. There were several good questions at the end. One particularly good comment addressed my speculation that within a few years, the open models in most of the useful chemical spaces will be sufficiently good that it will be as easy to Google a melting point or a solubility as it is now to get driving directions. The question was: weren't we just replacing trust from one information source to another, namely these models. I don't think the concept of trust applies in these cases because the training sets, the descriptors and the performance of the models are (and will be) open. This is in sharp contrast with most commercial software generating predictions for solubility and melting points - these are generally black boxes because either the training set, the model or the descriptors are not open. Open Notebook Science Web Services - ACS Spring 2011 View more presentations from Jean-Claude Bradley ACRL Trust in Science Talk View more presentations from Jean-Claude Bradley [...]
Towards the automated discovery of useful solubility applications 2011-03-27T21:57:35.161-04:00 Last week, I came across (via David Bradley) a paper by an MIT group regarding the desalination of water using a very clever application of solubility behavior:Anurag Bajpayee, Tengfei Luo, Andrew Muto and Gang Chen, Energy Environ. Sci., 2011 Very low temperature membrane-free desalination by directional solvent extraction (article, summary)The technique simply involves the heating of saltwater with molten decanoic acid to 40-80 C. Some water dissolves into the decanoic acid, leaving the salt behind. The layers are then separated and, upon cooling to 34C, sufficiently pure water separates out. Any traces of decanoic acid are inconsequential since this compound is already present in many foods at higher levels.From a technological standpoint, I can't think of a reason why this solution could not have been discovered and implemented 100 years ago. It makes you wonder how many other elegant solutions to real problems could be uncovered by connecting the right pieces together.To me, this is where the efforts of Open Science and the automation of the scientific process will pay off first. For this to happen on a global level, two key requirements must be met:1) Information must be freely available, optimally as a web service (measurements if possible - otherwise a predicted value, preferably from an Open Model)2) There has to be a significantly automated way of identifying what is important enough to be solved.Since we have been working on fulfilling the first requirement for solubility data, I first looked at our available services to see if there was anything there that could have pointed towards this solution.Although we have a measured (0.0004 M) and predicted (0.001 M) room temperature solubility of decanoic acid in water, our best prediction service can't do the opposite: the solubility of water in decanoic acid. For that we would need the Abraham descriptors for decanoic acid as a solvent and those are not yet available as far as I'm aware.Also, we use a model to predict solubility at different temperatures - but it assumes that the solute is miscible with the solvent at its melting point. This is probably a reasonable assumption for the most part but it fails when the solute and the solvent are too radically dissimilar (e.g. water/hydrophobic organic compounds). In this particular application, decanoic acid melts at 31C and the process occurs in the 34-80 C range.But even if we had the necessary models (and corresponding web services) for the decanoic acid/water/NaCl system, could it have been flagged in an automated way as being potentially "useful" or even "interesting"?For utility assessment, humans are still the best source. Luckily, they often record this information tagged with common phrases in the introductory paragraphs of scientific documents. (In fact, this is the origin of the UsefulChem project). For example, if we search for "there is a pressing need for" AND solubility in a Google search, most of the results provide reasonable answers to the question of what a useful application of solubility might be. I have summarized the initial results in this sheet.The first result is:"there is a pressing need for new materials for efficient CO2 separation" from a Macromolecules article in 2005. The general problem needing solving would correspond to "global warming/CO2 sequestration" and the modeling challenge would be "gas solubility".Analyzing the first 9 results in this way gives us the following problem types:global warming/CO2 sequestrationfire controlglobal w[...]
Open modeling of melting point data 2011-03-22T16:31:49.592-04:00 The contribution of Alfa Aesar melting point data to our open collection has facilitated the validation of a significant amount of the entire dataset. However, this process of curation is never-ending. A good example is the discovery of an error in one of the sources for the melting point of warfarin. Following David Weinberger's post about our melting point explorer, his brother Andy noticed a problem and this enabled us to fix it.In a way, creating an open environment to make it easy to find and report errors - as well as add new data - complicates scientific evaluation. In order to report a reproducible process and outcome, it is necessary to take a snapshot of the dataset. Choosing the exact composition of a dataset for a particular application is somewhat arbitrary. Aside from selecting a threshold for excluding measurements that deviate too much, compounds may be excluded based on their type.For the sake of clarity, we archived the various datasets we created from multiple sources with brief descriptions of the filtering and merging at each step. From the perspective of an organic chemist, ONSMP013 is probably the most useful at this time. It contains averaged measurements for 12634 organic compounds and excludes salts, inorganics or organometallics. The original file provided by Alfa Aesar contained several of these excluded compounds and can be obtained from ONSMP000. It might be interesting at some point to create a collection of melting points for inorganics or salts. We would welcome contributions of collections of melting points with different filters.One of the advantages of ONSMP013 is that it is possible to generate CDK descriptors for each entry (and these are included in the spreadsheet). By not using commercial software to generate descriptors, it enables fully transparent modeling - and extension of that modeling by anyone.With this in mind, Andrew Lang has used ONSMP013 to generate a Random forest melting point model (MPM002). The most important descriptors turned out to be the number of hydrogen bond donors and the Topological Polar Surface Area (TPSA). The scatter plot below shows the correlation (R2 = 0.79) between the predicted and experimental values. (color represents TPSA and size relates to H-bond donors)Andy has described in much more detail the rationale for selecting the Random forest approach over a linear model in MPM001. He has also compared the performance of CDK descriptors versus those from a commercial program for a small set of drug melting points in MPM003.The Random forest model (MPM002) is also now available as a web service by entering the ChemSpiderID (CSID) of a compound in a URL. See this example for benzoic acid. If experimental results exist they will appear on top and a link to obtain the predicted melting point will appear underneath.Note that the current web service for predicting melting points can be slow - it may take a minute to process.Additional web services for melting point data will be listed on the ONS web services wiki.[...]
Validating Melting Point Data from Alfa Aesar, EPI and MDPI 2011-03-04T20:03:19.786-05:00 I recently reported that Alfa Aesar publicly released their melting point dataset for us to use to take into account temperature in solubility measurements. Since then, Andrew Lang, Antony Williams and I have had the opportunity to look into the details of this and other open melting point datasets. (See here for links and definitions of each dataset)An initial evaluation by Andy found that the Alfa Aesar collection yielded better correlations with selected molecular descriptors compared to the Karthikeyan dataset (originally from MDPI), an open collection of melting points used by several researchers to provide predictive melting point models. This suggested that the quality of the Alfa Aesar dataset might be higher.Inspection of the Karthikeyan dataset did reveal some anomalies that may account for the poor correlations. First there were several duplicates - identical compounds with different melting points, sometimes radically different (up to 176 C). A total of 33 duplicates (66 measurements) were found with a difference in melting points greater than 10 C.(see ONSMP008 dataset) Here are some examples.A second problem we ran into involved difficulty processing the SMILES in the Karthikeyan collection. Most of these involved SO2 groups. An attempt to view this SMILES string in ChemSketch ends up with two extra hydrogens on the sulfur.[S+2]([O-])([O-])(OCC#N)c1ccc(C)cc1Other SMILES strings render with 5 bonds on a carbon and ChemSketch draws these with a red X on the problematic atom. See for example this SMILES string:O=C(OC=1=C2C=CC=CC2=NC=1c1ccccc1)CNote that the sulfur compounds appear to render correctly on Daylight's Depict site:In total 311 problematic SMILES from the Karthikeyan collection were removed (see ONSMP009).With the accumulation of melting point sources, overlapping coverage is revealing likely incorrect values. For example, 5 measurements are reported for phenylacetic acid.Four of the values cluster very close to 77 C and the other - from the Karthikeyan dataset - is clearly an outlier at 150 C.In order to predict the temperature dependence for the solutes in our database, Andy collected the EPI experimental melting points, which are listed under the predicted properties tab in ChemSpider (ultimately from the EPA). (There are predicted EPI values there but we only used the ones marked exp).This collection of 150 compounds was then listed in a spreadsheet (ONSMP010) and each entry was marked as having only an EPI value (44 compounds) or having at least one other measurement from another source (106 compounds). Out of those having at least one more value, 10 reported significant differences (> 5C) between the measurements. Upon investigation, many of these point strongly to the error lying with the EPI dataset. For example, the EPI melting point for phenyl salicylate is over 85 C higher than that reported by both Sigma-Aldrich and Alfa Aesar.These preliminary results suggest that as much as 10% of the EPI experimental melting point dataset is significantly in error. Only a systematic analysis over time will reveal the full extent of the deficiencies.So far the Alfa Aesar dataset has not produced many outliers, when other sources are available for comparison. However, even here, there are some surprising results. One of the most well studied organic compounds - ethanol - is listed with a melting point of -130 C by Alfa Aesar, clearly an outlier from the other values clustered around -114 C.When downl[...]
ONS Solubility Challenge Book cited in a Langmuir nanotechnology paper 2011-02-26T15:33:01.702-05:00 An interesting application of the data from the Open Notebook Science Solubility Challenge has recently been reported in Langmuir: "Enhanced Ordering in Gold Nanoparticles Self-Assembly through Excess Free Ligands" by Cindy Y. Lau, Huigao Duan, Fuke Wang, Chao Bin He, Hong Yee Low and Joel K. W. Yang (Feb 24, 2011).The context is as follows, and the reference is to Edition 3 of the ONS Solubility Challenge Book. Although to our best knowledge there lacks literature value of OA solubility in the two solvents, the 10-fold better solubility of 1-otadecylamine (sic), the saturated version of oleylamine, in toluene than hexane is in line with our hypothesis.(33) This increased solubility caused the OA molecules that were originally attached to the AuNPs to gradually detach from the AuNPs, which is supported by our observations in poor AuNP stability and surface-pressure isotherms.This is a nice application of solubility to understand and control the behavior of gold nanoparticles. It is in line with some of the applications I discussed at a recent Nanoinformatics conference, where I think there is a place for the interlinking of information between solubility and nanotechnology databases. I have to admit that it is somewhat ironic to see this citation in Langmuir, given the controversy about a year ago (post and FF discussion) regarding the citation of non-traditional literature.
Alfa Aesar melting point data now openly available 2011-02-21T21:02:08.947-05:00 A few weeks ago, John Shirley - Global Marketing Manager at Alfa Aesar - contacted me to discuss the Chemical Information Validation results I posted from my 2010 Chemical Information Retrieval class. Our research showed that Alfa Aesar was the second most common source of chemical property information from the class assignment.We explored some possible ways that we could collaborate. With our recent report of the use of melting point measurements to predict temperature solubility curves, the Alfa Aesar melting point data collection could prove immensely useful for our Open Notebook Science solubility project.However, since we are committed to working transparently, the only way we could accept the dataset is if it were shared as Open Data. I am extremely pleased to report that Alfa Aesar has agreed to this requirement and we hope that this gesture will encourage other chemical companies to follow suit.The initial file provided by Alfa Aesar did not store melting points in a database ready format - it included ranges, non-numeric characters and entries reporting decomposition or sublimation. One of benefits we could provide back to the company was cleaning up the melting point field to pure numerical values ready for sorting and other database processing. This processed collection contains 12986 entries. Note that these entries are not necessarily different chemical compositions since they refer to specific catalog entries with different purities or packaging.For our purposes of prioritizing organic chemicals for solubility modeling and applications we curated this initial dataset by collapsing redundant chemical compositions and excluded inorganics (including organometallics) and salts. We did retain organosilicon, organophosphorus and organoboron compounds. Because the primary key for all of our projects depend on ChemSpiderIDs, all compounds were assigned CSIDs by deposition in the ChemSpider database if necessary. SMILES were also provided for each entry, as well as a corresponding link to the Alfa Aesar catalog page. This curated collection contains 8739 entries.For completeness, we thought it would be useful to merge the Alfa Aesar curated dataset with other collections for convenient federated searches. We thus added the Karthikeyan melting point dataset, which has been used in several cases to model melting point predictions. This dataset was downloaded from Cheminformatics.org. Although we were able to use most of the structures in that collection, a few hundred were left out because of some difficulty in resolving some of the SMILES, perhaps related to the differences in algorithms used by OpenBabel and OpenEye. Hopefully this issue will be resolved in a simple way and the whole dataset can be incorporated in the near future. This final curated collection contains 4084 entries.Similarly the smaller Bergstrom dataset was included after processing the original file to a curated collection of 277 drug molecules.Finally, the melting point entries from the ChemInfo Validation sheet itself, generated by student contributions, is added to amount to a collection of currently 13,436 Open Data melting point values. We believe that this is currently the largest such collection and that it should facilitate the development of completely transparent and free models for the prediction of melting points. As we have argued recently, improved access to measured or predi[...] |
|||||||||||||||||||