33 Matching Annotations
  1. Apr 2017
    1. One can perform partial synonym matching, by setting this option to “word”. 

      When using the word option, does pubchem limit the number of results to a specific number like 10,000 for example. I have seen some other systems limit results unless being accessed from a verified user with a token or some key for verification that they are not a robot.

    2. When a list of compounds are specified as the input, the image of only the first compound on the list will be returned.

      Is there a way to get images for a list of chemicals submitted or would they need to be submitted as separate individual searches? I notice a quick way to do this would be as shown below in the spreadsheet, but that looks like it is submitted as separate searches.

    1. Currently PubChem contains more than 180 million depositor-provided substance descriptions, 60 million unique chemical structures and 225 million biological test results from 1 million assays, covering more than 9000 unique protein target sequences (as of January 2015). Programmatic access to this vast amount of data by the chemical biology and biomedical communities presents new opportunities for data-driven research in a ‘big data’ era.

      Im curious as to what type of database all of this is run by with so many different queries and data sets available? I would think that since its relational data that a database with those features would be needed, such as MySQL, Oracle or Microsoft SQL Maybe? My second question is would this all run on cloud servers or traditional style servers? The deeper we get into these modules, the more I am just amazed at the abilities that PubChem has.


    2. However, it should be noted that recent re-engineering of many of PubChem's search services has made them fast enough to work synchronously. These operations have a ‘fast’ prefix on them in PUG-REST, in order to preserve backwards compatibility of the asynchronous variants as described above. For example, the ‘fastsubstructure’ input can be used to retrieve a list of CIDs from a chemical substructure search; this input can then be used with any other PUG-REST operation on CIDs, in a single request.

      Im curious as to what individual pieces of the search happen synchronously as described in the term "fast"? Is this referring to something like multiple returns being processed similarily instead of one at a time based on a query result?


    1. The CGI interprets your incoming request, initiates the appropriate action, then returns results (also) in XML format.

      Is this service mostly limited to an output in XML? I notice some services offered an output for ASN.1, SDF, CSV, etc, but can this be extended to all services with the possibility of a particular format or is XML the limit.

      The reason i ask is that a lot of data can quickly cause excessively large files when put into XML vs something like a CSV if it only needs to contain one type of result.


    1. Learn to code interactively, for free.

      This is a really good site and I am glad it was mentioned. I use this site a lot because I am not a programmer but I often have research meetings with people who do lots of programming and it quickly lets me learn the basics of what they do.


  2. Mar 2017
    1. ELISA

      The PubChem BioAssay Classification Tree is really an amazing feature. Within a few minutes I was able to go through the classification of bacteria to a species I use to work with in microbiology, combine the sources with a few keywords and find the exact AID needed to run confirmation tests. This being coupled to the Pubmed side on the publication is a great feature for scientists to have access to.

    1. The method is also used to quantify the degree of chirality of asymmetric molecules and to investigate the chirality of biphenyl and the amino acids.

      This article was cited in the OLCC Cheminformatics Class Module 6 for atom-centered Gaussian-shape comparison method. I will see if i can access the full article from campus tomorrow, but its very difficult for me to understand how causing a gaussian adaptation to a molecule would help to provide information about the degree of chirality on certain molecules. for example, it would be difficult to understand the shape of a miniature model sailboat inside of a glass bottle if the only shape you see is the outer bottle and not whats contained inside.


    1. ROCS is a fast shape comparison application, based on the idea that molecules have similar shape if their volumes overlay well and any volume mismatch is a measure of dissimilarity. It uses a smooth Gaussian function to represent the molecular volume [5], so it is possible to routinely minimize to the best global match.

      I assume that the ROCS description here that talks about volume overlays is what is mentioned on the olcc module 6 page about finding the best alignmentfor a 3D structure overlay. it seems that through the gaussian blur around the chemical being matched against others that many of the individual bonds would not be compared in the matching process.


    1. The method provides good accuracy across a range of organic and drug-like molecules. The core parameterization was provided by high-quality quantum calculations, rather than experimental data, across ~500 test molecular systems. The method includes parameters for a wide range of atom types including the following common organic elements: C, H, N, O, F, Si, P, S, Cl, Br, and I. It also supports the following common ions: Fe+2, Fe+3, F-, Cl-, Br-, Li+, Na+, K+, Zn+2, Ca+2, Cu+1, Cu+2, and Mg+2. The Open Babel implementation should automatically perform atom typing and recognize these elements.
    1. Cameo ChemicalsResearch and Development, Governmental Organizations, Curation Efforts

      I am pretty familiar with the software that CAMEO Chemicals puts out that we use in emergency response but had no idea they also provided data to services like PubChem.


    1. The listed chemicals are safer alternatives, grouped by their functional-use class.

      There are quite a few safer alternatives in this list, but none are listed on the PubCHem page under the EPA Safer Choice Label for Use and Manufacturing on the PubChem Classification Browser which links itself to this page.


    1. The input list can be provided by text, a file, or Entrez history. Registry IDs, SIDs, CIDs, InChIKey, and SMILES can be separated by white space, comma, tab, or carriage return, however InChI and Synonyms should be separated by tab or carriage return only. If Registry IDs are provided, the Registry Source Name must also be provided.

      This says that an input list can be provided by a file. What types of files would work? CSV, XML, etc? Also is there a limit to how many identifiers can be in a single exchange?

    1. To confound the issue, some energy products (eg,5-Hour Energy,Monster Energy,Rockstar) are mar-keted as dietary supplements, while others (eg,Red-Bull) are marketed as conventional foods.4,5Althoughthe FDA regulates both dietary supplements and con-ventional foods under the Federal Food, Drug, andCosmetic Act (FFDCA), the requirements for theseproducts are different, including the process for mar-keting and reporting of adverse events post marketing.

      This is very interesting that based on the marketing or how a company classifies their product, they can have different rules to follow when regardless the product is intended to be consumed.

    1. PubChem is an archive.  When a record is updated, it is versioned and the original record is retained and accessible.  These dates are more prominent and now includes a modification date table for easy access.  For example, the Substance Record page for SID 12345 has multiple versions:

      This is a really great feature. With all data, I believe things should be preserved even as new things replace the old.


    1. (22)      ToxNet (http://toxnet.nlm.nih.gov/) (Accessed on 2/19/2017).

      I am really glad that this source was used in the write-up. Im not sure why I was so unfamiliar with some of the things that this resource provides given all of the articles i have read from TOXNET, but I will be using it more in the future.


    2. TOXNET (http://toxnet.nlm.nih.gov/)22-25, maintained by the National Library of Medicine (NLM) at NIH, is a group of databases covering toxicology, hazardous chemicals, toxic releases, environmental and occupational health, risk assessment.  Currently, 16 databases are integrated into the TOXNET system, and users can search all these databases either at once or individually.  While all the 16 databases provide valuable information, three of them may be worth mentioning in the context of this course.

      If i were a chemical supply vender and wanted to update the SDS sheets for important information updates, would this be recommended as the first place to perform a search? It seems like a perfect source, but with a lot of entries in SDS sheets, they may state that something is a possible carcinogin for example. Its not really clear to me if TOXNET includes possible hazards or only the ones that have been confirmed through scientific means.


    3. The error propagation issue is a serious, but very common, problem.39,40  Therefore, when using information in these databases, one should keep in mind various data accuracy and quality issues prevalent in these databases.

      Over the years, we have seen some of the errors mentioned when pulling data sets and using APIs, especially when we would use database entries as the basis to perform conversions. The most difficult involved either, identifiers that had multiple possibilities or through use of a chemicals common name which had a lot of varyance in different databases.


    4. As of February 2017, PubChem contains more than 235 million depositor-provided substances, 94 million unique chemical structures, and one million biological assays, which cover about 10 thousand protein target sequences.

      I know that PubChem houses a lot of data and also pulls data from many other sources. Due to this, would the 235 million deposited chemical structures mean that it holds that many in its own database or is that number a sum of entries held in many separate databases?


    5. 2.5. NIST Webbook: thermodynamic and spectroscopic data of chemicals

      I find it really fascinating that the NIST Webbook has versions that go back to 1996 for their database. Do they offer direct access to this database or would bulk data only be acquired through their product services or web scraping?


    6. It is very common that a primary database curates its data with information drawn from secondary databases. 

      I was unaware that this was very common. Its easy to wrap my head around secondary databases pulling from primary or even from other secondary sources, but I wonder if pulling secondary into primary would then make it primary data based on the way it may be used in the new database. Maybe the data exchange and integration mentioned in the following sentence causes a new way to directly use data in a primary way.


    1. Data is provided by hundreds of contributors (http://pubchem.ncbi.nlm.nih.gov/sources/), including publishers, researchers, chemical vendors, pharmaceutical companies, and a number of important chemical biology resources.  Each of these data sources contributes a description of chemical substance samples for which they have information.

      How do you become a data contributor if you have things you would like to put into PubChem?

      Surely, this would be a fairly simple process with millions of data sets already put into the system.


  3. Feb 2017
    1. NIST Atomic Spectra Database - http://www.nist.gov/pml/data/asd.cfm NIST Molecular Spectra Databases - http://www.nist.gov/pml/data/molspec.cfm NMR Shift DB - http://nmrshiftdb.nmr.uni-koeln.de/ Human Metabolome Database - http://www.hmdb.ca EPA Emissions Measurement Center Spectral Database -http://www3.epa.gov/ttn/emc/ftir/index.html MassBank - http://www.massbank.jp/ Romanian Database of Raman Spectroscopy - http://rdrs.uaic.ro/index.html

      Its seems just a matter of time before all of the online sources for spectra offer APIs, possibly including predictive spectra. I wonder how much of a load these types of things put on a server to request.

    2. This capability was added in order that spectral files were not to large for the storage media available (see above).

      Although compression is nice for saving space, would the compression limit the ability to search spectral databases if reading compressed files makes things more complicated. when a lot of files need to be accessed in a single search, such as through an API or user interface search i would think this may take longer.

    1. It is fun to point out though that Binary Large Objects (BLOBs) can be used to store binary files like images, audio, etc.

      Would storing images in a database as a blob cause poor performance. This seems like a lot for a single entry. I would think storing a link to the image in the database would reduce the entry size.

    1. Linenotations are not the only way of communicating structure: also popular are file-based formats such asMDL's MOL File19(and its variant, the SD File), and Chemical Markup Language20(CML, avariant of XML

      One of the big drawbacks that I often hear when using XML with databasing is that it quickly starts making very large files in terms of storage. Would the same hold true for using CML with large chemical databases? If so, what size reduction could be expected for the same size collection of chemicals stored using connection tables?

    1. Main layer○Chemical formula, no prefix○atom connections, prefix ‘c’○hydrogen atoms, ‘h’●Charge layer○proton sublayer, ‘p’○Charge sublayer, ‘q’●Stereochemical layer○Double bonds and cumulenes, ‘b’○tetrahedral stereochemistry of atoms and allenes, ‘t’ or ‘m’○Stereochemistry information type, ‘s’●Isotope layer, ‘I’, ‘h’, and ‘b’, ‘t’ and ‘m’ for stereochemistry of isotopes●Fixed-h layer, ‘f’

      Will a future release version of InChI include a layer to include inorganic molecules that can be standardized? I have come across things in the past saying that it really only works efficiently for organic molecules. I have the understanding that some inorganic things have been included, but not the more complex structures.

    1. Hashing is a one-way mathematical transformation typically used to calculate a compact fixed length digital representation of a much longer string of arbitrary length.

      This is very similar to how I use to protect passwords on my server before SSH keys became the standard. We used similar protocols under the MD5 standard so its very interesting to see the same thing used to make something easier to find with search engines as it was to keep something from being found when used as a security measure.

    2. As a result, different aromaticity detection algorithms often disagree with each other on whether a given molecule is aromatic or not, making it difficult to interchange information between databases that use different aromaticity detection algorithms for SMILES generation.

      What process or services would be needed for the databases to perform this interchange?

    1. aromatic bond type

      How would an aromatic bond be represented if needed to be used to bypass the multiple configuration possibilities?

    1. 3D (x,y,z) coordinates can also be stored for each atom and used to display the conformation of a molecule. These coordinates may be determined experimentally (typically via x-ray crystallography), or calculated (using force-fields, quantum chemistry, molecular dynamics or composite models such as docking). Understanding a molecule's actual shape, whether it be in solution, in a vacuum, or in the binding site of a protein, opens up a whole new domain of computational chemistry. Most molecules have some flexibility, and even if a given conformation is the most stable, there are often a number of competing shapes to consider. Knowing how a particular set of coordinates was determined is crucial to making intelligent use of it for cheminformatics purposes.

      This would be a very good project if all of the 3D data for the different conformations were all indexed and compared based on the influence type and how it changed a molecules shape.

    1. A substructure search query can be matchedagainst a connection table atom-by-atom but the so-called subgraph isomorphism algorithm that is usedin substructure search to compare one graph againstanother is slow and complex and it is likely that theremay be many mismatches before a hit is found. Asubstructure search can be carried out faster if an ini-tial screening stage is carried out to filter out quicklystructures that could not possibly be matches. A com-mon method is to use substructure fragments as thefilter.

      Its very interesting that breaking apart a structure into fragments, performing a search and then rebuilding the structure as part of a screening process with a complex search would be more accurate than just using a connection table to compare each atom.