235 Matching Annotations
  1. Apr 2019
    1. Because both the ST and CT scores range from 0 (for no similarity) to 1 (for identical molecules), the ComboT score may have a value from 0 to 2 (without normalization to unity)

      the Combo T has a range of 0 to 2

    2. The clustering threshold may be adjusted by clicking an appropriate position on the similarity score axis (the horizontal line above/below the dendrogram).

      This section and the next have dendrograms, that range from 1 (similar) to 0 (dissimilar)

  2. Feb 2019
    1. Symp. 8.2: Digital Chemistry and the Lab of the Future

      This is an obvious place for an OER talk, but it would not really be outreach, as these would be people who know about it anyways, and it would be more of a "report", which is not our goal.

      This may also be a place for a talk on the Cheminformatics OLCC, especially if we wish to run another this Fall.

  3. May 2017
    1. Two-dimensional (2-D) similarity methods

      In the 2D structure similarity below, I wonder how the fingerprint indicate the number of bond. For instance, like C-C bond, how do you know which one since both of them has C-C bond? - Phuc

  4. Apr 2017
    1. *fulleren*

      Note, you can't use stars, but need to use the "contains" option

    2. EMTREE drug terms listed in the Index Terms

      EMTREE is like Mesh, but how do you get to the index terms, Unfortunately, h. will not link inside of Reaxys, and so I can't give a link to where I am

    1. identity search.
    2. Getting molecular properties of a set of compounds

      It seems most of the things we can get are from the advanced search (entrez?) options. Can we get other items, like boiling points?

    3. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/CSCPPACGZOOCGX-UHFFFAOYSA-N/record/XML?record_type=3d
    4. One can perform partial synonym matching, by setting this option to “word”. 

      When using the word option, does pubchem limit the number of results to a specific number like 10,000 for example. I have seen some other systems limit results unless being accessed from a verified user with a token or some key for verification that they are not a robot.

    5. When a list of compounds are specified as the input, the image of only the first compound on the list will be returned.

      Is there a way to get images for a list of chemicals submitted or would they need to be submitted as separate individual searches? I notice a quick way to do this would be as shown below in the spreadsheet, but that looks like it is submitted as separate searches.

    6. Getting a list of CIDs for compounds with a given substructure

      Could SMILES be used here instead of the CIDs? Emily

    7. The input identifiers can also be specified by SMILES or InChI strings, although special care needs to be taken because these identifiers contain special characters (such as “/”) that cause conflicts with the URL syntax.4 

      Why use these identifiers if they can cause conflicts? Emily

    8. Getting a list of CIDs for compounds identical to a query compound

      This only shows the structures with identical things to the CID provided, how would one only find those that are similar? Emily


      What are the difference between these two, and in what situations where would one use PUG-SOAP instead of PUG-REST? Emily

    10. In PUG-REST, these three pieces of information are encoded into an URL in the following format:

      Is there a way to make sure the inquires work, other than just trying and not getting anything? Emily

    11. In PUG-REST, these three pieces of information are encoded into an URL in the following format:

      I'm guessing these basic piece of information would be similar to the CACTUS chemical resolver URL.-Phuc

    12. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/180/record/PNG

      I know it through PUG-REST, but how are these generated? same for those link afterward.-Phuc

    13. Entrez Utilities (also called E-Utilities or E-Utils) Power User Gateway (PUG) PUG-SOAP PUG-REST

      Are these suitable with any kind of programmatic platform? -Phuc

    1. PubChem subgraph fingerprints,

      Why are these called a "Subgraph"? I see this from , wikipedia. Is it that each bit of the fingerprint represents a subgraph of the molecule, which in it self is the graph? That is, its existance/nonexistance of this subset within the entire?.

  5. Mar 2017
    1. In 2-D similarity methods, structural similarity between two molecules is estimated by comparing their molecular fingerprints. 

      So, will each structure on PubChem have a fingerprint and a molfile? lyndsie

    2. targets

      what is a target? lyndsie

    3. In substructure search, one provides an input substructure as a query to find molecules that contain the query substructure

      My question is if I were trying to find a large superstructure with more than one substructure, can I input more than one substructure to search? i.e. If I want to find compounds that contain and aldehyde and an alcohol. Lyndise

    4. Although each compound has up to 500 conformers (depending on the molecular size and flexibility),

      is this true? (nwume)

    5. PubChem homepage (http://pubchem.ncbi.nlm.nih.gov) PubChem Chemical Structure Search (https://pubchem.ncbi.nlm.nih.gov/search/search.cgi) PubChem Search (https://pubchem.ncbi.nlm.nih.gov/search/).

      i noticed that anything you search using these three different search interface gives you the same result.(nwume).

    6. Shape-Tanimoto (ST): quantifies steric shape similarity between two conformers. Color-Tanimoto (CT): quantifies the overlap of functional groups between two conformers, such as hydrogen bond donors and acceptors, cations, anions, rings, and hydrophobes. Combo-Tanimoto (ComboT): the sum of ST and CT scores between two conformers.  It takes into account the shape similarity (ST) and functional group similarity (CT) simultaneously. 

      whiich one is the best among these three metrics ? (nwume)

    7. The Structure Clustering tool

      please i dont understand how to make use of this structure clustering tools.(nwume)

    8. without minor isotopes by combining the retrieved search results [from (a)] with the following Entrez indices:

      once you have done this, click on the number of hints, go to the doc sum page, go to advanced, and use the advanced query builder to perform the filters.

    9. ? [Note that a compound may have multiple EC50 values determined from different experiments (likely under different experimental conditions)

      There are two ways to get rid of duplicates. You can simply paste the CIDs into the pubchem search, and search, I think this requires Chrome.


      You can go to Data/data tools in excel, and choose the "Remove duplicates" tool

    10. To perform an identity search for Cymbalta (CID 60835), go to the Chemical Structure Search pag

      so if many CIDs relate to specific isotopes, do the normal molar masses use ave values, like the periodic table?

    11. heat map-style layout,

      Is there an article that explains how to read the data shown below? Emily

    12. PubChem Search

      the beta search link of the homepage

    13. PubChem Chemical Structure Searc

      Direct Link is Under Services on the homepage

    14. the Structure Clustering tool and Structure-Activity Relationship (SAR) Analysis tool.

      Where would these tool be used the most? Meaning in what kind of research would these tools be the most helpful? Emily

    15. 1.3. Substructure and superstructure search

      Could this search be used to find reactions to build superstructures from substructures? Emily

    16. Tanimoto coefficient6-8

      I'm trying to understand how to use the Tanimoto coefficient. I don't see any example to reference to. (Daniel)

    17. As an alternative to 2-D similarity search, 3-D similarity search can also be performed using the “3D conformer” tab in PubChem Chemical Structure Search.  3-D similarity methods use the 3-D structures (that is, conformations) of molecules.  PubChem’s 3-D similarity method is based on the atom-centered Gaussian-shape comparison method by Grant and coworkers,9-12 implemented in the Rapid Overlay of Chemical Structures (ROCS).13,14  While the underlying mathematics of this approach is beyond the scope of this module, what this method essentially does is to find the “best” alignment of the 3-D structures of two molecules, which gives the maximized overlap between them.  The 3-D similarity method quantifies the 3-D molecular similarity using three metrics

      what are the differences of 2D and 3D structures?


    18. wo-dimensional (2-D) similarity methods

      I don't really understand this 2D similarity method.


    19. However, molecular formula search implemented in some databases, including PubChem Chemical Structure Search, has an option to allow other elements in returned hits (e.g., C6H6O or C6H6N2O for the “C6H6” query).

      Why is it has option to allow other elementsin returned hits when we type C6H6 or any other related molecules? Amita

    20. The most common types of molecular fingerprints are structural keys, which encode structural information of a molecule into a binary string (that is, a string of 0’s and 1’s).  The position of each number in this string corresponds to a particular fragment.  If the molecule has a particular fragment, the corresponding bit position is set to 1, and otherwise to 0.  Note that there are many different ways to design molecular fingerprints, depending on what fragments are included in the fingerprint definition.  PubChem uses its own fingerprint called PubChem subgraph fingerprints.

      I am confused with binary string and fingerprint. How does it work to recognize molecules? Amita

    21. PubChem provides two web-based tools that allow users to perform a cluster analysis of PubChem data:  the Structure Clustering tool and Structure-Activity Relationship (SAR) Analysis tool.

      I am confused using structure clustering tool, how can we use this? I practised to use this tool but I did not get any results. Amita

    22. The Structure-Activity Relationship (SAR) Analysis tool

      Confusion on using this tool. Amita

    23. On the contrary, superstructure search returns molecules that comprise or make up the provided chemical structure query (that is, substructures that is contained in the query superstructure).  It should be noted that substructure search does not give you substructures of the query and that superstructure search does not return superstructures of the query.

      How do we know which one is substructure and superstructure? Amita

    24. MMFF94s

      In case if you curious about MMFF94s. and this might answering my previous question about certain element limitation. -Phuc

    25. Consist of only supported elements (H, C, N, O, F, Si, P, S, Cl, Br, and I).

      So any organometallic compounds such as cis-platin don't have 3D structure? and why we have to set this limit? -Phuc

    1. EC50”

      Half maximal effective concentration (EC50) refers to the concentration of a drug, antibody or toxicant which induces a response halfway between the baseline and maximum after a specified exposure time

    2. s. Click the “Pharmacological Actions” link under “BioMedical Annotation” to choose the c

      Note you get 2 hits, the search 5090 and 5472495. This tells you that there is a molecule that is 80% similar and has been tested as active.

    3. n, use the smallest EC50 value (for simplicity) if there are multiple values

      Is there a quick way to recognize duplicates in Excel? In the case of CID5152, the compound CID9892481 was repeated 3 times

    4. mentioned at the top of t

      444036 is now getting 5 targets

    5. What is the difference between the non-common components of the compounds in (a)?

      Is there a quick way to compare smiles to do this?

    6. Provide the query CID in the search box and run the search. Repeat the search with the “same isoto

      note in the query, this gives #24 and #23, which in the advanced search,

      23 1:1 CovalentUnitCount

      24 56 documents

      A quick scan showed only 1 covalent unit in -24, so was -23 done just because it is good practice?

      Also, the entrez history shows here too.

      Also, in the first search with 56 results, the first three did not have isotopes specified, while in the seconds search, those were the only ones to come out. Does it always give priority in the search to ones without isotopic labels?

      This leads to a bigger question. are the molar masses from the periodic table, or are they related to specific isotopes?

    7. 2.1. The Structure Clustering tool

      Private note - useful for GHS?

    8. this URL

      Available through the services menu of PubChem homepage

    1. The method is also used to quantify the degree of chirality of asymmetric molecules and to investigate the chirality of biphenyl and the amino acids.

      This article was cited in the OLCC Cheminformatics Class Module 6 for atom-centered Gaussian-shape comparison method. I will see if i can access the full article from campus tomorrow, but its very difficult for me to understand how causing a gaussian adaptation to a molecule would help to provide information about the degree of chirality on certain molecules. for example, it would be difficult to understand the shape of a miniature model sailboat inside of a glass bottle if the only shape you see is the outer bottle and not whats contained inside.


    1. ROCS is a fast shape comparison application, based on the idea that molecules have similar shape if their volumes overlay well and any volume mismatch is a measure of dissimilarity. It uses a smooth Gaussian function to represent the molecular volume [5], so it is possible to routinely minimize to the best global match.

      I assume that the ROCS description here that talks about volume overlays is what is mentioned on the olcc module 6 page about finding the best alignmentfor a 3D structure overlay. it seems that through the gaussian blur around the chemical being matched against others that many of the individual bonds would not be compared in the matching process.


    1. PDB and CSD

      PDB is biological compounds while CSD is material sciences oriented.

    2. How many compounds

      Use OR Bolean in advanced search

    3. How many compounds in

      Use "AND" of advanced search, the above classification browser results will be in search history

    4. For example, the protein-bound 3-D structure of penicillin V can be accessed via the following UR

      This can be done from PubChem TOC of Classification browser, Biomolecular Interactions and Pathways and Protein Bound 3-D Structures The next link is same, but Chemical and Physical Properties/Crystal structures

    5. ummarize what Lipinski’s rule of 5 i

      </= 5 H Bond donors </= 10 H Nond acceptors </= 500 Daltons LogP </= 5

    6. the “MeSH Synonyms” se

      This chemical has depositor supplied syntony P071 fro AK Scientific,P071

      While Cetrizine has Mesh synonym of P071

    7. zyrtec[completesynonym

      exact name match (zyrtec-D will not work, it is a partial match)

    8. t is the CID of that covalent-bonded unit? [A covalently-bound unit (or simply called covalent unit) consists of a group of

      note, two of these are mixtures, which could be filtered out by setting covalent units to 1:1

    9. The use of covalently bonded units (instead of components) removes this ambiguity because it is well accepted that NaCl is bonded through an ionic bond.)

      sodium chloride comes up with 1:2 not 1:1 covalent units.

    10. zyrtec

      gets MESH and Depositor supplied synonyms

    1. grouping


    2. These conformer models aim to generate bioactive conformers,which would be found in protein-ligand complexes. For this reason, these conformers are often very different from their experimental structures determined in the gas phase

      How do you choose which are best? Both, from the perspective of creating them for the databank, and for using them as a client (in your research).

    3. following condition
    4. n the ST-optimization approach, the shape overlap between the molecules (that is, the ST score) are maximized and the single-point CT score is evaluated at that superposition. On the contrary, the CT-optimization considers both ST and CT scores to find the best superposition between molecules, and the single-point ST score is computed at that superposition

      Can there be some sort of superposition using an algorithm that is sort of in the middle, maximizes both ST and CT?

    5. here NAand NBare the number of bits set in the fingerprints for molecules A and B, respectively, and NABis the number of bits set in both fingerprints.

      Is there a service that generates fingerprints from smiles strings?

    6. ubChem uses its ownfingerprint called PubChem subgraph fingerprints.

      If you follow the link you see the fingerprint key, and this is confusing. How would you represent a molecule with on carbon, 1 oxygen and 2 hydrogen (formaldehyde) when the lowest C count is 2, and H count is 4

    7. choose adesireddegree of “sameness” from several predefined options. To see these options, one need to expand the options section by clicking the “plus” button next to the “option” section headin

      I get confused between indices and filters. If you open the filters, you see what I thought were Indices, (HBD,HBA,...).

      Also, what is the criteria for similar structures? I don't see a 90%, 80%,....

    8. PubChem Searc

      "Try PubChem Search Beta" from homepage

    9. PubChem Chemical

      Under Services of homepage

    1. f the query is a phrase or a name with non-alphanumeric characters, double quotes should be used around the query. 

      text searches with numbers, should use double quotes

    2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pccompound (for Compound) http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pcsubstance (for Substance) http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pcassay (for BioAssay).  

      These links are the same as the above section on indices

    3. retrieved in the XML format, using the eInfo functionality in E-Utilities (which will be covered in Module 7):

      to open these xml files in Google sheets use: =importxml(A1,"//Field/FullName") where in A1 you place the link below

    4. For numeric indices, a search for a range of values can be done by using minimum and maximum values separated by a colon and followed by the bracketed index name (e.g., “100:105[MolecularWeight]”).

      use of numeric indices X:Y

    5. FLink tool

      The flink tool is also useful for 3D structure, macromolecular structure search. https://goo.gl/tgS4QB Amita

    6. However, some special filters, such as the "lipinski rule of 5" filter, or the “all” filter, are not link-based.

      What is "lipnski rule of 5" filter and why is it not link based? Amita

    7. The Entrez Search and Retrieval System

      This link provided here gives detail about Entrez, how it works. http://www.ncbi.nlm.nih.gov/books/NBK184582/ Amita

    8. In practice, the identifier exchange service may be used as a quick approach to search the PubChem Compound database using multiple queries, although this type of task may be performed programmatically (for example, using PUG-REST,10 which will be discussed in Module 7).

      since pug rest query can be used as a quick approach to search the pubchem compounds. can I also use it to get the list of all pubchem IDs, if yes pls how?.(nwume)

    9. 3.2. Identifier Exchange Service

      Can this service be accessed by other methods instead of using the form?


    10. “[synonym]” index returns additional 97 compounds. 

      i noticed that when i search Aspirin[synonym] i get 97 hits but if i search Aspirin[synonyms] i get 102 hits which is the same number "aspirin" gave me. can someone explain why its like that, because i thought it will give me the same number of hits as aspirin[synonym] (.Nwume)

    11. MeSH Synonyms

      Why are some of the MeSH synonyms not the same as the depositor-provided synonyms Emily

    12. 1.2.6. Entrez History

      Is there a way for Entrez to save my search history permanently? -Phuc

    13. Entrez links are cross links or associations between records in different Entrez databases, or within the same database.  These links may be applied to an entire search result list (via the “find related data” section at the right column of a DocSum page) or to an individual record (via links at the bottom of each record presented on the DocSum page).  The Entrez links provide a way to discover relevant information in other Entrez databases based on a user’s specific interests.  Equivalently, one may think of this as a way to transform an identifier list from one database to another based on a particular criterion. Note that there are limits to how many records may be used as input in a link operation.  To process a large amount of input records and/or to expect a large amount of output records associated with the input records, one should use the FLink tool (https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi).         A complete list of the Entrez links available for the three PubChem databases can be retrieved in the XML format through these links http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pccompound (for Compound) http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pcsubstance (for Substance) http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pcassay (for BioAssay).  

      what is XML format, and how can it be applied? Esther

    14. aspirin[completesynonym] (1 hit, as of Feb. 26, 2017)https://www.ncbi.nlm.nih.gov/pccompound/?term=aspirin%5Bcompletesynonym%5D aspirin[synonym] (98 hits)https://www.ncbi.nlm.nih.gov/pccompound/?term=aspirin%5Bsynonym%5D aspirin (103 hits)

      I wonder why do we get different numbers of compound lists when we hit different type of queries? Amita

    15. MeSH (Medical Subject Headings)

      Is there a difference between the National Library of Medicine MeSH and this MeSH? I was confused when reading and found out that there are two sections where it is discussed. Under Synonym and under classification. Daniel

    1. why the “legacy” designation was introduced in PubChem

      :…. Pubchem does not allow anyone other than the data contributor to modify the provided information and that makes some of the records in PubChem persist with outdated or incorrect data. so in other to help correct this the “legacy” designation was introduced.(nwume)

    2. (22)      ToxNet (http://toxnet.nlm.nih.gov/) (Accessed on 2/19/2017).

      I am really glad that this source was used in the write-up. Im not sure why I was so unfamiliar with some of the things that this resource provides given all of the articles i have read from TOXNET, but I will be using it more in the future.


    3. Therefore, it is very common that database groups exchange their information with each other. 

      Does this mean PubChem is sharin information between BioAssay and Compound databases or PubChem is sharing information with ChemSpider, etc.? Lyndsie

    4. Although the data provenance information is critical in the reliability of a data source (and its data), this information is not easy to manage. 

      How do we determine the reliability of a source? Like, I've gone through and looked at the data provenance, how do I say this peice of info is reliable, but this conflicting information on a different database is not based on their data provenance?


    5. In turn, users should always pay attention to the data provenance issue when using a database.

      Since we are focusing on PubChem and this sentence tell us, as users, to pay attention to the sources, how do we find the source on PubChem? Furthermore, say that source leads to another souce, how far do we go back?


    6. TOXNET (http://toxnet.nlm.nih.gov/)22-25, maintained by the National Library of Medicine (NLM) at NIH, is a group of databases covering toxicology, hazardous chemicals, toxic releases, environmental and occupational health, risk assessment.  Currently, 16 databases are integrated into the TOXNET system, and users can search all these databases either at once or individually.  While all the 16 databases provide valuable information, three of them may be worth mentioning in the context of this course.

      If i were a chemical supply vender and wanted to update the SDS sheets for important information updates, would this be recommended as the first place to perform a search? It seems like a perfect source, but with a lot of entries in SDS sheets, they may state that something is a possible carcinogin for example. Its not really clear to me if TOXNET includes possible hazards or only the ones that have been confirmed through scientific means.


    7. The error propagation issue is a serious, but very common, problem.39,40  Therefore, when using information in these databases, one should keep in mind various data accuracy and quality issues prevalent in these databases.

      Over the years, we have seen some of the errors mentioned when pulling data sets and using APIs, especially when we would use database entries as the basis to perform conversions. The most difficult involved either, identifiers that had multiple possibilities or through use of a chemicals common name which had a lot of varyance in different databases.


    8. As of February 2017, PubChem contains more than 235 million depositor-provided substances, 94 million unique chemical structures, and one million biological assays, which cover about 10 thousand protein target sequences.

      I know that PubChem houses a lot of data and also pulls data from many other sources. Due to this, would the 235 million deposited chemical structures mean that it holds that many in its own database or is that number a sum of entries held in many separate databases?


    9. 2.5. NIST Webbook: thermodynamic and spectroscopic data of chemicals

      I find it really fascinating that the NIST Webbook has versions that go back to 1996 for their database. Do they offer direct access to this database or would bulk data only be acquired through their product services or web scraping?


    10. It is very common that a primary database curates its data with information drawn from secondary databases. 

      I was unaware that this was very common. Its easy to wrap my head around secondary databases pulling from primary or even from other secondary sources, but I wonder if pulling secondary into primary would then make it primary data based on the way it may be used in the new database. Maybe the data exchange and integration mentioned in the following sentence causes a new way to directly use data in a primary way.


    11. TOXNET (http://toxnet.nlm.nih.gov/)22-25, maintained by the National Library of Medicine (NLM) at NIH, is a group of databases covering toxicology, hazardous chemicals, toxic releases, environmental and occupational health, risk assessment. 

      Does Toxnet also deals with nanomaterials and environmental pollutions? Amita

    12. 2.6. DrugBank: comprehensive information on drug molecules

      More details about DrugBank can get from this link. (http://www.drugbank.ca/about) Amita

    13. As of February 2017, PubChem’s data are from more than 500 organizations, including government agencies, university labs, pharmaceutical companies, substance vendors, and other databases

      How would Pubchem curate and check the validity of those data coming from different sources? Phuc

    14. Getting the most out of PubChem for virtual screening

      What are other tool for drugs virtual screening in Pubchem beside structure similarities? Phuc

    15. The term “data provenance” refers to a record trail that describes the origin or source of a piece of data and the process by which it entered in a database.1 

      Is this the same as a substance record on PubChem? Emily

    16. (d) Explain the reason why the “legacy” designation was introduced in PubChem in two or three sentences.

      The best explanation for this in 2.4 in the article at the bottom of the page. Emily

    17. Some records in PubChem are “non-live”, meaning that they are “not searchable”, although they do exist in the database.  This exercise is designed to help students better understand what non-live records are.

      Here is a great explanation on what "non-live" records are in PubChem. Emily

    18. Therefore, databases need to document the provenance of the data and devise a way to notify users of that information.  In turn, users should always pay attention to the data provenance issue when using a database.  

      Since documenting the provenance of data is still a working progress, can we say public databases are not completely reliable? Daniel

    19. ChEMBL:

      how does CHEMBL create the chemical structure for each compound? (NWUME)

    20. PubChem organizes its data into three inter-linked databases: Substance, Compound, and BioAssay (See Table 1), which can be searched from either the PubChem home page (https://pubchem.ncbi.nlm.nih.gov) or the web page of one of the three PubChem databases.   Table 1. Three inter-linked databases in PubChem. Database URL Identifier Substance https://www.ncbi.nlm.nih.gov/pcsubstance SID Compound https://www.ncbi.nlm.nih.gov/pccompound CID BioAssay https://www.ncbi.nlm.nih.gov/pcassay AID   Individual data contributors deposit information on chemical substances to the Substance database (https://www.ncbi.nlm.nih.gov/pcsubstance).  Different data contributors may provide information on the same molecule, hence the same chemical structure may appear multiple times in the Substance database.  To provide a non-redundant view, chemical structures in the Substance database are normalized through a process called “standardization” and the unique chemical structures are identified and stored in the Compound database (https://www.ncbi.nlm.nih.gov/pccompound).  The difference between the Substance and Compound databases is explained in more detail in this blog post.

      How do we use a process called standardization to normalize substance database?


    21. 1.2. Primary databases vs. secondary databases

      please can someone classify the pulic databases mentioned in module 4 as primary and secondary databases.

    22. Table 1. Three inter-linked databases in PubChem. Database URL Identifier Substance https://www.ncbi.nlm.nih.gov/pcsubstance SID Compound https://www.ncbi.nlm.nih.gov/pccompound CID BioAssay https://www.ncbi.nlm.nih.gov/pcassay AID  

      is it possible for someone to update substance data in PubChem and i hope once you update it ,it will not affect the SID. (NWUME)

    23. ChemIDPlus26,27 is a dictionary of over 400,000 chemical records (names, synonyms, and structures) and provides access to the structure and nomenclature files used for the identification of chemical substances in the TOXNET system and other NLM databases.  The Hazardous Substances Data Bank (HSDB)28,29 focuses on the toxicology of potentially hazardous chemicals, providing information on human exposure, industrial hygiene, emergency handling procedures, environmental fate, regulatory requirements, nanomaterials, and related areas. All HSDB data are referenced and derived from a core set of books, government documents, technical reports and selected primary journal literature. Importantly, HSDB is peer-reviewed by the Scientific Review Panel (SRP), a committee of experts in the major subject areas within the data bank's scope. The Comparative Toxicogenomics Database (CTD)30,31  contains manually curated data describing interactions of chemicals with genes/proteins and diseases.  This database provides insight into the molecular mechanisms underlying variable susceptibility for environmentally influenced diseases

      since currently 16 databases are integrated into TOXNET SYSTEM but only 3 were shown here ,how can i find the rem 13 databases. (NWUME)

    24. All information in the Substance database is submitted by individual data depositors.  However, the Compound database does contain information that are not submitted by data depositors,

      Are there sample files for submitting substances in PUBCHEM since substance database can be submitted by individual data depositors (NWUME)

    25. 1.2. Primary databases vs. secondary databases     Databases are often categorized into primary and secondary databases.  Primary databases contain experimentally-derived data that are directly submitted by researchers (also called “primary data”).  In essence, these databases serve as archives that keep original data.  Therefore, they are also known as archival databases. Secondary databases contain secondary data, which are derived from analyzing and interpreting primary data.  These databases often provide value-added information related to the primary data, by using information from other databases and scientific literature.  Essentially, secondary databases serve as reference libraries for the scientific community, providing highly curated reviews about primary data.  For this reason, they are also known as curated databases, or knowledgebase.

      Can we classify PubChem as secondary database since it collects data from other sources?


    26.   PubChem organizes its data into three inter-linked databases: Substance, Compound, and BioAssay (See Table 1), which can be searched from either the PubChem home page (https://pubchem.ncbi.nlm.nih.gov) or the web page of one of the three PubChem databases.   Table 1. Three inter-linked databases in PubChem. Database URL Identifier Substance https://www.ncbi.nlm.nih.gov/pcsubstance SID Compound https://www.ncbi.nlm.nih.gov/pccompound CID BioAssay https://www.ncbi.nlm.nih.gov/pcassay AID   Individual data contributors deposit information on chemical substances to the Substance database (https://www.ncbi.nlm.nih.gov/pcsubstance).  Different data contributors may provide information on the same molecule, hence the same chemical structure may appear multiple times in the Substance database.  To provide a non-redundant view, chemical structures in the Substance database are normalized through a process called “standardization” and the unique chemical structures are identified and stored in the Compound database (https://www.ncbi.nlm.nih.gov/pccompound).  The difference between the Substance and Compound databases is explained in more detail in this blog post.

      what is the difference between Upload ID, Registry ID,SID, CID and AID in PubChem? Esther

    27. 1.2. Primary databases vs. secondary databases     Databases are often categorized into primary and secondary databases.  Primary databases contain experimentally-derived data that are directly submitted by researchers (also called “primary data”).  In essence, these databases serve as archives that keep original data.  Therefore, they are also known as archival databases. Secondary databases contain secondary data, which are derived from analyzing and interpreting primary data.  These databases often provide value-added information related to the primary data, by using information from other databases and scientific literature.  Essentially, secondary databases serve as reference libraries for the scientific community, providing highly curated reviews about primary data.  For this reason, they are also known as curated databases, or knowledgebase.

      Can i allow my associates or reviewers to access my on-hold assay data? Esther

    28. ChEMBL: literature-extracted biological activity information         ChEMBL (https://www.ebi.ac.uk/chembl/)8,9 is a large bioactivity database, developed and maintained by the European Bioinformatics Institute (EBI), which is part of the European Molecular Biology Laboratory (EMBL).  The core activity data in ChEMBL are “manually” extracted from the full text of peer-reviewed scientific publications in select chemistry journals, such as Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters, and Journal of Natural products.  From each publication, details of the compounds tested, the assays performed and any target information for these assays are abstracted.  ChEMBL also integrates screening results and bioactivity data from other public databases (such as PubChem BioAssay) and information on approved drugs from the U.S. FDA Orange Book10 and the NLM’s DailyMed

      Is it possible for ChEMBL to work when used to search for chemical structures? Esther

    29. Public Chemical Databases         These days many public online databases provide chemical information free of charge and the databases mentioned in this module are only a few examples of them.  Note that these databases vary in size and scope.   2.1. PubChem: chemical information repository at the U.S. NIH         PubChem (https://pubchem.ncbi.nlm.nih.gov)2-4 is a public repository of information on small molecules and their biological activities, developed and maintained by the National Library of Medicine (NLM), an institute within the U.S. National Institutes of Health (NIH).  Since its launch in 2004 as a component of the NIH’s Molecular Libraries Roadmap Initiatives, it has been rapidly growing, and now serves as a key chemical information resource for researchers in many biomedical science areas, including cheminformatics, chemical biology, and medicinal chemistry.  Detailed information on PubChem can be found in these three papers:

      if i want to submit manuscript to a journal, what pubchem identifier can i use? Esther

    30. The Protein Data Bank (PDB) is an archive of the experimentally determined 3-D structures of large biological molecules such as proteins and nucleic acids.

      Aside protein and nucleic acid so protein Data Bank cannot be used to determine 3D structure of another biological molecule. ( NWUME)

    1. thasSOlongbeenacustomtoprefaceacourseoflectureswiththehistoryofthesciencewhichistheirsubject,thatitmaybenecessarytostate,briefly,thereasonsthathaveinducedmetodepartfromthisestablishedusage.

      This is a test of a Gutenberg Archive book uploaded to the OLCC

    1. Lipinski’sruleof5[50].Among them, 10.3 million (12% of the total) are fragment-like ones, which satisfy Congreve’sruleof3[51].

      What are the Lipinski and Congreve rules? Emily

    2. Therefore, 3-D neighboring may offer comple-mentary views on structural similarity between molecules withsimilar biological activities.

      In researching on the 2-D and 3-D neighbors, i couldn't find which is better because they both have their advantages and disadvantages. Plus since they rely on different methods, which one is used more often? Daniel

    3. Forexample, ChEMBL [49] manually extracts bioactivity datafrom peer-reviewed papers published in journals in themedicinal chemistry and natural product domains.

      Based on this passage in Section 2.2 on Bioactivity , it will be preferable to use ChEMBL than PubChem for high quality data extraction. Daniel

    1. t the approach above is easily implemented andcan run in a wide range of environments. It has the benefitof synergy with the c

      test for public file on cheminformatics OLCC

    1. es.PubChem archives the molecular structure and bioassay data from the MLSCNand other contributors. PubChem provides search, retrieval, and data analysistools to optimize the utility of these results. PubChem further enhances the re-search utility of the

      Restricted Access Article - I am testing to see if this link works when someone is logged in, and is not logged in, to the OLCC course.

  6. Feb 2017
    1. NIST Atomic Spectra Database - http://www.nist.gov/pml/data/asd.cfm NIST Molecular Spectra Databases - http://www.nist.gov/pml/data/molspec.cfm NMR Shift DB - http://nmrshiftdb.nmr.uni-koeln.de/ Human Metabolome Database - http://www.hmdb.ca EPA Emissions Measurement Center Spectral Database -http://www3.epa.gov/ttn/emc/ftir/index.html MassBank - http://www.massbank.jp/ Romanian Database of Raman Spectroscopy - http://rdrs.uaic.ro/index.html

      Its seems just a matter of time before all of the online sources for spectra offer APIs, possibly including predictive spectra. I wonder how much of a load these types of things put on a server to request.

    2. This capability was added in order that spectral files were not to large for the storage media available (see above).

      Although compression is nice for saving space, would the compression limit the ability to search spectral databases if reading compressed files makes things more complicated. when a lot of files need to be accessed in a single search, such as through an API or user interface search i would think this may take longer.

    3. lthough JCAMP-DX has not formally been standardized, it is currently the de facto standard for sharing spectral data and all the major databases store their data in the format.

      How & WHY did the JCMAP format become the de facto standard even though it was not formally standardized?

    4. websites where you can obtain reliable spectral data, and software for viewing/simulating spectra.

      Is there any free software or website in which we can analyze SEM, TEM images and TGA spectra?

    5. #YFACTOR=  9.5367E-7 … ##XYDATA= (X++(Y..Y))  4400   68068800 68092800 68145600 68100800 68140800 68232000  4394   68304000 68316800 68195200 68152000 68182400 68176000  4388   68240000 68252800 68156800 68156800 68236800 68292800  4382   68302400 68265600 68233600 68214400 68224000 68284800  4376   68353600 68334400 68219200 68230400 68315200 68276800  4370   68259200 68264000 68257600 68316800 68292800 68339200  

      i dont understand the actual meaning of this DATA table and it's format

    1. It is fun to point out though that Binary Large Objects (BLOBs) can be used to store binary files like images, audio, etc.

      Would storing images in a database as a blob cause poor performance. This seems like a lot for a single entry. I would think storing a link to the image in the database would reduce the entry size.

    2. all data on a computer must be represented in binary notation.

      I am quite confused with binary notation. How is it related with chemistry?

    3. While there are a number of styles of database the most common currently are relational.

      Graph databases and other NoSQL information stores (non relational) seem to be gaining traction in filling gaps where traditional relational database struggle. MongoDB may be a good tool to include as part of future material for this course. {edit} just noticed you mention NoSQL later in the text...apologies!)

    4. SPARQL which is the interestingly recursive acronym that stands for ‘SPARQL Protocol and RDF Query Language’.

      What is the significant of this?

    5. n the context of discussing the common ‘data types’ we are going to reference those that are used in the relational database software ‘MySQL’.

      How important are the details of MySQL datatypes? Are these datatypes consistent across other relational databases?

    6. STIX fonts

      How far back in history does the STIX fonts project consider for mapping languages? For example, if one wanted to have unicodes for scientific texts from ancient egyptian era, would STIX provide a glpyh?

    7. Then there is the much more sophisticated Visual Basic for Applications, (VBA), which sits behind Excel and is used to record the Macro’s run in Excel.  VBA is much more a true programming language allowing for declaration of variables, loop structures, if-then-else conditionals and user defi

      please i do not understand the practical way of using excel to check for stings and to tell if the strings is a valid InChI string or not.

    8. Amino Acid Properties Thermodynamic Properties of Pure Substances

      bad links

    1. ANalytical Data Interchange (ANDI)

      Is this type of file recently developed? I have not ever seen this one as the save option in our UALR Agilent GC-MS

    2. <request string="arsinic acid" representation="names">   <data id="1" resolver="name_by_opsin" notation="arsinic acid">     <item id="1" classification="pubchem_iupac_name">arsinic acid</item>     <item id="2" classification="pubchem_substance_synonym">CHEBI:29840</item>     <item id="3" classification="pubchem_substance_synonym">HAsH2O2</item>     <item id="4" classification="pubchem_substance_synonym">[AsH2O(OH)]</item>     <item id="5" classification="pubchem_substance_synonym">arsinic acid</item>     <item id="6" classification="pubchem_substance_synonym">dihydridohydroxidooxidoarsenic</item>   </data> </request>

      How can you pull this string of data? is it from inspection option on the browser? what type of file is this?

    3. chemical metadata (information about chemicals – not chemical data like melting points etc.)

      I need more to clarification about chemical metadata. Like what is it represent if it's not like melting point??

    4. instrument data and metadata

      please, can someone tell me the difference between instrument data and metadata.

    5. Although we still ‘use’ ASCII today, in reality we use something called UTF-8.  This is easier to say than how it is derived - Universal Coded Character Set + Transformation Format - 8-bit.  Unicode (see http://unicode.org) started in 1987 as an effort to create a universal character set that would encompass characters from all languages and defined 16-bits, two bytes -> 216 -> 256 x 256 = 65536 possible characters – or code points. Today, the first 65536 characters are considered the “Basic Multilingual Plane”, and in addition there are sixteen other planes for representing characters giving a total of 1,114,112 code points.  Thankfully, we don’t need to worry because if something is UTF-8 encoded it is backward compatible with the first 128 ASCII characters.

      please i want know if it is possible that all computer can make use of UTF-8 and uincode

    6. It was recognized in 2004 that there needed to be a successor to JCAMP-DX because of i) advances in technology, ii) a recognized need to represent data from all analytical techniques, and iii) issues with variants of JCAMP-DX that made interoperability of the files difficult.  AnIML files consist of up to four data sections; SampleSet, ExperimentStepSet, AuditTrailEntrySet, and SignatureSet.  By design very little data/metadata is required so that legacy data, which may not have much or any metadata to describe it, can be stored in the AnIML format. An example ‘minimum’ AnIML file is shown below:

      What are the major differences between JCAMP-DX and its successor AnIML, and how have the problems with JCAMP-DX been corrected with AnIML?

    7. widgets

      Why haven’t the major chemical database created widgets, the ones that can be downloaded and used as plug-ins on search engines like google chrome?

    8. Although it grew out of the relational database model

      Does this imply that the older relational softwares are no longer in use?

    9. This is for mass spectrometry and chromatography data

      For representing and managing Digital Spectra, is ANDI only use for mass spectrometry and chromatography data?

    1. GENSAL (GENeric Structure LAnguage

      Why GENSAL and not GENSLA?

    2. Note that aromaticity is not a measurable physical quantity, but a concept without a unanimous mathematical definition.  As a result, different aromaticity detection algorithms often disagree with each other on whether a given molecule is aromatic or not, making it difficult to interchange information between databases that use different aromaticity detection algorithms for SMILES generation

      Do any of the aromaticity detection algorithms employ Huckel's Rule (4n+2pi electron rule) for predicting aromaticity?

    3. ASCII

      What does ASCII stand for?

    4. Hashing is a one-way mathematical transformation typically used to calculate a compact fixed length digital representation of a much longer string of arbitrary length.

      This is very similar to how I use to protect passwords on my server before SSH keys became the standard. We used similar protocols under the MD5 standard so its very interesting to see the same thing used to make something easier to find with search engines as it was to keep something from being found when used as a security measure.

    5. Another extension of SMILES is SMIRKS28,29, which is a line notation for generic reactions. 

      Can you provide more details of SMIRKS and SMARTS with examples? How can we generate any reaction using this extension?

    6. Actually, it is very common that there are a lot of SMILES strings that represent the same structure, whether it has a ring or not, because one can start with any atom in a molecule to derive a SMILES string.  Therefore, it is necessary to select a “unique SMILES” for a molecule among many possibilities.  Because this is done through a process called “canonicalization”, this unique SMILES string is also called the “canonical SMILES”.

      How can we do canonicalization to get unique SMILES?

    7. In SMILES, atoms are represented by their atomic symbols.  The second letter of two-character atomic symbols must be entered in lower case.  Each non-hydrogen atom is specified independently by its atomic symbol enclosed in square brackets, [ ] (for example, [Au] or [Fe]).  Square brackets may be omitted for elements in the “organic subset” (B, C, N, O, P, S, F, Cl, Br, and I) if the proper number of “implicit” hydrogen atoms is assumed.  “Explicitly” attached hydrogens and formal charges are always specified inside brackets. A formal charge is represented by one of the symbols + or -.  Single, double, triple, and aromatic bonds are represented by the symbols, -, =, #, and :, respectively.  Single and aromatic bonds may be, and usually are, omitted.  Here are some examples of SMILES strings

      According to Smiles specification rules, atom with two characters are enclosed in a square bracket.why is CL, BR not included.?

    8. Line notations represent structures as a linear string of characters.  They are widely used in Cheminformatics because computers can more easily process linear strings of data. Examples of line notations include the Wiswesser Line-Formula Notation (WLN)1, Sybyl Line Notation (SLN)2,3 and Representation of structure diagram arranged linearly (ROSDAL)4,5.  Currently, the most widely used linear notations are the Simplified Molecular-Input Line-Entry System (SMILES)6-9 and the IUPAC Chemical Identifier (InChI)10-13, which are described below.

      In this context, does it mean that WLN, SLN and ROSDAL line notations are no longer in existence, since SMILES and Inchi are widely used now.

    9. The Simplified Molecular-Input Line-Entry System (SMILES)6-9 is a line notation for describing chemical structures using short ASCII strings. 

      What is the meaning of ASCII strings?

    10. Many databases such as PubChem17, ChemSpider18, ChEBI19, and NIST Chemistry Webbook20 accept InChI and InChIKey strings as queries to search for chemical structures.  InChIs and InChIKeys can also be used as queries in UniChem21 to produce cross-references between chemical structure identifiers from different databases.

      when all these databases, pubchem ,chemspider chEBI and NIST are used to search for the inchis or inchikey of a particular structure are they going to give the same result

    11. the standard InChI

      when open smiles and standard inchi were not in existence what was the previous scientist doing to avoid different result in their project when they use different software?

    12. hey are widely used in Cheminformatics because computers can more easily process linear strings of data. Examples of line notations include the Wiswesser Line-Formula Notation (WLN)1, Sybyl Line Notation (SLN)2,3 and Representation of structure diagram arranged linearly (ROSDAL)4,5.  Currently, the most widely used linear notations are the Simplified Molecular-Input Line-Entry System (SMILES)6-9 and the IUPAC Chemical Identifier (InChI)10-13, which are described below.

      since smiles and inchi are currently used the linear notation ,does it mean that WLN, SLN and ROSDAL when used now can yield a wrong result since they are no more in existence.

    13. Generic structures are commonly used in chemistry texts as well as in chemical patents in which the inventor claims a whole class of related compounds.  Generic structures are more often called “Markush” structures after Dr. Eugene A. Markush, who involved in a legal case which set a precedent in the USA for generic chemical structure patent filing.

      This might sound a bit dumb! Can Inchi be generate from a Markush structure? I know generic structure and Inchi are quite contradicted from each other. Markush can be quite proprietary and InCHI is open science.

    14. c1ccccc1           Benzene (C6H6)

      if there are different substituent group on a benzene. Will SMILE indicate its position such as Ortho, Para, Meta?

    15. As a result, different aromaticity detection algorithms often disagree with each other on whether a given molecule is aromatic or not, making it difficult to interchange information between databases that use different aromaticity detection algorithms for SMILES generation.

      What process or services would be needed for the databases to perform this interchange?

    16. There are currently six InChI layer types, each different class of structural information: the main layer, a charge layer, a stereochemical layer, an isotopic layer, a fixed-H layer and a reconnected layer. 

      I understand the first four layers of InChI, but what does it mean about the last 2 a fixed-H layer and a reconnected layer?

    17. SMARTS is useful for substructure searching, which finds a particular pattern (subgraph) in a molecule.

      Can you give some examples of the substructures that are searched for?

    1. The first block of 14 (out of total 27) characters for anInChIKey encodes core molecular constitution, as de-scribed by formula, connectivity, hydrogen positions andcharge sublayers of the InChI main laye

      Can you search the web with the first block of the InchI key and find all isomer of the compound?

    1. Linenotations are not the only way of communicating structure: also popular are file-based formats such asMDL's MOL File19(and its variant, the SD File), and Chemical Markup Language20(CML, avariant of XML

      One of the big drawbacks that I often hear when using XML with databasing is that it quickly starts making very large files in terms of storage. Would the same hold true for using CML with large chemical databases? If so, what size reduction could be expected for the same size collection of chemicals stored using connection tables?

    2. All InChIs currently are prefixed with “INCHI=”. Following this, a designator of “1/” or “1S/” indicates whether the InChI is non-standard or standard (i.e. with fixed standardized options in the software)

      Can one compound or molecule have both standard and non-standard InCHIs?

    3. . Ring aromaticity is handled in SMILES at the atomic level, not at the bond level (i.e. an atom is considered aromatic rather than a

      In SMILES, why is aromaticity not considered on bond level? Given that the bonds in the ring structures counts for it's aromaticity.

    1. Main layer○Chemical formula, no prefix○atom connections, prefix ‘c’○hydrogen atoms, ‘h’●Charge layer○proton sublayer, ‘p’○Charge sublayer, ‘q’●Stereochemical layer○Double bonds and cumulenes, ‘b’○tetrahedral stereochemistry of atoms and allenes, ‘t’ or ‘m’○Stereochemistry information type, ‘s’●Isotope layer, ‘I’, ‘h’, and ‘b’, ‘t’ and ‘m’ for stereochemistry of isotopes●Fixed-h layer, ‘f’

      Will a future release version of InChI include a layer to include inorganic molecules that can be standardized? I have come across things in the past saying that it really only works efficiently for organic molecules. I have the understanding that some inorganic things have been included, but not the more complex structures.

    2. While the InChI representation is normally too complex for a human to decode, it is impossible for even a computer to extract the chemical structure from the InChIKey. therefore, it is important that the InChI repre-sentation is also included in any database.

      In instances were chemical structures can't be determined with the provided InChI or InChIKey, are there any tips for searching with an InChI?

    3. WLN:WiswesserLineNotatio

      I havent encountered this line notation at all. Wonder if is there any database systems still use this as a part of history showcase? (know that it is already out of favor)

    4. aromatic bonds are implied between aromatic atoms, but may be explicitly defined using the ‘:’ symbol.

      When would you use the colon instead of lowercase letters for aromatic bonds?

    5. another language, based on conventions in SMILeS, has also been devel-oped for rapid substructure searching, called SMiles arbitrary target Spec-ification (SMartS). Similarly, SMIrKS has also been defined as a subset of SMILeS that encodes reaction transforms. SMIrKS does not have a defini-tion, but plays on the SMILeS acronym. SMartS and SMIrKS will be consid-ered in more detail in later chapters.

      SMARTS used for rapid substructure searching is noted as another language based on SMILES conventions. What is the meaning of SMIRKS and is it used in a similar form as SMARTS?

    6. Oneofthemostimportantapplicationsofgraphtheorytochemoinfor-maticsisthatofgraph-matchingproblems.Itisoftendesirableinchemoin-formaticstodeterminedifferingtypesofstructuralsimilaritybetweentwomolecules,oralargersetofmolecules.

      I have been looking into chematics, which looks to be graph theory (network) applied to goal of automating chemical synthesis. How ready for primetime is this technology? Do we envision its likely to be militarized/weaponized or remain in the corporate domain? What are thoughts on how these developments will impact chemists in the near term (5 to 10 years). Interesting article from last year on the topic:


    7. the Molfile contains the atoms and the bonding patterns between those atoms, but also includes xyz co-ordinate information so the 3D structure can be explicitly encoded and stored for subsequent use. the file format was orig-inally developed by MDL Information Systems, which through a number of acquisitions and mergers, Symyx technologies and accelrys, respectively, is now subsumed with Biovia, a subsidiary of Dassault Systems.the Molfile is split into distinct lines of information, referred to as blocks. the first three lines of any Molfile contain header information: molecule name or identifier; information regarding its generation, such as software, user, etc.; and the comments line for additional information, but in practice this is often blank. the next line always encodes the metadata regarding the connection table and must be parsed to identify the numbers of atoms and bonds, respectively. the first two digits of this line encode the numbers of atoms and bonds, respectively

      I need a clearer explanation on the use of MDL format

    8. these additional data, coupled with the additional metadata recorded in the SDF format, makes this file format ideal

      In what instances is MDL ideal and/or when would you prefer MDL over SDF?

    9. a key advantage of the Molfile and SDF formats is the inclusion of geomet-ric information regarding the spatial arrangement of atoms in three-dimen-sional space.

      You talk of advantages of using both MDL and SDF formats, are there any disadvantages that would make you utilize other formats?

    10. MDL Molfile (extension *.mol) or structure-data file (SDF, extension *.sdf

      what is the main difference between MDL and SDF

    11. 3.3

      is adjacency matrix still in existence or its outdated because since it has up to two advantages more than the connection table why do you prefer connection table instead of it.

    12. CheMBL and SureCheMB

      please i want to understand if chEMBL and SureCHEMBL are doing the same work or do they work differently

    13. 21Structure Representationsystems can encode and decode systematic names such that they follow the naming conventions, but this representation is not commonly used in prac-tice for computational work.Chemical structures are the explicit representation and the chemist’s lingua franca. Chemical structure drawings appear in many scientific publications,

      What is the difference between SMILES and Aromatic Smiles?

    1. aromatic bond type

      How would an aromatic bond be represented if needed to be used to bypass the multiple configuration possibilities?

    1. 3D (x,y,z) coordinates can also be stored for each atom and used to display the conformation of a molecule. These coordinates may be determined experimentally (typically via x-ray crystallography), or calculated (using force-fields, quantum chemistry, molecular dynamics or composite models such as docking). Understanding a molecule's actual shape, whether it be in solution, in a vacuum, or in the binding site of a protein, opens up a whole new domain of computational chemistry. Most molecules have some flexibility, and even if a given conformation is the most stable, there are often a number of competing shapes to consider. Knowing how a particular set of coordinates was determined is crucial to making intelligent use of it for cheminformatics purposes.

      This would be a very good project if all of the 3D data for the different conformations were all indexed and compared based on the influence type and how it changed a molecules shape.

    2. CAS RN

      please i want to understand more about this CAS RN what is it really used for?

    1. A substructure search query can be matchedagainst a connection table atom-by-atom but the so-called subgraph isomorphism algorithm that is usedin substructure search to compare one graph againstanother is slow and complex and it is likely that theremay be many mismatches before a hit is found. Asubstructure search can be carried out faster if an ini-tial screening stage is carried out to filter out quicklystructures that could not possibly be matches. A com-mon method is to use substructure fragments as thefilter.

      Its very interesting that breaking apart a structure into fragments, performing a search and then rebuilding the structure as part of a screening process with a complex search would be more accurate than just using a connection table to compare each atom.

    2. (InChI) and InChIKey

      please i want to know if there is any difference between inchi and inchi key

    3. Hydrogenatoms are not necessarily included explicitly in aconnection table: they may be implicit.

      what will happen if HYDROGEN atom is included explicitly in a connection table please i need a clear explanation.

    4. NOTATIONSLine notations represent structures as a linear stringof alphanumeric symbols. Their compactness was anadvantage in the early days of cheminformatics whenstorage space was at a premium, and even nowa-days, it can be faster to enter a structure as a no-tation instead of using a chemical structure drawingprogram. Several notations20–22were proposed in the1950s and 1960s, but only one, the Wiswesser Line-Formula Notation (WLN; see Figure 1)23–28becamewidely used,29–47despite the fact that the Dyson no-tation was formally adopted by IUPAC.20,21WLNstarted to fall out of use in the early 1980s.

      What are the examples of Line notation and can i get a clearer explanation on it?

    5. To the practicing chemist, the language ofchemistry is the two-dimensional (2D) structure dia-gram and most chemical information systems featuregraphical input and output of chemical structures; themachine-held representation need not be meaningfulto the synthetic chemist. In the ideal (unique) repre-sentation there is only one ‘code’ for a given struc-ture and any one code can be interpreted to give onlyone structure. A unique representation is essential forchemical registration systems in which the novelty ofa structure is determined before it is recorded in adatabase. Some representations, for example, molec-ular formulas, are not unique; one molecular for-mula will generate more than one full structure. Somenonunique representations (e.g., molecular formulasand fragment codes) do, however, play a part in cer-tain chemical information systems, even though theydo not represent the full topology of a structure

      In this context, i don't understand when they say "Code can be interpreted to give one structure, what does it really mean?

    1. ol-water partition coefficient (ClogP). The descriptors tend to convolute any different properties into these simple scalar descriptors, but can be highly effective in certain circumstances and are widely appreciated for their interpretability in interactive systems. One such set of property descriptors that has gained wide acceptance is the Lipinski rule-of-five, which has been suggested as an heuristic for indicating the oral absorption of a potential drug based on marketed orally-dosed drugs. The rule-of-five applies four calculated properties and defined cut-offs

      test of tags

    1. Chirality

      How can we write the Mol file for Chirality between two different atoms like chirality between C-CH3 or C-OH?

    2. Bonds Block

      It seems confusing to me how do they build up this bond block? I am confused at 4-5,6-1 and 5-7 bond block.

    3. Resonance Run-of-the mill delocalization presents some of the same problems as aromaticity, but there is no conventional label for (non-aromatic) delocalized electrons, such as the delocalized negative charge and pi system in benzoate (VII and VIII). The connection tables will simply represent one resonance structure or another.

      In the figure MOL VII and MOL VIII, there is insert of 5 in the V2000 (file format), what is significant of this value? and how can we know which value should be inserted for other resonance structure?

    4. MOL files do indicate chirality. However, they can do so in two ways. A “1” or “6” in the fourth field of the bonds table indicates wedged and dashed bonds, respectively. A “1” or “2” in the stereochemistry field of the atom table represents the chirality of a stereocenter.

      Is there any clue on the Mol File that if the structure has more than one chirality center?

    5. To make things even more complicated, software may account for the chirality of a stereocenter atom when generating a MOL file but ignore it when rendering a MOL file!

      Is there a way to catch this if it happens?

    6. Each of the two Kekulé structures for the benzene ring shows up as a different set of single and double bonds (MOL I, MOL IV). The Bond Tables are different: The MOL file format uses the number 4 to indicate bonds that are explicitly labeled as aromatic (MOL V). This has the advantage of differentiating aromatic bonds from single and double bonds without requiring the chemist to write a script to identify and label the alternating single and double bonds of a Kekulé structure.

      When would you use the aromatic structure instead of the Kekulé structures?

    1. A connection table can represent multiple distinct compounds.