The SureChem Chemical Skill Cartridge® (SCSC) is a specialised tool for detecting and indexing chemistry in text.
The chemical entities include a variety of chemical compound, substance and common names, such as:
• Diacetylmorphine hydrochloride
The technology at the core of the SCSC also powers the SureChem patent chemistry series of products. It is designed to be accurate, scalable and robust in the face of poor quality data.
The SCSC identifies chemical names in text by first tokenizing around white space and other significant separators, then calculating a probability for whether each token is a chemical. A machine learning model is used to determine this probability. In addition, dictionary look-up and a series of natural language processing heuristics are used to aid in the precision and recall of chemical name identification. The SCSC accepts XML, HTML and plain text input. There are two basic outputs: annotations and names. Annotations comprise start and end positions for entities in the document, which can be stored in a database and used for rendering a document with chemistry. Names are the actual chemical entity text strings that have been identified.
Customization and Extension
The SCSC has an optional module for repairing chemical names that have been fragmented due to poor quality source text. If this option is enabled, output will also include whether an annotation/name pair is the result of joining one or more chemical name fragments into a complete name.
Common use cases where SCSC can be used include :
• the enrichment of chemical literature with deep, domain-specific metadata
• the extraction of structured chemical information to feed knowledge bases on key research areas
• the analysis of such to derive scientific insights