NEW: Version 1.1 of the cartridge is "trained" on a 40 000 document corpus in order to reduce noise
The STF DBPedia computerscience Skill Cartridge® extracts topics around computer science and related fields from text. It is built on a subset of more than 60000 concepts and over 200000 synonyms from DBPedia as of January 2013 and their taxonomic information.
Powered by Smart Taxonomy Facilitator (STF) technology, The STF DBPedia computerscience Skill Cartridge contains more than 60000 concepts and over 200000 synonyms from DBPedia on computer science and related fields. Information on taxonomic embedding as well as the DBPedia URI for each concept is retained. Each resulting term receives a confidence score representing to what degree it appears as an appropriate topic or index term for the document.
The Skill Cartridge is compiled for the english language.
The topics modelled in this Skill Cartridge® include but are not limited to:
• Computer Science
• Digital Technology
• Theoretical Computer Science
• Computer Scientists
• Computer Programming
So topics like Haskell, SCRUM, SaaS, Object-oriented programming or source code escrow are modelled as are many tens of thousands of other terms.
STF technology adds two distinctive features to the thesaurus-based extraction :
- Fuzzy Term Matching that produces and recognizes variants of thesaurus terms, minimizing silence and increasing recall
- Relevance Scoring, that evaluates the contextual relevance of each recognized term, discarding the less relevant ones improving precision. Relevance Scoring exploits a range of heuristics, including statistics and part-of-speech tagging .
Customization and Extension
The Skill Cartridge® can be easily customized by tuning the STF parameters on the specific way of scoring results, on the required maximal amount of terms to return, their minimal confidence score, their minimal string distance to known thesaurus terms and many others.
Typical use cases for the STF DBPedia computerscience Skill Cartridge include
- Automated indexing of documents in corresponding computer science
- Creation of hierarchical facets for enhancing search and browsing
- Document recommendations based on indexed terms
- Design of topical collections or Topic Pages focusing on specific thematic hierarchies in computer science