Originally developped to anonymize case law for online publication, the Anonymization SC® identifies personal information within documents to facilitate its removal.
This is implemented as a two step process. First, the SC® recognizes and annotates all family names, company names, postal addresses, e-mail addresses, phone numbers and fax numbers in the document. These items are then tagged as either 'to anonymize' (in cases where for example a name can be replaced by a single letter) or 'to exclude' from anonymization.
- For People names, the SC® uses titles as triggers to exclude the names of attorneys, magistrates and experts from anonymization. In the current version of the SC®, first names are also excluded from anonymization.
- Regarding Addresses, only those associated with a party to the case are qualified as 'to anonymize'.
- Company names are automatically excluded, unless they contain a family name cited in the document as a party.
- All phone numbers, fax numbers and e-mail addresses are qualified as 'to anonymize'.
The SC® provides two annotation procedures. The first (Anonymization) extracts all the entities and tags them as 'to anonymize' or 'to exclude'; the second (AnoSansExclu) only extracts the 'to anonymize' entities.
The original use case for which this SC® was developped is the anonymization of legal decisions for online publication in confirmity with national regulations. It may also be adapted for anonymization of any large-scale corpus containing personal information (for example, in Healthcare-related applications).