6.5. Run the anonymiser

Now you’ve created and edited your config file and data dictionary, you can run the anonymiser in one of the following ways:

crate_anonymise --full
crate_anonymise --incremental
crate_anonymise_multiprocess --full
crate_anonymise_multiprocess --incremental

The ‘multiprocess’ versions are faster (if you have a multi-core/-CPU computer). The ‘full’ option destroys the destination database and starts again. The ‘incremental’ one brings the destination database up to date (creating it if necessary). The default is ‘incremental’, for safety reasons.

Get more help with

crate_anonymise --help

6.5.1. crate_anonymise

This runs a single-process anonymiser.

Options:

USAGE: crate_anonymise [-h] [--config CONFIG] [--version] [--verbose]
                       [-i | -f] [--skipdelete] [--dropremake] [--drop_all]
                       [--optout] [--nonpatienttables] [--patienttables]
                       [--index] [--restrict RESTRICT]
                       [--limits LIMITS LIMITS] [--file FILE]
                       [--list LIST [LIST ...]]
                       [--free_text_limit FREE_TEXT_LIMIT] [--excludescrubbed]
                       [--process [PROCESS]] [--nprocesses [NPROCESSES]]
                       [--processcluster PROCESSCLUSTER] [--skip_dd_check]
                       [--seed SEED] [--chunksize [CHUNKSIZE]]
                       [--reportevery [REPORTEVERY]] [--debugscrubbers]
                       [--savescrubbers] [--echo]

Database anonymiser. (CRATE version 0.20.0, 2023-02-14. Created by Rudolf
Cardinal.)

OPTIONAL ARGUMENTS:
  -h, --help            show this help message and exit
  --config CONFIG       Config file (overriding environment variable
                        CRATE_ANON_CONFIG). (default: None)
  --version             show program's version number and exit
  --verbose, -v         Be verbose (default: False)

MODE OPTIONS:
  -i, --incremental     Process only new/changed information, where possible.
                        (default: True)
  -f, --full            Drop and remake everything. (default: False)
  --skipdelete          For incremental updates, skip deletion of rows present
                        in the destination but not the source. (default:
                        False)

ACTION OPTIONS (DEFAULT IS TO DO ALL, BUT IF ANY ARE SPECIFIED, ONLY THOSE ARE DONE):
  --dropremake          Drop/remake destination tables, and admin tables
                        except opt-out tables. (default: False)
  --drop_all            Drop all destination tables known to the data
                        dictionary, and all admin tables, then stop. (May also
                        be helpful in revealing leftover tables in the
                        destination database, e.g. if the data dictionary has
                        changed.) (default: False)
  --optout              Update opt-out list in administrative database.
                        (default: False)
  --nonpatienttables    Process non-patient tables only. (default: False)
  --patienttables       Process patient tables only. (default: False)
  --index               Create indexes only. (default: False)

RESTRICTION OPTIONS:
  --restrict RESTRICT   Restrict which patients are processed. Specify which
                        field to base the restriction on or 'pid' for patient
                        ids. (default: None)
  --limits LIMITS LIMITS
                        Specify lower and upper limits of the field specified
                        in '--restrict'. (default: None)
  --file FILE           Specify a file with a list of values for the field
                        specified in '--restrict'. (default: None)
  --list LIST [LIST ...]
                        Specify a list of values for the field specified in
                        '--restrict'. (default: None)
  --free_text_limit FREE_TEXT_LIMIT
                        Filter out all free text fields over the specified
                        length. For example, if you specify 200, then
                        VARCHAR(200) fields will be permitted, but
                        VARCHAR(200), or VARCHAR(MAX), or TEXT (etc., etc.)
                        fields will be excluded. (default: None)
  --excludescrubbed     Exclude all text fields which are being scrubbed.
                        (default: False)

PROCESSING OPTIONS:
  --process [PROCESS]   For multiprocess mode: specify process number.
                        (default: 0)
  --nprocesses [NPROCESSES]
                        For multiprocess mode: specify total number of
                        processes (launched somehow, of which this is to be
                        one). (default: 1)
  --processcluster PROCESSCLUSTER
                        Process cluster name (used as part of log name).
                        (default: )
  --skip_dd_check       Skip data dictionary validity check. (default: False)
  --seed SEED           String to use as the basis of the seed for the random
                        number generator used for the transient integer RID
                        (TRID). Leave blank to use the default seed (system
                        time). (default: None)
  --chunksize [CHUNKSIZE]
                        Number of records copied in a chunk when copying PKs
                        from one database to another. (default: 100000)

REPORTING AND DEBUGGING:
  --reportevery [REPORTEVERY]
                        Report insert progress every n rows in verbose mode.
                        (default: 100000)
  --debugscrubbers      Report sensitive scrubbing information, for debugging.
                        (default: False)
  --savescrubbers       Saves sensitive scrubbing information in admin
                        database, for debugging. (default: False)
  --echo                Echo SQL. (default: False)

6.5.2. crate_anonymise_multiprocess

This runs multiple copies of crate_anonymise in parallel.

Options:

USAGE: crate_anonymise_multiprocess [-h] [--nproc [NPROC]] [--verbose]

Runs the CRATE anonymiser in parallel. Version 0.20.0 (2023-02-14). Note that
all arguments not specified here are passed to the underlying script (see
crate_anonymise --help).

OPTIONAL ARGUMENTS:
  -h, --help            show this help message and exit
  --nproc, -n [NPROC]   Number of processes (default is the number of CPUs on
                        this machine) (default: 8)
  --verbose, -v         Be verbose (default: False)

6.5.3. crate_anon_show_counts

This ancillary tool prints record counts from your source and destination databases.

USAGE: crate_anon_show_counts [-h] [--config CONFIG] [--verbose]

Print record counts from source/destination databases. (CRATE version 0.20.0,
2023-02-14. Created by Rudolf Cardinal.)

OPTIONAL ARGUMENTS:
  -h, --help       show this help message and exit
  --config CONFIG  Config file (overriding environment variable
                   CRATE_ANON_CONFIG). (default: None)
  --verbose, -v    Be verbose (default: False)

6.5.4. crate_anon_check_text_extractor

This ancillary tool checks that you have the text extraction software that you might want. See third-party text extractors.

USAGE: crate_anon_check_text_extractor [-h]
                                       [checkextractor [checkextractor ...]]

Check availability of tools to extract text from different document formats.
(CRATE version 0.20.0, 2023-02-14. Created by Rudolf Cardinal.)

POSITIONAL ARGUMENTS:
  checkextractor  File extensions to check for availability of a text
                  extractor. Try, for example, '.doc .docx .odt .pdf .rtf .txt
                  None' (use a '.' prefix for all extensions, and use the
                  special extension 'None' to check the fallback processor).
                  (default: None)

OPTIONAL ARGUMENTS:
  -h, --help      show this help message and exit

6.5.5. crate_anon_summarize_dd

This ancillary tool reads your data dictionary and summarizes facts about each table. It may be helpful to find problems with large data dictionaries.

USAGE: crate_anon_summarize_dd [-h] [--config CONFIG] [--verbose]
                               [--output OUTPUT]

Draft a data dictionary for the anonymiser. (CRATE version 0.20.0, 2023-02-14.
Created by Rudolf Cardinal.)

OPTIONAL ARGUMENTS:
  -h, --help       show this help message and exit
  --config CONFIG  Config file (overriding environment variable
                   CRATE_ANON_CONFIG). (default: None)
  --verbose, -v    Be verbose (default: False)
  --output OUTPUT  File for output; use '-' for stdout. (default: -)