The CO-CONNECT project requires all data partners to convert their data to the Observational Medical Outcomes Partnership (OMOP) format.
For security reasons, the CO-CONNECT team will not have access to any of the Data Partner’s raw data; instead, each Data Partner will share a standardised set of metadata files with the CO-CONNECT team who will use these files to define bespoke transformation and mapping rules.
The CO-CONNECT team will advise each Data Partner about which datasets are most relevant to include, but the final decision lies with the Data Partner. It is important that the data is prepared in the right way and the CO-CONNECT team are ready to provide support with this. The set of CSV files should have:
Data Partners will then upload these CSV files into the virtual machine that hosts the BC|Link software that will have the mapping rules applied and the OMOP data created. These files must not contain any identifiable information and should have all data privacy steps applied to them before proceeding.
Metadata allows the CO-CONNECT team to understand the structural format of the CSV files and the range of values present. Ideally, each Data Partner should use a data profiling tool called WhiteRabbit to generate the metadata on the extracted files. This is so that the metadata is extracted in a standard format. This tool can be connected directly to most databases but if this causes security concerns then it can also be pointed towards CSV files. It can be ran on a completely locked-down machine with no internet access – this should alleviate security concerns. For Data Partners who are not comfortable using this tool even with such precautions, the CO-CONNECT team can provide a template to manually complete.
The WhiteRabbit Scan Report contains information about the fields within each CSV file and frequency distributions of the values within these fields, but it will not be able to complete the field description. Therefore, for each field (on the Field Overview tab) please complete the description.
WhiteRabbit contains an option to automatically remove values with frequency values under a certain threshold, for instance if a dataset contains a field called “condition”, and there are only 2 patients with a very rare condition, this can cause security concerns. Data Partners should set the “Minimum Cell Count” threshold in WhiteRabbit to 5 to conform to CO-CONNECT standards.
Please remove any data values which could be deemed as confidential or sensitive from the scan report, such as the anonymised ID of the individual, date of birth, and their frequencies. When removing these values and frequencies from the Scan Report, please refrain from removing the entire column from the spreadsheet. For instance, please maintain the headers and remove the values, as in the example below:
Whilst the WhiteRabbit Scan Report provides a good understanding of a Data Partner’s data, a description of the codes used in each dataset is still required. Examples of common codes are M and F, or 0 and 1 for a sex or gender question. Please provide this information in one CSV file, with the following columns: csv_file_name, field_name, code, and value.
Below is an example of the expected content:
csv_file_name | field_name | code | value |
questionnaire | gender | 0 | Male |
questionnaire | gender | 1 | Female |
tab10 | q1 | 0 | Fever |
tab10 | q1 | 1 | Cough |
tab11 | q2 | 0 | Yes |
tab11 | q2 | 1 | No |
tab11 | q3 | SNOMED |
If a recognised standard code system is used in the dataset, please enter the name of the coding system in the code column. In the example above, the responses to question q3 are coded using SNOMED. The CO-CONNECT team can assist Data Partners who are unsure which coding systems present in their datasets.