WikiPrint - from Polar Technologies

This is a printer-friendly version of the HICDEP 1.30 specification available under http://www.hicdep.org/

HICDEP 1.30

  1. HICDEP 1.30
    1. General data format
      1. Overview of data tables
    2. Diagram
    3. Structure of data
      1. From flat files towards a normalized structure
      2. Technical considerations
    4. Coding Conventions
      1. Date codes
        1. Day unknown
        2. Month and day unknown
        3. Year unknown
        4. Specification of precision
      2. ?ICD-10 codes
      3. ?ATC codes
      4. Other codes

This article describes the HICDEP version 1.30 which was released on the 28th of March 2008. It was superseded by version 1.50 on the 13th of September 2011. For a more detailed version history, please refer to the ChangeLog.

General data format

The table pages referenced in the overview describe the specific tables' structure in detail and present a list of suggested codes, both standard and human readable.

All codes apart from trivial no, yes or unknown codes are presented as lookup tables, the usage of these are described in the the article Considerations for using the format to create a database.

Along with the basic structure described in each ?Core fields? section, additional fields containing additional or more specific data are described in the ?Additional fields? sections. These fields were taken from several cohort collaborations but with the required changes that were needed for the specific data structures. This is presented to the reader to show that the core structure is not a fixed proposal but rather a basic structure, which can be altered by adding fields.

Issues regarding duplicates are discussed in Considerations For Data Management.

Overview of data tables

Table Content
tblAE holds type and date of adverse events including serious non-AIDS conditions
tblART holds type of antiretroviral drug, start and stop dates and reason for stopping
tblBAS holds basic information such as demographics, basic clinical information, date of AIDS diagnosis, death and drop-out information
tblDIS holds type and date of CDC-C diseases.
tblLAB holds type, date, value and unit of laboratory tests.
tblLAB_BP holds date, diastolic and systolic values and unit of blood pressure measurements.
tblLAB_CD4 holds date and value of CD4 measurements.
tblLAB_RNA holds date, value, detection limit and type of viral assay.
tblLAB_RES holds background information on the resistance test, laboratory, library, kit, software and type of test
tblLAB_RES_LVL_1 holds nucleoside sequence for the PRO and RT sequences
tblLAB_RES_LVL_2 holds mutations and positions of these.
tblLAB_RES_LVL_3 holds resistance result in relation to antiretroviral drug.
tblLAB_VIRO holds test results for viro-/serological tests (hepatitis etc.)
tblLTFU holds data in death and drop-out
tblMED holds type, start and stop dates for other HIV related medicines.
tblOVERLAP holds information on the patient's participation in other cohorts
tblVIS holds visit related information, weight, wasting.

Diagram

diagram of HICDEP tables with relations

Structure of data

From flat files towards a normalized structure

The data collected in HIV collaborations is presented on the following pages in a set of data files/tables. Typically data would be put into one data file that would hold one line/record per patient where each field is represented as a separate column in that dataset. Often a dataset could contain more than 3000 columns of data.

The implication of going from thousands of fields to fewer fields means that data is in fact transposed from the flat format into the normalised format.

Example of a flat file structure:

PATIENT ALAT_D ALAT_V ALAT_U ASAT_D ASAT_V ASAT_U
999999 01-01-2000 15 U/l 01-01-2000 12 U/l

The normalised structure would then be like this:

PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
999999 1 01-01-2000 15 U/l
999999 2 01-01-2000 12 U/l

The type of measurement is identified through the TYPE_ID field. Here 1 codes for ALAT and 2 codes for ASAT:

Code Description
1 ALAT - Alanin-Aminotransferase
2 ASAT - Aspartat aminotransferase

Technical considerations

To enable a normalised structure that minimises the number of columns dramatically, the one file solution must be broken into several minor tables. These breakdowns are driven by the different data characteristics.

Each table has a basic structure that includes the patient identifier, a code that represents e.g. drug, adverse event or laboratory test performed. Along with this combination values like date, result, unit etc are present for each record.

A record for a laboratory measurement would include:

A record for usage of an antiretroviral drug would include:

These issues imply that a set of distinct tables must be generated based on the ?nature? of the data. Since laboratory, medication and event data both cannot and should not be mixed at least 3 tables must be designed. Additionally there are other types of information that need their own domains: background information on the patient (height, birth date etc.), visit related data (weight, blood pressure, wasting etc.), and resistance testing (the latter requires more consideration due to the diversity of data present).

In this protocol further separation of data into different tables are presented. These separations are not only based on the rules for the relational model and normalisation, but they are ?culturally? related.

For example: antiretroviral treatment medication is kept in one table and other medication in another table; CD4 cell measurements and HIV-RNA measurements are put into separate tables, that are also different from the general laboratory table. These separations are done simply because data in these tables are of distinct importance in analysis and often are gathered more frequently and with more attention than other variables.

Coding Conventions

Date codes

Although it is best to have precise dates in the format of YEAR-MONTH-DAY ?ISO standard, it might be that some cohorts are limited to representing date data at the level of the month only, or information kept on the patient in the charts only defines dates to the month and in some cases only to the year. To solve this a set of date codes are presented here.

Day unknown

In this case the date should be coded as the 15th of the month ? so that 1999-12-?? becomes 1999-12-15. This enables the date to be no more than 15 days away from the actual date.

Month and day unknown

Best approach to this is to apply something similar, as with unknown dates, this would then mean that 1999-??-?? becomes 1999-07-01.

Year unknown

If the year is unknown but the presence of the date value is needed as in case of opportunistic infections or adverse events (see later in this document) a fictive date should be used that couldn?t be mistaken with an actual date. An unknown year should be coded as 1911-11-11.

Specification of precision

An alternative to the above is to apply an additional field to each date field for which it is known that there might be issues regarding the precision of the dates. The field is then used to specify at which degree of the day, month or year the date is precise:

Code Precision of date
< Before this date
D Exact to the date
M Exact to the month
Y Exact to the year
> After this date
U Unknown

?ICD-10 codes

The coding system is the official standard for coding of diseases, however there is a wide set of ?homebrew? codes used within the HIV field in data coding in general, often it?s a 3 or 4 letter codes that is an abbreviation for the AIDS defining disease. ICD-10 doesn?t have single codes that represent all single CDC-C events and as a consequence of this a list of 3 to 4 letter codes is the recommended way of coding for all CDC stage C events

ICD-10 codes are however the recommended for codes AE?s since it would become impossible for this protocol to maintain a complete list of all possible AE?s. ICD-10 is also recommended for causes of death.

?ATC codes

ATC is a hierarchical structure for coding medication. The structure and hierarchy are best explained with an example of how a drug code is defined. Here it is on Indinavir:

J
ANTIINFECTIVES FOR SYSTEMIC USE (1st level, anatomical main group)
J05
ANTIVIRALS FOR SYSTEMIC USE (2nd level, therapeutic subgroup)
J05A
DIRECT ACTING ANTIVIRALS (3rd level, pharmacological subgroup)
J05AE
Protease inhibitors (4th level, chemical subgroup)
J05AE02
Indinavir (5th level, chemical substance)

This hierarchy has some benefits as will be explained later, but one of its limitations is that it?s impossible to ?read? the code compared to the widely used 3 letter mnemonic codes for antiretroviral drugs.

Example:

Drug Code ATC code
Indinavir IDV J05AE02

The difference is that the IDV code is easily readable, where the ATC code is not; going from a flat file structure to a normalised structure the human readable aspect becomes increasingly important. In the flat file format the column names and the possibility of labels makes data more or less readable; in the normalised format only the coding can help. Because of this the 3 letter codes are being presented in this document. However it must be stressed that usage of the ATC coding should be used to diminish the risk of several homebrew and non-compatible coding schemes.

Currently however, the ATC scheme does not provide sufficient detail on the specific drugs, there is e.g. no official way to code Saquinavir as hard or soft gel. Thus a slight alteration to the set of codes will be presented in the sections of the ART and MED tables. The alterations are designed to extend the existing structure of ATC.

One of the benefits is that the structure of ATC allows easier statistics on e.g. drug classes

Although the codes might be harder to read they provide grouping mechanisms in the way they are coded. Interested readers should go to the ?ATC Website to learn about the structure of ATC. A fully updated database of ATC codes and DDD (Defined Daily Dosage) is available for querying.

Other codes

It is often necessary to code for values like ?Yes?, ?No? and ?Unknown?, this document suggests that the following codes should be used:

Code Description
0 No
1 Yes
9 Unknown

Unknown should be used to identify the difference between a value that has not yet been collected (Empty) and a value that cannot be collected (Unknown). Empty values should be required where Unknown values make little sense to keep querying for a value.

Example ? weight:

Depending on the unit in which weight is measured, a different value for Unknown should be applied. In the case of kg the ?Unknown? code should be 999 and not just 9 or 99, the last two could be actual values.

Blank values, for SAS users also known as "." and for database programmers known as NULL, should be used wherever specified in this protocol. However, sometimes it might be more correct just to omit the record if no value has been recorded, test has not been performed etc.

tblBAS - Basic clinical, background and demographic information

  1. tblBAS - Basic clinical, background and demographic information
    1. Core fields
    2. Additional fields

holds basic information such as demographics, basic clinical information, date of AIDS diagnosis, death and drop-out information

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

TODO
why is DEATH_OT in tblBAS and not tblLTFU?

For mode of infection, origin and death a set of other fields are often used to capture what cannot be coded. These fields are represented here as optional fields as it is the intention that the suggested codes applied to the MODE, ORIGIN, DEATH_R1-3 and ICD10_1-3 should be able to cover all possible values.

tblLTFU - Death and drop-out

  1. tblLTFU - Death and drop-out
    1. Core fields
    2. Additional fields

holds data in death and drop-out

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

List of DEATH_R# and DEATH_RC# should be continued for as many reasons that are recorded.

The DEATH_RC# fields should enable cohorts to transfer data in accordance with the ?Coding of Death project (CoDe). You are welcome to contact the CoDe group for electronic sample forms for detailed collection of data used for the CoDe review process.

CoDe defines 1 immediate, 2 contributing and 1 underlying cause of death.

Additional fields

List of ICD10_# inplace of or together with DEATH_R# and together DEATH_RC# and should be continued for as many reasons that are recorded.

CoDe defines 1 immediate, 2 contributing and 1 underlying cause of death.

tblOVERLAP - Cross-cohort identification

holds information on the patient's participation in other cohorts

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Patients of an ?original?-cohort who also participate in a ?super?-cohort should be analysed within the ?original?-cohort only. To suppress these patients from the datasets of the ?super?-cohorts the identifier used in the ?super?-cohort is needed. It is suggested that ?original?-cohorts report id?s from the ?super?-cohorts, since the ?super?-cohorts might not even know the other ID?s. Often this information is only available at centre level.

A record should be present for each cohort that the patient is participating in (apart from it?s own ?original?-cohort).

tblVIS - Basic follow-up/visit related data

  1. tblVIS - Basic follow-up/visit related data
    1. Core fields

holds visit related information, weight, wasting.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Depending on the collaboration this data might be collected in intervals of a year, e.g. from July last to July this year. In that case all visit dates or a fixed number of visit dates for that period should be gathered, if the patient did not have a visit in the defined period, a record with the PATIENT id and empty fields for VIS_D etc. should be included.

tblART - Antiretroviral treatment

  1. tblART - Antiretroviral treatment
    1. Core fields
    2. Additional fields

holds type of antiretroviral drug, start and stop dates and reason for stopping

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

Depending on the aim of the study it might be needed to gather both the dosage and the frequency of the dosage taken. However many cohorts do not collect this date and thus these fields are optional.

PATIENT field

  1. PATIENT field

containing table
tblART
explanation of variable
Code to identify patient (Cohort Patient ID)
format of data
character (or numeric if possible)
exists since HICDEP version
1.30

ART_ID field

  1. ART_ID field
    1. Coding Table

containing table
tblART
explanation of variable
Code representing the antiretroviral treatment
format of data
character. see coding table for valid codings.
exists since HICDEP version
1.30

Coding Table

A set of extended ATC codes are being presented here in order to code both more specific on subtypes of the drugs, e.g. saquinavir hard and soft gel, but also to enable coding of drugs that are at their trial stage and have not yet been assigned an ATC code. To do this the drug will be assigned the code elements as far down the levels as possible. Given two examples to illustrate this:

Saquinavir - Hard Gel
J05AE01-SQH
Saquinavir - Soft Gel
J05AE01-SQS
Saquinavir - not specified
J05AE01

This will ensure the fidelity needed to distinguish between hard and soft gel and not specified, but also for analysis easily include all records which coding starts with J05AE01, regardless if the drug is hard or soft gel.

See the ?ATC Index for the individual codes.

save Download this table as CSV filesave Download this table as XML file
Code (Extended ATC Codes)Anti-Retroviral Drugs
J05AART unspecified
J05A-BEVBeviramat
J05A-PBTParticipant in Blinded Trial
J05AEPI unspecified
J05AE-MOZMozenavir (DMP-450)
J05AE01Saquinavir (gel, not specified)
J05AE01-SQHSaquinavir hard gel (INVIRASE)
J05AE01-SQSSaquinavir soft gel (FORTOVASE)
J05AE02Indinavir (CRIXIVAN)
J05AE03Ritonavir (NORVIR)
J05AE03-HRitonavir high dose (NORVIR)
J05AE03-LRitonavir low dose (NORVIR)
J05AE04Nelfinavir (VIRACEPT)
J05AE05Amprenavir (AGENERASE)
J05AE06Lopinavir/Ritonavir (Kaletra)
J05AE07Fos-amprenavir (Telzir, Lexiva)
J05AE08Atazanavir (Reyataz)
J05AE09Tipranavir (Aptivus)
J05AE10Darunavir (TMC-114, Prezista)
J05AFNRTI unspecified
J05AF-ALOAlovudine
J05AF-AMDAmdoxovir (DADP)
J05AF-FOZFozivudine tidoxi
J05AF-LDNLodenosine (trialdrug)
J05AF-RVTReverset
J05AF01Zidovudine (AZT, RETROVIR)
J05AF02Didanosine (ddI) (VIDEX)
J05AF03Zalcitabine (ddC) (HIVID)
J05AF04Stavudine (d4T) (ZERIT)
J05AF05Lamivudine (3TC, EPIVIR)
J05AF06Abacavir (1592U89) (ZIAGEN)
J05AF07Tenofovir (VilREAD)
J05AF08Adefovir (PREVEON)
J05AF09Emtricitabine
J05AF10Entecavir
J05AF11Telbivudine
J05AGNNRTI unspecified
J05AG-CPVCapravirine
J05AG-DPC083DPC 083
J05AG-DPC961DPC 961
J05AG-EMVEmivirine (MKC442)
J05AG-ETVEtravirine (TMC 125)
J05AG-LOVLoviride
J05AG-RPVRilpivirine (TMC-278)
J05AG01Nevirapine (VIRAMUN)
J05AG02Delavirdine (U-90152) (RESCRIPTOR)
J05AG03Efavirenz (DMP-266) (STOCRIN, SUSTIVA)
J05AR01Combivir (Zidovudine/Lamivudine)
J05AR02Kivexa (Lamivudine/Abacavir)
J05AR03Truvada (Tenofovir/Emtricabine)
J05AR04Trizivir (Zidovudine/Lamivudine/Abacavir)
J05AR05Douvir-N (Zidovudine/Lamivudine/Nevirapine)
J05AR06Atripla (Emtricitabine/Tenofovir/Efavirenz)
J05AX-EVGElvitegravir (Gilead)
J05AX-VICVicriviroc (Schering)
J05AX07Enfurvirtide (Fuzeon, T-20)
J05AX08Raltegravir (Merck)
J05AX09Maraviroc (Pfizer)
L01XX05Hydroxyurea/Hydroxycarbamid (Litalir)

ART_SD field

  1. ART_SD field

containing table
tblART
explanation of variable
Date of Initiation of treatment
format of data
yyyy-mm-dd
exists since HICDEP version
1.30

ART_ED field

  1. ART_ED field

containing table
tblART
explanation of variable
Date of stopping treatment
format of data
yyyy-mm-dd
exists since HICDEP version
1.30

ART_RS field

  1. ART_RS field
    1. Coding Table

containing table
tblART
explanation of variable
Reason for stopping treatment
format of data
character. see coding table for valid codings.
exists since HICDEP version
1.30

Coding Table

save Download this table as CSV filesave Download this table as XML file
CodeReason for Stopping Treatment
1Treatment failure (i.e. virological, immunological, and/or clinical failure
1.1Virological failure
1.2Partial virological failure
1.3Immunological failure - CD4 drop
1.4Clinical progression
2Abnormal fat redistribution
3Concern of cardiovascular disease
3.1Dyslipidaemia
3.2Cardiovascular disease
4Hypersensitivity reaction
5Toxicity, predominantly from abdomen/G-I tract
5.1Toxicity - GI tract
5.2Toxicity - Liver
5.3Toxicity - Pancreas
6Toxicity, predominantly from nervous system
7Toxicity, predominantly from kidneys
8Toxicity, predominantly from endocrine system
8.1Diabetes
9Haematological toxicity (anemia etc.)
10Hyperlactataemie/lactic acidosis
88Death
90Side effects - any of the above but unspecified
90.1Comorbidity
91Toxicity, not mentioned above
92Availability of more effective treatment (not specifically failure or side effect related)
92.1Simplified treatment available
92.2Treatment too complex
92.3Drug interaction
93Structured Treatment Interruption (STI)
93.1Structured Treatment Interruption (STI) - at high CD4
94Patient's wish/decision, not specified above
94.1Non-compliance
95Physician's decision, not specified above
96Pregnancy
97Study treatment
98Other causes, not specified above
99Unknown

Although most reasons mentioned above could be coded in ICD-10 codes, it is still certain that many reasons will never be defined in terms of ICD-10 (STI, physicians and patients decision), so to avoid a mixture of two different coding systems it is advised to go with the above list and build upon that for now.

tblMED - Other medication

  1. tblMED - Other medication
    1. Core fields
    2. Additional fields

holds type, start and stop dates for other HIV related medicines.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

Please see tblART - Antiretroviral treatment.

tblDIS - Opportunistic infections

  1. tblDIS - Opportunistic infections
    1. Core fields
    2. Additional fields

holds type and date of CDC-C diseases.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

1 DIS_OTH might be part of the record's unique identification

Additional fields

Please see tblAE - Adverse Events for specification on optional fields.

tblLAB - Laboratory values

  1. tblLAB - Laboratory values
    1. Core fields
    2. Additional fields

holds type, date, value and unit of laboratory tests.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

Other detailed information regarding the patient and the measurement would be relevant, like the proposed fasting information shown below.

Depending on the set of measurements collected and the mandatory fields applicable to these individual measurements, it might be useful to separate the LAB table into several sub tables. This is already shown for the CD4 and RNA measurements: the level of detail needed for CD4 is less than for the LAB variables in general (no unit since it?s always the same), while for RNA the data required is more detailed (assay and detection limit).

tblLAB_CD4 - Laboratory values

  1. tblLAB_CD4 - Laboratory values
    1. Core fields
    2. Additional fields

holds date and value of CD4 measurements.

Note: If needed, a CD8 table (tblLAB_CD8) could be formed from the same structure.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

CD4_V is assumed to contain absolute CD4 cell counts per mL as standard. In case CD4 % should be collected as well, please append the following field to the table:

tblLAB_RNA - Laboratory values

  1. tblLAB_RNA - Laboratory values
    1. Core fields
    2. Additional fields

holds date, value, detection limit and type of viral assay.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

tblLAB_BP - Laboratory values - Blood pressure

  1. tblLAB_BP - Laboratory values - Blood pressure
    1. Core fields

holds date, diastolic and systolic values and unit of blood pressure measurements.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

tblLAB_VIRO - Laboratory values - viro-/serology

  1. tblLAB_VIRO - Laboratory values - viro-/serology
    1. Core fields
    2. Additional fields

holds test results for viro-/serological tests (hepatitis etc.)

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

tblLAB_RES - Resistance testing

  1. tblLAB_RES - Resistance testing
    1. Core fields
    2. Additional fields

holds background information on the resistance test, laboratory, library, kit, software and type of test

Note: This table is tightly linked to tblLAB_RES_LVL_1, tblLAB_RES_LVL_2 and tblLAB_RES_LVL_3.

Resistance should be reported at lowest level of interpretation possible ? so if the nucleotide sequence is available this should be reported rather than the list of mutations or resistance scores. However, the resistance test results should be captured if they have been part of the physician?s treatment decisions for the patient.

These four tables are designed to capture several possible formats the clinics and cohorts might have recorded resistance test data in. Once this data is gathered it should like all other tables be quality assessed. For the full nucleotide sequences a short guide on ?Sequence Quality Control? can be found here:

TODO
fix the following link

?http://hiv-web.lanl.gov/content/hiv-db/CONTAM/contam_main.html

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

As shown with the core fields, the SAMP_ID is the link between the 3 levels of data and the test background information table. The sample identifier, however, must be unique for the format to work. This might not always be the case. If needed SAMPLE_D could be used as an additional part of the key, or just SAMPLE_D along with the PATIENT key1.

Some prior assessment of the assigned sample identifiers has to be done in order to avoid duplicates.

In a running database the duplicate issues are easily resolved by adding a unique auto-generated key as the identifier between 3 levels of data and the test background information table SAMP_ID.

Along with the SAMP_ID it might be necessary to store the ID assigned to the sample at both the testing laboratory but also the centres laboratory in order to track the sample. Each of these could also be used as the SAMP_ID value.

1: However this raises the issue about several aliquots from the same day will look like duplicates in the tables.

tblLAB_RES_LVL_1 - Nucleotide sequences (PRO, RT, GP41, GP120)

  1. tblLAB_RES_LVL_1 - Nucleotide sequences (PRO, RT, GP41, GP120)
    1. Core fields
    2. Additional fields

holds nucleoside sequence for the PRO and RT sequences. No entry is made if the test was a phenotype test.

Note: This table is tightly linked to tblLAB_RES.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

In cases where the amino acid sequence is collected rather than the nucleotide sequence, the field SEQ_NUC might be replaced with SEQ_AA, which is the nucleotide sequence, expressed in an amino acid sequence:

However using the amino acid sequence does not give the same detail of data as the nucleoside sequence: wobbles in the nucleoside sequence can either complicate the reading and alignment of the amino acid sequence or the wobbles can be lost and silent mutations are lost.

tblLAB_RES_LVL_2 - Mutations

  1. tblLAB_RES_LVL_2 - Mutations
    1. Core fields

holds mutations and positions of PRO and RT sequences.

Note: This table is tightly linked to tblLAB_RES.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

AA_FOUND_# could be extended if mixtures with more than 4 amino acids are found.

tblLAB_RES_LVL_3 - Resistance test result

  1. tblLAB_RES_LVL_3 - Resistance test result
    1. Core fields
    2. Additional fields

holds resistance result in relation to antiretroviral drug.

Note: This table is tightly linked to tblLAB_RES.

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

Additional fields

For phenotype test results it will be necessary to extend the table with a field to store the cut-off value:

However using the amino acid sequence does not give the same detail of data as the nucleoside sequence: wobbles in the nucleoside sequence can either complicate the reading and alignment of the amino acid sequence or the wobbles can be lost and silent mutations are lost.

tblAE - Adverse Events

  1. tblAE - Adverse Events
    1. Core fields
    2. Additional fields

holds type and date of adverse events including serious non-AIDS conditions

Core fields

Note: Fields marked bold form the unique identifier for a record of the table.

TODO
where to put the following text?

Please see the HICDEP website for examples of detailed AE tables for the events listed in 3.16.2 below. Data format is available in the HICDEP DAD event forms document.

Additional fields

Considerations for data management

Uniqueness of PATIENT ID

Each table in this document is shown with the PATIENT field as part of the unique identifier for each record, in many cases it might be necessary to specify both PATIENT and COHORT as identifier since the PATIENT id might not be unique across cohorts, this should however be implemented only if needed.

Another solution, which might be worth considering when building the final dataset for the analysis, is to concatenate the patient id and the cohort id into a single unique new patient id.

Duplicate records

Transposition of the flat format data, where there is one record per patient, into a normalised structure that has multiple rows per patient requires that the combination of PATIENT, TYPE_ID and LAB_DATE is unique for each row in the table.

The normalised structure and the relational model does not allow for duplicate records like in the following table to exist:

PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
999999 1 2000-01-01 15 U/l
999999 1 2000-01-01 15 U/l

The typical solution to this is to generate an auto-incremented value ? RECORD_ID - for each record in the table:

RECORD_ID PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
1 999999 1 2000-01-01 15 U/l
2 999999 1 2000-01-01 15 U/l

But it can, and should be dealt with in the design of the study. The presence of duplicate records like in the example shown above might not make any sense; if they did occur it then would be on account of a recording error. If, however, it was important to keep records that show measurements taken the same day but at different times, the format of the LAB_DATE shouldn?t be date (YYYY-MM-DD) but instead should be date-time (YYYY-MM-DD hh:mm:ss):

PATIENT TYPE_ID LAB_DATE LAB_VAL LAB_UNIT
999999 1 2000-01-01 10:00:00 15 U/l
999999 1 2000-01-01 14:00:00 15 U/l

Different and more sophisticated methods to make records unique will be presented later in this document under the definition of the LAB table.

Considerations for using the format to create a database

  1. Considerations for using the format to create a database
    1. Administrative fields
    2. Further normalisation
    3. Lookup tables
    4. Performance

Administrative fields

Sometimes it might be needed to have a fixed value that shows from which visit or merger a value originates, this does not only apply to the VIS table but could be applied to all tables. This however does depend on the nature of the database and needs for data management, the field below should be considered an administrative support field for data management.

VISIT
Visit number
Numeric:
0 = Baseline Visit
1 = First follow up visit
2 = Second follow up visit
etc.

Often the above field is used for clinical trials databases where there is a need to associate the data directly with a given week?s follow-up. Codes could then be the week number e.g. 4, 12, 24 etc or ?1 for screening/randomisation and 0 for baseline visits.

In some cases it might be useful to have a separate field that defines the correct order of the periods. This becomes important where several dates are incomplete (unknown days, unknown months and possibly unknown years). The ordering by date would then not be correct.

One solution to this is use a PERI_ID field that numbers the periods from the 1st until Nth usage:

PERI_ID
Period of usage (1st, 2nd, 3rd etc.)
Numeric

However this is an optional field that for most cohorts may not be needed. It also requires additional maintenance to keep it updated.

For databases that work with double data entry, such as most clinical databases, it becomes necessary to make each data entry unique and backwards traceable. For this to work a field like the above would have to be part of the primary key of each table that requires double data entry.

ENTRY_ID
Number of data entry
Numeric:
1 = first data entry
2 = second data entry
3 = comparison of 1st and 2nd data entry
4 = final approved record including corrections

With respect to performance, it might also be a good design to have 3 copies of each table, one to hold the data while being entered and compared, one for the two data entries to be archived into once a final record has been approved and a table holding the final and approved values. This way it is avoided that queries will have to work on checking for ENTRY_ID = 4 and to select amongst a table holding 3 times the almost same data.

As part of an audit trail in a database a time stamp field could be added for each record to fix the exact time when the record last was inserted or updated. Along with the time stamp name of the user who entered or altered data can be recorded.

T_STAMP USER_LOG
Date and time of data entry Username of user that last inserted or updated data
yyyy-mm-dd hh:mm:ss character

Often it's necessary to keep a log of user action in a separate table. The above suggestion will only be valid for inserts and updates, and only be valid for the most recent action performed.

To record a complete audit trail a logging facility must be implemented. In most database management systems this is done using triggers on the tables. For any insert, update or delete actions performed on the data, the user, time stamp, old value and new value are recorded in the logging table.

The T_STAMP field could also include information about which time zone is relevant for data entry. Depending on database requirements this might in fact be mandatory if the ?FDA?s 21 CRF part 11 on electronic records and signatures applies.

Further normalisation

Depending on performance considerations it might be worth looking at how data are queried for data entry and data analysis. A smaller tblBAS table might increase performance: Since processing a smaller table is always faster than processing a larger table, one could put drop-out, death, birthday, date of aids diagnosis, etc. into separate tables and keep the core patient list in a separate master table

But if the database is used e.g. for BMI calculations directly on the running database, performance might be enhanced by keeping the patient list and the height together in the same table so that a query involves 2 tables (tblBAS and tblVIS) rather than perhaps 3 or more.

Another consideration is space. Although it may not be much of an issue, it will be possible to minimise the actual size of the database by putting fields that may be empty for most patients, like death information, into a separate table in a 0-1 to 1 relation to the master table.

Lookup tables

In a running database the #_ID fields could be implemented as a foreign key to a linked lookup table containing all possible codes and their corresponding definitions in a text string.

ART lookup table

This setup not only enables integrity of the data, but also defines the domain1 for the #_ID values and enables data to both become human readable and easily recoded2.

An important note on lookup tables is that the performance on a large database can be slowed significantly by using character based keys to link lookup tables with the primary table as it is presented in this document. A work around is to use numeric value for the codes.

1: Domain is a term in the definition of the relational database model that defines a set of allowed values for a given set of fields (attributes), the mixing of different domains is not allowed in order to preserve the integrity of a relational and normalised model.

2: Easily recoded permanently if the relation is specified as cascade on update or recoded dynamic by selecting a different column from the lookup table when querying the data through SQL

Performance

As already outlined in the above section, there are also performance issues that may have to be considered.

When using the suggested data types presented in this document for a database implementation, it may be worth looking at the actual data at hand when defining the final data types. The aim of this document is to present a format that will work between cohorts with rather different setups.

If it is at all possible in many cases there may be a large performance gain by using numeric instead of character fields. Character fields have been suggested here for, amongst others, the PATIENT field. If the PATIENT id is purely numeric it?s worth using a numeric data type since it always faster for querying than a character field.

Whenever the field has to be character, make sure that only the needed amount of space is assigned for the field length; there is no need to assign 50 characters of memory if the field in fact only stores a 3-letter code.

Wiki page "CaseDefinitions" does not exist

Change log

  1. Change log
    1. Draft Version 1.100
    2. Version 1.90
    3. Version 1.80
    4. Version 1.70
    5. Version 1.60
    6. Version 1.50
    7. Version 1.30
    8. Version 1.25
    9. Version 1.21
    10. Version 1.2
    11. Version 1.1
    12. Version 1.00
    13. Version 0.50 and 0.90
    14. Version 0.38

Draft Version 1.100

Version 1.90

Version 1.80

Version 1.70

Version 1.60

Version 1.50

Version 1.30

1:Detailed table definitions for the D:A:D events are available at ?http://www.cphiv.dk/HICDEP/Documents/tabid/159/Default.aspx

Version 1.25

Version 1.21

Code Coding for Reason of Stopping Treatment
1.1 Virological failure
1.2 Partial virological failure
1.3 Immunological failure ? CD4 drop
1.4 Clinical progression
90 Side effects ? any of the above but unspecified
90.1 Comorbidity
92.1 Simplified treatment available
92.2 Treatment to complex
92.3 Drug interaction
93.1 Structured Treatment Interruption (STI) ? at high CD4
94.1 Non-compliance
96 Pregnancy
97 Study treatment

Version 1.2

Version 1.1

Version 1.00

Version 0.50 and 0.90

First public versions that incorporated comments and corrections received from attendees at the 7th International Workshop on HIV Observational Databases, March 29th-30th 2003, Fiuggi, Italy and Stephen Hart.

Version 0.38

Version presented at 7th International Workshop on HIV Observational Databases, March 29th-30th 2003, Fiuggi, Italy