This is a printer-friendly version of the HICDEP 1.30 specification available under http://www.hicdep.org/
This article describes the HICDEP version 1.30 which was released on the 28th of March 2008. It was superseded by version 1.50 on the 13th of September 2011. For a more detailed version history, please refer to the ChangeLog.
The table pages referenced in the overview describe the specific tables' structure in detail and present a list of suggested codes, both standard and human readable.
All codes apart from trivial no, yes or unknown codes are presented as lookup tables, the usage of these are described in the the article Considerations for using the format to create a database.
Along with the basic structure described in each ?Core fields? section, additional fields containing additional or more specific data are described in the ?Additional fields? sections. These fields were taken from several cohort collaborations but with the required changes that were needed for the specific data structures. This is presented to the reader to show that the core structure is not a fixed proposal but rather a basic structure, which can be altered by adding fields.
Issues regarding duplicates are discussed in Considerations For Data Management.
| Table | Content |
|---|---|
| tblAE | holds type and date of adverse events including serious non-AIDS conditions |
| tblART | holds type of antiretroviral drug, start and stop dates and reason for stopping |
| tblBAS | holds basic information such as demographics, basic clinical information, date of AIDS diagnosis, death and drop-out information |
| tblDIS | holds type and date of CDC-C diseases. |
| tblLAB | holds type, date, value and unit of laboratory tests. |
| tblLAB_BP | holds date, diastolic and systolic values and unit of blood pressure measurements. |
| tblLAB_CD4 | holds date and value of CD4 measurements. |
| tblLAB_RNA | holds date, value, detection limit and type of viral assay. |
| tblLAB_RES | holds background information on the resistance test, laboratory, library, kit, software and type of test |
| tblLAB_RES_LVL_1 | holds nucleoside sequence for the PRO and RT sequences |
| tblLAB_RES_LVL_2 | holds mutations and positions of these. |
| tblLAB_RES_LVL_3 | holds resistance result in relation to antiretroviral drug. |
| tblLAB_VIRO | holds test results for viro-/serological tests (hepatitis etc.) |
| tblLTFU | holds data in death and drop-out |
| tblMED | holds type, start and stop dates for other HIV related medicines. |
| tblOVERLAP | holds information on the patient's participation in other cohorts |
| tblVIS | holds visit related information, weight, wasting. |
The data collected in HIV collaborations is presented on the following pages in a set of data files/tables. Typically data would be put into one data file that would hold one line/record per patient where each field is represented as a separate column in that dataset. Often a dataset could contain more than 3000 columns of data.
The implication of going from thousands of fields to fewer fields means that data is in fact transposed from the flat format into the normalised format.
Example of a flat file structure:
| PATIENT | ALAT_D | ALAT_V | ALAT_U | ASAT_D | ASAT_V | ASAT_U |
|---|---|---|---|---|---|---|
| 999999 | 01-01-2000 | 15 | U/l | 01-01-2000 | 12 | U/l |
The normalised structure would then be like this:
| PATIENT | TYPE_ID | LAB_DATE | LAB_VAL | LAB_UNIT |
|---|---|---|---|---|
| 999999 | 1 | 01-01-2000 | 15 | U/l |
| 999999 | 2 | 01-01-2000 | 12 | U/l |
The type of measurement is identified through the TYPE_ID field. Here 1 codes for ALAT and 2 codes for ASAT:
| Code | Description |
|---|---|
| 1 | ALAT - Alanin-Aminotransferase |
| 2 | ASAT - Aspartat aminotransferase |
To enable a normalised structure that minimises the number of columns dramatically, the one file solution must be broken into several minor tables. These breakdowns are driven by the different data characteristics.
Each table has a basic structure that includes the patient identifier, a code that represents e.g. drug, adverse event or laboratory test performed. Along with this combination values like date, result, unit etc are present for each record.
A record for a laboratory measurement would include:
A record for usage of an antiretroviral drug would include:
These issues imply that a set of distinct tables must be generated based on the ?nature? of the data. Since laboratory, medication and event data both cannot and should not be mixed at least 3 tables must be designed. Additionally there are other types of information that need their own domains: background information on the patient (height, birth date etc.), visit related data (weight, blood pressure, wasting etc.), and resistance testing (the latter requires more consideration due to the diversity of data present).
In this protocol further separation of data into different tables are presented. These separations are not only based on the rules for the relational model and normalisation, but they are ?culturally? related.
For example: antiretroviral treatment medication is kept in one table and other medication in another table; CD4 cell measurements and HIV-RNA measurements are put into separate tables, that are also different from the general laboratory table. These separations are done simply because data in these tables are of distinct importance in analysis and often are gathered more frequently and with more attention than other variables.
Although it is best to have precise dates in the format of YEAR-MONTH-DAY ?ISO standard, it might be that some cohorts are limited to representing date data at the level of the month only, or information kept on the patient in the charts only defines dates to the month and in some cases only to the year. To solve this a set of date codes are presented here.
In this case the date should be coded as the 15th of the month ? so that 1999-12-?? becomes 1999-12-15. This enables the date to be no more than 15 days away from the actual date.
Best approach to this is to apply something similar, as with unknown dates, this would then mean that 1999-??-?? becomes 1999-07-01.
If the year is unknown but the presence of the date value is needed as in case of opportunistic infections or adverse events (see later in this document) a fictive date should be used that couldn?t be mistaken with an actual date. An unknown year should be coded as 1911-11-11.
An alternative to the above is to apply an additional field to each date field for which it is known that there might be issues regarding the precision of the dates. The field is then used to specify at which degree of the day, month or year the date is precise:
| Code | Precision of date |
|---|---|
| < | Before this date |
| D | Exact to the date |
| M | Exact to the month |
| Y | Exact to the year |
| > | After this date |
| U | Unknown |
The coding system is the official standard for coding of diseases, however there is a wide set of ?homebrew? codes used within the HIV field in data coding in general, often it?s a 3 or 4 letter codes that is an abbreviation for the AIDS defining disease. ICD-10 doesn?t have single codes that represent all single CDC-C events and as a consequence of this a list of 3 to 4 letter codes is the recommended way of coding for all CDC stage C events
ICD-10 codes are however the recommended for codes AE?s since it would become impossible for this protocol to maintain a complete list of all possible AE?s. ICD-10 is also recommended for causes of death.
ATC is a hierarchical structure for coding medication. The structure and hierarchy are best explained with an example of how a drug code is defined. Here it is on Indinavir:
This hierarchy has some benefits as will be explained later, but one of its limitations is that it?s impossible to ?read? the code compared to the widely used 3 letter mnemonic codes for antiretroviral drugs.
Example:
| Drug | Code | ATC code |
|---|---|---|
| Indinavir | IDV | J05AE02 |
The difference is that the IDV code is easily readable, where the ATC code is not; going from a flat file structure to a normalised structure the human readable aspect becomes increasingly important. In the flat file format the column names and the possibility of labels makes data more or less readable; in the normalised format only the coding can help. Because of this the 3 letter codes are being presented in this document. However it must be stressed that usage of the ATC coding should be used to diminish the risk of several homebrew and non-compatible coding schemes.
Currently however, the ATC scheme does not provide sufficient detail on the specific drugs, there is e.g. no official way to code Saquinavir as hard or soft gel. Thus a slight alteration to the set of codes will be presented in the sections of the ART and MED tables. The alterations are designed to extend the existing structure of ATC.
One of the benefits is that the structure of ATC allows easier statistics on e.g. drug classes
Although the codes might be harder to read they provide grouping mechanisms in the way they are coded. Interested readers should go to the ?ATC Website to learn about the structure of ATC. A fully updated database of ATC codes and DDD (Defined Daily Dosage) is available for querying.
It is often necessary to code for values like ?Yes?, ?No? and ?Unknown?, this document suggests that the following codes should be used:
| Code | Description |
|---|---|
| 0 | No |
| 1 | Yes |
| 9 | Unknown |
Unknown should be used to identify the difference between a value that has not yet been collected (Empty) and a value that cannot be collected (Unknown). Empty values should be required where Unknown values make little sense to keep querying for a value.
Example ? weight:
Depending on the unit in which weight is measured, a different value for Unknown should be applied. In the case of kg the ?Unknown? code should be 999 and not just 9 or 99, the last two could be actual values.
Blank values, for SAS users also known as "." and for database programmers known as NULL, should be used wherever specified in this protocol. However, sometimes it might be more correct just to omit the record if no value has been recorded, test has not been performed etc.
holds basic information such as demographics, basic clinical information, date of AIDS diagnosis, death and drop-out information
Note: Fields marked bold form the unique identifier for a record of the table.
For mode of infection, origin and death a set of other fields are often used to capture what cannot be coded. These fields are represented here as optional fields as it is the intention that the suggested codes applied to the MODE, ORIGIN, DEATH_R1-3 and ICD10_1-3 should be able to cover all possible values.
holds data in death and drop-out
Note: Fields marked bold form the unique identifier for a record of the table.
List of DEATH_R# and DEATH_RC# should be continued for as many reasons that are recorded.
The DEATH_RC# fields should enable cohorts to transfer data in accordance with the ?Coding of Death project (CoDe). You are welcome to contact the CoDe group for electronic sample forms for detailed collection of data used for the CoDe review process.
CoDe defines 1 immediate, 2 contributing and 1 underlying cause of death.
List of ICD10_# inplace of or together with DEATH_R# and together DEATH_RC# and should be continued for as many reasons that are recorded.
CoDe defines 1 immediate, 2 contributing and 1 underlying cause of death.
holds information on the patient's participation in other cohorts
Note: Fields marked bold form the unique identifier for a record of the table.
Patients of an ?original?-cohort who also participate in a ?super?-cohort should be analysed within the ?original?-cohort only. To suppress these patients from the datasets of the ?super?-cohorts the identifier used in the ?super?-cohort is needed. It is suggested that ?original?-cohorts report id?s from the ?super?-cohorts, since the ?super?-cohorts might not even know the other ID?s. Often this information is only available at centre level.
A record should be present for each cohort that the patient is participating in (apart from it?s own ?original?-cohort).
holds visit related information, weight, wasting.
Note: Fields marked bold form the unique identifier for a record of the table.
Depending on the collaboration this data might be collected in intervals of a year, e.g. from July last to July this year. In that case all visit dates or a fixed number of visit dates for that period should be gathered, if the patient did not have a visit in the defined period, a record with the PATIENT id and empty fields for VIS_D etc. should be included.
holds type of antiretroviral drug, start and stop dates and reason for stopping
Note: Fields marked bold form the unique identifier for a record of the table.
Depending on the aim of the study it might be needed to gather both the dosage and the frequency of the dosage taken. However many cohorts do not collect this date and thus these fields are optional.
A set of extended ATC codes are being presented here in order to code both more specific on subtypes of the drugs, e.g. saquinavir hard and soft gel, but also to enable coding of drugs that are at their trial stage and have not yet been assigned an ATC code. To do this the drug will be assigned the code elements as far down the levels as possible. Given two examples to illustrate this:
This will ensure the fidelity needed to distinguish between hard and soft gel and not specified, but also for analysis easily include all records which coding starts with J05AE01, regardless if the drug is hard or soft gel.
See the ?ATC Index for the individual codes.
Download this table as CSV file
Download this table as XML file| Code (Extended ATC Codes) | Anti-Retroviral Drugs |
|---|---|
| J05A | ART unspecified |
| J05A-BEV | Beviramat |
| J05A-PBT | Participant in Blinded Trial |
| J05AE | PI unspecified |
| J05AE-MOZ | Mozenavir (DMP-450) |
| J05AE01 | Saquinavir (gel, not specified) |
| J05AE01-SQH | Saquinavir hard gel (INVIRASE) |
| J05AE01-SQS | Saquinavir soft gel (FORTOVASE) |
| J05AE02 | Indinavir (CRIXIVAN) |
| J05AE03 | Ritonavir (NORVIR) |
| J05AE03-H | Ritonavir high dose (NORVIR) |
| J05AE03-L | Ritonavir low dose (NORVIR) |
| J05AE04 | Nelfinavir (VIRACEPT) |
| J05AE05 | Amprenavir (AGENERASE) |
| J05AE06 | Lopinavir/Ritonavir (Kaletra) |
| J05AE07 | Fos-amprenavir (Telzir, Lexiva) |
| J05AE08 | Atazanavir (Reyataz) |
| J05AE09 | Tipranavir (Aptivus) |
| J05AE10 | Darunavir (TMC-114, Prezista) |
| J05AF | NRTI unspecified |
| J05AF-ALO | Alovudine |
| J05AF-AMD | Amdoxovir (DADP) |
| J05AF-FOZ | Fozivudine tidoxi |
| J05AF-LDN | Lodenosine (trialdrug) |
| J05AF-RVT | Reverset |
| J05AF01 | Zidovudine (AZT, RETROVIR) |
| J05AF02 | Didanosine (ddI) (VIDEX) |
| J05AF03 | Zalcitabine (ddC) (HIVID) |
| J05AF04 | Stavudine (d4T) (ZERIT) |
| J05AF05 | Lamivudine (3TC, EPIVIR) |
| J05AF06 | Abacavir (1592U89) (ZIAGEN) |
| J05AF07 | Tenofovir (VilREAD) |
| J05AF08 | Adefovir (PREVEON) |
| J05AF09 | Emtricitabine |
| J05AF10 | Entecavir |
| J05AF11 | Telbivudine |
| J05AG | NNRTI unspecified |
| J05AG-CPV | Capravirine |
| J05AG-DPC083 | DPC 083 |
| J05AG-DPC961 | DPC 961 |
| J05AG-EMV | Emivirine (MKC442) |
| J05AG-ETV | Etravirine (TMC 125) |
| J05AG-LOV | Loviride |
| J05AG-RPV | Rilpivirine (TMC-278) |
| J05AG01 | Nevirapine (VIRAMUN) |
| J05AG02 | Delavirdine (U-90152) (RESCRIPTOR) |
| J05AG03 | Efavirenz (DMP-266) (STOCRIN, SUSTIVA) |
| J05AR01 | Combivir (Zidovudine/Lamivudine) |
| J05AR02 | Kivexa (Lamivudine/Abacavir) |
| J05AR03 | Truvada (Tenofovir/Emtricabine) |
| J05AR04 | Trizivir (Zidovudine/Lamivudine/Abacavir) |
| J05AR05 | Douvir-N (Zidovudine/Lamivudine/Nevirapine) |
| J05AR06 | Atripla (Emtricitabine/Tenofovir/Efavirenz) |
| J05AX-EVG | Elvitegravir (Gilead) |
| J05AX-VIC | Vicriviroc (Schering) |
| J05AX07 | Enfurvirtide (Fuzeon, T-20) |
| J05AX08 | Raltegravir (Merck) |
| J05AX09 | Maraviroc (Pfizer) |
| L01XX05 | Hydroxyurea/Hydroxycarbamid (Litalir) |
Download this table as CSV file
Download this table as XML file| Code | Reason for Stopping Treatment |
|---|---|
| 1 | Treatment failure (i.e. virological, immunological, and/or clinical failure |
| 1.1 | Virological failure |
| 1.2 | Partial virological failure |
| 1.3 | Immunological failure - CD4 drop |
| 1.4 | Clinical progression |
| 2 | Abnormal fat redistribution |
| 3 | Concern of cardiovascular disease |
| 3.1 | Dyslipidaemia |
| 3.2 | Cardiovascular disease |
| 4 | Hypersensitivity reaction |
| 5 | Toxicity, predominantly from abdomen/G-I tract |
| 5.1 | Toxicity - GI tract |
| 5.2 | Toxicity - Liver |
| 5.3 | Toxicity - Pancreas |
| 6 | Toxicity, predominantly from nervous system |
| 7 | Toxicity, predominantly from kidneys |
| 8 | Toxicity, predominantly from endocrine system |
| 8.1 | Diabetes |
| 9 | Haematological toxicity (anemia etc.) |
| 10 | Hyperlactataemie/lactic acidosis |
| 88 | Death |
| 90 | Side effects - any of the above but unspecified |
| 90.1 | Comorbidity |
| 91 | Toxicity, not mentioned above |
| 92 | Availability of more effective treatment (not specifically failure or side effect related) |
| 92.1 | Simplified treatment available |
| 92.2 | Treatment too complex |
| 92.3 | Drug interaction |
| 93 | Structured Treatment Interruption (STI) |
| 93.1 | Structured Treatment Interruption (STI) - at high CD4 |
| 94 | Patient's wish/decision, not specified above |
| 94.1 | Non-compliance |
| 95 | Physician's decision, not specified above |
| 96 | Pregnancy |
| 97 | Study treatment |
| 98 | Other causes, not specified above |
| 99 | Unknown |
Although most reasons mentioned above could be coded in ICD-10 codes, it is still certain that many reasons will never be defined in terms of ICD-10 (STI, physicians and patients decision), so to avoid a mixture of two different coding systems it is advised to go with the above list and build upon that for now.
holds type, start and stop dates for other HIV related medicines.
Note: Fields marked bold form the unique identifier for a record of the table.
Please see tblART - Antiretroviral treatment.
holds type and date of CDC-C diseases.
Note: Fields marked bold form the unique identifier for a record of the table.
1 DIS_OTH might be part of the record's unique identification
Please see tblAE - Adverse Events for specification on optional fields.
holds type, date, value and unit of laboratory tests.
Note: Fields marked bold form the unique identifier for a record of the table.
Other detailed information regarding the patient and the measurement would be relevant, like the proposed fasting information shown below.
Depending on the set of measurements collected and the mandatory fields applicable to these individual measurements, it might be useful to separate the LAB table into several sub tables. This is already shown for the CD4 and RNA measurements: the level of detail needed for CD4 is less than for the LAB variables in general (no unit since it?s always the same), while for RNA the data required is more detailed (assay and detection limit).
holds date and value of CD4 measurements.
Note: If needed, a CD8 table (tblLAB_CD8) could be formed from the same structure.
Note: Fields marked bold form the unique identifier for a record of the table.
CD4_V is assumed to contain absolute CD4 cell counts per mL as standard. In case CD4 % should be collected as well, please append the following field to the table:
holds date, value, detection limit and type of viral assay.
Note: Fields marked bold form the unique identifier for a record of the table.
holds date, diastolic and systolic values and unit of blood pressure measurements.
Note: Fields marked bold form the unique identifier for a record of the table.
holds test results for viro-/serological tests (hepatitis etc.)
Note: Fields marked bold form the unique identifier for a record of the table.
holds background information on the resistance test, laboratory, library, kit, software and type of test
Note: This table is tightly linked to tblLAB_RES_LVL_1, tblLAB_RES_LVL_2 and tblLAB_RES_LVL_3.
Resistance should be reported at lowest level of interpretation possible ? so if the nucleotide sequence is available this should be reported rather than the list of mutations or resistance scores. However, the resistance test results should be captured if they have been part of the physician?s treatment decisions for the patient.
These four tables are designed to capture several possible formats the clinics and cohorts might have recorded resistance test data in. Once this data is gathered it should like all other tables be quality assessed. For the full nucleotide sequences a short guide on ?Sequence Quality Control? can be found here:
?http://hiv-web.lanl.gov/content/hiv-db/CONTAM/contam_main.html
Note: Fields marked bold form the unique identifier for a record of the table.
As shown with the core fields, the SAMP_ID is the link between the 3 levels of data and the test background information table. The sample identifier, however, must be unique for the format to work. This might not always be the case. If needed SAMPLE_D could be used as an additional part of the key, or just SAMPLE_D along with the PATIENT key1.
Some prior assessment of the assigned sample identifiers has to be done in order to avoid duplicates.
In a running database the duplicate issues are easily resolved by adding a unique auto-generated key as the identifier between 3 levels of data and the test background information table SAMP_ID.
Along with the SAMP_ID it might be necessary to store the ID assigned to the sample at both the testing laboratory but also the centres laboratory in order to track the sample. Each of these could also be used as the SAMP_ID value.
1: However this raises the issue about several aliquots from the same day will look like duplicates in the tables.
holds nucleoside sequence for the PRO and RT sequences. No entry is made if the test was a phenotype test.
Note: This table is tightly linked to tblLAB_RES.
Note: Fields marked bold form the unique identifier for a record of the table.
In cases where the amino acid sequence is collected rather than the nucleotide sequence, the field SEQ_NUC might be replaced with SEQ_AA, which is the nucleotide sequence, expressed in an amino acid sequence:
However using the amino acid sequence does not give the same detail of data as the nucleoside sequence: wobbles in the nucleoside sequence can either complicate the reading and alignment of the amino acid sequence or the wobbles can be lost and silent mutations are lost.
holds mutations and positions of PRO and RT sequences.
Note: This table is tightly linked to tblLAB_RES.
Note: Fields marked bold form the unique identifier for a record of the table.
AA_FOUND_# could be extended if mixtures with more than 4 amino acids are found.
holds resistance result in relation to antiretroviral drug.
Note: This table is tightly linked to tblLAB_RES.
Note: Fields marked bold form the unique identifier for a record of the table.
For phenotype test results it will be necessary to extend the table with a field to store the cut-off value:
However using the amino acid sequence does not give the same detail of data as the nucleoside sequence: wobbles in the nucleoside sequence can either complicate the reading and alignment of the amino acid sequence or the wobbles can be lost and silent mutations are lost.
holds type and date of adverse events including serious non-AIDS conditions
Note: Fields marked bold form the unique identifier for a record of the table.
Please see the HICDEP website for examples of detailed AE tables for the events listed in 3.16.2 below. Data format is available in the HICDEP DAD event forms document.
Each table in this document is shown with the PATIENT field as part of the unique identifier for each record, in many cases it might be necessary to specify both PATIENT and COHORT as identifier since the PATIENT id might not be unique across cohorts, this should however be implemented only if needed.
Another solution, which might be worth considering when building the final dataset for the analysis, is to concatenate the patient id and the cohort id into a single unique new patient id.
Transposition of the flat format data, where there is one record per patient, into a normalised structure that has multiple rows per patient requires that the combination of PATIENT, TYPE_ID and LAB_DATE is unique for each row in the table.
The normalised structure and the relational model does not allow for duplicate records like in the following table to exist:
| PATIENT | TYPE_ID | LAB_DATE | LAB_VAL | LAB_UNIT |
|---|---|---|---|---|
| 999999 | 1 | 2000-01-01 | 15 | U/l |
| 999999 | 1 | 2000-01-01 | 15 | U/l |
The typical solution to this is to generate an auto-incremented value ? RECORD_ID - for each record in the table:
| RECORD_ID | PATIENT | TYPE_ID | LAB_DATE | LAB_VAL | LAB_UNIT |
|---|---|---|---|---|---|
| 1 | 999999 | 1 | 2000-01-01 | 15 | U/l |
| 2 | 999999 | 1 | 2000-01-01 | 15 | U/l |
But it can, and should be dealt with in the design of the study. The presence of duplicate records like in the example shown above might not make any sense; if they did occur it then would be on account of a recording error. If, however, it was important to keep records that show measurements taken the same day but at different times, the format of the LAB_DATE shouldn?t be date (YYYY-MM-DD) but instead should be date-time (YYYY-MM-DD hh:mm:ss):
| PATIENT | TYPE_ID | LAB_DATE | LAB_VAL | LAB_UNIT |
|---|---|---|---|---|
| 999999 | 1 | 2000-01-01 10:00:00 | 15 | U/l |
| 999999 | 1 | 2000-01-01 14:00:00 | 15 | U/l |
Different and more sophisticated methods to make records unique will be presented later in this document under the definition of the LAB table.
Sometimes it might be needed to have a fixed value that shows from which visit or merger a value originates, this does not only apply to the VIS table but could be applied to all tables. This however does depend on the nature of the database and needs for data management, the field below should be considered an administrative support field for data management.
| VISIT |
|---|
| Visit number |
| Numeric: 0 = Baseline Visit 1 = First follow up visit 2 = Second follow up visit etc. |
Often the above field is used for clinical trials databases where there is a need to associate the data directly with a given week?s follow-up. Codes could then be the week number e.g. 4, 12, 24 etc or ?1 for screening/randomisation and 0 for baseline visits.
In some cases it might be useful to have a separate field that defines the correct order of the periods. This becomes important where several dates are incomplete (unknown days, unknown months and possibly unknown years). The ordering by date would then not be correct.
One solution to this is use a PERI_ID field that numbers the periods from the 1st until Nth usage:
| PERI_ID |
|---|
| Period of usage (1st, 2nd, 3rd etc.) |
| Numeric |
However this is an optional field that for most cohorts may not be needed. It also requires additional maintenance to keep it updated.
For databases that work with double data entry, such as most clinical databases, it becomes necessary to make each data entry unique and backwards traceable. For this to work a field like the above would have to be part of the primary key of each table that requires double data entry.
| ENTRY_ID |
|---|
| Number of data entry |
| Numeric: 1 = first data entry 2 = second data entry 3 = comparison of 1st and 2nd data entry 4 = final approved record including corrections |
With respect to performance, it might also be a good design to have 3 copies of each table, one to hold the data while being entered and compared, one for the two data entries to be archived into once a final record has been approved and a table holding the final and approved values. This way it is avoided that queries will have to work on checking for ENTRY_ID = 4 and to select amongst a table holding 3 times the almost same data.
As part of an audit trail in a database a time stamp field could be added for each record to fix the exact time when the record last was inserted or updated. Along with the time stamp name of the user who entered or altered data can be recorded.
| T_STAMP | USER_LOG |
|---|---|
| Date and time of data entry | Username of user that last inserted or updated data |
| yyyy-mm-dd hh:mm:ss | character |
Often it's necessary to keep a log of user action in a separate table. The above suggestion will only be valid for inserts and updates, and only be valid for the most recent action performed.
To record a complete audit trail a logging facility must be implemented. In most database management systems this is done using triggers on the tables. For any insert, update or delete actions performed on the data, the user, time stamp, old value and new value are recorded in the logging table.
The T_STAMP field could also include information about which time zone is relevant for data entry. Depending on database requirements this might in fact be mandatory if the ?FDA?s 21 CRF part 11 on electronic records and signatures applies.
Depending on performance considerations it might be worth looking at how data are queried for data entry and data analysis. A smaller tblBAS table might increase performance: Since processing a smaller table is always faster than processing a larger table, one could put drop-out, death, birthday, date of aids diagnosis, etc. into separate tables and keep the core patient list in a separate master table
But if the database is used e.g. for BMI calculations directly on the running database, performance might be enhanced by keeping the patient list and the height together in the same table so that a query involves 2 tables (tblBAS and tblVIS) rather than perhaps 3 or more.
Another consideration is space. Although it may not be much of an issue, it will be possible to minimise the actual size of the database by putting fields that may be empty for most patients, like death information, into a separate table in a 0-1 to 1 relation to the master table.
In a running database the #_ID fields could be implemented as a foreign key to a linked lookup table containing all possible codes and their corresponding definitions in a text string.
This setup not only enables integrity of the data, but also defines the domain1 for the #_ID values and enables data to both become human readable and easily recoded2.
An important note on lookup tables is that the performance on a large database can be slowed significantly by using character based keys to link lookup tables with the primary table as it is presented in this document. A work around is to use numeric value for the codes.
1: Domain is a term in the definition of the relational database model that defines a set of allowed values for a given set of fields (attributes), the mixing of different domains is not allowed in order to preserve the integrity of a relational and normalised model.
2: Easily recoded permanently if the relation is specified as cascade on update or recoded dynamic by selecting a different column from the lookup table when querying the data through SQL
As already outlined in the above section, there are also performance issues that may have to be considered.
When using the suggested data types presented in this document for a database implementation, it may be worth looking at the actual data at hand when defining the final data types. The aim of this document is to present a format that will work between cohorts with rather different setups.
If it is at all possible in many cases there may be a large performance gain by using numeric instead of character fields. Character fields have been suggested here for, amongst others, the PATIENT field. If the PATIENT id is purely numeric it?s worth using a numeric data type since it always faster for querying than a character field.
Whenever the field has to be character, make sure that only the needed amount of space is assigned for the field length; there is no need to assign 50 characters of memory if the field in fact only stores a 3-letter code.
1:Detailed table definitions for the D:A:D events are available at ?http://www.cphiv.dk/HICDEP/Documents/tabid/159/Default.aspx
| Code | Coding for Reason of Stopping Treatment |
|---|---|
| 1.1 | Virological failure |
| 1.2 | Partial virological failure |
| 1.3 | Immunological failure ? CD4 drop |
| 1.4 | Clinical progression |
| 90 | Side effects ? any of the above but unspecified |
| 90.1 | Comorbidity |
| 92.1 | Simplified treatment available |
| 92.2 | Treatment to complex |
| 92.3 | Drug interaction |
| 93.1 | Structured Treatment Interruption (STI) ? at high CD4 |
| 94.1 | Non-compliance |
| 96 | Pregnancy |
| 97 | Study treatment |
First public versions that incorporated comments and corrections received from attendees at the 7th International Workshop on HIV Observational Databases, March 29th-30th 2003, Fiuggi, Italy and Stephen Hart.
Version presented at 7th International Workshop on HIV Observational Databases, March 29th-30th 2003, Fiuggi, Italy