The Research Condition and Disease Classification (RCDC) project is a cross-institute effort to create a knowledge management infrastructure that automates and standardizes the classification of grant applications into specific disease categories. The goal of this project is to provide better accounting to Congress and the public on the amount of money NIH spends on each disease. The RCDC software classifies each document by a technique known as fingerprinting, which extracts from the text a set of concepts using a domain-specific taxonomy. By comparing a grant’s fingerprint to all the disease fingerprints, the most appropriate disease code for the document can be identified. In collaboration with the Acting Director of CIT, Jack Jones, HPCIO is investigating different methods in which the current process can be improved without drastic disruptions to the existing workflow.
We have evaluated the current RCDC disease coding process and its underlying technology developed by Collexis®. Several shortcomings in the system have been identified, as have solutions to these shortcomings.
Current and Future Work
In FY 2007, we shall propose improvements to the RCDC system. The four areas of opportunity are:
- assessment of the relationships between disease fingerprints from the semantic standpoint and in the vector space model;
- automatic assignment of the concept weights in disease fingerprints;
- statistical validation of the data generated in RCDC; and
- extraction of additional knowledge from document fingerprints.
The first three enhancements aim to improve the accuracy and efficiency of the disease coding process, while the last broadens the utility of the RCDC effort.
Jack Jones, Ph.D., Director, CIT, Office of the Director