TO: John Callahan
Assistant Secretary for Management and Budget
FROM: Director, NIH
Subject: Request for Comments on Clarifying Changes to Proposed Revision on Access to Research Data.
The OMB has solicited comments on their clarifying changes to the proposed revision of A110 regarding access to research data. Specifically, they asked for comments on the following:
Definition of "Data": NIH concurs with the clarification of the definition of data. Data are now referred to as “research data” and are defined as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.” The exclusion of information that would constitute a clearly unwarranted invasion of personal privacy, including information that could identify a specific participant in a research study, is absolutely essential. We recommend that in section III. A., the term "files" be replaced by "information" in the sentence "Moreover, under the proposed definition, 'research data' would exclude (A)… and (B) personnel and medical files and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy…." This would ensure that the privacy of such information is protected equally well if it resides in a medical file or record or in some other research format. The changes proposed by OMB mitigate many of our concerns about the agency obtaining such data, redacting it, and holding both the redacted and unredacted data sets.
Scope: The earlier language applied the amendment to Federal policies and rules, which caused concern to NIH since it opened a vast arena of Federal activities. Such concerns have been addressed in the revised language, which focuses specifically on regulations. The proposed further limitation to regulations of significant economic impact is also desirable.
It would have been especially useful if the application of this amendment also focused on significant scientific findings. When a regulatory agency cites research in the regulatory process, that research may be critically or marginally applicable to that regulation. A brief review of regulations revealed that some cite hundreds of research studies, all of which would be subject to FOIA under this amendment. It would greatly reduce the burden of this legislation if access were afforded to data from only those studies that were critical in the formulation of the regulation.
There are several existing mechanisms for making available data to other researchers, none of which are as burdensome as FOIA. These mechanisms include archives, such as the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan, where data are made available at a modest cost and come with complete documentation and often technical support for the user. Some investigators make data available on the web, building in protections for privacy through the software while allowing analysis of the data. Yet another mechanism involves data repositories. Data repositories maintain control of the data but receive and fulfill requests for analysis. The National Center for Health Statistics serves as a data repository in cases where the risk to privacy is too great to allow the data to leave their site. Nevertheless, they want to allow others to use the data for their own research and thus conduct the analyses “on demand” as specified by outside investigators. In each of these existing models, the goals of the legislation are already being met. We urge that the scope of this amendment be restricted to "data not otherwise already available for reanalysis".
Definition of "Published": The OMB proposed clarification of the definition of published research findings is very valuable. Published findings are now defined as those published in a peer reviewed scientific or technical journal or those publicly and officially cited by a Federal agency in support of an action. NIH concurs with the clarified definition and finds that it will eliminate many concerns about premature release of data while fulfilling the spirit of the original language.
Cost reimbursement: Determination of costs associated with providing data and mechanisms for reimbursement presents significant challenges. At this time it is not clear how those challenges will be met. The costs associated with providing data under this amendment are likely to be substantially greater than costs incurred to fulfill current FOIA requests. Unlike data that are currently provided through FOIA, the data covered by the proposed amendment are not in the possession of the agency. Thus, in order to administer the request for data, the agency must request the data from the investigator or the grantee organization, import the data set, review and redact the data set, and release it to the requestor. The process of importing and exporting data sets can be difficult and expensive, especially if the investigator used software that was custom made for the project. The process of reviewing and redacting data will require the time and skills of individuals with a range of specialized training, including training in the substantive area covered by the research data as well as epidemiology and biostatistics to ensure that redaction adequately protects the identity of research subjects. This is a broad set of skills, not typically found in a FOIA office. It is unclear how agencies will identify and make available individuals with the broad array of backgrounds and training needed to process requests for data. Similarly, the grantee organizations will need to establish a structure and procedures to handle requests for data, as discussed below. Thus, the “administrative” costs associated with the proposed amendment for both agencies and grantee organizations constitute a significant expansion over current FOIA capabilities. Current FOIA processes are costly for agencies, but the current amendment represents a significant expansion of costs.
It is difficult – if not impossible -- to estimate with any accuracy the actual costs of providing the data. What we can safely say is that the range of costs will be huge. The cost of providing data from a small study that collected information on 20 variables from 50 rats at one point in time would be minimal. However, the cost of providing data from a face-to-face survey of 4,000 adults, with 300 variables and repeated measures over time, would be very much greater. Providing such a data set could include the costs of redacting a large and complex data set and producing a code book for the redacted data set to make the data set usable.
Similarly, there are uncertainties about the mechanism to be used to recoup these costs. Costs are incurred in three basic ways. First, universities and other nonprofit organizations conducting research will need to put in place a structure and procedures for dealing with FOIA requests for data. Both the administrative and accounting structure and the procedures will need to be established before receiving an actual request in order to be in compliance with A110. This aspect of the costs to universities would likely provoke a request from the grantees relief from the 26% cap on the administrative component of indirect costs.
Second, in addition to establishing an infrastructure to respond to requests for data, institutions will face costs associated with providing the data for specific requests. These costs would be appropriately paid by the requestor, as noted in the legislation. There are many difficulties associated with agencies being the conduit for such funds and we seek to avoid building any new accounting or budget procedure. Therefore, we recommend that the costs of filling a specific request be paid by the requestor directly to the research institution following agency confirmation that the agency has the data ready to send to the requestor. These funds would not be considered program income. This plan would ensure that institutions received compensation and that the administrative burdens were minimized.
Finally, the amendment acknowledges the costs incurred by the agency but proposes the same compensation practices currently used under FOIA. This fails to recognize that the burdens on the agencies are likely to be far greater under A110 than in the current FOIA system. As the costs associated with A110 requests for data rise, it will be increasingly important that the fees paid be retained by the agency, not the Treasury Department. In the earlier draft of the amendment, it was observed that legislation would be required to solve the problem of agency retention of funds, but this was not discussed further. We are concerned that this amendment will be put into effect before a strategy for reimbursing agency costs has been specified. Thus, we recommend that a trans-agency solution be sought immediately.
Remaining issues: The amendment states that the agency will need to provide data in a "reasonable time period," but there was little discussion of how that would happen. If data are not prepared for release until after they are cited in a regulation and a request is made, it is unlikely that they would be available and reanalyzed during the comment period associated with the development of a regulation. The FOIA process requires a response in ten days, a goal that would be unlikely to be met for data requests. If the agency does not meet that deadline, a requestor can bring legal action. Our concern is that unreasonable requests would be made of federal agencies and grantees as requestors attempt to obtain and reanalyze data within the time period for comment on a proposed regulation.
By basing the access to data on FOIA, the privacy protections apply only to individuals, not other entities that participate in research. Research at the NIH includes projects that use clinics, hospitals, schools, and other entities as the unit of study. It is not uncommon for such entities to want their privacy protected. Even when there is no potential for commercial harm (e.g., a public health sexually transmitted disease clinic), there are other legitimate reasons why entities wish to remain anonymous, including embarrassment or other reputational factors. Participation in research may be of great research value but little value to the individual organization; inability to provide protection to organizations will undoubtedly lower their participation in research. We recommend that the definition of research data be amended to exclude "unwarranted invasion of personal or organizational privacy".
The present "clarifying changes" do not address the problem created by the fact that many investigators are supported by funds from multiple sources. We view this as a very difficult issue since some projects involve funding from both Federal and non-Federal sources. In some cases, funding from non-Federal sources is important, providing access to data from pharmaceutical companies, state governments, private sources or foreign governments. In some cases, these funders would provide their own data to be merged with data collected with Federal support. We are concerned that by forcing uncontrolled access to data funded through other sources we would reduce the willingness of such groups to participate in NIH-funded research. It would be helpful to have clarification stating that the amendment would not apply to data that were not produced under the Federally supported grant, even if those data were used by the grantee. Such an exemption should also apply to NIH-funded analyses of data from non-Federally supported data used to create new variables.
Despite our support for the constructive effort by the OMB in developing this regulation, serious concerns remain. These are generally rooted in the fact that the strategy for data sharing is based on the FOIA. FOIA was developed to provide public access to government records. FOIA does not provide the kinds of procedures or protections that are required for safe and effective access to research data. For example, FOIA places no restriction on who gets data, how they intend to use it, or to whom they may give the data. Access to research data typically requires that the recipient provides an assurance that they will use the data for research purposes, they will not try to identify or contact individual subjects, and they will not share or otherwise release the data to others. In the case of the Health and Retirement Study, there is a requirement that the user not merge the components of the HRS files with other files, such as driver's license or Equifax files. This requirement is in place to protect confidentiality of individuals. Informed consent documents need to be able to tell potential subjects what will be done with the information they provide. These boundaries are important, and yet they cannot be protected when data are shared through the FOIA.
In conclusion, I view the steps taken by OMB as constructive and have provided several other modifications that we believe would greatly strengthen this rule. However, we remain convinced that basing access to research data on the Freedom of Information Act process is fundamentally flawed.
The following case study was written to illustrate some potential problems inherent in the current version of the NPRM to modify circular A-110.
Case Study: A Population-Based Survey of Perimenopausal Women and HRT
Dr. Smith of Estron University was awarded an NIH grant to conduct a large, population-based survey of perimenopausal women residing in Rapid City to assess their knowledge, attitudes, beliefs, and use of hormone replacement therapy (HRT). Due to the substantive focus on this study, only women between 45 and 60 years old were eligible to participate.
The sampling frame consisted of all housing units on the tax rolls for Rapid City. Dr. Smith’s sampling statistician produced a final sample that consisted of housing units randomly selected from census tracks. The study design called for a sample size of 2500 women between 45 and 60 years of age. Using data from the 1990 census, the sampling statistician estimated that only 40% of households in Rapid City contain a female resident in the eligible age range. Thus, 6250 households would have to be screened to achieve the targeted sample size.
The interviewers were sent into the field to:
(1) Screen each household in the sample for eligible subjects;
(2) Select and recruit one subject from the eligibles in the household;
(3) Administer informed consent; and
(4) Administer the interview.
To screen households, interviewers were given paper and pencil forms that included the address from the sampling frame on the cover sheet along with specific questions to elicit information on the household composition. In rostering the household, the interviewer asked how many people live in the household and then arrayed the residents by first name, age, and gender. The interviewer then arrayed the eligible women. If there were more than one, the interviewer used a selection algorithm provided by the sampling statistician to randomly select one eligible subject per household. Since eligible subjects may not be at home during the time of screening, interviewers also asked for full names, telephone numbers, and best times to contact the potential subject.
Once an eligible subject was selected and recruited, the interviewer administered the informed consent form. There are ethical and legal guidelines that the investigator must follow in creating the informed consent document. The consent document must inform potential subjects about the procedures associated with participation, risks and benefits, confidentiality of records, and the voluntary nature of participation.
After obtaining consent, the interviewer administered the interview. Dr. Smith’s study used a new computer-based interview system to collect the interview data. Computer scientists at Estron University developed the software specifically for that project.
Approximately one year after the project was completed, Dr. Smith’s program officer at NIH received a FOIA request from Dr. Jones of Progon University for Dr. Smith’s data. The FDA was developing a regulation related to HRT, and the Smith study was cited.
When the FOIA office contacted Dr. Smith, he expressed concern about protecting the privacy of the participants. His informed consent form told women that the data they provided would be confidential and that the data would only be used for research purposes. The FOIA office told Dr. Smith that they would strip the data set of information that could potentially identify women. This process of redacting the data would protect individual’s privacy.
Dr. Smith provided a data tape to the FOIA officer at NIH as requested. The FOIA officer was responsible for reviewing the data set and redacting it to ensure that individual participants could not be identified. However, the FOIA office couldn’t examine Dr. Smith’s data set because NIH did not have the software used in that study.
PROBLEM 1: Estron University considered the software to be proprietary and did not wish to share it.
After extensive efforts on the part of IT staff at Estron University and NIH, the data set was converted for use on commercially available software.
PROBLEM 2: The software translation and the redacted data set required a new code book with user instructions.
IT staff at Estron University and Dr. Smith wrote the new user manual, but it took them 4 weeks.
PROBLEM 3: Because the grant that funded Dr. Smith’s research had ended, Estron University wanted to know who would pay for this labor intensive, expensive effort. Estron University could include these costs in their bill to the requestor, Dr. Jones. However, they would not receive the funds until the data were delivered.
The data were delivered to Dr. Jones, and he conducted his analysis. Because the data set included information on age, number and age of children, and occupation, Dr. Jones was able to identify his sister, Frieda, the first and only female rabbi in Rapid City. Frieda was alarmed and concerned that her brother knew so much about her menopausal symptoms and their impact on her marital relationship.
PROBLEM 4: Frieda contacted Dr. Smith and complained about the breach in confidentiality of the data she provided for his study The FOIA office at NIH was responsible for redacting the data to prevent inadvertent identification of research subjects. The redactor stripped names, addresses, and telephone numbers from the data set. Unfortunately, that person did not know that this did not afford adequate protection for subjects.
After completing preliminary analyses, Dr. Jones recontacted the FOIA office at NIH. He had questions about the representativeness of the sample and asked for the screening data. The FOIA office forwarded that request to Dr. Smith. Dr. Smith’s screening data were on paper and pencil forms. He did not have a data tape.
PROBLEM 5: Coding, keying, and cleaning the data from 6250 screening forms would take time and resources that neither Dr. Smith nor Estron University had. The FOIA office instructed Dr. Smith to copy the paper forms, blacking out the names of the women who participated in the study. Dr. Smith provided copies of the paper forms.
Dr. Jones shared the paper screening forms with his brother, Edgar, who lived with his sister Frieda and owned a funeral home. Edgar didn’t want to contact women who participate in the study. Rather, he used the screening forms to identify households with residents 70 years of age or older and used the information to market his funeral services.
PROBLEM 6: There was no data sharing agreement under FOIA. There was no prohibition to prevent Dr. Smith from sharing the screening data with his brother. However, if residents of Rapid City found out that participating in an NIH study could result in unwanted and unanticipated marketing activities, subsequent studies may have a very difficult time recruiting subjects.