Data Policy Excerpts

Data Management

1. Data sets should be inventoried

While it is likely that labs have a good knowledge of current data sets, features and locations of historical or seldom used data sets may not be easily available. Written documentation on many of these data sets, current or not, may be lacking or hard to find.

A vital first step will be a thorough inventory of at least current and recently used data sets. This would include data stored on desktop machines, servers, and recently archived data sets. Ideally some basic metadata should be noted down, chiefly that required for a data table of contents. It would also be most helpful to the data manager if s/he could consult with the lab personnel about the types of data gathered and how they are handled. Since programs at SERC are dynamic, the inventory process will need to be an ongoing effort in order to keep information current.

2. Data should be available for centralized cross-platform access

The current mixture of Macintoshes and Intel platforms (DOS, Win 3.1x, Win 95, Win NT) present a complex geography for data management. When the many data applications that SERC investigators use are also taken into account, the landscape can be torturous indeed. External requests for data are answered by time consuming searches and file consolidation. Labs often make one or more file conversions in order to share data.

A centralized database management system could alleviate some of these problems. Desktop clients running on any of the operating systems here at SERC would access the same files residing on the central server. Common files and file types would allow seamless use of these files within and between labs. Security permissions would restrict viewing and editing access to appropriate users. A modern database management system could import and even search some of the database formats indigenous to a particular lab, through tools such as ODBC (object database connectivity) and support of SQL (structured query language). Efficient handling of large databases, enhanced storage on the NT server, and CD-ROM based data archiving will allow more databases to be accessible.

3. Data should be available for inter-lab access.

Data sharing between labs is hampered by the technical differences outlined in 2.) and by a lack of information about available data sets. A better view of other labs' data gathering efforts may help refine data collection and interpretation here at SERC. An example of this situation can be found in the intern projects. These short term projects often need background data that another lab may hold; the newly arrived intern typically does not know to whom to turn. If a request for data is made, the location, summarizing and reformatting may take so long that the data arrives too late to be helpful.

Future data management should address this challenge on two fronts. The first hurdle, that of actual access to data, could be addressed by a modern database management system. Equipped with a GUI interface, this system may be mastered more quickly. Common file formats between labs will facilitate more transparent access. The second hurdle, that of knowing what types of data are being collected by other labs, must be addressed by an enhanced effort by labs and SERC's data manager to publicize these holdings. This could be done by means of a data catalog on SERC's intraweb. It could be organized according to researcher, lab, or subject keyword; in the future it could also be searchable. These data catalogs would also smooth staff transitions by providing common documentation for newly arrived staff or personnel changing from one lab to another.

4. SERC's data should be catalogued

An easily accessed catalog of SERC's data holdings would help increase awareness of the variety and depth of research done here.

An extensive data catalog for internal use as well as an edited version for external use could be posted on intra- and external webs respectively. The Long Term Environmental Research (LTER) sites have proposed a standard for their "Data Table of Contents" to which SERC could adhere. Details are available at http://www.vcrlter.virginia.edu/nis/dtoc_form.html. Please see the appendix for the format of this proposed standard. The LTER sites use a free search engine WebGlimpse (http://donkey.CS.Arizona.EDU/webglimpse/) for free text searches. (see their web site: http://lternet.lternet.edu/DTOC/) We may be able to take advantage of WebGlimpse in the future when it is ported to the NT platform from UNIX. John Porter, the Virginia Coastal Reserve LTER data manager, has informed me that they chose this route for simplicity for users and to keep the data descriptions locally for ease of editing.

Global and national access to SERC's metadata could be available through Federally sponsored data clearinghouses (details in the metadata section) if we adopt one of their standards. After metadata forms were checked for accuracy and completeness, they could be submitted electronically to these clearinghouses. Advantages to this approach would be the wide exposure gained and being freed from supporting the network and storage burdens that this kind of public access would require. A disadvantage would be the difficulty of editing metadata on an external server. On-line applications will assist in this job; perhaps this could be done by the data manager on a weekly or monthly basis. The LTER sites use the Global Change Master Directory; however, Mr. Porter states that "the technological interface for GCMD etc. was a bit formidable for our user community" but that future needs still make this a promising avenue.

5. Data should be summarized on a regular basis.

Automated instrumental recording in the field has led to great increases in analytical power as more parameters can be studied in greater detail. Increasing numbers of bench top analytical systems have resulted in a similar leap in detailed chemical analysis. The result of these trends has been increased amounts of data, gathered at ever faster tempos. It can be difficult to digest this information without summarizing it in tabular or graphical form. Further, some uses of the data call for summary data or greater granularity. (hourly rather than per second, for example) There are several hurdles to summarizing data. Data is scattered across different machines and in different formats. Current tools such as SAS can be too cumbersome to use for ad hoc summaries. Finally, the need to conserve hard disk space means that large data sets are archived onto tape and can be cumbersome to access.

Future data management will ease the process of summarizing data. Centralized data access can allow different users to summarize the same data in preferred ways. Modern database managers can include data summary (or report) templates which can be easily modified by GUI interfaces that allow easy selection of displayed parameters and levels of summary. A variety of tabular and graphical choices are available. Improved large data set handling coupled to the increased hard disk capacities of the future NT server will allow larger data sets to remain on-line; the expected storage of other data sets on CD-ROM will further aid access. Finally, central data access will allow the data manager to run scheduled data summaries to free labs from some of the burden.

6. Data management activities should be coordinated.

Data management frequently is not coordinated among labs and between the labs and the computer office. This results in some redundancy in effort. Further, it is increasingly important that the knowledge of data assets and operations be distributed among and within labs.

Recently there have been exchanges in this area and good ideas have been shared. As data management matures at SERC, continued two-way sharing of expertise will yield rewards. While many labs will continue to use their own data management methods and applications, the computer office will help those who wish to migrate wholly or in part to a centralized DBMS. Labs can control file creation and user security as much as they wish; all management activities can be done centrally by the computer office or selectively learned by a lab local administrator.

Resources can be shared between labs. In addition to databases, queries (pre-compiled for speed, if desired) and reports can also be placed in central repositories. Various levels of security access can be applied to these objects to protect data integrity.