The MCDU Project Team’s first challenge was to prepare the 56 million MARC records for analysis. This page has links to documents that provide information about the decomposition of the MARC records, the design of a MySQL database to hold the decomposed records, the database loading, and the validation of the parsing software that decomposed the records. It also includes a document that details the procedures for creating different subsets of the records for analysis by format. The final document presents the analytical questions that guided our analysis.
Decomposing the Records, Database Structure, Database Loading, Validation Procedures
- MCDU Project MARC Records Dataset: Decomposition Specification, Database Design, and Parser Software
This document provides information about the MARC dataset, the specifications for decomposing the MARC record, the design of the database to hold the decomposed records, and the parsing software that was designed to decompose the records and load the data to the database. Date Posted: July 15, 2005.
- Validation Procedures for MARC Record Parsing Software
This document describes the procedures for testing the parsing scripts used to decompose the MARC records. A sample of the raw MARC records from the dataset and the resulting parsed records are subjected to the validation procedures detailed below to verify the integrity of the software and ascertain that the data from the MARC records are correctly represented prior to loading into the database. Date Posted: July 15, 2005.
The MCDU MARC Parser and Database Loader
- The MCDU Project Team developed a custom application using the PHP scripting language to decompose the MARC records and to load the data into the MySQL database. The parsing software reads MARC records from the ASCII text file, parses the records in memory according to specifications described in the first document listed above, and inserts the records in to MySQL database. The following is a brief description of the software functionality. A user selects the source file containing MARC records, the number of records to skip at the beginning of the file, the number of records to process in a group, and the number of records to process from the file. The user then selects the output destination, which can be either Browser or Database. Choosing Browser displays the results of the parsing in the browser user interface. Browser output is used to validate the parsing function. If Database is chosen for the output, the system launches the database loader and moves the parsed data into the MySQL database. After a user clicks the Start Processing button, the parser performs processing, and the results are output to the browser or inserted to the database. For demonstration of this tool, we have disabled writing the data to the database. Click on the link below to see a demonstration of the parser tool. Date Posted: July 15, 2005.
Preparation of the Databases Containing MARC Records for Analyses
- Format Content Designation Analysis: Set Definition and Extraction Queries
This document contains the procedures in the form of structured query language (SQL) queries designed to create the 20 format-specific MCDU project databases containing approximately 56 million MARC records from the OCLC WorldCat database. Natural language queries were developed and translated into SQL. Test queries were initially run against samples of the data and analyzed to ensure that the queries are properly formed and produce expected results.
- Format Content Designation Analysis: Set Profiling and Analysis Queries
This document describes the questions we are asking of the data to address project research questions. Questions are transformed into SQL queries that result in reports produced by the MySQL database management system. The basic questions asked of the sets are similar, and the organization for basic analysis queries of the MCDU record sets are detailed in this document. The queries are organized into three categories: “General Profile Queries”, “Frequency Counts Queries”, and “Second Level Analysis Queries”.
Back to Analysis Reports and Results