The BIMA Image Pipeline and Implications
for a Common Pipeline Architecture

Raymond Plante, Doug Roberts, and Dave Mehringer
National Center for Supercomputing Applications

January 13, 2000

This document is also available in PostScript and gzipped PostScript formats.

Abstract:

In this document, we describe the design of the BIMA Image Pipeline, a prototype for a general architecture that could be applied to other radio interferometer telescopes. We outline our goals and design principles for the pipeline, and we summarize the major components that need to be developed in order to work with the existing BIMA telescope and archive systems. In particular, we highlight the architectural feature that separates information management tasks that drive the pipeline from the actual processing tasks. Not only does this minimize the impact on the development of AIPS++, it allows us to freely take advantage of other technologies, particularly XML. Finally, we examine the prospects for a common pipeline architecture that might be applied to other radio interferometer telescopes and opportunities for collaboration with other observatories.

1. Introduction

A major goal of Radio Astronomy Imaging Group at NCSA is to develop high-performance computing solutions to the problem of increasing data production rates from radio interferometers. To this end, we developed the BIMA Data Archive which allows astronomers to access data from the BIMA telescope via the Web. An important feature of the archive is real-time transfer of data from the telescope in Northern California to the archive at NCSA in Illinois. This has enabled near-real-time processing of data using NCSA supercomputers. The development of a flexible imaging pipeline working with the telescope and archive will provide a foundation that can make real-time radio astronomy more practical for the typical radio astronomer.

2. Design Principles

2.1 Goals

The BIP design outlined here aims to meet the following goals.

1.
To automatically produce calibrated uv-data and deconvolved images that can be used directly for research by BIMA astronomers.

Phase 1: For 80% of the BIMA observing projects,

Phase 2:

2.
To support a high degree of interaction with the astronomer

Phase 2: Real-time monitoring; observer-provided processing parameters

Phase 3: Sophisticated Portal Interface

3.
To support ``big-processing'' archival research by allowing users to initiate processing jobs on NCSA supercomputers using data from the archive (Phase 3).

4.
To provide a framework for enabling real-time observing.

5.
To prototype a generalized pipeline architecture that can be applied to other radio interferometer telescopes.


2.2 Constraints

The BIP is being built as a prototype pipeline that augments a currently operating telescope. This imposes a major constraint on our design in that it must adapt to the existing BIMA telescope system. It must place minimal requirements on the system so that it can continue to operate normally as the pipeline is developed. In particular, the pipeline cannot require changes to the format and overall model for writing raw data to disk (e.g. the pipeline must accept raw data in MIRIAD format). This constraint is easy to meet, since the existing BIMA archive system already supports most of the necessary communication with the telescope system.

Related to this constraint is one that requires that the archive continue to support its current services to BIMA users. In particular, the archive must continue to provide raw data to users in MIRIAD format so that they can use MIRIAD to reduce the data themselves from scratch. However, processed data from the pipeline need not meet this constraint. Given that processing will be done using AIPS++, it is clear that the pipeline system must be able to handle multiple formats.

Finally, the pipeline system must be able operate within the facilities of NCSA. This can include both constraints regarding hardware and policy; although we have some ability to affect the latter if necessary.


2.3 Architectural Principle

A guiding principle that we have adopted for our design is to separate the tasks of the pipeline into two groups:

1.
radio astronomy processing and analysis. These tasks fall perfectly within the ability of AIPS++ and for the most part have the same requirements as for interactive processing. Glish will allow us to create the scripts necessary for automated processing from templates or recipes.

2.
information management. These tasks are about pushing information around: from users and the telescope and into Glish scripts. Much of this information is largely in the form of metadata.

In particular, we have chosen not to use AIPS++ in the information management component apart from creating Glish scripts. This separation has some compelling advantages. First, this principle can minimize (or perhaps entirely eliminate) the demands on AIPS++ that are different from those of interactive users. Second, we can freely apply other technologies (e.g. XML, DBMS, Globus) that are better optimized for information management. Furthermore, we can keep the information needed only by the pipeline out of the scientific data files. Not only does this eliminate the problems associated with polluted metadata schemas (e.g. having to create application-specific FITS keywords), the pipeline metadata and the datasets they describe can evolve separately without changes in one affecting the other.

3. BIP Design

3.1 Overview

The Pipeline will be developed in four phases with each phase expanding its capabilities and encompassing more of the goals described above. Figure 1 shows a schematic of the Pipeline in its Phase 2 form. (For a more detail description of the Pipeline in its various phases, see Roberts et al. 1999.)

Figure 1: Pipeline Schematic. The arrows indicate the flow of data and metadata.
\begin{figure}
\epsfxsize=\textwidth
\epsfbox{schem.eps}\end{figure}

Prior to observations, the astronomer can specify the desired processing parameters via the Processing Director's Web interface. When the data are gathered by the telescope, they are transfered in real-time to the Archive at NCSA where they are made available in their raw form. Metadata are also forwarded to the Metaqueue's Configuration Generator which creates the necessary job control and Glish processing scripts needed to calibrate the data and create images. A processing request is then sent from the Configuration Generator to the Metaqueue Manager which retrieves the necessary input data from the archive and sends the data and the processing scripts to one or more compute engines for processing. As processing completes, the resulting products are sent back to the archive for access by the astronomer.

3.2 Major Components

Below, we highlight the major components of the Pipeline, focusing on the elements now under development.

3.2.1 Archive System

The Pipeline for the most part will build on the existing BIMA Archive System; nevertheless, the Archive requires some improvements. In its present state, the Archive is set up only for raw telescope data. In order to handle the wider variety of dataset (calibration solutions, calibrated data, images, etc.), the archive is being upgraded to employ a more flexible data model.

The mapping of data objects into the data model is controlled by metadata. A major advance for the archive will be the use of XML to describe datasets in the new data model. The advantages of XML include the ability to provide a variety of different presentations of the metadata information depending on the context; this is accomplished through the XSLT (Extensible Stylesheet Language Transformation) standard which allows the HTML presentation to be configured in an ASCII document rather than having it hard-coded into CGI programs. More importantly, it provides a structured, machine-readable format that can be used to pass information through the Image Pipeline. There currently exists an extensive amount of third party software for creating, parsing, and displaying XML data; thus, it is easy to develop XML applications rapidly, to maintain them, and to adapt them to changes in the system. Furthermore, the XML format in its raw form can be made very human-readable; thus, XML applications can be easier to understand by other developers.

The goal of the new data model is to make it easier for users to understand the relationships between the datasets. We have created an XML DTD (Document Type Definition) that organizes datasets into three hierarchical groupings: project, experiment (mapping to an observing script), and trial (mapping to either an observed track or processing request). Users see this organization when searching and browsing via a project-oriented view of the data. This will allow them to see clearly what tracks have been observed and what processed data are available for each project in the archive. The new data model also provides for new metadata that will be needed by the pipeline. This includes such items as data quality information, the roles of the datasets within the project, and processing parameters.

3.2.2 Metaqueue

The general role of the Metaqueue is to create and manage processing job from incoming requests. Such requests can be sent to the metaqueue either automatically from the archive as the data are sent to the telescope or interactively by a remote user.

3.2.2.1 Configuration Generator

The purpose of this component is to analyze the incoming processing request and create the necessary Glish scripts to carry it out.

The request will come in the form of an XML document containing parameters that describe input datasets, what and how processing tasks should be carried out, and what to do with the output datasets. (This document could potentially contain embedded Glish scripts; however in initial versions of the pipeline, the request will only contain parameters.) Once the Processing Director is made part of the pipeline, many of these parameters will come from the astronomer prior to observation. Other parameters, generated automatically by the archive, provide a profile of the observations as a whole; in particular, they describe the different input datasets and the roles they play in the observations (e.g. target sources, phase calibration, passband calibration, etc.).

The configuration generator will compare the profile against profiles it recognizes. Based on this matching, recipes will be extracted from a recipe database and assembled into Glish processing scripts. During the first phase of development, we expect our ``database'' to contain essentially one recipe that will be used to create images for the most common type of BIMA observing project: a single-field spectral line observation. When the Glish scripts are complete, they are sent on to the Metaqueue manager for execution.

3.2.2.2 Metaqueue Manager

This component submits Glish scripts for execution and monitors their progress. This component is non-trivial in part because the processing may not occur on a single machine. Parts of the processing are expected to be largely serial in nature and others, highly parallel; it may be important or necessary to run the two portions on different machines or in different batch queues on the same machine. Furthermore, the actual execution will be highly dependent on the machine used. For example, the SGI Origin supercomputers at NCSA feature a number of different queues for different job sizes, with each having its own restrictions on use. The early version of the manager will be fairly specialized for the NCSA queues.

Our initial manager will eventually be replaced with a more general one that uses XML documents to encode the state of the queue. This will make it very straightforward to support Web-based monitoring. It will also make it easy to recover from system failures: the XML queue state would just be reloaded and jobs resubmitted.

3.2.3 Compute Engine

The compute engine is essentially the AIPS++ system running on parallel (and possibly serial) computers. Since input to the engine will be data and Glish scripts, little or no pipeline-specific software will need to be developed.

Development of the compute engine will focus primarily on two goals: (1) developing generalized recipes for processing BIMA data, and (2) developing algorithms for measuring the quality of the data. For the latter item, we will need to assess not only the quality of the raw data but also the processed data. This information may be used as feedback into the processing. For example, dirty images could be deconvolved simultaneously using different techniques; which result is ultimately used could be decided based on measurements of the resulting fidelity of each technique. We plan to carry both types of development using Glish. If necessary, portions of the analysis could be converted into C++ objects.

3.3 User Interaction with the Pipeline

Our pipeline development plan outlines a develop a user interface that will evolve into a centralized, integrated, and network-based environment or portal. This development will occur in four phases:

Phase 1. In the first incarnation of the pipeline, users will interact only through the archive. The updates being made to the archive in this phase will provide the user with a project-oriented view of the data that will make it is easy to see what raw and processed data are available for each project.

Phase 2. The Processing Director will be added in this phase. This will be implemented as a Web-based interface that allows users to upload their observing scripts to the telescope (which they do currently via FTP) and then augment it with metadata that will control the processing.

Also added in this phase will be a Metaqueue Monitor. This web-based interface will allow users to monitor the state of processing jobs in the pipeline.

Phase 3. In this phase, archival-based research will be supported. Through the archive, users will be able to initiate their own processing jobs using data from the archive. This will be enabled primarily through an interface to the Configuration Generator. Supporting some interaction directly with the Compute Engine may be useful as well.

Phase 4. In this phase, the pipeline access points will be gathered together in the portal interface that will most likely be Web-based. In this scenario, users will be able to access a document that, through multiple frames, allows them to monitor the operation of the telescope in real-time, search and browse the archive, and interact with processing within the pipeline.

New features that will be explored in this phase include web-based visualization and real-time interaction with processing as observations are being made.


4. Implications for AIPS++ and the Telescope System

Our current design is expected to have little impact on the way BIMA Telescope System and AIPS++ evolve and operate. Consider a more generalized view of the software systems within our architecture as shown in fig. 2. As described in §2.3, we have purposely separated the information management and data processing into different components. The telescope system was also separated out because we were constrained to do so as described in § 2.2. These separations can reduce the components' interdependencies.

Figure 2: A Generalized Architecture. Each square represents a major software component.
\begin{figure}
\epsfxsize=\textwidth
\epsfbox[-144 216 733 503]{softcomp.eps}\end{figure}

Nevertheless, there are some implicit implications for the lower two systems that are worth examining.

4.1 AIPS++

We see this project primarily affecting AIPS++ only in as much as it may require accelerating the developments of AIPS++ features that are already within its current scope and which would be needed by interactive users. These fall into three categories:

1.
updating bimafiller to properly support the structure of a raw BIMA dataset as an AIPS++ measurement set; in particular, this includes:

2.
implementing missing features required for calibration and analysis of BIMA data. Based on a study of AIPS++ in late 1999, these include:

3.
optimizing compute-intensive AIPS++ tasks for parallel machines.

Our development could feed back into AIPS++ in the form of generalized, high-level Glish scripts. In particular, algorithms for assessing data quality may be useful for interactive users.

Finally, we note that in a previous pipeline design, we suggested that Glish's distributed computing capabilities be extended to support standards for user authentication and authorization. This design had a more aggressive focus on real-time observing within a high-performance, grid environment. Glish's support for distributed computing looked useful but underdeveloped for use within a wide area grid. The high cost of developing Glish for this purpose was a major disadvantage that design.

4.2 The Telescope System

Under our ``prototyping'' approach, the pipeline has been designed to take advantage of specific framework of BIMA telescope system and currently existing archive. Thus, most of the requirements that our design imposes on the telescope system are already satisfied. Nevertheless, it is useful to enumerate some of these requirements for considering how the design might be applied to a general telescope.

1.
The telescope system must write data in such a way that it can be uniquely identified as to its membership in a project.

With BIMA, the system writes its data into a directory structure that identifies the project, the observing script used, and the date.

2.
The telescope system must support a mechanism for signaling the archive system when observing for a particular project is finished. For support of real-time archiving, the telescope must also signal when the observing starts as well.

With BIMA, the observing system executes a program at the beginning and end of each observing track, providing the directory in which the data are written. (The directory path tells which project the data belongs to.)

3.
The telescope system must provide a way to pass metadata associated with a project from the astronomer, through the observing system, and to the archive.

With BIMA, we will support the insertion of metadata into comments in the observing script which are currently generated by the astronomer and archived automatically.

4.
Our data model requires that the archive have access to information normally associated with the proposal. Currently, this includes project title, author list, and contact email addresses.

5.
There must be a way to determine the role of each of the datasets created for a observing track. That is, the pipeline needs to know which datasets contain target data, phase calibrations data, passband calibration data, etc.

6.
For real-time observing, the bulk of the data must be written in a simple sequential format so that data files can be transfered to the archive as they are being written. It must be possible to then calibrate and image the transfered portion.

The BIMA Telescope System writes data in MIRIAD format which separates the random access information (such as the dataset header) from the sequential access data (containing the uv measurements and along with other system data that vary with time) into separate files. The header portion is small and so can be copied in its entirety every time it gets updated. The visibility data, which grows to be very large, can be transfered one record at a time as they are written to disk by the telescope.

5. A Common Pipeline Architecture

A major motivation for developing the BIMA Image Pipeline is to prototype an architecture that can be applied to future radio interferometer telescopes. As part of a collaboration with the Millimeter Array Development Consortium (MDC), we plan to generalize our design after development of our prototype. With other current radio observatories developing pipeline systems, it is worthwhile to consider developing a common architecture now through some collective effort.

5.1 Why Develop a Common Architecture?

The most obvious reason for developing a common architecture for radio astronomy image pipelines is to reduce duplicated efforts. Common API's allow components to be exchanged as new technologies become available.

Another possible motivation concerns the effects that pipeline development at different observatories might have on the AIPS++ project. That is, different observatories could develop AIPS++ tools based on different pipeline architectures. This might introduce some fragmentation into the AIPS++ system, which would require an effort from the AIPS++ project itself to reconcile the various tools. However, as described in §4, we advocate an architecture that separates the pipeline-specific components from the pure-processing ones, which minimizes the impact to the AIPS++ system.

In the absence of other reasons, a common architecture would only be useful in as much as the pipeline components can be made general purpose and still support the specific needs a telescope.

5.2 Opportunities for Collaboration on Common Components



rai@ncsa.uiuc.edu
Last modified: 2000-01-13