OC Internals Documentation

Draft: 01/12/2009
Last Revised: 03/14/2012
OC Version 1.1

Table of Contents

Introduction

This document is an ongoing effort to describe the internal operation of oc.

Parsers

DDS/DAS Parser: dap.y

The dap.y parser parses DDSs, DATADDSs, and DASs. The supported syntax is essentially the same as the Ocapi parser, but the actions are different. Take a production of this form, for example.
nonterm: nonterm1 nonterm2 nonterm3 ;
The corresponding action calls an external procedure named for the left hand side and taking the values of the right side non-terminals as arguments.
{$$=nonterm(parsestate,$1,$2,$3);}
Note that this form of parsing action was requested by John Caron so that the same .y file could be used for C and Java parsers. In line with this, all non-terminals are defined to return a type of "Object", which is "void*" for C parsers and "Object" for Java parsers. The cost is the use of a lot of casting in the action procedures.

Note the extra "parsestate" argument. The parsers are constructed as reentrant and this argument contains the per-parser state information.

The bodies of the action procedures is defined in a separate file called "dapparselex.c". That file also contains the lexer required by the parser. Note that lex was not used because of the simplicity of the lexemes.

One of the issues that must be addressed by any bottom-up parser is handling the accumulation of sets of items (nodes, etc.)

The canonical way that this is handled in the oc parsers is to use the following form of production.

1  declarations:
2            /* empty */ {$$=declarations(parsestate,NULL,NULL);}
3          | declarations declaration {$$=declarations(parsestate,$1,$2);}
4          ;
The base case (line 2) action is called with NULL arguments to indicate the base case. The recursive case (line 3) is called with the values of the two right side non-terminals.

The corresponding action code is defined as follows.

1  Object
2  declarations(DAPparsestate* state, Object decls, Object decl)
3  {
4      Oclist* alist = (Oclist*)decls;
5      if(alist == NULL) alist = oclistnew();
6      else oclistpush(alist,(ocelem)decl);
7      return alist;
8  }
The base case is handled in line 5. It creates and returns a Sequence instance; a Sequence is a dynamically extendible array of arbitrary items (see below). The recursive case is in line 6, where it is assumed that the Sequence argument is defined and there is a decl object that should be inserted into the sequence.

This pattern, in various forms, is ubiquitous in the parsers.

Constraint Parser: ce.y

The ce.y parser parses DAP url projections (see OCURI). There is code to also parse selections, but since that is not needed, it is commented out. This does not mean that selections are not used, only that the selection string is passed unmodified to the server.

Currently, there is no need for this parser, so it is included in the source tree, but is not used.

OC Node Tree

As with Ocapi, the dap parser produces a node tree defining the DDS (or DAS) structure. The node structure (struct OCnode) is defined in ocnode.h and has the following fields, some of which are of subsidiary struct types that are defined below. This particular structure is relatively similar to that of the Ocapi node, but with all the extra data storage information elided.

unsigned int magic – A magic number to identify this structure.
OCtype octype – Defines the general kind of node.
OCtype etype – Used for attribute nodes and primitive nodes to define the primitive type.
char* name – From the DDS.
char* fullname – Fully qualified name such as a.b.c.
OCnode* container – Parent node of this node.
OCnode* root – root node of the tree containing this node.
OCnode* datadds – The correlated DATA DDS node, if any */
OCdiminfo dim – Extra information about dimension nodes.
OCarrayinfo array – Extra information about nodes that have rank > 0.
OClist* subnodes – (Sequence) The subnodes of this node.
Sequence* attributes – (Sequence)Any attributes associated with this node.
struct OCSKIP skip –  Extra information about the node vis-a-vis the datadds data to improve access times.
OCtypeinfo –  Extra information about type definitions if netcdf-4 is being supported.
OCtypeinfo –  Extra information about group definitions if netcdf-4 is being supported.

The auxilliary structs are as follows.
StructFieldDescription
struct OCdiminfo
OCnode* array  –  The defining array node, if known
unsigned int arrayindex  –  The rank position in the defining array node, if known
ocindex_t declsize  –  Dimension size as specified in the (data)DDS
struct OCarrayinfo
OClist* dimensions  –  (Sequence)The complete set of dimensions
unsigned int rank  –  |dimensions|
struct OCattribute
char* name  –  of the attribute
OCtype etype  –  primitive type of the attribute
char* name  –  of the attribute
size_t nvalues  –  Length of the values field
char** values  –  List of values associated with the attribute

struct OCSKIP

The OCSKIP structure requires some detailed explanation because it is important in optimizing access to data. The idea is that for many DATADDS OCnode objects, it is possible to precompute information about that object vis-a-vis the xdr formatted data packet. In particular it is possible to calculate, sometimes, the following information (kept as OCSKIP fields).
  1. ocindex_t count - defined only for nodes with rank > 0 as the cross product of the dimension sizes.

  2. ocoffset_t instancesize - For some nodes, it is possible to pre-compute the exact size of an instance of the node. For example, any primitive array that is not of type String or URL can have its exact size computed. Any Grid, or Structure, or Sequence whose fields all have defined totalsizes, can have its instance size pre-computed.

  3. ocoffset_t totalsize - If the instance size and count are known, then the total size of the object is ((instancesize * count) + overhead) where the overhead is any preceding counts. For scalars, the overhead is zero, for non-string/url primitives, the overhead is 8 bytes of xdr data representing the count and its repetition (an artifact of the DAP2 protocol xdr encoding).

    Note that the totalsize of sequences is not possible to compute because the number of records is unknown before any data is fetched.

  4. ocoffset_t offset - If all objects (preceding a node) in the preorder listing of the DATADDS nodes have a defined totalsize, then it is possible to define the exact offset in the xdr data packet of that node by adding up the total sizes of the preceding object. Of course, as soon as a String/URL or Sequence typed object is encountered, the offset becomes indeterminate.

The skip values are pre-computed recursively in the procedure occomputeskipdata in ocnode.c and should be consulted to see how the computation is carried out.

OCstate Management

The overarching concept in the API is that of a OCstate, which is an opaque identifier representing a DAP state; it is used to maintain persistent state about the state to a specific DAP server, as well as the various requests and responses between the client and the server.

A good analog is to the FILE object used by C standard IO. Like a FILE, an OCstate provides the context for some operation or object.

The state is used for a variety of purposes and is as a rule the first argument of any of the API procedures.

OCstate Structure

The OCstate structure contains the following fields.
unsigned int magic – A magic number to identify this structure.
CURL* curl – The handle to a CURL connection. Its lifetime is that of the OCstate structure.
OClist* trees – The set of root objects for previously fetched DAP requests. See OC Trees.
OCURI* uri – URI for fetching data.
OCbytes* packet – buffer for temporary storage of fetched data.
OCcontent* contentlist – linked list of all created OCcontent objects.
struct OCerrdata error – A struct to hold error return info from server (see below).
struct OCcurlflags curlflags – The curl flags to set before fetch. (see below).
struct OCSSL ssl – SSL related Authorization and authentication information. (see below).
struct OCproxy proxy – Proxy information. (see below).
struct OCcredentials creds – Credentials for BASIC (i.e. password-based) authentication (see below).

The auxilliary structs are as follows. For the curl flags, the curl documentation should be consulted.
StructFieldDescription
struct OCerrdata
char* code  –  A numeric error code (in ascii) from the dap server
long httpcode  –  Any HTTP error code returned (i.e. 404)
struct OCcurlflags
int compress  –  CURLOPT_ENCODING
int verbose  –  CURLOPT_VERBOSE
int timeout  –  CURLOPT_TIMEOUT
int followlocation  –  CURLOPT_FOLLOWLOCATION
int maxredirs  –  CURLOPT_MAXREDIRS (=10)
char* useragent  – 
char* cookiejar  –  CURLOPT_COOKIESESSION, CURLOPT_COOKIEJAR
char* cookiefile  –  CURLOPT_COOKIEFILE
struct OCSSL
int validate  –  CURLOPT_SSL_VERIFYPEER
char* certificate  –  CURLOPT_SSLCERT
char* key  –  CURLOPT_SSLKEY
char* keypasswd  –  CURLOPT_KEYPASSWD
char* cainfo  –  CURLOPT_CAINFO
char* capath  –  CURLOPT_CAPATH
int verifypeer  –  CURLOPT_SSL_VERIFYPEER
struct OCproxy
char* host  –  The proxy host name
int port  –  The proxy port number
struct OCcredentials
char* username  –  The username for logging into the proxy
char* password  –  The pass word for logging into the proxy

OCstate Trees

Every time that a DAP DDS, DAS, or DATADDS is fetched from a server, it is parsed and a tree of OCnode instances is constructed. The roots of these trees are kept in the OCstate and may be created by fetching and destroyed by appropriate interface procedures.

Associated with the root node of every tree is an instance of OCtree, which is used to store information about the fetch and the tree.

The OCtree structure contains the following fields.
OCdxd dxdclass – Enumeration instance: one of OCDAS, OCDDS or OCDATADDS.
char* constraint – The constraint string used when fetching the DAP object.
char* text – The text of the DAP object as received from the server.
OCnode* root – Cross link to the root node to which this OCtree instance is attached.
OCstat* state – Cross link to the state containing the root.
OClist* nodes – A list of all nodes in the tree rooted at root.

When the dxdclass is OCDATADDS, the the following additional fields are defined and used.
unsigned long bod – offset in the datadds packet to the beginning of the binary XDR data.
char* filename – name of the temporary file for holding datadds data.
FILE* file – FILE object for the temporary file.
unsigned long filesize – size of the temporary file.
XDR* xdrs – XDR handle for walking the temporary file.
OCmemdata* memdata – root of the compiled datadds packet.

API

The API is best understood by reading the user's manual and following the code for procedures of interest. The API is defined in oc.[ch].

One important thing to understand is that the externally visible API hides the actual definitions of the OCstate, OCnode, and OCcontent types. This is accomplished by defining alternate, externally visible, types that are internally mapped to the appropriate actual type and are the values passed into and out of the API procedures.

The types and mapping are as follows.

  1. External type OClink maps to internal type OCstate
  2. External type OCobject maps to internal type OCnode
  3. External type OCdata maps to internal type OCcontent
The three external types are all defined as being either The assumption is that in all cases the size of the external type is the same size as void*.

It is important to be able to verify for each API that its arguments are semantically correct This is handled by the macro OCVERIFY.

If OC_FASTCONSISTENCY is defined, then OCVERIFY will check, by casting, for an expected magic number at the beginning of the external object. If OC_FASTCONSISTENCY is not defined, then a table of all created objects is searched. Since the fast consistency check is preferable, the option of using the object map is only useful in certain debugging situations when it might be desirable to track all of the created object.

Once an API argument is verified, it needs to be cast to the appropriate internal type. This is accomplished using the OCDEREF macro, which casts the argument to the proper type and stores it in a specified local variable of internal type.

OC Data Access API

Accessing the actual data associated with a DATADDS fetch is perhaps the most complex and confusing part of the oc API. This is, unfortunately, a direct reflection of the complexity of the DAP2 protocol data model, and especially the consequences of Sequences.

A navigational interface has been defined that allows for simplified walking of the data dds packet data. The navigational interface has been modified multiple times, and the one described here is a variation on the one designed by Patrick West for the IDL client for OPeNDAP.

The oc user's manual (ocuserman.html) should be read to obtain a working understanding of the navigational interface (the oc_data_XXX procedures). This section discusses the complexities underlying that interface.

In addition to the OCstate structure and the OCnode structure, the navigational interface defines an OCcontent structure.
unsigned int magic  –  A magic number to identify this structure.
OCmode mode  –  The access mode (see below).
OCstate* state  –  the state object to which this content is associated.
OCnode* node  –  the OCnode that serves as template for the data pointed to by this content object.
OCtree tree  –  The specific tree of nodes, typically refers to the DDS tree associated with a DATADS fetch.
int packed  –  True if this content points to packed data, which means that the node octype is OC_PRIMITIVE, its etype is OC_BYTE or OC_CHAR, and it is not a scalar object.
struct OCCACHE  –  Cache to track last index and xdr positions (see below).
struct OCcontent* next  –  link to next OCcontent object; allows reclamation and reuse.

The OCcontent object represents a subset of the data (aka an instance) within the data part of a DATADDS response. The node field serves as a template for accessing the data (in xdr format) pointed to by the OCcontent object.

The mapping between nodes and contents is one-to-many. That is, there often will be multiple data instances of a given node type in a DATADDS response. Consider the following example.

Dataset {
  Structure {
    int16 f11[2];
    float32 f12;
  } S1;
  Structure {
    int16 f21;
    float32 f22[2];
  } S2[3]
} D1;
If we have a data response with this DDS, then the following instances will exist.
ClassCountInstances
D11D1
S11D1.S1
f112D1.S1.f11[0]
D1.S1.f11[1]
f121D1.S1.f12
S23D1.S2[0]
D1.S2[1]
D1.S2[2]
f213D1.S2[0].f21
D1.S2[1].f21
D1.S2[2].f21
f226D1.S2[0].f22[0]
D1.S2[0].f22[1]
D1.S2[1].f22[0]
D1.S2[1].f22[1]
D1.S2[2].f22[0]
D1.S2[2].f22[1]

The goal is to allow the user to navigate to all of the instances contained in a given DATADDS data packet and, when desired, extract the instance as usable data. Note however, that only primitive typed arrays (or scalars) can have their data extracted. It is not possible in the current interface to, for example, extract a whole Structure object; rather it must be be done by extracting each field in turn. This may require recursion if one of the fields is itself, for example, a Grid, Structure, or Sequence.

The most important internal procedures are as follows.

ProcedureAbbreviated Semantics
OCcontent* ocnewcontent(OCstate* state)  –  Obtain a unused OCcontent object; either off the free list or using malloc().
void ocfreecontent(OCstate* state, OCcontent* content)  –  Release a content object onto the free list for later reuse
int ocrootdata(struct OCstate*, struct OCnode*, struct OCcontent*)  –  Obtain an OCcontent object that points to the data dds as a whole
int ocdataith(struct OCstate*, OCcontent*, size_t, OCcontent*)  –  Move to the i'th "position" of this object as controlled by the object's type and a mode.
int ocgetcontent(struct OCstate*, struct OCcontent*, void* memory, size_t memsize, size_t start, size_t count)  –  Extract the data associated with the current content. As mentioned above, this can only be done for primitive array or scalar data.
int ocxdrread(struct OCcontent*, XXDR*, char* memory, size_t, ocindex_t index, ocindex_t count)  –  This is the workhorse internal procedure to actually extract the xdr formatted data and convert it to the proper form in memory.
int ocskipinstance(OCnode* node, XXDR* xdrs, int state, int* tagp)  –  In order to get to some point in the data, it is often necessary to skip over preceding data. This can be a complex activity when sequences and strings are involved. This procedure handles the skipping over of arbitrary data.
OCmode modetransition(OCnode* node, OCmode srcmode)  –  This procedure determines the mode of the new content returned by the ocdataith procedure.

One note about OCcontent objects. The reason that there are explicit create and destroy operations is to allow/force the user to control the number of created OCcontent objects and to reuse previously created OCcontent objects. If the API created a new object for every call to, say, ocdimcontent, then there would be an explosion of OCcontent objects equal to the product of the dimension. There would be no way to reclaim them either because it would be impossible to know which are still actively in use.

It is important to understand the modetransition procedure in order to understand how the navigation works. The idea is that we have the following pieces of information:

The transition table has three columns.

  1. Mode of the current OCcontent Object
  2. OCtype of the node associated with the current OCcontent object
  3. Mode to be assigned to the new content object representing the i'th element of the current content object.
CaseCurrent ModeCurrent OCtypeNew Mode
1OCARRAYMODEOC_GridOCFIELDMODE
2 OC_StructureOCFIELDMODE
3 OC_SequenceOCSEQUENCEMODE
4OCSEQUENCEMODE  any modeOCFIELDMODE
5OCFIELDMODEOC_SequenceOCARRAYMODE
6 OC_GridOCARRAYMODE
7 OC_StructureOCARRAYMODE
8 OC_PrimitiveOCPRIMITIVEMODE
Any combinations not listed are illegal.

The general idea is that given a set of objects (i.e. an array of them or a sequence of them), asking for the i'th element should cause transition to pointing to the actual i'th data item in the sequence. This is seen in cases 1, 2, and 3, where we are transitioning from referencing an array of Grids or Structures or Sequences to referencing a specific Grid/Structure/Sequence in the array. Not that, for purposes of the transitions, scalars are considered arrays of size 1. Also note that arrays of sequences are supported here, but are illegal according to the DAP 2 specification.

Case 4 also shows the same kind of transition, but here the transition is from a pointer to a whole Sequence to the fields of a specific (i'th) record in the Sequence.

Cases 5, 6, 7, and 8 occur when we are moving from to a specific i'th field of a Grid object, Structure Object, or Sequence record. If the field octype is OC_Structure or OC_Grid, we assume that we are moving to an array of those objects, hence the new mode is OCARRAYMODE. If the field type is OC_Sequence, then we are moving to the Sequence object, hence the mode becomes OC_Sequence. If the field type is OC_Primitive, then we have reached the point where actual data extraction is possible, so the mode becomes OCPRIMITIVEMODE.

Caching and Skipping

The OCCACHE structure is used to track information that enables the ocskipinstance procedure to more quickly find a point in the xdr data packet. This structure contains the following fields.
  1. int valid - 1 if this cache is valid, 0 otherwise
  2. ocindex_t index - the last index argument reached by the OCcontent object
  3. ocindex_t maxindex - max allowable index, if known, 0 => max is unknown; used to check for index out of bounds errors in oc_data_ith calls.
  4. ocoffset_t offset - offset (from 0) of the index'th object encountered.
The idea is that as oc_data_ith is applied to an OCcontent object, with different index arguments, the cache tracks the last index used and the associated xdr offset of that index'th object.

The ocdataith and ocskipinstance procedures use the OCSKIP and OCCACHE information to efficiently point to, or skip over, objects in the xdr data cache. For example, if the user is trying to reach the i'th element in a primitive typed array field inside a structure, and the offset of the field is known in the OCSKIP information, then a simple calculation will immediately produce a pointer into the xdr data packet to the beginning of that primitive typed field. At that point, oc_data_get can quickly extract the data directly from the xdr data packet.

Even if the offset is not known, other information such as the total object size, or even the instance size, can speed up access by changing what would otherwise be a series of data reads (looking for counts or record tags, for example) into a mix of data reads and repositionings that is faster than the reads alone.

Further, by caching the last referenced index and its corresponding xdr data packet offset, the OCCACHE information can speed up a call to oc_data_ith to access the index'th + 1 object because the search can start with the cached information rather than having to begin at position zero.

Error Handling

Error handling in oc is somewhat different than in Ocapi, and follows mostly the netCDF model. That is, procedures return simple numeric error codes to indicate success (OC_NOERR) or failure (OC_EXXX). The current error codes are defined in oc.h, but it needs reorganization and extension.

Logging

One good thing about Ocapi was that it provided a mechanism for returning detailed error information strings. In order to keep something like that, oc has a log mechanism (oclog.[ch]) that can be used to dump extra error or warning info and it can be used to dump debug info (see the DEBUG macros in ocdebug.h).

The logging interface is defined by the following procedures, but they are just the internal versions of the ones described in ocuserman.html

OCURI

Surprisingly, it appears that libcurl does not export any kind of URL parsing capability. Therefore, the ocuri type was created to support this. It is defined in ocuri.[ch]. In the following the terms "url" and "uri" will be used interchangeably, even though there are subtle semantic differences.

The uri is assumed to be (most generally) of the form

[param=...,param=...,...]protocol://username:password@host:port/file?constraint
The constraint, in turn is composed of projections and selections.
?projection,projection,...&selection&selection...

The OCURI structure contains the following fields.
char* uri  –  The uri as originally passed in to the parser
char* protocol  –  Protocol field (e.g. "https") of the uri
char* user  –  User name field; NULL if not present
char* password  –  Password field; NULL if not present
char* host  –  Host field
char* port  –  Port number; 0 if not present
char* file  –  File part of the uri, with the leading '/'
char* constraint  –  Constraint (not including leading '?'); NULL if not present
char* projection  –  The projections in the constraint; NULL if not present.
char* selection  –  The selections in the constraint; NULL if not present.
char* params  –  The parameters in the constraint; NULL if not present.
char** paramlist  –  A "compiled" version of the params in envv format, where paramlist[i] is the param name and paramlist[i+1] is the param value. The whole list is NULL terminated. It is assumed that the name part and the value part are never NULL. Rather, the empty string ("") is used to indicated no value.

The most important parts of the ocuri API are as follows.

OperationSemantics
int ocuriparse(const char* uri, OCURI** ocurip)  –  Creates an instance of OCURI, stores the pointer to it in ocurip, and fills the created instance with data from parsing the uri string into its component parts. It returns 0 if fails, 1 otherwise.
void ocurifree(OCURI* ocuri)  –  Free all the memory associated with the argument, including the argument instance.
int ocuridecodeparams(OCURI* ocuri)  –  Parses ocuri->params into ocuri->parmlist
const char* ocurilookup(OCURI* ocuri, const char* param)  –  Searches ocuri->paramlist for a match to param. If not found, then return NULL, otherwise return the value associated with the param; an empty value is represented by the zero-length string "", not by NULL.
char* ocuriencode(char* s, char* allowable);  –  Applies URL character encoding and returns a new encoded instance of s. The set of characters to not encode is specified by the allowable argument.
char* ocuribuild(OCURI* ocuri, const char* prefix, const char* suffix, int flags)  –  Construct a url string from the fields in ocuri; the new url is prefixed (before any parameters are added) with the prefix argument and suffixed (before any constraints are added) with the suffix argument; the protocol, host, port, and file parts are always included, and the flags argument (possibly an or of multiple flags) determines what other parts are included as follows
  • OCURICONSTRAINTS - include the constraints
  • OCURIUSERPWD - include user name and password
  • OCURIPARAMS - include the parameters in the parameter list
  • OCURIENCODE - url encode the output

Miscellaneous

The two datatypes OClist and OCbytes are used through out the code. They correspond closely in semantics to the Java Arraylist and Stringbuffer types, respectively. They are used to help encapsulate dynamically growing lists of objects or bytes to reduce certain kinds of errors.

The canonical code for non-destructive walking of a Sequence is as follows.

for(i=0;i<oclistlength(list);i++) {
    T* element = (T*)oclistget(list,i);
    ...
}

OCbytes provides two ways to access its internal buffer of characters. One is "ocbytescontents()", which returns a direct pointer to the buffer, and the other is "ocbytesdup()", which returns a malloc'd string containing the contents and null terminated.

Multi-Dimensional Array Handling

Within a data packet, the DAP protocol "linearizes" multi-dimensional arrays into a single dimension. The rule for converting a multi-dimensional array to a single dimensions is as follows.

Suppose we have the DDS field Int F[2][5][3];. There are obviously a total of 2 X 5 X 3 = 30 integers in F. Thus, these three dimensions will be reduced to a single dimension of size 30.

A particular point in the three dimensions, say [x][y][z], is reduced to a number in the range 0..29 by computing ((x*5)+y)*3+z. The corresponding general C code is as follows.

size_t
dimmap(int rank, size_t* indices, size_t* sizes)
{
    int i;
    size_t count = 0;
    for(i=0;i<rank;i++) {
	count *= sizes[i];
	count += indices[i];
    }
    return count;
}
In this code, the indices variable corresponds to the x,y, and z. The sizes variable corresponds to the 2,5, and 3.

Change Log

Copyright

Copyright 2009, UCAR/Unidata and OPeNDAP, Inc.