The aim of this tutorial is to introduce you to SDMX-ML using a practical example – the publication of the euro foreign exchange reference rates
The tutorial will look at the technology from both sides of the fence: the side of the data provider (“How can we use SDMX-ML to publish statistical data on our website?”); and the side of the data consumer (“What kind of useful things can we do with an SDMX-ML data file?”). However, first we will present, in a nutshell, the SDMX information model and some of the SDMX-ML formats.
To make the most of this tutorial, a basic knowledge of XML and XML-related technologies, such as XML schemas, XSLT and SAX, is required. Some of the tasks described in the tutorial will also require the use of an XML-validating parser, for example Apache Xerces or xmllint from libxml2, and an XSLT processor, such as Apache Xalan, Saxon or xsltproc from libxml2.
For those who use Ant as their build system, most of the exercises can be run using the supplied build file.
The Statistical Data and Metadata Exchange initiative is sponsored by seven institutions (the BIS, the ECB, Eurostat, the IMF, the OECD, the UN and the World Bank) to foster standards for the exchange of statistical information. The first version of the standard is an ISO standard (ISO/Technical Specification 17369:2005). It offers an information model for the representation of statistical data and metadata, as well as several formats to represent this model (SDMX-EDI and several SDMX-ML formats). It also proposes a standard way of implementing web services, including the use of registries.
The list below tells you everything you need to know about the SDMX information model  ] in order for us to start developing an application based on the SDMX standard:
Descriptor concepts: In order to make sense of some statistical data, we need to know the concepts associated with them. For example, on its own the figure 1.2953 is pretty meaningless, but if we know that this is an exchange rate for the US dollar against the euro on 23 November 2006, it starts to make more sense.
Packaging structure: Statistical data can be grouped together at the following levels: the observation level (the measurement of some phenomenon); the series level (the measurement of some phenomenon over time, usually at regular intervals); the group level (a group of series – a well-known example being the sibling group, a set of series which are identical, except for the fact that they are measured with different frequencies); and the dataset level (made up of several groups, to cover a specific statistical domain for instance). The descriptor concepts mentioned in point 1 can be attached at various levels in this hierarchy.
Dimensions and attributes: There are two types of descriptor concept: dimensions, which both identify and describe the data, and attributes, which are purely descriptive.
Keys: Dimensions are grouped into keys, which allow the identification of a particular set of data (a series, for example). The key values are attached at the series level and given in a fixed sequence. Conventionally, frequency is the first descriptor concept and the other concepts are assigned an order for that particular dataset. Partial keys can be attached to groups.
Code lists: Every possible value for a dimension is defined in a code list. Each value on that list is given a language-independent abbreviation (code) and a language-specific description. Attributes are represented sometimes by codes, and sometimes by free-text values. Since the purpose of an attribute is solely to describe and not to identify the data, this is not a problem.
Data Structure Definitions: A Data Structure Definition (key family) specifies a set of concepts, which describe and identify a set of data. It tells us which concepts are dimensions (identification and description) and which are attributes (just description), and it gives the attachment level for each of these concepts on the basis of the packaging structure (dataset, group, series or observation), as well as their status (mandatory or conditional). It also specifies which code lists provide possible values for the dimensions and gives possible values for the attributes, either as code lists or free text fields.
SDMX-ML supports various use cases and, therefore, defines several XML formats. . For the purpose of this tutorial, the following two formats will be used:
The Structure Definition format : This format will be used to define the structure (concepts, code lists, dimensions, attributes, etc.) of the key families.
The Compact format: This format will be used to define the data file. It is not a generic format (it is specific to a Data Structure Definition), but it is designed to support validation and is much more compact so as to support the exchange of large datasets.
Now that we know the basics, we can start developing our application.
We want to publish the euro foreign exchange reference rate data on our website. The first step is to analyse the kind of data we are dealing with; then we have to create the Structure Definition file to represent these data. We will then generate a schema out of the Structure Definition file, which we will use to validate the data file. Finally, we will create the XML data file to be published on the website.
For the purpose of this exercise, we will use the Data Structure Definition defined in the table below. .
Table 1. The Data Structure Definition: dimensions, measures and attributes
|Dimension 1 (role: frequency)||FREQ||CL_FREQ||Frequency of observations (daily, in this case).|
|Dimension 2||CURRENCY||CL_CURRENCY||The currency whose value is being measured against the base currency, e.g. the US dollar.|
|Dimension 3||CURRENCY_DENOM||CL_CURRENCY||The base currency (the euro, in this case).|
|Dimension 4||EXR_TYPE||CL_EXR_TYPE||The exchange rate type (spot, in this case).|
|Dimension 5||EXR_SUFFIX||CL_EXR_SUFFIX||Exchange rate series variation (average or standardised measure for a given frequency, in this case).|
|Dimension (role is time)||TIME_PERIOD||Time Point Set||The date on which an observation was made. It is not part of the series key, but is attached to the observation level.|
|OBS_VALUE (The measured value)|
|OBS_STATUS (mandatory)||Observation||CL_OBS_STATUS||The observation status, e.g. normal, estimated or forecast (normal, in this case).|
|OBS_CONF (optional)||Observation||CL_OBS_CONF||The observation confidentiality. All published data are unrestricted, but it is good practice to mention it anyway.|
|TIME_FORMAT (mandatory)||Series||Time Duration Set||ISO 8601compliant way to describe duration (in this case, P1D).|
|COLLECTION (mandatory)||Series||CL_COLLECTION||When the information was collected, e.g. end of period (in this case, A – average of observations during period).|
|UNIT (mandatory)||Group||CL_UNIT||The unit used, e.g. RUB for Russian rouble.|
|UNIT_MULT (mandatory)||Group||CL_UNIT_MULT||Whether the data is in millions, billions, etc. (in this case, the value is “0” (unit)).|
|DECIMALS (mandatory)||Group||CL_DECIMALS||The number of decimal places.|
|TITLE_COMPL (mandatory)||Group||Up to 1050 characters||A human-readable title describing a certain group of data, e.g. ECB reference exchange rate, Australian dollar/euro, 2.15 p.m. CET.|
Now that we have a good overview of the structure of the data that we want to publish on the website, we can formally define this structure using SDMX-ML. The Structure Definition format is used for this kind of task, as it contains the description of the structural metadata, such as key families, concepts and code lists.
Not including the initial XML declaration, the XML file (
ecb_exr1_structure.xml) starts with the
Structure element and the standard SDMX
<?xml version="1.0" encoding="UTF-8"?> <Structure xmlns="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message" xmlns:message= "http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message SDMXMessage.xsd http://www.SDMX.org/resources/SDMXML/schemas/v2_0/structure SDMXStructure.xsd"> <Header> <ID>IREF000506</ID> <Test>false</Test> <Name>ECB structural definitions</Name> <Prepared>2006-10-25T14:26:00</Prepared> <Sender id="4F0"/> </Header>
The root element is defined (
Structure), with the namespaces and schema definitions attached as attributes (
SDMXStructure.xsd). The SDMX Message namespace is used by all other SDMX-ML namespace modules. It contains the common message constructs, including the common
Header information (Message
Prepared date, etc.). The
id attribute of the
Sender element specifies the sender of the data, which is the European Central Bank in this case (code 4F0 within the context of this data sender, taken from the code list CL_ORGANISATION).
Apart from the
Header element, the XML file also contains the following three main elements, all belonging to the SDMX Structure namespace:
Concepts element contains a list of concepts used to identify and describe the data. All the concepts used in the Data Structure Definition are included in this list.
Concept element contains two attributes: the
ID of the agency responsible for the concept (“ECB”) and the concept
id (for example, “UNIT_MULT”). Both are identifiers, and, as such, language independent. The
Name element contains a language-dependent description of the concept (as specified in the
<Concept agencyID="ECB" id="COLLECTION"> <Name xml:lang="en">Collection indicator</Name> </Concept>
CodeLists element contains a list of
CodeLists elements, each with two attributes: the
ID of the agency responsible for the code list (“ECB”) and the code list
id (for example, “CL_EXR_SUFFIX”). The
Name element contains a description of the code list in a specific language. Each code list also contains a list of codes, with an attribute for the code value and a language-dependent description of the code. The code lists define the possible values taken by the dimensions and the coded attributes.
<CodeList agencyID="ECB" id="CL_EXR_SUFFIX"> <Name xml:lang="en">Exch. rate series variation code list</Name> <Code value="A"> <Description xml:lang="en">Average or standardised measure for given frequency</Description> </Code> <Code value="E"> <Description xml:lang="en">End-of-period</Description> </Code> </CodeList>
Now that we have our list of concepts and code lists, we can start compiling our Data Structure Definition.
KeyFamily element contains the
ID of the agency responsible for defining the Data Structure Definition (“ECB”), the
id for the Data Structure Definition (“ECB_EXR1”), a
uri (we use the Data Structure Definition namespace for this) and the name of the Data Structure Definition in a specific language.
<KeyFamily agencyID="ECB" id="ECB_EXR1" uri="http://www.ecb.int/vocabulary/stats/exr/1"> <Name xml:lang="en">Exchange Rates</Name> <Components>
Then the components of the Data Structure Definition are defined, starting with the dimensions.
<Dimension conceptRef="FREQ" codelist="CL_FREQ" isFrequencyDimension="true"/> <Dimension conceptRef="CURRENCY" codelist="CL_CURRENCY"/> <Dimension conceptRef="CURRENCY_DENOM" codelist="CL_CURRENCY"/> <Dimension conceptRef="EXR_TYPE" codelist="CL_EXR_TYPE"/> <Dimension conceptRef="EXR_SUFFIX" codelist="CL_EXR_SUFFIX"/> <TimeDimension conceptRef="TIME_PERIOD"/>
Dimension element contains references to a descriptor concept and the code list from which the dimension value has to be taken. For example, the dimension which represents the concept of frequency takes its values from the CL_FREQ code list and, as such, can only take one of the following values: A (annual), B (business), D (daily), E (event), H (half-yearly), M (monthly), Q (quarterly) and W (weekly). The
isFrequencyDimension is attached to the dimension which represents the frequency (
FREQ in this case). There can be only one such dimension per Data Structure Definition.
TimeDimension is a special dimension that must be included in any Data Structure Definition which will be used for time series formats, such as GenericData, CompactData and UtilityData.
The order of declaration of the dimensions is important as it describes the order in which the dimensions will appear in the the keys (with the exception of the time dimension, which is not part of the key).
Group element declares any useful groupings of data, such as sibling groups.
<Group id="Group"> <DimensionRef>CURRENCY</DimensionRef> <DimensionRef>CURRENCY_DENOM</DimensionRef> <DimensionRef>EXR_TYPE</DimensionRef> <DimensionRef>EXR_SUFFIX</DimensionRef> </Group>
Next we indicate which attribute will contain the measured value. Conventionally, it is associated with the
Finally, we list the attributes. An
Attribute element will contain such information as the concept used for the attribute, the attachment level, i.e. observation, group, series or dataset, and whether it is mandatory or conditional. Coded attributes will indicate from which code list the values should be taken, while, for uncoded attributes, a specific format may be specified using the
TextFormat element. For attributes attached to the Group level, we specify the ID of the group to which the attributes are attached with an
AttachmentGroup element. The concept of time format is identified by the
isTimeFormat attribute having a value of true and is typically a mandatory series level attribute the value of which is taken from ISO 8601.
<Attribute conceptRef="TIME_FORMAT" attachmentLevel="Series" assignmentStatus="Mandatory" isTimeFormat="true"> <TextFormat textType="String" maxLength="3"/> </Attribute> <Attribute conceptRef="OBS_STATUS" attachmentLevel="Observation" codelist="CL_OBS_STATUS" assignmentStatus="Mandatory"/> <Attribute conceptRef="DECIMALS" attachmentLevel="Group" codelist="CL_DECIMALS" assignmentStatus="Mandatory"> <AttachmentGroup>Group</AttachmentGroup> </Attribute>
As a last step, we use an XML validating parser to validate our Structure Definition file and make sure that it is compliant with the SDMX-ML standard . You can use the task
validateStructure supplied in the Ant build file to perform this step.
bash$ant validateStructure Buildfile: build.xml validateStructure: [xmlvalidate] 1 file(s) have been successfully validated. BUILD SUCCESSFUL Total time: 2 seconds
Now that we have a valid Structure Definition file, we can generate an XML schema file for our Data Structure Definition. The SDMX initiative offers some free tools to developers one of which is an XSL file (
StructureToCompact.xsl) that creates an XML schema out of a Structure Definition file for a selected Data Structure Definition.
To create our schema, we need to:
use an XSL parser using our Structure Definition file (
ecb_exr1_structure.xml) as an XML input file, the XSL file (
StructureToCompact.xsl), and the desired filename for our XML schema file (e.g.:
ecb_exr1_compact.xsd) as parameters.
open the file in an editor and add the default namespace in the
xs:schema element (
You can use the task
generateCompactSchema supplied in the Ant build file to perform this step.
bash$ant generateCompactSchema Buildfile: build.xml generateCompactSchema: [xslt] Processing ecb_exr1_structure.xml to ecb_exr1_compact.xsd [xslt] Loading stylesheet StructureToCompact.xsl BUILD SUCCESSFUL Total time: 2 seconds
We now have an XML schema file (
ecb_exr1_compact.xsd), which we will use to validate the XML data file before publishing it on our website.
It is now time to create our XML data file (
ecb_exr1_compact.xml). This is normally done by extracting the data from a database and creating the XML data file.
When we open the file, we should recognise some similarities with the Structure Definition file, as all SDMX messages share some common constructs, i.e.: the elements from the SDMXMessage namespace, such as the
<CompactData xmlns="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation= "http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message SDMXMessage.xsd"> <Header> <ID>EXR-HIST_2006-11-29</ID> <Test>false</Test> <Name xml:lang="en">Euro foreign exchange reference rates</Name> <Prepared>2006-11-23T08:26:29</Prepared> <Sender id="4F0"> <Name xml:lang="en">European Central Bank</Name> <Contact> <Department xml:lang="en">DG Statistics</Department> <URI>mailto:firstname.lastname@example.org</URI> </Contact> </Sender> </Header>
The message continues with the
DataSet element, the highest possible level of grouping. A namespace, an XML schema and the
dataset ID are added to the element.
<DataSet xmlns="http://www.ecb.int/vocabulary/stats/exr/1" xsi:schemaLocation="http://www.ecb.int/vocabulary/stats/exr/1 ecb_exr1_compact.xsd" datasetID="ECB_EXR1">
The next level of grouping is the
Group element. As it is a sibling group, it contains all dimensions (CURRENCY, CURRENCY_DENOM, EXR_TYPE and EXR_SUFFIX), except the frequency (FREQ), which is wild-carded, and the date/time information (TIME_PERIOD), which is attached to the obs level. In addition to the dimensions, it also includes the attributes which are attached to the Group level (DECIMALS, UNIT, UNIT_MULT and TITLE_COMPL).
<Group CURRENCY="AUD" CURRENCY_DENOM="EUR" EXR_TYPE="SP00" EXR_SUFFIX="A" DECIMALS="4" UNIT="AUD" UNIT_MULT="0" TITLE_COMPL="ECB reference exchange rate, Australian dollar/Euro"/>
Lower down in the package grouping, we reach the
Series level, which contains the same dimensions as the Group level, plus the frequency. It also includes the attributes which are attached to the Series level (TIME_FORMAT and COLLECTION).
<Series FREQ="D" CURRENCY="AUD" CURRENCY_DENOM="EUR" EXR_TYPE="SP00" EXR_SUFFIX="A" TIME_FORMAT="P1D" COLLECTION="A">
Series element also contains the list of observations, at the lowest possible level in the package grouping. The observations contain the time dimension (TIME_PERIOD), the measured value (OBS_VALUE) and the attributes attached to the obs level (OBS_STATUS and OBS_CONF).
<Obs TIME_PERIOD="1999-01-04" OBS_VALUE="1.9100" OBS_STATUS="A" OBS_CONF="F" />
These last three elements (
Obs) will be repeated for all groups of data to be published.
Now that we have our data file ready, before publishing it on our website, we should validate it using an XML-validating parser. You can use the task
validateData supplied in the Ant build file to perform this step.
bash$ant validateData Buildfile: build.xml validateData: [xmlvalidate] 1 file(s) have been successfully validated. BUILD SUCCESSFUL Total time: 6 seconds
Once this has been done, the job of the data publisher is complete and we have successfully managed to publish an SDMX-ML data file on our website. We can now use it to retrieve and display exchange rate data.back to top
Several XML technologies are available for processing an XML data file.
The XSL Transformation (XSLT) language: XSLT can be used to transform parts or all of the XML data file into other formats, such as (X)HTML, CSV, PDF, WML, etc. For instance, we could use this technology to create an (X)HTML table displaying the rates for all currencies at a specific time.
The Document Object Model (DOM): The DOM is an object-oriented model of an XML document, which represents it as a tree structure. You can use the DOM to read from and write to an XML data file. However, the DOM stores the entire document tree in its memory, which means that it is resource-intensive, especially for large XML data files. If all we need is a sequential read or a one-time selective read, the DOM might not be necessary.
The Simple API for XML (SAX): SAX uses an event-driven model and handles an XML file as a unidirectional stream of data. SAX parsing is usually faster and the memory footprint is much smaller than a DOM construct. SAX is limiting, however, since you cannot write to an XML data file, reading is unidirectional (so you cannot go back into the XML file if needed, and you have to start from the beginning again) and there is no representation of the structure of the XML data file.
For the purpose of this tutorial, we will use SAX to display the exchange rate for a certain currency on a certain date, and XSLT to extract some data from the SDMX ML data file and create an HTML table displaying the rates for all currencies at a specific time.
Before using the SDMX-ML data file, we must first validate it. Then we can use a SAX parser to extract the exchange rate for a specific currency at a specific point in time.
A SAX parser is event-based. It will call up one method in the application when it encounters an element tag and another when it encounters some text, for example. Therefore, it is up to the developer to write the call-back methods. Java provides classes on and methods for working with SAX, but a tutorial on using SAX in Java is beyond the scope of this exercise.
Apart from the setup-code needed (see the
main method in the
SAXGetRate.java file), the
startElement method shows us what we are looking for: when we find the series for the selected currency, we search for the observation that matches the given period and exit when the value has been found.
Just open a shell and try it out following this syntax:
bash$java -cp tutorial.jar SAXGetRate ecb_exr1_compact.xml
So, for example
bash$java -cp tutorial.jar SAXGetRate ecb_exr1_compact.xml USD 2006-11-21 Exchange rate value for USD on 2006-11-21: 1.2814
We now want to generate an (X)HTML table containing the exchange rates for all the currencies at a specified time. We will also use the Structure Definition file to extract the names of the currencies so that we can display the full names, rather than just the currency codes.
Again, an introduction to XSLT is beyond the scope of this tutorial, but you will notice that the XSL script is fairly small. Apart from the output settings, we assign the value passed to the script for the desired period to a variable and we receive the list of currencies from the Structure Definition file (
ecb_exr1_structure.xml). We then match the root of the SDMX-ML data file and output the basic HTML information (head, body, table, etc). Finally, we match all series; for each series we receive the observation value for the specified date and add it to the HTML table.
To generate the (X)HTML table, we use an XSLT processor, using the SDMX-ML data file (
ecb_exr1_compact.xml) as the input file,
sdmxml2html.xsl for the XSL file and the desired filename for the HTML table (e.g.
ecb_exr1_table.html) as parameters. You can use the task
generateHTMLTable supplied in the Ant build file to perform this step.
bash$ant generateHTMLTable Buildfile: build.xml generateHTMLTable: [xslt] Processing ecb_exr1_compact.xml to ecb_exr1_table.html [xslt] Loading stylesheet sdmxml2html.xsl BUILD SUCCESSFUL Total time: 5 seconds
We have now finished creating the two utilities that the data consumer needs and we have enough knowledge to be able to build fully-fledged SDMX-based software.
 For a detailed description of the SDMX information model, see Section 02 of the SDMX Standards Version 2.0 Complete Package. There is also a very useful introduction to the basic concepts at the end of this document.
 For a detailed description of SDMX-ML, see Section 03 of the SDMX Standards Version 2.0 Complete Package
 It is similar to the official ECB_EXR1 key family but has been simplified slightly for the purpose of this exercise.
 SDMX schemas are available on the SDMX website and can also be downloaded at the following location: http://www.sdmx.org/data/2_0/SDMX_2_0_SECTION_03B_SDMX-ML_Schemas_and_Samples.zip . For the purpose of this exercise, the schema files have been downloaded and placed in the same directory as the files written for the tutorial.
 The tools can be downloaded (after registration) from: http://www.metadatatechnology.com/software/index.cgi
 Currently, the tools work with version 1.0 of the standard only. We have adapted the XSL code to work with version 2.0 of the standards as well. Where there are differences between the original and the ECB version, comments have been inserted.
 Owning to some technical limitations (you cannot set the attribute xmlns in XSLT 1.0), we need to perform this step after the XML schema file has been generated. In this tutorial, this step is performed manually.
 Multiple datasets may be merged in the same data file using an SDMX MessageGroup message.
 Since we expect you to be familiar with XML technologies, we have only provided a brief overview of these technologies. There are, of course, other technologies that could be used to process an XML data file, but in this tutorial we will limit ourselves to XSLT, SAX and the DOM, as these are probably the most frequently used technologies that are not language or platform dependent.
 We have also created a small DOM class that performs the same operation in order to compare how the two technologies perform for the same task.
 The code used here is very primitive and merely offered as a simple examples of how we can extract data from an SDMX-ML data file. It should by no means be considered to be a quality code or at all useful for any purpose other than that described above.