JARV User's Guide

$Id: JARV.html,v 1.4 2003/01/15 02:41:14 kkawa Exp $
Written by Kohsuke KAWAGUCHI

Table of Contents

  1. Introduction
  2. Obtaining implementations
  3. Architecture
  4. Using JARV
    1. Step 1: create VerifierFactory
    2. Step 2: compile a schema
    3. Step 3: create a verifier
    4. Step 4-1: perform validation
    5. Step 4-2: validation via SAX
  5. Advanced Topics
    1. Finding implementation at Run-time
    2. Fail-Fast Design
    3. Creating Verifier directly from VerifierFactory
    4. JAXP masquerading
    5. Thread Affinity
    6. Schema Language Auto Detection
  6. Examples
    1. Validating bunch of files
    2. Multi-threaded example
    3. DOM validation
    4. SAX validation
    5. JAXP masquerading

Introduction

JARV is an implementation-independent interface set for validators developed by the RELAX community. There are several implementations available that support this interface.

Although it originally came from the RELAX community, JARV is not limited to RELAX; it can be used with many other schema languages. One of the advantages of JARV is that it allows you to use multiple schema languages with minimal change on your code.

Obtaining implementations

First, you need the latest isorelax.jar file, which is available here.

Then, you need actual implementations. Currently, following implementations are available:

Sun Multi-Schema XML Validator
RELAX NG/Core/Namespace, TREX, W3C XML Schema, XML DTD
Jing
RELAX NG
Xerces-2
W3C XML Schema
Swift RELAX Verifier for Java
RELAX Core/Namespace

You need to set up those jars so that the class loader can find them.

Architecture

JARV consists of three components. VerifierFactory, Schema and Verifier.

The VerifierFactory interface is the main interface between the implementation and your application. It has a method to compile a schema into a Schema object. The Schema interface is the internal representation of the schema. This interface is thread-safe, so you can have multiple threads access one Schema object concurrently. Also, this interface has a method to create a new Verifier object. The Verifier interface represents a so-called "validator"; it has a schema object in it and it validates documents by using that schema.

Using JARV

Step 1: create VerifierFactory

The first thing you would do is to create an instance of VerifierFactory. To do that, simply create an instance of VerifierFactory implementation. In case of MSV, it will be:

VerifierFactory factory = new com.sun.msv.verifier.jarv.TheFactoryImpl();

To use Swift RELAX Verifier for Java:

VerifierFactory factory = new jp.xml.gr.relax.swift.SwiftVerifierFactory();

JARV is also capable of finding an implementation that supports a particular schema language at run-time. To learn more about this discovery mechanism, please read this.

Step 2: compile a schema

Once you get a factory, then you can use it to compile a schema. To compile a schema, call the compileSchema method of the factory.

Schema schema = factory.compileSchema("http://www.example.org/test.xsd");

This method can accept many types of input. For example, you can pass InputSource, File, InputStream, etc.

Schema objects are thread-safe. So even if you have more than one threads, you only need one instance of Schema; you can share that one instance with as many threads as you want.

Step 3: create a verifier

Schema is just a compiled schema, so it cannot do anything by itself. Verifier object is the object that performs the actual validation. To create a Verifier object, do as follows:

Verifier verifier = schema.newVerifier();

In this way, you can create a Verifier that checks documents against a particular schema.

Verifier is not thread-safe. So typically you want to create one instance per one validation (or one thread.)

Step 4-1: perform validation

Verifier has several methods to validate documents. One way is to call the verify method, which accepts a DOM tree, File, URL, etc and returns the validity. For example, to validate a DOM document, simply pass it as an argument:

if(verifier.verify(domDocument))
  // the document is valid
else
  // the document is invalid (wrong)

This method will only give you yes/no answer, but you can get more detailed error information by setting an error handler through the setErrorHandler method.

Just like a parser reports well-formedness errores through org.xml.sax.ErrorHandler, JARV implementations (like MSV) reports validity errors through the same interface. In this way, you can get the error message, line number that caused the error, etc. For example, in the following code, a custom error handler is set to report error messages to the client.

verifier.setErrorHandler( new MyErrorHandler() );
try {
  if(verifier.verify(new File("abc.xml")))
    // the document is valid
  else
    // the execution will never reach here because
    // if the document is invalid, then an exception should be thrown.
} catch( SAXParseException e ) {
  // if the document is invalid, then the execution will reach here
  // because we throw an exception for an error.
}
...

class MyErrorHandler implements ErrorHandler {
  public void fatalError( SAXParseException e ) throws SAXException {
    error(e);
  }
  public void error( SAXParseException e ) throws SAXException {
    System.out.println(e);
    throw e;
  }
  public void warning( SAXParseException e ) {
    // ignore warnings
  }
}

If you throw an exception from the error handler, that exception will not be catched by the verify method. So the validation is effectively aborted there. If you return from the error handler normally, then MSV will try to recover from the error and find as much errors as possible.

Step 4-2: perform validation via SAX

Every JARV implementation supports the validation via SAX2 in two ways.

The first one is a validator implemented as ContentHandler, which can be obtained by calling the getVerifierHandler method. This content handler will validate incoming SAX2 events, and you can obtain the validaity through the isValid method. For example,

XMLReader reader = ... ; // get XML reader from somewhere
VerifierHandler handler = verifier.getVerifierHandler();
reader.setContentHandler(handler);
reader.parse("http://www.mydomain.com/some/file.xml");

if(handler.isValid())
  // the document is correct
else
  // the document is incorrect

The second one is a validator implemented as XMLFilter, which can be obtained by calling the getVerifierFilter method.

A verifier implemented as a filter, VerifierFilter, is particularly useful because you can plug it right in the middle of any SAX event pipeline.

Not only you can validate documents before you process them, you can validate them after your application process them.

In the following example, a verifier filter is used to validate documents before your own handler process it.

VerifierFilter filter = verifier.getVerifierFilter();
// create a new XML reader and setup the pipeline
filter.setParent(getNewXMLReader());
filter.setContentHandler( new MyApplicationHandler() );

// parse the document
filter.parse("http://www.mydomain.com/some/file.xml");
if(filter.isValid())
  // the parsed document was valid
else
  // invalid

SAX-based validation will not make much sense unless you set an error handler, because to know that the document was invalid after you've processed it is too late.

To set an error handler, call the setErrorHandler method just as you did with the verify method.

filter = verifier.getXMLFilter();
verifier.setErrorHandler(new MyErrorHandler());
...
filter.parse(...);

In this way, you can abort the processing by throwing an exception in case of an error. If you are using VerifierFilter you can also set an error handler by calling the setErrorHandler method of the VerifierFilter interface.

Some JARV implementations (e.g., MSV, Jing, RELAX Verifier for Java) always runs in the fail-fast manner. So as long as you set an error handler, it is guaranteed that your application will never see incorrect document at all.

Advanced Topics

Finding implementation at Run-time

A simple, obvious way to create a VerifierFactory is to create a new instance of appropriate implementation class (like com.sun.msv.verifier.jarv.TheFactoryImpl.

In this way, you can decide the JARV implementation at the compile time. Especially in case of MSV, it is advantageous to do so because of the support of the "multi-schema" capability. The MSV factory will accept any schema written in any of the supported languages. Thus you can instantly change the schema language without changing your code at all

However, there is one problem in this approach. Specifically, it locks you into a particular JARV implementation, so you need to change your code to use other JARV implementations.

For this reason, you may want to "discover" an implementation (just like you usually do with JAXP) at run-time by calling the static newInstance method of the VerifierFactory class. To do that, you need to pass the name of schema language you want to use. This method will find an implementation that supports a given schema language from the class path and returns its VerifierFactory.

VerifierFactory factory = VerifierFactory.newInstance(
  "http://relaxng.org/ns/structure/1.0");

Usually, the namespace URI of the schema language is used as the name. For the complete list, plaese consult the javadoc.

Fail-Fast Design

One of the problems of some validators (like DTD validator in Xerces) is that it doesn't work in the fail-fast manner. This problem is unique to SAX.

What is "fail-fast"? A fail-fast validator is a validator that can flag an error as soon as an error is found. A non fail-fast validator may let some part of the wrong document slip in (they will flag an error at the later moment.)

When you are using non fail-fast validator, you need to take extra care to write your code because your code may be exposed to bad documents.

For example, imagine a following simple DTD and a bad document:

<!ELEMENT root (a,b)*>
<!ELEMENT a    #EMPTY>
<!ELEMENT b    #EMPTY>

<root>
  <b/>  <!-- error -->
  <b/>
</root>

Suprisingly, in a typical non-fail-fast validator, the error will be signaled as late as in the end-element event of the root element. So you have to make sure that your application behaves gracefully when it sees the wrong 'b'.

Typically, this robs the merit of the validation because you do the validation to protect your application code from unexpected inputs.

Many of JARV implementations (including MSV, Jing, RELAX Verifier for Java) are fail-fast validators; so they will signal an error at the start-element event of the first 'b'. This guarantees that the application will never see a wrong document.

Note that some other JARV implementations may be non fail-fast validators.

Creating Verifier directly from VerifierFactory

The VerifierFactory class has the newVerifier method as a short-cut. It is a short-cut in the sense that the following two code fragments have exactly the same meaning:

Verifier v = factory.compileSchema(x).newVerifier();

Verifier v = factory.newVerifier(x);

This is sometimes useful when you are using only one thread.

JAXP Masquerading

JAXP masquerading feature is a wrapper implementation of JAXP. This wrapper enhances another JAXP implementation (such as Aelfred or Crimson) by adding JARV-based validation capability to it. Parsing is done by the wrapped JAXP implementation, and JARV implementation adds advanced validation capability to it.

This is often the easiest way to incorporate the validation into your application. Since it's just so easy to use.

To create a wrapped SAXParserFactory, do as follows:

Schema schema = /* compile schema */;
SAXParserFactory parserFactory = new org.iso_relax.jaxp.ValidatingSAXParserFactory(schema);

This will create a JAXP SAXParserFactory that validates every parsed document by the specified schema. Similarly, to create a wrapped DocumentBuilder, do as follows:

Schema schema = /* compile schema */;
DocumentBuilderFactory dbf = new org.iso_relax.jaxp.ValidatingDocumentBuilderFactory(schema);

Once those instances are created, just use them as you use a normal JAXP implementation.

Thread Affinity

The VerifierFactory interface is not thread-safe. This basically means that you cannot use one object from two threads.

The Schema interface is thread-safe. So once you compile a schema file into a Schema object, it can be shared by multiple threads and accessed concurrently. This is useful at server-side, where multiple threads process client requests simultaneously.

The Verifier interface is again not thread-safe. Each thread needs its own copy of Verifier. Verifier objects are still re-usable, as you can use the same object to validate multiple documents one by one. What you cannot do is to validate multiple documents simultaneously.

The thread affinity of JARV is designed after that of TrAX API (javax.transform package). Familiarity with TrAX will help you understand JARV better.

MSV and Schema Language Auto Detection

com.sun.msv.verifier.jarv.TheFactoryImpl automatically detects the schema language from the schema file. However, there is one important limitation. Currently, the detection of XML DTDs is based on the file extension. Specifically, if the schema name has ".dtd" extension, it is treated as XML DTD and otherwise it is treated as other schema languages.

This causes a problem when you are passing InputStream as the parameter to the compileSchema method. Since InputStreams do not have names, they are always treated as non-DTD schemas.

To avoid this problem, wrap it by an InputSource and call the setSystemId method to set the system id. The following example shows how to do that:

InputSource is = new InputSource(
  MyClass.class.getResourceAsStream("abc.dtd") );
is.setSystemId("abc.dtd");

verifierFactory.compileSchema(is);

This ugly limitation came from the difficulty in correctly detecting XML DTDs, which are written in non-XML syntax, from other schema languages, which are written in XML syntax.

Any input on this restriction is very welcome.

Examples

If you need an example that is not listed here, please let me know so that I can add it in the next release.

Validating bunch of files

Have a look at SingleThreadDriver.java example in this zip file. It compiles a schema and obtains a verifier object, then use the same verifier to validate multiple documents.

Multi-threaded example

Have a look at MultiThreadDriver.java example in this zip file. This example first compiles a schema, then it launches a lot of threads and let them share one schema object.

This example shows you how to use JARV in the multi-threaded environment and how you can cache a compiled schema into memory.

DOM validation

The following code shows how you can validate DOM by using JARV.

import org.iso_relax.verifier.*;

void f( org.w3c.dom.Document dom )
{
  // create a VerifierFactory
  VerifierFactory factory = VerifierFactory.newInstance(
                       "http://relaxng.org/ns/structure/1.0");
  
  // compile a RELAX NG schema
  Schema schema = factory.compileSchema( new File("foo.rng") );
  
  // obtain a verifier
  Verifier verifier = schema.newVerifier();
  
  
  // check the validity of a DOM.
  if( verifier.verify(dom) )
    // the document is valid
  else
    // the document is not valid
  
  
  // you can use the same verifier object to test multiple DOMs
  // as long as you don't use it concurrently.
  if( verifier.verify(anotherDom) )
    ...
  
  
  // or you can pass an Element to validate that subtree.
  Element e = (Element)dom.getDocumentElement().getFirstSibling();
  if( verifier.verify(e) )
    ...
}

SAX validation

The following code shows how you can use JARV together with SAX.

import org.iso_relax.verifier.*;

void f( javax.xml.parsers.SAXParserFactory parserFactory )
{
  // create a VerifierFactory with the default SAX parser
  VerifierFactory factory = VerifierFactory.newInstance(
                       "http://www.xml.gr.jp/xmlns/relaxCore");

  // compile a RELAX schema
  Schema schema = factory.compileSchema( new File("foo.rxg") );
  
  
  
  // obtain a verifier
  Verifier verifier = schema.newVerifier();
  
  // set an error handler
  // this error handler will throw an exception if there is an error
  verifier.setErrorHandler( new MyErrorHandler() );
  
  // get a XMLFilter
  VerifierFilter filter = verifier.getVerifierFilter();
  
  // set up the pipe-line
  XMLReader reader = parserFactory.newSAXParser().getXMLReader();
  filter.setParent( reader );
  filter.setContentHandler( new MyContentHandler() );
  
  
  // parse the document
  try {
    filter.parse( "MyInstance.xml" );
    // if the execution reaches here, the document was valid and
    // there was nothing wrong.
  } catch( SAXException e ) {
    // error.
    
    // maybe the document is not well-formed, or it's not valid
    // or some other reasons.
  }
}

JAXP Masquerading

The following code shows how you can use JARV via JAXP-masquerading.

import org.iso_relax.verifier.*;
import org.iso_relax.jaxp.*;

void f()
{
  // create a RELAX NG validator
  VerifierFactory factory = VerifierFactory.newInstance(
                       "http://relaxng.org/ns/structure/1.0");

  // compile a schema
  Schema schema = factory.compileSchema( new File("myschema.rng") );
  
  // wrap it into a JAXP
  SAXParserFactory parserFactory = new ValidatingSAXParserFactory(schema);
  
  // create a new XMLReader from it
  parserFactory.setNamespaceAware(true);
  XMLReader reader = parserFactory.newSAXParser().getXMLReader();
  
  // set an error handler
  // this error handler will throw an exception if there is an well-formedness
  // error or a validation error.
  reader.setErrorHandler( new MyErrorHandler() );
  
  // set the content handler
  reader.setContentHandler( new MyContentHandler() );
  
  
  // parse the document
  try {
    reader.parse( "MyInstance.xml" );
    // if the execution reaches here, the document was valid and
    // there was nothing wrong.
  } catch( SAXException e ) {
    // error.
    
    // maybe the document is not well-formed, or it's not valid
    // or some other reasons.
  }
}