14. XS2DTD internals

14.1. Specifications
14.2. DTD Handler documentation
14.3. Annex: in discussion stuff

The main goal of XS2DTD is to convert SchemaDoc XML Schema document models into DTD document models. On top of the many differences between these 2 models as well as their respective limitations, it is rather critical to describe exactly what will be covered by this conversion, what will not be covered and how what is not covered will be handled.

XS2DTD is part of the SchemaDoc , which has for objective to provide a way to define XML file information model linked and managed with its own documentation. Keeping this in mind, the scope of this tool is to be able to create, from a set of Schema grouped regarding a SchemaDoc documentation, the relevant textual DTD and, later, DTD information within the documentation output.

This development is today still needed because there are still a lot of people that are not using Schemas and also because there are a lot of tools that are not managing Schemas, especially in the structured document field.

Actually, there are tools that are able to do an XML Schema to DTD conversion. But they are not efficient enough for us because they only keep the base of semantics of information and because they are not providing readable and documented DTDs. As soon as a model is not only a validation tool but also a project component used as part of specifications, it must be readable. Moreover, As soon as it is part of editorial systems, it should be possible to engineer it, to derive it, in order for example to be able to go from reference DTD to edition DTD.

Therefore, generated DTDs from a conversion must keep the order of the original schema file set and, more generally it must keep the whole file architecture of the set of schemas and their relation with SchemaDoc documentation.

Is it always possible to create a DTD? Surely not! The designer will have to decide if he wants to either create a model compatible with DTD and Schema or with Schema only. Therefore, the objective is not to have a generic transformer but rather a transformer that meets the previously defined objectives. For this reason, and because a lot of things will not be done, there is a need to have, during specification, a real and clear transformation policy that will define all schema restrictions expectations.

The objective of this project is to create readable DTDs, based on the readability of the defined Schemas.

Another objective is to provide modular DTDs that can be easily packed for a specific purpose. For example, people creating DTDs today are always defining a lot of information using parameter entities, in order to be able to redefine them in a specific context. This is also what is done in Schema using global types and groups. The objective is thus to be able to map this engineering while creating DTDs.

The resulting DTDs need to be complete, handy and conform to the input XML Schemas. They must be documented, as we want to add information about the created DTDs in our documentation models.

DTDs can be generated with or without namespaces coded within object names.

The generated commentaries must be in a chosen language (French or English as first implementation). These commentaries concern lost order and lost information during the conversion. Lost features that do not have equivalent in DTD are not formatted but embedded as comments. All these commentaries are logged into a log file but will also be present, if required, as near as possible of the objects, in the generated DTDs. This requirement will be passed as a parameter.

PEs will often be used to express modularity and reusabiliy. Conditional sections are for now not targeted, unless a special requirement appears during the specification steps.

In addition to the generated DTDs, an OASIS-Open catalog managing all generated DTDs will be provided,. Because in SchemaDoc each XML Schema file has a title, any entry of the generated DTDs will be added in the catalog to associate his public name with its URI.

Schema must respect a quality charter in order to be properly converted in DTD. Read Section 14.1.7, “Schema quality charter” for more details about this quality charter.

Because a SchemaDoc mechanism enables to use an existent DTD in place of a Schema transformation, these DTDs must also respect a quality charter. Read Section 14.1.8, “DTD quality charter” for more details.

The XSD to DTD transformation goes through several steps:

A program will internalize a SchemaDoc document and will generate a DTD Handler intermediate structure. This structure will then be “dumpable” as a whole or reused in documentation context for being able to output DTD fragments within the documentation itself.


  1. standard Eclipse loading

  2. Extend the eclipse Interface in order to solve information needed at schema level. Method used is treeWalker generating all extended objects from XSD Eclipse

  • Generate a schema ID mechanism all along the schema and SchemaDoc structure

  • Provide schema object information to SchemaDoc output generation.

This includes :

  • identification of all objects of the schema (see also 11.1 Development steps )

  • Type resolution (enabling to solve questions like “what is the iD of the referenced complexType of this schema).

Note that for this feature, the exposition of the XSD methods should be enough.

This break down structure enables future conversions into other formats such as Relax NG to be easily integrated. Indeed, only the step 3 and other would have to be changed in order to generate the appropriate format.

Limitation : actually, the previous structure is not fully implemented. A wrap-up is made on the processes 2 and 3. Integration will remove this limitation.

XS2DTD is a Java program mainly based on the schema library XSD (eclipse) and XML Java libraries such as DOM and XSLT.

It contains Java source code, Java libraries (internal & external), and scripts allowing to build and run the application.

Integrated within SchemaDoc V2, it uses 5 directories:

XS2DTD java packages are part of fr.tireme..SchemaDoc .

The main Java components are the DTDHandler, the DTDProcessor and the XSD interfaces. Refer to Figure 11, “XS2DTD main components diagram” for more details.

A Javadoc of SchemaDoc is available at the end of this document.

The classes are deployed as follow:

  • ● XSD à Wrapped Eclipse XSD package


This section goes through the major Schema concepts and elements and describes the conversion with respect to the DTD limitations.

<element

 abstract = boolean  : false

   block = (#all | List of ( extension | restriction
| substitution ))

  default = string    final = (#all | List of (extension
| restriction ))

  fixed = string   form = (qualified | unqualified
)

   id = ID

 maxOccurs = ( nonNegativeInteger | unbounded )  : 1

   minOccurs = nonNegativeInteger  : 1

   name = NCName

 nillable = boolean  : false

   ref = QName

 substitutionGroup = QName

 type = QName

 {any attributes with non-schema namespace . . .}
>

  Content: ( annotation ?, (( simpleType | complexType
)? , (unique | key | keyref )*))

 </element>


Schemas elements are mapped to DTD ELEMENT having the same name with explicit content(s) or PE(s) and DTD ATTLIST with explicit attribute(s) or PE(s).

Local elements are made global (brought to the root level) and inserted just after the object they originated from. But conflicts may arise because of this globalization of local elements. XS2DTD V1 does not resolve the conflicts but it is planned to be achieved in the next releases. So for the moment, local elements are globalized and a warning is raised whenever a conflict is detected.

Empty schemas elements are converted into EMPTY DTD elements.

The following schemas objects are not treated: nillable, form, fixed, default, unique, key and keyref .

From a DTDHandler point of view, a schema element is an elementDecl containing a qName , a contentModel and some attributes . Read Section 14.2, “DTD Handler documentation” for more details.

Example :


<simpleType

 final = (#all | (list | union | restriction))

 id = ID

 name = NCName

 {any attributes with non-schema namespace . . .}>

 Content: (annotation ?, (restriction | list | union ))

</simpleType>

Schema simple types are represented as PEs and can be used in elements and/or attributes.

All xs: schema datatyping used in the schema wrote needs to be declared as PCDATA for either attribute and/or content model use. This enables, later to be sure that the PE exists .

Depending on their use, i.e.: whether they are part of an Element, an Attribute or both, simple types are converted differently such as:

For simple types included in elements:

Comments are added to the handler providing the simple type facet definition and/or derivation.

For simple types included in attributes:

Thus, any references to a simple type will match one or the other so declared PE.

From a DTDHandler point of view, a schema SimpleType is of type contentModelType , which can be either EMPTY, ANY, MIXED (=#PCDATA if no child), choice, sequence, and a reference to another contentModelType. Read Section 14.2, “DTD Handler documentation” for more details.

Derivations are allowed and applied to the PE. See Section 14.1.5.2, “Derivations” for more details.

Example:



Global attributes are mapped as global PEs.

Local attributes are converted into ATTLIST DTD element.

An attribute value can be CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, Enumeration or PE reference.

Each attribute has a use and a default or fixed value.

As for the values of the ‘use’ attributes, they are handled as follows:

The possible values of use are summarized below:

anyAttribute is not treated and a warning is raised.

Example:



The redefine mechanism enables to redefine simple and complex types, groups, and attribute groups that are obtained from external schema files. Like the include mechanism, redefine requires the external components to be in the same target namespace as the redefining schema.

The redefine element acts very much like the include element as it includes all the declarations and definitions from the included schema. However, the main difference is that the base type of redefined elements is equivalent to the original element.

A redefinition is actually an include element which content is redefined inside the including element (the one that is redefining a declared element).

In terms of conversion, and since the ‘redefinable’ elements are all mapped though PEs, the redefinition is actually applied on their PE’s content. Practically speaking, redefining an object consists in creating new PEs, called redefining PEs, which are overcharging other existing PEs. Those redefining PEs will happen to have the same name than the overcharged ones and must be declared before the inclusion call (the include ).

In the DTDHandler, an attribute redef will give the ID of the redefined element and show its redefinition.

It is possible to derive redefined elements. As for extensions, the original PE’s content is used. As for restrictions, attributes are kept except the prohibited ones.

Read Section 14.1.5.2.3, “Particularities: Deriving redefined objects” for more details.

Example:


*) Idée de mise en oeuvre

---------------------

Dans le cas de restriction, à mon avis pas grand chose à dire, il faut utiliser le mecanisme standard.

Dans le cas de l'extension, peut etre -pour faciliter les choses- est il possible d'agir en deux temps :

1) utiliser le mecanisme d'extension classique :

    ecrire dans le dtdHandler une peDef qui va dans son contentModel va inclure un peRef sur lui meme.

    Garder quelque part la trace qu'il s'agit d'un redefine concernant tel schema.

2) faire au final une passe sur le dtdHandler cherchant ces cas de figure :

    - pour un peRef inclus dans le peDef en question

        aller chercher dans le handler correspondant aux schema redefini la peDef

        remplacer le peRef par le content model de la peDef trouvé.

Il me semble que celà serait assez simple, mais je laisse mike faire ses choix à ce propos

car c'est lui qui maitrise le mieux les algos sur tout celà.

As stated before, the conversion from Schema to DTD is not straightforward.

There are constraints related to Namespaces , Mixed content or Ordering for example that are due DTD limitations. Others such as Collisions , or Structure Management are more SchemaDoc requirements. All these constraints that have to be taken into account during the conversion are described and discussed in the following chapter.

Schema allows types to be derived by extension or restriction. This derivation is converted differently depending on the base type.

Several cases are identified and described below:

PE1 à xx_CONTENTMODEL (#PCDATA) PE2 à xx_ATTRIBUTE (attribute declaration)

Examples: Next are examples of complex types extension, providing 2 complex types references ct_sequence and ct_choice such that:

ct_sequence :

<xs:complexType name="ct_sequence"> <xs:sequence> <xs:element name="t1"/> <xs:element name="t2"/> </xs:sequence> </xs:complexType>

ct_choice:

<xs:complexType name="ct_choice"> <xs:choice> <xs:element name="t3"/> <xs:element name="t4"/>

</xs:choice> </xs:complexType>

<xs:complexType name="ctseq_ext_sequence"> <xs:complexContent> <xs:extension base="ct_sequence"> <xs:sequence> <xs:element ref="new1"/>

<xs:element ref="new2"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

is converted as:

<!ENTITY % ctseq_ext_sequence '(%ct_sequence;,(new1,new2))'>

<xs:complexType name="ctchoi_ext_sequence"> <xs:complexContent> <xs:extension base="ct_sequence"> <xs:choice> <xs:element ref="new1"/>

<xs:element ref="new2"/> </xs:choice> </xs:extension> </xs:complexContent> </xs:complexType>

is converted as:

<!ENTITY % ctchoi_ext_sequence '(%ct_sequence;,(new1|new2))'>

<xs:complexType name="ctseq_ext_choice"> <xs:complexContent> <xs:extension base="ct_choice"> <xs:sequence> <xs:element ref="new1"/>

<xs:element ref="new2"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

is converted as:

<!ENTITY % ctseq_ext_choice '(%ct_choice;,(new1,new2))'>

<xs:complexType name="ctchoi_ext_choice"> <xs:complexContent> <xs:extension base="ct_choice"> <xs:choice> <xs:element ref="new1"/>

<xs:element ref="new2"/> </xs:choice> </xs:extension> </xs:complexContent> </xs:complexType>

is converted as:

<!ENTITY % ctchoi_ext_choice '(%ct_choice;,(new1|new2))'>

<xs:complexType name="ctseq_ext_attribute"> <xs:complexContent> <xs:extension base="ct_sequence"> <xs:attribute name="newA"/> </xs:extension> </xs:complexContent> </xs:complexType>

is converted as:

<!ENTITY % ctseq_ext_attribute_CMODEL '(%ct_sequence_CONTENTMODEL;)'>

<!ENTITY % ctseq_ext_attribute_Atts '(%ct_sequence_Attributes;, newA)'>

Recall that redefinitions have the same name as their base type, there’s a need to resolve the derivation or the reference to groups, in order not to fall into a recusivity dead-end. The resolution will consist in copying the original content model of the redefined element into the redefining element.

For example, when a classic extension looks like:

complexType t1

extension complexType t2

  add content model cm1

complexType t2

content model cm2

, the generated DTD would be :

<!ENTITY % t1 '(%t2;cm1)'>

But considering the re-entrance problem (recursivity), the reference to the PE is replaced with the PE’s content model itself.

Hence the following:

schema root

redefine schema sub

complexType t1

extension complexType t1

adding content model cm1

schema sub

complexType t1

content model cm2

is converted into :

<!ENTITY % t1 '(cm2,cm1)’>

    <!—and not (%t1;cm1) !!! -->

<:ENTITY % sub SYSTEM "sub.dtd">

%sub;

sub.dtd

<!ENTITY % t1 '(cm2)'>

Pierre wrote: Pourquoi ne pas avoir plutôt :

root.dtd

<!ENTITY % t1 '(%t1_ToBeRedefined;,cm1')>

<:ENTITY % sub SYSTEM "sub.dtd">

%sub;

sub.dtd

<!ENTITY % t1 '(cm2)'>

<!ENTITY % t1_ToBeRedefined '(cm2)'>

This only applies to extensions of complex types, to extensions of simple types with an enumeration, and to redefinitions of groups and attributes groups through explicit references to the included group (group ref=).

Restricting a redefined object does not imply any specific treatment. Refer to general restrictions for more details.

In Schemas, declarations can be done after the usage. For example, it is possible to declare an element of type foo whereas this type foo has not been defined yet. foo can be declared later in the same document or in an included document.

Unfortunately, this is not feasible in DTD as every PEs must be declared before their use.

Which means in terms of DTDHandler that peDefs must be declared before their according peRefs or when appropriate, before the PE ‘include’ that is including the peRef. The very same model works for DTDs.

As for the DTDHandler, several cases, detailed below, need to be considered:

peDef is moved and inserted just before peRef.

peDef is moved and inserted just before the peDef ‘include’ and the ordering process is restarted for this schema .

Do not do anything but raise and log a warning and generate the DTDHandler as usual.

The ordering process is then just a matter of re-ordering schemas locally. This is a clear and identified limitation.

The following examples falls into the non-supported third case:

root.xsd: <xs:include shemaLocation="sub.xsd"/> <xs:complexType name="itemType"> <xs:complexContent> <xs:extension base="basicType"> ... </xs:extension> </xs:complexContent > </xs:complexType >

</xs:include>

sub.xsd : <xs:complexType type="basicType> …

</> <xs:element name="item" type="itemType"/>

It produces the following unorganised DTDs:

root: <!ENTITY % sub SYSTEM "sub.ent"> %sub; <!ENTITY % itemType '(%basicType;,.....)'>

sub: <!ENTITY % basicType '....'> <!ELEMENT item %itemType;>

There are 3 kinds of Content type:

Several cases need to be considered:

Content model analysis is then mandatory to solve the collisions.

A warning is raised when collisions happen.

Types must be resolved to solve some cases. We must know if:

  • a group is used in a mixedContent

  • a simpleType is used for elements, for attributes or for both.

Please explain …

By default, a type content summation will create a choice group element into which the different content models are inserted. But there are exceptions …

One drawback is that conflicts may arise, regarding the DTD semantic.

For example, the following choice ((a,b,c)|(a,d))* is not DTD compliant since the a element is part of the 2 alternatives.

A better solution based on content model analysis might be considered in the future

As for their attributes, an Section 14.1.5.7.2, “Attributes generalisation” is applied.

But there are of course exceptions that have to be treated …

Case 1: Mixed complex types cannot make references to other mixed types

From the following example,

<!ENTITY % emphasis.Grp '(#PCDATA|emph|sub|sup)*'> <!ENTITY % paraType '(#PCDATA|%emphasis.Grp;)*'> <!ELEMENT title (%emphasis.Grp;)>

We need to create 2 Pes emphasis.Grp - emphasis.Grp_formixed and emphasis.Grp - such as:

<!ENTITY % emphasis.Grp_formixed 'emph|sub|sup'> <!ENTITY % emphasis.Grp '(#PCDATA|emph|sub|sup)*'> <!ENTITY % paraType '(#PCDATA|%emphasis.Grp_mixed;)*'>

<!ELEMENT title (%emphasis.Grp_notmixed;)>

Note that the occurrences have disappeared as well as the parenthesises around the formixed PE. The conversion can be reproduced with PEs of PEs such as:

<!ENTITY % guil_formixed 'quote'> <!ENTITY % emphasis.Grp_formixed 'emph|sub|sup|%guil_formixed;'>

Case 2: Mixed complex types cannot contain anything else but choices

Then the following is forbidden:

<!ENTITY % emphasis.Grp '(emph,sub,sup)'> <!ENTITY % paraType '(#PCDATA|%emphasis.Grp;)*'> <!ELEMENT title (%emphasis.Grp;)>

as well as:

<!ENTITY % emphasis.Grp 'emph,sub,sup'> <!ENTITY % emphasis.Grp '(emph|sub|sup)'> <!ENTITY % emphasis.Grp 'emph|sub*|sup'>

But it is possible to overcome this by using 2 PEs such that:

<!ENTITY % emphasis.Grp '(emph,sub,sup)*'> <!ENTITY % emphasis.Grp_formixed 'emph|sub|sup'> <!ENTITY % paraType '(#PCDATA|%emphasis.Grp_formixed ;)*'>

Warning: this model can be propagated at every levels such that:

A wrong conversion would be:

<!ENTITY % eqn.Grp '(intro,body)'> <!ENTITY % baseElements.Grp '(emph|sub|sup|%eqn.Grp;)'> <!ENTITY % paraType '(#PCDATA|%baseElements.Grp;)*'

The correct conversion is rather:

<!ENTITY % eqn.Grp '(intro,body)'> <!ENTITY % eqn.Grp_formixed '(intro|body)'> <!ENTITY % baseElements.Grp '(emph|sub|sup|%eqn.Grp_formixed ;)'> <!ENTITY % paraType '(#PCDATA|%baseElements.Grp;)*'>

Case 3: Derived mixed complex types

It is possible to derive complex types – originally mixed or not - with a mixed content.

<xs:element name="para"> <xs:complexType mixed="true"> <xs:complexContent> <xs:extension base="paraType"> … extension … </> </> </> </>

In that case, the rules described above are applicable to paraType .

Case 4: Non simple choice content models used in mixed contents

This one is a simple add-on to the 2 previous cases. Whatever is the content model type, different from a simple choice without occurrences, it is resolved as in case 2 (even for a choice*).

Then, we cannot have:

<!ENTITY % emphasis.Grp '(emph|sub|sup)*'> <!ENTITY % paraType '(#PCDATA|%emphasis.Grp;)*'>

but rather:

<!ENTITY % emphasis.Grp '(emph|sub|sup)*'> <!ENTITY % emphasis.Grp_formixed 'emph|sub|sup'> <!ENTITY % paraType '(#PCDATA|%emphasis.Grp;)*'>

Case 4: Parenthesises

DTD does allow to have 2 opening parenthesis in a row in front of PCDATA such as:

<!ENTITY % paraType '(#PCDATA|italic|%emphasis.Grp;)*'> <!ELEMENT para (%paraType;)>

This problem is handled by the XSLT transformer such as:

<xsl:template match="dtd:mixed"> <xsl:choose> <xsl:when test=".//dtd:*"> <xsl:text>(#PCDATA</xsl:text> <xsl:apply-templates/> <xsl:text>)*</xsl:text> </xsl:when> <xsl:otherwise> <xsl:text>(#PCDATA)</xsl:text> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="dtd:mixed/dtd:elemRef|dtd:mixed/dtd:peRef"> <xsl:text>|</xsl:text> <xsl:value-of select="le nom de l'objet"/> </xsl:template>

As soon as DTD does not support namespaces, it handles a certain mapping of namespaces by creating xxx.nsdefs.ent files, which are included in every generated DTDs (see 5.10 ).

Therefore, the dumper is mainly used for being able to dump to DTD files. Nevertheless, the generation of documentation process will also use it. For example because as soon as reference documentation is provided, it will need to provide the same DTD information than the one provided within the DTD files.

Nevertheless, this is not the responsibility of the xs2dtd program to manage documentation. Therefore, people responsible of the integration of the program within SchemaDoc environment, will decide on the best way to use the dumper for documentation needs.

In parallel, an XML Catalog is generated. This catalog is a way to map the information in an XML external identifier into an URI reference for the desired resource. When a file is in a different place, only the catalog has to be modified. When we generate DTD, we make the catalog, using information from SchemaDoc with the model names in SchemaDoc .

TBD

Element xs2dtdHandler

— structure handling information from a Schema : contains only parameter entity definitions and element declaration

globalPropertiesdeclarationsmapN139EF

Element globalProperties

— Properties for the main structure

xsPropertiessdPropertiesmapN13A84

Element xsProperties

— Properties extracted from XML Schema file relevant to the xs namespace itself

xsFilePropertiestargetNamespacenoTargetNamespacequalificationScopemapN13AC3

Element noTargetNamespace

— The schema has no target namespace

mapN13BF8

Element sdProperties

— Properties extracted from XML Schema file relevant to the SchemaDoc namespace

sdPropertymapN13CB2

Element declarations

— All declarations : parameter entities and element declarations, comment -from xs annotation- may be found.

peDefpeTypeelementDeclattributesDeclpeRefqNameTypecommentnotationmapN13D7C

Element peDef

— Parameter Entity declaration. May be a simple content model (from simpleType), a complex content model (from complexType, group), attributes declarations (complexType, attribute, attributeGroup)

Element elementDecl

— element declaration

qNameqNameTypeschemaOrigincommentcontentModelcontentModelTypeattributesattributesTypemapN13E47

Element attributesDecl

— case of complex type redefinition, a new attribute is defined. In this cas, it generates a peDef and an attribute declaration -for an elemRef- with a peRef. The attributes decl refers to an element declared somewhere (referenced by elementRef)

elementRefqNameTypeschemaOriginattributesattributesTypemapN13EA0