ISO 15926 in OWL: Conventional Literals

The proposal for dealing with ISO 15926 literals in OWL (at time of writing) involves creating a "wrapper" object for the literal instantiated from its ISO 15926 class - the literal itself is linked to this as the object of an RDF relationship via a "content" predicate. Moreover, the literal object must have an identifier and must be unique within the RDF graph.

This page argues against that construction for the external OWL as used for a publishing format in RDS/WIP and for façades (SPARQL endpoints for interoperability), primarily on performance and security grounds. Rather, it argues that traditional RDF literals should be used here, and the (few) shortcomings resolved in other ways. At the very least it argues that if literal objects are retained, that they be represented as blank nodes and that the uniqueness constraint is removed for use in triple-stores.

Original ISO 15926 Literals in OWL

In the original ISO 15926 literal proposition, a new object is instantiated for a string, so that the instance (class actually, not individual) can be used as the target of properties in a template instance:

<mysite:myTemplate rdf:ID="TPLI23432">

<mysite:nameProperty rdf:resource"#XSST_12345"/> ...

</mysite:myTemplate>

<part4:XmlSchemaString rdf:ID="XSST_12345">

<part2:content rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Hello, world!</part2:content>

</part4:XmlSchemaString>

The main structural advantage this provides is that it allows literals to be the subject of an RDF statement (not that they are actually used that way when a template is involved). It would also allow an extremely naïve RDF client to identify a common XML schema type for the literal content: however, note that more sophisticated clients could easily resolve an ISO 15926 class definition to find the XML schema type (which will be necessary anyway if someone introduces, for example, a floating point representation with a different binary mantissa and exponent precision to those currently supported by XML schema).

From an ISO 15926 part 2 point of view, its also correct, so long as there is only one instantiated string object in the data set that has the same data type, language and text.

Conventional RDF Literals in OWL

What I would propose is that for the external format at least (that intended to be supported by façades for interoperability scenarios similar to those where you might use an service-oriented-architecture (SOA) solution), it is better to use conventional RDF literals, like this:

<mysite:myTemplate rdf:ID="TPLI23432">

<mysite:nameProperty rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Hello, world!</mysite:nameProperty> ...

</mysite:myTemplate>

The reasoning for this is long, arduous and complicated, but I think ultimately sound, so long as the goal is performance and security. At the very least, if that cannot be accepted, then I would present this:

<mysite:myTemplate rdf:ID="TPLI23432">

<mysite:nameProperty>

<part4:XmlSchemaString rdf:ID="XSST_12345">

<part2:content rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Hello, world!</part2:content>

</part4:XmlSchemaString> ...

</mysite:nameProperty>

</mysite:myTemplate>

making the string into a blank node and obviating the requirement for uniqueness.

The reasoning follows below (largely excerpts from emails circa September 2007 and January 2008).

Imposition of Uniqueness on Literals

Briefly, my background for this is that I built protocol stacks, key escrow systems and various other bits and pieces for a PKI company in the late 90s, so while I may not be an expert, I know something about security.

Here's the esoteric definition of the problem: if a literal must be unique, then it must also be public (* see addendum below) and it must also be searchable. Hence, all literals are unique, public* and searchable. Even if they are not searchable, they become searchable if their identifiers follow any kind of pattern, since they can be "tested" for with a query. This is different to RDF - in RDF, literals are only the object of a statement - you can only search the object if you can see the statement, so permissions to the statement limit visibility of the object. To restate this in another way, in ISO 15926 part 7, literals are represented by statements that are necessarily public and searchable (or testable if not searchable) and so you cannot limit their visibility.

* Addendum: theoretically, you can construct a projection of the triplestore that evaluates the permissions on a literal on the basis of the permissions to templates using that literal. This is possible, but the complexity it creates for evaluation is very high - I will explore this complexity later on this page. Secondly, it introduces problems for the generation of unique IDs if those IDs need to be tested for existence before provision (which is a normal failsafe for ID generator algorithm), in that the test must be done in the base triplestore, not the projected triplestore - this requires fairly rigorous code controls (or the appropriate use of aspect) for a truly secure system. While it probably wouldn't matter that much in a system like the RDS/WIP, the necessary controls will introduce burdens in higher security implementations (again, not insurmountable, just something to consider).

Reconstructing Data

As some background for just generally reconstructing data, there have been papers out there since the late 1970s on how to use statistical techniques for analyzing block length patterns in streaming data and to use that as a hint to the kind of data being transmitted.

That is to say, all the observer could see was the timing and the pattern of changing blocking factors as the stream passed. From that, they could automate guesses on what protocol was being used, who was client, who was server, what applications might have been used over that connection and so on. From a security POV, that is considered too much knowledge.

Some of the more sophisticated algorithms could even make pretty good guesses about large chunks of specific data being transmitted at different times (eg. login banner). That allowed better (ie. faster) attacks to be constructed on the keys to find the remaining data. That of course, is considered way too much knowledge.

So modern high security systems either transmit a constant stream (transmitting noise when there are no packets to send) or they get very creative with the blocking factors, splitting, padding and micro second level timing of the packets. I've built systems that did this for this very reason.

The whole point that I am making is that, given that even just the insert pattern of data can give a clue as to what might lie behind it, to actually be able to identify fragments of the data in a possible known exchange gives more ammunition to these statistical attacks.

To wrap up that topic, if unique literal instances are searchable or sequenced or otherwise sampleable, then that would be considered an unreasonable security hole for a shared system.

Practical Applications: Secret Integers - Key Components, Oil Well Logs

Regarding integers - private and public key pairs used in the RSA form of assymetric cryptography are expressed using sets of large prime numbers - these are large though - admittedly larger than the 18 decimal characters required by XML Schema integers. Typically, a larger integer representation is written as a list of regular integers with an understood number of significant bits based on the power of 2 (usually 32 or 64).

So for example, if you are storing key material in that way in a triple store, then any sequence to the triples may expose it, for example, insert sequence, identifier sequence. Once you have recovered a private key, you can sign documents as another entity, intercept their SSL connections, decode any private message sent to them etc. etc.

I admit that this is an artificial example - it is unlikely that a private key would be stored in the clear in any shared database reliant on access controls only to exclude access, but the point I am making is that integers can be sensitive, sometimes very sensitive.

Lets move this to a more concrete case - what about oil well pressure/flow data? This data is logged at regular intervals and could be considered proprietary information. Some people at the oil platform have access to it, some do not. But if you can sample the new floating point instances added to a triple store between the logging intervals, you have just uncovered the pressure/flow data (assuming for the sake of argument that the floating point values are high precision and therefore the instances do not get re-used frequently).

Access Control Policies and Encryption of Strings

Encryption of strings basically makes them unsearchable, so that's not really an option. Its better to use some sort of access control policy to limit visibility of literal content.

Really, any access control policy we come up with should be able to secure both literal content and the relationships they participate in (as well as other relationships of course). My personal belief is that the access control policy should be able to implemented via an OWL projection of the full graph, just to simplify implementation.

How This Affects Class of Information Representation

Overview: as can be seen from the above, by having sequenced IDs, or by having public literals, it allows so many different kinds of mathematic analysis its akin to having all of the text of your private database open for all to see and just relying on the structure of the database to confuse people - as someone who has reconstructed disk data from destroyed lookup blocks and reverse engineered database formats from looking at their files in a hex editor, I can say that this is no kind of protection. And that's not even taking into account cryptanalysis techniques for reconstructing data.

The first problem is strings - when you have a pool of strings, it may be that some of those strings contain text that is only intended for the eyes of a limited audience; while other strings contain text for the public. But if you mandate that strings are unique, you mandate that all strings are necessarily public (* see note above) - since you need to be able to test for the existence of a string before you create a new one, and you need to be able to get an identifier for a string that you create.

Let us say for the sake of argument, you construct some sort of algorithm for determining the visibility of a unique string based on the relationships in which it plays a part. That means, you have created a system where a string is only public if it is used in at least one public relationship. If these identifiers follow any sort of pattern, or if the query response order reveals any sort of sequence (even of only those strings that became public after being used as the target of a property in a publicly visible template instance), then you have instantly provided ammunition to me - that is, the person looking to reconstruct your data. The problem diminishes to some degree as the pool of public strings gets larger, but you won't get this solution past anyone with a security mindset. You can get some degree of insulation by making identifier sequence unpredictable, but its only a part solution.

The second problem is numbers - if I store a private key integer in a triple store as an addressable subject then it probably has between 1024 and 4096 bits. I could easily search for, or sample integer identifiers, to locate those integers which fall in this range and reasonably assume that they are assymetric key material of some sort. Armed with this very nicely filtered subset of my problem space, I could launch an attack on a public-key encoded private message to entity X. Soon enough, I will have located X's corresponding private key and I can then set myself up to spoof them - for example, by a man-in-the-middle attack on SSL connections.

In summary, just about any value can be analyzed using statistical techniques to understand its relevance and approximate usage. This is especially the case with triplestores which typically order their statements by insertion. Not only can you sample for time (periodically check the database, see what new things turn up), you can rely on order - this allows the reconstruction of data, or at least some of it.

For this reason alone, I think that ClassOfInformationRepresentation? instances should be at least blank nodes in OWL/RDF triplestores and their corresponding transport formats (RDF/XML, N3, N-Triple, SPARQL result sets).

If a ClassOfInformationRepresentation? instance MUST (for the sake of what is being expressed) play a part in multiple statements, then it MAY be given an identifier (using IETF usage of the words MUST and MAY), on the understanding that the data is then in essence public.

A Perceived Problem of RDF Literals: Literal as Subject

Some people have stated that translations are a common example where a string is the subject of a statement - I refute this, because I believe that that model for translations is atypical - taking from my scant knowledge of other languages, you cannot take a phrase or sentence in one language and provide a single phrase or sentence in another language that always satisfies it, meaning for meaning. Context can make very big differences. The rest of this section deals with this argument.

Typically, when building a localization catalogue, we isolate a specific meaning in a given context and then provide translations for it - the translations are not translations for eachother, they are translations for the meaning. In this way, a translation does not have to be a subject in its own right. The meaning is the only thing given an identifier, not the strings themselves.

If we take something like "The cat is dead", it is a conveniently unproblematic sentence for English - we can probably translate it to other languages using a literal as subject approach.

But now, try something more difficult, like "the dog is worn out".

In English this could mean the dog tooth on a synchromesh cone in my motorcycle gearbox has been eroded beyond its engineering design tolerance, or it could mean that a shepherd's dog has been running around the field chasing sheep for too long and is panting too much to do any more work.

There is no guarantee that the Japanese word for the animal "dog" has the same ambiguity. So all you are saying by creating a translation for "the dog is worn out" in Japanese is that it is one possible translation for at least one context. When you have multiple target translations for any given language, you cannot deterministically select the correct translation without context.

Also, I want to back up my position by illustrating localization catalogues from a real-world application:

English catalogue entries:

gui.action.clear.button.text=Clear

gui.action.apply.button.text=Apply

German catalogue entries:

gui.action.clear.button.text=Löschen

gui.action.apply.button.text=Übernehmen

Notice the vast amount of contextual detail in the identifiers, versus the minimal amount of content in the translations? gui.action.clear.button.text is never presented to the user - it is used in the code to say "this is where to put a translation appropriate for the main presentation text of a generic "clear button" in a graphical user interface. I'm not suggesting that the contextual detail needs to be contained within the identifiers, just that the meaning exists independently of the translations.

I cannot stress how important context is, especially when you start dealing with languages like Chinese - in chinese, the same two characters side by side (simplistically, two words) can have a very, very wide range of meanings, and context is all important in differentiating them. In Japanese, they do not have articles (the, a) or pronouns (he, she, they, it) in the usual sense, rather they employ a concept known as a "topic marker" - what parts are conventionally inferred and what parts are conventionally reiterated in other languages in a given context can be also be very different. Or take Russian where there is only a single vestigial article that is only used for emphasis; Russian also has no present tense verb to be, except again a vestigial form for emphasis. How do you differentiate "a dog is here" from "the dog is here" when in Russian its written as just "dog - here" formally, or "dog here" informally (using Russian words in Cyrillic though of course).

A General Injunction Against the Requirement for Unique, Identified Literals

NRX is envisaging the case where a standards body provides definitions of various pieces of equipment using part 7. A supplier provides equipment specifications and references to maintenance routines, safety and handling documents and so on, perhaps referencing classifications from the standards body, certainly using tempaltes defined by the standards body, again all using part 7. An Engineering Procurement and Construction company might supplement that information with their own engineering details, also using part 7. A commissioning company might create equipment tags and add new data to that provided by the EPC, including for example, operating pressures and other control and monitoring data for the running equipment, also using part 7. The owner operator might add more data into their Asset Information Management (AIM) system , such as maintenance plan modifications, surrogate spares, use/criticality/reliability data and so on, also using part 7. They might plug that data into their Enterprise Resource Planning (ERP) system and integrate yet more data. We see ISO 15926 as the interoperability platform for all of this data.

So the issue is that once this data reaches an AIM system, or an ERP system, it is in fact the merging of data from many different sources. Some of these sources are public (eg. standards bodies, vendors), some are private or privileged (eg. commissioning data). Some of these sources are enduring (eg. standards bodies) some are transient (eg. an EPC project).

At the end of the day, the owner operator is collecting a large amount of data from many different sources and it is important to be able to identify the source of data, what is enduring, what is transient, what is public, what is private. It is also important that identifiers used in data do not collide.

Now, consider the case that a supplier adjusts an equipment specification datum, or an EPC continues to make modifications to a design during commissioning. Data is again flowing. That new data must correlate and "patch" the existing data in an efficient, and fault-free manner. If we need to rename identifiers, then we have just introduced a substantial complexity to that act of patching the data efficiently and flawlessly.

Not only must it "patch" the first system it encounters, it needs to patch every successive downstream system reliably and consistently. Not only that, but what about full circle actions? Isn't it possible that an ERP system might make direct contact with an OEM system using part 7 passed through this long chain? Isn't it possible that an AIM system might consult the original standard reference against which an OEM categorized their equipment for more details or as the basis for a search for substitute equipment?

So long story short, given that data is being continually merged from multiple sources, sometimes from the same source again and again, it is untenable for the identifiers to change or collide. Given that it is unreasonable to create a central registry of literals (for security, scalability and performance reasons), it is not far fetched to say that it is adviseable that literals are not identified at all, unless they absolutely must be for the relationships that they participate in.

Note: that the authors of part 7 have declared the above use case to be "out of scope" as far as what part 7 is intended to address, however, there should be some "high degree" interoperability solution for ISO 15926 data, and I believe that the RDS/WIP should store information in that form, since it will be the focal reference system of these other "highly interoperable" ISO 15926 use cases.

@todo

Stuff to be done - O notation for complexity of various different schemes to evaluate literal reachability from permitted templates.

Summary (unfinished)

I think (I hope anyway) that I've made a convincing argument against unique, identified literals in the externally facing, interoperability-focused ISO 15926 OWL representation. I think that a lot of it comes down to dealing with the merging, splitting and rejoining of RDF graphs from different sources. If we take the current state of part 7 as written, it makes the RDF representation very much a transitory state of the data. That's okay for people who have EXPRESS, but the attraction of RDF/OWL for the rest of is that it is available, free and very active and we'd rather keep the data in RDF if we can, and have to massage it as little as possible. Its not so much a matter of "it can't work", but more a matter of "its error prone to make it work well and it will be slower".

Home
About PCA
Reference Data Services
Projects
Workgroups