You are here

Uniqueness

9 September, 2015 - 10:08

The assignment of unique identifiers to entities is often the ideal way of distinguishing one entity from another. There are several ways to create a unique identifier. One way to create unique identifiers is to use a combination of existing attributes whose data values, taken together, are certain to be unique within a given data management context.

EX. WM-9:

The approach that we have used to uniquely identify locations for temperature readings has been to combine location names with geographic coordinates. While technically sufficient, we saw that it was less than ideal in terms of usability. Users would be able to distinguish between identical location names (e.g. Albany City Centre) using the coordinates, which would have to be different; however, coordinates would not be the most intuitive type of data for must users.

EX. TR-1:

Suppose we are designing a data management system to handle train reservations. One data modeling task would be to represent the trains. What we mean by “trains” in this context will determine what combination of attributes might be a suitable unique identifier.

Suppose we take “train” to mean a route between an origin and a destination. In this case, we could achieve uniqueness by combining attribute values for origin and destination. This assumes that our train network does not have duplicate city names. A schema suitable for uniquely identifying train routes might be the following: trains (origin, destination). Note: the use of bold within a schema indicates the attributes that make up a unique identifier or key. The following is an example data collection that follows this schema:

<Montréal, Halifax><Halifax, Montréal><Montréal, Ottawa><Ottawa, Montréal>

Note, that this schema has the advantage of yielding a unique identity for reciprocal routes. For example, <Montréal, Halifax> ≠ <Halifax, Montréal> and <Montréal, Ottawa> ≠ <Ottawa, Montréal>.

What if our train service offers multiple trips on the same route? For example, there are at least two <Montréal, Ottawa> trips each day, in both directions. In this case, origin and destination would not guarantee uniqueness. The condition that prevents uniqueness – multiple trips – points to one possible solution, which is to use those data values that distinguish multiple trips on the same route. In this case it is time. Thus, we might modify our schema to the following: trains (origin, destination, time_of_day).

We may find situations where it is important to distinguish between the same train route on different days of the week or specific dates. If a passenger orders two round trips between Montréal and Ottawa on different days, we might record those reservations as follows in our database:

<”Jim Smith”, Montréal, Ottawa, 08:00, 2007-09-01><”Jim Smith”, Ottawa, Montréal, 20:00, 2007-09-01><”Jim Smith”, Montréal, Ottawa, 08:00, 2007-09-07><”Jim Smith”, Ottawa, Montréal, 20:00, 2007-09-07>

As we can infer from the data values that origin, destination, time_of_day, and date must be combined to uniquely identify a route; we might add the passenger_nameto attempt overall uniqueness of each record. Why does that not work?

It is easy to see that the combining of attributes can easily become tedious. We quickly went from juggling two attributes to four attributes. If we add the passenger_nameattribute, we are still not guaranteed uniqueness in certain cases. What are they?

Thus, we need another approach to creating unique identifiers.

The use of multiple attributes to guarantee uniqueness can become quite messy. Often the most reliable approach to creating unique identifiers is to use a single, special attribute to which values are assigned that are guaranteed to be unique. Such an attribute is usually called a key or object identifier. One approach is simply to assign a unique number as the identifier to each entity.

EX. WM-10:

Instead of using a combination of location names and geographic coordinates to distinguish between the locations where a temperature reading was recorded, we could devise a set of unique identifiers for each station. Thus, if we have location names that are the same, like Albany, New York and Albany, Georgia, we can create a unique code for each. This is actually the approach taken by weather monitoring networks around the world . For example, Environment Canada assigns a unique 5 digit code, called an “index number,” to each of the monitoring stations in its network. As an example, the stations listing “Fredericton” in its name are recorded as follows [EnvironmentCanada2007]:

Index Number

Station Name

Lat

Lon

Other attributes …

71668

FREDERICTON CDA CS, NB

45 55N

66 36W

71700

FREDERICTON A, NB

45 52N

66 32W

 

Note that there are two stations in Fredericton with different geographic coordinates, each having its own Index Number attribute value. The other parts of the name have some meaning to those in the Weather Office. Index Number 71700 represents the Fredericton Airport (cf. A Web site such as http://maps.google.com and enter the coordinates).

Manually working out unique identifiers for each weather monitoring station works since there are a finite and slowly growing number of them. In the Weather monitoring example, however, we can imagine other entities that will have to be modeled for which this method of assigning unique identities is not feasible. There is a rapidly growing collection of individual temperature readings from each site in the network. It would not be possible for humans to manually assign unique identifiers to each temperature reading record that is stored. We would need to have our data management system generate unique identifiers somehow.

EX. TR-2:

It is common in transportation networks to assign a unique number to each route. People in a given location often know the routes by these numbers. “If you want to go to Ottawa, take the 35, 37, or 39,” a Montréal resident might say. Recall, that the issue we had with uniquely identifying train routes had not only to do with unique <origin, destination> pairings, but times of day. Thus, VIA Rail assigns a unique number to each <origin, destination, time_of_day> combination. Thus, the Montréal to Ottawa route has unique numbers for each time of day. An example of the routes listed previously in EX. TR-1 with their unique “Train numbers” is as follows:

Train Number

Route name

Origin

Destination

Departure Time

35

Montréal-Ottawa

Montréal

Ottawa

15:05

37

Montréal-Ottawa

Montréal

Ottawa

16:45

39

Montréal-Ottawa

Montréal

Ottawa

18:00

14

The Ocean

Montréal

Halifax

18:30

15

The Ocean

Halifax

Montréal

12:35

34

Ottawa-Montréal

Ottawa

Montréal

15:10

36

Ottawa-Montréal

Ottawa

Montréal

16:25

38

Ottawa-Montréal

Ottawa

Montréal

17:55

 

As we discussed above in Weather monitoring example EX. WM-10, the hand assignment of unique train route codes is feasible because there are a fixed number of routes and new routes are added infrequently. The train reservations, on the other hand, will not be amenable to this approach. The reasons are the same as given for the temperature readings. Need to have our data management system generate unique identifiers somehow.

Question to the reader:

Some trains do not travel each day of the week. How would you model that?

It has been seen in EX. WM-10 and EX. TR-2 that sometimes uniqueness can be worked out manually. Often times, it is necessary to automate the process of assigning unique identifiers. There are a number of ways this can be done. Software developers can create services that give each requester a unique identifier. Thus, each time a new entity is created, a unique identifier is requested from the service. A few basic ideas about how this can be done are given in the following examples. In some cases it is not necessary to write new software to generate unique identifiers since most database management systems provide services for doing this.

EX. WM-11:

As was discussed in example EX. WM-10, we must somehow generate unique identifiers for each temperature reading. One approach is to simply ask the database management system used to store our temperature readings to generate a new unique identifier each time we create a new temperature reading record. That is, each time we store a new record in our collection, the database management system software would be instructed to insert a unique piece of information in that new record. A common method for a database management system to do this is to simply keep a counter. Each time a new record is created, the counter is incremented by 1 and the new value is given. Using this method, we might have something like the following for our temperature readings:

Reading Number

Station Index Number

Date

Time

Temperature

1

71668

2007-08-14

08:00 AM ADT

-10

2

71627

2007-08-14

07:00 AM EDT

-15

3

71700

2007-08-14

08:00 AM ADT

-11

 

107

71668

2007-08-14

08:05 AM ADT

-11

108

71700

2007-08-14

08:05 AM ADT

-12

 

In the example above, we see that the reading_number attribute is simply an integer whose values are obtained from the database management system. It provides each record with a unique identity. This approach works as long as there is a central authority for generating the unique identifiers. In this case, that authority is the database management system. In examples WM-10 and TR-2 the central authorities where Environment Canada and VIA Rail respectively.

What happens, however, if each region or each station in our weather monitoring network has its own database management system. In this scenario, the records would be stored in regional collections, each management by its own system, and then consolidated periodically into a national collection so that the data can be used together. In this scenario, how do we guarantee that the database management system responsible for Fredericton’s two weather stations generates identifiers that are unique from those of some other region, let’s say Montréal. If each database management system simply increments an integer, then we have no practical way of insuring that each system generates identifiers that are different from the other systems.

One solution would to combine the integer values assigned by each system with some unique location or region identifier. Suppose we assign the Atlantic region in which Fredericton exists the code “ATL” and the Eastern region in which Montréal exists the code “EAST.” We might then unique reading numbers such as the following:

Reading Code

Station Number

Index

Date

Time

Temperature

ATL.1

71668

2007-08-14

08:00 AM ADT

-10

EAS.1

71627

2007-08-14

07:00 AM EDT

-15

ATL.2

71700

2007-08-14

08:00 AM ADT

-11

 

ATL.107

71668

2007-08-14

08:05 AM ADT

-11

ATL.108

71700

2007-08-14

08:05 AM ADT

-12

 

A critical problem with simple schemes for selecting unique identifiers is that they are often not secure. This is of special concern if such values are to be used by humans or if they can somehow be intercepted by other information systems.

At least two types of security are of concern here: (1) security from imitation identifiers and (2) security from errors introduced into the identifier itself. Ideally, one should not be able to guess a valid value for a unique identifier. Someone might attempt this in order to co-opt records for malicious reasons.

The introduction of errors is related in a way to the risk in (1). That is, if it is easy to guess a valid value for an identifier, then it is probably also very easy to introduce an erroneous value into a system. Common data recording errors for humans include the transposition of data (e.g. 5401 vs. 4501) and the addition or removal of data. Thus, an ideal scheme for generating unique identifiers or other types of codes is one which is said to be error-correcting. For example, if given an identifier a human should not be able to introduce an error into the code that produces another valid identifier. The corollary to this is that the introduction of such an error should be detectable. Another way to view this is that the code should be self-validating. That is, it should be possible to determine the validity of a code value by examining the value itself.

EX. TR-3:

We could employ the same approaches that were just discussed in EX. WM-11 to identify the reservation records discussed in EX. TR-1. Other ideas might be explored, however. One problem with simply selecting unique identifiers from a sequence is that they carry no useful information other than a guarantee from the database management system from which we get them that they are unique. In the last iteration of the design of our reading-number attribute in EX. WM-11 above, we can now discern regions from identifier.

Likewise, in creating a suitable identifier for train reservations, it would be useful if, beyond being unique, one could glean critical information from them upon examination. For example, an official from the train network might be able to offer additional help to a passenger when presented with such an identifier. We must also be wary of: (1) the possibility of creating imitation reservation numbers or (2) the ease with which errors can be introduced into an identifier’s data value.

In case (1), suppose people know that the train network issues reservation numbers that are integers. It would then be very easy for people to create fake reservation numbers. With a fake reservation number – and lacking any other security measures – someone might then be able to co-opt someone else’s reservation for their own or to discover information about the holder of the reservation. Ideally, we should use a unique identifier scheme for which it is not be possible for someone to guess a valid value.

Case (2) can be seen as related to case (1). It should not be possible for someone to introduce an error into a reservation number that produces another valid reservation number. Suppose Jim Smith makes a reservation over the telephone and is told by the operator that his reservation number is “1010” while he mistakenly writes down “1001.” Suppose that “1001” is someone else’s reservation number. Jim Smith now has created an identifier that can potentially violate someone’s privacy. Train officials cannot easily tell that it is erroneous and, thereby, help Jim Smith to get his correct reservation number. This is a simple example, of course. In reality, other data, such as identifying information, are often used to cross check such things as reservation numbers.

EX. LM-1:

The International Standard Book Number (ISBN) is one scheme that has been developed that addresses the issues of error detection and the introduction of meaning into the structure of data values, as discussed above in example EX. TR-3. Each identifier is composed of unique numbers for group (i.e. Language group or country), publisher, title, and check digit (cf. http://www.isbn.org ). An example of the 10 digit version, called ISBN-10 is: 1-40207-067-5.

1 := English-speaking country group40207 := Publisher number067 := Title (within the publisher’s collection)5 := Check digit

If we have the lists of codes for the different language groups, publishers, and each publisher’s titles, the value of a ISBN itself will lead us to all of that information. The check digit is a value calculated using the values of the group, publisher, and title fields. Thus, if the check digit does not match the value calculated using the other digits, then we know there is an error in that particular ISBN number. The check digit is calculated using the following formula, where di is the ith digit from the left:

check digit = d1+2d2+3d3+4d4+5d5+6d6+7d17+8d8+9d9 mod 11

Thus, the first 9 digits of the ISBN 1-40207-067 would be used as follows to calculate the check digit:

check digit = 1+2*4+3*0+4*2+5*0+6*7+7*0+8*6+9*7 mod 11
            = 170 mod 11
            = 5

Note: The ISBN-10 standard specifies that if the remainder is 10, then the character “X” is to be used as the check digit so as to keep the number to 10 digits.

A data management system can then easily validate any value presented as ISBN-10 by the following rule:

IF (d1+2d2+3d3+4d4+5d5+6d6+7d17+8d8+9d9 mod 11) == check digit THEN
    ISBN is correct
ELSE
    ISBN is invalid.
END IF

Following our discussion in example EX. WM-11, we can see that this particular approach to generating identifiers counts on a central authority to define the group and publisher codes, but that once a publisher has been assigned its own publisher number, it can assign its own titles without interfering with any other publisher. Of course, it must notify some central authority of the new titles it creates.