您在這裡

Derived data values

8 九月, 2015 - 16:25

Data modeling options have varying degrees of disadvantages in terms of the amount of space and time they require.

In deciding what must be represented by data, it is also possible to decide what need not be represented explicitly in the data. What need not be represented explicitly depends on the nature of the data. For example, in cases where multiple unit systems exists, it may make sense to store values in only one of the possible systems and then provide conversions to other systems. Data values that are represented this way are known as derived data values. Using a derived data mechanism, we need not stored an actual value in a data collection. Values can be calculated on demand instead, provided that the necessary input data values exist in the collection. It is important to note that derived data need not be limited to numeric data values.

EX. WM-14:

Suppose we must support users with temperature readings in both metric and imperial units. Using a derived data mechanism, we might modify the latest version of our temperature_reading schema in example EX.WM-11 to arrive at the following:

temperature_readings(
    reading_code,
    station_index_number,
    data,
    time,
    celsius_temperature,
    fahrenheit_temperature(celsius_temperature))

By this syntax, we mean that the attribute fahrenheit_temperatureis actually a function and that it derives its value using the value of the celsius_temperatureattribute. Using some type of programming language, we would specify to the data management system the following algebraic equation in defining the fahrenheit_temperatureattribute:

fahrenheit_temperature(celsius_temperature) :=
    1.8 x celsius_temperature + 32

Question to the reader: What are the trade-offs in choosing a derived data value approach in this case?

Another case where derived values are beneficial is in the calculation of aggregate data values. Aggregate data values are derived using a collection of data values. The most common examples are statistical measures such as the average, mean, or median values of a collection of numbers. Other aggregate measures such as the sum, maximum value, or minimum values of a set of numbers are common as well. Database management systems typically provide built-in services or functions that calculate such measures over a collection of data specified by the user.

EX. WM-15:

Suppose we must support users with the average, maximum, and minimum temperature readings at each station for the last 24 hours. One method would be the following:

Given the schema:

stations (station_index_number, location_name)

temperature_readings(
    reading_code,
    station_index_number,
    date,
    time,
    temperature)

station_statistics (
    unique_id,
    date,
    time,
    station_index_number,
    avg_temperature,
    max_temperature,
    min_temperature)

g. Collect current temperature readings from all stations.
h. Store each reading collected during Step 1 as an entity in the temperature_readings data collection.
i. For each station_index_number in the stations data collection, calculate the average, maximum, and minimum temperature values recorded in the temperature_readings collection over the last 24 hours; and store each calculation in the station_statistics data collection.
j. Wait for a specified time interval then goto Step 1.

Step 3 in our method above will cost us space, but it will also cost us in computational time. The latter will come about with the need to read values from the stations and temperature_readingscollections in order to calculate the values to be stored in the station_statisticscollection.

Using a derived data value approach, we can save space and potentially time by calculating the aggregate statistics only when requested by a user.

Questions for the reader:

  1. What approach would you choose if there are likely to be many requests for statistics?
  2. Can you think of a way to optimize the procedure above in terms of computational time?

The costs in space expenditures are not zero for derived values. In a computer system at least, some amount of space is required to store the methods that are required to derive a value. These are traditionally called storedprocedures in database management systems. For example, our temperature conversion formulas would be stored in the form of a small computer program. Many database management systems allow such procedures to be stored with the data. In other cases, developers decide to implement the procedures within external computer programs.

As was discussed at the start of this section and but can be seen more clearly by now, deciding what should be represented in data that we collect should depend on finding the best ways in which it will be understood and used.