You are here

Granularity

8 September, 2015 - 12:22

Granularity in the context of data modeling refers to the level of specificity or detail at which something is represented. Almost all data modeling problems offer the opportunity to represent some aspect of reality at ever greater levels of detail. Representations that contain a lot of detail are said to be fine grained with respect to granularity. Conversely, coarse grained representations contain less detail. Coarse grained representations are also often referred to as high level. The data modeler must ask the question: “how fine grained must the model be?”

EX. WM-6:

There are several dimensions along which we can increase the level of detail of this particular representation. Weather monitoring includes the dimensions of time and space. We are keeping temperature readings by time and geographic location. We have chosen thus far to represent the geographic location at a level of granularity of whole cities.

As was discussed above, some users of weather data may call for a finer-grained representation. To satisfy the requirements of organizations that manage airports and similar agencies, for example, we would have to adjust our current weather monitoring data model to be finer-grained with respect to geographic area. That is, our monitoring should include more locations within a given geographic area and perhaps mandated locations, such as airports, irrespective of other considerations.

The other dimension of granularity impacting Weather monitoring that was discussed above is time. In this context, we can adjust the granularity by increasing or decreasing the frequency with which we take and record measurements. Again, different users may have different requirements of the same type data. Individuals may be satisfied with an hourly or even daily temperature reading if they need only a sense of how hot or cold it is. Other types of users of weather data, such as airports, will require more frequent readings.

Temperatures can shift significantly in most locations. This is another aspect of representing reality that can have a critical impact on people as was discussed in our earlier discussion about deicing of airplanes. Thus, to meet the needs of airports, we must be sure to collect temperature data with high enough frequency that they are able to see temperatures rise or fall past critical points when they need to see it.

The level of granularity at which we choose to represent something brings with it simultaneous advantages and disadvantages. There are trade-offs between cost and accuracy where granularity is concerned. With a fine-grained representation, we can answer more questions with our data; however, the cost of storing a collection of data will rise as its granularity increases. Any detail we add to a data model means that more data must be collected to populate it. Obviously, any increase in the frequency that data are collected into a given data model will also result in higher storage costs. Data management professionals must calculate these costs when designing a system.

EX. WM-7:

The Global Climate Observing System (GCOS) Network consisted of 1016 stations around the world as of January 2007. Each station collects data anywhere from every 5 minutes to twice daily.

Suppose every GCOS station collects data every five minutes according to the current temperature_readingsschema. Suppose further that date-time, city-name, and temperature requires 64, 40, and 4 bytes of data storage respectively. We would then require 31,104 bytes/day for each GCOS station or 31,601, 664 bytes/day for all GCOS stations in the network. The annual storage requirements would, thus, be 11,534,607,360 bytes. (The reality is that the real GCOS stations collect far more data with each reading. )

Question to the reader:

Can you derive these answers?

Independent of the level of granularity, one key aspect of representing some part of reality is the need to record information about things – objects, concepts, locations, etc. -- in such a way that they can be accurately identified later.