Data Types

In a very brief statement, statistics is the study of information. In more detail, we would say that statistics involves planning, summarizing, analyzing, and interpreting data that describes our information. Data comes in many forms, and some data is more structured than other data. Data typically describes some characteristic to be studied, and that characteristic is often described as a variable in the study.

Quantitative or Qualitative

Quantitative data always measures or counts some characteristic. Measuring generally answers the question "how much", while counting answers the question "how many". The number of children on the local playground is quantitative data, since the answer involves a count of the number of children. The distance between your home and your place of employment is quantitative data, since the answer involves measuring that distance. Quantitative data is always numerical.

Qualitative data is information that labels or ranks items, but does not measure or count them. If we collected information on the model of car owned by each teacher at a local high school, that data would be qualitative, since the models are only labels. If we asked shoppers at the mall whether they were "very satisfied", "satisfied", "dissatisfied", or "very dissatisfied" with a particular product they had recently purchased, we would be ranking their satisfaction with that product. The results would be qualitative data. Even if we asked them to rate the product on a scale of 1 to 5, we are still ranking their satisfaction, since the location of the numbers are somewhat arbitrary. In other words, qualitative data may be numerical, or it may not be numerical.

Discrete or Continuous

Continuous data allows for a smooth variation of values from one extreme to another. Generally, if the data requires measurement, answering the question "how much", the data is probably continuous. More specifically, if various fraction or decimal values are acceptable, then the data is usually continuous. The number of gallons of gasoline purchased by each customer on a particular day would be a continuous variable.

Discrete data comes in discrete or separate values. Generally, discrete data answers the question "how many", and counting is required to obtain that answer. Fraction and decimal values are permitted only when gaps remain between the allowable values. The number of children in each family of a village would be discrete data.

The process of recording data may actually change its type. Although the number of gallons of gasoline purchased could be measured with infinite precision (and an infinite number of decimal places to describe the quantity), it is typically rounded. If we round all purchases to the nearest tenth of a gallon, we have actually converted it to discrete values (which do involve decimals, but still have gaps between the values).

Nominal, Ordinal, Interval, or Ratio

The identification of data as one of these four types is to identify how much structure there is in the data. Generally, qualitative data are either nominal or ordinal, and quantitative data are either interval or ratio data.

Nominal data has the least structure. If values are simply labels, and cannot be ordered or ranked with any meaning, then the data is nominal data. The colors of cars in the parking lot will generally be nominal data (unless a physicist is measuring the wavelength of the light involved), because the values would be recorded as "red", "green", "white", and so on.

Ordinal data can be ordered or ranked, but does not measure or count some characteristic. Asking about satisfaction of a product or service generally involves a ranking.

Interval data does measure or count some characteristic (which gives differences between values a meaning), but ratios of the values have no intrinsic meaning. This happens any time the zero point on the measurement scale does not describe the absence of some quantity. A good example of this type of data is temperature in degrees Fahrenheit. On one day the high temperature in a particular city may be 5 degrees below zero, and the next day the temperature rises to 5 degrees (above zero). The rise in temperature by 5 degrees is well understood, but the ratio $\dfrac{5}{-5}=-1$ does not describe a physical change. Zero degrees Fahrenheit is only a location on a scale, and does not describe the absence of heat or motion. (Absolute zero on the Kelvin scale does, so Kelvin temperature is ratio data.)

Ratio data also measures or counts some characteristic, and the ratios of the values do have an intrinsic meaning. This will occur any time the zero point on the measurement scale describes the absence of some quantity. The number of children in a family is an example of ratio data. When one family has six children and another family three, then the ratio $\dfrac63=2$ means that the first family has twice as many children as the second. To have zero children would mean an absence of any children in the family.

Cross-Sectional or Time Series

Cross-Sectional data describe some characteristic at (roughly) a single point in time. High temperatures in the major cities of the USA on December 22 of last year would be cross-sectional, as one particular day has been singled out.

Time Series data describe a single characteristic over a period of time. The daily high temperatures in Duluth during the year 2009 would be time series data.

Univariate, Bivariate, and Multivariate

Data is also classified by the number of independent quantities being recorded for each observation. If a statistician observes a football team and records only the weights of each player, then the data has one independent quantity and is called univariate. Any data that involve more than one variable are multivariate. If a statistician records both the height and the weight of each football player, then multivariate data, or more specifically in this case, bivariate data has been collected. And if a statistician records height, weight, eye color, and length of hair, the data is also multivariate (but not bivariate).

Both the display of data and methodology used in statistical testing can depend on the number of independent variables. It should be noted that in this context, frequency is a dependent variable, not an independent one.

Population or Sample

Population data is obtained when ALL values in a group are recorded. Summary data of a population (such as averages) are called parameters of the population. The census that the USA does every ten years is population data (or a very good attempt at one, but a few people do get missed), at least for those characteristics that are asked of EVERY individual.

Sample data occurs whenever a subset of a group is selected, some characteristic observed in that smaller group, and the results are imputed to the entire group. Summary data of a sample (such as averages) are called statistics. Sampling is done because of efficiency and cost, and as it turns out, if the process is carefully done, the loss of accuracy is minor. A major focus of a statistics class is to determine how and when sample data can be used in place of population data. In fact, the interplay between populations and samples even requires us to recognize different variables for population data and sample data.