Gary Perlman
OCLC Online Computer Library Center
6565 Frantz Road
Dublin, OH 43017
perlman@oclc.org
This paper appeared in the proceedings of
HFES 2000, the
|
The results reported here is a serendipitous set of results that were observed while researching model-based linking in hypertext. The research explored the idea that an entity-relationship model of data could be used to automatically generate useful hypertext links between related information [Raghavan, 1998]. In the research, links were generated based on entity-relationships and evaluated in online and paper forms against search/index-based access methods.
Both objective (performance) and subjective (preference ratings) measures can be used to study the differences in the use of relationship links to seek information. Subjective preference ratings can be collected concurrent with the performance of the tasks and retrospectively (e.g., through a post-test questionnaire or interview).
In this paper we report the results of our study which shows that the difference of preference versus performance can differ for concurrent versus retrospective ratings.
Why study the ratings of performance when you can use the performance measures? Many products are released without any performance measures, and only very informal subjective impression, such as in a focus group, or worse yet, from management/ marketing. Collecting subjective measures is a way to estimate that biased influence. Besides, if performance does not match preference, the design choices might need to be explained more carefully than when they agree.
An experiment evaluated the usefulness of entity-relationship-based links in accessing online versus print information. Subjects sought answers to questions in each of four comparison conditions:
The performance of subjects answering the questions was assessed with three dependent variables:
A post-test questionnaire gathered subjective (retrospective) ratings for each condition:
All results have been "normalized" to be on comparable scales of measure. Predicted time was based on "designer's intuition". Although potentially biased to favor the designer's ego, it is worth presenting here as a contrast to objective (actual time) and subjective (retrospective speed) measures.
Actual time (in seconds) correlated poorly with Predicted time, but correlated well (r=0.49) with concurrent ratings of accuracy (low times were accompanied by high ratings of accuracy). Concurrent ratings of accuracy (which had significantly lower confidence for Linked-Paper) correlated well with Actual Accuracy and Subjects were significantly more confident when correct (F(1,510)=87.4, p<0.001). Actual Accuracy varied little across conditions (F(3,45) = 0.1). Concurrent ratings of accuracy correlated well with retrospective ratings of accuracy (r=0.62). However, Actual Accuracy did not correlate well with retrospective ratings of accuracy (r=-0.36). All retrospective ratings were correlated (r=0.65, r=0.32, r=0.55), and these correlated best with the designer's predicted times (e.g., for speed ratings, r = -0.44).
The results are summarized in Table 1.
Table 1 Summary of Objective and Subjective (Concurrent & Retrospective) measures across the four major conditions Linked Online (LO), Unlinked Online (UO), Linked Paper (LP), and Unlinked Paper (UP)
Online | Paper | |||
---|---|---|---|---|
Linked (LO) | Unlinked (UO) | Linked (LP) | Unlinked (UP) | |
Predicted Time (PT) | < | < | < | |
Objective | ||||
Actual Time (AT) (secs) |
83.3
(6.2) |
106.6
(9.3) |
121.3
(8.2) |
104.6
(7.7) |
Actual Accuracy (AA) (%) |
77.3
(3.7) |
78.1
(3.8) |
75.8
(3.7) |
76.6
(3.8) |
Subjective | ||||
Concurrent Accuracy (CA) |
9.1
(.17) |
8.9
(.24) |
8.3
(.17) |
8.9
(.15) |
Retro Time (RT) |
8.2
(0.3) |
6.9
(0.6) |
4.7
(0.6) |
4.3
(0.5) |
Retro Accuracy (RA) |
8.8
(0.3) |
8.4
(0.4) |
7.1
(0.5) |
6.8
(0.5) |
Retro Usability (RU) |
7.7
(0.3) |
6.9
(0.4) |
4.9
(0.4) |
4.4
(0.4) |
Figure 1 shows a combined view of the three measures Actual Time (AT), Actual Accuracy (AA), and Concurrent Confidence ratings of Accuracy (CA) which are presented as percentages (bars) on the primary Y-axis. To allow for comparisons, the bars for each measure are clustered together. Time is presented as points connected by lines on the secondary Y-axis. Error bars indicate one standard error of the means.
The subjective (retrospective) results are summarized in Figure 2. The graph shows a combined view of all three ratings. The ratings for Retrospective Speed and Accuracy are presented (bars) on the primary Y-axis. Retrospective Usability is presented as points connected by a line on the secondary Y-axis. Error bars indicate one standard error of the means.
Designer-intuition predicted time did not correlate well with actual time. Actual time correlated well with Concurrent confidence ratings, which might make one want to generalize that subjective confidence ratings about accuracy are good predictors of performance time, but one would be less eager given that actual-accuracy did not differ across conditions. The use of isolated measures becomes even more tenuous when we look at the retrospective ratings, which all correlated well with predicted time; If only subjective retrospective ratings were collected as data, one might conclude that the designer's intuition was perfect. But the retrospective time and accuracy scores seem to have lost the poor performance (Actual time and Concurrent confidence ratings) of the Linked-Paper condition (all conditions were counter-balanced in a latin square).
It is instructive to consider the conclusions that might be drawn if we had measured fewer dependent variables.
We found that retrospective ratings of accuracy, time, and usability to be less related to objective measures than concurrent confidence ratings. If objective and subjective measures are inconsistent, then we would anticipate needing to explain the benefits of a more effective design. The more they are at odds, the more we might want to find ways to make the most effective design the most positively received.
In summary, in this experiment, the gathering of retrospective usability ratings has helped to demonstrate that they may not serve well as measures of true performance, and could have, if collected as the only dependent measure, have been used to confirm incorrect predictions. On the other hand, if retrospective ratings are to be used to measure an overall impression that a user takes with them from an experience, then the uncorrelated actual data (e.g., the objective time measure) may be less useful in predicting future purchase/use behavior (see [Davis 1989]).