Journal of Undergraduate Research
Volume 9, Issue 3 - Spring 2008

When a Penguin Is a Bird: An On-Line Study of Category Co-Reference

Maia A. Petee and H. Wind Cowles

ABSTRACT

When you hear the word "bird," what image comes most readily to mind? Is it that of a small, feathered, flighty animal, similar to a robin or a bluebird? Or is it that of a much larger, sleek, oblong creature that never takes to the air but prefers diving among ice floes instead? Chances are, if you haven’t grown up in the Antarctic, the first description far more closely meshes with your experience of birds, of their appearance and actions. Since a robin so well exemplifies all the basic characteristics of a bird, represents the unbidden and instant "mental picture" of one, it can be said that it is a typical bird, and that a penguin, described by the second set of characteristics, is an atypical bird (speaking formally, an atypical exemplar of the category "bird").

Taking this into consideration, what happens during normal language use when we are forced to mentally associate "bird" and "penguin"? This is exactly what happens when we use a category “anaphor” (e.g., bird) to refer back to the referent of an antecedent expression (e.g., penguin). Do we simply focus on the representation of “penguin” that we already have in mind, or do we shift our focus to more bird-like features of penguins? That is, does referring to a penguin as a “bird” cause us to think of the penguin differently? In this paper, we explored this question in an eye-tracking study that examined how actions or characteristics of atypical antecedents are perceived when they follow an anaphor. Properties that were either typical of the category, typical of the exemplar, or equally typical for both were tested and analyzed. Each stimulus was presented in a short two-sentence passage in which the first sentence introduced the antecedent:

"A penguin was sitting on the ice"

and the second referred back to it with a category anaphor, then going on to assert one of three possible properties of the antecedent:

1) Antecedent-Specific: "The bird suddenly dove into the water."
2) Category-Specific: "The bird suddenly preened its feathers."
3) Antecedent-Incompatible: "The bird suddenly flew into the air."

In example (1), the predicate given (flying) is very typical of birds, but impossible for penguins. Either a typical bird or a penguin could be engaging in the preening described in (2), but diving into the water, (3), is not usually associated with a typical kind of bird, like a bluejay or a robin—only for water birds, such as penguins. When a reader is first faced with a subject/predicate combination that strikes him or her as impossible or unlikely, this may disrupt ordinary reading processes, causing longer reading times, and he or she may possibly retrace his or her eye movements to try to make some more sense of the sentence.

What is not yet known is which combinations will register with the participant as being semantically odd and which will be processed with no problem. Simply put, immediately after reading “the bird,” will the reader so strongly expect to see something typical of birds that he will accept an action impossible for a penguin, or will the connection that this "bird" is a penguin be made in time? If the first is true, sentences (3) and (2) will cause no problems, but sentence (1), because of its resulting semantic oddity, will cause a delay in reading and processing time. Conversely, if the antecedent information is still most prevalent, (1) and (2) should be read without difficulty, but (3) will take much longer to process.

BACKGROUND

The typicality of an antecedent and its category anaphor plays an important role in how quickly anaphoric reference is processed (e.g., Garrod & Sanford, 1977). Garrod and Sanford (1977) were the first to examine whether typicality of an exemplar antecedent affected reading time for a companion sentence, containing a category anaphor. The design of this experiment was simple enough: it sought to determine whether typicality of an antecedent had any effect on reading time for the sentence containing its anaphor. In terms of our "bird/penguin" example, Garrod and Sanford’s experiment tested whether a pair of sentences like,

            “The robin was bright-eyed and alert. The bird preened its feathers.”
            “The penguin was bright-eyed and alert. The bird preened its feathers.”

would be processed any differently; specifically, whether "robin" as a typical antecedent would be read more quickly than "penguin" as an atypical one.

The results of their study did indeed show a significant effect with typical antecedents. Wanting to further explore this, Garrod & Sanford (1977) conducted a similar experiment that incorporated the extra variable of inserting an intervening sentence between the sentences containing the antecedent and the anaphor and tested whether the effect still took place, finding that it did.

To more completely understand the comprehension process that Garrod and Sanford were testing, it is necessary to understand that each time a person encounters a discourse entity while reading, his mind automatically checks all previous entities to identify possible antecedents, beginning from his current center of discourse and working back.

Having proven that readers do check and realize when an antecedent and anaphor refer back to the same object in the text, Garrod and Sanford next turned their attention to discovering whether the same checking procedure occurred when the antecedent and anaphor actually refer to different objects (Garrod & Sanford, 1977). They found that the effect does still take place, and any difference noted in the processing times between co-referring noun-phrase pairs and unrelated noun-phrase pairs is negligible; the checking procedure takes about the same length of time regardless of whether the nouns are actually co-referring.

Much research has been conducted on anaphoric processing since Garrod and Sanford’s (1977) influential paper, and their cornerstone finding of a typicality effect has been upheld. However, very little has been learned about what happens after the anaphor – that is, what effect naming an atypical exemplar by its category has on the representation of that exemplar in the mental model of the discourse that readers develop over time. After all, using a full semantic label like "bird" is different from simply using an anaphor with very little semantic content such as a pronoun like "it" or even repeating the same name again (“penguin”).

One model of anaphoric processing (Garnham & Cowles, in press) has argued that one function of using category anaphors is to place additional focus on category-specific features. Thus, using "bird" to refer to "penguin" may be a signal that the text will go on to highlight the more bird-like aspects of the penguin. If this is the case, then bird-specific features may become more available after encountering an anaphor like "bird."

To recap, we return to an example set of stimuli that will be presented to participants. Words in italics are those that are being looked back to (the antecedents); underlined information is that which is consistent with either the antecedent referent, the anaphor, or both; text in bold is the anaphor.

"A penguin was sitting on the ice. The bird suddenly dove into the water."
"A penguin was sitting on the ice. The bird suddenly preened its feathers."
"A penguin was sitting on the ice. The bird suddenly flew into the air."

In all three of these stimuli, at the moment the participant encounters bird, he immediately runs a check to search for possible antecedents to this noun; all three of these stimuli give him the same result, penguin.  The question we are seeking to explore is, upon encountering bird and automatically drafting a potential set of features that will be relevant in the upcoming discourse, does bird influence the representation of its referent penguin? Thus, will future information most salient to penguins or most salient to birds be encountered more naturally, with the least amount of resistance? If antecedent-specific features are still most relevant, then a predicate such as “dove into the water” would be accepted most naturally; conversely, if anaphor-specific features become more relevant, then a predicate featuring a flying bird would be considered far more natural than one featuring a diving bird.

PREPARATION AND METHOD

Stimuli gathered for use in this study were generated by consulting several resources, including Battig & Montague (1969), who published a study of category norms in which semantic associations between categories and exemplars were examined. The data were based on questionnaires given to participants to determine what they perceived to be typical exemplars of a variety of categories. Exemplars were not suggested, but were generated by the participants themselves during sessions in which they wrote down all that came to mind in thirty seconds. The results were expansive, ranging over 56 categories and including at least ten exemplars for each of these, and were presented in a very clean, list-like format that made internal comparisons easily accessible. The goal for the current experiment was to have pairs of category/atypical exemplar nouns that shared one or two basic characteristics or actions, but which were still different enough that there was very little overlap. To each category and to each of its exemplars, three possible predicates were given: an exemplar-biased predicate (X-Bias), a neutral predicate (Neutral), such as “is preening,” and a category-biased predicate, C-Bias. A representation of the range of possible items for one category/exemplar pair is given in Table 1 below, where "bird" is the category and "penguin" is here the exemplar.

Both the subjects and the predicates were run through CELEX frequency databases to ensure that they were neither too frequent nor too infrequent in the English language. All were fairly uniform in frequency. However, frequency alone is not sufficient to determine true usability, and so the subject/predicate pairs were also evaluated by participants before the experiment was run.


Table 1.
Antecedent-Anaphor Feature Sets
Antecedent Anaphor Feature
Bird is flying (C-Bias)
Bird is preening. (Neutral)
Bird is diving. (X-Bias)
Penguin is flying. (C-Bias)
Penguin is preening (Neutral)
Penguin is diving. (X-Bias)

 

Pre-Test

To ensure that the subject/predicate pairs chosen would be widely viewed, typicality-wise, as they had been intended to be, a "pre-experiment" was prepared. Three versions of a 98-question questionnaire were drafted, each with 98 different subject/predicate pairings whose order had been randomized by an online random-number generator. For each item, participants were shown a subject in a column marked “Thing” and a corresponding action or characteristic (e.g., “is breathing” or “is salty”) under the heading “Action/Property,” then prompted to rate them on a scale from 1 (Very Typical) to 7 (Very Atypical). An example of the format of the pre-test is given in Table 2.

A number of measures were taken to prevent each stimulus from interfering with the interpretation of those around it. First, all of the stimuli taken from the same item were divided evenly among the three non-overlapping "lists," so that the same stimulus was not seen twice. Further, all stimuli with semantically related subjects or predicates (e.g. "tree" and "cypress") were spaced, at the closest, no fewer than ten items apart.

Participants were 18 undergraduate students at the University of Florida. Ten were females, and eight were males. Their ages ranged from 18 to 23, with mean age being 19.78. Participants were compensated at an hourly rate, and all consented to the experiment when briefed on what it would involve. Testing was done in a quiet, plainly decorated room free of distraction. Participants were instructed not to begin until they had thoroughly read a set of instructions detailing the meaning and specific context of terms such as “typical” and “atypical” and had been presented sample items in order to familiarize them with the format in Table 2 below. 


Table 2.
Pre-Test Format
Number Thing Action / Property Very Typical
Very Atypical
1 turkey is flying 1            2            3            4            5            6            7
2 ship is submerged 1            2            3            4            5            6            7

 

Data were compiled and organized by arranging the average responses for each item in a matrix that grouped them vertically by item number and horizontally by whether they were categories or exemplars and what "bias" their predicate contained: category bias, exemplar bias, or neutral. These averages were then used to see how successful the stimuli had been in distinguishing their intended typicality.

Stimuli whose typicality was rated either significantly less or significantly more than what the experimenters intended were evaluated and removed if necessary. For "neutral" conditions, if the "category" and "exemplar" subjects yielded a significant difference, they were removed, because there should be no difference in the rated typicalities for category-subjects versus exemplar-subjects. Similarly, when evaluating responses for category-biased predicates, a much higher overall typicality is desired for category-biased subjects versus exemplar-subjects, and in responses for exemplar-based predicates, a higher typicality for exemplar-subjects. In either of these cases, the more marked a difference that was found, the better. There was no absolute cutoff in any condition; however, a discrepancy in directionality, such as a Category C-Bias sentence-set being rated as less typical than a Category X-Bias sentence-set, was immediately disqualified. This screening resulted in the elimination of seven of the remaining 30 statistically sound stimuli, leaving 23 usable items out of an original 49.  Of these, t-tests of the pairs showed that category and exemplar properties were rated as significantly different (p < .0001) in their association with categories and exemplars, while there was no difference in this rating for the neutral properties. (p = . 76).

Preparation and Method: Final Test

To leave as little room for error as possible in the final test, some additional preparation of the stimuli was needed. The design of the experiment required that the number of stimuli be a multiple of three, and so the 23 items were pared down to 21 before they were formalized. The final 21 items were then divided into three lists such that each list contained equal numbers of items from each condition and every item was in each list exactly once.

To prevent participants from becoming aware of the repetitive grammatical structure of the stimuli and anticipating the anaphoric sentence before they read it, 39 "filler" stimuli were created for interspersion among the true stimuli. The formats of these were carefully controlled using each salient feature of the true stimuli, adjusting each in some way that "violated" the stimuli’s strict format. For example, eighteen percent of the filler had the NP antecedent not being referred to again, but had the anaphor in the second sentence referring back to a second entity that had been introduced. Another eighteen percent adopted the same structure, but made the NP anaphor a pronoun instead. Examples of such filler stimuli can be seen in Table 3. As there was no semantic overlap to take into question, the same filler items appeared in each of the three lists. Also drafted at this time were comprehension questions; for each stimulus and filler item, a short yes-or-no question was written that would test the participant’s understanding of the stimulus. These were written mainly for verification purposes; to ensure the experimenters that the participant truly was understanding the stimuli as it was projected they would be understood, and that the recorded reading times were based on this understanding.

The stimuli and filler were presented in as large a typeface as would permit each to fit on one line, and were saved as image files, parts of which could be tagged and labeled (as S-Subj., T-Subj., Predicate, and other salient parts) to allow the eye-tracker to superimpose its findings onto the image and identify the main measure of interest: the total time readers spent looking at the predicate of the sentence. This was visually represented by the image file of the stimulus being overlaid with a series of small circles whose breadth was dependent on the amount of time that word was dwelt on, marked with specific times (in milliseconds). Data gathering was only somewhat automatic, as it was necessary for the experimenter to go into the program to retrieve and compile, as well as analyze, these recorded numbers for each participant. 

Participants and Equipment

The 22 participants used in this experiment were members of the University of Florida community, 11 females and 11 males, with ages ranging from 19-31 and mean age being 22. All but one was compensated at the rate of $7.50/hr.; that one received course credit for participation.

The eye-tracker used is the Eyelink II model from SR Research in Canada, and was run with the SR Research Experiment Builder software v. 1.4.55.RC.  

Table 3.
Examples of Filler Stimuli
Example
“A monk paced the halls of the cloister. The pathways were completely silent.”
“Allen picked up a Kleenex and blew his nose. It was cherry-red from the cold.”
“A marker rolled off the desk and bounced on the ground. It was empty.”
“A boy casually munched a poppyseed bagel. He slowly savored each bite.”
“A laptop hummed softly on the café table. The computer rested, unused.”
.
“A policeman spoke into his radio. The official was afflicted with a bad cold.”
“A young girl jumped around in the snow. The weather delighted the child.”
“A group of boys was playing frisbee on the lawn. After a while, it got dark.”

PROCEDURE

The participant was seated in a chair centered in front of a monitor, and was briefed via a set of instructions on the monitor about what he or she would be required to do and the form that the stimuli and comprehension questions would take. The participant was given a video-game controller with two shoulder buttons, and used this throughout the experiment to either forward the instructions and sentences on the screen or to answer yes (right shoulder button) or no (left shoulder button) to the comprehension questions. After the participant read the instructions and indicated that he or she was ready, the experimenter fitted him or her with the eye-tracking device: a light headset with two tiny cameras to monitor eye movements. The most time-consuming part of the setup was positioning and focusing each camera (one for each eye) precisely for height, distance, horizontal and vertical angles, and horizontal position. Head motion was monitored and corrected for by a signal sent from the headset to sensors on each corner of the monitor. This set-up generally took five to ten minutes. After the cameras were properly positioned and the participant had found a comfortable place he or she could sit for an extended period of time without moving, the machine was calibrated by having the subject look steadily at a small circle as it jumped around the screen. The results of this test were validated a second time.
The experimenter was stationed at a nearby monitor that showed a real-time readout of the participant’s progress, rotated so the participant could not see or be distracted by it, and, after prompting him, began the experiment. The participant pushed a shoulder button when he was done reading the stimulus to bring up the comprehension question, and after he had answered it and refocused his eyes on the center of the screen (marked with a small circle), the experimenter manually forwarded the program to display the next stimulus. This portion of the experiment took at most twenty minutes, as there were sixty stimuli (21 true items and 39 filler).

Results and Analysis

There were several complicating factors involving the equipment or participant understanding that resulted in some data being omitted from further analysis. Sometimes head movements on the part of the participant during a trial would cause the eye-tracker to show eye movement “drift” that could not be corrected during data analysis. One or two trials were lost for each participant due to this. Several participants had to have their entire data set excluded from analysis because of low accuracy in answering the comprehension questions. Intended as a control to ensure full reading, these were simple reading-comprehension questions, and should not have been answered incorrectly if the participant was reading normally. A participant’s data were excluded if he or she answered more than four out of twenty-one questions incorrectly; that is, if his percentage answered correctly was less than 81%. Nine participants were disqualified in this manner, with the worst of these answering only 62% of the questions correctly.
Another participant’s data were lost due to experimenter error, bringing the total of unused participants up to ten.

The data of the twelve remaining participants were analyzed. Average dwell times (in milliseconds) on the targets of each of the three bias conditions (Category, Exemplar, and Neutral) were calculated. These data, reproduced in Table 4, represent the total amount of time that participants spent looking at the target region of the text.  To recap, “Category” refers to those two-sentence pairs whose targets (or predicates) were semantically biased toward the category anaphor as opposed to the exemplar antecedent. Returning to the "penguin/bird" example, “Category” would be the sentence set ending “The bird flew into the air,” and “Exemplar” the set ending “The bird dove into the water,” with “flew” and “dove” being the targets, respectively.

Looking at these data, a clear effect can be seen immediately. On average, the target reading times for category-biased items were over 50 milliseconds longer than reading times for exemplar-biased items, showing that participants had more difficulty processing the category-biased items despite the nearness of the categories to the targets.

Table 4.
Target Dwell-Times
Mean Reading Times Category-Bias Exemplar-Bias Neutral
Total 411.75 360.90 350.316

This pattern of results was confirmed by paired t-tests across each condition. As expected, the longer category-biased reading times were significantly different from the neutral control reading times (t (1,11) = 2.25, p < .046), meaning that the probability that the reading-time effect was due to chance was extremely low. Further, the significance of the exemplar-biased reading times as compared to the neutral times was not remotely significant (t < 1). This shows that there was basically no increase in target-processing time for exemplar-biased items, and therefore no added difficulty.

“A penguin was sitting on the ice. The bird suddenly dove into the water.”
“A penguin was sitting on the ice. The bird suddenly flew into the air.”

These findings suggest that readers still retain the semantic information from the antecedent and rank that information highly: more highly, even, than information from a more recent reference to the antecedent.  Practically speaking, "bird" does seem to take its meaning from "penguin," but not so much so that it disrupts the processing of its atypical antecedent.

CONCLUSION

The wealth of data gathered in this experiment provides opportunity for many more, and much more in-depth, analyses. The analysis presented here focuses on only one factor: target dwell-times, showing a basic directionality correspondence between NP processing and typicality. Exploration of dwell times for any of the other five "critical areas" of the passages, as well as regression paths and times, could bring more rounded insight. However, current data go far enough to suggest that Garnham & Cowles’ prediction was not borne out, as in this study it predicted incorrectly that "bird" would entirely disrupt semantic memory of "penguin": in reality, no such effect was found. This suggests that a penguin remains a penguin in all circumstances, even when it is also a bird.


REFERENCES

  1. Almor, A. “Noun-Phrase Anaphora and Focus: The Informational Load Hypothesis.” Psychological Review 106 (1999): 748-765.
  2. Battig, W.F., and W.E. Montague.  “Category Norms for Verbal Items in 56 Categories: A Replication and Extension of the Connecticut Norms.”  Journal of Experimental Psychology 80 (1969): 1-46.
  3. Casey, Paul J.  “A Reexamination of the Roles of Typicality and Category Dominance in Verifying Category Membership.”  Journal of Experimental Psychology  18-4(1992): 823-834.
  4. CELEX English database (Release E25) [On-line].  1993.  Available: Nijmegen: Centre for Lexical Information [Producer and Distributor].
  5. Garnham, Alan, and Wind Cowles. “Looking Both Ways: The JANUS Model of Noun Phrase Anaphor Processing.” Reference and Reference Processing (to appear): 1-45.
  6. Garrod, Simon, and Anthony Sanford.   “Interpreting Anaphoric Relations: The Integration of Semantic Information while Reading.”  Journal of Verbal Learning and Verbal Behavior 16 (1977): 77-90.
  7. Garrod, Simon, Daniel Freudenthal, and Elizabeth Boyle.  “The Role of Different Types of Anaphor in the On-Line Resolution of Sentences in a Discourse.”  Journal of Memory and Language 33 (1994): 39-68.
  8. Grosz, B., A. Joshi, and S. Weinstein. “Centering: A Framework for Modelling the Local Coherence of Discourse.” Computational Linguistics 21 (1995): 203-226.
  9. Murphy, Gregory L., and Mary E. Lassaline.  “Hierarchical Structure in Concepts and the Basic Level of Categorization.”  Knowledge, Concepts, and Categories.  MIT Press: Cambridge, (1997):  93-131.
  10. Rips, Lance J., Edward J. Shoben, and Edward E. Smith.  “Semantic Distance and the Verification of Semantic Relations.”  Journal of Verbal Learning and Verbal Behavior 12  (1973): 1-20.
  11. Vanoverberghe, Veerle, and Gert Storms.  “Feature Importance in Feature Generation and Typicality Rating.”  European Journal of Cognitive Psychology 15 (2002): 1-18.
  12. Van Overschelde, James P., Katherine A. Rawson, and John Dunlosky.  “Category Norms: An Updated and Expanded Version of the Battig and Montague (1969) Norms.”  Journal of Memory and Language 50 (2004): 289-335.

--top--

Back to the Journal of Undergraduate Research