A. McGuire, M. Bohacek, H. Farid, P. Taylor, S. Nightingale. How Realistic are AI-generated Faces? European Conference on Visual Perception, 2024.

The advent of diffusion-based models has taken the ability to generate fake content to a new level: with a simple text prompt anyone can create a convincing image of almost anything. Prior research has shown that humans are unable to reliably distinguish between real faces and faces synthesised using generative artificial networks (GANs), the precursors to diffusion models. We wondered whether diffusion models have also passed through the uncanny valley. To examine this question, we used the same 800 faces (400 real and 400 StyleGAN2) from Nightingale and Farid (2022). We synthesised a further 400 faces using Adobe Firefly, matching the original stimulus set in terms of diversity across gender, age, and race. In an online study, participants first see a short tutorial consisting of examples of synthesised and real faces. Following this, participants complete 96 trials consisting of a single face, of which a third are real, a third GAN-synthesised, and a third diffusion-synthesised (for each image type, faces are balanced equally in terms of gender and race). Participants have unlimited time to classify the face as “real” or “synthetic”. Based on a small pilot data (N=13) overall accuracy is 50% (chance performance). Participants were more accurate at classifying the diffusion faces (60%) than GAN (45%) or real (49%) faces. We are now collecting the remaining experimental data. Early indications suggest that humans are limited in their ability to distinguish between real and synthetic images. The difference in accuracy scores implies the diffusion models are highly realistic but not yet as realistic as GAN. We also examined ChatGPT’s ability to accurately classify these faces as real or synthetic. Somewhat to our surprise, with an accuracy of 65%, ChatGPT (v.4.0) outperformed humans, but still struggled to accurately perform this task, further emphasizing the photo realism of AI-generated faces.


S. Barrington and H. Farid. Perceptual Estimates of the Physical Attributes of People in Photographs. Vision Sciences, 2023. [poster]

On January 6, 2021 a mob attacked the U.S. Capitol seeking to stop confirmation of the presidential election; in the aftermath, five lay dead. Some insurrectionists were recognizable from photos of that day, but others were masked or obscured. Attempting to aid in identifying (or exonerating) photographed suspects, we ask how accurately physical attributes -- height and weight -- can be estimated from a single photograph. Volunteers (n=58) had their height and weight measured, and then photographed in neutral and dynamic poses in a studio with no surrounding structures; and in a hallway surrounded by familiar structures (doorway, stairs). Study participants (n=325) recruited from Mechanical Turk were asked to estimate the height/weight of the volunteers depicted in 58 photos. The median absolute height/weight error in the studio is 8.4cm/9.1kg, and 6.4cm/7.5kg in the hallway. Studio and hallway accuracy pooled across 20 respondents improved to 5.5cm/6.7kg and 3.9cm/5.8kg. By comparison, a state-of-the-art, deep-learning based, computer-vision system was used to estimate the 3D body pose and shape (scaled to be consistent with a gender-specific, inter-pupillary distance) from which height/weight was estimated. The median studio and hallway error of 7.3cm/8.0kg and 5.0cm/10.4kg is smaller than individuals, larger than pooled responses, but -- with the exception of pooled weight -- not statistically so. Lastly, ten recruited, licensed photogrammetrists were statistically less accurate at height estimation than our pooled participants and no different at weight estimation. Pooled non-expert human estimates of physical attributes from a photo are surprisingly accurate, even in the absence of reference objects. It is not immediately obvious how these estimates are made, but naive observers remain the most accurate way to estimate basic physical attributes from a single photograph.


S.J. Nightingale and H. Farid. Synthetic Faces Are More Trustworthy Than Real Faces. Vision Sciences, 2022.

The photo realism of synthetic media (deep fakes) continues to amaze and entertain, as well as alarm those concerned about abuses in the form of non-consensual pornography, fraud, and disinformation campaigns. We have previously shown that synthetic faces are visually indistinguishable from real faces. Because in just milliseconds faces illicit implicit inferences about traits such as trustworthiness, we wondered if synthetic and real faces illicit differential responses of trustworthiness. We synthesized 400 faces using StyleGAN2, ensuring diversity across gender, age, and race. A convolutional neural network descriptor was used to extract a perceptually meaningful representation of each face, from which a matching real face was selected from the Flickr-Faces-HQ dataset. Mechanical Turk participants (N=223) read a brief introduction explaining that the purpose of the study was to assess face trustworthiness on a scale of 1 (very untrustworthy) to 7 (very trustworthy). Each participant then saw 128 faces, one at a time, and rated their trustworthiness. Participants had an unlimited amount of time to respond. The average trustworthy rating for synthetic faces of 4.82 is more than the rating of 4.48 for synthetic faces. Although only 7.7% more trustworthy, this difference is significant (t(222) = 4.8, p < 0.001, d = 0.49). Although a small effect, Black faces were rated more trustworthy than South Asian faces, but otherwise there was no effect across race. Women were rated as significantly more trustworthy than men, 4.94 as compared to 4.36 (t(222) = 19.5, p < 0.001, d = 0.82). Synthetically-generated faces are not just photo realistic, they are more trustworthy than real faces. This may be because synthesized faces tend to look more like average faces which themselves are deemed more trustworthy. Regardless of the underlying reason, and ready or not, synthetically-generated faces have emerged on the other side of the uncanny valley.


S.J. Nightingale, S. Agarwal, Erik Harkonen, J. Lehtinen, and H. Farid. Synthetic Faces: how perceptually convincing are they? Vision Sciences, 2021. [poster]

Recent advances in machine learning, specifically generative adversarial networks (GANs), have made it possible to synthesize highly photo-realistic faces. Such synthetic faces have been used in the creation of fraudulent social media accounts, including the creation of a fictional candidate for U.S. Congress. It has been shown that deep neural networks can be trained to discriminate between real and synthesized faces; it remains unknown, however, if humans can. We examined people's ability to discriminate between synthetic and real faces. We selected 400 faces synthesized using the state of the art StyleGAN2, further ensuring diversity across gender, age, and race. A convolutional neural network descriptor was used to extract a low-dimensional, perceptually meaningful, representation of each face. For each of the 400 synthesized faces, this representation was used to find the most similar real faces in the Flickr-Faces-HQ (FFHQ) dataset. From these, we manually selected a matching face that did not contain additional discriminative cues (e.g., complex background, other people in the scene). Participants (N=315) were recruited from Mechanical Turk and given a brief tutorial consisting of examples of synthesized and real faces. Each participant then saw 128 trials, each consisting of a single face, either synthesized or real, and had unlimited time to classify the face accordingly. Although unknown to the participant, half of the faces were real and half were synthesized. Across the 128 trials, faces were equally balanced in terms of gender and race. Average performance was close to chance with no response bias (d-prime = -0.09; beta = 0.99). These results suggest that StyleGAN2 can successfully synthesize faces that are realistic enough to fool naive observers. We are examining whether a more detailed training session, raising participants' awareness of some common synthesis artifacts, will improve their ability to detect synthetic faces.


S.J. Nightingale, S. Agarwal, and H. Farid. Can We Detect Face Morphing to Prevent Identity Theft? Vision Sciences, 2020. [poster]

A relatively new type of identity theft uses morphed facial images in identification documents in which images of two individuals are digitally blended to create an image that maintains a likeness to each of the original identities. We examined people's ability to detect facial morphing. We collected 3,500 passport-format facial images. This dataset consists of a diverse set of people across gender, age, and race. Convolutional neural network descriptors are used to extract a low-dimensional, perceptually meaningful, representation of each face. For each of 54 faces, these representations are used to find the most similar face in the dataset. A mid-way morph is generated between each pair of different individuals; another mid-way morph is generated between two different photos of the same individual. The morphs are manually edited to remove obvious artifacts. In Experiment 1a (all experiments, N=100), on each trial, participants viewed two images - an original image alongside a morph (from the same or different individual) - and indicated if they are of the same individual or not. Participants struggled to perform this task accurately and were biased to respond "same" (d'=0.68; β=1.81). In Experiment 1b we focused participants' attention on the eye/nose/mouth regions and provided feedback on each trial. This did not improve sensitivity but led to a reduced bias (d'=0.58; β=1.09). In Experiment 2a, participants saw a single image - a morph or an original - and indicated if it was a morphed face or not. Participants performed only slightly above chance (d'=0.21; β=0.98). In Experiment 2b, when participants were informed of morphing artifacts to look out for and received feedback, performance improved slightly (d'=0.53; β=0.92). Preliminary results suggest that computational methods for face recognition may outperform humans but remain imperfect. Combined, these results suggest that face morphing might be a worryingly effective technique for committing identity theft.


S.J. Nightingale, K. Wade, H. Farid, and D. Watson. Can Shadows and Reflections Help in the Detection of Photo Forgeries? Society for Applied Research in Memory and Cognition, Sydney, Australia, 2017.

The growing sophistication of photo-editing tools means nearly anyone can make a convincing forgery. Consequently, we find ourselves questioning the authenticity of photos and perhaps wondering if we can distinguish real from fake. In two experiments we found that observers could not reliably detect even large inconsistencies in shadows (d'=0.15) and reflections (d'=-0.05). Furthermore, observers were biased to accept physically impossible shadows as plausible (c=0.20) and physically correct reflections as implausible (c=-0.16). These findings may partially account for people's willingness to accept manipulated images as real, and may lead to methodologies for training observers to distinguish real from fake images.


D. Finnegan, G. Hamilton, L. Stearns, A. LeWinter, H. Farid, and H. Renedo. Tidewater Glacier Velocities from Repeat Ground-Based Terrestrial LiDAR Scanning; Helheim Glacier, Southeast Greenland. Transactions of the American Geophysical Union, San Francisco, CA, 2014.

Tidewater glaciers exhibit dynamic behaviors across a range of spatial and temporal scales, posing a challenge to both in situ and remote sensing observations. In situ measurements capture variability over very short time intervals, but with limited spatial coverage, and at significant cost and risk to deploy. Conversely, airborne and satellite remote sensing is capable of measuring changes over large spatial extents but at limited temporal resolution. Here we use a near-situ approach to observing dynamic glacier behavior. Terrestrial LiDAR Scanning (TLS) combines the rapid acquisition capabilities of in situ measurements with the broad spatial coverage of traditional remote sensing, and can be carried out from a safe off-ice location. Repeat (30 min) high-resolution, long-range (6-10km) TLS surveys were conducted at Helheim Glacier, southeast Greenland, during July 9-14, 2014, and coincident in situ global positioning system (GPS) observations were acquired close to the glacier terminus. Analysis of these data allows for independent estimates of flow displacement and verification of 3D analytic techniques for quantifying vector motion. These techniques will enable the automated processing of large volumes of repeat scanning data to be collected during planned the deployment of an autonomous version of our LiDAR scanning system.


M.J. Bravo and H. Farid. Search Templates Can be Adapted to the Context, but Only for Unfamiliar Targets. Vision Sciences, St. Pete Beach, FL, 2014.

When observers search repeatedly for a target in a particular context, they learn a target template that is optimized for that context. If the same object is encountered in a different context, observers may learn a different target template. Can observers learn multiple templates for the same object and switch among these templates depending on the context? In an earlier study, we trained observers to search for a target in three contexts (three types of distractors). We then intermixed the contexts and found that search for the target was faster when observers were given a cue that allowed them to anticipate the context. We concluded that observers were switching their target template depending on the context (VSS 2012). This year, we ruled out the alternative explanation that observers use the cue to suppress the context. To do this, we repeated the experiment but randomly varied the target across trials. The context cue no longer benefited search, supporting the idea that observers used the context cue to switch their target template rather than suppress the context. We also tested whether observers could develop multiple search templates for a target that was already very familiar. We again repeated our original experiment, but we first pre-trained observers to discriminate the target from a large set of highly similar objects. This pre-training eliminated the effect of the context cue. In total, our results indicate that observers can develop context-specific search templates for unfamiliar targets. If observers have a pre-existing representation of the target, however, they seem unable to adapt their target template to the context.


M.J. Bravo and H. Farid. Symbolic Distractor Cues Facilitate Search. Vision Sciences, Naples, FL, 2012.

When observers practice finding a particular target among a particular type of distractor, their search times become faster. One benefit of practice is that it allows observers to hone a search template that optimally distinguishes the target from the distractors. If observers practice searching for this same target among a different type of distractor, they will likely develop a different search template. This study examined whether observers can switch among these search templates if they are provided with a symbolic cue to the distractors’ identity prior to the search display. The search items were photographs of four very similar objects (four wristwatches or four fishing lures) with one object selected to serve as the target. In two training sessions, observers practiced finding this target among distractors drawn from the other three objects. Each search display was comprised of the target and five identical distractors arranged randomly but without overlap. Observers indicated whether the target appeared on the right or left side of the display. During the training sessions, displays with different distractors were run in separate blocks of trials, and each trial was preceded by a symbolic cue (a number) that identified the distractors. During a subsequent testing session, displays with different distractors were randomly intermixed, and half of the trials were preceded by the symbolic distractor cue. Ten observers ran the experiment with either the wristwatch or fishing lure stimuli. For all observers, the symbolic distractor cue produced a robust decrease in search times (t(9) = 5.16, p <0.001, 15% average decrease ) with no change in accuracy. This result indicates that observers can develop multiple search templates for the same target object and that observers can readily switch among these templates to optimize their search.


M.J. Bravo and H. Farid. Distinctive Features are Prominent in Object Representations. Vision Sciences, Naples, FL, 2011.

Question: Does our representation of a learned object weigh all features equally or are diagnostic features given greater prominence? To find out, we conducted a visual search experiment based on the premise that search will be fastest when the features that are prominent in the observer's representation are also salient in the stimulus.

Methods: Observers were trained to associate names with three butterflies that had different types of texture on their upper and lower wings. For each observer, the texture sample on one set of wings varied (the diagnostic wings) while the texture sample on the other set of wings was fixed (the common wings). Soon after training, the observers were tested on a visual search task with the butterfly names as cues. Each search stimulus contained one butterfly on a textured background; the observer's task was to locate the butterfly. On some trials, the statistics of the background texture matched those of the common wings, causing the common features to be highly camouflaged and the diagnostic features to be salient. On other trials, the statistics of the background texture matched those of the diagnostic wings, causing the diagnostic features to be highly camouflaged and the common features to be salient.

Predictions: If diagnostic features are given special prominence in object representations, then search should be fastest when those features are salient in the image. If common features are given special prominence (possibly because they are seen most frequently), then search should be fastest when those features are salient in the image.

Results: Observers found butterflies faster on background textures that camouflaged the common wings rather than the diagnostic wings. Our internal representation of objects gives greater prominence to diagnostic features than to common features.


H. Farid and M.J. Bravo. Photo Forensics: How Reliable is the Visual System? Vision Sciences, Naples, FL, 2010.

In 1964, the Warren commission concluded that John F. Kennedy had been assassinated by Lee Harvey Oswald. This conclusion was based in part on the famous "backyard photograph" of Oswald holding a rifle and Marxist newspapers. A number of people, including Oswald himself, have claimed that the photograph was forged. These claims of forgery have been bolstered by what appear to inconsistencies in the lighting and shadows in the photo.

This is but one of several cases in which accusations of photographic inauthenticity have spawned national or international controversies and conflicts. Because these claims are often based on perceptual judgments of scene geometry, we have examined the ability of observers to make such judgments. To do this, we rendered scenes that were either internally consistent or internally inconsistent with respect to their shadows, reflections, or planar perspective distortions. We then asked 20 observers to judge the veridicality of the scenes. The observers were given unlimited viewing time and no feedback. Except for the most degenerate cases, performance was near chance, even though the information required to make these judgments was readily available in the scenes. We demonstrate the availability of this information by showing that straightforward computational methods can reliably discriminate between possible and impossible scenes.

We have used computational methods to also test the claims of inauthenticity made about the Oswald backyard photo. By constructing a 3D model of the scene, we show that the shadows in the photo are consistent with a single light source. Our psychophysical results suggest that the claims to the contrary arose because human observers are unable to reliably judge certain aspects of scene geometry. Accusations of photo inauthenticity based solely on a visual inspection should be treated with skepticism.


D.T. Bolger, T. Morrison, B. Vance and H. Farid. Development and Application of a Computer-Assisted System for Photographic Mark-Recapture Analysis. Ecological Society of America, Pittsburgh, PA, 2010.

Background/Questions: Photographic mark-recapture is a cost-effective, non-invasive way to study populations. However, to effectively apply photographic mark-recapture to large populations, computer software is needed for efficient image manipulation and pattern matching. This talk describes a new software package and its application to giraffe (Giraffa camelopardalis) populations in the Tarangire Ecosystem in northern Tanzania.

Results/Conclusions: We created an open source application for the storage, pattern extraction, and pattern-matching of digital images for the purposes of mark-recapture analysis. The resulting software package is a stand-alone, multi-platform application implemented in Java. Over 1200 images were acquired in the field in three primary sampling periods, Sept.-Oct. 2008, Jan.-Mar. 2009, and Dec. 2009. The pattern information in these images was extracted and matched resulting in capture histories for over 600 unique individuals. These histories were then analyzed with Cormack-Jolly-Seber models to estimate survival rates and closed population models to estimate population sizes for two spatially distinct subpopulations. Our program employs the SIFT operator (Scale Invariant Feature Transform) which extracts distinctive features invariant to image scale and rotation. This was advantageous in this application as it allowed reduced preprocessing of images and accepted a greater range of image quality with low matching error rates. This new tool allowed photographic mark-recapture to be applied successfully to this relatively large population and suggests it can be successfully applied to other suitable species.


M.J. Bravo and H. Farid. Training Determines the Target Representation for Search. Vision Sciences, Naples, FL, 2009.

Purpose: Visual search is facilitated when observers are pre-cued with the target image. This facilitation arises in part because the pre-cue activates a stored representation of the target. We examined whether training designed to alter the nature of this representation can influence the specificity of the cueing effect.

Methods: The experiment involved a training session and, 1-2 days later, a testing session. For both sessions, the stimuli were photo-composites of coral reef scenes and the targets were images of tropical fish. The observer's task was to judge whether a fish was present in each reef scene. During training, observers practiced searching for 3 exemplars of 4 fish species. Half the observers searched for the 12 fish in 12 separate blocks (blocked-by-fish group), the other half searched for the three fish belonging to each species in separate blocks (blocked-by-species group). During testing, the observers were shown a brief pre-cue one second before the search stimulus. The pre-cues were either identical to the target, the same species as the target, or, as a control, the word "fish".

Prediction: During training, we expected that the blocked-by-fish group would develop a specific representation for each of the 12 fish images, while the blocked-by-species group would develop a more general representation of the 4 fish species. We expected this difference to show-up during testing as a difference in the specificity of the cueing effect.

Results: For the blocked-by-fish group, pre-cues facilitated search for identical targets but not same-species targets. For the blocked-by-species group, pre-cues facilitated search for identical targets as well as same-species targets.

Conclusion: The pattern of cueing effects suggests that observers trained on the same visual search stimuli can form different representations of the target.


D.T. Bolger, T. Morrison, B. Vance and H. Farid. A New Software Application for Photographic Mark Recapture Analysis. Society for Conservation Biology, Edmonton Alberta, Canada, 2010.

Photographic mark-recapture (PMR) is a cost-effective, non-invasive way to study populations. However, to effectively apply PMR to large populations, computer software is needed for efficient image manipulation and pattern matching. We have created an open-source application for the storage, pattern extraction, and pattern-matching of digital images for the purposes of PMR. Our software is a stand-alone, multi-platform application implemented in Java that employs the SIFT operator (Scale Invariant Feature Transform) which extracts distinctive features invariant to image scale and rotation. In this poster we present a validation of the application for two species with distinct markings, wildebeest (Connochaetes taurinus) and giraffe (Giraffa camelopardalis). We used ROC curves (Receiver Operator Characteristics) to characterize the trade-off between false negative and false positive error in the photo-matching process and to identify the best performing scoring procedure. Because false negative error was of greater concern than false positive, we selected scoring thresholds that minimized false negative error. For wildebeest, the best procedure generated false negative error rates of 14% while yielding a 130-fold labor savings over an unassisted matching process. For giraffe, errors rates were negligible and labor savings even greater. These results suggest that this software should be useful to other researchers employing PMR.


H. Farid and M.J. Bravo. Photorealistic Rendering: How Realistic Is It? Vision Sciences, Sarasota, FL, 2007.

The US Supreme Court recently ruled that portions of the 1996 Child Pornography Prevention Act are unconstitutional. The Court ruled that computer generated (CG) images depicting a fictitious minor are constitutionally protected. Judges, lawyers, and juries are now being asked to determine whether an image is CG, but there is no data on whether they can reliably do so.

To test the ability of human observers to discriminate CG and photographic images, we collected 180 high-quality CG images with human, man-made, or natural content. Since we were interested in tracking the quality of CG over time, we collected images created over the past six years. For each CG image, we found a photographic image that was matched as closely as possible in content. The 360 images were presented in random order to ten observers from the introductory psychology subject pool at Rutgers. Observers were given unlimited time to classify each image.

Observers correctly classified 83% of the photographic images and 82% of the CG images (d'=1.87). Observers inspected each image for an average of 2.4 seconds. Among the CG images, those depicting humans were classified with the highest accuracy at 93% over all six years. This accuracy declined to 63% for images created in 2006.

Because the experiment was self-paced, inspection times differed among observers, and the results show a strong speed-accuracy trade-off. The observer with the longest inspection time (3.5 seconds/image) correctly classified 90% of all photographic images and 96% of all CG images (d'=3.03). This observer correctly classified 95% of CG images depicting humans, and his only errors occurred with 2006 images, where he achieved an accuracy of 78%.

Even with great advances in computer graphics technology, the human visual system is still very good at distinguishing between computer generated and photographic images.


M.J. Bravo and H. Farid. A Measure of Relative Set Size for Search in Clutter. Vision Sciences, Sarasota, FL, 2007.

Visual search performance is typically measured as a function of set size, but it is unclear how to determine set size in cluttered scenes. Previously, we claimed that in such scenes, set size corresponds to the number of segmentable regions rather than the number of objects (Bravo & Farid, 2003). We supported this claim by showing that distractors with multiple regions produce steeper search functions than distractors with a single region. Our goal this year was to quantify this relationship by using a computational model to count the regions in our clutter stimuli. There are many computational models for segmenting an image into regions; of these, graph-based approaches have shown particular promise. We employed one such algorithm (Felzenszwalb and Huttenlocher, 2004) to count the number of regions in our 2003 stimuli. We then used this measure of set size to replot the search time data. Using the number of segmentable regions as the measure of set size produced a better fit (R2=0.995) than did using the number of distractor objects (R2=0.916). Like all computational models of image segmentation, the output of the algorithm is highly scale-dependent. By adjusting the algorithm's parameters, a single image can be segmented into 50 regions or 500 regions. This variability is not a limitation of the algorithm; it reflects the scale ambiguity inherent in image segmentation. To determine whether our choice of parameters was fortuitous, we explored the parameter space and found that nearly every set of values produced an excellent fit to our data. Evidently, the number of regions in our stimuli is roughly proportional over a wide range of scales. We have confirmed that this proportionality also holds across many natural images. We conclude that computational models of image segmentation can provide a good measure of relative set size in cluttered stimuli.


M.J. Bravo and H. Farid. Using an Interest Point Detector to Find Potential Fragments for Recognition. Vision Sciences, Sarasota, FL, 2006.

Inspired by recent computer vision models for object recognition in clutter, we are developing a model of human object recognition based on local, distinctive fragments. The first stage of such models typically involves the selection of a large pool of potential image fragments using an interest point detector. In subsequent stages, this large pool is reduced to a smaller set of distinctive fragments. In developing a model for humans, our first step has been to determine whether the pool of fragments selected by the most common interest point detector, the Harris Detector (HD), includes the fragments humans find distinctive. Our test images were randomly rotated photographs of 12 common tools. We applied an HD to these images and collected fragments with a wide range of interest ratings. The scale of the HD determined the size of the fragments (8-pixel radius, 1-2% of the whole object). These fragments were then used as the stimuli in a recognition experiment. After a brief training period with whole tools, observers identified the tool fragments. Overall, observers were remarkably good at recognizing these tiny fragments. We then compared the recognition results with the interest ratings of the HD. Many fragments that were recognizable to observers were not given high interest ratings by the HD, which responds best to locations with large luminance gradients in multiple directions (e.g., corners). In addition to recognizing such fragments, observers also recognized fragments with subtle or one-dimensional gradients.


V. Maljkovic, P. Martini and H. Farid. The Contribution of Statistical Image Differences to Human Rapid Categorization of Natural Scenes is Negligible. Vision Sciences, Sarasota, FL, 2006.

Purpose: To examine the contribution of low-level image properties to the rapid categorization of natural scenes.

Methods:. Three image classes were tested in a blocked design: positive/negative emotional images, landscapes/cityscapes and animals/vehicles. Within each category, half of the images were natural, whereas the remaining were synthetic stimuli generated by matching statistical feature vectors extracted from the same image class (Portilla & Simoncelli, 2000) and lacked any meaning. Each image was presented (masked) for 13-50msec, once per subject. Natural and synthetic images were shown mixed within a block of trials. 24 subjects categorized both natural and synthetic images into their natural class within a 2AFC design and accuracy of categorization was calculated per exposure.

Results: All categories of natural images were reliably discriminated after a single video frame. Categorization of synthetic images, however, was impaired at all exposures. Each class of natural images was discriminated far more accurately than the corresponding synthetic images: 93% vs. 67% for cityscapes/landscapes, 94% vs. 56% for animals/vehicles, and 70% vs. 53% for emotional images (averages across exposures, chance 50%).

Conclusions: The contribution of statistical image differences to image categorization at brief exposures is small in general. It is more sizeable for image categories such as animal/vehicles and cityscapes/landscapes where computational linear discriminant analysis algorithms have some success. In the case of emotional image categorization the contribution is virtually null, matching the failure of computational algorithms (Maljkovic et al., VSS 2004). Thus, scene categorization at brief exposures seems not to overly rely on low-level image statistics.


D.C. Finnegan, H. Farid, D.E. Lawson and W. Krabill. Quantifying Surface Fluctuations using Optical Flow Techniques and Multi-Temporal LiDAR. Transactions of the American Geophysical Union, San Francisco, CA, 2006.

In recent decades scientific communities have seen a significant increase in technological innovations and applications using airborne and spaceborne remote sensing. In particular, airborne laser altimetry has provided the opportunity to characterize large-scale terrain and geologic processes such as glaciers and ice sheets at fine-scale resolutions. Although, processing and deriving information from these data can still pose significant challenges. To this end, we describe a novel approach that combines the use of a multi-temporal LiDAR (Light Detection and Ranging) topographic dataset and optical flow techniques, adapted from the computer vision community, to quantify ice flow dynamics of the Hubbard glacier. Using NASA's Airborne Topographic Mapper (ATM-IV) LiDAR as a source of high-resolution (~5cm) topographic data, repeat airborne surveys of the Hubbard Glacier terminus were acquired on August 22nd and 26th, 2005. From the resulting Digital Elevation Model (DEM) we seek to measure a dense motion field that describe both the shift and change in elevation of the glacier. The change in the DEM is modeled spatially as locally affine but globally smooth. The model also explicitly accounts for changes in elevation, and for missing data. This approach is built upon a differential multi-scale framework, allowing for the measurement of both large and small scale motions. The resulting measurement yields a dense 2-D motion vector field for each point in the DEM. On the Hubbard Glacier, we achieve an average accuracy within 8% as compared with manual measurements. These results are encouraging and show that repeat high-resolution elevation data that LiDAR provides allows us to quantify surface processes in a precise yet timely manner. These results may then be incorporated as essential boundary conditions into models that seek to predict geologic behavior such as glacier and ice sheet flow.


M.J. Bravo and H. Farid. The Depth of Distractor Processing in Search Through Clutter. Vision Sciences, Sarasota, FL, 2005.

Background clutter can make it difficult to segment whole objects. This is especially true for compound objects, which have parts made from different materials (e.g., a table lamp). We reported earlier that when observers search for a category target in dense clutter, search is slower when the distractors are compound objects rather than simple objects. This result is consistent with two interpretations. In the first, observers reject parts, and this process is slow for compound distractors because they have multiple parts. In the second, observers reject whole objects, and this process is slow for compound distractors because they are difficult to segment. In the present search experiment, we used familiar and chimerical distractors to distinguish between these alternatives. Familiar distractors were drawn from a set of 100 color photographs of everyday objects. Each of these objects had at least two clearly delineated parts. Chimerical distractors were created by exchanging parts between objects. Observers searched for a target defined by its membership in a broad category (e.g., animal) or categories (e.g., animal or vehicle). We found that when target uncertainty was high and target recognition was difficult (e.g., the target was partially occluded, randomly rotated or drawn from two categories), search times were significantly slower for chimerical distractors than for normal distractors. This difference suggests that for some search tasks, observers identify and reject whole objects. This difference was greatly reduced, however, when the target was unoccluded, upright and drawn from a single category. For this simpler search task, observers may reject object parts. In sum, the demands of the search task determine the depth of distractor processing required, and this determines whether observers recognize distractor objects.


H. Sun, D.W. Roberts, H. Farid, Z. Wu, A. Hartov and K.D. Paulsen. Cortical Surface Tracking Using a Stereoscopic Operating Microscope. Neurosurgery, 56:86-97, 2005.

OBJECTIVE: In order to measure and compensate for soft tissue deformation during image-guided neurosurgery, we have developed a novel approach to estimate the three-dimensional (3-D) topology of the cortical surface and track its motion over time.

METHODS: We employ stereopsis to estimate the 3-D cortical topology during neurosurgical procedures. To facilitate this process, two CCD cameras have been attached to the binocular optics of a stereoscopic operating microscope. Prior to surgery, this stereo imaging system is calibrated to obtain the extrinsic and intrinsic camera parameters. During surgery the 3-D shape of the cortical surface are automatically estimated from a stereo pair of images and registered to the preoperative image volume to provide navigational guidance. This estimation requires robust matching of features between the images, which, when combined with the camera calibration, yields desired 3-D coordinates. After estimating the 3-D cortical surface from stereo pairs, its motion is tracked by comparing the current surface to its prior locations.

RESULTS: We are able to estimate the 3-D topology of the cortical surface with an average error less than 1.2mm. Executing on a 1.1 GHz Pentium machine, the 3-D estimation from a stereo pair of 1024 x 768 resolution images requires approximately 60 seconds of computation. By applying stereopsis over time, we are able to track the motion of the cortical surface including the pulsatile movement of the cortical surface, gravitational sag, tissue bulge as a result of increased intracranial pressure, and the parenchymal shape changes associated with tissue resection. The results from ten surgical cases are reported.

CONCLUSION: We have demonstrated that a stereo vision system coupled to the operating microscope can be used to efficiently estimate the dynamic topology of the cortical surface during surgery. The 3-D surface can be co-registered to the preoperative image volume. This unique intraoperative imaging technique expands the capability of the current navigational system in the OR and increases the accuracy of anatomical correspondence with preoperative images through compensation for brain deformation.


H. Farid and D.C. Finnegan. Quantifying Planetary and Terrestrial Geologic Surfaces Using Wavelet Statistics. Transactions of the American Geophysical Union, San Francisco, CA, 2005.

Over the past two decades the planetary and terrestrial scientific communities have seen a significant increase in airborne and space-based scientific monitoring and data acquisition. Laser altimetry, visible and microwave imaging sensors, and radar altimeters provide insights into fine-scale details and large-scale surface processes. Characterization of surface features and processes from such data, however, still poses significant challenges. To this end, we describe a quantitative approach to statistically characterize surface features and processes from gridded elevation data.

The computer vision community has recently seen significant advances in modeling the statistics of natural images. These models consist of statistical measurements extracted from an image (e.g., a parametric description of Fourier energy). The model's descriptive power is verified by synthesizing a new image with matching statistics. If the synthesized image is visually similar to the original, then the model likely captured some inherent properties of the image. The model parameters can then be used as a quantitative similarity metric.

The statistical model employed here is that of Portilla and Simoncelli, 2000. The model first decomposes an image using a complex wavelet transform. From this decomposition and the original image, a number of statistics are extracted: (1) marginal statistics that embody the basic pixel intensity distribution; (2) coefficient correlations that embody the salient spatial frequencies and local spatial regularities; (3) coefficient magnitude statistics that embody higher-order geometric structures; and (4) cross-scale phase statistics that embody long-range spatial correlations. Depending on the image size and wavelet parameters, approximately 1,000 to 10,000 statistics are extracted.

We applied this model to a grayscale shaded relief image derived from a 2m lidar DEM. We extracted statistical measurements from each of five qualitatively different regions (fluvial, glacial and aeolian). Synthesized images based on these measurements qualitatively capture the underlying structure of each region. When coupled with pattern recognition techniques, the measurements are used to quantify the structural similarity between different regions.

Further development is needed to apply this approach to surfaces imaged with different modalities and at different scales. These results, nevertheless, provide an encouraging first step in quantifying surface features and their underlying processes.


M.J. Bravo and H. Farid. Still Searching a Cluttered Scene. Vision Sciences, Sarasota, FL, 2004.

Purpose: Most of the research on visual search and recognition has used isolated objects presented on uniform backgrounds. It is unclear whether the conclusions from this work generalize to cluttered scenes. For example, in sparse displays, global shape is thought to play a central role in recognition. Background clutter may obscure global shape, however, and so we expect that in clutter, local cues such as color will play a larger role.

Methods: Each search stimulus contained 12 photographs of ordinary objects. These objects were either arranged sparsely (well-separated on a uniform grid) or arranged as clutter (randomly positioned and often overlapping). On each trial, observers were first presented with a category name (animal, vehicle, food). Their task was to locate a member of that category in the search stimulus. This target often overlapped other objects, but was itself never occluded. To measure the effect of color, the stimuli were presented in their original color, in grayscale, or in a hue-shifted color.

Results: In the uniform condition, response times did not differ for the original-color, grayscale, and hue-shifted stimuli. In the clutter condition, however, response times were about 20% faster for the original-color stimuli than for the grayscale and hue-shifted stimuli.

Conclusions: In sparse stimuli, global shape is such a reliable cue for recognition that color plays a minimal role. In clutter stimuli, however, global shape is less reliable and so local cues like color have increased importance. (Color might also be expected to facilitate segmentation in the clutter stimuli, but the similar response times for hue-shifted and grayscale stimuli did not support this prediction.) This result, in conjunction with our results from last year, indicates that the processes underlying recognition and search may differ significantly for sparse and cluttered scenes.


V. Maljkovic, P. Martini and H. Farid. The Time-Course of Categorization of Real-Life Scenes with Affective Content. Vision Sciences, Sarasota, FL, 2004.

Purpose: To establish the temporal dynamics of the human ability to extract meaning from scenes.

Methods: EXP 1: 384 color images with emotional valence from the IAPS set were presented (masked) once to each of 96 subjects, at durations from one video-frame (13 ms) to 1710ms. Subjects rated each image valence on a 9-point scale. We calculated mean ratings per exposure and derived hazard functions for different valence categories. EXP 2: Three image classes were tested in a blocked design: positive/negative images, landscapes/cityscapes and animals/vehicles. Each image was presented (masked) for 13-50msec. Subjects categorized the images in a 2AFC design and accuracy of categorization was calculated per exposure.

Results: EXP 1: Valence was reliably discriminated after a single video frame and asymptoted at ~1s. The derived hazard functions show that categorization rates for positive and negative images are the same, with a transient peak at ~50ms, and a sharp decline by 200ms. EXP 2: Performance remained constant at ~95% for landscapes/cityscapes and animals/vehicles at all exposures; performance for emotional scenes improved from ~60% at one frame exposure to ~75% at 50 ms exposure. To determine if low-level features could be responsible for these results we built a statistical model consisting of 24 low-level measurements of luminance and spatial frequency. A linear classifier was able to almost perfectly separate the landscapes/cityscapes and animals/vehicles, but was unable to separate the valence categories.

Conclusions: Image meaning is available at exposures as brief as one video-frame. While rapid categorization of some image classes could exploit differences in low-level image properties, no such differences seem to be available for emotional scenes, and yet image meaning can be extracted from them reliably and quickly. This suggests a true act of object recognition, dependent on mechanisms functioning on similarly fast scales.


M.J. Bravo and H. Farid. Searching a Cluttered Scene. Vision Sciences, Sarasota, FL, 2003.

Purpose: In one popular scenario of vision, bottom-up grouping processes organize a scene into objects and then attention selects one of these objects for recognition. A problem with this scenario is that many ordinary objects are composed of multiple distinct parts (e.g., a lamp with a paper shade and a ceramic base), and when these objects are presented in clutter, it may not be possible to group whole objects using only bottom-up processes. To test whether attention selects whole objects or just object parts, we asked observers to search for a category target (food) in cluttered displays composed of single-part and multi-part objects.

Methods: Each display contained twelve color photographs of ordinary objects. The observer's task was to determine whether these objects included a food item. In half of the displays, the distractors were selected from 66 single-part objects; in the other half, the distractors were selected from 66 multi-part objects. While both types of displays were composed of the same number of similarly sized objects, the multi-part displays had many more parts. We also used two types of object arrangements. In the sparse arrangement, the objects were uniformly positioned and well-separated from one another. In the clutter arrangement, the objects were randomly positioned and overlapped one another.

Results: With the sparse arrangement, there was little difference in the search times for displays composed of single-part objects and those composed of multi-part objects. With the clutter arrangement, however, search times for multi-part displays were much slower than those for single-part displays.

Conclusion: These data suggest that with sparse arrangements (the norm in vision research), it is reasonable to suppose that the visual system can select and reject whole objects when searching for a target. With cluttered arrangements (the norm in everyday vision), object parts are likely the initial units of selective attention.


H. Sun, H. Farid D. Roberts, K. Rick, A. Kartov, and K. Paulsen. A Non-contacting 3-D Digitizer For Use in Image-Guided Neurosurgery. American Society for Stereotactic and Functional Neurosurgery, New York City, 2003.

Introduction: We have designed and implemented a non-contacting 3-D digitizer that attaches to the binocular optics of an operating microscope. This system can be used to efficiently and automatically register the surgical scene to the preoperative image volume through cortical feature analysis and then track the 3-D surface topology within the operating field in order to account for motion-induced changes that occur during surgery.

Methods: We have attached two CCD cameras to the binocular optics of an operating microscope. Prior to surgery, this stereo imaging system is calibrated to obtain the extrinsic and intrinsic camera parameters. During surgery the 3-D coordinates of salient image features are automatically estimated from a stereo pair of images and registered to the preoperative image volume to provide navigational guidance. This estimation requires the robust matching of features between the images, which, when combined with the camera calibration, yields the desired 3-D coordinates. A parameterized 3-D surface can then be fit to the estimated 3-D coordinates and, when registered to the preoperative image volume, provides navigational information in the face of tissue motion during surgery.

Results: We are able to estimate the 3-D structure of a surgical scene with an average accuracy of 1.3mm. Executing on a 1.1 GHz Pentium machine, the 3-D estimation from a stereo pair of 1024x768 images requires approximately 8 minutes of computation.

Conclusions: We have demonstrated that an operating microscope is capable of, without inducing brain deformation, digitizing 3-D surfaces with efficient acquisition and image analysis of stereo pairs, which can also be co-registered to the preoperative image volume through related feature analysis.

Learning Objectives: The ability to quickly and automatically estimate 3-D cortical surface topology during neurosurgery has several applications: (1) cortical vasculature can be localized in 3-D and registered with pre-operative imaging data; (2) fiducial markers can be localized in 3-D and used for the intraoperative update of calibration parameters; (3) the 3-D cortical surface can be continuously estimated and tracked for use in FEM-based compensation of brain deformation and shift that occurs in the OR.


M.J. Bravo and H. Farid. Segmentation in Clutter. Vision Sciences, Sarasota, FL, 2002.

In a cluttered scene, it may be difficult to fully segment an object using only bottom-up cues. In such cases we may segment the object by first detecting one of its salient, distinctive parts and then using this part to predict the location and orientation of other object parts. For a rigid object, the predictive power of the salient part should depend on its symmetry. For example, a sphere which has infinite rotational symmetry (and so looks the same from all viewpoints) should have less predictive power than a cone.

To test this idea we constructed computer-generated, rigid objects composed of two pieces: a "handle" (a simple geometric shape) and a "tool" (two connected cylinders). In each scene, an object was presented at a random orientation amongst clutter composed of cylinders resembling the tool. A small black or white ring was placed around one of the tool's cylinders at a location that varied across trials. Similar rings were also placed on the clutter. The observer's task was to report the color of the ring located on the tool. Because the tool was camouflaged against the background of clutter, response times were expected to depend on the degree to which the salient handle could be used to predict the tool's location and orientation in the clutter. That is, response times were expected to depend on the symmetry of the handle.

The results supported this idea: response times increased monotonically as the symmetry of the handle increased from 0-fold to 2-fold to 4-fold. Response times for handles with infinite symmetry, however, were no longer than those with 4-fold symmetry.

We conclude that observers can use a salient part to predict the location and orientation of the rest of an object. The predictive power of these salient parts depends, up to a limit, on their symmetry.


H. Farid and E.H. Adelson. Energy versus Synchrony in Perceptual Grouping. Vision Sciences, Sarasota, FL, 2002.

It has been proposed that the human visual system can use temporal synchrony for perceptual grouping. In a compelling demonstration of this theory a stochastic motion display purportedly driven solely by temporal synchrony was shown to promote grouping. It was then argued that these effects point to the role of synchrony-based mechanisms and processes. We have previously argued that the displays contain a traditional form of contrast energy and thus the grouping phenomena might be attributed to traditional mechanisms.

To further study this topic we devised new stimuli rich in temporal synchrony but devoid of contrast energy. These stimuli allow aspects of synchrony and spatio-temporal energy to be independently manipulated. We find that the energy, and not the synchrony, predicts the results.

The stochastic displays consist of a sea of drifting elements. On each frame every element moves according to a random process. Different random processes drive all the elements in the central and surrounding regions. One might argue that the resulting form cue is defined solely by the fine-grained temporally synchronous motion reversals. We observe, however, that there are moments when all elements in one region repeatedly reverse directions, while in the other region all elements have a run with no reversals. We show that a classic spatio-temporal energy model consisting of a spatial lowpass filter and a temporal bandpass filter can convert these relatively large-scale temporal change differences into a contrast cue.

This simple model is consistent with the psychophysical results of Lee and Blake (Science, 1999), Kandil and Fahle (Euro. J. Neuro., 2001), Farid and Adelson (Nat. Neuro., 2001), and Morgan and Castet (Proc. Roy. Soc., 2002). A model based on temporal synchrony alone can not explain all of these results. We conclude that the proposed synchrony-based mechanisms and processes are neither necessary nor sufficient to explain the phenomena.


A.M. Heimsath and H. Farid. Hillslope Topography from Unconstrained Photographs. Transactions of the American Geophysical Union, San Francisco, CA, 2002.

Quantifications of Earth surface topography are essential for modeling the connections between physical and chemical processes of erosion and the shape of the landscape. Enormous investments are made in developing and testing process-based landscape evolution models. These models may never be applied to real topography because of the difficulties in obtaining high-resolution (1-2 m) topographic data in the form of digital elevation models (DEMs). Here we present a simple methodology to extract the high-resolution 3-dimensional topographic surface from photographs taken with a hand-held camera with no constraints imposed on the camera positions or field survey. This technique requires only the selection of corresponding points in three or more photographs. From these corresponding points the unknown camera positions and surface topography are simultaneously estimated. We compare results from surface reconstructions estimated from high-resolution survey data from field sites in the Oregon Coast Range and northern California to verify our technique. Our most rigorous test of the algorithms presented here is from the soil-mantled hillslopes of the Santa Cruz marine terrace sequence. Results from three unconstrained photographs yield an estimated surface, with errors on the order of 1 m, that compares well with high resolution GPS survey data and can be used as an input DEM in process-based landscape evolution modeling. We further explore this method by quantifying volume of sediment lost by landsliding. Finally, we compare curvature (used as a proxy for landscape lowering with a simple diffusion-like model) calculated with the photo-estimated topography with high-resolution field surveys to test further the applicability of this methodology to Earth surface process studies.


S. Inati, H. Farid, K. Sherwin, and S. Grafton. A Global Probabilistic Approach to Fiber Tractography with Diffusion Tensor MRI. Human Brain Mapping, Brighton, UK, 2001.

Introduction: Coherently organized tissue has a high degree of diffusional anisotropy which can be observed non-invasively using diffusion weighted MRI. DT-MRI has been applied with some success to the in vivo tracing of white-matter fiber pathways in the brain. Current approaches[1,2,3] "grow" pathways by taking steps in the direction indicated by the diffusion tensor at each point. The noise inherent in DT-MRI data limits the ability of these stepwise techniques to trace long fibers in their entirety. In contrast to these local approaches, we describe a global approach that employs Expectation Maximization[4] (EM). EM is a statistical technique that has been used in computer vision to estimate parametric models from 2-D motion vector fields[5]. Here, we apply EM to the estimation of neuronal pathways from 3-D tensor fields. EM is an iterative two stage process for simultaneous data segmentation and model estimation. In order to employ EM, a set of models are first defined to describe the data. In the E-step the probability of every data point belonging to each model is computed. In the M-step the probabilities are used to re-estimate the model parameters. The E- and M-steps are repeatedly performed until the probabilities and model parameters converge to a solution.

Methods: Subjects were scanned using a 1.5T GE Signa Echospeed (LX8.3) equipped with 4 G/cm gradients. Diffusion weighted images were acquired using a single-shot, diffusion-weighted, spin echo EPI sequence with 6 gradient encoding directions (b=1000). In our case, the data consisted of a volume of vector-valued data points (the principle eigenvectors of the apparent diffusion tensor). The model was taken to be a low-order Bezier curve. A single run of EM yielded a parametric description of a pathway and a probability of each point belonging to the pathway. Multiple pathways were classified after repeated runs with different random starting conditions.

Results/Discussion: Using the global probabilistic methods outlined above, we have successfully traced white matter fiber pathways. The global nature of this technique provides several advantages over local approaches: 1) Fiber tracts can be traced through noisy regions. 2) Longer fiber tracts can be traced because EM is insensitive to the accumulation of error found in stepwise solutions. 3) EM provides a measure of the likelihood of each pathway given the underlying data. 4) Finally, by fixing both endpoints, pathways connecting two brain regions can be easily found. We have presented a method for fiber tracking using EM. This global approach can be tailored and extended to provide a valuable tool in neuroimaging, separately or in concert with existing methods.

References
[1] P.J. Basser, et al., MRM 44:625-632, 2000
[2] T.E. Conturo, et al., PNAS 96:10422-10427, 1999
[3] S. Mori, et al., Ann Neurol 45:265-269, 1999
[4] G.J. McLachlan et al., "Mixture Models: Inference and Applications to Clustering", 1988
[5] A. Jepson et al., Proc CVPR 760-761, 1993

Acknowledgments: This work was supported in part by NSF Grant P50-NS-17778 (Inati), NSF CAREER Award IIS-99-83806 (Farid), NSF Grant EIA-98-02068 (Farid), PHS Grant NS-33504 (Grafton).


M.J. Bravo and H. Farid. Top-Down and Bottom-Up Processes for Object Segmentation. Vision Sciences, Sarasota, FL, 2001.

Purpose: Last year, we reported a study of the effects of object knowledge on object segmentation. Here we examine the relationship between this top-down segmentation process and a bottom-up segmentation process based on luminance cues.

Methods: Twelve block-objects were generated by randomly assigning a color (R,G,B,Y) to 7 blocks and then randomly but neatly stacking the blocks next to or on top of one another. Half of these objects were studied by one set of subjects, the other half were studied by a second set. The block-objects were then neatly stacked next to one another to create blocks-scenes. These scenes contained no visual cues marking the boundaries between objects. Also in each scene, stacked between two objects, was a 4-, 5- or 6-block target. The target luminance varied across trials; sometimes it matched the surrounding blocks, at other times it was noticeably darker. The subject's task was to count the number of blocks in the target under three conditions: (1) Top-down: scenes composed of studied objects, target not defined by luminance cue. (2) Bottom-up: non-studied objects, target defined by luminance cue. (3) Both: studied objects and luminance cue.

Results: (1) With only top-down information, subjects were able to accurately (>90%) segment the target blocks from the other 28 blocks in the scene. This was a slow process requiring 10-15 sec. (2) With a target defined by a strong luminance cue, subjects were also accurate but they responded 2-3 times faster. As the luminance cue was reduced, accuracy fell while response times increased to the same level as the top-down condition. (3) When both top-down and weak bottom-up cues were available, some subjects were able to combine the two strategies: accuracy was similar to the top-down only case, but response times were faster.

Conclusion: Object knowledge can be used for object segmentation. Although this top-down process is slow, it can be combined effectively with a faster bottom-up process.


H. Farid and E.H. Adelson. Standard Mechanisms Can Explain Grouping in Temporally Synchronous Displays. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 2000.

Purpose. In a recent report, Lee and Blake (Science, 284, 1999) argued that the human visual system can use temporal microstructure to bind image regions into unified objects, as has been proposed in some neural models. Their stimuli were designed in an attempt to remove all classical form-giving cues, so that timing itself would provide the only form cue. They found that observers could see synchrony-defined form, and they posited the existence of special synchrony-sensitive mechanisms and binding processes. However, we believe that the filtering properties of early vision can convert the synchrony information into contrast information, from which standard mechanisms can extract form.

Methods. Lee and Blake's stimuli consisted of two dense regions of randomly oriented Gabor elements, where the Gabor phase randomly shifted forward or backward on each frame. The elements in a central rectangular region changed in synchrony according to a random sequence, while the elements in the background region changed independently. We downloaded several such movies from their web site, and simulated the effects of temporal lowpass and bandpass filtering.

Results. In the filtered movies, the target region's contrast fluctuated noticeably above and below that of the background. Consider the case of temporal lowpass filtering (i.e., simple visual persistence). If a Gabor element undergoes a run of multiple shifts in one direction, its effective contrast is low due to the temporal averaging. Conversely, if it undergoes a run of alternating shifts, its effective contrast remains fairly high because it is ``jittering'' in place. Since the Gabor elements in the target region are synchronized, the effective contrast of the entire region fluctuates en masse, and from one moment to the next can be noticeably different than the background. Similar results hold for bandpass temporal filters.

Conclusions. Lee and Blake's stimuli were cleverly designed to remove form cues from single frames and frame pairs. However, when one considers the full sequence, strong contrast cues can emerge due to the spatio-temporal filtering present in early vision. These cues may well explain the perception of form in these displays, thus obviating the need to posit special grouping mechanisms based on temporal synchrony.


M.J. Bravo and H. Farid. The Role of Object Recognition in Scene Segmentation. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 2000.

Purpose. How does our familiarity with the objects in our environment affect the way we organize the visual world? To find out, we tested how well subjects could segment 3D scenes with ambiguous low-level grouping cues.

Methods. 3D block objects were generated by a simple computer algorithm which neatly stacked 6-8 colored blocks next to or on top of one another. Subjects were trained to recognize four of these block objects. Block scenes were then created by aligning several block objects next to each other such that the object boundaries were completely ambiguous. These block scenes were either composed of the familiar (learned) or unfamiliar objects. Placed neatly in the scene was a target object consisting of four blocks. A new target was used on each trial, and subjects were either shown the target before they were shown the scene (precue) or after they were shown the scene (postcue). The subject's task was to determine whether the target was present in the scene.

Results. In the precue condition, there was no effect of familiarity on accuracy: subjects could search a scene of unfamiliar objects as effectively as a scene of familiar objects. In contrast, the postcue condition showed a large effect of familiarity on accuracy: subjects rarely reported the presence of the target in scenes composed of unfamiliar objects, but they performed quite well with scenes of familiar objects. With scenes of familiar objects, subjects reported that they first identified the block objects and then directed their attention to “what was left over”.

Conclusions. Subjects appear to be able to organize a scene into familiar objects in the absence of low-level grouping cues. It is this organization that allows them to find a target before they know what it looks like (postcue). If subjects do know what the target looks like (precue), then this perceptual organization appears to play no role in search.


M.J. Bravo and H. Farid. Texture Segmentation in 3D. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 1999.

Purpose: Observers can readily discriminate two textures with different orientations when both are presented on a planar surface. In this case the discontinuity in the image coincides with the discontinuity in the world. But if the surface is folded the image it produces may contain additional discontinuities. Can observers differentiate between texture discontinuities that are due only to changes in surface slant versus those that reflect a change in both surface slant and surface texture?

Methods: Our stimulus was a rendered three-panel surface. The texture on the center panel was oriented bandpass noise, the texture on one side panel was the same (in the world, not the image), while the texture on the other side panel was rotated by a variable amount. The stimuli were presented stereoscopically and all observers reported having a vivid 3D percept. The observer's task was to indicate which side panel had the rotated texture.

Results: Performance levels varied with the orientation of the surface texture. Observers performed best with textures that were oriented horizontally on two of the surfaces but they performed near chance with some diagonal textures.

Conclusions: Observers generally have difficulty determining whether a change in image texture is due solely to a change in surface slant or if it also reflects a change in the intrinsic surface texture. While humans are quite adept at detecting texture discontinuities in an image, they are limited in their ability to interpret them.


M.J. Bravo and H. Farid. The Effects of 2D and 3D Smoothness on Motion Segmentation. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 1998.

Purpose: To measure the sensitivity of observers to small, local perturbations in the flow field produced by the rotation of a rigid plane and to determine whether performance is based solely on detecting deviations in the smoothness of the 2D flow field.

Methods: Test stimuli simulated a textured plane rotating about a vertical or horizontal axis viewed under perspective projection. The plane's texture consisted of eight patches of dots arranged in a circle around the fixation point. As the plane rotated, the patches moved and their shapes changed, however, the shape of one patch, the target, did not change appropriately. That is, at the center of the target patch, the velocity was consistent with the plane, but the spatial derivatives of the velocity were not. The observer's task was to locate this target patch. Control stimuli were generated by transforming the test flow field in two ways: either each vector of the flow field was rotated by 90 degrees or the sign of either the Vx or Vy component of the flow field was inverted. Both transformations preserve the 2D smoothness of the flow, but destroy the 3D percept of a rigid plane.

Results: Subjects were able to locate the target patch in all three stimuli, but they required a larger deviation with the control stimuli compared with the test stimuli.

Conclusions: Observers are sensitive to local perturbations in a smooth 2D flow field, but they appear to be more sensitive to such perturbations when the flow field corresponds to a rigid 3D plane.


H. Farid, E.P. Simoncelli, M.J. Bravo and P.R. Schrater. Effects of Contrast and Period on Perceived Coherence of Moving Square-Wave Plaids (evidence for a speed bias in the human visual system). Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 1995

Purpose: The coherence of moving square-wave plaids depends on a number of stimulus parameters: plaid angle (theta), grating speed (Sg), contrast, and period. Last year at ARVO, we explored the dependence on the plaid angle and the grating speed. We found that coherence depended on both of these parameters: this dependence is best understood via a reparameterization in terms of pattern speed (Sp = Sg / cos(theta)). When Sp is below a critical speed (roughly 5 deg/sec), the plaid is more likely to be seen as coherent. Above this critical speed, the plaid has the appearance of two gratings sliding transparently over each other. This year, we examined the effect of contrast and component period on the coherence of square-wave plaids.

Methods: Subjects were presented with symmetric square-wave plaids of varying period and were asked whether the stimuli appeared transparent or coherent. In a second experiment, subjects judged the coherence of symmetric square-wave plaids of varying contrast.

Results: The experiments reveal that both contrast and period affect the perceived coherence of the stimuli: gratings of higher contrast and gratings of smaller period appear more coherent. For fixed period and contrast, the effect of varying plaid angle and grating speed is consistent with our previous experiments: coherence is determined by the pattern speed relative to a critical speed. However, the current experiments reveal that this the critical speed depends on the stimulus contrast and period.

Conclusions: These results suggest that the primary determinant of square-wave plaid coherence is the pattern speed. This behavior may be explained by a model for velocity perception with a built-in preference for slower speeds.


H. Farid and E.P. Simoncelli. The Perception of Transparency in Moving Square-Wave Plaids. Investigative Opthalmology and Visual Science, Sarasota, FL, 1994.

Purpose: We performed psychophysical experiments to determine the rules governing the perception of transparency in additive square-wave plaids.

Methods: Subjects were presented with a randomized sequence of square-wave plaids of varying grating speed, grating orientation and plaid intersection luminance. The two gratings were symmetrically oriented about vertical, with fixed and equal period and duty-cycle. Presentations lasted two seconds, with a three second inter-trial interval. Subjects were asked whether the stimulus appeared to be transparent or coherent.

Results: Our experimental results suggest that the perception of transparency is primarily governed by the pattern speed and the grating speed. In particular, when the pattern speed exceeds a certain critical speed (Sc), the plaid is more likely to be seen as transparent. Furthermore, when the grating speed exceeds the critical speed, subjects report being unable to make clear judgements. This result is illustrated in the idealized diagram of subject response versus pattern speed (Sp) and grating speed (Sg) shown to the right. Further studies suggest that varying the luminance of the plaid intersections (see Stoner, et. al.,1990) seems to affect the percept of transparency only when the pattern speed is close to the critical speed.

Conclusions: The existence of such a critical speed suggests that the human visual system may have a perceptual preference for slower speeds. This data and the original data of Stoner, et. al. is consistent with a fairly simple energy-based model for velocity computation in which the representation of velocity is speed-limited.