M.J. Bravo and H. Farid. Search Templates Can be Adapted to the
Context, but Only for Unfamiliar Targets. Vision Sciences,
St. Pete Beach, FL, 2014.
When observers search repeatedly for a target in a particular context,
they learn a target template that is optimized for that context. If
the same object is encountered in a different context, observers may
learn a different target template. Can observers learn multiple
templates for the same object and switch among these templates
depending on the context? In an earlier study, we trained observers to
search for a target in three contexts (three types of distractors). We
then intermixed the contexts and found that search for the target was
faster when observers were given a cue that allowed them to anticipate
the context. We concluded that observers were switching their target
template depending on the context (VSS 2012). This year, we ruled out
the alternative explanation that observers use the cue to suppress the
context. To do this, we repeated the experiment but randomly varied
the target across trials. The context cue no longer benefited search,
supporting the idea that observers used the context cue to switch
their target template rather than suppress the context. We also tested
whether observers could develop multiple search templates for a target
that was already very familiar. We again repeated our original
experiment, but we first pre-trained observers to discriminate the
target from a large set of highly similar objects. This pre-training
eliminated the effect of the context cue. In total, our results
indicate that observers can develop context-specific search templates
for unfamiliar targets. If observers have a pre-existing
representation of the target, however, they seem unable to adapt their
target template to the context.
M.J. Bravo and H. Farid. Symbolic Distractor Cues Facilitate Search. Vision Sciences, Naples, FL, 2012.
When observers practice finding a particular target among a particular
type of distractor, their search times become faster. One benefit of
practice is that it allows observers to hone a search template that
optimally distinguishes the target from the distractors. If observers
practice searching for this same target among a different type of
distractor, they will likely develop a different search template. This
study examined whether observers can switch among these search
templates if they are provided with a symbolic cue to the distractors’
identity prior to the search display. The search items were
photographs of four very similar objects (four wristwatches or four
fishing lures) with one object selected to serve as the target. In two
training sessions, observers practiced finding this target among
distractors drawn from the other three objects. Each search display
was comprised of the target and five identical distractors arranged
randomly but without overlap. Observers indicated whether the target
appeared on the right or left side of the display. During the training
sessions, displays with different distractors were run in separate
blocks of trials, and each trial was preceded by a symbolic cue (a
number) that identified the distractors. During a subsequent testing
session, displays with different distractors were randomly intermixed,
and half of the trials were preceded by the symbolic distractor
cue. Ten observers ran the experiment with either the wristwatch or
fishing lure stimuli. For all observers, the symbolic distractor cue
produced a robust decrease in search times (t(9) = 5.16, p <0.001, 15%
average decrease ) with no change in accuracy. This result indicates
that observers can develop multiple search templates for the same
target object and that observers can readily switch among these
templates to optimize their search.
M.J. Bravo and H. Farid. Distinctive Features are Prominent in Object Representations. Vision Sciences, Naples, FL, 2011.
Question: Does our representation of a learned object weigh all
features equally or are diagnostic features given greater prominence?
To find out, we conducted a visual search experiment based on the
premise that search will be fastest when the features that are
prominent in the observer's representation are also salient in the
stimulus.
Methods: Observers were trained to associate names with three
butterflies that had different types of texture on their upper and
lower wings. For each observer, the texture sample on one set of wings
varied (the diagnostic wings) while the texture sample on the other
set of wings was fixed (the common wings). Soon after training, the
observers were tested on a visual search task with the butterfly names
as cues. Each search stimulus contained one butterfly on a textured
background; the observer's task was to locate the butterfly. On some
trials, the statistics of the background texture matched those of the
common wings, causing the common features to be highly camouflaged and
the diagnostic features to be salient. On other trials, the statistics
of the background texture matched those of the diagnostic wings,
causing the diagnostic features to be highly camouflaged and the
common features to be salient.
Predictions: If diagnostic features are given special prominence in
object representations, then search should be fastest when those
features are salient in the image. If common features are given
special prominence (possibly because they are seen most frequently),
then search should be fastest when those features are salient in the
image.
Results: Observers found butterflies faster on background textures
that camouflaged the common wings rather than the diagnostic
wings. Our internal representation of objects gives greater prominence
to diagnostic features than to common features.
H. Farid and M.J. Bravo. Photo Forensics: How Reliable is the Visual System? Vision Sciences, Naples, FL, 2010.
In 1964, the Warren commission concluded that John F. Kennedy had been
assassinated by Lee Harvey Oswald. This conclusion was based in part
on the famous "backyard photograph" of Oswald holding a rifle and
Marxist newspapers. A number of people, including Oswald himself, have
claimed that the photograph was forged. These claims of forgery have
been bolstered by what appear to inconsistencies in the lighting and
shadows in the photo.
This is but one of several cases in which accusations of photographic
inauthenticity have spawned national or international controversies
and conflicts. Because these claims are often based on perceptual
judgments of scene geometry, we have examined the ability of observers
to make such judgments. To do this, we rendered scenes that were
either internally consistent or internally inconsistent with respect
to their shadows, reflections, or planar perspective distortions. We
then asked 20 observers to judge the veridicality of the scenes. The
observers were given unlimited viewing time and no feedback. Except
for the most degenerate cases, performance was near chance, even
though the information required to make these judgments was readily
available in the scenes. We demonstrate the availability of this
information by showing that straightforward computational methods can
reliably discriminate between possible and impossible scenes.
We have used computational methods to also test the claims of
inauthenticity made about the Oswald backyard photo. By constructing a
3D model of the scene, we show that the shadows in the photo are
consistent with a single light source. Our psychophysical results
suggest that the claims to the contrary arose because human observers
are unable to reliably judge certain aspects of scene
geometry. Accusations of photo inauthenticity based solely on a visual
inspection should be treated with skepticism.
D.T. Bolger, T. Morrison, B. Vance and H. Farid. Development and Application of a Computer-Assisted System for Photographic Mark-Recapture Analysis. Ecological Society of America, Pittsburgh, PA, 2010.
Background/Questions: Photographic mark-recapture is a cost-effective,
non-invasive way to study populations. However, to effectively apply
photographic mark-recapture to large populations, computer software is
needed for efficient image manipulation and pattern matching. This
talk describes a new software package and its application to giraffe
(Giraffa camelopardalis) populations in the Tarangire Ecosystem in
northern Tanzania.
Results/Conclusions: We created an open source application for the
storage, pattern extraction, and pattern-matching of digital images
for the purposes of mark-recapture analysis. The resulting software
package is a stand-alone, multi-platform application implemented in
Java. Over 1200 images were acquired in the field in three primary
sampling periods, Sept.-Oct. 2008, Jan.-Mar. 2009, and Dec. 2009. The
pattern information in these images was extracted and matched
resulting in capture histories for over 600 unique individuals. These
histories were then analyzed with Cormack-Jolly-Seber models to
estimate survival rates and closed population models to estimate
population sizes for two spatially distinct subpopulations. Our
program employs the SIFT operator (Scale Invariant Feature Transform)
which extracts distinctive features invariant to image scale and
rotation. This was advantageous in this application as it allowed
reduced preprocessing of images and accepted a greater range of image
quality with low matching error rates. This new tool allowed
photographic mark-recapture to be applied successfully to this
relatively large population and suggests it can be successfully
applied to other suitable species.
M.J. Bravo and H. Farid. Training Determines the Target Representation for Search. Vision Sciences, Naples, FL, 2009.
Purpose: Visual search is facilitated when observers are pre-cued with
the target image. This facilitation arises in part because the pre-cue
activates a stored representation of the target. We examined whether
training designed to alter the nature of this representation can
influence the specificity of the cueing effect.
Methods: The experiment involved a training session and, 1-2 days
later, a testing session. For both sessions, the stimuli were
photo-composites of coral reef scenes and the targets were images of
tropical fish. The observer's task was to judge whether a fish was
present in each reef scene. During training, observers practiced
searching for 3 exemplars of 4 fish species. Half the observers
searched for the 12 fish in 12 separate blocks (blocked-by-fish
group), the other half searched for the three fish belonging to each
species in separate blocks (blocked-by-species group). During testing,
the observers were shown a brief pre-cue one second before the search
stimulus. The pre-cues were either identical to the target, the same
species as the target, or, as a control, the word "fish".
Prediction: During training, we expected that the blocked-by-fish
group would develop a specific representation for each of the 12 fish
images, while the blocked-by-species group would develop a more
general representation of the 4 fish species. We expected this
difference to show-up during testing as a difference in the
specificity of the cueing effect.
Results: For the blocked-by-fish group, pre-cues facilitated search
for identical targets but not same-species targets. For the
blocked-by-species group, pre-cues facilitated search for identical
targets as well as same-species targets.
Conclusion: The pattern of cueing effects suggests that observers
trained on the same visual search stimuli can form different
representations of the target.
D.T. Bolger, T. Morrison, B. Vance and H. Farid. A New Software Application for Photographic Mark Recapture Analysis. Society for Conservation Biology, Edmonton Alberta, Canada, 2010.
Photographic mark-recapture (PMR) is a cost-effective, non-invasive
way to study populations. However, to effectively apply PMR to large
populations, computer software is needed for efficient image
manipulation and pattern matching. We have created an open-source
application for the storage, pattern extraction, and pattern-matching
of digital images for the purposes of PMR. Our software is a
stand-alone, multi-platform application implemented in Java that
employs the SIFT operator (Scale Invariant Feature Transform) which
extracts distinctive features invariant to image scale and
rotation. In this poster we present a validation of the application
for two species with distinct markings, wildebeest (Connochaetes
taurinus) and giraffe (Giraffa camelopardalis). We used ROC curves
(Receiver Operator Characteristics) to characterize the trade-off
between false negative and false positive error in the photo-matching
process and to identify the best performing scoring procedure. Because
false negative error was of greater concern than false positive, we
selected scoring thresholds that minimized false negative error. For
wildebeest, the best procedure generated false negative error rates of
14% while yielding a 130-fold labor savings over an unassisted
matching process. For giraffe, errors rates were negligible and labor
savings even greater. These results suggest that this software should
be useful to other researchers employing PMR.
H. Farid and M.J. Bravo. Photorealistic Rendering: How Realistic Is It? Vision Sciences, Sarasota, FL, 2007.
The US Supreme Court recently ruled that portions of the 1996 Child
Pornography Prevention Act are unconstitutional. The Court ruled that
computer generated (CG) images depicting a fictitious minor are
constitutionally protected. Judges, lawyers, and juries are now being
asked to determine whether an image is CG, but there is no data on
whether they can reliably do so.
To test the ability of human observers to discriminate CG and
photographic images, we collected 180 high-quality CG images with
human, man-made, or natural content. Since we were interested in
tracking the quality of CG over time, we collected images created over
the past six years. For each CG image, we found a photographic image
that was matched as closely as possible in content. The 360 images
were presented in random order to ten observers from the introductory
psychology subject pool at Rutgers. Observers were given unlimited
time to classify each image.
Observers correctly classified 83% of the photographic images and 82%
of the CG images (d'=1.87). Observers inspected each image for an
average of 2.4 seconds. Among the CG images, those depicting humans
were classified with the highest accuracy at 93% over all six
years. This accuracy declined to 63% for images created in 2006.
Because the experiment was self-paced, inspection times differed among
observers, and the results show a strong speed-accuracy trade-off. The
observer with the longest inspection time (3.5 seconds/image)
correctly classified 90% of all photographic images and 96% of all CG
images (d'=3.03). This observer correctly classified 95% of CG images
depicting humans, and his only errors occurred with 2006 images, where
he achieved an accuracy of 78%.
Even with great advances in computer graphics technology, the human
visual system is still very good at distinguishing between computer
generated and photographic images.
M.J. Bravo and H. Farid. A Measure of Relative Set Size for Search in Clutter. Vision Sciences, Sarasota, FL, 2007.
Visual search performance is typically measured as a function
of set size, but it is unclear how to determine set size in cluttered
scenes. Previously, we claimed that in such scenes, set size
corresponds to the number of segmentable regions rather than the
number of objects (Bravo & Farid, 2003). We supported this claim by
showing that distractors with multiple regions produce steeper search
functions than distractors with a single region. Our goal this year
was to quantify this relationship by using a computational model to
count the regions in our clutter stimuli. There are many computational
models for segmenting an image into regions; of these, graph-based
approaches have shown particular promise. We employed one such
algorithm (Felzenszwalb and Huttenlocher, 2004) to count the number of
regions in our 2003 stimuli. We then used this measure of set size to
replot the search time data. Using the number of segmentable regions
as the measure of set size produced a better fit (R2=0.995) than did
using the number of distractor objects (R2=0.916). Like all
computational models of image segmentation, the output of the
algorithm is highly scale-dependent. By adjusting the algorithm's
parameters, a single image can be segmented into 50 regions or 500
regions. This variability is not a limitation of the algorithm; it
reflects the scale ambiguity inherent in image segmentation. To
determine whether our choice of parameters was fortuitous, we explored
the parameter space and found that nearly every set of values produced
an excellent fit to our data. Evidently, the number of regions in our
stimuli is roughly proportional over a wide range of scales. We have
confirmed that this proportionality also holds across many natural
images. We conclude that computational models of image segmentation
can provide a good measure of relative set size in cluttered stimuli.
M.J. Bravo and H. Farid. Using an Interest Point Detector to Find Potential Fragments for Recognition. Vision Sciences, Sarasota, FL, 2006.
Inspired by recent computer vision models for object recognition in
clutter, we are developing a model of human object recognition based
on local, distinctive fragments. The first stage of such models
typically involves the selection of a large pool of potential image
fragments using an interest point detector. In subsequent stages, this
large pool is reduced to a smaller set of distinctive fragments. In
developing a model for humans, our first step has been to determine
whether the pool of fragments selected by the most common interest
point detector, the Harris Detector (HD), includes the fragments
humans find distinctive. Our test images were randomly rotated
photographs of 12 common tools. We applied an HD to these images and
collected fragments with a wide range of interest ratings. The scale
of the HD determined the size of the fragments (8-pixel radius, 1-2%
of the whole object). These fragments were then used as the stimuli in
a recognition experiment. After a brief training period with whole
tools, observers identified the tool fragments. Overall, observers
were remarkably good at recognizing these tiny fragments. We then
compared the recognition results with the interest ratings of the
HD. Many fragments that were recognizable to observers were not given
high interest ratings by the HD, which responds best to locations with
large luminance gradients in multiple directions (e.g., corners). In
addition to recognizing such fragments, observers also recognized
fragments with subtle or one-dimensional gradients.
V. Maljkovic, P. Martini and H. Farid. The Contribution of Statistical Image Differences to Human Rapid Categorization of Natural Scenes is Negligible. Vision Sciences, Sarasota, FL, 2006.
Purpose: To examine the contribution of low-level image properties to
the rapid categorization of natural scenes.
Methods:. Three image classes were tested in a blocked design:
positive/negative emotional images, landscapes/cityscapes and
animals/vehicles. Within each category, half of the images were
natural, whereas the remaining were synthetic stimuli generated by
matching statistical feature vectors extracted from the same image
class (Portilla & Simoncelli, 2000) and lacked any meaning. Each image
was presented (masked) for 13-50msec, once per subject. Natural and
synthetic images were shown mixed within a block of trials. 24
subjects categorized both natural and synthetic images into their
natural class within a 2AFC design and accuracy of categorization was
calculated per exposure.
Results: All categories of natural images were reliably discriminated
after a single video frame. Categorization of synthetic images,
however, was impaired at all exposures. Each class of natural images
was discriminated far more accurately than the corresponding synthetic
images: 93% vs. 67% for cityscapes/landscapes, 94% vs. 56% for
animals/vehicles, and 70% vs. 53% for emotional images (averages
across exposures, chance 50%).
Conclusions: The contribution of statistical image differences to
image categorization at brief exposures is small in general. It is
more sizeable for image categories such as animal/vehicles and
cityscapes/landscapes where computational linear discriminant analysis
algorithms have some success. In the case of emotional image
categorization the contribution is virtually null, matching the
failure of computational algorithms (Maljkovic et al., VSS
2004). Thus, scene categorization at brief exposures seems not to
overly rely on low-level image statistics.
D.C. Finnegan, H. Farid, D.E. Lawson and W. Krabill. Quantifying Surface Fluctuations using Optical Flow Techniques and Multi-Temporal LiDAR. Transactions of the American Geophysical Union, San Francisco, CA, 2006.
In recent decades scientific communities have seen a significant
increase in technological innovations and applications using airborne
and spaceborne remote sensing. In particular, airborne laser altimetry
has provided the opportunity to characterize large-scale terrain and
geologic processes such as glaciers and ice sheets at fine-scale
resolutions. Although, processing and deriving information from these
data can still pose significant challenges. To this end, we describe a
novel approach that combines the use of a multi-temporal LiDAR (Light
Detection and Ranging) topographic dataset and optical flow
techniques, adapted from the computer vision community, to quantify
ice flow dynamics of the Hubbard glacier. Using NASA's Airborne
Topographic Mapper (ATM-IV) LiDAR as a source of high-resolution
(~5cm) topographic data, repeat airborne surveys of the Hubbard
Glacier terminus were acquired on August 22nd and 26th, 2005. From the
resulting Digital Elevation Model (DEM) we seek to measure a dense
motion field that describe both the shift and change in elevation of
the glacier. The change in the DEM is modeled spatially as locally
affine but globally smooth. The model also explicitly accounts for
changes in elevation, and for missing data. This approach is built
upon a differential multi-scale framework, allowing for the
measurement of both large and small scale motions. The resulting
measurement yields a dense 2-D motion vector field for each point in
the DEM. On the Hubbard Glacier, we achieve an average accuracy within
8% as compared with manual measurements. These results are encouraging
and show that repeat high-resolution elevation data that LiDAR
provides allows us to quantify surface processes in a precise yet
timely manner. These results may then be incorporated as essential
boundary conditions into models that seek to predict geologic behavior
such as glacier and ice sheet flow.
M.J. Bravo and H. Farid. The Depth of Distractor Processing in Search Through Clutter. Vision Sciences, Sarasota, FL, 2005.
Background clutter can make it difficult to segment whole
objects. This is especially true for compound objects, which have
parts made from different materials (e.g., a table lamp). We reported
earlier that when observers search for a category target in dense
clutter, search is slower when the distractors are compound objects
rather than simple objects. This result is consistent with two
interpretations. In the first, observers reject parts, and this
process is slow for compound distractors because they have multiple
parts. In the second, observers reject whole objects, and this process
is slow for compound distractors because they are difficult to
segment. In the present search experiment, we used familiar and
chimerical distractors to distinguish between these
alternatives. Familiar distractors were drawn from a set of 100 color
photographs of everyday objects. Each of these objects had at least
two clearly delineated parts. Chimerical distractors were created by
exchanging parts between objects. Observers searched for a target
defined by its membership in a broad category (e.g., animal) or
categories (e.g., animal or vehicle). We found that when target
uncertainty was high and target recognition was difficult (e.g., the
target was partially occluded, randomly rotated or drawn from two
categories), search times were significantly slower for chimerical
distractors than for normal distractors. This difference suggests that
for some search tasks, observers identify and reject whole
objects. This difference was greatly reduced, however, when the target
was unoccluded, upright and drawn from a single category. For this
simpler search task, observers may reject object parts. In sum, the
demands of the search task determine the depth of distractor
processing required, and this determines whether observers recognize
distractor objects.
H. Sun, D.W. Roberts, H. Farid, Z. Wu, A. Hartov and K.D. Paulsen. Cortical Surface Tracking Using a Stereoscopic Operating Microscope. Neurosurgery, 56:86-97, 2005.
OBJECTIVE: In order to measure and compensate for soft tissue
deformation during image-guided neurosurgery, we have developed a
novel approach to estimate the three-dimensional (3-D) topology of the
cortical surface and track its motion over time.
METHODS: We employ stereopsis to estimate the 3-D cortical topology
during neurosurgical procedures. To facilitate this process, two CCD
cameras have been attached to the binocular optics of a stereoscopic
operating microscope. Prior to surgery, this stereo imaging system is
calibrated to obtain the extrinsic and intrinsic camera
parameters. During surgery the 3-D shape of the cortical surface are
automatically estimated from a stereo pair of images and registered to
the preoperative image volume to provide navigational guidance. This
estimation requires robust matching of features between the images,
which, when combined with the camera calibration, yields desired 3-D
coordinates. After estimating the 3-D cortical surface from stereo
pairs, its motion is tracked by comparing the current surface to its
prior locations.
RESULTS: We are able to estimate the 3-D topology of the cortical
surface with an average error less than 1.2mm. Executing on a 1.1 GHz
Pentium machine, the 3-D estimation from a stereo pair of 1024 x 768
resolution images requires approximately 60 seconds of computation. By
applying stereopsis over time, we are able to track the motion of the
cortical surface including the pulsatile movement of the cortical
surface, gravitational sag, tissue bulge as a result of increased
intracranial pressure, and the parenchymal shape changes associated
with tissue resection. The results from ten surgical cases are
reported.
CONCLUSION: We have demonstrated that a stereo vision system coupled
to the operating microscope can be used to efficiently estimate the
dynamic topology of the cortical surface during surgery. The 3-D
surface can be co-registered to the preoperative image volume. This
unique intraoperative imaging technique expands the capability of the
current navigational system in the OR and increases the accuracy of
anatomical correspondence with preoperative images through
compensation for brain deformation.
H. Farid and D.C. Finnegan. Quantifying Planetary and Terrestrial Geologic Surfaces Using Wavelet Statistics. Transactions of the American Geophysical Union, San Francisco, CA, 2005.
Over the past two decades the planetary and terrestrial scientific
communities have seen a significant increase in airborne and
space-based scientific monitoring and data acquisition. Laser
altimetry, visible and microwave imaging sensors, and radar altimeters
provide insights into fine-scale details and large-scale surface
processes. Characterization of surface features and processes from
such data, however, still poses significant challenges. To this end,
we describe a quantitative approach to statistically characterize
surface features and processes from gridded elevation data.
The computer vision community has recently seen significant advances
in modeling the statistics of natural images. These models consist of
statistical measurements extracted from an image (e.g., a parametric
description of Fourier energy). The model's descriptive power is
verified by synthesizing a new image with matching statistics. If the
synthesized image is visually similar to the original, then the model
likely captured some inherent properties of the image. The model
parameters can then be used as a quantitative similarity metric.
The statistical model employed here is that of Portilla and
Simoncelli, 2000. The model first decomposes an image using a complex
wavelet transform. From this decomposition and the original image, a
number of statistics are extracted: (1) marginal statistics that
embody the basic pixel intensity distribution; (2) coefficient
correlations that embody the salient spatial frequencies and local
spatial regularities; (3) coefficient magnitude statistics that embody
higher-order geometric structures; and (4) cross-scale phase
statistics that embody long-range spatial correlations. Depending on
the image size and wavelet parameters, approximately 1,000 to 10,000
statistics are extracted.
We applied this model to a grayscale shaded relief image derived from
a 2m lidar DEM. We extracted statistical measurements from each of
five qualitatively different regions (fluvial, glacial and
aeolian). Synthesized images based on these measurements qualitatively
capture the underlying structure of each region. When coupled with
pattern recognition techniques, the measurements are used to quantify
the structural similarity between different regions.
Further development is needed to apply this approach to surfaces
imaged with different modalities and at different scales. These
results, nevertheless, provide an encouraging first step in
quantifying surface features and their underlying processes.
M.J. Bravo and H. Farid. Still Searching a Cluttered Scene. Vision Sciences, Sarasota, FL, 2004.
Purpose: Most of the research on visual search and recognition has
used isolated objects presented on uniform backgrounds. It is unclear
whether the conclusions from this work generalize to cluttered
scenes. For example, in sparse displays, global shape is thought to
play a central role in recognition. Background clutter may obscure
global shape, however, and so we expect that in clutter, local cues
such as color will play a larger role.
Methods: Each search stimulus contained 12 photographs of ordinary
objects. These objects were either arranged sparsely (well-separated
on a uniform grid) or arranged as clutter (randomly positioned and
often overlapping). On each trial, observers were first presented with
a category name (animal, vehicle, food). Their task was to locate a
member of that category in the search stimulus. This target often
overlapped other objects, but was itself never occluded. To measure
the effect of color, the stimuli were presented in their original
color, in grayscale, or in a hue-shifted color.
Results: In the uniform condition, response times did not differ for
the original-color, grayscale, and hue-shifted stimuli. In the clutter
condition, however, response times were about 20% faster for the
original-color stimuli than for the grayscale and hue-shifted stimuli.
Conclusions: In sparse stimuli, global shape is such a reliable cue
for recognition that color plays a minimal role. In clutter stimuli,
however, global shape is less reliable and so local cues like color
have increased importance. (Color might also be expected to facilitate
segmentation in the clutter stimuli, but the similar response times
for hue-shifted and grayscale stimuli did not support this
prediction.) This result, in conjunction with our results from last
year, indicates that the processes underlying recognition and search
may differ significantly for sparse and cluttered scenes.
V. Maljkovic, P. Martini and H. Farid. The Time-Course of Categorization of Real-Life Scenes with Affective Content. Vision Sciences, Sarasota, FL, 2004.
Purpose: To establish the temporal dynamics of the human ability to
extract meaning from scenes.
Methods: EXP 1: 384 color images with emotional valence from the IAPS
set were presented (masked) once to each of 96 subjects, at durations
from one video-frame (13 ms) to 1710ms. Subjects rated each image
valence on a 9-point scale. We calculated mean ratings per exposure
and derived hazard functions for different valence categories. EXP 2:
Three image classes were tested in a blocked design: positive/negative
images, landscapes/cityscapes and animals/vehicles. Each image was
presented (masked) for 13-50msec. Subjects categorized the images in a
2AFC design and accuracy of categorization was calculated per
exposure.
Results: EXP 1: Valence was reliably discriminated after a single
video frame and asymptoted at ~1s. The derived hazard functions show
that categorization rates for positive and negative images are the
same, with a transient peak at ~50ms, and a sharp decline by
200ms. EXP 2: Performance remained constant at ~95% for
landscapes/cityscapes and animals/vehicles at all exposures;
performance for emotional scenes improved from ~60% at one frame
exposure to ~75% at 50 ms exposure. To determine if low-level features
could be responsible for these results we built a statistical model
consisting of 24 low-level measurements of luminance and spatial
frequency. A linear classifier was able to almost perfectly separate
the landscapes/cityscapes and animals/vehicles, but was unable to
separate the valence categories.
Conclusions: Image meaning is available at exposures as brief as one
video-frame. While rapid categorization of some image classes could
exploit differences in low-level image properties, no such differences
seem to be available for emotional scenes, and yet image meaning can
be extracted from them reliably and quickly. This suggests a true act
of object recognition, dependent on mechanisms functioning on
similarly fast scales.
M.J. Bravo and H. Farid. Searching a Cluttered Scene. Vision
Sciences, Sarasota, FL, 2003.
Purpose: In one popular scenario of vision, bottom-up grouping
processes organize a scene into objects and then attention selects one
of these objects for recognition. A problem with this scenario is that
many ordinary objects are composed of multiple distinct parts (e.g., a
lamp with a paper shade and a ceramic base), and when these objects
are presented in clutter, it may not be possible to group whole
objects using only bottom-up processes. To test whether attention
selects whole objects or just object parts, we asked observers to
search for a category target (food) in cluttered displays composed of
single-part and multi-part objects.
Methods: Each display contained twelve color photographs of ordinary
objects. The observer's task was to determine whether these objects
included a food item. In half of the displays, the distractors were
selected from 66 single-part objects; in the other half, the
distractors were selected from 66 multi-part objects. While both types
of displays were composed of the same number of similarly sized
objects, the multi-part displays had many more parts. We also used two
types of object arrangements. In the sparse arrangement, the objects
were uniformly positioned and well-separated from one another. In the
clutter arrangement, the objects were randomly positioned and
overlapped one another.
Results: With the sparse arrangement, there was little difference in
the search times for displays composed of single-part objects and
those composed of multi-part objects. With the clutter arrangement,
however, search times for multi-part displays were much slower than
those for single-part displays.
Conclusion: These data suggest that with sparse arrangements (the norm
in vision research), it is reasonable to suppose that the visual
system can select and reject whole objects when searching for a
target. With cluttered arrangements (the norm in everyday vision),
object parts are likely the initial units of selective attention.
H. Sun, H. Farid D. Roberts, K. Rick, A. Kartov, and K. Paulsen. A Non-contacting 3-D Digitizer For Use in Image-Guided Neurosurgery. American Society for Stereotactic and Functional Neurosurgery, New York City, 2003.
Introduction: We have designed and implemented a non-contacting 3-D
digitizer that attaches to the binocular optics of an operating
microscope. This system can be used to efficiently and automatically
register the surgical scene to the preoperative image volume through
cortical feature analysis and then track the 3-D surface topology
within the operating field in order to account for motion-induced
changes that occur during surgery.
Methods: We have attached two CCD cameras to the binocular optics of
an operating microscope. Prior to surgery, this stereo imaging system
is calibrated to obtain the extrinsic and intrinsic camera
parameters. During surgery the 3-D coordinates of salient image
features are automatically estimated from a stereo pair of images and
registered to the preoperative image volume to provide navigational
guidance. This estimation requires the robust matching of features
between the images, which, when combined with the camera calibration,
yields the desired 3-D coordinates. A parameterized 3-D surface can
then be fit to the estimated 3-D coordinates and, when registered to
the preoperative image volume, provides navigational information in
the face of tissue motion during surgery.
Results: We are able to estimate the 3-D structure of a surgical scene
with an average accuracy of 1.3mm. Executing on a 1.1 GHz Pentium
machine, the 3-D estimation from a stereo pair of 1024x768 images
requires approximately 8 minutes of computation.
Conclusions: We have demonstrated that an operating microscope is
capable of, without inducing brain deformation, digitizing 3-D
surfaces with efficient acquisition and image analysis of stereo
pairs, which can also be co-registered to the preoperative image
volume through related feature analysis.
Learning Objectives: The ability to quickly and automatically estimate
3-D cortical surface topology during neurosurgery has several
applications: (1) cortical vasculature can be localized in 3-D and
registered with pre-operative imaging data; (2) fiducial markers can
be localized in 3-D and used for the intraoperative update of
calibration parameters; (3) the 3-D cortical surface can be
continuously estimated and tracked for use in FEM-based compensation
of brain deformation and shift that occurs in the OR.
M.J. Bravo and H. Farid. Segmentation in Clutter. Vision Sciences, Sarasota, FL, 2002.
In a cluttered scene, it may be difficult to fully segment an object
using only bottom-up cues. In such cases we may segment the object by
first detecting one of its salient, distinctive parts and then using
this part to predict the location and orientation of other object
parts. For a rigid object, the predictive power of the salient part
should depend on its symmetry. For example, a sphere which has
infinite rotational symmetry (and so looks the same from all
viewpoints) should have less predictive power than a cone.
To test this idea we constructed computer-generated, rigid objects
composed of two pieces: a "handle" (a simple geometric shape) and a
"tool" (two connected cylinders). In each scene, an object was
presented at a random orientation amongst clutter composed of
cylinders resembling the tool. A small black or white ring was placed
around one of the tool's cylinders at a location that varied across
trials. Similar rings were also placed on the clutter. The observer's
task was to report the color of the ring located on the tool. Because
the tool was camouflaged against the background of clutter, response
times were expected to depend on the degree to which the salient
handle could be used to predict the tool's location and orientation in
the clutter. That is, response times were expected to depend on the
symmetry of the handle.
The results supported this idea: response times increased
monotonically as the symmetry of the handle increased from 0-fold to
2-fold to 4-fold. Response times for handles with infinite symmetry,
however, were no longer than those with 4-fold symmetry.
We conclude that observers can use a salient part to predict the
location and orientation of the rest of an object. The predictive
power of these salient parts depends, up to a limit, on their
symmetry.
H. Farid and E.H. Adelson. Energy versus Synchrony in Perceptual Grouping. Vision Sciences, Sarasota, FL, 2002.
It has been proposed that the human visual system can use temporal
synchrony for perceptual grouping. In a compelling demonstration of
this theory a stochastic motion display purportedly driven solely by
temporal synchrony was shown to promote grouping. It was then argued
that these effects point to the role of synchrony-based mechanisms and
processes. We have previously argued that the displays contain a
traditional form of contrast energy and thus the grouping phenomena
might be attributed to traditional mechanisms.
To further study this topic we devised new stimuli rich in temporal
synchrony but devoid of contrast energy. These stimuli allow aspects
of synchrony and spatio-temporal energy to be independently
manipulated. We find that the energy, and not the synchrony, predicts
the results.
The stochastic displays consist of a sea of drifting elements. On each
frame every element moves according to a random process. Different
random processes drive all the elements in the central and surrounding
regions. One might argue that the resulting form cue is defined solely
by the fine-grained temporally synchronous motion reversals. We
observe, however, that there are moments when all elements in one
region repeatedly reverse directions, while in the other region all
elements have a run with no reversals. We show that a classic
spatio-temporal energy model consisting of a spatial lowpass filter
and a temporal bandpass filter can convert these relatively
large-scale temporal change differences into a contrast cue.
This simple model is consistent with the psychophysical results of Lee
and Blake (Science, 1999), Kandil and Fahle (Euro. J. Neuro., 2001),
Farid and Adelson (Nat. Neuro., 2001), and Morgan and Castet
(Proc. Roy. Soc., 2002). A model based on temporal synchrony alone can
not explain all of these results. We conclude that the proposed
synchrony-based mechanisms and processes are neither necessary nor
sufficient to explain the phenomena.
A.M. Heimsath and H. Farid. Hillslope Topography from Unconstrained Photographs. Transactions of the American Geophysical Union, San Francisco, CA, 2002.
Quantifications of Earth surface topography are essential for modeling
the connections between physical and chemical processes of erosion and
the shape of the landscape. Enormous investments are made in
developing and testing process-based landscape evolution models. These
models may never be applied to real topography because of the
difficulties in obtaining high-resolution (1-2 m) topographic data in
the form of digital elevation models (DEMs). Here we present a simple
methodology to extract the high-resolution 3-dimensional topographic
surface from photographs taken with a hand-held camera with no
constraints imposed on the camera positions or field survey. This
technique requires only the selection of corresponding points in three
or more photographs. From these corresponding points the unknown
camera positions and surface topography are simultaneously
estimated. We compare results from surface reconstructions estimated
from high-resolution survey data from field sites in the Oregon Coast
Range and northern California to verify our technique. Our most
rigorous test of the algorithms presented here is from the
soil-mantled hillslopes of the Santa Cruz marine terrace
sequence. Results from three unconstrained photographs yield an
estimated surface, with errors on the order of 1 m, that compares well
with high resolution GPS survey data and can be used as an input DEM
in process-based landscape evolution modeling. We further explore this
method by quantifying volume of sediment lost by landsliding. Finally,
we compare curvature (used as a proxy for landscape lowering with a
simple diffusion-like model) calculated with the photo-estimated
topography with high-resolution field surveys to test further the
applicability of this methodology to Earth surface process studies.
S. Inati, H. Farid, K. Sherwin, and S. Grafton. A Global Probabilistic Approach to Fiber Tractography with Diffusion Tensor MRI. Human Brain Mapping, Brighton, UK, 2001.
Introduction: Coherently organized tissue has a high degree of
diffusional anisotropy which can be observed non-invasively using
diffusion weighted MRI. DT-MRI has been applied with some success to
the in vivo tracing of white-matter fiber pathways in the
brain. Current approaches[1,2,3] "grow" pathways by taking steps in
the direction indicated by the diffusion tensor at each point. The
noise inherent in DT-MRI data limits the ability of these stepwise
techniques to trace long fibers in their entirety. In contrast to
these local approaches, we describe a global approach that employs
Expectation Maximization[4] (EM). EM is a statistical technique that
has been used in computer vision to estimate parametric models from
2-D motion vector fields[5]. Here, we apply EM to the estimation of
neuronal pathways from 3-D tensor fields. EM is an iterative two stage
process for simultaneous data segmentation and model estimation. In
order to employ EM, a set of models are first defined to describe the
data. In the E-step the probability of every data point belonging to
each model is computed. In the M-step the probabilities are used to
re-estimate the model parameters. The E- and M-steps are repeatedly
performed until the probabilities and model parameters converge to a
solution.
Methods: Subjects were scanned using a 1.5T GE Signa Echospeed (LX8.3)
equipped with 4 G/cm gradients. Diffusion weighted images were
acquired using a single-shot, diffusion-weighted, spin echo EPI
sequence with 6 gradient encoding directions (b=1000). In our case,
the data consisted of a volume of vector-valued data points (the
principle eigenvectors of the apparent diffusion tensor). The model
was taken to be a low-order Bezier curve. A single run of EM yielded a
parametric description of a pathway and a probability of each point
belonging to the pathway. Multiple pathways were classified after
repeated runs with different random starting conditions.
Results/Discussion: Using the global probabilistic methods outlined
above, we have successfully traced white matter fiber pathways. The
global nature of this technique provides several advantages over local
approaches: 1) Fiber tracts can be traced through noisy regions. 2)
Longer fiber tracts can be traced because EM is insensitive to the
accumulation of error found in stepwise solutions. 3) EM provides a
measure of the likelihood of each pathway given the underlying
data. 4) Finally, by fixing both endpoints, pathways connecting two
brain regions can be easily found. We have presented a method for
fiber tracking using EM. This global approach can be tailored and
extended to provide a valuable tool in neuroimaging, separately or in
concert with existing methods.
References
[1] P.J. Basser, et al., MRM 44:625-632, 2000
[2] T.E. Conturo, et al., PNAS 96:10422-10427, 1999
[3] S. Mori, et al., Ann Neurol 45:265-269, 1999
[4] G.J. McLachlan et al., "Mixture Models: Inference and Applications to Clustering", 1988
[5] A. Jepson et al., Proc CVPR 760-761, 1993
Acknowledgments: This work was supported in part by NSF Grant P50-NS-17778 (Inati), NSF CAREER Award IIS-99-83806 (Farid), NSF Grant EIA-98-02068 (Farid), PHS Grant NS-33504 (Grafton).
M.J. Bravo and H. Farid. Top-Down and Bottom-Up Processes for Object Segmentation. Vision Sciences, Sarasota, FL, 2001.
Purpose: Last year, we reported a study of the effects of object
knowledge on object segmentation. Here we examine the relationship
between this top-down segmentation process and a bottom-up
segmentation process based on luminance cues.
Methods: Twelve block-objects were generated by randomly assigning a
color (R,G,B,Y) to 7 blocks and then randomly but neatly stacking the
blocks next to or on top of one another. Half of these objects were
studied by one set of subjects, the other half were studied by a
second set. The block-objects were then neatly stacked next to one
another to create blocks-scenes. These scenes contained no visual cues
marking the boundaries between objects. Also in each scene, stacked
between two objects, was a 4-, 5- or 6-block target. The target
luminance varied across trials; sometimes it matched the surrounding
blocks, at other times it was noticeably darker. The subject's task
was to count the number of blocks in the target under three
conditions: (1) Top-down: scenes composed of studied objects, target
not defined by luminance cue. (2) Bottom-up: non-studied objects,
target defined by luminance cue. (3) Both: studied objects and
luminance cue.
Results: (1) With only top-down information, subjects were able to
accurately (>90%) segment the target blocks from the other 28 blocks
in the scene. This was a slow process requiring 10-15 sec. (2) With a
target defined by a strong luminance cue, subjects were also accurate
but they responded 2-3 times faster. As the luminance cue was reduced,
accuracy fell while response times increased to the same level as the
top-down condition. (3) When both top-down and weak bottom-up cues
were available, some subjects were able to combine the two strategies:
accuracy was similar to the top-down only case, but response times
were faster.
Conclusion: Object knowledge can be used for object
segmentation. Although this top-down process is slow, it can be
combined effectively with a faster bottom-up process.
H. Farid and E.H. Adelson. Standard Mechanisms Can Explain Grouping in Temporally Synchronous Displays. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 2000.
Purpose. In a recent report, Lee and Blake (Science, 284, 1999) argued
that the human visual system can use temporal microstructure to bind
image regions into unified objects, as has been proposed in some
neural models. Their stimuli were designed in an attempt to remove all
classical form-giving cues, so that timing itself would provide the
only form cue. They found that observers could see synchrony-defined
form, and they posited the existence of special synchrony-sensitive
mechanisms and binding processes. However, we believe that the
filtering properties of early vision can convert the synchrony
information into contrast information, from which standard mechanisms
can extract form.
Methods. Lee and Blake's stimuli consisted of two dense regions of
randomly oriented Gabor elements, where the Gabor phase randomly
shifted forward or backward on each frame. The elements in a central
rectangular region changed in synchrony according to a random
sequence, while the elements in the background region changed
independently. We downloaded several such movies from their web site,
and simulated the effects of temporal lowpass and bandpass filtering.
Results. In the filtered movies, the target region's contrast
fluctuated noticeably above and below that of the background. Consider
the case of temporal lowpass filtering (i.e., simple visual
persistence). If a Gabor element undergoes a run of multiple shifts in
one direction, its effective contrast is low due to the temporal
averaging. Conversely, if it undergoes a run of alternating shifts,
its effective contrast remains fairly high because it is ``jittering''
in place. Since the Gabor elements in the target region are
synchronized, the effective contrast of the entire region fluctuates
en masse, and from one moment to the next can be noticeably different
than the background. Similar results hold for bandpass temporal
filters.
Conclusions. Lee and Blake's stimuli were cleverly designed to remove
form cues from single frames and frame pairs. However, when one
considers the full sequence, strong contrast cues can emerge due to
the spatio-temporal filtering present in early vision. These cues may
well explain the perception of form in these displays, thus obviating
the need to posit special grouping mechanisms based on temporal
synchrony.
M.J. Bravo and H. Farid. The Role of Object Recognition in Scene Segmentation. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 2000.
Purpose. How does our familiarity with the objects in our environment
affect the way we organize the visual world? To find out, we tested
how well subjects could segment 3D scenes with ambiguous low-level
grouping cues.
Methods. 3D block objects were generated by a simple computer
algorithm which neatly stacked 6-8 colored blocks next to or on top of
one another. Subjects were trained to recognize four of these block
objects. Block scenes were then created by aligning several block
objects next to each other such that the object boundaries were
completely ambiguous. These block scenes were either composed of the
familiar (learned) or unfamiliar objects. Placed neatly in the scene
was a target object consisting of four blocks. A new target was used
on each trial, and subjects were either shown the target before they
were shown the scene (precue) or after they were shown the scene
(postcue). The subject's task was to determine whether the target was
present in the scene.
Results. In the precue condition, there was no effect of familiarity
on accuracy: subjects could search a scene of unfamiliar objects as
effectively as a scene of familiar objects. In contrast, the postcue
condition showed a large effect of familiarity on accuracy: subjects
rarely reported the presence of the target in scenes composed of
unfamiliar objects, but they performed quite well with scenes of
familiar objects. With scenes of familiar objects, subjects reported
that they first identified the block objects and then directed their
attention to “what was left over”.
Conclusions. Subjects appear to be able to organize a scene into
familiar objects in the absence of low-level grouping cues. It is this
organization that allows them to find a target before they know what
it looks like (postcue). If subjects do know what the target looks
like (precue), then this perceptual organization appears to play no
role in search.
M.J. Bravo and H. Farid. Texture Segmentation in 3D. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 1999.
Purpose: Observers can readily discriminate two textures with
different orientations when both are presented on a planar surface. In
this case the discontinuity in the image coincides with the
discontinuity in the world. But if the surface is folded the image it
produces may contain additional discontinuities. Can observers
differentiate between texture discontinuities that are due only to
changes in surface slant versus those that reflect a change in both
surface slant and surface texture?
Methods: Our stimulus was a rendered three-panel surface. The texture
on the center panel was oriented bandpass noise, the texture on one
side panel was the same (in the world, not the image), while the
texture on the other side panel was rotated by a variable amount. The
stimuli were presented stereoscopically and all observers reported
having a vivid 3D percept. The observer's task was to indicate which
side panel had the rotated texture.
Results: Performance levels varied with the orientation of the surface
texture. Observers performed best with textures that were oriented
horizontally on two of the surfaces but they performed near chance
with some diagonal textures.
Conclusions: Observers generally have difficulty determining whether a
change in image texture is due solely to a change in surface slant or
if it also reflects a change in the intrinsic surface texture. While
humans are quite adept at detecting texture discontinuities in an
image, they are limited in their ability to interpret them.
M.J. Bravo and H. Farid. The Effects of 2D and 3D Smoothness on Motion Segmentation. Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 1998.
Purpose: To measure the sensitivity of observers to small, local
perturbations in the flow field produced by the rotation of a rigid
plane and to determine whether performance is based solely on
detecting deviations in the smoothness of the 2D flow field.
Methods: Test stimuli simulated a textured plane rotating about a
vertical or horizontal axis viewed under perspective projection. The
plane's texture consisted of eight patches of dots arranged in a
circle around the fixation point. As the plane rotated, the patches
moved and their shapes changed, however, the shape of one patch, the
target, did not change appropriately. That is, at the center of the
target patch, the velocity was consistent with the plane, but the
spatial derivatives of the velocity were not. The observer's task was
to locate this target patch. Control stimuli were generated by
transforming the test flow field in two ways: either each vector of
the flow field was rotated by 90 degrees or the sign of either the Vx
or Vy component of the flow field was inverted. Both transformations
preserve the 2D smoothness of the flow, but destroy the 3D percept of
a rigid plane.
Results: Subjects were able to locate the target patch in all three
stimuli, but they required a larger deviation with the control stimuli
compared with the test stimuli.
Conclusions: Observers are sensitive to local perturbations in a
smooth 2D flow field, but they appear to be more sensitive to such
perturbations when the flow field corresponds to a rigid 3D plane.
H. Farid, E.P. Simoncelli, M.J. Bravo and P.R. Schrater. Effects of Contrast and Period on Perceived Coherence of Moving Square-Wave Plaids (evidence for a speed bias in the human visual system). Investigative Opthalmology and Visual Science, Fort Lauderdale, FL, 1995
Purpose: The coherence of moving square-wave plaids depends on a
number of stimulus parameters: plaid angle (theta), grating speed
(Sg), contrast, and period. Last year at ARVO, we explored the
dependence on the plaid angle and the grating speed. We found that
coherence depended on both of these parameters: this dependence is
best understood via a reparameterization in terms of pattern speed (Sp
= Sg / cos(theta)). When Sp is below a critical speed (roughly 5
deg/sec), the plaid is more likely to be seen as coherent. Above this
critical speed, the plaid has the appearance of two gratings sliding
transparently over each other. This year, we examined the effect of
contrast and component period on the coherence of square-wave plaids.
Methods: Subjects were presented with symmetric square-wave plaids of
varying period and were asked whether the stimuli appeared transparent
or coherent. In a second experiment, subjects judged the coherence of
symmetric square-wave plaids of varying contrast.
Results: The experiments reveal that both contrast and period affect
the perceived coherence of the stimuli: gratings of higher contrast
and gratings of smaller period appear more coherent. For fixed period
and contrast, the effect of varying plaid angle and grating speed is
consistent with our previous experiments: coherence is determined by
the pattern speed relative to a critical speed. However, the current
experiments reveal that this the critical speed depends on the
stimulus contrast and period.
Conclusions: These results suggest that the primary determinant of
square-wave plaid coherence is the pattern speed. This behavior may be
explained by a model for velocity perception with a built-in
preference for slower speeds.
H. Farid and E.P. Simoncelli. The Perception of Transparency in Moving Square-Wave Plaids. Investigative Opthalmology and Visual Science, Sarasota, FL, 1994.
Purpose: We performed psychophysical experiments to determine the
rules governing the perception of transparency in additive square-wave
plaids.
Methods: Subjects were presented with a randomized sequence of
square-wave plaids of varying grating speed, grating orientation and
plaid intersection luminance. The two gratings were symmetrically
oriented about vertical, with fixed and equal period and
duty-cycle. Presentations lasted two seconds, with a three second
inter-trial interval. Subjects were asked whether the stimulus
appeared to be transparent or coherent.
Results: Our experimental results suggest that the perception of
transparency is primarily governed by the pattern speed and the
grating speed. In particular, when the pattern speed exceeds a certain
critical speed (Sc), the plaid is more likely to be seen as
transparent. Furthermore, when the grating speed exceeds the critical
speed, subjects report being unable to make clear judgements. This
result is illustrated in the idealized diagram of subject response
versus pattern speed (Sp) and grating speed (Sg) shown to the
right. Further studies suggest that varying the luminance of the plaid
intersections (see Stoner, et. al.,1990) seems to affect the percept
of transparency only when the pattern speed is close to the critical
speed.
Conclusions: The existence of such a critical speed suggests that the
human visual system may have a perceptual preference for slower
speeds. This data and the original data of Stoner, et. al. is
consistent with a fairly simple energy-based model for velocity
computation in which the representation of velocity is speed-limited.