- Jul 2018
-
europepmc.org europepmc.org
-
On 2016 Oct 26, Lydia Maniatis commented:
Natural scenes
The use of “natural scene statistics” is popular in current vision science and directly linked to its conceptual confusion.
In the words of the authors, “Biological systems evolve to exploit the statistical relationships in natural scenes….”
I want to first address the authors’ use of the term “natural scenes” and its implications, and move on to the problem of the validity and implications of the above quote in a subsequent comment.
“Natural scenes” is a very broad category, even broader given that the authors include in it man-made environments. In order to be valid on their own terms, the “statistics” involved – i.e. the correlations between “cues” and physical features of the environment – must hold across very different distances and orientations of the observer to the world, and across very different environments, including scenes involving close-ups of human faces.
Describing 96 photographs taken of various locations on the University of Texas campus from a height of six feet, a camera perpendicular to the ground, at distances of 2-200 meters as a theoretically meaningful, representative sample of “natural scenes” seems rather flakey. If we include human artifacts, then what count as “non-natural scenes” ?
The authors themselves are forced to confront (but choose to sidestep) the sampling problem when they note that “previous studies have reported that surfaces near 0° of slant are exceedingly rare in natural scenes (Yang & Purves, 2003), whereas we find significant probability mass near 0° of slant. That is, we find—consistent with intuition—that it is not uncommon to observe surfaces that have zero or near-zero slant in natural scenes (e.g., frontoparallel surfaces straight ahead).”
(Quite frankly, the authors’ intuition is causing them to confuse cause and effect, since we have a behavioral tendency to orient ourselves to objects so that we are in a fronto-parallel relationship to surfaces rather than in an oblique relationship to them, thus biasing the “statistics” in this respect).
They produce a speculative, technical and preliminary rationalization for the discrepancy between their distributions and those of Yang and Purves, leaving clarification to “future research.”
What they don’t consider is the sampling problem. Is there any doubt WHATSOEVER that different “natural scenes” - or different heights, or different angles of view, or different head orientations - will produce very different “prior probabilities”? If this is a problem, it isn’t a technical one.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 26, Lydia Maniatis commented:
Orientation isn’t prior to shape
The idea that slant/tilt are judged prior to shape, which Burge et al have adopted, suffers from the problems discussed above, and from empirical evidence that contradicts it.
The notion dates back at least to Marr’s 2.5D sketch. As Pizlo (2008) observes, “Marr assumed that figure-ground organization was not necessary for providing the percept of the 3D shape of an object” and asks “How could he get a way with this so long after the Gestalt psychologists had revolutionized perception by demonstrating the importance of figure-ground organization?”
Pizlo references experiments using wire objects (e.g. Rock & DiVita, 1987) that have shown that figure-ground organization is key to the shapes that actual 3D objects produce in perception, “even when binocular disparity or other depth cues are available.”
In general, if Marr had been correct in assuming that depth orientations of edges are prior to, and sufficient or necessary for, shape perception, then monocular perception would have no objective content, and 3D pictorial percepts would not occur, unless some special mechanisms had evolved just for this purpose.
In short, tilt-to-shape is not a principled, credible premise.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 26, Lydia Maniatis commented:
the right angle, and is linked to each of the other two, and each of the latter sit at the apex of acute angles, and are connected to each other. (What is being grouped are all the points inside of the perceptually constructed triangular outline).
If we now add a single point to the group, so that the set is compatible with a square, our rule will require that the connection between the two latter points be discarded, as both become the apex of right angles bounding a square and both are linked to the new point. This is, in fact, what happens in perception. So applying a rule to the three points, and a rule to the single point, locally, would not add up to the square contour, in principle; and does not, in perception.
Thus, when the authors assert that…
“the visual system starts with local measurements then combines those local measurements into the global representations; the more accurate the local measurements, the more accurate the global representation,”
…they are making an assertion that might sound simple and commonsensical to a layman (which is perhaps why it is so tenacious) but which is not justified for a vision scientist, any more than it is to say that we can build a house of cards one card at a time, or hear the sound of one hand clapping.
The use of “local cues” is a contemporary version of the reductionist approach to perception known as structuralism/introspectionism with its “sensory elements.” This approach couldn’t address the logical problem of organization of the visual field into shaped objects, discussed above, without invoking “experience” in a paradoxical and inconsistent fashion. “Cues” are similarly impotent.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 25, Lydia Maniatis commented:
As discussed, the best the authors could achieve in predicting measured “3D tilt” using their cues was not very good. Nevertheless, they describe their results as “complex,” “rich” and “detailed” in the sense that they feel able to discern some patterns in the generally inaccurate data that might be theoretically important or useful. For example, they say performance was often better when the three cues were in agreement. They propose to go on to compare performance of the model to performance of humans in psychophysical experiments. It seems to me that an important step to take prior to psychophysical testing is to test the model on its own terms; that is, to take a second set of “natural” images (perhaps of a different campus, or a national park) and test whether the ad hoc model derived from the first set will produce a qualitatively similar dataset. Will the two datasets, in all their richness and complexity, be mutually, statistically consistent? How will the authors compare them? If the data do not prove qualitatively repeatable, then p/p experiments would seem premature.
p.s. The open-endedness of the term "natural scene," in which the authors include man-made environments, imposes quite a serious replicability burden on the model. (The sampling problem (assuming the inductive approach was viable) includes the fact that arguably more time is spent by humans looking at human faces and bodies than at trees and shrubs). How many "scenes" should we test? Nevertheless, at least one attempt seems a minumum.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 22, Lydia Maniatis commented:
Part 1 This paper is all too similar to a large proportion of the vision literature, in which fussy computations thinly veil a hollow theoretical core, comprised of indefensible hypotheses asserted as fact (and thus implicitly requiring no justification), sometimes supported by citations that only weakly support them, if at all. The casual yet effective (from a publication point of view) fashion in which many authors assert popular (even if long debunked) fallacies and conjure up other pretexts for what are, in fact, mere measurements without actual or potential theoretical value is well on display here.
What is surprising in, perhaps, every case, is the willful empirical agnosia and lack of common sense, on every level – general purpose, method, data analysis - necessary to enable such studies to be conducted and published. A superficial computational complexity adds insult to injury, as many readers may wrongly feel they are not competent to understand and evaluate the validity of a study whose terms and procedures are so layered, opaque and jargony. However, the math is a distraction.
Unjustified and/or empirically false assumptions and procedures occur, as mentioned, at every level. I discuss some of the more serious ones below (this is the first of a series of comments on this paper).
- Misleading, theoretically and practically untenable, definitions of “3D tilt” (and other variables).
The terms slant and tilt naturally refer to a geometrical characteristic of a physical plane or volume (relative to a reference plane). The first sentence of Burge et al’s abstract gives the impression that we are talking about tilt of surfaces: “Estimating 3D surface orientation (slant and tilt) is an important first step toward estimating 3D shape. Here, we examine how three local image cues …should be combined to estimate 3D tilt in natural scenes.” As it turns out, the authors perform a semantic but theoretically pregnant sleight of hand in the switch from the phrase “3D surface orientation (slant and tilt)” to the phrase “3D tilt” (which is also used in the title).
The obvious inference from the context is that the latter is a mere short-hand for the former. But it is not. In fact, as the authors’ finally reveal on p. 3 of their introduction, their procedure for estimating what they call “3D tilt” does not allow them to correlate their results to tilt of surfaces: “Our analysis does not distinguish between the tilt of surfaces belonging to individual objects and the tilt (i.e. orientation [which earlier was equated with “slant and tilt”]) of depth discontinuities…We therefore emphasize that our analysis is best thought of as 3D tilt rather than 3D surface tilt estimation.”
“3D tilt” is, in effect, a conceptually incoherent term made up to coincide with the (unrationalised) procedure used to arrive at certain measures given this label. I find the description of the procedure opaque, but as I am able to understand it, small patches of images are selected, and processed to produce “3D tilt” values based on range values collected by a range finder within that region of space. The readings within the region can be from one, two, three, four, or any number of different surfaces or objects; the method does not discriminate among these cases. In other words, these local “3D tilt values” have no necessary relationship to tilt of surfaces (let alone tilt of objects, which is more relevant (to be discussed) and which the authors don’t address even nominally). We are talking about a paradoxically abstract, disembodied definition of “3D tilt.” As a reader, being asked to “think” of the measurements as representing “3D tilt” rather than “3D surface tilt” doesn’t help me understand either how this term relates, in any useful or principled way, to the actual physical structure of the world, nor to the visual process that represents this world. The idea that measuring this kind of “tilt” could be useful to forming a representation of the physical environment, and that the visual system might have evolved a way to estimate these intrinsically random and incidental values, is an idea that seems invalid on its face - and the authors make no case for it.
They then proceed to measure 3 other home-cooked variables, in order to search for possible correlations between these and “3D tilt.” These variables are also chosen arbitrarily, i.e. in the absence of a theoretical rationale, based on: “simplicity, historical precedence, and plausibility given known processing in the early visual system.” (p. 2). Simplicity is not, by itself, a rationale – it has to have a rational basis. At first glance, at least the third of these reasons would seem to constitute a shadow of a theoretical rationale, but it is based on sparse, premature and over-interpreted physiological data primarily of V1 neuron activity. Furthermore, the authors’ definitions of their three putative cues: disparity gradient, luminance gradient, texture gradient, are very particular, assumption-laden, paradoxical, and unrationalised.
For example, the measure of “texture orientation” involves the assumption that textures are generally composed of “isotropic [i.e. circular] elements” (p. 8). This assumption is unwarranted to begin with. Given, furthermore, that the authors’ measures at no point involve parsing the “locations” measured into figures and grounds, it is difficult to understand what they can mean by the term “texture element.” Like tilt, reference to an “isotropic texture element” implies a bounded, discrete area of space with certain geometric characteristics and relationships. It makes no sense to apply it to an arbitrary set of pixel luminances.
Also, as in the case of “3D tilt” the definition of “texture gradient” is both arbitrary and superficially complex: “we define [the dominant orientation of the image texture] in the Fourier domain. First, we subtract the mean luminance and multiply by (window with) the Gaussian kernel above centered on (x, y). We then take the Fourier transform of the windowed image and comute the amplitude spectrum. Finally, we use singular value decomposition ….” One, two, three….but WHY did you make these choices? Simplicity, historical precedence, Hubel and Wiesel…?
If, serendipitously, the authors’ choices of things to measure and compare had led to high correlations, they might have been justified in sharing them. But as it turns out, not surprisingly, the correlations between “cues” and “tilt” are “typically not very accurate.” Certain (unpredicted) particularities of the data which to which the authors speculatively attribute theoretical value (incidentally undermining one of their major premises) will be discussed later.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2016 Oct 22, Lydia Maniatis commented:
Part 1 This paper is all too similar to a large proportion of the vision literature, in which fussy computations thinly veil a hollow theoretical core, comprised of indefensible hypotheses asserted as fact (and thus implicitly requiring no justification), sometimes supported by citations that only weakly support them, if at all. The casual yet effective (from a publication point of view) fashion in which many authors assert popular (even if long debunked) fallacies and conjure up other pretexts for what are, in fact, mere measurements without actual or potential theoretical value is well on display here.
What is surprising in, perhaps, every case, is the willful empirical agnosia and lack of common sense, on every level – general purpose, method, data analysis - necessary to enable such studies to be conducted and published. A superficial computational complexity adds insult to injury, as many readers may wrongly feel they are not competent to understand and evaluate the validity of a study whose terms and procedures are so layered, opaque and jargony. However, the math is a distraction.
Unjustified and/or empirically false assumptions and procedures occur, as mentioned, at every level. I discuss some of the more serious ones below (this is the first of a series of comments on this paper).
- Misleading, theoretically and practically untenable, definitions of “3D tilt” (and other variables).
The terms slant and tilt naturally refer to a geometrical characteristic of a physical plane or volume (relative to a reference plane). The first sentence of Burge et al’s abstract gives the impression that we are talking about tilt of surfaces: “Estimating 3D surface orientation (slant and tilt) is an important first step toward estimating 3D shape. Here, we examine how three local image cues …should be combined to estimate 3D tilt in natural scenes.” As it turns out, the authors perform a semantic but theoretically pregnant sleight of hand in the switch from the phrase “3D surface orientation (slant and tilt)” to the phrase “3D tilt” (which is also used in the title).
The obvious inference from the context is that the latter is a mere short-hand for the former. But it is not. In fact, as the authors’ finally reveal on p. 3 of their introduction, their procedure for estimating what they call “3D tilt” does not allow them to correlate their results to tilt of surfaces: “Our analysis does not distinguish between the tilt of surfaces belonging to individual objects and the tilt (i.e. orientation [which earlier was equated with “slant and tilt”]) of depth discontinuities…We therefore emphasize that our analysis is best thought of as 3D tilt rather than 3D surface tilt estimation.”
“3D tilt” is, in effect, a conceptually incoherent term made up to coincide with the (unrationalised) procedure used to arrive at certain measures given this label. I find the description of the procedure opaque, but as I am able to understand it, small patches of images are selected, and processed to produce “3D tilt” values based on range values collected by a range finder within that region of space. The readings within the region can be from one, two, three, four, or any number of different surfaces or objects; the method does not discriminate among these cases. In other words, these local “3D tilt values” have no necessary relationship to tilt of surfaces (let alone tilt of objects, which is more relevant (to be discussed) and which the authors don’t address even nominally). We are talking about a paradoxically abstract, disembodied definition of “3D tilt.” As a reader, being asked to “think” of the measurements as representing “3D tilt” rather than “3D surface tilt” doesn’t help me understand either how this term relates, in any useful or principled way, to the actual physical structure of the world, nor to the visual process that represents this world. The idea that measuring this kind of “tilt” could be useful to forming a representation of the physical environment, and that the visual system might have evolved a way to estimate these intrinsically random and incidental values, is an idea that seems invalid on its face - and the authors make no case for it.
They then proceed to measure 3 other home-cooked variables, in order to search for possible correlations between these and “3D tilt.” These variables are also chosen arbitrarily, i.e. in the absence of a theoretical rationale, based on: “simplicity, historical precedence, and plausibility given known processing in the early visual system.” (p. 2). Simplicity is not, by itself, a rationale – it has to have a rational basis. At first glance, at least the third of these reasons would seem to constitute a shadow of a theoretical rationale, but it is based on sparse, premature and over-interpreted physiological data primarily of V1 neuron activity. Furthermore, the authors’ definitions of their three putative cues: disparity gradient, luminance gradient, texture gradient, are very particular, assumption-laden, paradoxical, and unrationalised.
For example, the measure of “texture orientation” involves the assumption that textures are generally composed of “isotropic [i.e. circular] elements” (p. 8). This assumption is unwarranted to begin with. Given, furthermore, that the authors’ measures at no point involve parsing the “locations” measured into figures and grounds, it is difficult to understand what they can mean by the term “texture element.” Like tilt, reference to an “isotropic texture element” implies a bounded, discrete area of space with certain geometric characteristics and relationships. It makes no sense to apply it to an arbitrary set of pixel luminances.
Also, as in the case of “3D tilt” the definition of “texture gradient” is both arbitrary and superficially complex: “we define [the dominant orientation of the image texture] in the Fourier domain. First, we subtract the mean luminance and multiply by (window with) the Gaussian kernel above centered on (x, y). We then take the Fourier transform of the windowed image and comute the amplitude spectrum. Finally, we use singular value decomposition ….” One, two, three….but WHY did you make these choices? Simplicity, historical precedence, Hubel and Wiesel…?
If, serendipitously, the authors’ choices of things to measure and compare had led to high correlations, they might have been justified in sharing them. But as it turns out, not surprisingly, the correlations between “cues” and “tilt” are “typically not very accurate.” Certain (unpredicted) particularities of the data which to which the authors speculatively attribute theoretical value (incidentally undermining one of their major premises) will be discussed later.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 25, Lydia Maniatis commented:
As discussed, the best the authors could achieve in predicting measured “3D tilt” using their cues was not very good. Nevertheless, they describe their results as “complex,” “rich” and “detailed” in the sense that they feel able to discern some patterns in the generally inaccurate data that might be theoretically important or useful. For example, they say performance was often better when the three cues were in agreement. They propose to go on to compare performance of the model to performance of humans in psychophysical experiments. It seems to me that an important step to take prior to psychophysical testing is to test the model on its own terms; that is, to take a second set of “natural” images (perhaps of a different campus, or a national park) and test whether the ad hoc model derived from the first set will produce a qualitatively similar dataset. Will the two datasets, in all their richness and complexity, be mutually, statistically consistent? How will the authors compare them? If the data do not prove qualitatively repeatable, then p/p experiments would seem premature.
p.s. The open-endedness of the term "natural scene," in which the authors include man-made environments, imposes quite a serious replicability burden on the model. (The sampling problem (assuming the inductive approach was viable) includes the fact that arguably more time is spent by humans looking at human faces and bodies than at trees and shrubs). How many "scenes" should we test? Nevertheless, at least one attempt seems a minumum.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 26, Lydia Maniatis commented:
the right angle, and is linked to each of the other two, and each of the latter sit at the apex of acute angles, and are connected to each other. (What is being grouped are all the points inside of the perceptually constructed triangular outline).
If we now add a single point to the group, so that the set is compatible with a square, our rule will require that the connection between the two latter points be discarded, as both become the apex of right angles bounding a square and both are linked to the new point. This is, in fact, what happens in perception. So applying a rule to the three points, and a rule to the single point, locally, would not add up to the square contour, in principle; and does not, in perception.
Thus, when the authors assert that…
“the visual system starts with local measurements then combines those local measurements into the global representations; the more accurate the local measurements, the more accurate the global representation,”
…they are making an assertion that might sound simple and commonsensical to a layman (which is perhaps why it is so tenacious) but which is not justified for a vision scientist, any more than it is to say that we can build a house of cards one card at a time, or hear the sound of one hand clapping.
The use of “local cues” is a contemporary version of the reductionist approach to perception known as structuralism/introspectionism with its “sensory elements.” This approach couldn’t address the logical problem of organization of the visual field into shaped objects, discussed above, without invoking “experience” in a paradoxical and inconsistent fashion. “Cues” are similarly impotent.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 26, Lydia Maniatis commented:
Orientation isn’t prior to shape
The idea that slant/tilt are judged prior to shape, which Burge et al have adopted, suffers from the problems discussed above, and from empirical evidence that contradicts it.
The notion dates back at least to Marr’s 2.5D sketch. As Pizlo (2008) observes, “Marr assumed that figure-ground organization was not necessary for providing the percept of the 3D shape of an object” and asks “How could he get a way with this so long after the Gestalt psychologists had revolutionized perception by demonstrating the importance of figure-ground organization?”
Pizlo references experiments using wire objects (e.g. Rock & DiVita, 1987) that have shown that figure-ground organization is key to the shapes that actual 3D objects produce in perception, “even when binocular disparity or other depth cues are available.”
In general, if Marr had been correct in assuming that depth orientations of edges are prior to, and sufficient or necessary for, shape perception, then monocular perception would have no objective content, and 3D pictorial percepts would not occur, unless some special mechanisms had evolved just for this purpose.
In short, tilt-to-shape is not a principled, credible premise.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Oct 26, Lydia Maniatis commented:
Natural scenes
The use of “natural scene statistics” is popular in current vision science and directly linked to its conceptual confusion.
In the words of the authors, “Biological systems evolve to exploit the statistical relationships in natural scenes….”
I want to first address the authors’ use of the term “natural scenes” and its implications, and move on to the problem of the validity and implications of the above quote in a subsequent comment.
“Natural scenes” is a very broad category, even broader given that the authors include in it man-made environments. In order to be valid on their own terms, the “statistics” involved – i.e. the correlations between “cues” and physical features of the environment – must hold across very different distances and orientations of the observer to the world, and across very different environments, including scenes involving close-ups of human faces.
Describing 96 photographs taken of various locations on the University of Texas campus from a height of six feet, a camera perpendicular to the ground, at distances of 2-200 meters as a theoretically meaningful, representative sample of “natural scenes” seems rather flakey. If we include human artifacts, then what count as “non-natural scenes” ?
The authors themselves are forced to confront (but choose to sidestep) the sampling problem when they note that “previous studies have reported that surfaces near 0° of slant are exceedingly rare in natural scenes (Yang & Purves, 2003), whereas we find significant probability mass near 0° of slant. That is, we find—consistent with intuition—that it is not uncommon to observe surfaces that have zero or near-zero slant in natural scenes (e.g., frontoparallel surfaces straight ahead).”
(Quite frankly, the authors’ intuition is causing them to confuse cause and effect, since we have a behavioral tendency to orient ourselves to objects so that we are in a fronto-parallel relationship to surfaces rather than in an oblique relationship to them, thus biasing the “statistics” in this respect).
They produce a speculative, technical and preliminary rationalization for the discrepancy between their distributions and those of Yang and Purves, leaving clarification to “future research.”
What they don’t consider is the sampling problem. Is there any doubt WHATSOEVER that different “natural scenes” - or different heights, or different angles of view, or different head orientations - will produce very different “prior probabilities”? If this is a problem, it isn’t a technical one.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-