Reviewer #3 (Public Review):
Summary:
This is a tools paper that describes an open source software package, BonVision, which aims to provide a non-programmer-friendly interface for configuring and presenting 2D as well as 3D visual stimuli to experimental subjects. A major design emphasis of the software is to allow users to define visual stimuli at a high level independent of the actual rendering physical devices, which can range from monitors to curved projection surfaces, binocular displays, and also augmented reality setups where the position of the subject relative to the display surfaces can vary and needs to be adjusted for. The package provides a number of semi-automated software calibration tools to significantly simplify the experimental job of setting up different rigs to faithfully present the intended stimuli, and is capable of running at hardware-limited speeds comparable to and in some conditions better than existing packages such as Psychtoolbox and PsychoPy.
Major comments:
While much of the classic literature on visual systems studies have utilized egocentrically defined ("2D") stimuli, it seems logical to project that present and future research will extend to not only 3D objects but also 3D environments where subjects can control their virtual locations and viewing perspectives. A single software package that easily supports both modalities can therefore be of particular interest to neuroscientists who wish to study brain function in 3D viewing conditions while also referencing findings to canonical 2D stimulus responses. Although other software packages exist that are specialized for each of the individual functionalities of BonVision, I think that the unifying nature of the package is appealing for reasons of reducing user training and experimental setup time costs, especially with the semi-automated calibration tools provided as part of the package. The provisions of documentation, demo experiments, and performance benchmarks are all highly welcome and one would hope that with community interest and contributions, this could make BonVision very friendly to entry by new users.
Given that one function of this manuscript is to describe the software in enough detail for users to judge whether it would be suited to their purposes, I feel that the writing should be fleshed out to be more precise and detailed about what the algorithms and functionalities are. This includes not shying away from stating limitations -- which as I see it, is just the reality of no tool being universal, but because of that is one of the most important information to be transmitted to potential users. My following comments point out various directions in which I think the manuscript can be improved.
The biggest point of confusion for me was whether the 3D environment functionality of BonVision is the same as that provided by virtual spatial environment packages such as ViRMEn and gaming engines such as Unity. In the latter software, the virtual environment is specified by geometrically laying out the shape of the traversable world and locations of objects in it. The subject then essentially controls an avatar in this virtual world that can move and turn, and the software engine computes the effects of this movement (i.e. without any additional user code) then renders what the avatar should see onto a display device. I cannot figure out if this is how BonVision also works. My confusion can probably be cured by some additional description of what exactly the user has to do to specify the placement of 3D objects. From the text on cube mapping (lines 43 and onwards), I guessed that perhaps objects should be specified by their vectorial displacement from the subject, but I have very little confidence in my guess and also cannot locate this information either in the Methods or the software website. For Figure 5F it is mentioned that BonVision can be used to implement running down a virtual corridor for a mouse, so if some description can be provided of what the user has to do to implement this and what is done by the software package, that may address my confusion. If BonVision is indeed not a full 3D spatial engine, it would be important to mention these design/intent differences in the introduction as well as Supplementary Table 1.
More generally, it would be useful to provide an overview of what the closed-loop rendering procedure is, perhaps including a Figure (different from Supplementary Figure 2, which seems to be regarding workflow but not the software platform structure). For example, I imagine that after the user-specified texture/object resources have been loaded, then some engine runs a continual loop where it somehow decides the current scene. As a user, I would want to know what this loop is and how I can control it. For example, can I induce changes in the presented stimuli as a function of time, whether this time-dependence has to be prespecified before runtime, or can I add some code that triggers events based on the specific history of what the subject has done in the experiment, and so forth. The ability to log experiment events, including any viewpoint changes in 3D scenes, is also critical, and most experimenters who intend to use it for neurophysiological recordings would want to know how the visual display information can be synchronized with their neurophysiological recording instrumental clocks. In sum, I would like to see a section added to the text to provide a high-level summary of how the package runs an experiment loop, explaining customizable vs. non-customizable (without directly editing the open source code) parts, and guide the user through the available experiment control and data logging options.
Having some experience myself with the tedium (and human-dependent quality) of having to adjust either the experimental hardware or write custom software to calibrate display devices, I found the semi-automated calibration capabilities of BonVision to be a strong selling point. However I did not manage to really understand what these procedures are from the text and Figure 2C-F. In particular, I'm not sure what I have to do as a user to provide the information required by the calibration software (surely it is not the pieces of paper in Fig. 2C and 2E..?). If for example, the subject is a mouse head-fixed on a ball as in Figure 1E, do I have to somehow take a photo from the vantage of the mouse's head to provide to the system? What about the augmented reality rig where the subject is free to move? How can the calibration tool work with a single 2D snapshot of the rig when e.g. projection surfaces can be arbitrarily curved (e.g. toroidal and not spherical, or conical, or even more distorted for whatever reasons)? Do head-mounted displays require calibration, and if so how is this done? If the authors feel all this to be too technical to include in the main text, then the information can be provided in the Methods. I would however vote for this as being a major and important aspect of the software that should be given air time.
As the hardware-limited speed of BonVision is also an important feature, I wonder if the same ~2 frame latency holds also for the augmented reality rendering where the software has to run both pose tracking (DeepLabCut) as well as compute whole-scene changes before the next render. It would be beneficial to provide more information about which directions BonVision can be stressed before frame-dropping, which may perhaps be different for the different types of display options (2D vs. 3D, and the various display device types). Does the software maintain as strictly as possible the user-specified timing of events by dropping frames, or can it run into a situation where lags can accumulate? This type of technical information would seem critical to some experiments where timings of stimuli have to be carefully controlled, and regardless one would usually want to have the actual display times logged as previously mentioned. Some discussion of how a user might keep track of actual lags in their own setups would be appreciated.
On the augmented reality mode, I am a little puzzled by the layout of Figure 3 and the attendant video, and I wonder if this is the best way to showcase this functionality. In particular, I'm not entirely sure what the main scene display is although it looks like some kind of software rendering — perhaps of what things might look like inside an actual rig looking in from the top? One way to make this Figure and Movie easier to grasp is to have the scene display be the different panels that would actually be rendered on each physical panel of the experiment box. The inset image of the rig should then have the projection turned on, so that the reader can judge what an actual experiment looks like. Right now it seems for some reason that the walls of the rig in the inset of the movie remain blank except for some lighting shadows. I don't know if this is intentional.