Reviewer #3:
The paper titled: "Auditory detection is modulated by theta phase of silent lip movements" the authors investigate visual entrainment to lip movement using behavioral (exp1) and non-invasive physiology (EEG; exp2).
In the first experiment participants engage in the detection of a brief tone embedded in noise. Critically, the tone appears whilst subjects are viewing a silent movie clip. Tones are critically timed with respect to the phase of the theta rhythm prevalent in the lip action trajectory (and its relation to the original audio track). Each trial includes 0, 1 or 2 tones and subjects provide a speeded response when the tone is detected. Tones are also critically presented either during the first half of the clip or the second half of the clip (or both or neither). This latter timing parameter is designed to probe the possibility of an increasing degree of entrainment to visual lip movement as the clip evolves. In the second experiment the findings demonstrated in the exp 1 are met with an analysis of visual entrainment and its impact on auditory sources using EEG and source estimation on data obtained while observers viewed the same silent movie clips passively. The paper is well written, the premise is clear and the findings are interesting and timely. In what follows I outline some questions and concerns that come to mind when assessing the validity of the interpretation of the findings. Those span the experimental and stimulus design as well as the analysis choices made.
1) The behavioral procedure suggests that the tones were pseudo-randomly positioned w/ respect to the quantified theta phase of the lip movement. It would be interesting to understand whether any care was taken to exhaustively sample different phases of the phase of interest in the lip movement. It might be important, therefore to demonstrate that phases were equivalently sampled by chance in the first and second half trials and over the different clips. An inset in figure 1 would make for a good spot to demonstrate the descriptive statistics of target positioning (as a function of phase).
2) Second and somewhat related, wouldn't it make more sense to quantify accuracy based on phase bins? This way no division to subpopulation would be required since each individual could be aligned to their best phase. The methods leave it somewhat unclear whether this was a possibility in terms of the stimulus design (i.e., were there enough phases to accomplish this in the stimulus/tone timing; see previous point).
In addition the subject mean phase of the correctly detected target provides little insight as to the periodic nature of performance. Analyzing whether there is a periodic modulation of the pattern of responses over phase would provide richer, more nuanced evidence for the claims.
3) It would be important and interesting to learn whether the first and second part of the trial has the same MI profile at theta b/w lip movement and audio track. Currently, The characterization of MI was done on the whole movie clips. This is crucial for both Experiment 1 and Experiment 2 interpretation.
4) The distinction b/w the first and second half -- indicating that entrainment takes time to build up is somewhat overstated in the context of this paper seeing that the literature suggests that by 0.5 s entrainment is fully arrived at (among others -- the authors themselves say so in the TINs piece). Other processes such as calibration to a given speaker might take longer, and those might justify (or account for?) the result showing that early vs. late targets differ in the degree to which the phase of the lip action affects performance.
Important details over the stimuli need to be clarified:
5) Did every clip introduce a new speaker to the subject? Thus, time on cl cip also amounts to degree of familiarity with the speaker?
6) Did each clip have the same degree of MI b/w audio and lip movement or were there better (more pronounced) lip clips than others when considering their link to the audio? Would it make sense to add these measures as covariates in the analysis?
7) Is the same target timing used for the same clip for all subjects? Or are the tones truly randomly placed and matched onto clips such that a given clip could appear w/ tones at different times for different subjects?
At the risk of somewhat repeating point #2 above -- within the analysis the following should be considered:
8) The authors establish that in the second half performance there are, in fact, two subpopulations in the sample. Wouldn't this post hoc grouping factor, which isn't obviously motivated be better described by properly delineating performance as a function of phase? I can readily understand that the authors might not have a clear hypothesis over what might be the better phase for performing on an irrelevant tone probe. Nonetheless, if a periodic process is entraining performance once a best phase is identified adjacent phase bins should demonstrate this circular relationship. This would allow for a direct quantification of ALL data together after aligning performance to the best phase bin, per subject.
Finally, the following points pertain for most for the contextualization of this work and the discussion:
9) While the authors discuss at least two mechanisms relating to how entertainment affects growth by the second part of the clip, it would be nice to relate the concrete reading of this effect to cognitive processes that may evolve within these timescales. In other words, learning that tracking takes 0.5 s or learning that visual inputs to frontal cortex take a given time scale to exert impact on auditory sensory regions is another description of the finding. What might these time scales buy me as a speaker and as a listener? What processes might be reflected by arriving at these states of synchrony and top-down control for speech comprehension?
10) The post hoc description of the subpopulations preferred phases is interesting and could relate interestingly to the entertainment literature (from Spaak 2014 in vision through Hickok 2015 in audition and others). Might the authors speculate on what part of speech is characterized by one phase vs. another?
11) The author's conjecture in the discussion of this topic - an additional one - there are recent papers by Assaneo et al. (Poeppel as PI, Nat Neurosci, 2019) that show bi-modal behavior in a spontaneous synchronization task (motor to auditory), which was found to be related to morphological differences in frontal-to-auditory white matter pathways, functional differences AND better learning in a statistical learning paradigm. How do the two sets of bi-modal populations interact? The author's discussion of the motor cortex suggests they would.
Methods section:
The paper by and large is well written. An exception to this would be the methods section. Currently, the methods do not comply with best practices that would generate the work reproducible by others.