681 Matching Annotations
  1. Dec 2023
    1. Other ultrasound methodologies, employing higherfrequencies, are more severely impacted by signal attenuation,both in the skull and in soft tissue

      but maybe shallow imaging could work? hmm

    2. Time-of-flight tomography uses a short-wavelength approx-imation, basing its analysis on the simplified physics of raytheory in which the effects of transmission through aheterogeneous medium are represented by a simplechange in travel time. For a finite wavelength, wavetransmitted through a medium that is heterogeneous onmany scales, such delay times are only sensitive to theproperties of the medium averaged over the dimensions ofthe first Fresnel zone 11 .

      hmm why?

  2. Nov 2023
    1. Upon receiving observations y, solving a Bayesian inverse problem involves sampling theconditional distribution of x given y (Tarantola, 2005)

      I guess adding the noise also helps in making the problem well-posed, i.e. having a solution

    1. For the transmission inversion we use a model based large scale minimization at a particular angular fre-quency ωj based on an L 2 minimization.

      I think the Fourier transform they apply is in the spatial domain right?

  3. Oct 2023
    1. At test time, Gr is the sameas F(P ∗)

      I think this should be \(G_p\) ?

  4. Jul 2023
    1. Power Doppler intensity for each implant scenario over N=15 acquisitions. Stars indicate statistical differences (paired sampledt-test) between each scenario compared to the No implant scenario (* : p < 0.05 , **: p < 0.01 , ***: p < 0.001) (F) SNR attenuationfor each implant scenario as function of the depth (N=15 acquisitions).(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.The copyright holder for this preprintthis version posted June 15, 2023.;https://doi.org/10.1101/2023.06.14.544094doi:bioRxiv preprint

      would have been nice to compare with skull tissue, to get an idea of how much of an improvement this is!

  5. Oct 2022
    1. A good first guess is one that makes your data decorrelated and whitened.

      I guess the intuition here can be obtained by imagining doing gradient descent on a space where the manifold of images (the data) is very far from a ball (decorrelated and whitened): gradient descent is likely to move out of it

  6. Sep 2022
    1. given legal tractability

      how are u going to deal with this?

    2. A clinical mind enhancement research group could conduct healthy-volunteer clinical trials on effects and safety of MEIs to gather sufficient data to push almost deployable MEIs to deployable level.

      Yes we need more of this!

  7. Aug 2022
    1. you lose all the value of using AI if you won’t use unaligned systems

      isn't this just a matter of weighting values?

    2. having an AI that is competent or intelligent

      which, is something I want

    3. And if what I want is not measurable

      But it has to be, at some level? Or else how can we even objectively talk about it?

    4. I might think it's really important for that new system to not mess up. I think this is a really important problem, but it's worth separating it from the problem of building AI that's trying to do what we want it to do.

      Why do you think it is important separating them?

    1. ou could stay on a diet, spend the optimal amountof time learning which activities will achieve your goals, and then follow through on anoptimal plan, no matter how tedious it was to execute.

      WHY do we equate rationality with goals not changing

    2. biological cognitiveenhancement (Sandberg 2011)

      cognitive enhancement may not be biological


    1. then “slightly increase their degree of influence-seeking behavior” would be just as good a modification as “slightly improve their conception of the goal.”

      but he former would take more computation to connect to the goal? but it could also be simpler hmm? hmm

    2. in the broader landscape of “possible cognitive policies.”

      with a metric related to goals of those policies?

    3. Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.


    4. Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.


    5. We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart:

      but why cant the proxies change?

    6. -that can’t be done by trial and error

      hmm why not? if we have a relatively clear picture of what we want, or at least can tell stuff we want from stuff we dont, why cant we just try stuff until we find whether we want something?

  8. Jul 2022
    1. In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right.

      what is the difference really?

    1. As an example to illustrate the base/mesa distinction in a different domain, and the possibility of misalignment between the base and mesa- objectives, consider biological evolution.

      wouldnt say that evolution is an optimizer under their definition. It displays behavior that can be approximated by an optimization process, but it's not really one according to their definition (search + selecting candidates with best objective) or even if u could describe evolution as such a process for a fixed environment, once u take into account that the environment is changed by the evolutionary process, i dont think it can be cleanly described as an optimization process any more (again under their definition of one)

    2. Arguably, any possible system has a behavioral objective—including bricks and bottle caps. However, for non-optimizers, the appropriate behavioral objective might just be “1 if the actions taken are those that are in fact taken by the system and 0 otherwise,”[6] and it is thus neither interesting nor useful to know that the system is acting to optimize this objective. For example, the behavioral objective “optimized” by a bottle cap is the objective of behaving like a bottle cap.[7] However, if the system is an optimizer, then it is more likely that it will have a meaningful behavioral objective. That is, to the degree that a mesa-optimizer’s output is systematically selected to optimize its mesa-objective, its behavior may look more like coherent attempts to move the world in a particular direction.[8]

      right but things which arent optimizers may be equally undesirable in terms of their behaviour?

    3. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer's objective may not transfer to the mesa-optimizer.

      thus making the base optimizer itself not safe?

      But yeah some properties may not transfer

    4. Whether a system is an optimizer is a property of its internal structure—what algorithm it is physically implementing—and not a property of its input-output behavior. Importantly, the fact that a system’s behavior results in some objective being maximized does not make the system an optimizer. For example, a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.(1) Rather, bottle caps have been optimized to keep water in place. The optimizer in this situation is the human that designed the bottle cap by searching through the space of possible tools for one to successfully hold water in a bottle. Similarly, image-classifying neural networks are optimized to achieve low error in their classifications, but are not, in general, themselves performing optimization.

      Hmm, I mean one could be pedantic, and say that a bottle cap is searching over atomic configurations using thermal fluctuations and finding a low energy configuration that gives it its shape..

    1. The VR devices of 2025 (or potentially even next year) will be qualitatively different from what we have today:

      Other things that are needed: handtracking and matbe periferal tracking (or mixed reality/ optical tracking )

    2. Some body language.

      can go beyond. some people already sharing biometrics

    3. Why is it so much better?

      i think fa e tracking for subtle microexpressions could be necessary

    4. ,


    5. The sense of being transported to another place, being bodily present with others, is important for committing with your whole self to the people you're with and the thing you're doing together.

      makes sure ur attention is focused

    6. ,


    7. Most people don't realize this (and that's a large part of the problem), but when you have an audio call without audio isolation (without wearing headphones), the system silences you when other people are talking in order to prevent audio feedback. This disrupts natural turntaking negotiation and makes very lively conversation impossible.

      this seems not vr specific but yeah more common i vr awo

    8. Costs about 100$


    1. "human-level AI

      well, i mean by human-level AI, an AI that is human-level in the sense of its generality. Why should we assume humans are the most general?

    1. For a model using a point estimate of θ (such as anMLE or MAP), rather than a distribution over θ

      why is this necessary?

    1. Specifically, is itnecessary to explicitly ground high-level actions into therepresentation of the environment? Or is it sufficient tosimply inform the learner about the abstract structure ofpolicies, without ever specifying how high-level behaviorsshould make use of primitive percepts or actions?

      do we need grounding?

    1. For the example inFigure 2, f might be an image rating model (Socheret al., 2014) that outputs a scalar judgment y of howwell an image x matches a caption w

      like CLIP!

    1. While Transformers are good at modelinglong-range global context, they are less capable to extract fine-grained local feature patterns.

      why is this though? hmm

    1. The 10th and 90th percentiles are estimated viabootstrapping data (80% of the dataset is sampled 100 times)

      what does "80% of the dataset is sample 100 times"? as in what dataset are they refering to? the dataset of (N,D,L) values?

    1. The relation between data compression (via prediction)and intelligence

      i would say there's a connection bertween compression and generalization, and between performance and intelligence. And therefore because for a given amount of data, better generalization means better preformance, better compression tends to lead to higher intelligence

    1. This evaluation is perhaps slightly unfair, as we have not performed the obvious step of training the model on a much largerdataset of executions. This is an interesting direction for future work.

      It would be interesting to see if doing this would improve program synthesis

    2. 100% 12%

      hmm if the 12% is due to variability in sampling how come all of the ones that are solved in one but not the other, are in the edited dataset, given that these samples are the same??

    3. , indicating general variability in model performance

      do the two models have the same parameters and prompt? i guess it cant be, unless this variability if purely from the sampling?

    4. This method of checking correctness forced us to filter the MathQA dataset to keeponly those problems for which the code evaluates to the declared numerical answer, resulting in us removing 45% ofproblems.

      what? so almost half of MathQA has wrong answers??

    1. How to characterize "task type," estimating how "difficult" and expensive a task is to “try” or “watch” once.

      this can be interpreted as asking about how much data is needed to be processed basically

    2. Trained on a task where each "try" took days, weeks, or months of intensive "thinking."

      hmm and how many tries are needed?

    1. In the generalisation-based approach, the way to create superhuman CEOsis to use other data-rich tasks (which may be very different from the taskswe actually want an AI CEO to do) to train AIs to develop a range of usefulcognitive skills.

      This is just unsupervised pretraining

    2. reinforcementlearning (which often requires billions of training steps even for much simplertasks).

      reinforcement learning isnt the best type of algorithm for talking about this. We should be taking about supervised learning. And then the figures look very different


    1. if this is possible

      if either is possible

    2. BCI as Existential Security Factor

      Also could BCI be used to secure democratic states from devolving into totalitarism by allowing people to more strongly uphold and protect democratic principles?

    3. Supposing our estimates are out by an order of magnitude

      in the sense of the final estimate not the individual factors right?

    4. Having a tiny fraction of humanity free would not have powerful longterm positive effects for humanity overall, as it would be vanishingly unlikely that small, unthreatening populations could liberate humanity and recover a more positive future.

      hmm maybe by escaping into space? hmm

    5. Further evidence can be seen in the steady expansion in government surveillance by even democratic governments.

      again more of what you were talking about that democratic states seem not stable typically? hmm

    6. If there were only a single actor controlling the development of all BCIs, it would likely be an easier situation to control and regulate. Having multiple state actors, each of whom can decide the level of coercion they wish their BCI to be built with, is a much more complex scenario

      lol so if the world was authortarian, it would be easier to control, wouldnt it?

    7. But BCI based surveillance would have no such flaw.

      Well BCI tech will have loopholes that defectors could probably utilize

  9. May 2022
    1. This allows the system to interactively perform actions and probecausal relationships

      this distinction proposed here amounts to me more like the distinction between onpolicy and offpolicy learning.hmm. Because if u can do offpolicy learning, it doesnt matter much where the data came from

    2. The semantic layer is not contained in the data


    3. And the animal-level part might not be themost crucial part in this equation.

      That is indeed an interesting hypothesis

    4. Another example of what can be done in terms of hybridization is how we could use agenetic algorithm to evolve a formal grammar for a particular language, using a large lan-guage model like GPT-3 as an oracle to compute a fitness by comparing the kind of sentencesthat the current grammar candidates are producing and assessing how statistically realisticthey sound.

      like distilling DNNs to a simpler program hmm

    5. its focus on handling only solutionspaces that happen to allow for a gradient computation.

      what about the techniques used in dVAEs/MoEs to deal with non-differentiable components?


    1. However, to avoid circular dependencies in the graph, it re-quires that agent k choose its action before j, and thereforek can influence j but j cannot influence k.

      One thing i dont get is why they didnt try a basic influence computation based on the effect of the action at the previous time step on the current action. That way it doesnt have problems with circular dependencies like they have in their basic set up

    1. As thesame as the reward case, DT takes the summation of the state-feature over a trajectory as an input

      hmm yeah this seems inherently less powerful by design no?

    2. n this case, CDT is unnecessary because if Φ learns sufficient featuresof s, matching their means, i.e. moments, through DT is enough to match any distribution to anarbitrary precision (Wainwright & Jordan, 2008; Li et al., 2015)

      i.e. we can just matcht the means of Phi which could be any number of moments of s (thus matching any distributions of s)

  10. Apr 2022
    1. We split our static dataset 50:50, andtrained separate PMs on each half, which we refer to as train PMs and test PMs

      the test PM is not necessarily an unbiased estimate of the true reward tho? Tho I guess we can assume it's close to one~?

    2. This means that our helpfulness dataset goes ‘up’ in desirability during the conversation, while our harmlessnessdataset goes ‘down’ in desirability. We chose the latter to thoroughly explore bad behavior, but it is likely not idealfor teaching good behavior. We believe this difference in our data distributions creates subtle problems for RLHF, andsuggest that others who want to use RLHF to train safer models consider the analysis in Section 4.4.

      " We chose the latter to thoroughly explore bad behavior, but it is likely not ideal for teaching good behavior." hmm what do they mean? of course the bad behaviour dataset is not good for teaching good behaviour? or do they mean that even negatively-rewarding that data is not an ideal way to teach good behavior? hmm?

    1. We hypothesize that this might be to a domain gapbetween the natural images that CLIP has been trained on andthe simulated images from CALVIN.

      or that images and related texts are not actually mapped to same spaces, people seem to find? hmm

  11. Mar 2022
    1. The task parameters referto the variables that can be collected by the system andthat describe a situation, such as positions of objects inthe environment. The task parameters can be fixed duringan execution trial or they can vary while the motion isexecuted

      So they are basically what would be called "observations" in RL

  12. Feb 2022
    1. C∗ ∈[1,1+ ̃O(1/N)]

      why would C* (or the bounds on it) depend on the number of data samples N?

    2. For analysis purposes, we consider a BC algorithm that matches the empiricalbehavior policy on states in the offline dataset, and takes uniform random actions outside the supportof the dataset.

      Seriously? So no assumption of generalization? Which is precisely the strength of BC....

    3. This condition holds in sparse-reward tasks

      doesnt it hold even if there's a reward at every time step?

  13. Dec 2021
    1. Although ingeneral, the optimal policy does vary with the horizon, in environments where it is possible to stay atthe goal, a Markovian policy that reaches, then stays at the goal can be near-optimal

      I guess the optimal action if you know you have a lot of time left, vs if you know the episode is finishing soon, is different

    1. it should be possible to extend it toachieve any goal which does not change due to the simulated char-acter’s behaviour

      A goal parametrizes a reward. What does a reward that doesn't change due to the simulated character's behaviour mean? Rewards are functions of the simulated character behaviour aren't they?

    2. The reward structure is similarto that of Bergamin et al. [2019]

      why use exponential here? It is not intuitive that this would provide better reward gradients. For example, if two of the losses were quite high, the reward would stay close to 0 even if improvement was made on one of the two?

    3. Again,PPO’s reward definition being a multiplication of terms likely lessensthe effect of any one loss component having more influence on themotion imitation than another.


    4. before being converted toquaternions

      via the exponential map. Why are outputs ok to be axis angles, but not inputs? hmm

    5. Second, that the transition function of these two individ-ual character states is partially separable; generally the kinematiccharacter’s behaviour does not change because of the simulated

      what do you mean here? Isnt the kinematic character behaviour determined by the simulation?? or does the kinematic character refer to the target pose??

  14. Nov 2021
    1. benefits from models leveraging temporal abstraction [58, 66]. Inspired by these works, weextend the RNN model from BC-RNN into a multi-layered variant (TieredRNN). This new variantintegrates multiple layers operating at varying timesteps to better streamline information flow fromtimesteps early on in a given sequence to timesteps much further downstream, which can be usefulfor our long-horizon MM tasks.

      why not use Transformers?

    1. (c)

      z hat encodes the probability of different robot strategies, given the human's action

    2. t ∼ ψ(zt|st−K:t,aHt−1)

      wouldnt it make sense to get z from a the past history of actions. Tho I guess here they are modelling z as being independent at each time step..

    1. compute energies

      if N_samples is large, then this could be very memory or compute intensive right?

  15. Oct 2021
    1. we can define this simple environment to bein N dimensions

      dimension of the observations no?

  16. Sep 2021
    1. he observation suggests that the semantic structure of wordand object categories share the common representation in human cognitive processing.

      is this about that, or about saliency? hmm

    2. The Stroop effect known in human cognitive scienceliterature is one of such biases. When human participants observe a mismatched word, where theword “red” with the “blue” color, and are asked to answer the “color” of the word, their responseto the word can be slower than when they see the word “red” with the “red” color (Stroop, 1935).

      This is very Magritte

    3. ir recognitio

      recognition of what?


    1. They are somehow unable to grasp that someone could categorically attack ALL agents on the basis that they do not exist, and that it is potentially harmful to believe that they do.

      but there are agents, they are people no? That's the danger of this idea I believe

    2. But remember, although agent programs tend to share a set of deficiencies, it is your psychology that really makes a program into an agent; a very similar program with identical capabilities would not be an agent if you take responsibility for understanding and editing what it does. An agent is a way of using a program, in which you have ceded your autonomy. Agents only exist in your imagination. I am talking about YOU, not the computer.

      Agents are a social construct xP xd

    1. we leverage the learned reward function to scan goal candidates

      oh so rather than a captioner hmm

    2. Bahdanau et al. [4] and Fu et al. [29] also learn a reward function butrequire extensive expert knowledge (expert dataset and known environment dynamics respectively),whereas our agent uses experience generated by its own exploration.

      and the social partner!

    3. Language can thus be used to generate out-of-distributions goals by leveragingcompositionality to imagine new goals from known ones

      Rather than compositionality, I'd say that language has the property of being lower Kolmogorov complexity, which perhaps is related to its property that different possible generated sentences are more likely to be interesting than randomly generated images, because language is a better world model!

    4. Children do so by leveraging the compositionality of language as atool to imagine descriptions of outcomes they never experienced before, targetingthem as goals during play.

      What is compositionality? Isn't it just modelling the distribution of langauge?

    1. Existing beam-search [17, 48, 53] and backtracking solu-tions [24, 29] are infeasible due to the larger action and statespaces, long horizon, and inability to undo certain actions.

      How does their expert planner work then? Hmm

    2. Thisinference is more realistic than simple object class pre-diction, where localization is treated as a solved problem

      Although this has been solved for ALFRED now it seems, as everyone is using a FasterRCNN for the mask prediction, and just predicting labels.

      Modularizing the problem~~

    1. This time the model got the “turn right” thing all right. We still have the disambiguation problem, there are two tables involved here. I also disagree with “turning around” as the approapriate description for the navigation part after picking up the box (more like “make a right and then make a left”). The captioner may be picking up correlations in the language directly with bad conditoning on the frames, as picking up an object oftern is followed by turning around, as seen in the previous examples.

      Tbh I find it hard to pick up the directions in which it turns sometimes, due to the discrete nature of the actions.

    2. That is, use the same framework and algorithm, but replace the sequence of words corresponding to goal sentences with goals as sequences of final state-action tuples. (This would be the time-extended version of simple goal-conditioned RL algos with goals as final states. In this framework, no relabelling is needed).

      One case where this type of goal-conditioning would fail, while the language one wouldn't, would be, for goals that characterize the whole trajectory.

  17. Jul 2021
    1. We conclude that the synthetic data is especiallyimportant for generalization to unseen environments.

      well you are basically providing new data

    2. We conjecture thatdomain-specific language pretraining is important for theALFRED benchmark.

      I conjecture this shows that the ALFRED benchmark is too narrow

    3. hen evaluated on human instruc-tions

      so is this a model that takes either visual embeddings or language embeddings as instructions?

    4. We note that performance of the modelwith16input frames is close to the performance of the fullepisode memory agent, which can be explained the averagetask length of50timesteps

      huh? Isn't 16 significantly less than 50?

    5. it is desirable for the agents tobe able to handle raw sensory inputs, such as images orvideos.

      you should explain why

  18. Jun 2021
    1. his setting is harder as it removes theability for agents to explore the environment and collect additional feedback.

      but it's easier in that you are not tasked to do the exploration well

    1. evaluate the likelihood under our discretized model, we treat each bin as a uniform distribution overits specified range; by construction, the model assigns zero probability outside of this range.

      Oh I see. Why didn't they use the log likelihood of the discretized sequence? Here they are ignoring the softmax probability and just calculating the log likelihood corresponding to a particular discrete sequence sample?? That doesnt seem likelihy, as as long as one token is wrong, then this would give a 0 probability. So I guess they are still weighting by softmax hm. Yeah that seems to be what it is, looking at Appendix B. I just think they didn't write it very clearly

    2. To better isolate the source of the Transformer’s improved accuracy over standard single-step models,we also evaluate a Markovian variant of our same architecture. This ablation has a truncated contextwindow that prevents it from attending to more than one timestep in the past. We find that this modelperforms similarly to the trajectory Transformer on fully-observed environments, suggesting thatarchitecture differences and increased expressivity from the autoregressive state discretization play alarge role in the trajectory Transformer’s long-horizon accuracy.

      Aha! So it's more about expressing probability distributions over outputs well!! I've also been noticing that with transflower!^^

    3. log likelihood

      why is their log likelihood positive, if they are using a discrete model, so log likelihoods should be negative??

    4. , the conditional probabilities of the past given the future are stillwell-defined, allowing us to condition samples not only on the preceding states, actions, and rewardsthat have already been observed, but also any future context that we wish to occur. If the futurecontext is a state at the end of a trajectory, we decode trajectories with probabilities of the form

      This is the same approach that Lynch et al follow

    5. While this choice may seem inefficient, it allows us tomodel the distribution over trajectories with more expressivity, without simplifying assumptions suchas Gaussian transitions.

      But again there are alternatives like modelling using normalizing flows, VDVAEs, dVAEs, DDPMs, etc

    6. modeling considerations areconcerned less with architecture design and more with how to represent trajectory data – consistingof continuous states and actions – for processing by a discrete-token architecture

      but there are ways around this, like MoGlow or Transflower

    7. nd we no longer require ensembles orother epistemic uncertainty estimators, as is common in prior work on model-basedRL

      Not sure what they are referring to here?

      Are these the probabilistic aspects used for exploration? But here they don't solve the exploration problem as they rely on high-reward demonstrations from some other exploration algorithm no?

  19. May 2021
    1. relaxing the conditioning onX

      again i don't think you are relaxing the conditioning on X; you are finding a high probability limit as X grows to be infinite in size, I think

    2. Relaxing the conditioning onXgives us the limiting spectral distributionργMPρψMP, becauseμXis almost surelyρψMP, which terminates the proof.

      is it relaxing the conditioning on X, or is it just the latter observation?

      I wouldn't call this "relaxing the conditioning on X"


    1. intra-class correlations and strong inter-class

      I thought we were talking about binary classification?

    2. Observe that if the average test error over theversion spaceε(VS)is low, thenmostmembers of the version space musthave low test error (either intuitively, or by Markov’s inequality).

      We prove a much stronger high-probability inequality in our "Generalization bounds for deep learning" work

    3. In words, typical solutions may bedescribed in a number of bits close to the typical information content

      if the measure P is sufficiently biased, and simple, etc. See the AIT arguments in Dingle et al and Valle-Perez et al

    1. While the numberof steps until validation accuracy>99% grows quickly as dataset size decreases, the number of stepsuntil the train accuracy first reaches 99% generally trends down as dataset size decreases and stays inthe range of103-104optimization steps.

      But what if you keep running the training even longer? Are they just doing very inefficient model selection with the validation data, thus effectively using the validation data as extra training data?? I.e. just keep training, until the model happens to reach a good validation accuracy, and then stop? That sounds like training on the validation data?

    1. Articulatory suppression and interference have been used as methods to investigate the contribution of verbal, spatial, and/or motor codes in encoding, rehearsing, and recalling series of danceitems from working memory (e.g., Jean et al. 2001; Rossi-Arnaud, Cortese, & Cestari,2004). Suppression tasks involve performing a task at the same time as observing the to-be-remembered material, and are used to disrupt encoding of the to-be-remembered (TBR) material.

      like the experiments Feynman did for time perception:)

    1. Leveraging the extracted kinematic beats, we define the dance unit in this work. As illustrated inFigure 2(b), a dance unit is a temporally standardized short snippet, consisting of a fixed numberof poses, whose kinematic beats are normalized to several specified beat times with a constant beatinterval. A

      Wait so do they do this under the assumption that every music beat will have a matching kinematic beat? I don't understand how they "normalize" the dance in a way that every kinematic beat aligns with a music one:P

    2. We thusadopt an auxiliary reconstruction task on the latent movement sequences to facilitate training:Lmrecon=E[∥∥{ˆzimov}−{zimov}∥∥1].

      So they are loosing some of the distributional modelling properties of the GAN...

    1. While the generation diversity is ensured by the adversariallearning schema

      doubt. You don't add noise to the model do you? Or if you do, you don't say it

    2. So our proposed BC, whichfinds matched kinematic beat for each music beat, is a bet-ter metric

      but it has the flaw that if there are very frequent kinematic beats this could get a good value, even though the dance isnt really capturing the rhythm

    3. Position Variance (PVar)evaluates the diversity of thegenerated dance

      this is quite a crude way to evaluate diversity

    4. For each node, 3 individual fully-connected layersembed the joint node features into Q, K and V vectors andthen we perform 24 individual LLA modules on Q, K and Vof the nodes.

      so its basically independent attention for each node/joint?

      With feature sharing in the KCN layer?

    5. N ∼(ti,σ2)

      what, the mean increases with time?, or what is \(t_i\)?

    6. Attφ(qi,K,V) =ψ(qi ̃φ(KT))φ(V)

      isnt this somewhat similar to T5-style relative positional encoding

    7. we adopt an adversarial trainingarchitecture

      However, their architecture is not probabilistic, in the sense that it won't produce different dances for the same song. However, given that no song is likely to be the same, even if they are very similar, that could be a way to effectively inject some noise.

    8. During training thestart pose is simply selected as the first frame of the dancesequence and the right shifted outputs are just taken fromground-truth with mask for future time steps.

      So they use a causal decoder.

      I guess they could run this indefinitely, in an autoregressive way, by just restricitng the encoder/decoder context window to some max lenght of the past

    9. In implementation we use a 4-knot CHS to fit a mo-tion curve between two key poses, as an optimal trade-off between fitting precision and representation simplicity

      "optimal"? In which sense

    10. position androtation

      but position can be determined from rotation no?

    11. The mo-tion capture methods are more accurate but still suffer fromnoise and misalignment to music.

      do they? They are usually aligned to music, and the noise for a good mocap system isnt much, and can be cleaned up relatively well as far as I know

    12. Ow-ing to the LLA module, the training is more stable and thedegradation to averaging actions is avoided

      Hmm I would have thought that more context helps in avoding degeneration into mean pose for deterministic models

    13. Different from sentences in NLP that have long-term cor-relations, human poses and motions only have local cor-relations in a short time interval.

      That is very debatable

    1. similarity between them.

      we need a discriminator, or CLIP-like model for music/audio-motion!

    2. Thisposes a unique challenge for cross-modal learning betweenmusic and motion

      Probably better handled by probabilistic models

    1. f a model instead develops amore general representation, it should be able to apply learned functions also to longer input strings.Its performance on longer strings may drop for other, practical, reasons, but this drop should bemore smooth than for a model that has not learned a general-purpose representation at all.

      What kind of "practical reaons" are you referring to here?

    2. Note that, in ourcase, this ability does not require productivity in generating output strings, since the correct outputsequences are not distributionally different from those in the training data (in some cases, theymay even be exactly the same)

      but it requires productivity w.r.t. the inputs no?

    3. When the percentage of exceptions becomes too low, on the other hand, all models have difficultiesmemorising them at all:

      what about with larger models, which may be better at memorizing?

    4. One potential explanation for this score discrepancy is that, due to the slightly different distri-bution of examples in the systematicity data set, the models learn a different solution than before.

      Ah, they train on a different training set for each composionality task?

    5. As observed by many others before us, insight in the compositional skills of neural networks is noteasily acquired by studying models trained on natural language directly. While it is generally agreedupon that compositional skills are required to appropriately model natural language, successfullymodelling natural data requires far more than understanding compositional structures. As a conse-quence, a negative result may stem not from a model’s incapability to model compositionality, butrather from the lack of signal in the data that should induce compositional behaviour. A positiveresult, on the other hand, cannot necessarily be explained as successful compositional learning, sinceit is difficult to establish that a good performance cannot be reached through heuristics and memo-risation.

      if we cannot even measure/evaluate compositionality in natural language, how can we be so sure it is necessary for modeling it?

    6. f an expression is altered byreplacing one of its constituents with another constituent with the same meaning (a synonym),this does not affect the meaning of the expression (Pagin, 2003

      How is "meaning" defined?

    7. As a consequence, the interpretation of theprinciple of compositionality depends on the type of constraints that are put on the semantic andsyntactic theories involved.

      so that should really be the content of a theory of generalization, not the formally vacuous principle

    8. Fodor and Pylyshyn (1988) contrast systematicity with storing all sentences in an atomic way,in a dictionary-like mapping from sentences to meanings. Someone who entertains such a dictionarywould not be able to understand new sentences, even if these sentences were similar to the onesoccurring in their dictionary. Since humans are able to infer meanings for sentences they have neverheard before, they must use some systematic process to construct these meanings from the onesthey internalised before

      So here 'systematicity' simply seems to mean "uses a simple rule to describe the data", which indeed is the key to generalization

    9. This ability concerns the recombination of known parts and rules: anyone who understands a numberof complex expressions also understands other complex expressions that can be built up from theconstituents and syntactical rules employed in the familiar expressions. To use a classic examplefrom Szab ́o (2012): someone who understands ‘brown dog’ and ‘black cat’ also understands ‘browncat’.

      this is a bit tautological, depending on how you operationally define 'understand' I think

    10. In the substitutivity test, we evaluate how robust models aretowards the introduction of synonyms, and, more specifically, in which cases words are consideredsynonymous.

      This feels like also a case of (a) no?

    1. he binary nature of many preference tests means that responsesare relatively information-poor and makes it harder to verify that two conditions are statistically different.

      can you not do a Likert-type pairwise comparison?

  20. Apr 2021
    1. uTWv


    2. pθ(z)

      this should read \(-\ln{p_\theta ( \mathbf{z})}\)

    3. whilereverse KL-divergence is not a viable objective function,

      I guess in the above derivation, we dont really have access to p_d(x) even if we had the trained discriminator. We only have the softmaxes which give us the two ratios in expression (23) which are conditional probabilities (p(g|x) and p(d|x)) rather than the pd and pg separately.

    4. dditionally, it has been suggested thatreverse KL-divergence,DKL(pg||pd), is a better measurefor training generative models than normal KL-divergence,DKL(pd||pg), since it minimisesEx∼pg[lnpd(x)][92]

      I think the intuition here is that minimizing D(pg|pd) has the property of ensuring that pg is contained inside the support of pd. So that, we are more sure that samples will be realistic, though may suffer from mode collapse, like GANs, but that may be a less serious problem.

    5. 2-Stage VAEsBy interpreting VAEs as regularised autoencoders, it is nat-ural to apply less strict regularisation to the latent spaceduring training then subsequently train a density estima-tor on this space, thus obtaining a more complex prior[53]. Vector Quantized-Variational Autoencoders (VQ-VAE)[170], [215] achieve this by training an autoencoder witha discrete latent space, then training a powerful autore-gressive model (see Section 5) on the latent space.

      So I think what VQ-VAEs are trying to achieve is a VAE with a very flexible learned prior. If we look at the VAE objective (eq 19), we see that if p(z) equals q(z), then the KL divergence (averaged over q(x), which I'm interpreting to be the data distribution), becomes -I(z;x), i.e. minus the mutual information between z and x. Interesting!

      However, we can think of VQ-VAE basically ignoring this explicit regularization term, during its first stage (or maybe its implicitly approximated via its codebook training objectives! hmm). In the second stage, we just fit the prior to approximate q(z) which is what we were assuming to be true in the analysis.

      It seems that some of the other works cited in the previous paragraph, try to use the expression in eq 19 (and thus the -I(z;x) regularization) directly. Though because it is intractable to compute, they approximate it in different ways.

    6. 2 stage VAEs that firstmap data to latents of dimensionrthen use a second VAE tocorrect the learned density can better capture the data [35]

      because then the second VAE does recover the data distribution on the latent, according to that paper. Interesting!

    7. (14)

      i think here q(x) is supposed to be q(x|z) and the e^E(x) should be e^E(z)? (well e^E(z) is equivalent to e^E(x) if x is a function fo z..) But yeah, I think the q part, should be q(x|z). For example, if using normalizing flows, it would be jacobian of the inverse flow

    8. implicit energy models

      what are implicit energy models?? I thought they said previously that EBM are not implicit generative models?

    9. This is made worse by thefinite nature of the sampling process, meaning that samplescan be arbitrarily far away from the model’s distribution[57].

      does he mean because of the finite step size? Is that a big problem? hmm not sure I guess this sentence.

    10. the gradient of the negative log-likelihoodlossL(θ) =Ex∼pd[−lnpθ(x)]has been shown to approxi-mately demonstrate the following property [21], [193]

      this contrastive divergence result is very cool!

    11. TABLE 1

      what do the stars in training speed, sample speed and param efficiency correspond to, quantiatively?

      Also, it would be nice to know robsutness to hyperparameters, as that is often a big part of "training time"

    12. training speed is assessed based on reportedtotal training times

      hmm ideally we would know if they had trained until convergence, or if they had gone over convergence.

    13. Choosing what to optimise for has implica-tions for sample quality, with direct likelihood optimisationoften leading to worse sample quality than alternatives.

      is this in part because of noise in the data, which the likelihood based models also fit?

    1. constant mem-ory

      They say constant memory but below the memory is said to be \(O(N)\) which one is true? As far as I can tell, the latter is a typo?

    1. especially in the most difficult long horizon setting.

      actually the graph shows a larger difference for 1 task than for more tasks?

    2. This process is scalable because pairing happens after-the-fact, making it straightforward to parallelize via crowdsourc-ing.

      could you not crowdsource insruction following too?

      Maybe this adds extra diversity though

      Probably combining both would be best

  21. Mar 2021
    1. Training an agent for145such social interactions most likely requires drasti-146cally different methods – e.g. different architectural147biases – than classical object-manipulation training

      Or a lot of pre-training data, which given current empirical findings, tends to work better.

    2. To117enable the design and study of complex social sce-118narios in reasonable computational time

      alternatively you could consider more complex environments but with more offline algorithms like bootstraping from supervised learning

    3. rather than lan-059guage based social interactions

      some important recent counter example is IIL.


    1. (8)

      This whole variational calculation is basically like combining Monte Carlo integration (with importance sampling), and Jensen inequality (to bring the expectation outside the log). The cool thing is that optimizing over q, makes the approximation exact if we our model for q is sufficiently expressive

    2. Consequently, maximizing the log-likelihood of the continuous model on uniformly dequantizeddata cannot lead to the continuous model degenerately collapsing onto the discrete data, because itsobjective is bounded above by the log-likelihood of a discrete model.

      I don't see how this argument works

    1. In our datasets Fig. 2c, we find empirically thatfor the same amount of collection time, play indeed covers 4.2 times more regions of the availableinteraction space than 18 tasks worth of expert demonstration data, and 14.4 times more regions thanrandom exploration.

      That is indeed a cool finding

    2. lay data ischeap: Unlike expert demonstrations (Fig. 5), playrequires no task segmenting, labeling, or resetting to an initial state, meaning it can be collectedquickly in large quantities.

      Well, you still need to have enough people playing for enough time, which may not be cheap. For example, in Imitating Interactive Intelligence they had to spent about 200K pounds to pay people to play with their environment

    3. We additionally find that play-supervised models,unlike their expert-trained counterparts, are more robust to perturbations and ex-hibit retrying-till-success behaviors.

      I guess because in the play data there was examples of reaching the goal even from suboptimal trajectories

    1. Intuitively, this is equivalentto taking the average of demonstrated actions at each specificstate.

      unless you model the ditribution, e.g. using Normalizing flows, which should be better?

    1. TextWorldaddresses this discrepancy by providing programmatic and aligned linguistic signals during agentexploration.

      but isnt this just substituting the human language-instructor, with a rule-based one that is bound to be of lower quality?

    1. Informally, all else being equal, discontinuousrepresentations should in many cases be “harder” to approx-imate by neural networks than continuous ones. Theoreti-cal results suggest that functions that are smoother [34] orhave stronger continuity properties such as in the modulusof continuity [33, 10] have lower approximation error for agiven number of neurons.

      and they probably generalize better, as there are several works showing that DNNs are implicitly biased towards "smooth" functions

    1. n the skill learning phase, LGB relies on an innate semantic representation that characterizes spatial relations between objects in the scene using predicates known to be used by pre-verbal infants [Mandler, 2012].

      so this is feature-engineered no?

    2. Although the policy over-generalizes, the reward function can still identify whether plants have grown or haven’t.

      how has the reward function learned the association between "feed" and "the object grows"? I guess that was taught from the language descriptions? It should be able to learn the reward function correctly then

    3. more aligned data autonomously.

      i think this is similar to the idea of self-training

    1. This would require a means for representing meaning from experience—a situation model—and a mechanism that allows information to be extracted from sentences and mapped onto the situation model that has been derived from experience, thus enriching that representation

      This basically means adding extra information that is inferred, to what is just directly observed no?

    1. What can and should the user be doing while the AI agent is taking its turn to increase engagement?

      Maybe the agent's actions themselves should be engaging enough? We should aim for that I think

    2. We employ a turn-based frameworkbecause it is a common way of organizing co-creative interactions [3,12,13] and because it suitsevolutionary and reinforcement-learning approaches that require discrete steps [2, 7, 8, 14].

      I think thats a significant limitation. More fluid interactions can only take place in continuous-time settings

    1. As can be seen in Figure 3 (left), the trainingperformance was sensitive to the weight scaleσ, despitethe fact that a weight normalisation scheme was beingused.

      It would be interesting to explore whether this pitfall can actually have an effect in some scenario where one isnt using an abnormally high initialization

  22. Feb 2021
    1. natural motions

      more natural than the baseline*

    2. For quantitative evaluation, we computed the meansquared error between the generated motion and motioncapture on a left-out test set, for fingertip positions and jointangles

      this is problematic because there could be many motions which are good but quite different, and thus having big MSE

    3. In total, we used approximately120 minutes of data

      what? why didnt you use more data? ... We need to do scaling experiments with this

    1. 2)autoregression reduces the amount of fast moments, making thevelocity histogram more similar to the ground truth

      huh? I see autoregression increasing the amount of fast movements no?

    2. “In which video...”: (Q1) “...are the character’s movements mosthuman-like?” (Q2) “...do the character’s movements most reflectwhat the character says?” (Q3) “...do the character’s movementsmost help to understand what the character says?” (Q4) “...are thecharacter’s voice and movement more in sync?”

      It would also be good to do observational studies where users are simply asked to interact with different characters. And we measure how engaged they are.

    3. Hence, after five epochs of training with autoregression,our model has full teacher forcing: it always receives the ground-truth poses for autoregression. This procedure greatly helps withlearning a model that properly integrates non-autoregressive input.

      Interesting, I would have guessed that doing it the other way (starting with teacher forcing and decrease this to fully autoregressive training) would have been the natural curriculum.

      What was the idea for doing this? Is the idea basically to extend to gradually make the information in the autoregressive part of the input more and more predictive, so that the network can anneal from using features in the speech part, to using features in both speech and autoregressive motion?

    4. This pretraining helps thenetwork learn to extract useful features from the speech input, anability which is not lost during further training.

      I wonder if self-attention like in transformers would be better at learning which features to pick on

    5. we pass a sliding windowspanning 0.5 s (10 frames) of past speech and 1 s (20 frames) offuture speech features over the encoded feature vectors.

      so cant generate gestures from audio/text in real time with this

    6. feature vector𝑉𝑠was made distinct from all other encodings, bysetting all elements equal to−15

      it may be a good idea to learn these embeddings no?

    1. three different domains: U.S. presidents,dog breeds, and U.S. national parks. We use mul-tiple domains to include diversity in our tasks,choosing domains that have a multitude of entitiesto which a single question could be applied

      three domains wow much diversity

    2. We assume each task has an associ-ated metricμj(Dj,fθ)∈R, which is used tocompute the model performance for taskτjonDjfor the model represented byfθ.

      So this assumes that the reward can be defineable. In some tasks, it may not be so easy right? We may need to learn rewards

    1. ecological pre-training

      whats ecological pretraining

    1. BART waspre-trained using a denoising objective and a variety of different noising functions. It has obtainedstate-of-the-art results on a diverse set of generation tasks and outperforms comparably-sized T5models [32].

      wait so it was just trained on reconstruction? hmm interesting.

      i guess the fine-tuning then really changes the output in this case, even tho it still reuses knowledge in the model?

    1. We believe these properties provide good motivationfor continuing to scale larger end-to-end imitation archi-tectures over larger play datasets as a practical strategy fortask-agnostic control.


    1. Multipleavenues, including understanding more deeply the mechanisms of creative, knowledge-rich thought, or transferring knowledge from large, real world datasets, may offer a wayforward.


    2. To go beyond competence within somewhat stereo-typed scenarios toward interactive agents that can actively acquire and creatively recombineknowledge to cope with new challenges may require as yet unknown methods for knowl-edge representation and credit assignment, or, failing this, larger scales of data.

      Probably most reliable approach: Larger scales of data

    3. To record sufficientlydiverse behaviour, we have “gamified” human-human interaction via the instrument of lan-guage games.


    4. Winograd envisioned computers that are not “tyrants,” but rather ma-chines that understand and assist us interactively, and it is this view that ultimately led himto advocate convergence between artificial intelligence and human-computer interaction(Winograd, 2006)

      And VR is a big part in the next step in human-computer interaction

    5. Generally, these results give us con-fidence that we could continue to improve the performance of the agents straightforwardlyby increasing the dataset size.

      yeah if you have lots of money to pay people..

      but that is not that scalable

    6. Although the agents do not yet attainhuman-level performance, we will soon describe scaling experiments which suggest thatthis gap could be closed substantially simply by collecting more data.

      We need more data

    7. The regularisation schemes presented in the last section can improve the generalisationproperties of BC policies to novel inputs, but they cannot train the policy to exert active con-trol in the environment to attain states that are probable in the demonstrator’s distribution.

      Unless that active control can be learned by generalizing from learned actions in the demonstrations?