9 Matching Annotations
  1. Aug 2024
    1. SPOC uses SIGLIP image and text encoders that produce84 × 768 (npatch × dimage) features and 768 dimension fea-ture per text token. We use 3-layer transformer encoder anddecoder with 512 dimensional hidden state (dvisual) and 8attention heads for the goal-conditioned visual encoder andaction decoder, respectively. We use a context window of100. All models are trained with a batch size of 224 on8×A100 GPUs (80 GB memory / GPU) with the AdamWoptimizer and a learning rate of 0.0002. Single-task modelsare trained for 20k iterations, while multi-task models aretrained for 50k iterations.
      • encoder - SigLIP (84 patch, 768 emb size)
      • transformer - 3 lauer, 512 hidden state, 8 heads
      • context - 100
      • batch size - 224
      • 8xA100 gpu (80 gb)
      • optim - AdamW
      • learning rate - 0.0002
      • iterations - 20k - 50k
    2. SPOC uses SIGLIP image andtext encoders. We use 3-layer transformer encoder and de-coder and a context window of 100. All models are trainedwith batch size=224, AdamW and LR=0.0002. Single-taskmodels and multi-task models are trained for 20k and 50kiterations, respectively. Using 16-bit mixed precision train-ing SPOC trains at an FPS of ≈3500, compared to an FPSof ≈175 for RL implemented using AllenAct

      training params: * 3-layer transformer encoder, decoder * context 100 * batch_size 224 * optim AdamW * learning rate 0.0002 * iterations 20k-50k

    3. To analyze models we first present results on a subset of15 object categories from the full 863 categories, called S.Evaluations on the full category set are named L. Our train-ing data contains an average of 90k episodes per task andon average each task contains 195 episodes in the evalua-tion benchmark.

      Training - 90k episodes per task. Eval - 195 episodes per task

  2. Mar 2023
  3. Feb 2023
  4. Jun 2017