Hypothesis

24 Matching Annotations

Aug 2024
www.semanticscholar.org www.semanticscholar.org

[PDF] SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World | Semantic Scholar

24
1. deni_denitos 15 Aug 2024
  
  in Public
  
  RealSense 455 fixed cameras, namely the navigation andthe manipulation camera, both with a vertical field of viewof 59◦ and capable of 1280×720 RGB-D image capture.The navigation camera is placed looking in the agent’s for-ward direction and points slightly down, with the hori-zon at a nominal 30◦
  
  sim2real transfer
2. deni_denitos 13 Aug 2024
  
  in Public
  
  We discretize the action spaceinto 20 actions: Move Base (±20 cm), Rotate Base (±6◦,±30◦), Move Arm (x, z) (±2 cm, ±10 cm), Rotate Grasper(±10◦), pickup, dropoff, done with subtask, and terminate.
  
  Action Space For Navigation: * move_ahead - +20 cm * rotate left/right +-30 degrees * end
3. deni_denitos 13 Aug 2024
  
  in Public
  
  Using GPT-3.5, we extract five short de-scriptions of common usages for each synset, given thesynset’s definition, and a score indicating the confidence inthe correctness of the usage.
  
  affordance - gpt 3.5 description for synset
  
  text_goal
4. deni_denitos 13 Aug 2024
  
  in Public
  
  A total of 191,568houses are sampled from this distribution, with ratio of10:1:1 across training, validation, and test.
  
  200k houses 10:1:1 train/val/test
  
  params
5. deni_denitos 13 Aug 2024
  
  in Public
  
  For the model with detection (SPOC w/ GT Det or SPOCw/ DETIC), we encode the coordinates of the boundingboxes using sinusoidal positional encoding followed by alinear layer with LayerNorm and ReLU. We also add coor-dinate type embeddings to differentiate the 10 coordinates(5 per camera - x1, y1, x2, y2, and area). The coordinate en-codings are then concatenated with the two image featuresand text features before feeding into Transformer EncoderEvisual. When no object is detected in the image, we use1000 as the dummy coordinate value and set area to zero.
  
  Bounding Box decoding
6. deni_denitos 13 Aug 2024
  
  in Public
  
  Using 16-bit mixedprecision training SPOC trains at approximately 1.2 hoursper 1000 iterations
  
  16-bit - 1.2 hours for 1000 iter, -> 60 hours for 50k
  
  params
7. deni_denitos 13 Aug 2024
  
  in Public
  
  SPOC uses SIGLIP image and text encoders that produce84 × 768 (npatch × dimage) features and 768 dimension fea-ture per text token. We use 3-layer transformer encoder anddecoder with 512 dimensional hidden state (dvisual) and 8attention heads for the goal-conditioned visual encoder andaction decoder, respectively. We use a context window of100. All models are trained with a batch size of 224 on8×A100 GPUs (80 GB memory / GPU) with the AdamWoptimizer and a learning rate of 0.0002. Single-task modelsare trained for 20k iterations, while multi-task models aretrained for 50k iterations.
  
  encoder - SigLIP (84 patch, 768 emb size)
  
  transformer - 3 lauer, 512 hidden state, 8 heads
  
  context - 100
  
  batch size - 224
  
  8xA100 gpu (80 gb)
  
  optim - AdamW
  
  learning rate - 0.0002
  
  iterations - 20k - 50k
  
  params
8. deni_denitos 13 Aug 2024
  
  in Public
  
  Real world results.
  
  sim obj nav = 85 (detic) real obj_nav = 83.3 (detic)
  
  result
9. deni_denitos 13 Aug 2024
  
  in Public
  
  00 houseswith 1000 episodes each, and 10k houses with 10 episodeseach.
  
  10k houses with 10 episodes > 100 houses with 1k episodes
  
  result
10. deni_denitos 13 Aug 2024
  
  in Public
  
  Effect of context window
  
  result
11. deni_denitos 13 Aug 2024
  
  in Public
  
  Comparing different image encoders
  
  SigLIP (vit-b) > DinoV2 (vit-s) > CLIP (resnet 50)
  
  result
12. deni_denitos 13 Aug 2024
  
  in Public
  
  Swapping transformer encoder and decoder with alternative architectures.
  
  table 4: * transformer > gru
  
  result
13. deni_denitos 13 Aug 2024
  
  in Public
  
  Training on single tasks, IL outperforms RL even with meticulous reward shaping.
  
  table 3: * SPOC > EmbSigLIP (rl) * SPOC single = multitask * SPOC detector > without detector
  
  result
14. deni_denitos 13 Aug 2024
  
  in Public
  
  our IL-trained SPOC dramatically outperforms thepopular RL-trained EmbCLIP architecture
  
  SPOC outperforms EmbCLIP
  
  result
15. deni_denitos 13 Aug 2024
  
  in Public
  
  SPOC withground truth detection (provided by the simulator) shows15% absolute average success rate gain across all taskson CHORES-S and an even larger 24.5% absolute gain onCHORES-L where the detection problem is harder due tothe larger object vocabulary.
  
  comparing SPOC with detector and without
  
  result
16. deni_denitos 13 Aug 2024
  
  in Public
  
  SPOC uses SIGLIP image andtext encoders. We use 3-layer transformer encoder and de-coder and a context window of 100. All models are trainedwith batch size=224, AdamW and LR=0.0002. Single-taskmodels and multi-task models are trained for 20k and 50kiterations, respectively. Using 16-bit mixed precision train-ing SPOC trains at an FPS of ≈3500, compared to an FPSof ≈175 for RL implemented using AllenAct
  
  training params: * 3-layer transformer encoder, decoder * context 100 * batch_size 224 * optim AdamW * learning rate 0.0002 * iterations 20k-50k
  
  params
17. deni_denitos 13 Aug 2024
  
  in Public
  
  To analyze models we first present results on a subset of15 object categories from the full 863 categories, called S.Evaluations on the full category set are named L. Our train-ing data contains an average of 90k episodes per task andon average each task contains 195 episodes in the evalua-tion benchmark.
  
  Training - 90k episodes per task. Eval - 195 episodes per task
  
  params
18. deni_denitos 13 Aug 2024
  
  in Public
  
  Practically, RL in complex visual worldsis sample inefficient, especially when using large actionspaces and for long-horizon tasks.
  
  rl minus
19. deni_denitos 13 Aug 2024
  
  in Public
  
  Phone2Proc [19], where a scanned layout of thereal-world house is used to generate many simulated varia-tions for agent fine-tuning
  
  many environment from one layot
20. deni_denitos 13 Aug 2024
  
  in Public
  
  Imitation learning has recently gained popularity inrobotics, significantly impacting areas like autonomousdriving [14, 36, 43, 48, 52, 55].
  
  imitation learning in autonomous driving
21. deni_denitos 13 Aug 2024
  
  in Public
  
  [1, 16, 39, 53, 61, 77].Recent simulators model realistic robotic agents but oftentrade off physical fidelity to increase simulation speed [21,22, 24, 41, 42, 54, 63, 68, 78].
  
  simulators: Matter-port3D, AI2-THOR, VirtualHome, Habitat, Gibson
  
  simulator
22. deni_denitos 13 Aug 2024
  
  in Public
  
  Whentrained jointly on four tasks – Object-Goal Navigation(OBJNAV), Room Visitation (ROOMVISIT), PickUp Object(PICKUP), and Fetch Object (FETCH) – SPOC achieves animpressive average success rate of 49.9% in unseen sim-ulation environments at test time.
  
  succes rate at simulation - 50 %
  
  result
23. deni_denitos 13 Aug 2024
  
  in Public
  
  success rate of 85% for OBJNAV.
  
  success rate
  
  result
24. deni_denitos 13 Aug 2024
  
  in Public
  
  200,000 procedu-rally generated houses containing 40,000 unique 3D as-sets.
  
  How authors calculate this data?
  
  simulator
Visit annotations in context

Tags

simulator

params

result

text_goal

Annotators

deni_denitos

URL

semanticscholar.org/reader/18ccf45e9ddecc03dfd7487e03adc80ac9c42b30

Tags

Annotators

URL