85 Matching Annotations
  1. Last 7 days
    1. In practice, we can remove predicted bounding boxes with lower confidence even before performing non-maximum suppression, thereby reducing computation in this algorithm. We may also post-process the output of non-maximum suppression, for example, by only keeping results with higher confidence in the final output.

      pre and post process

    1. The best way to do this is by first using tesseract to get OCR text in whatever languages you might feel are in there, using langdetect to find what languages are included in the OCR text and then run OCR again with the languages found.

      how about the accuracy?

  2. Aug 2025
    1. As is observed in the above results, after an nn.Sequential instance is scripted using the torch.jit.script function, computing performance is improved through the use of symbolic programming.

      but longer time

    1. The photorealistic text-to-image examples in Fig. 11.9.5 suggest that the T5 encoder alone may effectively represent text even without fine-tuning.

      t5和输出之间应该还有网络?

    1. Note that h heads can be computed in parallel if we set the number of outputs of linear transformations for the query, key, and value to pqh=pkh=pvh=po.

      不一致就不能平行运算吗?

    1. In the case of a (scalar) regression with observations (xi,yi) for features and labels respectively, vi=yi are scalars, ki=xi are vectors, and the query q denotes the new location where f should be evaluated.

      x_i和q相等?

    1. Using word-level tokenization, the vocabulary size will be significantly larger than that using character-level tokenization, but the sequence lengths will be much shorter.

      the sequence lengths?

    1. While we can use the chain rule to compute ∂ht/∂wh recursively, this chain can get very long whenever t is large. Let’s discuss a number of strategies for dealing with this problem.

      我不明白为什么可以这么替换

    2. Input from the first step passes through over 1000 matrix products before arriving at the output, and another 1000 matrix products are required to compute the gradient.

      forward and backforward

    1. Having a small value for this upper bound might be viewed as good or bad. On the downside, we are limiting the speed at which we can reduce the value of the objective. On the bright side, this limits by just how much we can go wrong in any one gradient step.
  3. Jul 2025
    1. thus batch normalization layers function differently in training mode (normalizing by minibatch statistics) than in prediction mode (normalizing by dataset statistics).

      为什么对整个数据集求平均和标准差不可行,而对于单个layer却可行?

    1. Because these networks are invariant to the order of the features, we could get similar results regardless of whether we preserve an order corresponding to the spatial structure of the pixels or if we permute the columns of our design matrix before fitting the MLP’s parameters.

      example?

    1. Note that in this case, only the first layer requires lazy initialization, but the framework initializes sequentially. Once all parameter shapes are known, the framework can finally initialize the parameters.

      ?

    1. Finally a module must possess a backpropagation method, for purposes of calculating gradients. Fortunately, due to some behind-the-scenes magic supplied by the auto differentiation (introduced in Section 2.5) when defining our own module, we only need to worry about parameters and the forward propagation method.

      forward propagation. backward propagation for calculating gradients (we have auto differentiation)

  4. Jun 2025