valid_ds
?
it likely leads to larger values in the Gram matrix
负相关?
the image printing function requires that each pixel has a floating point value from 0 to 1
why?
The best way to do this is by first using tesseract to get OCR text in whatever languages you might feel are in there, using langdetect to find what languages are included in the OCR text and then run OCR again with the languages found.
how about the accuracy?
For comparison, we define an identical model, but initialize all of its model parameters to random values.
全部保持初始就是随机赋值?
As is observed in the above results, after an nn.Sequential instance is scripted using the torch.jit.script function, computing performance is improved through the use of symbolic programming.
but longer time
In the context of computer vision this schedule can lead to improved results.
图像增强?
The photorealistic text-to-image examples in Fig. 11.9.5 suggest that the T5 encoder alone may effectively represent text even without fine-tuning.
t5和输出之间应该还有网络?
Since we use the fixed positional encoding whose values are always between −1 and 1,
?
position
position对应time step?
To compute multiple heads of multi-head attention in parallel, proper tensor manipulation is needed.
不同的head一定要相同的长度吗(num_hiddens / num_heads)?
Note that h heads can be computed in parallel if we set the number of outputs of linear transformations for the query, key, and value to pqh=pkh=pvh=po.
不一致就不能平行运算吗?
In the case of a (scalar) regression with observations (xi,yi) for features and labels respectively, vi=yi are scalars, ki=xi are vectors, and the query q denotes the new location where f should be evaluated.
x_i和q相等?
the conditional probability of each token at time step 3 has also changed in Fig. 10.8.2
why change like that?
Using word-level tokenization, the vocabulary size will be significantly larger than that using character-level tokenization, but the sequence lengths will be much shorter.
the sequence lengths?
we can easily get a deep-gated RNN by replacing the hidden state computation in (10.3.1) with that from an LSTM or a GRU.
方向是不是错了?
Reset gates help capture short-term dependencies in sequences. Update gates help capture long-term dependencies in sequences.
why?
Note that only the hidden state is passed to the output layer.
上一时间点的输出不是这一时间点的输入?
For instance, if the first token is of great importance we will learn not to update the hidden state after the first observation.
重要 -> 不更新?
neuron
is a neuro a cell?
detaching the gradient
?
Using the chain rule yields
?
Whenever ξt=0 the recurrent computation terminates at that time step t.
?
While we can use the chain rule to compute ∂ht/∂wh recursively, this chain can get very long whenever t is large. Let’s discuss a number of strategies for dealing with this problem.
我不明白为什么可以这么替换
where computation of ht−1 also depends on wh
?
Having a small value for this upper bound might be viewed as good or bad. On the downside, we are limiting the speed at which we can reduce the value of the objective. On the bright side, this limits by just how much we can go wrong in any one gradient step.
exponentially
why
There will be many plausible three-word combinations that we likely will not see in our dataset.
?
formulae
独立性与unigram, bigram, trigram的关系?
After all, we will significantly overestimate the frequency of the tail, also known as the infrequent words.
why will overestimate?
frequency
为什么有频率的描述?
Even today’s massive RNN- and Transformer-based language models seldom incorporate more than thousands of words of context.
大模型每次输入多少文本呢?
probabilistic classifier
从不同的概率分布的集合中分类?
compare
not prediction, but comparation?