31 Matching Annotations
  1. Dec 2016
    1. The product, like the CCM itself, is mass-deficient

      I.E. it assigns some probability to impossible outcomes.

    2. Initialization is important to the success of anylocal search procedure. We chose to initialize EMnot with an initial model, but with an initial guessat posterior distributions over dependency structures(completions). For the first-round, we constructeda somewhat ad-hoc “harmonic” completion

      This is very important to make the whole thing work and there has actually been some follow up research that attempts to get rid of the dependence on this specific initialization.

    3. The basic inside-outside algorithm (Baker, 1979)can be used for re-estimation.

      By using the mapping from dependency trees to constituency trees.

    4. butlower than simply linking all adjacent words into aleft-headed (and right-branching) structure (53.2%)

      Which is a strong baseline for English, it would be interesting to see how this compares on other languages.

    5. Most recent progress in unsupervised parsing hascome from tree or phrase-structure grammar basedmodels

      True then, not necessarily true anymore these days.

    6. hence un-dermines arguments based on “the poverty of thestimulus”

      Although the learning algorithm already encodes certain biases about what linguistic structure should look like i.e. in the form of the grammar they consider and Chomsky had some specific complex examples and did not claim that it way impossible to learn anything.

  2. Nov 2016
    1. This result suggests that some rules are harder tolearn than others regardless of their frequency, sotheir presence in the specified ruleset yields strongerperformance gains.

      It could also be the case the the trees inferred without these restrictions just differ in a more complex way from the gold trees.

    2. Variational approximations to the HDP are trun-cated at 1

      So you could also just assume a parametric version where you just have 10 refinement labels and a simpler presentation.

    3. Collins head finding rules

      Probably necessary for the Collins rules to work properly in inference.

    4. all mass is concentrated on some singleβ∗

      At this point you could just do approximate MAP inference and the presentation would become less complicated.

    5. explicitly generate wordsxrather than only part-of-speech tagss

      But since both the part-of-speech tags and the words are observed an the dependency tree structure in their reduced model is independent of the words given the tags, this makes no difference compared to the original models.

    6. Root→Verb

      In my experience the rules that ensure that the root has a verb or an auxiliary as a child are the most important ones. Most learning algorithms seem to have a real problem with making the first noun in the sentence the head.

    7. Furthermore, these universal rules arecompact and well-understood, making them easy tomanually construct.

      But still dependent on knowing gold part-of-speech tags.

    1. our models may be improved by the applicationof (unsupervised) discriminative learning techniqueswith features (Berg-Kirkpatrick et al., 2010); or byincorporating topic models and document informa-tion (Griffiths et al., 2005; Moon et al., 2010)

      It might also be interesting to explore the use of other information sources such as the HTML annotation used by Spitkovsky and colleagues

    2. After chunking is performed, all multiword con-stituents are collapsed and represented by a singlepseudoword

      A slightly "fancier" approach would be to cluster the low level constituents.

    3. The fact that the HMM and PRLG havehigher recall on NP identification on Negra than pre-cision is further evidence towards this.

      One thing that can be taken away from this is the fact that the model seems to identify relatively short sequences as chunks.

    4. ossible tag transitions as a state diagram

      This diagramm is key to their approach. B will with high likelyhood generate words that are likely to be followed by non-stop words. I does the same for words that are likely to follow non-stop words.

    5. they im-pose a hard constraint on constituent spans, in thatno constituent (other than sentence root)

      This heuristic is not entirely true:

      Ich denke, [ dass Marie denkt, [ dass John nicht nachgedacht hat] ]

      but it is a useful heuristic much like assuming one tag per word type is helpful in unsupervised part-of-speech tagging.

    6. As in previouswork, punctuation is not used for evaluation

      I.e. punctuation is used as an information source during training, but then discarded before evaluation.

    7. ef-fectiveness at identifying low-level constituent

      This seems to be true for all techniques that exist for unsupervised parsing to some extend. There does not seem to be a really good technique for discovering more "high level" constituents.

    8. low-level constituents based solely on wordco-occurrence frequencie

      This connects directly to Hänig et. al.'s paper.

    1. pparently the variable c already suffices to produce very competitive results, assuming POS tagswere used

      Suggesting that significant co-occurrence is already a good indicator of constituency.

    2. our algorithm significantly outperforms the random branching baseline

      Note that the right branching baseline is much stronger.

    1. The secondbaseline is a more recent vector-based syntactic classinduction method, the SVD approach of (Lamar etal., 2010), which extends Sch ̈utze (1995)’s originalmethod and, like ours, enforces a one-class-per-tagrestriction

      Note that all the comparison Systems are type based.

    2. Note that this model with multiple context fea-tures is deficient: it can generate data that are in-consistent with any actual corpus, because there isno mechanism to constrain the left context wordof tokeneito be the same as the right contextword of tokenei−1(and similarly with alignmentfeatures)

      Here the independet generation of words comes back to be a problem.

    3. type leve

      Like Capitalization

    4. choose a class assignmentzjfrom the distributionθ. For each classi= 1. . . Z,choose an output distribution over featuresφi. Fi-nally, for each tokenk= 1. . . njof word typej,generate a featurefjkfromφzj, the distribution as-sociated with the class that word typejis assignedto.

      Note that features and words in the generative story are generated independently of each other. Meaning there is no need in the generative model for counts of features to be "consistent".

    1. induction system allow it to produce useful proto-types with the current method and/or to develop aspecialized system specifically targeted towards in-ducing useful prototypes

      Maybe Clark's method does not benefit as much because it already uses Morphological information?

    2. We note that the two best-performing systems,clarkandfeat, are also the only two to use mor-phological informatio

      This reduces the information that has to come from distributional data. Which should be especially important for languages with more complex morphology and less information expressed via word order. But we later see that there is not that much of a difference.

    3. To address thisquestion, we evaluate all the systems as well on themultilingual Multext East corpus

      Still limited to European languages.

    4. This is probably due tothe fact that the gold standard encodes “pure” syn-tactic classes, while substitutability also depends onsemantic characteristics (which tend to be picked upby unsupervised clustering systems as well

      This is actually a key "problem" in unsupervised part-of-speech induction - the induced POS pick up on semantic distinction which lead, among other things, to distinctions between nouns that are more fine grained than what one would expect according to standard syntactic theory.