378 Matching Annotations
  1. Oct 2023
    1. See the examples directory for annotated examples of parsing JSON, Lisp, a Python-ish language, and math.

      test

  2. Oct 2020
    1. Понимаемая таким образом, эта наука не находится в противоречии
    2. Но если свобода воли действительно

      привет!

    3. Книга эта - прежде всего попытка исследовать факты
    4. Но если свобода воли действительно

      привет!

    5. Книга эта - прежде всего попытка исследовать факты
    6. Понимаемая таким образом, эта наука не находится в противоречии ни с каким видом философии, ибо она переносится на совсем другую почву. Возможно, нрав- 1 Нас упрекают (см. Beudant. Le droit individнel et l'Etat, р. 244) в том, что мы назвали в одном месте вопрос о свободе воли тонким. Это выражение не Имело в наших устах ничего презрительного. Если мы отказынаемся от Решения этой проблемы, то исключительно nотому, что ее решение, каким бы оно ни было, не может препятствовать нашему исследованию.
    7. Книга эта - прежде всего попытка исследовать
    8. Книга эта - прежде всего попытка исследовать
    1. Книга эта - прежде всего попытка исследовать

      ggggg

    2. ggggg

    3. Книга эта -прежде всего попытка исследовать факты

      body of annotation, can include markup

    4. Книга эта -прежде всего попытка исследовать факты

      body of annotation, can include markup

    5. Книга эта -прежде всего попытка исследовать факты

      body of annotation, can include markup

    6. Книга эта -прежде всего попытка исследовать факты

      body of annotation, can include markup

    1. Брюно Латур, Стивен Вулгар

      пинг

    Annotators

  3. Aug 2019
    1. our system is able to address zeroanaphora. Thorough cross-lingual analysis byNov ́ak and Nedoluzhko (2015) showed that manycounterparts of Czech or English coreferential ex-pressions are zeros. This likely holds for the otherpro-drop languages, too
    2. we divide mentions into multiple cate-gories in this paper: (1) personal pronouns, (2)possessive pronouns, (3) reflexive possessive pro-nouns, (4) reflexive pronouns, all four types ofpronouns in the 3rd or ambiguous person, (5)demonstrative pronouns, (6) zero subjects, (7) ze-ros in non-finite clauses, (8) relative pronouns, (9)the pronouns of types (1)-(4) in the 1st or 2nd per-son, (10) named entities, (11) common nominalgroups, and (12) other expressions.
  4. Jul 2019
    1. Starting with a text inthe target language to be labeled with coreference,it first must be machine-translated to the sourcelanguage. A coreference resolver for the sourcelanguage is then applied on the translated text and,finally, the newly established coreference links areprojected back to the target language
    2. Approaches to cross-lingual projection are usu-ally aimed to bridge the gap of missing resourcesin the target language.
    3. MT-based approachesapply a machine-translation ser-vice to create synthetic data in source language.Corpus-based approachestake advantage of thehuman-translated parallel corpus of the two lan-guages.
  5. Jun 2019
    1. Computationally-aided linguistic analysis The focus of this paper type is new linguistic insight.
    2. NLP engineering experiment paper This paper type matches the bulk of submissions at recent CL and NLP conferences.
    3. Reproduction paper The contribution of a reproduction paper lies in analyses of and in insights into existing methods and problems—plus the added certainty that comes with validating previous results.
    4. Resource paper Papers in this track present a new language resource. This could be a corpus, but also could be an annotation standard, tool, and so on.
    1. Perhaps the simplest and most commonly used query framework isuncertaintysampling(Lewis and Gale, 1994). In this framework, an active learner queriesthe instances about which it is least certain how to label. This approach is oftenstraightforward for probabilistic learning models.
    2. The main difference between stream-based and pool-based active learning isthat the former scans through the data sequentially and makes query decisionsindividually, whereas the latter evaluates and ranks the entire collection beforeselecting the best query.
    3. For many real-world learning problems, large collections of unlabeled data can begathered at once. This motivatespool-based sampling(Lewis and Gale, 1994),which assumes that there is a small set of labeled dataLand a large pool of un-labeled dataUavailable.
    4. An alternative to synthesizing queries isselective sampling(Cohn et al., 1990,1994). The key assumption is that obtaining an unlabeled instance is free (or in-expensive), so it can first be sampled from the actual distribution, and then thelearner can decide whether or not to request its label.
    5. he active learner aims to achieve high accuracy usingas few labeled instances as possible, thereby minimizing the cost of obtaininglabeled data
    6. Active learning systems attempt to overcome the labeling bottleneck by askingqueriesin the form of unlabeled instances to be labeled by anoracle(e.g., a humanannotator)
    7. The key hypothesis is that if the learning algorithm isallowed to choose the data from which it learns—to be “curious,” if you will—itwill perform better with less training.
    1. Balance exploration and exploitation: the choice of examples to label is seen as a dilemma between the exploration and the exploitation over the data space representation. This strategy manages this compromise by modelling the active learning problem as a contextual bandit problem. For example, Bouneffouf et al.[9] propose a sequential algorithm named Active Thompson Sampling (ATS), which, in each round, assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label. Expected model change: label those points that would most change the current model. Expected error reduction: label those points that would most reduce the model's generalization error. Exponentiated Gradient Exploration for Active Learning:[10] In this paper, the author proposes a sequential algorithm named exponentiated gradient (EG)-active that can improve any active learning algorithm by an optimal random exploration. Membership Query Synthesis: This is where the learner generates its own instance from an underlying natural distribution. For example, if the dataset are pictures of humans and animals, the learner could send a clipped image of a leg to the teacher and query if this appendage belongs to an animal or human. This is particularly useful if your dataset is small.[11] Pool-Based Sampling: In this scenario, instances are drawn from the entire data pool and assigned an informative score, a measurement of how well the learner “understands” the data. The system then selects the most informative instances and queries the teacher for the labels. Stream-Based Selective Sampling: Here, each unlabeled data point is examined one at a time with the machine evaluating the informativeness of each item against its query parameters. The learner decides for itself whether to assign a label or query the teacher for each datapoint. Uncertainty sampling: label those points for which the current model is least certain as to what the correct output should be. Query by committee: a variety of models are trained on the current labeled data, and vote on the output for unlabeled data; label those points for which the "committee" disagrees the most Querying from diverse subspaces or partitions:[12] When the underlying model is a forest of trees, the leaf nodes might represent (overlapping) partitions of the original feature space. This offers the possibility of selecting instances from non-overlapping or minimally overlapping partitions for labeling. Variance reduction: label those points that would minimize output variance, which is one of the components of error. Conformal Predictors: This method predicts that a new data point will have a label similar to old data points in some specified way and degree of the similarity within the old examples is used to estimate the confidence in the prediction.[13]
  6. May 2019
    1. The first NER task was organized by Grishman and Sundheim (1996) in the Sixth Message Under-standing Conference. Since then, there have been numerous NER tasks (Tjong Kim Sang and De Meul-der, 2003; Tjong Kim Sang, 2002; Piskorski et al., 2017; Segura Bedmar et al., 2013; Bossy et al., 2013;Uzuner et al., 2011).
    2. Starting with Collobert et al. (2011), neural network NERsystems with minimal feature engineering have become popular. Such models are appealing becausethey typically do not require domain specific resources like lexicons or ontologies, and are thus poised tobe more domain independent.
    3. Early NER systems were based on handcrafted rules, lexicons, orthographic fea-tures and ontologies. These systems were followed by NER systems based on feature-engineering andmachine learning (Nadeau and Sekine, 2007

      v

    1. define a set of rules to set the correct sentiment score to the opinion word

      sentiment "correction"

    2. the spaCy’s dependency parser is able to identify other dependency words linked to that particular opinion word. This allows you to extract the aspect term

      depedency tree-based aspect term extraction

    3. identify opinion words by cross referencing the opinion lexicon for negative and positive words

      lexicon-based opinion words extraction

    4. If the word fails to meet the threshold for the proximity in the two words in the vector space, the algorithm falls back on using the category of the entire sentence that was classified previously using ML-NB
    5. I first try to assign based on the similarity of the aspect term to the aspect category with word2vec’s n_similarity
    6. tag it with an aspect using a Multi-label Naive Bayes model

      multilabel classification

    7. segment the chunk of text into sentences

      sentence tokenize

    8. I first replace the pronouns in the sentence using a pre-trained neural coreference model;

      coref

    9. For example, let’s assume you’re trying to classify a single yelp restaurant review into one of five aspects: food, service, price, ambience, or simply anecdotal/miscellaneous.

      мультилейбл классификация

    1. chmoller et al. [21]carry out their analysis using a dataset provided by the car sharing operator,which contains more information than what is generally available to the researchcommunity at large.
    1. There are various types of n-grams and syntactic n-grams ac­cording to types of elements they are built of: lexical units (words, stems, lemmas), POS tags, SR tags (names of syntactic relations), characters, etc.
    2. re­cently we have proposed a concept of syntactic n-grams, i.e., n-grams constructed by following paths in syntactic trees [19,21].
    3. The most widely used features are words and n-grams.
    4. by their nature the features have symbolic values, then they are mapped to numeric values in some manner.
    5. The most common manner to represent objects is the Vector Space Model (VSM) [17]. In this model, the objects are represented as vectors of values of features. The features characterize each object and have numeric values.
    1. Additionally, we defined a subcategory of positive posts that covers frequent speech acts, such asexpressions of gratitude, greetings, and congratulations. They are very frequent in VK data, and thesentiment they express is overtly positive, but they are also very formulaic.
    2. Wealso defined the “skip” class for excluding the posts that were too noisy, unclear, or not in Russian (e.g.,in Ukrainian). We also made the decision to exclude jokes, poems, song lyrics, and other such contentthat was not generated by the users themselves
    3. We prioritized the speed of annotation over detail, opting for a 3-point scale rather than e.g., the 5-point scale in SemEval Twitter datasets (Rosenthal et al., 2017). Thus, the task was to rate the prevailingsentiment in complete posts from VK on a three-point scale (“negative", “neutral”, and “positive”).
    4. The annotation was performed by sixnative speakers with backgrounds in linguistics over the course of 5 months. The average annotationspeed was 250-350 posts per hour
    5. The datasets from the SentiRuEval 2015 and 2016 competi-tions are the largest resource that has been available to date (Loukachevitch and Rubtsova, 2016). TheSentiRuEval 2016 dataset is comprised by 10,890 tweets from the telecom domain and 12,705 from thebanking domain. The Linis project (Koltsova et al., 2016) reports to have crowdsourced annotation for19,831 blog excerpts, but only 3,327 are currently available on the project website
    6. RuSentiLex, the largest sentiment lexicon for Russian3(Loukachevitch and Levchik, 2016), currentlycontains 16,057 words
    7. The best results were achieved witha neural network model that made use of word embeddings trained on the VKontakte corpus, which wealso release to enable a fair comparison with our baselines in future work. This model achieved an F1score of 0.728 in a 5-class classification setup.
    8. The overall inter-annotatoragreement in terms of Fleiss’ kappa stands at 0.58. In total, 31,185 posts were annotated, 21,268 ofwhich were selected randomly (including 2,967 for the test set). 6,950 posts were pre-selected with anactive learning-style strategy in order to diversify the data.
    1. On the otherhand, new vehicle concepts withstackable capabilitieshavebeen recently released or are under development, which canbe stacked into a train (through a mechanical and electriccoupling) and/or folded together.
    2. It is important to point out thatthe relocationprocess is intrinsically inefficient: as one driver per caris needed, to relocate several cars a large workforce ormany willing customers are necessary
    3. One-way car sharing is not without drawbacks for the carsharing operators. With one-way car sharing, cars will followthe natural flows of people in a city, hence accumulating incommercial/business areas in the morning and in residentialareas at night [3]
    4. ne-way systems can be also classifiedintofree-floatingorstation-basedaccording to their parkingrestrictions.
    5. One-way car sharing, in which customers are not forcedto return the vehicle at the starting point of their journey
    6. people do notown a car, they simply rent it from the car sharing operatorwhen they need it (typically for short-range trips), effectivelyimplementing the concept of Mobility-as-a-Service
    7. Car sharing can also act asa last-kilometre solution for connecting people with publictransport hubs, hence becoming a feeder to traditional publictransit [2].
    1. Graph network analysis is conducted on the learned DDGF, which shows the DDGF can capture similarinformation that is embeddedin the SD, DE and DC matrices, and extra hidden heterogeneous pairwisecorrelations between stations
    2. Two architecturesof the GCNN-DDGF model,GCNNreg-DDGF and GCNNrec-DDGF are explored.Theirprediction performancesarecomparedwith four GCNN models with pre-defined adjacency matrices and seven benchmark models. The proposed GCNNrec-DDGF out-performs all of thesemodels.
    3. Proposinga novel GCNN-DDGFmodel that can automatically learn hidden heteroge-neous pairwise correlations between stationsto predict station-level hourly demand.
    4. Aspointed out by many previous studies (Chen et al., 2016; Li et al., 2015; Lin, 2018; Zhou, 2015), it is common for BSSswith fixed stations that some stations are empty with no bikes to check out while othersare full precludingbikes from beingreturnedat those locations.
    5. In general, distributed bike-sharing systems(BSSs) can be groupedinto two types, dock-based BSS and non-dock BSS.
    1. Our proposed approach fully integrates rebalancing, requestassignment and ride sharing, in a fully decentralized manner.
    2. Strategiesrange from those that use a short window of known futurerequests (e.g., 5 minutes in [11] and 30 seconds in [4]), basedon historical demand (e.g., [8]) or using prediction techniquesto predict future demand (e.g., [14]).
    3. To denote thedifferent areas of the system in order to map the demandto a geographical area, the network is generally divided intoseveral zones [13], blocks [11] or hexagons [14].
    4. The relocation of empty vehicles in shared MoD systemshas been widely studied in the literature, and can be dividedinto operator-based approaches [8], [9] (where employeesof the car-sharing service relocate the vehicles), user-basedapproaches [10] (where users are financially incentivized toreturn the vehicles to high-demand areas) and more recentlythose in shared autonomous vehicle (SAV) systems [4], [11],[12] (where driverless vehicles are autonomously relocated).
    1. The bike-sharing system simulator proposed by Caggianiand Ottomanelli (2012 and 2013)has been used to represent and model the FFBSS under analysis, pretending that the centroids of each zone coincidewith a hypotheticalbike-sharing station
    2. We assume that in this area a free-floating bike-sharing system(FFBSS)is operating. A further assumption is that a typical user iswilling to cover a maximum distance of about 630meters by walk to reach the bicycleclosest to the origin of his/her trip.
    3. we apply the suggested methodology to a study area of 1.2 km x 1.2 km of extension. This area is composed of36 square zones, with a side length equal to 0.2 km (grid of 6x6 zones).
    4. the zero-vehicle-ti me (ZV T)(Kek et al., 2009). When ZVT occurs, azone(or station, in a station-based system) is without any available vehicle; then, a customer requesting for vehicles at that moment in that zonewill be rejected/unsatisfied.
    5. Every zone of this FFVSScould be seen as a station (in a station-based sharing-system), that aggregates/contains(inside its borders) a number of vehicles.
    6. some authors have shown how cluster analysis is capable of revealing groups of stations with a similar trend of rental and return activities during the day (Vogelet al., 2011).
    7. t is worthy to mention Reiss and Bogenberger (2015), that in order to apply their operator-based strategy to a bike-sharing system, have divided the operating area of the free-floating system into a certain numberof zones, that in a way could be interpreted as stations.
    8. all the approaches adopted to relocate the shared fleets can be grouped intotwo categoriesaccording to who actually performs the relocation: user-based and operator-based strategies
    9. These imbalances of supply and demandcan be resolved/mitigatedonly with an appropriate reallocation strategy (Reiss and Bogenberger, 2015), namely a transfer of vehicles from zones with high accumulation to areas wheretheshortageis experienced (Boyacıet al., 2015).
    10. during the day significant fluctuations in travel demand(due to weather conditions, time of the day and holidays/weekends)can be observed. Sometimes there is avehicle overcrowding incertainzones, and a lack of available vehiclesin others, at the time the users need them(Herrmannet al., 2014)
    1. . The conversion to the binary scale was performed according to the following scheme: {1, 2} → nega-tive, {4, 5} → positive. Reviewsthat have a score of 3 on anaspect were not consid-ered for this aspect when assessing the quality of the algorithm
    2. As a result, for thepositive sentiment, 342 terms were found (with thethreshold of 0.2) and 1203 terms for the negative sentiment (with thethreshold of 0.25).
    3. In the same way, sentiment terms were obtained. As the initial terms that set the over-all sentiment, the words отличный(excellent) for thepositive class and ужасный(terrible) for thenegative class were chosen. For each newlygenerated term, the co-sine similarity value with the initial term was found and was assigned to the term as theweight.

      вес сентимента как дистанция между конкретным словом и одним из начальных: "отличный" или "ужасный"

    4. As a result, each of the three aspects has its own list of terms. The number of terms for each aspect is the following: 2550 for Room, 1317 for Location, and 1740 for Service
    5. Thus, for each term a list of 10 new terms closest to the original one was found. These lists were combined,with duplicate termsremoved. This process continues and the resulting list again generates a new one according to the same principle. Repeating this procedure for new term lists is an iterative process that generates aspect terms.To remove noise words which appear during term generation,an additional re-striction was used: each newly generated term was stored in the resultinglist of aspect terms only ifthe similarity value with at least three thefive terms in the initial list exceeded0.3 for each aspect. For each term, the cosine similaritywith initial terms is calculated and the maximum is assigned to it as the weight. The weight value will be usedat the sentiment assignment step.

      как создают словарь аспектов

    6. . For the aspect Room the initial terms номер(room), ванная(bathroom), телевизор(TV), свет(light), кровать(bed) are selected. For the aspect Servicethe initial terms are сервис(service), персонал(staff), администратор(administrator), сотрудник(staff member), консьерж(concierge).For the aspect Location the words местоположение(loca-tion), достопримечательность(attraction), центр(center), транспорт(transport), месторасположение(location) were chosen
    7. phy2. The collocations with the adverbочень(very)were processed in the same
    8. gs to, it was decided to add the prefix not_to the first adjective, adverb or ve
    9. In total, 50 329 reviews were collected for the training corpus
    10. For the sentiment identification stage of the algorithm, only three aspects were cho-sen: Room, Location and Service, since they are the most popular ones.
    11. The following information was collected from the site: the text of the review, the overall rating of the hotel (on a 5-point scale), an assessment of the hotel's characteris-tics, such as the price-quality ratio, location, room, cleanliness, service, quality of sleep
    12. the reviews were collected from the website TripAdviso
    13. Another important note is that many methods often benefit fromtaking advantage of more data, i.e. additional reviews, even without annotated terms. This was well demonstrated by top performers in the SemEval-2014 aspect-based sentiment analysis task [Pontiki et al., 2014]
    14. Liu [2012] lists four main approaches to aspect extraction:1. Using frequent nouns and noun phrases.2. Using opinion and target relations.3. Supervised learning.4. Topic modeling.
    15. State-of-the-art models make use of topic modeling methods, such as Latent Dirichlet Allocation (LDA), and Conditional Random Fields(CRF).
    16. Traditional approaches are based on collecting the most frequent words and phrases which are contained in the manually constructed aspect or sentiment lexic
    17. The task of aspect-based sentiment analysis [Liu, 2012; Pontiki et al., 2014; Pav-lopoulos, 2014] is usuallysplit into two subtasks: aspect terms extraction and aspect terms polarity estimation,which are concerned separately and often use different techniques.
    1. Agreements for aspect expressions are0.93,0.94,0.93.
    2. The Kappa Coefficient is calculatedover aspect-sentiment pairs per each location. Pairwise inter-annotator agreement for aspect categoriesmeasured using Cohen’s Kappa is0.73,0.78and0.70, which is deemed of sufficient quality
    3. However this task assumes only theoverallsentiment for eachentity. Moreover, the existing corpora for this task so far has contained only a single target entity per unitof text.
    4. Another line of research in this field istargeted(a.k.a. target-dependent) sentiment analysis (Jiang etal., 2011; Vo and Zhang, 2015). Targeted sentiment analysis investigates the classification of opinionpolarities towards certain target entity mentions in given sentences (often a tweet).
    5. Aspect-basedsentiment analysis (ABSA) (Jo and Oh, 2011; Pontiki et al., 2015; Pontiki et al., 2016)relates to the task of extracting fine-grained information by identifying the polarity towards differentaspects of an entity in the same unit of text, and recognizing the polarity associated with each aspectseparately
    6. argeted aspect-basedsentiment analysis handles extracting the target entities as well as different aspects and their relevantsentiments.
    7. Entities in the dataset are locations or neighbourhoods.
    8. entences containing one location mention — Single, and sentences con-taining two location mentions — Multi. This is to observe the difficulty of annotating two groups byhuman annotators and by the models
    9. In our annotationhowever, we only provided “Positive” and “Negative” sentiment labels.
    10. we define the two following special labels. Sen-tences marked with one of the these labels are removed from the dataset
    11. We use the BRAT annotation tool (Stenetorp et al., 2012) to simplify the annotation task.
    12. Aspectgeneralrefers to a generic opinion about a location, e.g. “I loveCamden Town”
    13. pre-defined listof aspects is provided for annotators to choose from. These aspects are:live,safety,price,quiet,dining,nightlife,transit-location,touristy,shopping,green-cultureandmulticultural
    1. In theAspect Cat-egory Polarity(ACP) task the polarity of each ex-pressed category is recognized, e.g. apositivecategory polarity is expressed in sentence 1.
    2. In theAspect Category Detection(ACD) task the cate-gory evoked in a sentence is identified, e.g. thefoodcategory in sentence 1).
    3. Aspect TermPolarity(ATP) task the polarity evoked for eachaspect is recognized, i.e. apositivepolarity isexpressed with respect tofried rice.
    4. TheAspect Term Extraction(ATE) subtask aimsat finding words suggesting the presence of as-pects on which an opinion is expressed, e.g.fried ricein sentence 1
    1. In practice UMAP uses a force directed graph layout algorithm in low dimen-sional space. A force directed graph layout utilizes of a set of aŠractive forcesapplied along edges and a set of repulsive forces applied among vertices. Anyforce directed layout algorithm requires a description of both the aŠractive andrepulsive forces. Œe algorithm proceeds by iteratively applying aŠractive andrepulsive forces at each edge or vertex. Convergence is guaranteed by slowlydecreasing the aŠractive and repulsive forces in a similar fashion to that used insimulated annealing
    2. In the €rst phase a particular weighted k-neighbour graph is con-structed. In the second phase a low dimensional layout of this graph is computed
    3. Œe theoretical description of the algorithm works in terms of fuzzy simpli-cial sets. Computationally this is only tractable for the one skeleton which canultimately be described as a weighted graph. Œis means that, from a practi-cal computational perspective, UMAP can ultimately be described in terms of,construction of, and operations on weighted graphs. In particular this situatesUMAP in the class of k-neighbour based graph learning algorithms such as Lapla-cian Eigenmaps, Isomap and t-SNE.
    4. At a high level, UMAP uses local manifold approximations and patches to-gether their local fuzzy simplicial set representations to construct a topologicalrepresentation of the high dimensional data. Given some low dimensional rep-resentation of the data, a similar process can be used to construct an equivalenttopological representation. UMAP then optimizes the layout of the data repre-sentation in the low dimensional space, to minimize the cross-entropy betweenthe two topological representations.
    5. Dimension reduction algorithms tend to fall into two categories;those that seek to preserve the distance structure within the data and those thatfavor the preservation of local distances over global distance. Algorithms suchas PCA [22], MDS [23], and Sammon mapping [41] fall into the former categorywhile t-SNE [50, 49], Isomap [47], LargeVis [45], Laplacian eigenmaps [5, 6] anddi‚usion maps [14] all fall into the laŠer category
    1. 5. The dissimilarity matrix of the data should be well represented by theclustering (i.e., by the ultrametric induced by a dendrogram, or by defininga binary metric “in same cluster/in different clusters”).6. Clusters should be stable.7. Clusters should correspond to connected areas in data space with highdensity.8. The areas in data space corresponding to clusters should have certaincharacteristics (such as being convex or linear).9. It should be possible to characterize the clusters using a small number ofvariables.10. Clusters should correspond well to an externally given partitionor valuesof one or more variables that were not used for computing the clustering.11. Features should be approximately independent within clusters.12. All clusters should have roughly the same size.13. The number of clusters should be low.
    2. 1. Within-cluster dissimilarities should be small.2. Between-cluster dissimilarities should be large.3. Clusters should be fitted well by certain homogeneous probabilitymodelssuch as the Gaussian or a uniform distribution on a convex set, or bylinear, time series or spatial process models.4. Members of a cluster should be well represented by its centroid
  7. Apr 2019
    1. A natural extension of this idea is to usea Negative Binomial distribution, which is a gamma mixtureof infinite number of Poisson distributions. The probabil-ity density functions of a Negative Binomial distribution isgiven below,P(k)=„k+r−1r−1«pr(1−p)k,(4)wherepandrare parameters of the distributions
    2. One of the distributions capturesthe rate of the word occurrence when the word occurs be-cause it is topically relevant to the document. The seconddistribution captures the rate of the word occurrence whenthe word occurs without being topically relevant to the doc-ument. This mixture of two probability distributions hasthe probability density function:P(k)=αλk1e−λ1k!+(1−α)λk2e−λ2k!
    1. rom three di erent perspectives: from a statistically mo-tivated point of view; with a computationally motivated mindset; and in atopologically motivated framework
    2. Finally HDBSCAN* resolves manyof the diculties in parameter selection by requiring only a small set of intuitiveand fairly robust parameters.
    3. being a density based approach, DBSCAN only su ersfrom the diculty of parameter selection.
    4. The archetypal clustering algorithm, K-Means,su ers from all three of the problems mentioned previously: requiring the selec-tion of the number of clusters; partitioning the data, and hence assigning noiseto clusters; and the implicit assumption that clusters have Gaussian distribu-tions.
    5. Partitioning, on the other hand,requires that every data point be associated with a particular cluster. In thepresence of noise the partitioning approach can be problematic.
    6. Methods to determine the number of clusters such as the elbow method andsilhouette method are often subjective and can be hard to apply in practice.
    7. While clustering has many uses to many people, our particular focus is onclustering for the purpose of exploratory data analysis. By exploratory dataanalysis we mean the process of looking for \interesting patterns" in a data set,primarily with the goal of generating new hypotheses or research questions aboutthe data set in question.
    8. Clustering is the attempt to group data in a way that meets with human in-tuition. Unfortunately, our intuitive ideas of what makes a `cluster' are poorlyde ned and highly context sensitive [26].
    1. люди, которые придерживаются разных взглядов, чаще всего имеют разное образование и выбирают разные места проживания. В Москве не этнические различия порождают сегрегационные процессы, а социальные. Социальная стратификация влечет за собой имущественную, а она, в свою очередь, - этническую
    2. Москва разваливается на четыре крупных мировоззренческих кластера, и они могут иметь территориальный признак.
    3. в наших городах начинает размываться понятие большинства. Чем выше уровень разнообразия, тем меньше вероятность того, что будет большинство, которое формирует доминирующую позицию. Такой город, как Москва, существует как расколотое сообщество.
    4. Вы прекрасно понимаете, что это один из вызовов демократии, институты демократии могут быть использованы любыми силами. И чем более развита демократия, тем больше возможность быть представленным различиям.
    5. Я считаю, в современной России нет гетто, есть антигетто.
    6. Кеннет Бенджамен Кларк, на которого часто ссылается Дидье, в своих работах 60-х годов дает более широкое определение. Он пишет, что гетто являются одновременно парадоксом, конфликтом и дилеммой. Он дает надежду и является безнадежностью, это – церковь и кабак, кооперация и забота в гетто сочетаются с подозрительностью, соперничеством и исключение. Для жителей гетто характерно одновременно сильное стремление к ассимиляции и это отказ от нее, отчуждение и укрытие.
    7. Гетто – это феномен не географический, связанный с барьерами и границами, хотя это существенно, а социальный. Способ двойной организации сообщества. Например, Лоик Вакан подчеркивает, что гетто функционирует как своего рода этно-расовая тюрьма.
    8. «чем интенсивнее глобализация, тем активнее формируются гетто». Это важно в контексте нашей дискуссии. Потому что мы употребляем такие понятия, как гетто и сегрегация, не в аналитическом, а в метафорическом смысле.
    9. Иная ситуация в средневековых городах, когда начинает формироваться сословное общество. Сегрегация возникает не по признаку бедности и богатства, а по признаку корпорации, ремесленной принадлежности и т.д. Если мы возьмем средневековую Москву, это тоже будет сегрегированный город, и это осталось в названиях улиц или кварталов.
    10. Если говорить о городах античного мира, то, хотя общество было организовано иерархически, оно не было пространственно сегрегировано по признаку «бедности» и «богатства». Сама структура античного дома предполагала, совместную жизнь бедных и богатых, рабов и свободных граждан, слуг и господ.
    11. Сегрегация – это не только проблема, но и решение, которое находят для себя различные социальные слои для того, чтобы отделиться от других социальных слоев. Здесь же можно наблюдать образование того, что мы называем гетто, образование гетто в городских пространствах.
    12. Происходит иерархизация районов, различных мест в городе. Система дистанцирования, логика разделения, селективности приводят к тому, что люди показывают некоторое отличие своей группы от другой
    13. чем важнее становится роль этих потоков, тем большее количество населения ищет места для локализации и концентрации. Здесь вступает некая социальная логика, которая является логикой разделения и дистанцирования, селективной логикой. Мы видим, что социальные группы в городах все более разделяются. Привилегии, которые каждая группа для себя приобрела, позволяют им дистанцироваться от других групп.
    14. Сейчас происходит секторизация, если хотите. Город организуется больше не по графическому признаку, а скорее как некие островки, более похож на другую структуру, все более варьирующуюся и изменяющуюся. Она состоит из различных мест: где-то больше присутствует коммерция, где-то больше культура, а в некоторых местах вообще запрещено ездить на личном транспорте.
    15. завершение кольцевой структуры организации городов, завершение организации вокруг какого-то центра
    16. Прежде всего, нужно сказать, что города вообще изменились. В них больше сегрегации, происходит социальная, городская, этническая трансформация. Сейчас приходит конец той городской модели, которую мы видели в ХIХ веке, которая строится в виде концентрических кругов.
    17. Можно ли в гетто или сегрегации найти нечто позитивное? Это вопрос, который, мне кажется, чрезвычайно важен и интересен. По крайней мере, в отношении еврейских гетто в социологической традиции существует очень интересная версия. Она в свое время была предложена Ричардом Сэннетом, американским социологом. У него есть блестящая книжка, она называется «Камень и тело». В ней он говорит, что благодаря гетто сохранилась еврейская культура.
    18. Во-первых, что такое сегрегация в городе – это нормальное или аномальное явление? Может, по мере развития цивилизации мы от этого откажемся? В знаменитой книге Зигмунда Баумана «Глобализация и ее последствия» есть достаточно понятная формула, я ее не расскажу цитатно, но попытаюсь передать: «Гетто – оборотная сторона глобализации». Чем больше усиливаются и чем проще осуществляются глобализационные процессы, тем чаще мы будем с вами видеть эти сегрегации и обособленные кварталы - особенно в больших, глобальных городах.
    1. clusters similar documents into clusters, and then se-lects features as bursty events from the clusters. Therelated works include TDT [2, 3, 14, 18, 21, 26, 27],text mining [9, 13, 14, 17, 19, 20, 22], and visualiza-tion [7, 11, 24]. However, the main drawback of adapt-ing these techniques for the new hot bursty events de-tection problem is that they require many parametersand it is very difficult to find an effective way to tunethese parameters
    2. theemphasis of our problem is to identify sets of burstyfeatures, whereas the emphasis of TDT is to find clus-ters of documents.
    3. TDT is an unsu-pervised learning task (clustering) that finds clustersof documents matching the real events (sets of docu-ments identified by human) by reducing the number ofmissing documents in the clusters found and reducingthe possibility of false alarms.
    4. It is because that the set of burstyfeatures can be used as a set of features for positiveexamples, and therefore helps partially supervised textclassification [10, 6], which is a text classification tech-nique using positive examples only
    5. hotbursty events detectionin a text stream, where a textstream is a sequence of chronologically ordered doc-uments, and a hot bursty event is a minimal set ofbursty features that occur together in certain time win-dows with strong support of documents in the textstream
    1. The rapid increase of a term's frequency of appearance,de nes aterm burstin the text stream.
    1. Adam Kilgarriff referred to this as a “whelk” problem [16]. If you have a textabout whelks, no matter how infrequent this word is in the rest of your corpus, it’slikely to be in nearly every sentence in this text.
    2. some words areless likelyto experience frequency bursts,which puts them in inferior positions in the frequency lists in comparison to thosewhich do.
    1. conflating semanti-cally related words into one word type could im-prove model fit by intelligently reducing the spaceof possible models.
    2. stemmers approximate intuitive wordequivalence classes, so language models based onstemmed corpora inherit that semantic similarity,which may improve interpretability as perceived byhuman evaluators
    3. stemmers could reduce the effectof small morphological differences on the stabilityof a learned model.
    4. However, stemmers have the potential to be con-fusing, unreliable, and possibly even harmful in lan-guage models
    1. For each corpus, we select a set of 20 relevantquery words from high probability words from anLDA topic model (Blei et al., 2003) trained on thatcorpus with 200 topics. We calculate the cosine sim-ilarity of each query word to the other words in thevocabulary, creating a similarity ranking of all thewords in the vocabulary. We calculate the mean andstandard deviation of the cosine similarities for eachpair of query word and vocabulary word across eachset of 50 models.
    2. Rankings of most similar words are notreliable, and both ordering and membership in suchlists are liable to change significantly.
    3. the corpus-centered approach is based ondirect human analysis of nearest neighbors to embed-ding vectors, and the training corpus is not simply anoff-the-shelf convenience but rather the central objectof study
    4. other researchers take acorpus-centeredapproach and use relationships between em-beddings as direct evidence about the language andculture of the authors of a training corpus (Bolukbasiet al., 2016; Hamilton et al., 2016; Heuser, 2016)
    5. Although PPMI appears deterministic (due to itspre-computed word-context matrix), we find that thisalgorithm produced results under theFIXEDorderingwhose variability was closest to theBOOTSTRAPset-ting. We attribute this intrinsic variability to the useof token-level subsampling.
    6. In general, LSA, GloVe, SGNS,and PPMI are not sensitive to document order in thecollections we evaluated
    7. the membershipof the lists changes substantially between runs of theBOOTSTRAPsetting
    8. The presence of specific documents has asignificant effect on all four algorithms (lesser forPPMI), consistently increasing the standard devia-tions.
    9. We observe that theFIXEDandSHUFFLEDsettings for GloVe and LSA producethe least variable cosine similarities, while PPMI pro-duces the most variable cosine similarities for allsettings
    10. We process each corpus by lowercasing all text, re-moving words that appear fewer than 20 times in thecorpus, and removing all numbers and punctuation.
    11. GloVe is sensitive to the presence ofspecific documents
    12. GloVe is not sensitive to document order.
    13. he pres-ence of specific documents in the corpus can signifi-cantly affect the cosine similarities between embed-ding vectors
    14. we also removeduplicate documents from each corpus
    15. NLP research in word embeddings has so far fo-cused on adownstream-centereduse case, wherethe end goal is not the embeddings themselves butperformance on a more complicated task
    16. f usersdo not account for this variability, their conclusionsare likely to be invalid.Fortunately, we also find thatsimply averaging over multiple bootstrap samplesis sufficient to produce stable, reliable results in allcases tested
    17. Embedding algo-rithms are much more sensitive than they appear tofactors such as the presence of specific documents,the size of the documents, the size of the corpus, andeven seeds for random number generators