108 Matching Annotations
  1. Aug 2024
    1. for example our standard english language model is trained with something like maybe 100 gigabytes or so of text um that gives it a strength as if you would throw bird at it with the google corpus so the other thing is of course uh a small corpus like that is computed in two hours or three hours on a on a laptop yeah so that's the other thing uh by the way i didn't mention our fingerprints are actually a boolean so when we when we train as i said we are not using floating points

      for - comparison - cortical io vs normal AI - training dataset size and time

    2. we basically grow models of let's say same quality like all the others by using thousand time or ten thousand times less training data

      for - comparison - semantic folding vs normal machine learning - training dataset sizes and times

    3. in that bitmap representation at the end i can look at every position in my bitmap and i can refer it back explicitly to the bits of reference information that i trained it with

      for - semantic fingerprint bitmap - tracing bitmap to training dataset

  2. Mar 2024
  3. Oct 2023
    1. "Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning" Yuejiang Liu1, 2,* YUEJIANG.LIU@EPFL.CH Alexandre Alahi2 ALEXANDRE.ALAHI@EPFL.CH Chris Russell1 CMRUSS@AMAZON.DE Max Horn1 HORNMAX@AMAZON.DE Dominik Zietlow1 ZIETLD@AMAZON.DE Bernhard Sch ̈olkopf1, 3 BS@TUEBINGEN.MPG.DE Francesco Locatello1 LOCATELF@AMAZON.DE

  4. Oct 2022
    1. An old drug and different ways to treat cutaneous leishmaniasis: Intralesional and intramuscular meglumine antimoniate in a reference center, Rio de Janeiro, Brazil.
    2. AN OLD DRUG AND DIFFERENT WAYS TO TREAT CUTANEOUS LEISHMANIASIS: INTRALESIONAL AND INTRAMUSCULAR MEGLUMINE ANTIMONIATE IN A REFERENCE CENTER, RIO DE JANEIRO, BRAZIL
  5. Jun 2022
    1. The major issue with much of the data that can be downloaded from web portals or through APIs is that they come without context or metadata. If you are lucky you might get a paragraph about where the data are from or a data dictionary that describes what each column in a particular spreadsheet means. But more often than not, you get something that looks like figure 6.3.

      I think that the reason behind data's lack of context is the reluctance in making extra column for data's description and the inconsiderate and misleading vision that those in technologies hold when they put forth that data should be clean and concise.

      I encountered the insufficient provision of data multiple times and I found it extremely inconvenient when trying to use downloaded online reports and attached them to my work experiences as a way to illustrate the efficient changes in driving audiences for a social media platform (Facebook). I used to help run an facebook page for a student organization. After being done with the role, I went to the "Insights" section of Facebook, hoping to download the report of increases in Page Likes, Visits, and Interactions during the period that I was an admin of the page. It took me several glitches to download the report (because it was a year-long term). When the pdf file was ready to be viewed, I was surprised, because they did not mention the years I was working, the name of the student organization, and other categorizations that should have been highlighted. Apparently, it's not hard to include the years or even the name because they were included in the filter when I wanted to extract certain part of the report and because it was the source where they took the data from, respectively. This laziness in showing competent data for analysis was desperate, and I had to add extra analysis to it. Even after I finished with the "extra work", I started to question to validity of the report I was downloading. Would it be trustworthy anymore, because without my clarification, no analysis could be made even by a person involved in data science field. Even if they could, it would take them a while to collect other external information before making clear of the data presented to them.

      Understanding and constantly being bothered by this ongoing problem gives me justification to call for a more thorough data translation and presentation process. More questions should be raised and answered regarding what might a user wonder about this dataset when encountering it.

  6. May 2022
    1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

      The problem in 1988

  7. Apr 2022
  8. Nov 2021
  9. Jun 2021
  10. May 2021
    1. To investigate these hypotheses, I created an election-year-country dataset covering the period from the early 1990s to the present for all post- communist democracies.7 The dataset is structured as a quasi-time series of 93 parliamentary elections in 17 countries from 1991 to 2012, and the depen-dent variable is the natural log of the radical right party’s combined vote share in elections held at time t.

      this is the data, her explanation of the dataset she created

  11. Mar 2021
    1. 14 of which were sampled at multiple timepoints
    2. RNA sequencing on samples from 46 individuals with PCR-positive, symptomatic SARS-CoV-2 infection
    3. 77 peripheral blood samples across 46 subjects with COVID-19 and compared them to subjects with seasonal coronavirus, influenza, bacterial pneumonia, and healthy controls.
    4. seasonal coronavirus (n=59)
    5. divided based on disease severity and time from symptom onset
    6. elucidate novel aspects of the host response to SARS-CoV-2
    7. influenza (n=17)
    8. bacterial pneumonia (n=20)
    9. healthy controls (n=19)
    1. elucidate key pathways in the host transcriptome of patients infected with SARS-CoV-2, we used RNA sequencing (RNA Seq) to analyze nasopharyngeal (NP) swab and whole blood (WB) samples from 333 COVID-19 patients and controls, including patients with other viral and bacterial infections.
    2. host response biosignature for COVID-19 from RNA profiling of nasal swabs and blood
  12. Dec 2020
    1. Databases If databases data is stored on a ZFS filesystem, it’s better to create a separate dataset with several tweaks: zfs create -o recordsize=8K -o primarycache=metadata -o logbias=throughput -o mountpoint=/path/to/db_data rpool/db_data recordsize: match the typical RDBMSs page size (8 KiB) primarycache: disable ZFS data caching, as RDBMSs have their own logbias: essentially, disabled log-based writes, relying on the RDBMSs’ integrity measures (see detailed Oracle post)
  13. Oct 2020
  14. Sep 2020
    1. Bavadekar, Shailesh, Andrew Dai, John Davis, Damien Desfontaines, Ilya Eckstein, Katie Everett, Alex Fabrikant, et al. ‘Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description (Version 1.0)’. ArXiv:2009.01265 [Cs], 2 September 2020. http://arxiv.org/abs/2009.01265.

  15. Jul 2020
  16. Jun 2020
  17. May 2020
  18. Apr 2020
    1. Salganik, M. J., Lundberg, I., Kindel, A. T., Ahearn, C. E., Al-Ghoneim, K., Almaatouq, A., Altschul, D. M., Brand, J. E., Carnegie, N. B., Compton, R. J., Datta, D., Davidson, T., Filippova, A., Gilroy, C., Goode, B. J., Jahani, E., Kashyap, R., Kirchner, A., McKay, S., … McLanahan, S. (2020). Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1915006117

  19. Mar 2020
    1. ll datasets were supplied by Suther-land in the Supporting Information as 3D geometriesaligned according to the original literature, namely byflexible alignment on one or more templates obtained bycrystallographic enzyme-inhibitor complexes
    2. eight comprehensive datasets

      what are the datasets look like? this may help to understand the application domain of this tool.

    Tags

    Annotators

  20. Feb 2019
    1. Impact of Fully Connected Layers on Performance of Convolutional Neural Networks for Image Classification

      作者总结说:1)CNN 层越少,FC 层里的node 就要越多才行。相反 CNN 越深,FC node 少就够了;2)浅的 CNN 除了需要更多 FC node 外,数据集 class 类目数越多,FC 层应该越多越好,反之亦然;3)对于单个 class 内样本越多的数据集,网络越深越好,但若 class 类目数很多,浅的网络表现会更好。

    2. Do we train on test data? Purging CIFAR of near-duplicates

      作者玩了把 CIFAR 测试数据集,认为有些样本作为 test 会与 train 样本太相近而过拟合的问题,于是就自己替换了疑似问题样本提出了新 test 数据集,最后拿那些著名模型实验后,庆幸说貌似它们没有过拟合而被错误评估模型优劣~(有点打脸的感觉~)

    3. Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

      深度神经网络版的“特征工程”技术~ [doge]

    4. Deep Learning on Small Datasets without Pre-Training using Cosine Loss

      在当代深度学习中,有两件事似乎无可争议:

      1. softmax激活后的分类交叉熵损失是分类的首选方法;
      2. 在小型数据集上从零开始训练CNN分类器效果不佳。在本文中作者证明,当处理小数据样本类时余弦损失函数比交叉上能够提供更好的性能。
  21. Jan 2019
    1. Fitting A Mixture Distribution to Data: Tutorial

      目测是一篇很有爱的教程!

    2. Optimization Models for Machine Learning: A Survey

      感觉此文于我而言真正有价值的恐怕只有文末附录的 Dataset tables 汇总整理了。。。。。

  22. Dec 2018
    1. Are All Training Examples Created Equal? An Empirical Study

      从此paper了解到了叫 Active learning 的有趣概念,这似乎和自己设计的连续参数训练数据采样池很接近。。。。

      这篇文章的主要工作是给出了一个在图像分类中关于训练样本重要性的研究,对于样本的重要度采用基于梯度的方法进行度量。文章的结论可能表明在深度学习中主动学习或许并不总是有效的。

    2. Image Score: How to Select Useful Samples

      提出的 semi-supervised learning 这个概念比较有趣。给数据集每个 sample 打分或许对 interpretability 有点帮助吧。。。。

  23. Nov 2018
    1. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

      该文做的实验是探索对数据集进行 shifts (某种可控的扰动) 后的模型表现,提出了classifier-based的方法/pipeline 来观察和评价:

      这对于我的引力波数据研究来说,可以借鉴其数据的 shift 方法以及评价机制 (two-sample tests)。

    2. Training neural audio classifiers with few data

      这是一个比较初步的简单实验。

      图像结论其实并不意外:数据量越多当然表现越好;迁移学习在极小量数据上表现良好;Prototypical 模型可能因结构的特异性会表现出一定程度上的优势;数据量越小,过拟合问题越严重。。。

  24. Sep 2016
    1. UK Biobank

      Large UK dataset containing extensive phenotypic, genotypic, and neuroimaging data.

      License: Unclear, but restrictive. Access: Human, ? Needs data use agreement: Yes Needs institutional signature for access: No (?)

    1. View Data Sets

      Public fMRI dataset repository.

      • License: PDDL v.1.0
      • Access: Human, s3 Needs data use agreement: No Needs institutional signature for access: No
    1. Brain Genomics Superstruct Project (GSP)

      License: Data use agreement Access: Human, API Needs data use agreement: Yes Needs institutional signature for access: No

    1. What is studyforrest?

      Rich multimodal dataset on naturalistic stimuli

      • License: PDDL v.10
      • Access: Human, rsync, git annex
      • Needs data use agreement: No
      • Needs institutional signature for access: No
      • License: PDDL v.10
      • Access: Human, s3, openfmri
      • Needs data use agreement: No
      • Needs institutional signature for access: No
  25. May 2016
  26. Aug 2015
    1. the definition of a “dataset,”

      this is interesting, and will be interesting to track within and across disciplines