- Aug 2024
-
www.youtube.com www.youtube.com
-
for example our standard english language model is trained with something like maybe 100 gigabytes or so of text um that gives it a strength as if you would throw bird at it with the google corpus so the other thing is of course uh a small corpus like that is computed in two hours or three hours on a on a laptop yeah so that's the other thing uh by the way i didn't mention our fingerprints are actually a boolean so when we when we train as i said we are not using floating points
for - comparison - cortical io vs normal AI - training dataset size and time
-
we basically grow models of let's say same quality like all the others by using thousand time or ten thousand times less training data
for - comparison - semantic folding vs normal machine learning - training dataset sizes and times
-
in that bitmap representation at the end i can look at every position in my bitmap and i can refer it back explicitly to the bits of reference information that i trained it with
for - semantic fingerprint bitmap - tracing bitmap to training dataset
-
- Mar 2024
- Oct 2023
-
arxiv.org arxiv.org
-
"Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning" Yuejiang Liu1, 2,* YUEJIANG.LIU@EPFL.CH Alexandre Alahi2 ALEXANDRE.ALAHI@EPFL.CH Chris Russell1 CMRUSS@AMAZON.DE Max Horn1 HORNMAX@AMAZON.DE Dominik Zietlow1 ZIETLD@AMAZON.DE Bernhard Sch ̈olkopf1, 3 BS@TUEBINGEN.MPG.DE Francesco Locatello1 LOCATELF@AMAZON.DE
-
- Oct 2022
-
repositorio.usp.br repositorio.usp.br
-
Free ions in kerosene-based ferrofluid detected by impedance spectroscopy (2021)
-
-
www.alice.cnptia.embrapa.br www.alice.cnptia.embrapa.br
-
Tempo de cultivo contínuo de cana-de-açúcar e influência nas características físicas e carbono orgânico de latossolos vermelhos distróficos em Guaíra/SP.
-
-
www.arca.fiocruz.br www.arca.fiocruz.br
-
An old drug and different ways to treat cutaneous leishmaniasis: Intralesional and intramuscular meglumine antimoniate in a reference center, Rio de Janeiro, Brazil.
-
AN OLD DRUG AND DIFFERENT WAYS TO TREAT CUTANEOUS LEISHMANIASIS: INTRALESIONAL AND INTRAMUSCULAR MEGLUMINE ANTIMONIATE IN A REFERENCE CENTER, RIO DE JANEIRO, BRAZIL
-
- Jun 2022
-
data-feminism.mitpress.mit.edu data-feminism.mitpress.mit.edu
-
The major issue with much of the data that can be downloaded from web portals or through APIs is that they come without context or metadata. If you are lucky you might get a paragraph about where the data are from or a data dictionary that describes what each column in a particular spreadsheet means. But more often than not, you get something that looks like figure 6.3.
I think that the reason behind data's lack of context is the reluctance in making extra column for data's description and the inconsiderate and misleading vision that those in technologies hold when they put forth that data should be clean and concise.
I encountered the insufficient provision of data multiple times and I found it extremely inconvenient when trying to use downloaded online reports and attached them to my work experiences as a way to illustrate the efficient changes in driving audiences for a social media platform (Facebook). I used to help run an facebook page for a student organization. After being done with the role, I went to the "Insights" section of Facebook, hoping to download the report of increases in Page Likes, Visits, and Interactions during the period that I was an admin of the page. It took me several glitches to download the report (because it was a year-long term). When the pdf file was ready to be viewed, I was surprised, because they did not mention the years I was working, the name of the student organization, and other categorizations that should have been highlighted. Apparently, it's not hard to include the years or even the name because they were included in the filter when I wanted to extract certain part of the report and because it was the source where they took the data from, respectively. This laziness in showing competent data for analysis was desperate, and I had to add extra analysis to it. Even after I finished with the "extra work", I started to question to validity of the report I was downloading. Would it be trustworthy anymore, because without my clarification, no analysis could be made even by a person involved in data science field. Even if they could, it would take them a while to collect other external information before making clear of the data presented to them.
Understanding and constantly being bothered by this ongoing problem gives me justification to call for a more thorough data translation and presentation process. More questions should be raised and answered regarding what might a user wonder about this dataset when encountering it.
-
- May 2022
-
www.gwern.net www.gwern.net
-
Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.
The problem in 1988
-
- Apr 2022
-
www.abc.net.au www.abc.net.au
-
Charting the COVID-19 spread: How Australia is faring. (2020, March 16). ABC News. https://www.abc.net.au/news/2020-03-17/coronavirus-cases-data-reveals-how-covid-19-spreads-in-australia/12060704
-
- Nov 2021
-
arxiv.org arxiv.org
-
Just because a dataset is publicly available doesn't mean that you can use it to build commercial AI software.
-
- Jun 2021
-
www.medrxiv.org www.medrxiv.org
-
Karlinsky, A., & Kobak, D. (2021). The World Mortality Dataset: Tracking excess mortality across countries during the COVID-19 pandemic. MedRxiv, 2021.01.27.21250604. https://doi.org/10.1101/2021.01.27.21250604
-
- May 2021
-
moodle.southwestern.edu moodle.southwestern.edu
-
To investigate these hypotheses, I created an election-year-country dataset covering the period from the early 1990s to the present for all post- communist democracies.7 The dataset is structured as a quasi-time series of 93 parliamentary elections in 17 countries from 1991 to 2012, and the depen-dent variable is the natural log of the radical right party’s combined vote share in elections held at time t.
this is the data, her explanation of the dataset she created
-
- Mar 2021
-
-
Karimi, Fariba, and Petter Holme. ‘A Temporal Network Version of Watts’s Cascade Model’. ArXiv:2103.13604 [Physics], 25 March 2021. http://arxiv.org/abs/2103.13604.
-
-
data.cdc.gov data.cdc.gov
-
Calgary, Open. ‘COVID-19 Case Surveillance Public Use Data with Geography | Data | Centers for Disease Control and Prevention’. Accessed 26 March 2021. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4.
-
-
www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov
-
14 of which were sampled at multiple timepoints
-
RNA sequencing on samples from 46 individuals with PCR-positive, symptomatic SARS-CoV-2 infection
-
77 peripheral blood samples across 46 subjects with COVID-19 and compared them to subjects with seasonal coronavirus, influenza, bacterial pneumonia, and healthy controls.
-
seasonal coronavirus (n=59)
-
divided based on disease severity and time from symptom onset
-
elucidate novel aspects of the host response to SARS-CoV-2
-
influenza (n=17)
-
bacterial pneumonia (n=20)
-
healthy controls (n=19)
Tags
Annotators
URL
-
-
www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov
-
elucidate key pathways in the host transcriptome of patients infected with SARS-CoV-2, we used RNA sequencing (RNA Seq) to analyze nasopharyngeal (NP) swab and whole blood (WB) samples from 333 COVID-19 patients and controls, including patients with other viral and bacterial infections.
-
host response biosignature for COVID-19 from RNA profiling of nasal swabs and blood
Tags
Annotators
URL
-
-
-
Cheng, C., Barceló, J., Hartnett, A. S., Kubinec, R., & Messerschmidt, L. (2020). COVID-19 Government Response Event Dataset (CoronaNet v.1.0). Nature Human Behaviour, 1–13. https://doi.org/10.1038/s41562-020-0909-7
-
- Dec 2020
-
saveriomiroddi.github.io saveriomiroddi.github.io
-
Databases If databases data is stored on a ZFS filesystem, it’s better to create a separate dataset with several tweaks: zfs create -o recordsize=8K -o primarycache=metadata -o logbias=throughput -o mountpoint=/path/to/db_data rpool/db_data recordsize: match the typical RDBMSs page size (8 KiB) primarycache: disable ZFS data caching, as RDBMSs have their own logbias: essentially, disabled log-based writes, relying on the RDBMSs’ integrity measures (see detailed Oracle post)
-
- Oct 2020
-
ourworldindata.org ourworldindata.org
-
docs.google.com docs.google.com
-
publications clinical trials datasets
-
-
www.kaggle.com www.kaggle.com
-
github.com github.com
-
storymaps.arcgis.com storymaps.arcgis.com
-
nextstrain.org nextstrain.org
-
www.arcgis.com www.arcgis.com
-
441187 total confirmed cases 111933 recovered 19784 deadhs
-
- Sep 2020
-
github.com github.com
-
I forgot to mention in the original issue way back that I have a lot of data. Like 1 to 3 MB that is being passed around via export let foo.
-
-
arxiv.org arxiv.org
-
Bavadekar, Shailesh, Andrew Dai, John Davis, Damien Desfontaines, Ilya Eckstein, Katie Everett, Alex Fabrikant, et al. ‘Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description (Version 1.0)’. ArXiv:2009.01265 [Cs], 2 September 2020. http://arxiv.org/abs/2009.01265.
-
- Jul 2020
-
osf.io osf.io
-
Morgan, L., Protopopova, A., Birkler, R. I. D., Itin-Shwartz, B., Sutton, G. A., gamliel, alexandra, Yakobson, B., & Raz, T. (2020). Human-dog relationships during COVID-19 pandemic; booming dog adoption during social isolation [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/s9k4y
-
-
psyarxiv.com psyarxiv.com
-
Schelhorn, I., Ecker, A., Bereznai, J., Tran, T., Rehm, S., Lugo, R., Sütterlin, S., Kinateder, M., & Shiban, Y. (2020). Depression symptoms during the COVID-19 pandemic in different regions in Germany. [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/p9wz8
-
- Jun 2020
-
www.youtube.com www.youtube.com
-
EU Datathon 2020—Webinar on COVID-19 and media and data monitoring. (2020, April 22). https://www.youtube.com/watch?v=wyNgmEfi_vk&feature=youtu.be
-
-
www.youtube.com www.youtube.com
-
EU Datathon 2020—Webinar on COVID-19 and media and data monitoring. (2020, April 22). https://www.youtube.com/watch?v=wyNgmEfi_vk&feature=youtu.be
Tags
Annotators
URL
-
-
www.youtube.com www.youtube.com
-
EU Datathon 2020—Webinar dedicated to COVID-19 data. (2020, April 9). https://www.youtube.com/watch?v=JIy6NO7QRQM&list=PLT5rARDev_rlAZ21iedz0ynnN4Na3UIoW&index=14&t=270s
Tags
Annotators
URL
-
-
eml.berkeley.edu eml.berkeley.edu
-
DellaVigna, S & Linos E. (2020). RCTs to scale: Comprehensive evidence from two nudge units. UC Berkeley. https://eml.berkeley.edu/~sdellavi/wp/NudgeToScale2020-03-20.pdf
-
-
psyarxiv.com psyarxiv.com
-
Yamada, Y., Ćepulić, D.-B., Coll-Martín, T., Debove, S., Gautreau, G., Han, H., Rasmussen, J., Tran, T. P., Travaglino, G. A., & Lieberoth, A. (2020). COVIDiSTRESS Global Survey dataset on psychological and behavioural consequences of the COVID-19 outbreak [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/v7cep
-
- May 2020
-
docs.google.com docs.google.com
-
www.ukcdr.org.uk www.ukcdr.org.uk
-
UKCDR - COVID-19 Research Project Tracker
-
-
ai.googleblog.com ai.googleblog.com
-
Tsitsulin, A. & Perozzi B. Understanding the Shape of Large-Scale Data. (2020 May 05). Google AI Blog. http://ai.googleblog.com/2020/05/understanding-shape-of-large-scale-data.html
-
-
www.kaggle.com www.kaggle.com
-
COVID-19 Open Research Dataset Challenge (CORD-19). (n.d.). Retrieved May 6, 2020, from https://kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
-
-
leoferres.info leoferres.info
-
Ferres, L. (2020 April 10). COVID19 mobility reports. Leo's Blog. https://leoferres.info/blog/2020/04/10/covid19-mobility-reports/
-
-
coviz.apps.allenai.org coviz.apps.allenai.orgAbout1
-
About. (n.d.). Retrieved May 6, 2020, from https://coviz.apps.allenai.org/
-
-
epjdatascience.springeropen.com epjdatascience.springeropen.com
-
Vilella, S., Paolotti, D., Ruffo, G. et al. News and the city: understanding online press consumption patterns through mobile data. EPJ Data Sci. 9, 10 (2020). https://doi.org/10.1140/epjds/s13688-020-00228-9
-
- Apr 2020
-
rajpurkar.github.io rajpurkar.github.io
-
-
-
Killeen, B.D., et al. (2020, April 1). A country-level dataset for informing the United States' response to COVID-19. Cornel University. arXiv:2004.00756.
-
-
www.ofcom.org.uk www.ofcom.org.uk
-
Ofcom. (2020 April 09). Covid-19 news and information: consumption and attitudes. https://www.ofcom.org.uk/research-and-data/tv-radio-and-on-demand/news-media/coronavirus-news-consumption-attitudes-behaviour
Tags
- news
- access
- COVID-19
- response
- misinformation
- survey
- dataset
- BARB
- consumption
- comScore
- lang:en
- information
- interactive
- attitude
- is:webpage
Annotators
URL
-
-
www.pnas.org www.pnas.org
-
Salganik, M. J., Lundberg, I., Kindel, A. T., Ahearn, C. E., Al-Ghoneim, K., Almaatouq, A., Altschul, D. M., Brand, J. E., Carnegie, N. B., Compton, R. J., Datta, D., Davidson, T., Filippova, A., Gilroy, C., Goode, B. J., Jahani, E., Kashyap, R., Kirchner, A., McKay, S., … McLanahan, S. (2020). Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1915006117
-
-
trello.com trello.com
-
Collective Intelligence and COVID-19 | Trello. (n.d.). Retrieved April 20, 2020, from https://trello.com/b/STdgEhvX/collective-intelligence-and-covid-19
-
-
arxiv.org arxiv.org
-
Alam, F., Sajjad, H., Imran, M., & Ofli, F. (2020). Standardizing and Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing. ArXiv:2004.06774 [Cs]. http://arxiv.org/abs/2004.06774
-
-
github.com github.com
-
experience.arcgis.com experience.arcgis.com
- Mar 2020
-
Local file Local file
-
ll datasets were supplied by Suther-land in the Supporting Information as 3D geometriesaligned according to the original literature, namely byflexible alignment on one or more templates obtained bycrystallographic enzyme-inhibitor complexes
-
eight comprehensive datasets
what are the datasets look like? this may help to understand the application domain of this tool.
-
-
ourworldindata.org ourworldindata.org
-
favorito,data_science
-
-
multimedia.scmp.com multimedia.scmp.com
-
unidad_COVID2019,favorita
-
-
www.visualcapitalist.com www.visualcapitalist.com
-
favorito,hermoso
-
-
coronavirus.thebaselab.com coronavirus.thebaselab.com
Tags
Annotators
URL
-
-
www.apprise.org.au www.apprise.org.au
-
www.gov.uk www.gov.uk
-
github.com github.com
-
unidad_COVID2019
-
-
coronavirus.jhu.edu coronavirus.jhu.edu
-
unidad_COVID2019
-
-
www.worldometers.info www.worldometers.info
-
bnonews.com bnonews.com
-
linea_tiempo
-
-
covid2019.app covid2019.app
-
acceso_abierto
Tags
Annotators
URL
-
-
www.consulta.mx www.consulta.mx
-
unidad_COVID2019,encuesta
-
-
coronavirus-disasterresponse.hub.arcgis.com coronavirus-disasterresponse.hub.arcgis.com
-
unidad_COVID2019,imprescindible
-
-
www.kff.org www.kff.org
- Feb 2019
-
iphysresearch.github.io iphysresearch.github.io
-
Impact of Fully Connected Layers on Performance of Convolutional Neural Networks for Image Classification
作者总结说:1)CNN 层越少,FC 层里的node 就要越多才行。相反 CNN 越深,FC node 少就够了;2)浅的 CNN 除了需要更多 FC node 外,数据集 class 类目数越多,FC 层应该越多越好,反之亦然;3)对于单个 class 内样本越多的数据集,网络越深越好,但若 class 类目数很多,浅的网络表现会更好。
-
Do we train on test data? Purging CIFAR of near-duplicates
作者玩了把 CIFAR 测试数据集,认为有些样本作为 test 会与 train 样本太相近而过拟合的问题,于是就自己替换了疑似问题样本提出了新 test 数据集,最后拿那些著名模型实验后,庆幸说貌似它们没有过拟合而被错误评估模型优劣~(有点打脸的感觉~)
-
Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need
深度神经网络版的“特征工程”技术~ [doge]
-
Deep Learning on Small Datasets without Pre-Training using Cosine Loss
在当代深度学习中,有两件事似乎无可争议:
- softmax激活后的分类交叉熵损失是分类的首选方法;
- 在小型数据集上从零开始训练CNN分类器效果不佳。在本文中作者证明,当处理小数据样本类时余弦损失函数比交叉上能够提供更好的性能。
-
-
towardsdatascience.com towardsdatascience.com
-
Top Sources For Machine Learning Datasets
-
- Jan 2019
-
iphysresearch.github.io iphysresearch.github.io
-
Fitting A Mixture Distribution to Data: Tutorial
目测是一篇很有爱的教程!
-
Optimization Models for Machine Learning: A Survey
感觉此文于我而言真正有价值的恐怕只有文末附录的 Dataset tables 汇总整理了。。。。。
-
- Dec 2018
-
iphysresearch.github.io iphysresearch.github.io
-
Are All Training Examples Created Equal? An Empirical Study
从此paper了解到了叫 Active learning 的有趣概念,这似乎和自己设计的连续参数训练数据采样池很接近。。。。
这篇文章的主要工作是给出了一个在图像分类中关于训练样本重要性的研究,对于样本的重要度采用基于梯度的方法进行度量。文章的结论可能表明在深度学习中主动学习或许并不总是有效的。
-
Image Score: How to Select Useful Samples
提出的 semi-supervised learning 这个概念比较有趣。给数据集每个 sample 打分或许对 interpretability 有点帮助吧。。。。
-
- Nov 2018
-
iphysresearch.github.io iphysresearch.github.io
-
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
该文做的实验是探索对数据集进行 shifts (某种可控的扰动) 后的模型表现,提出了classifier-based的方法/pipeline 来观察和评价:
这对于我的引力波数据研究来说,可以借鉴其数据的 shift 方法以及评价机制 (two-sample tests)。
-
Training neural audio classifiers with few data
这是一个比较初步的简单实验。
图像结论其实并不意外:数据量越多当然表现越好;迁移学习在极小量数据上表现良好;Prototypical 模型可能因结构的特异性会表现出一定程度上的优势;数据量越小,过拟合问题越严重。。。
-
- Sep 2016
-
www.ukbiobank.ac.uk www.ukbiobank.ac.uk
-
UK Biobank
Large UK dataset containing extensive phenotypic, genotypic, and neuroimaging data.
License: Unclear, but restrictive. Access: Human, ? Needs data use agreement: Yes Needs institutional signature for access: No (?)
Tags
Annotators
URL
-
-
openfmri.org openfmri.orgOpenfMRI1
-
View Data Sets
Public fMRI dataset repository.
- License: PDDL v.1.0
- Access: Human, s3 Needs data use agreement: No Needs institutional signature for access: No
-
-
dataverse.harvard.edu dataverse.harvard.edu
-
Brain Genomics Superstruct Project (GSP)
License: Data use agreement Access: Human, API Needs data use agreement: Yes Needs institutional signature for access: No
Tags
Annotators
URL
-
-
studyforrest.org studyforrest.org
-
What is studyforrest?
Rich multimodal dataset on naturalistic stimuli
- License: PDDL v.10
- Access: Human, rsync, git annex
- Needs data use agreement: No
- Needs institutional signature for access: No
-
-
myconnectome.org myconnectome.org
-
- License: PDDL v.10
- Access: Human, s3, openfmri
- Needs data use agreement: No
- Needs institutional signature for access: No
Tags
Annotators
URL
-
- May 2016
-
www.jstage.jst.go.jp www.jstage.jst.go.jp
-
Bird song data set
-
- Aug 2015
-
europepmc.org europepmc.org
-
the definition of a “dataset,”
this is interesting, and will be interesting to track within and across disciplines
Tags
Annotators
URL
-