But to understand how the system works and why it does what it does, we need also to understand the data used to train the model. And as the open data community has seen, barriers to sharing data are different than sharing source code. In addition to copyright and liability disclaimers, data have additional considerations like privacy and other third-party rights that might be in the training set. In order to truly get the benefits of open ML we need some additional information about the data used to build the overall ML.
I assume this is subject to discussion, and will ultimately be edited to match what is said under "data and dataset"?