Argument: TEI is proposed as the format for publishing text-related data when the original and complete data cannot be published because of ethical or legal reasons, as it allows modeling in a single document other types of data that are also relevant for analysis, such as textual structure, metadata, documentation or annotation (this also satisfies different points of the FAIR criteria); when modeling derived textual data in TEI altering the order in the TEI elements is probably the method that best balances the current legal framework, the amount of data and its usability.
Tools and projects referenced:
TEI
International: digitale Bibliothek in TextGrid Repository, Drama Corpus (DraCor), European Literary Text Collection (ELTeC), Textbox
Spanish: Biblioteca Virtual Miguel de Cervantes, CORDE, CREA, CORPES, Canon 60, Biblioteca Digital Artelope, DISCO, ADSO, BETTE (DraCor), CoNSSA, CONHA, ELTeC
Text+ (NFDI)
Google Ngram Dataset
HTRC Extracted Features Dataset: HathiTrust & JSON for Linking Data (JSON-LD),
GitLab
Voyant Tools
xSample
BNE, VIAF, Wikidata, Wikipedia
Online Picasso Project
Stanford POS tagger
Interesting fact: For narrative segments in TEI that are structured using the
, <seg> and <s> elements, it would be possible to model these tags in such a way that their frequencies are also collected in the bag of words, but as this would bring disadvantages such as the loss of the hierarchical structure, the proposal is to keep the tags as they are hanging from the <text> element, but to remove from them all lexical and typographic content. That is, to maintain a textual structure in TEI without text (p. 9).
Els Thant