- Mar 2024
- Oct 2023
-
research.swtch.com research.swtch.com
-
the modern textual archive format
The
ar
format is underrated.
Tags
Annotators
URL
-
- Apr 2022
-
-
function Zip(_io, _parent, _root) { this._io = _io; this._parent = _parent; this._root = _root || this; this._read(); } Zip.prototype._read = function() { this.sections = []; var i = 0; while (!this._io.isEof()) { this.sections.push(new PkSection(this._io, this, this._root)); i++; } }
Although the generated code is very useful...
This is wrong. It treats the ZIP format as if (à la PNG) it's a concatenated series of records/chunks marked by ZIP's characteristic, "PK" off-set, 4-byte magic numbers. It isn't. The only way to read a ZIP bytestream is to start from the end, look for the signature that denotes the possibility of the presence at the current byte offset the record containing the central directory metadata, proceeding to validate* the file based on that, and then operating on it appropriately. (* If validation fails, you can continue scanning backwards from the offset that was thought to be the signature.)
The first passed validation attempt carried out in this manner (from back to front) "wins"—there may be more than one validation passes beginning at various offsets that succeed, but only the one that appears nearest to the end of the bytestream is authoritative. If one or more validation attempts fail resulting in no successes, the file may be corrupt, and the implementation may attempt to "repair" it (not necessarily by making on-disk modifications, but merely by being generous with its interpretation of the bytestream—perhaps presenting several different options to the user), or, alternatively, it may be the case that the file is simply not a ZIP archive.
This is because a ZIP file is permitted to have its records be little embedded "data islands" (in a sea of unrelated bytes). This is what allows spanned/multi-disk archives and for the ZIP to be modified by updating the bytestream in an append-only way (or selectively rubbing out parts of the existing central directory and updating the pointers/offsets in-place). It's also what allows self-extracting archives to be self-extracting: foremost, they conform to the binary executable format and include code for being able to open the very same executable, process the records embedded within it, and write them to disk.
-
-
news.ycombinator.com news.ycombinator.com
-
What I like best about pdf files is that I can just give them to someone and be almost certain that any questions will be about the content rather than the format of the file.
Almost every time I've used FedEx's "Print and Go" for a PDF I've created by "printing" e.g. HTML (and that I've verified looks good when previewing it on-screen), it comes out mangled when actually printed to paper.
Tags
Annotators
URL
-
- Jul 2021
-
www.w3.org www.w3.org
-
The original document file (I think - I can't test it)
Referenced in an HN thread:
https://news.ycombinator.com/item?id=12793157
In the thread, William Woodruff mentions that LibreOffice is capable of displaying this file.
Tags
Annotators
URL
-
- Apr 2021
-
www.infoworld.com www.infoworld.com
-
Ideally, GitHub would understand rich formats
I've advocated for a different approach.
Most of these "rich formats" are, let's just be honest, Microsoft Office file formats that people aren't willing to give up. But these aren't binary formats through-and-through; the OOXML formats are ZIP archives (following Microsoft's "Open Packaging Conventions") that when extracted are still almost entirely simple "files containing lines of text".
So rather than committing your "final-draft.docx", "for-print.oxps" and what-have-you to the repo, run them through a ZIP extractor then commit that to the repo. Then, just like any other source code repo, include a "build script" for these—which just zips them back up and gives them the appropriate file extension.
(I have found through experimentation that some of these packages do include some binary files (which I can't recall offhand), but they tend to be small, and you can always come up with a text-based serialization for them, and then rework your build script so it's able to go from that serialization format to the correct binary before zipping everything up.)
-
- May 2019
-
sites.google.com sites.google.com
-
"list" (0x6C696E74)
The hex spells
lint
notlist
, also in a real WAV file it appears to be capitalizedLIST
= 0x4C495354
-