diff options
author | Adam Spragg <adam@spra.gg> | 2022-05-17 10:44:27 +0100 |
---|---|---|
committer | Adam Spragg <adam@spra.gg> | 2022-05-18 17:19:40 +0100 |
commit | 588b3e780fef14631f7ec5c369bff07efdc5e013 (patch) | |
tree | 5874b387da5e6f055abf70597c19cfaefe1760bb /FILEFORMAT_1 | |
parent | 96df5969b11b9a64f95c0c28347154b06cfc9d15 (diff) |
Add documentation for new Format 1
Diffstat (limited to 'FILEFORMAT_1')
-rw-r--r-- | FILEFORMAT_1 | 88 |
1 files changed, 88 insertions, 0 deletions
diff --git a/FILEFORMAT_1 b/FILEFORMAT_1 new file mode 100644 index 0000000..b16d85a --- /dev/null +++ b/FILEFORMAT_1 @@ -0,0 +1,88 @@ +Version 1 +--------- + +Following sections explain internals of metastore file (.metadata), version 1 + + +### Data types + + SIGNATURE = Magic signature, 10 bytes long = "MeTaSt00r3" + VERSION = Format version string, 8 bytes long. This version = "00000001" + URLSTRING = URL-encoded string. Chars 0x00 to 0x20 (inclusive), 0x7F, and + 0x25 (%) *must* be encoded. Terminated by any character which + must be encoded, which is not (e.g. "\t", "\n"). + INTSTRING = ASCII-encoded integer, in a pre-specified base from 2 (binary) + to 16 (hexadecimal). May be preceded by "-" for negative values + +### File layout + + SIGNATURE VERSION "\n" + n * (ENTRY "\n") + + +### ENTRY format + + URLSTRING - Path (absolute or relative) + "\t" URLSTRING - Owner (owner name, not uid) + "\t" URLSTRING - Group (group name, not gid) + "\t" INTSTRING - Mode (base 8, of struct stat.st_mode & 0177777. + i.e. File type and mode, as per inode(7) + "\t" URLSTRING - Mtime (including nanoseconds) in ISO-8601 format, UTC. + "YYYY-mm-ddTHH:MM:SS.nnnnnnnnnZ" + + m * ("\t" URLSTRING "\t" URLSTRING) + - xattr name/value pairs. `m` may be 0. + + "\n" - Entry-terminating newline. + + +### Discussion + +This format is designed to work with version control systems, specifically +`git(1)`. + +To fit in with `git` and its related tooling, this format is a line-based text +file. Each record is a bunch of text fields separated by tabs, terminated by a +newline. This means records should be identifiable and somewhat understandable +to readers, and should work with `diff(1)` and `patch(1)` (and their `git`ified +descendants). Merge conflicts should produce files that are resolvable with any +ordinary text editor. (Even `ed(1)`, if you insist!) + +This format is generally slightly larger than Format 0, but shouldn't be +significantly so for most use cases, and this is a reasonable trade-off for +readability and diff/patch/merge-ability. + +The format could be significantly larger if files have large amounts of binary +data in xattrs, as 35 out of the 256 possible bytes require URL-encoding as a +3-byte sequence, giving a 27.3% increase (by my calculations). This clearly +isn't ideal, but this author suspects that the proportion of files with large +binary xattrs is fairly small, and this should not cause an issue in practice. + +If a user does have large amounts of binary xattr data but can't handle the 27% +size increase this format incurs, they can still use Format 0 to store it +instead. If *you* have large amounts of binary xattr data that you have to store +in git in a way that's diff/patch/merge-able - well, feel free to submit patches +for Format 2 yourself ;-) + +If you do update this format, remember to change the man page as well as this +document! I've tried to keep the info in the man page as short as possible, and +to only include what a user should need to work with the resulting files. +Extended musings and notes for implementors go here (or in the `git commit` +log :-) + + +### UTF-8 cleanliness + +Note that because bytes >= 0x80 are not required to be URL-encoded, binary xattr +data is very unlikely to be UTF-8 clean. If this is a problem for the editor +you use to resolve conflicts... I dunno. Get a better editor maybe? We could +URL-encode all high bytes, but that would triple the size of half the bytes in +binary data, and of all non-ASCII byte sequences in UTF-8 text. I suppose it +might be possible to URL-encode all sequences of high bytes that are *not* UTF-8 +clean (and that would be backwards-compatible with the existing format) but I +don't want to add that much complexity at the moment. Also, it might not be +"enough" as you'd probably want to encode non-printable UTF-8 control codes +such as RTL/LTR marks (U+200E/U+200F) to prevent the possibility of "Trojan +Source" type attacks. + +(See <https://lwn.net/Articles/874951/> for more info on "Trojan Source") |